Lexical Diversity Measures

# Lexical Diversity Measures
## Illustrated and Compared
### E 382J Digital Text Analysis<br>Lars Hinrichs | 2022

---

---

## Please note

For calculations in this presentation, the lexdiv utility from {quanteda, quanteda.textstats} was used. <p>Explanations of the different statistics that follow are largely based on the [lexdiv tutorial](https://quanteda.io/reference/textstat_lexdiv.html) in the {quanteda} documentation.

---

# Why lexical diversity?

It is considered a measure of information density in a text sample.

---

# TTR

Type-token ratio is the archetype of lexical diversity measures.

$$
TTR = \frac{V}{N}
$$
where `$V$` is the number of distinct types in a sample and `$N$` is the number of tokens in the same sample.

---
layout: false

TTR is a good and functional measure of LD. Its only problem: it is not robust against sample length variation.

The longer your sample, the lower its TTR, when the the samples are from the same basic text.

<hr>

.small[Why does this make sense? Why would a longer sample have a lower TTR than a shorter sample of the same text?]

---
layout:true
class:middle

---

To demonstrate, let us work with different sample lengths from the novel previewed below.

<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> text </th>
   <th style="text-align:left;"> book </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> By Jane Austen </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Emma Woodhouse, handsome, clever, and rich, with a comfortable home </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
  <tr>
   <td style="text-align:left;"> and happy disposition, seemed to unite some of the best blessings of </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
  <tr>
   <td style="text-align:left;"> existence; and had lived nearly twenty-one years in the world with very </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
  <tr>
   <td style="text-align:left;"> little to distress or vex her. </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
  <tr>
   <td style="text-align:left;"> She was the youngest of the two daughters of a most affectionate, </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
  <tr>
   <td style="text-align:left;"> indulgent father; and had, in consequence of her sister's marriage, been </td>
   <td style="text-align:left;"> Emma </td>
  </tr>
</tbody>
</table>

---

The text of *Emma* was obtained from Julia Silge's [{janeaustenr}](https://cran.r-project.org/web/packages/janeaustenr/index.html) package and prepared using the routine presented in Silge & Robinson's [*Tidy Text Mining*](www.tidytextming.com).

---

We'll start with the TTR of the first 50 words and then increment in steps of 50. To be more exact, we will perform the following steps repeatedly:

- Extract a sample of a certain length (50 words, 100 words, 150 words, etc.) from *Emma*
- Get its lexical diversity statistic (the TTR value)
- Store the sample length and the TTR in two columns of a tibble.

Because this is repetitive, we'll define a custom function - see next slide.

---

## Define a custom function

With `n` as input, extract sample of that size from  beginning of *Emma* and return a one-row tibble with `n` in the 1st column and `ttr` as the 2nd.

```r
get_emma_ttr <- function(n) {
    emstring <- 
      emma_tokens %>% 
      slice(1:n) %>% 
      pull(word) %>% 
      paste(collapse = " ") %>% 
      tokens()
    textstat_lexdiv(emstring, measure="all") %>% 
      as_tibble() %>% 
      mutate(n_tokens = n) %>% 
      select(-document, n_tokens, everything())
  }
```

---

## Calculate a sequence of TTR

Calculate TTR for *Emma*-samples from n=50 to n=3000 in steps of 50.

```r
my_seq <- 
  seq(50, 3000, 50)

ttr_tibble <- tibble()

for (i in my_seq){
  ttr_tibble <- 
    rbind(ttr_tibble, get_emma_ttr(i))
}
```
]

]

<p><img src="emma.png" height=300 /></p>.right[Next, we visualize that.]

]

---
layout: false

]

]

---

## An alternative: log-TTR

$$
C = \frac{log V}{log N}
$$

]

]

---

## Corrected TTR

$$
CTTR = \frac{V}{\sqrt{2N}}
$$

]

]

---

## Moving-Average TTR

.small[
The Moving-Average TTR (MATTR) calculates TTRs for a moving window of tokens from the first to the last token, computing a TTR for each window. The MATTR is the mean of the TTRs of each window.
]

]

<img src="index_files/figure-html/unnamed-chunk-7-1.png" width="100%" />
]

---

## Mean Segmental TTR

.small[
MSTTR (a.k.a. "split TTR") splits the tokens into segments of the given size, TTR for each segment is calculated and the mean of these values returned. 
]

]

]

---

## Thanks for viewing

For a complete list and discussion of the lexical diversity measures that are implemented in the {quanteda.textstats} package, visit https://quanteda.io/reference/textstat_lexdiv.html.

My website: https://larshinrichs.site