class: center, middle, inverse, title-slide # Lexical Diversity Measures ## Illustrated and Compared ### E 382J Digital Text Analysis
Lars Hinrichs | 2022 --- layout: true class: middle --- ## Please note For calculations in this presentation, the lexdiv utility from {quanteda, quanteda.textstats} was used. <p>Explanations of the different statistics that follow are largely based on the [lexdiv tutorial](https://quanteda.io/reference/textstat_lexdiv.html) in the {quanteda} documentation. --- # Why lexical diversity? It is considered a measure of information density in a text sample. --- # TTR Type-token ratio is the archetype of lexical diversity measures. $$ TTR = \frac{V}{N} $$ where `\(V\)` is the number of distinct types in a sample and `\(N\)` is the number of tokens in the same sample. --- layout: false TTR is a good and functional measure of LD. Its only problem: it is not robust against sample length variation. -- The longer your sample, the lower its TTR, when the the samples are from the same basic text. -- <hr> .small[Why does this make sense? Why would a longer sample have a lower TTR than a shorter sample of the same text?] --- layout:true class:middle --- To demonstrate, let us work with different sample lengths from the novel previewed below. <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> text </th> <th style="text-align:left;"> book </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> By Jane Austen </td> <td style="text-align:left;"> Emma </td> </tr> <tr> <td style="text-align:left;"> Emma Woodhouse, handsome, clever, and rich, with a comfortable home </td> <td style="text-align:left;"> Emma </td> </tr> <tr> <td style="text-align:left;"> and happy disposition, seemed to unite some of the best blessings of </td> <td style="text-align:left;"> Emma </td> </tr> <tr> <td style="text-align:left;"> existence; and had lived nearly twenty-one years in the world with very </td> <td style="text-align:left;"> Emma </td> </tr> <tr> <td style="text-align:left;"> little to distress or vex her. </td> <td style="text-align:left;"> Emma </td> </tr> <tr> <td style="text-align:left;"> She was the youngest of the two daughters of a most affectionate, </td> <td style="text-align:left;"> Emma </td> </tr> <tr> <td style="text-align:left;"> indulgent father; and had, in consequence of her sister's marriage, been </td> <td style="text-align:left;"> Emma </td> </tr> </tbody> </table> --- The text of *Emma* was obtained from Julia Silge's [{janeaustenr}](https://cran.r-project.org/web/packages/janeaustenr/index.html) package and prepared using the routine presented in Silge & Robinson's [*Tidy Text Mining*](www.tidytextming.com). --- We'll start with the TTR of the first 50 words and then increment in steps of 50. To be more exact, we will perform the following steps repeatedly: - Extract a sample of a certain length (50 words, 100 words, 150 words, etc.) from *Emma* - Get its lexical diversity statistic (the TTR value) - Store the sample length and the TTR in two columns of a tibble. Because this is repetitive, we'll define a custom function - see next slide. --- ## Define a custom function With `n` as input, extract sample of that size from beginning of *Emma* and return a one-row tibble with `n` in the 1st column and `ttr` as the 2nd. ```r get_emma_ttr <- function(n) { emstring <- emma_tokens %>% slice(1:n) %>% pull(word) %>% paste(collapse = " ") %>% tokens() textstat_lexdiv(emstring, measure="all") %>% as_tibble() %>% mutate(n_tokens = n) %>% select(-document, n_tokens, everything()) } ``` --- ## Calculate a sequence of TTR Calculate TTR for *Emma*-samples from n=50 to n=3000 in steps of 50. .pull-left[ .small[ ```r my_seq <- seq(50, 3000, 50) ttr_tibble <- tibble() for (i in my_seq){ ttr_tibble <- rbind(ttr_tibble, get_emma_ttr(i)) } ``` ] ] .pull-right[ <p><img src="emma.png" height=300 /></p>.right[Next, we visualize that.] ] --- layout: false .left-column[ ] .right-column[ <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> ] --- ## An alternative: log-TTR .left-column[ Herdan's *C* $$ C = \frac{log V}{log N} $$ ] .right-column[ <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> ] --- ## Corrected TTR .left-column[ Carroll's *Corrected TTR* $$ CTTR = \frac{V}{\sqrt{2N}} $$ ] .right-column[ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="100%" /> ] --- ## Moving-Average TTR .left-column[ .small[ The Moving-Average TTR (MATTR) calculates TTRs for a moving window of tokens from the first to the last token, computing a TTR for each window. The MATTR is the mean of the TTRs of each window. ] ] .right-column[ <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="100%" /> ] --- ## Mean Segmental TTR .left-column[ .small[ MSTTR (a.k.a. "split TTR") splits the tokens into segments of the given size, TTR for each segment is calculated and the mean of these values returned. ] ] .right-column[ <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> ] --- class: inverse, center, middle ## Thanks for viewing For a complete list and discussion of the lexical diversity measures that are implemented in the {quanteda.textstats} package, visit https://quanteda.io/reference/textstat_lexdiv.html. My website: https://larshinrichs.site