Note: for code examples, see README.md

`tidylda`

package`tidylda`

implements Latent Dirichlet Allocation using
style conventions from the tidyverse and tidymodels.
`tidylda`

’s Gibbs sampler is written in C++ for performance
and offers several novel features. It also has methods for sampling from
the posterior of a trained model, for more traditional Bayesian
analyses.

*tidylda*’s Gibbs sampler has several unique features,
described below.

**Non-uniform initialization:** Most LDA Gibbs samplers
initialize by assigning words to topics and topics to documents (i.e.,
construct the \(\boldsymbol{Cd}\) and
\(\boldsymbol{Cv}\) matrices) by
sampling from a uniform distribution. This ensures initialization
without incorporating any prior information. *tidylda*
incorporates the priors in its initialization. It begins by drawing
\(P(\text{topic}|\text{document})\) and
\(P(\text{word}|\text{topic})\) from
Dirichlet distributions with parameters \(\boldsymbol\alpha\) and \(\boldsymbol\eta\), respectively. Then
*tidylda* uses the above probabilities to construct \(P(\text{topic}|\text{word},
\text{document})\) and makes a single run of the Gibbs sampler to
initialize \(\boldsymbol{Cd}\) and
\(\boldsymbol{Cv}\). This non-uniform
initialization powers transfer learning with LDA (tLDA), described
elsewhere, by starting a Gibbs run near where the previous run left off.
For initial models, it uses the user’s prior information to tune where
sampling starts.

**Flexible priors:** *tidylda* has multiple
options for setting LDA priors. Users may set scalar values for \(\boldsymbol\alpha\) and \(\boldsymbol\eta\) to construct symmetric
priors. Users may also choose to construct vector priors for both \(\boldsymbol\alpha\) and \(\boldsymbol\eta\) for a full specification
of LDA. Additionally, *tidylda* allows users to set a matrix
prior for \(\boldsymbol\eta\), enabled
by its implementation of tLDA. This lets users to set priors over
word-topic relationships informed by expert input. The best practices
for encoding expert input in this manner are not yet well studied.
Nevertheless, this capability makes *tidylda* unique among LDA
implementations.

**Burn in iterations and posterior averaging:** Most LDA
Gibbs samplers construct posterior estimates of \(\boldsymbol\Theta\) and \(\boldsymbol{B}\) from \(\boldsymbol{Cd}\) and \(\boldsymbol{Cv}\)’s values of the final
iteration of sampling, effectively using a single sample. This is
inconsistent with best practices from Bayesian statistics, which is to
average over many samples from a stable posterior. *tidylda*
enables averaging across multiple samples of the posterior with the
`burnin`

argument. When `burnin`

is set to a
positive integer, *tidylda* averages the posterior across all
iterations larger than `burnin`

. For example, if
`iterations`

is 200 and `burnin`

is 150,
*tidylda* will return a posterior estimate that is an average of
the last 50 sampling iterations. This ensures that posterior estimates
are more likely to be representative than any single sample.

**Transfer learning with tLDA:** Finally,
*tidylda*’s Gibbs sampler enables transfer learning with tLDA.
The full specification of tLDA and details on its implementation in
*tidylda* are described briefly in the tLDA vignette and more
thoroughly in forthcoming research.

*tidylda*’s construction follows *Conventions of R Modeling
Packages* [@tidymodelsbook]. In
particular, it contains methods for `print`

,
`summary`

, `glance`

, `tidy`

, and
`augment`

, consistent with other “tidy” packages. These
methods are briefly described below.

`print`

,`summary`

, and`glance`

return various summaries of the contents of a`tidylda`

object, into which an LDA model trained with*tidylda*is stored.`tidy`

returns the contents of \(\boldsymbol\Theta\), \(\boldsymbol{B}\), or \(\boldsymbol\Lambda\) (stored as`theta`

,`beta`

, and`lambda`

respectively), as specified by the user, formatted as a tidy`tibble`

, instead of a numeric matrix.`augment`

appends model outputs to observational-level data. Taking the cue from*tidytext*, “observational-level” data is one row per word per document. Therefore, the key statistic used by`augment`

is \(P(\text{topic}|\text{word}, \text{document})\).*tidylda*calculates this as \(\boldsymbol\Lambda \times P(\text{word}|\text{document})\), where \(P(\text{word}|\text{document}_d) = \frac{\boldsymbol{x}_d}{\sum_{v=1}^V x_{d,v}}\).

*tidylda* has three evaluation metrics for topic models, two
goodness-of-fit measures—\(R^2\) as
implemented from *mvrsquared*
and the log likelihood of the model given the data—and one coherence
measure—probabilistic coherence. A flag set during model fitting with
`calc_r2 = TRUE`

^{1} will return a model with an \(R^2\) statistic. Similarly, the log
likelihood of the model, given the data, is calculated at each Gibbs
iteration if the user selects `calc_likelihood = TRUE`

during
model fitting.

The coherence measure is called probabilistic coherence. (See vignette on probabilistic coherence.) Probabilistic coherence is bound between -1 and 1. Values near zero indicate that the top words in a topic are statistically independent from each other. Positive values indicate that the top words in a topic are positively correlated in a statistically-dependent manner. Negative values indicate that the words in a topic are negatively correlated with each other in a statistically-dependent manner. (In practice, negative values tend to be very near zero.)

*tidylda* enables traditional Bayesian uncertainty
quantification by sampling from the posterior. The posterior
distribution for \(\boldsymbol\theta_d\) is \(\text{Dirichlet}(\boldsymbol{Cd}_d +
\boldsymbol\alpha)\) and the posterior distribution for \(\boldsymbol\beta_k\) is \(\text{Dirichlet}(\boldsymbol{Cv}_k +
\boldsymbol\eta)\) (or \(\text{Dirichlet}(\boldsymbol{Cv}_k +
\boldsymbol\eta_k)\) for tLDA). *tidylda* enables a
`posterior`

method for `tidylda`

objects, allowing
users to sample from the posterior to quantify uncertainty for estimates
of estimated parameters.

*tidylda* uses one of two calculations for predicting topic
distributions (i.e., \(\hat{\boldsymbol\theta}_d\)) for new
documents. The first, and default, is to run the Gibbs sampler,
constructing a new \(\boldsymbol{Cd}\)
for the new documents but without updating topic-word distributions in
\(\boldsymbol{B}\). The second uses a
dot product as described in Appendix 1. *tidylda* actually uses
the dot product prediction combined with the *non-uniform
initialization*—described above—to initialize \(\boldsymbol{Cd}\) when predicting using the
Gibbs sampler.

While many other topic modeling packages exist, *tidylda* is
very user friendly and brings novel features. Its user friendliness
comes from compatibility with the tidyverse. And *tidylda*
includes tLDA and other methods contained in the previous chapters of
this dissertation. It also has methods for sampling from the posterior
of a trained model, for more traditional Bayesian analyses.
*tidylda*’s Gibbs sampler is written in C++ for performance.

You can install the development version from GitHub with:

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

```
##
## Attaching package: 'Matrix'
```

```
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
```

```
### Initial set up ---
# load some documents
docs <- nih_sample
# tokenize using tidytext's unnest_tokens
tidy_docs <- docs |>
select(APPLICATION_ID, ABSTRACT_TEXT) |>
unnest_tokens(output = word,
input = ABSTRACT_TEXT,
stopwords = stop_words$word,
token = "ngrams",
n_min = 1, n = 2) |>
count(APPLICATION_ID, word) |>
filter(n>1) #Filtering for words/bigrams per document, rather than per corpus
tidy_docs <- tidy_docs |> # filter words that are just numbers
filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))
# append observation level data
colnames(tidy_docs)[1:2] <- c("document", "term")
# turn a tidy tbl into a sparse dgCMatrix
# note tidylda has support for several document term matrix formats
d <- tidy_docs |>
cast_sparse(document, term, n)
# let's split the documents into two groups to demonstrate predictions and updates
d1 <- d[1:50, ]
d2 <- d[51:nrow(d), ]
# make sure we have different vocabulary for each data set to simulate the "real world"
# where you get new tokens coming in over time
d1 <- d1[, colSums(d1) > 0]
d2 <- d2[, colSums(d2) > 0]
### fit an intial model and inspect it ----
set.seed(123)
lda <- tidylda(
data = d1,
k = 10,
iterations = 200,
burnin = 175,
alpha = 0.1, # also accepts vector inputs
eta = 0.05, # also accepts vector or matrix inputs
optimize_alpha = FALSE, # experimental
calc_likelihood = TRUE,
calc_r2 = TRUE, # see https://arxiv.org/abs/1911.11061
return_data = FALSE
)
# did the model converge?
# there are actual test stats for this, but should look like "yes"
qplot(x = iteration, y = log_likelihood, data = lda$log_likelihood, geom = "line") +
ggtitle("Checking model convergence")
```

```
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

```
## # A tibble: 1 × 5
## num_topics num_documents num_tokens iterations burnin
## <int> <int> <int> <dbl> <dbl>
## 1 10 50 1524 200 175
```

```
## A Latent Dirichlet Allocation Model of 10 topics, 50 documents, and 1524 tokens:
## tidylda(data = d1, k = 10, iterations = 200, burnin = 175, alpha = 0.1,
## eta = 0.05, optimize_alpha = FALSE, calc_likelihood = TRUE,
## calc_r2 = TRUE, return_data = FALSE)
##
## The model's R-squared is 0.2503
## The 5 most prevalent topics are:
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 4 12.5 0.0527 cdk5, cns, develop, based, lsds, ...
## 2 3 11.5 0.170 cells, cell, sleep, specific, memory, ...
## 3 1 11.4 0.114 effects, v4, signaling, stiffening, wall, ...
## 4 6 10.9 0.348 diabetes, numeracy, redox, extinction, health, ...
## 5 8 10.7 0.337 cmybp, function, mitochondrial, injury, fragment, …
## # ℹ 5 more rows
##
## The 5 most coherent topics are:
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 6 10.9 0.348 diabetes, numeracy, redox, extinction, health, ...
## 2 8 10.7 0.337 cmybp, function, mitochondrial, injury, fragment, …
## 3 7 10.3 0.210 cancer, imaging, cells, rb, tumor, ...
## 4 5 9.13 0.206 program, dcis, cancer, research, disparities, ...
## 5 10 8.53 0.19 sud, plasticity, risk, factors, brain, ...
## # ℹ 5 more rows
```

```
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 1 11.4 0.114 effects, v4, signaling, stiffening, wall, ...
## 2 2 7.01 0.0779 research, natural, antibodies, hiv, core, ...
## 3 3 11.5 0.170 cells, cell, sleep, specific, memory, ...
## 4 4 12.5 0.0527 cdk5, cns, develop, based, lsds, ...
## 5 5 9.13 0.206 program, dcis, cancer, research, disparities, ...
## 6 6 10.9 0.348 diabetes, numeracy, redox, extinction, health, ...
## 7 7 10.3 0.210 cancer, imaging, cells, rb, tumor, ...
## 8 8 10.7 0.337 cmybp, function, mitochondrial, injury, fragment,…
## 9 9 8 0.184 ppg, core, pd, data, imaging, ...
## 10 10 8.53 0.19 sud, plasticity, risk, factors, brain, ...
```

```
## # A tibble: 500 × 3
## document topic theta
## <chr> <dbl> <dbl>
## 1 8574224 1 0.00238
## 2 8574224 2 0.00524
## 3 8574224 3 0.00238
## 4 8574224 4 0.00429
## 5 8574224 5 0.00238
## 6 8574224 6 0.00238
## 7 8574224 7 0.00238
## 8 8574224 8 0.00238
## 9 8574224 9 0.00238
## 10 8574224 10 0.974
## # ℹ 490 more rows
```

```
## # A tibble: 15,240 × 3
## topic token beta
## <dbl> <chr> <dbl>
## 1 1 adolescence 0.0025
## 2 1 age 0.0000648
## 3 1 application 0.0000648
## 4 1 depressive 0.0000648
## 5 1 disorder 0.0000648
## 6 1 emotionality 0.0000648
## 7 1 information 0.0025
## 8 1 mdd 0.0000648
## 9 1 onset 0.0000648
## 10 1 onset mdd 0.0000648
## # ℹ 15,230 more rows
```

```
## # A tibble: 15,240 × 3
## topic token lambda
## <dbl> <chr> <dbl>
## 1 1 adolescence 0.304
## 2 1 age 0.00938
## 3 1 application 0.00794
## 4 1 depressive 0.0206
## 5 1 disorder 0.0206
## 6 1 emotionality 0.0206
## 7 1 information 0.259
## 8 1 mdd 0.0115
## 9 1 onset 0.00795
## 10 1 onset mdd 0.0206
## # ℹ 15,230 more rows
```

`## Joining with `by = join_by(document, term, n)``

```
## # A tibble: 4,566 × 4
## document term n topic
## <chr> <chr> <int> <int>
## 1 8574224 adolescence 1 10
## 2 8646901 adolescence 1 10
## 3 8689019 adolescence 1 10
## 4 8705323 adolescence 1 10
## 5 8574224 age 1 10
## 6 8705323 age 1 10
## 7 8757072 age 1 10
## 8 8823186 age 1 10
## 9 8574224 application 1 10
## 10 8605875 application 1 10
## # ℹ 4,556 more rows
```

```
### predictions on held out data ---
# two methods: gibbs is cleaner and more technically correct in the bayesian sense
p_gibbs <- predict(lda, new_data = d2[1, ], iterations = 100, burnin = 75)
# dot is faster, less prone to error (e.g. underflow), noisier, and frequentist
p_dot <- predict(lda, new_data = d2[1, ], method = "dot")
# pull both together into a plot to compare
tibble(topic = 1:ncol(p_gibbs), gibbs = p_gibbs[1,], dot = p_dot[1, ]) |>
pivot_longer(cols = gibbs:dot, names_to = "type") |>
ggplot() +
geom_bar(mapping = aes(x = topic, y = value, group = type, fill = type),
stat = "identity", position="dodge") +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
ggtitle("Gibbs predictions vs. dot product predictions")
```

```
### Augment as an implicit prediction using the 'dot' method ----
# Aggregating over terms results in a distribution of topics over documents
# roughly equivalent to using the "dot" method of predictions.
augment_predict <-
augment(lda, tidy_docs, "prob") |>
group_by(document) |>
select(-c(document, term)) |>
summarise_all(function(x) sum(x, na.rm = T))
```

`## Joining with `by = join_by(document, term, n)``

`## Adding missing grouping variables: `document``

```
# reformat for easy plotting
augment_predict <-
as_tibble(t(augment_predict[, -c(1,2)]), .name_repair = "minimal")
colnames(augment_predict) <- unique(tidy_docs$document)
augment_predict$topic <- 1:nrow(augment_predict) |> as.factor()
compare_mat <-
augment_predict |>
select(
topic,
augment = matches(rownames(d2)[1])
) |>
mutate(
augment = augment / sum(augment), # normalize to sum to 1
dot = p_dot[1, ]
) |>
pivot_longer(cols = c(augment, dot))
ggplot(compare_mat) +
geom_bar(aes(y = value, x = topic, group = name, fill = name),
stat = "identity", position = "dodge") +
labs(title = "Prediction using 'augment' vs 'predict(..., method = \"dot\")'")
```

```
# Not shown: aggregating over documents results in recovering the "tidy" lambda.
### updating the model ----
# now that you have new documents, maybe you want to fold them into the model?
lda2 <- refit(
object = lda,
new_data = d, # save me the trouble of manually-combining these by just using d
iterations = 200,
burnin = 175,
calc_likelihood = TRUE,
calc_r2 = TRUE
)
# we can do similar analyses
# did the model converge?
qplot(x = iteration, y = log_likelihood, data = lda2$log_likelihood, geom = "line") +
ggtitle("Checking model convergence")
```

```
## # A tibble: 1 × 5
## num_topics num_documents num_tokens iterations burnin
## <int> <int> <int> <dbl> <dbl>
## 1 10 99 2962 200 175
```

```
## A Latent Dirichlet Allocation Model of 10 topics, 99 documents, and 2962 tokens:
## refit.tidylda(object = lda, new_data = d, iterations = 200, burnin = 175,
## calc_likelihood = TRUE, calc_r2 = TRUE)
##
## The model's R-squared is 0.1389
## The 5 most prevalent topics are:
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 5 14.5 0.107 research, program, cancer, health, disparities, ...
## 2 3 12.6 0.141 cell, cells, lung, sleep, specific, ...
## 3 1 11.9 0.0616 effects, muscle, wall, v4, signaling, ...
## 4 10 10.4 0.0499 risk, brain, factors, sud, plasticity, ...
## 5 2 10.2 0.0305 research, center, microbiome, core, hiv, ...
## # ℹ 5 more rows
##
## The 5 most coherent topics are:
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 8 7.34 0.326 cmybp, function, mitochondrial, injury, fragment, …
## 2 9 7.55 0.187 core, data, ppg, studies, imaging, ...
## 3 7 9.9 0.159 cancer, tumor, clinical, imaging, cells, ...
## 4 3 12.6 0.141 cell, cells, lung, sleep, specific, ...
## 5 5 14.5 0.107 research, program, cancer, health, disparities, ...
## # ℹ 5 more rows
```

```
## A Latent Dirichlet Allocation Model of 10 topics, 50 documents, and 1524 tokens:
## tidylda(data = d1, k = 10, iterations = 200, burnin = 175, alpha = 0.1,
## eta = 0.05, optimize_alpha = FALSE, calc_likelihood = TRUE,
## calc_r2 = TRUE, return_data = FALSE)
##
## The model's R-squared is 0.2503
## The 5 most prevalent topics are:
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 4 12.5 0.0527 cdk5, cns, develop, based, lsds, ...
## 2 3 11.5 0.170 cells, cell, sleep, specific, memory, ...
## 3 1 11.4 0.114 effects, v4, signaling, stiffening, wall, ...
## 4 6 10.9 0.348 diabetes, numeracy, redox, extinction, health, ...
## 5 8 10.7 0.337 cmybp, function, mitochondrial, injury, fragment, …
## # ℹ 5 more rows
##
## The 5 most coherent topics are:
## # A tibble: 10 × 4
## topic prevalence coherence top_terms
## <dbl> <dbl> <dbl> <chr>
## 1 6 10.9 0.348 diabetes, numeracy, redox, extinction, health, ...
## 2 8 10.7 0.337 cmybp, function, mitochondrial, injury, fragment, …
## 3 7 10.3 0.210 cancer, imaging, cells, rb, tumor, ...
## 4 5 9.13 0.206 program, dcis, cancer, research, disparities, ...
## 5 10 8.53 0.19 sud, plasticity, risk, factors, brain, ...
## # ℹ 5 more rows
```

I plan to have more analyses and a fuller accounting of the options of the various functions when I write the vignettes.

See the “Issues” tab on GitHub to see planned features as well as bug fixes.

If you have any suggestions for additional functionality, changes to functionality, changes to arguments or other aspects of the API please let me know by opening an issue on GitHub or sending me an email: jones.thos.w at gmail.com.

Users can calculate \(R^2\) after a model is fit by using the

*mvrsquared*package or calling`tidylda:::calc_lda_rsquared`

.`calc_lda_rsquared`

is an internal function to*tidylda*, requiring the package name followed by three colons, as is R’s standard.↩︎