R - LDA Topic Model Output Data - r

I'm working on building some topic models in R using the 'topicmodels' package. After pre-processing and creating a document term matrix, I am applying the following LDA Gibbs model. This may be a simple answer but I'm a newbie to R so here it goes. Is there a way that I can export the topics and term lists along with their probabilities to a text file or excel file? I can print them in R (as below), but don't know how to export :(
This is mainly so I can do some visualisation, which I'm sure can be done in Excel, but like I mentioned I'm a newbie and don't have too much available to learn visualisation techniques in R. Hope this makes sense
k = 33
burnin = 1000
iter = 1000
keep = 50
seed = 2003
model_lda <- LDA(myDtm, k = k, method = "Gibbs",control = list(seed = seed, burnin = burnin, iter = iter, keep = keep))
print(model_lda)
save(model_lda, file = "LDA_Output.RData")
topics(model_lda, 5)
terms(model_lda, 15)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
[1,] "seat" "dialogu" "websit" "census" "northern" "growth" "hse"
[2,] "resum" "church" "partnership" "disabl" "univers" "adjust" "legisl"
[3,] "suspend" "congreg" "nesc" "cso" "peac" "forecast" "die"
[4,] "adjourn" "school" "site" "statist" "unemploy" "bernard" "legal"
[5,] "fisheri" "survivor" "nesf" "survey" "polic" "burton" "child"

First, you can read in data with readr and then you could use the tidytext R package. For example:
readr::write_csv(tidy(model_lda, "beta"), "beta.csv")
readr::write_csv(tidy(model_lda, "gamma"), "gamma.csv")
The above code should save your beta matrix and gamma matrix in beta.csv and gamma.csv, respectively.
You can also find a chapter that was helpful for me here: http://tidytextmining.com/topicmodeling.html

Related

Classic king - man + woman = queen example with pretrained word-embedding and word2vec package in R

I am really desperate, I just cannot reproduce the allegedly classic example of king - man + woman = queen with the word2vec package in R and any (!) pre-trained embedding model (as a bin file).
I would be very grateful if anybody could provide working code to reproduce this example... including a link to the necessary pre-trained model which is also downloadable (many are not!).
Thank you very much!
An overview of using word2vec with R is available at https://www.bnosac.be/index.php/blog/100-word2vec-in-r which even shows an example of king - man + woman = queen.
Just following the instructions there and downloading the first English 300-dim embedding word2vec model from http://vectors.nlpl.eu/repository ran on the British National Corpus which I encountered, downloaded and unzipped the model.bin on my drive and next inspecting the terms in the model (words are there apparently appended with pos tags), getting the word vectors, displaying the vectors, getting the king - man + woman and finding the closest vector to that vector gives ... queen.
> library(word2vec)
> model <- read.word2vec("C:/Users/jwijf/OneDrive/Bureaublad/model.bin", normalize = TRUE)
> head(summary(model, type = "vocabulary"), n = 10)
[1] "vintage-style_ADJ" "Sinopoli_PROPN" "Yarrell_PROPN" "en-1_NUM" "74°–78°F_X"
[6] "bursa_NOUN" "uni-male_ADJ" "37541_NUM" "Menuetto_PROPN" "Saxena_PROPN"
> wv <- predict(model, newdata = c("king_NOUN", "man_NOUN", "woman_NOUN"), type = "embedding")
> head(t(wv), n = 10)
king_NOUN man_NOUN woman_NOUN
[1,] -0.4536242 -0.47802860 -1.03320265
[2,] 0.7096733 1.40374041 -0.91597748
[3,] 1.1509652 2.35536361 1.57869458
[4,] -0.2882653 -0.59587735 -0.59021348
[5,] -0.2110678 -1.05059254 -0.64248675
[6,] 0.1846713 -0.05871651 -1.01818573
[7,] 0.5493720 0.13456300 0.38765019
[8,] -0.9401053 0.56237948 0.02383301
[9,] 0.1140556 -0.38569298 -0.43408644
[10,] 0.3657919 0.92853492 -2.56553030
> wv <- wv["king_NOUN", ] - wv["man_NOUN", ] + wv["woman_NOUN", ]
> predict(model, newdata = wv, type = "nearest", top_n = 4)
term similarity rank
1 king_NOUN 0.9332663 1
2 queen_NOUN 0.7813236 2
3 coronation_NOUN 0.7663506 3
4 kingship_NOUN 0.7626975 4
Do you prefer to build your own model based on your own text or a more larger corpus e.g. the text8 file. Follow the instructions shown at https://www.bnosac.be/index.php/blog/100-word2vec-in-r.
Get a text file and use R package word2vec to build the model, wait untill the model finished training and next interact with it.
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip("text8.zip", files = "text8")
> library(word2vec)
> set.seed(123456789)
> model <- word2vec(x = "text8", type = "cbow", dim = 100, window = 10, lr = 0.05, iter = 5, hs = FALSE, threads = 2)
> wv <- predict(model, newdata = c("king", "man", "woman"), type = "embedding")
> wv <- wv["king", ] - wv["man", ] + wv["woman", ]
> predict(model, newdata = wv, type = "nearest", top_n = 4)
term similarity rank
1 king 0.9743692 1
2 queen 0.8295941 2
You haven't shown what pretrained models you've tried, nor what data you've used in your attempts, nor what training-then-probing code that you used and failed, nor how your attempt failed. So it's hard to help without writing you a whole tutorial... and there are already plenty of word2vec tutorials online.
But note:
word2vec is a data-hungry algorithm, and its useful qualities (including analogy-solving capabilities) really only become reliably demoable when using adequate large training sets
that said, most pretrained models from competent teams should easily show the classic man : king :: woman : queen analogy-solution, when using the same kinds of vector-arithmetic & candidate-answer ranking (eliminating all words in the question) as the original work
if I recall correctly, the merely 100MB of uncompressed-text text8 dataset from http://mattmahoney.net/dc/textdata) will often succeed or come close to succeeding on man : king :: woman : queen, though the related text9 that's 1GB of data tends to do much better. Both, though are a bit small for making strong general word-vectors. For contrast, the GoogleNews vectors Google released circa 2013 at the same time as the original word2vec papers were said to be trained on something like 100GB of news articles.
beware, though: the text8 & text9 datasets, by stripping all punctuation/linebreaks, may need to be chunked to pass to some word2vec implementations that rquire training-texts to fit within certain limits. For example, Python's Gensim expects training texts to be no longer than 10000 tokens each. text8 is 17 million words on one line. If you pass that one line of 17 million tokens to Gensim as one training text, 99.94% of them will be ignored as beyond the 10000-token limit. Your R implementation may have a similar, or even tighter, implementation limit.

Why is LSA in text2vec producing different results every time?

I was using latent semantic analysis in the text2vec package to generate word vectors and using transform to fit new data when I noticed something odd, the spaces not being lined up when trained on the same data.
There appears to be some inconsistency (or randomness?) in the method. Namely, even when re-running an LSA model on the exact same data, the resulting word vectors are wildly different, despite indentical input. When looking around I only found these old closed github issues link link and a mention in the changelog about LSA being cleaned up. I reproduced the behaviour using the movie_review dataset and (slightly modified) code from the documentation:
library(text2vec)
packageVersion("text2vec") # ‘0.5.1’
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# edit: make tcm symmetric:
tcm = tcm + Matrix::t(Matrix::triu(tcm))
n_topics = 10
lsa_1 = LatentSemanticAnalysis$new(n_topics)
d1 = lsa_1$fit_transform(tcm)
lsa_2 = LatentSemanticAnalysis$new(n_topics)
d2 = lsa_2$fit_transform(tcm)
# despite being trained on the same data, words have completely different vectors:
sim2(d1["film",,drop=F], d2["film",,drop=F])
# yields values like -0.993363 but sometimes 0.9888435 (should be 1)
mean(diag(sim2(d1, d2)))
# e.g. -0.2316826
hist(diag(sim2(d1, d2)), main="self-similarity between models")
# note: these numbers are different every time!
# But: within each model, results seem consistent and reasonable:
# top similar words for "film":
head(sort(sim2(d1, d1["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873934 0.9803280 0.9732380 0.9680488 0.9668800
# same in the second model:
head(sort(sim2(d2, d2["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873935 0.9803279 0.9732364 0.9680495 0.9668819
# transform works:
sim2(d2["film",,drop=F], transform(tcm["film",,drop=F], lsa_2 )) # yields 1
# LSA in quanteda doesn't have this problem, same data => same vectors
library(quanteda)
d1q = textmodel_lsa(as.dfm(tcm), 10)
d2q = textmodel_lsa(as.dfm(tcm), 10)
mean(diag(sim2(d1q$docs, d2q$docs))) # yields 1
# the top synonyms for "film" are also a bit different with quanteda's LSA
# film movie hunk show territory bay
# 1.0000000 0.9770574 0.9675766 0.9642915 0.9577723 0.9573138
What's the deal, is it a bug, is this intended behaviour for some reason, or am I having a massive misunderstanding? (I'm kind of hoping for the latter...). If it's intended, why would quanteda behave differently?
The issue is that your matrix seems ill-conditioned and hence you have numerical stability issues.
library(text2vec)
library(magrittr)
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# condition number
kappa(tcm)
# Inf
Now if you will do truncated SVD (algorithm behind LSA) you will notice that singular vectors are very close to zero:
library(irlba)
truncated_svd = irlba(tcm, 10)
str(truncated_svd)
# $ d : num [1:10] 2139 1444 660 559 425 ...
# $ u : num [1:4387, 1:10] -1.44e-04 -1.62e-04 -7.77e-05 -8.44e-04 -8.99e-04 ...
# $ v : num [1:4387, 1:10] 6.98e-20 2.37e-20 4.09e-20 -4.73e-20 6.62e-20 ...
# $ iter : num 3
# $ mprod: num 50
Hence the sign of the embeddings is not stable and cosine angle between them is not stable as well.
Similar to how it works in sklearn in Python, using a truncated SVD function in R has a random number function built in. It is both what makes it so powerful for large model building but somewhat difficult for smaller uses. If you set your values to a seed set.seed() before the SVD matrix is created you shouldn't have an issue. This used to terrify me when doing LSA.
Let me know if that helps!

H20: how to use gradient boosting on textual data?

I am trying to implement a very simple ML learning problem, where I use text to predict some outcome. In R, some basic example would be:
import some fake but funny text data
library(caret)
library(dplyr)
library(text2vec)
dataframe <- data_frame(id = c(1,2,3,4),
text = c("this is a this", "this is
another",'hello','what???'),
value = c(200,400,120,300),
output = c('win', 'lose','win','lose'))
> dataframe
# A tibble: 4 x 4
id text value output
<dbl> <chr> <dbl> <chr>
1 1 this is a this 200 win
2 2 this is another 400 lose
3 3 hello 120 win
4 4 what??? 300 lose
Use text2vec to get a sparse matrix representation of my text (see also https://github.com/dselivanov/text2vec/blob/master/vignettes/text-vectorization.Rmd)
#these are text2vec functions to tokenize and lowercase the text
prep_fun = tolower
tok_fun = word_tokenizer
#create the tokens
train_tokens = dataframe$text %>%
prep_fun %>%
tok_fun
it_train = itoken(train_tokens)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
> dtm_train
4 x 6 sparse Matrix of class "dgCMatrix"
what hello another a is this
1 . . . 1 1 2
2 . . 1 . 1 1
3 . 1 . . . .
4 1 . . . . .
Finally, train the algo (for instance, using caret) to predict output using my sparse matrix.
mymodel <- train(x=dtm_train, y =dataframe$output, method="xgbTree")
> confusionMatrix(mymodel)
Bootstrapped (25 reps) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction lose win
lose 17.6 44.1
win 29.4 8.8
Accuracy (average) : 0.264
My problem is:
I see how to import data into h20 using spark_read_csv, rsparkling and as_h2o_frame.
However, for points 2. and 3. above I am completely lost.
Can someone please give me some hints or tell me if this approach is even possible with h2o?
Many thanks!!
You can solve this one of two ways -- 1. in R first and then move to H2O for modeling or 2. Entirely in H2O using H2O's word2vec implementation.
Use R data.frames and text2vec, then convert the sparse matrix to an H2O frame and do the modeling in H2O.
# Use same code as above to get to this point, then:
# Convert dgCMatrix to H2OFrame, cbind the response col
train <- as.h2o(dtm_train)
train$y <- as.h2o(dataframe$output)
# Train any H2O model (e.g GBM)
mymodel <- h2o.gbm(y = "y", training_frame = train,
distribution = "bernoulli", seed = 1)
Or you can train a word2vec embedding in H2O, apply it to your text to get the equivalent of a sparse matrix. Then train a H2O machine learning model
(GBM). I will try edit this answer later with a working example using your data, but in the meantime, here is an example demonstrating the use of H2O's word2vec functionality in R.

Simulating data from multivariate distribution in R based on Winbugs/JAGS script

I am trying to simulate data, based on part of a JAGS/Winbugs script. The script comes from Eaves & Erkanli (2003, see, http://psych.colorado.edu/~carey/pdffiles/mcmc_eaves.pdf, page 295-296).
The (part of) the script I want to base my simulations on is as follows (different variable names than in the original paper):
for(fam in 1 : nmz ){
a2mz[fam, 1:N] ~ dmnorm(mu[1:N], tau.a[1:N, 1:N])
a1mz[fam, 1:N] ~ dmnorm(a2mz[fam, 1:N], tau.a[1:N, 1:N])
}
#Prior
tau.a[1:N, 1:N] ~ dwish(omega.g[,], N)
I want to simulate data in R for the parameters a2mz and a1mz as given in the script above.
So basically, I want to simualte data from -N- (e.g. = 3) multivariate distributions with -fam- (e.g. 10) persons with sigma tau.a.
To make this more illustrative: The purpose is to simulate genetic effects for -fam- (e.g. 10) families. The genetic effect is the same for each family (e.g. monozygotic twins), with a variance of tau.a (e.g. 0.5). Of these genetic effects, 3 'versions' (3 multivariate distributions) have to be simulated.
What I tried in R to simulate the data as given in the JAGS/Winbugs script is as follows:
library(MASS)
nmz = 10 #number of families, here e.g. 10
var_a = 0.5 #tau.g in the script
a2_mz <- mvrnorm(3, mu = rep(0, nmz), Sigma = diag(nmz)*var_a)
This simulates data for the a2mz parameter as referred to in the JAGS/Winbugs script above:
> print(t(a2_mz))
[,1] [,2] [,3]
[1,] -1.1563683 -0.4478091 -0.15037563
[2,] 0.5673873 -0.7052487 0.44377336
[3,] 0.2560446 0.9901964 -0.65463341
[4,] -0.8366952 0.4924839 -0.56891991
[5,] 0.7343780 0.5429955 0.87529201
[6,] 0.5592868 -0.3899988 -0.33709105
[7,] -1.8233663 -0.7149141 -0.18153049
[8,] -0.8213804 -1.4397075 -0.09159725
[9,] -0.7002797 -0.3996970 -0.29142215
[10,] 1.1084067 0.3884869 -0.46207940
However, when I then try to use these data to simulate data for the a1mz (third line of the JAGS/Winbugs) script, then something goes wrong and I am not sure what:
a1_mz <- mvrnorm(3, mu = t(a2_mz), Sigma = c(diag(nmz)*var_a, diag(nmz)*var_a, diag(nmz)*var_a))
This results in the error:
Error in eigen(Sigma, symmetric = TRUE, EISPACK = EISPACK) :
non-square matrix in 'eigen'
Can anyone give me any hints or tips on what I am doing wrong?
Many thanks,
Best regards,
inga
mvrnorm() takes a mean-vector and a variance matrix as input, and that's not what you're feeding it. I'm not sure I understand your question, but if you want to simulate 3 samples from 3 different multivariate normal distributions with same variance and different mean. Then just use:
a1_mz<-array(dim=c(dim(a2_mz),3))
for(i in 1:3) a1_mz[,,i]<-mvrnorm(3,t(a2_mz)[,i],diag(nmz)*var_a)

How can I smooth an array in R?

I have a 2-D array in R which represents value data for a grid of rows and columns. It looks like this:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 5 6 3
[3,] 2 3 2 1
[4,] 1 1 1 1
I want to "smooth" these values. At this proof-of-concept point, I am fine with using any popular smoothing function. I am currently attempting to use the smooth.spline function:
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list())
by (naively) calling
smoothed <- smooth.spline(myarray)
When I run this, I get this error:
Error in smooth.spline(a) : need at least four unique 'x' values
My array has four or more unique values in each dimension, so I am thinking that I do not know how to properly format the input data. Can someone give me some pointers to this kind of thing? The examples for smooth-like functions seem to work with single-dimension vectors, and I can't seem to extrapolate to the 2-D world. I am an R novice, so please feel free to correct my misuse of terms here!
To do 1-d smoothing in either the vertical or horizontal axis, use apply:
apply(myarray,1,smooth.spline)
or
apply(myarray,2,smooth.spline)
I'm not familiar with 2-D smoothing, but a quick experiment with the fields package seemed to work. You will need to install the package fields and it's dependencies. Where myMatrix is the matrix you had above... (I recreate it):
# transform data into x,y and z
m = c(1,1,2,1,1,5,6,3,2,3,2,1,1,1,1,1)
myMatrix = matrix(m,4,4,T)
myMatrix
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 5 6 3
[3,] 2 3 2 1
[4,] 1 1 1 1
Z = as.vector(myMatrix)
XY=data.frame(x=as.numeric(gl(4,1,16),Y=as.numeric(gl(4,4,16))
t=Tps(XY,Z)
surface(t)
Produced a pretty plot.
Smoothing is a big topic, and many functions are available in R itself and via additional packages from places like CRAN. The popular book 'Modern Applied Statistics with S' by Venables and Ripley lists a number of them in Section 8.1:
(I think -- my 4th edition is at work) and Figure 8.1:
Polynomial regression: lm(y ~ poly(x))
Natural splines: lm(y ~ ns(x))
Smoothing splines: smooth.splines(x, y)
Lowess: lowess(x, y) (and a newer / preferred method
ksmooth: ksmooth(x, y)
supsmu: spusmu(x, y)
If you install the MASS package that goes with the book, you can run this via the file scripts/ch08.R and experiment yourself.
Check the fields package (https://github.com/NCAR/fields) and especially the very helpful vignette: https://github.com/NCAR/fields/blob/master/fieldsVignette.pdf

Resources