H20: how to use gradient boosting on textual data? - r

I am trying to implement a very simple ML learning problem, where I use text to predict some outcome. In R, some basic example would be:
import some fake but funny text data
library(caret)
library(dplyr)
library(text2vec)
dataframe <- data_frame(id = c(1,2,3,4),
text = c("this is a this", "this is
another",'hello','what???'),
value = c(200,400,120,300),
output = c('win', 'lose','win','lose'))
> dataframe
# A tibble: 4 x 4
id text value output
<dbl> <chr> <dbl> <chr>
1 1 this is a this 200 win
2 2 this is another 400 lose
3 3 hello 120 win
4 4 what??? 300 lose
Use text2vec to get a sparse matrix representation of my text (see also https://github.com/dselivanov/text2vec/blob/master/vignettes/text-vectorization.Rmd)
#these are text2vec functions to tokenize and lowercase the text
prep_fun = tolower
tok_fun = word_tokenizer
#create the tokens
train_tokens = dataframe$text %>%
prep_fun %>%
tok_fun
it_train = itoken(train_tokens)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
> dtm_train
4 x 6 sparse Matrix of class "dgCMatrix"
what hello another a is this
1 . . . 1 1 2
2 . . 1 . 1 1
3 . 1 . . . .
4 1 . . . . .
Finally, train the algo (for instance, using caret) to predict output using my sparse matrix.
mymodel <- train(x=dtm_train, y =dataframe$output, method="xgbTree")
> confusionMatrix(mymodel)
Bootstrapped (25 reps) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction lose win
lose 17.6 44.1
win 29.4 8.8
Accuracy (average) : 0.264
My problem is:
I see how to import data into h20 using spark_read_csv, rsparkling and as_h2o_frame.
However, for points 2. and 3. above I am completely lost.
Can someone please give me some hints or tell me if this approach is even possible with h2o?
Many thanks!!

You can solve this one of two ways -- 1. in R first and then move to H2O for modeling or 2. Entirely in H2O using H2O's word2vec implementation.
Use R data.frames and text2vec, then convert the sparse matrix to an H2O frame and do the modeling in H2O.
# Use same code as above to get to this point, then:
# Convert dgCMatrix to H2OFrame, cbind the response col
train <- as.h2o(dtm_train)
train$y <- as.h2o(dataframe$output)
# Train any H2O model (e.g GBM)
mymodel <- h2o.gbm(y = "y", training_frame = train,
distribution = "bernoulli", seed = 1)
Or you can train a word2vec embedding in H2O, apply it to your text to get the equivalent of a sparse matrix. Then train a H2O machine learning model
(GBM). I will try edit this answer later with a working example using your data, but in the meantime, here is an example demonstrating the use of H2O's word2vec functionality in R.

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

How to get CI 95% for coefficients of linear model using simpleboot package

I'm trying to predict a linear model (basic linear regressions with 4 predictors) with the procedure lm(). This works all fine.
What I want to do now is bootstrapping the model. After a quick research on Google I found out about the package simpleboot which seemed to be quite easy to understand.
I can easily bootstrap the lm.object using something like this:
boot_mod <- lm.boot(mod,R=100,rows=TRUE)
and afterwards print the object boot_mod.
I can also access the list in which the coefficients for each bootstrap sample are among other metrics such as RSS, R² and so on.
Can anyone tell me how I can save all coefficients from the boot list in a list or dataframe?
The result would look like this at best:
boot_coef
sample coef 1 coef 2 coef 3...
1 1,1 1,4 ...
2 1,2 1,5 ...
3 1,3 1,6 ...
library(tidyverse)
library(simpleboot)
### Some Dummy-Data in a dataframe
a <- c(3,4,5,6,7,9,13,12)
b <- c(5,9,14,22,12,5,12,18)
c <- c(7,2,8,7,12,5,3,1)
df <- as_data_frame(list(x1=a,x2=b,y=c))
### Linear model
mod <- lm(y~x1+x2,data=df)
### Bootstrap
boot_mod <- lm.boot(mod,R=10,rows = TRUE)
You can also use the function sample of the same package simpleboot:
given the output from either lm.boot or loess.boot, you can specify what kind of information you want to extract:
samples(object, name = c("fitted", "coef", "rsquare", "rss"))
It outputs either a vector or matrix depending on the entity extracted.
Source:
https://rdrr.io/cran/simpleboot/man/samples.html
Here is a tidyverse option to save all coefficients from the boot.list:
library(tidyverse)
as.data.frame(boot_mod$boot.list) %>%
select(ends_with("coef")) %>% # select coefficients
t(.) %>% as.data.frame(.) %>% # model per row
rownames_to_column("Sample") %>% # set sample column
mutate(Sample = parse_number(Sample))
# output
Sample (Intercept) x1 x2
1 1 5.562417 -0.2806786 0.12219191
2 2 8.261905 -0.8333333 0.54761905
3 3 9.406171 -0.5863124 0.07783740
4 4 8.996784 -0.6040479 0.06737891
5 5 10.908036 -0.7249561 -0.03091908
6 6 8.914262 -0.5094340 0.05549390
7 7 7.947724 -0.2501127 -0.08607481
8 8 6.255539 -0.2033771 0.07463971
9 9 5.676581 -0.2668020 0.08236743
10 10 10.118126 -0.4955047 0.01233728

Why is LSA in text2vec producing different results every time?

I was using latent semantic analysis in the text2vec package to generate word vectors and using transform to fit new data when I noticed something odd, the spaces not being lined up when trained on the same data.
There appears to be some inconsistency (or randomness?) in the method. Namely, even when re-running an LSA model on the exact same data, the resulting word vectors are wildly different, despite indentical input. When looking around I only found these old closed github issues link link and a mention in the changelog about LSA being cleaned up. I reproduced the behaviour using the movie_review dataset and (slightly modified) code from the documentation:
library(text2vec)
packageVersion("text2vec") # ‘0.5.1’
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# edit: make tcm symmetric:
tcm = tcm + Matrix::t(Matrix::triu(tcm))
n_topics = 10
lsa_1 = LatentSemanticAnalysis$new(n_topics)
d1 = lsa_1$fit_transform(tcm)
lsa_2 = LatentSemanticAnalysis$new(n_topics)
d2 = lsa_2$fit_transform(tcm)
# despite being trained on the same data, words have completely different vectors:
sim2(d1["film",,drop=F], d2["film",,drop=F])
# yields values like -0.993363 but sometimes 0.9888435 (should be 1)
mean(diag(sim2(d1, d2)))
# e.g. -0.2316826
hist(diag(sim2(d1, d2)), main="self-similarity between models")
# note: these numbers are different every time!
# But: within each model, results seem consistent and reasonable:
# top similar words for "film":
head(sort(sim2(d1, d1["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873934 0.9803280 0.9732380 0.9680488 0.9668800
# same in the second model:
head(sort(sim2(d2, d2["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873935 0.9803279 0.9732364 0.9680495 0.9668819
# transform works:
sim2(d2["film",,drop=F], transform(tcm["film",,drop=F], lsa_2 )) # yields 1
# LSA in quanteda doesn't have this problem, same data => same vectors
library(quanteda)
d1q = textmodel_lsa(as.dfm(tcm), 10)
d2q = textmodel_lsa(as.dfm(tcm), 10)
mean(diag(sim2(d1q$docs, d2q$docs))) # yields 1
# the top synonyms for "film" are also a bit different with quanteda's LSA
# film movie hunk show territory bay
# 1.0000000 0.9770574 0.9675766 0.9642915 0.9577723 0.9573138
What's the deal, is it a bug, is this intended behaviour for some reason, or am I having a massive misunderstanding? (I'm kind of hoping for the latter...). If it's intended, why would quanteda behave differently?
The issue is that your matrix seems ill-conditioned and hence you have numerical stability issues.
library(text2vec)
library(magrittr)
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# condition number
kappa(tcm)
# Inf
Now if you will do truncated SVD (algorithm behind LSA) you will notice that singular vectors are very close to zero:
library(irlba)
truncated_svd = irlba(tcm, 10)
str(truncated_svd)
# $ d : num [1:10] 2139 1444 660 559 425 ...
# $ u : num [1:4387, 1:10] -1.44e-04 -1.62e-04 -7.77e-05 -8.44e-04 -8.99e-04 ...
# $ v : num [1:4387, 1:10] 6.98e-20 2.37e-20 4.09e-20 -4.73e-20 6.62e-20 ...
# $ iter : num 3
# $ mprod: num 50
Hence the sign of the embeddings is not stable and cosine angle between them is not stable as well.
Similar to how it works in sklearn in Python, using a truncated SVD function in R has a random number function built in. It is both what makes it so powerful for large model building but somewhat difficult for smaller uses. If you set your values to a seed set.seed() before the SVD matrix is created you shouldn't have an issue. This used to terrify me when doing LSA.
Let me know if that helps!

R - LDA Topic Model Output Data

I'm working on building some topic models in R using the 'topicmodels' package. After pre-processing and creating a document term matrix, I am applying the following LDA Gibbs model. This may be a simple answer but I'm a newbie to R so here it goes. Is there a way that I can export the topics and term lists along with their probabilities to a text file or excel file? I can print them in R (as below), but don't know how to export :(
This is mainly so I can do some visualisation, which I'm sure can be done in Excel, but like I mentioned I'm a newbie and don't have too much available to learn visualisation techniques in R. Hope this makes sense
k = 33
burnin = 1000
iter = 1000
keep = 50
seed = 2003
model_lda <- LDA(myDtm, k = k, method = "Gibbs",control = list(seed = seed, burnin = burnin, iter = iter, keep = keep))
print(model_lda)
save(model_lda, file = "LDA_Output.RData")
topics(model_lda, 5)
terms(model_lda, 15)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
[1,] "seat" "dialogu" "websit" "census" "northern" "growth" "hse"
[2,] "resum" "church" "partnership" "disabl" "univers" "adjust" "legisl"
[3,] "suspend" "congreg" "nesc" "cso" "peac" "forecast" "die"
[4,] "adjourn" "school" "site" "statist" "unemploy" "bernard" "legal"
[5,] "fisheri" "survivor" "nesf" "survey" "polic" "burton" "child"
First, you can read in data with readr and then you could use the tidytext R package. For example:
readr::write_csv(tidy(model_lda, "beta"), "beta.csv")
readr::write_csv(tidy(model_lda, "gamma"), "gamma.csv")
The above code should save your beta matrix and gamma matrix in beta.csv and gamma.csv, respectively.
You can also find a chapter that was helpful for me here: http://tidytextmining.com/topicmodeling.html

Manually conduct leave-one-out cross validation for a GLMM using a for() loop in R

I am trying to build a for() loop to manually conduct leave-one-out cross validations for a GLMM fit using the lmer() function from the lme4 pkg. I need to remove an individual, fit the model and use the beta coefficients to predict a response for the individual that was withheld, and repeat the process for all individuals.
I have created some test data to tackle the first step of simply leaving an individual out, fitting the model and repeating for all individuals in a for() loop.
The data have a binary (0,1) Response, an IndID that classifies 4 individuals, a Time variable, and a Binary variable. There are N=100 observations. The IndID is fit as a random effect.
require(lme4)
#Make data
Response <- round(runif(100, 0, 1))
IndID <- as.character(rep(c("AAA", "BBB", "CCC", "DDD"),25))
Time <- round(runif(100, 2,50))
Binary <- round(runif(100, 0, 1))
#Make data.frame
Data <- data.frame(Response, IndID, Time, Binary)
Data <- Data[with(Data, order(IndID)), ] #**Edit**: Added code to sort by IndID
#Look at head()
head(Data)
Response IndID Time Binary
1 0 AAA 31 1
2 1 BBB 34 1
3 1 CCC 6 1
4 0 DDD 48 1
5 1 AAA 36 1
6 0 BBB 46 1
#Build model with all IndID's
fit <- lmer(Response ~ Time + Binary + (1|IndID ), data = Data,
family=binomial)
summary(fit)
As stated above, my hope is to get four model fits – one with each IndID left out in a for() loop. This is a new type of application of the for() command for me and I quickly reached my coding abilities. My attempt is below.
fit <- list()
for (i in Data$IndID){
fit[[i]] <- lmer(Response ~ Time + Binary + (1|IndID), data = Data[-i],
family=binomial)
}
I am not sure storing the model fits as a list is the best option, but I had seen it on a few other help pages. The above attempt results in the error:
Error in -i : invalid argument to unary operator
If I remove the [-i] conditional to the data=Data argument the code runs four fits, but data for each individual is not removed.
Just as an FYI, I will need to further expand the loop to:
1) extract the beta coefs, 2) apply them to the X matrix of the individual that was withheld and lastly, 3) compare the predicted values (after a logit transformation) to the observed values. As all steps are needed for each IndID, I hope to build them into the loop. I am providing the extra details in case my planned future steps inform the more intimidate question of leave-one-out model fits.
Thanks as always!
The problem you are having is because Data[-i] is expecting i to be an integer index. Instead, i is either AAA, BBB, CCC or DDD. To fix the loop, set
data = Data[Data$IndID != i, ]
in you model fit.

Resources