how to use LSA for dimension reduction in text analytics with R - r

I am a beginner at data science, and I am working on a text analytics/sentiment analysis project with tweets.
what i have been trying to do is to perform some dimension reduction on my tweets training set, and feed the training set into a NaiveBayes learner, and use the learned NaiveBayes to predict the sentiment on the testing tweet set.
I have been following the steps in this article:
http://www.analyticskhoj.com/data-mining/text-analytics-part-iv-cluster-analysis-on-terms-and-documents-using-r/
their explanation is kind of too brief for a beginner like me.
I have used the lsa() to create a, what's labeled as "Large LSAspace (3 elements)" in RStudio. And following their example, I've created 3 more data frames:
lsa.train.tk = as.data.frame(lsa.train$tk)
lsa.train.dk = as.data.frame(lsa.train$dk)
lsa.train.sk = as.data.frame(lsa.train$sk)
when i view the lsa.train.tk data, it looks like this (lsa.train.dk looks pretty similar to this matrix):
and my lsa.train.sk looks like following:
my question is, how do i interpret such information?
How can i utilize this information to create something that I can feed into my NaiveBayes learner? I tried just using the lsa.train.sk for the NaiveBayes learner, but I cannot think of any good explanation that can justify what I've tried. Any help would be much appreciated!
EDIT:
What I've done so far:
making everything into term document matrix
pass in the matrix into the NaiveBayes learner
predict using the learned algorithm
my problems are:
accuracy is only 50%... and I realized that it labels everything as positive sentiment (so I could have gotten 1% accuracy if my test set only contains negative sentiment tweets).
current code is not scalable. since it utilizes large matrices, I can only handle up to 3.5k rows of data. more than that, my computer would crash. thus I wanted to do a dimensional reduction so that I can handle up to more data (such as 10k or 100k rows of tweets)

Related

Specification of a mixed model using glmmLasso package

I have a dataset containing repeated measures and quite a lot of variables per observation. Therefore, I need to find a way to select explanatory variables in a smart way. Regularized Regression methods sound good to me to address this problem.
Upon looking for a solution, I found out about the glmmLasso package quite recently. However, I have difficulties defining a model. I found a demo file online, but since I'm a beginner with R, I had a hard time understanding it.
(demo: https://rdrr.io/cran/glmmLasso/src/demo/glmmLasso-soccer.r)
Since I cannot share the original data, I would suggest you use the soccer dataset (the same dataset used in glmmLasso demo file). The variable team is repeated in observations and should be taken as a random effect.
# sample data
library(glmmLasso)
data("soccer")
I would appreciate if you can explain the parameters lambda and family, and how to tune them.

Keras predict repeated columns

I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is ​​200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is ​​slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!
One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.

When applying SVM classifier to unseen new data, I encounter an error message. (R user)

Thanks for your interest and help.
I built a Kernel SVM classifier with 30,000 rows of the training dataset by software R.
I used around 2,000-word features to train the classifier. It worked very well.
But, when I am trying to apply the classifier to a new text dataset, the problem occurred.
Because the new text document-term matrix does not contain all 2000-word features in the classifier (columns).
Of course, I can build a classifier with a small number of word features. Then, it works on the new text data, but the performance is not that good.
How do you solve this problem?
So, how do you solve the problem that the new text dataset does not have all the word features in the SVM classifier?
I asked a question and answer it myself for other users.
I may find the solution.
The problem is that the columns (word-features) in the DTM of the trainset and the unseen dataset are different.
So, when making a DTM of the unseen dataset, use word features of the trainset's DTM as a dictionary.
For example,
features <- trainset_dtm$dimnames$Terms
unseen_dtm <- DocumentTermMatrix(unseen_cropus, control = list(dictionary=features))
Finally, the columns in both dtm(train / unseen) are same. SO, SVM works on the unseen_dtm.

Smarter than an Eighth grader? Kaggle AI Challenge. R

I am working on the Allen AI Science Challenge currently up on Kaggle.
The idea behind the challenge is to train to a model using the training data provided (a set of Eighth grade level science questions along with four answer options, one of which is the correct answer and the correct answer) along with any additional knowledge sources (Wikipedia, Science textbooks, etc) so that it can answer science questions as well as an (average?) Eighth grader can.
I'm thinking of taking the first crack at the problem in R (proficient only in R and C++; I don't think C++ will be a very useful language to solve this problem in). After exploring the Kaggle forums, I decided to use the TopicModels (tm), RWeka and Latent Dirichlet Algorithm (LDA) packages.
My current approach is to build a text predictor of some sort which on reading the question posed to it outputs a string of text and compute the cosine similarity between this output text and the four options given in the test set and predict the correct one to be with the highest cosine similarity.
I will train the model using the training data, a Wikipedia corpus along with a few Science textbooks so that the model does not overfit.
I have two questions here:
Does the overall approach make sense?
What would be a good starting point to build this text predictor? Will converting the corpus(training data, Wikipedia and Textbooks) to a Term Document/Document Term matrix help? I think forming n-grams for all the sources would help but I don't know what the next step would be, i.e. how exactly will the model predict and belt out a string of text(of say, size n) on reading a question.
I have tried implementing a part of the approach; finding out optimum number of topics and performing LDA over the training set; here's the code:
library(topicmodels)
library(RTextTools)
data<-read.delim("cleanset.txt", header = TRUE)
data$question<-as.character(data$question)
data$answerA<-as.character(data$answerA)
data$answerB<-as.character(data$answerB)
data$answerC<-as.character(data$answerC)
data$answerD<-as.character(data$answerD)
matrix <- create_matrix(cbind(as.vector(data$question),as.vector(data$answerA),as.vector(data$answerB),as.vector(data$answerC),as.vector(data$answerD)), language="english", removeNumbers=FALSE, stemWords=TRUE, weighting = tm::weightTf)
best.model<-lapply(seq(2,25,by=1),function(k){LDA(matrix,k)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:25), LL=as.numeric(as.matrix(best.model.logLik)))
best.model.logLik.df[which.max(best.model.logLik.df$LL),]
best.model.lda<-LDA(matrix,25)
Any help will be appreciated!

Random forest on a big dataset

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.

Resources