Predicting LDA topics for new data - r

It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers.
Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data. Ultimately I would like extract a smaller set of topics from a very large bag of words and build a classification model using those topics as a few variables in the model. I've had success in running LDA on a training set, but the problem I am having is being able to predict which of those same topics appear in some other test set of data. I am using R's topicmodels package right now, but if there is another way to this using some other package I am open to that as well.
Here is an example of what I am trying to do:
library(topicmodels)
data(AssociatedPress)
train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]
train.lda <- LDA(train,5)
topics(train.lda)
#how can I predict the most likely topic(s) from "train.lda" for each document in "test"?

With the help of Ben's superior document reading skills, I believe this is possible using the posterior() function.
library(topicmodels)
data(AssociatedPress)
train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]
train.lda <- LDA(train,5)
(train.topics <- topics(train.lda))
# [1] 4 5 5 1 2 3 1 2 1 2 1 3 2 3 3 2 2 5 3 4 5 3 1 2 3 1 4 4 2 5 3 2 4 5 1 5 4 3 1 3 4 3 2 1 4 2 4 3 1 2 4 3 1 1 4 4 5
# [58] 3 5 3 3 5 3 2 3 4 4 3 4 5 1 2 3 4 3 5 5 3 1 2 5 5 3 1 4 2 3 1 3 2 5 4 5 5 1 1 1 4 4 3
test.topics <- posterior(train.lda,test)
(test.topics <- apply(test.topics$topics, 1, which.max))
# [1] 3 5 5 5 2 4 5 4 2 2 3 1 3 3 2 4 3 1 5 3 5 3 1 2 2 3 4 1 2 2 4 4 3 3 5 5 5 2 2 5 2 3 2 3 3 5 5 1 2 2

Related

Find minimal value for a multiple same keys in table [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
I have a table which contains multiple rows of the different data for a key of multiple columns.
Table looks like this:
A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2
I also discovered how to remove all of the duplicate elements using unique command for multiple colums, so the data duplication is not a problem.
I would like to know how to for every key(columns A and B in example) in the table to find only the minimum value in third column(C column in table)
At the end table should look like this
A B C
1 1 1 2
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
Thanks for any help. It is really appreciated
In any question, feel free to ask
con <- textConnection(" A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2")
df <- read.table(con, header = T)
df[with(df, order(A, B, C)), ]
df[!duplicated(df[1:2]),]
# A B C
# 1 1 1 2
# 3 2 1 4
# 4 1 2 4
# 5 2 2 3
# 6 2 3 1

Accessing data "cell" after reading csv file into R

Newbie, so please be gentle. On Windows 10,trying to read a csv file into R (by row (across), if possible), create a 60X4 matrix and access the data by "cell". When I try to access row 2 column 3 (for example), I get ALL of column 3 returned. I only want the one piece of data. What am I doing wrong?
> A <- read.csv("xxx.csv",header=TRUE)
> B <- matrix(A,nrow=60,ncol=4,byrow=TRUE)
> B[2,3]
[[1]]
[1] 1 2 4 2 5 2 2 2 8 9 3 12 2 9 6 12 4 8 6 12 7 9 12 9 4 2 8 3 3 3 1 3 2 2 2 2 1 1 1 1 3 1 1 2 3 1 2 3 4 3 2 1 1 1 2 2 1 1 1

Correlate item in dataframe based on character string in r

I currently have the dataset newdat, a dataframe containing item scores:
set.seed(1)
newdat <- setNames(data.frame(replicate(5,sample(1:5,10,replace=TRUE))),paste0("i",1:5))
i1 i2 i3 i4 i5
1 2 2 5 3 5
2 2 1 2 3 4
3 3 4 4 3 4
4 5 2 1 1 3
5 2 4 2 5 3
6 5 3 2 4 4
7 5 4 1 4 1
8 4 5 2 1 3
9 4 2 5 4 4
10 1 4 2 3 4
I also have the character strings "newCV" and "newDV" which are:
newCV <- c("i3","i2")
newDV <- c("i1")
I am attempting to correlate DV with all of the items except itself, and the items contained in newCV. I have tried the following:
corr<-cor(newdat,use="complete.obs")[-which(colnames(newdat)==c(newCV,newDV)),which(colnames(newdat)==c(newCV,newDV))]
Which works if there is nothing found in CV, but if there is something in CV I get an error and no results. Any thoughts? Thank you!
If you only want to calculate the specific correlations you can select what columns to pass to cor
cor(newdat[newDV], newdat[!(names(newdat) %in% c(newCV, newDV))],
use="complete.obs")

How to get from own survey data in csv to likert stacked bar chart?

Sorry for asking this,
but all the other questions and even the help and howto seem so much more advanced, that the 'simple' thing seems to go uncovered:
I have my own survey data. It is in Excel. My Likert like scale is coded 0-5
0 Not considered
1 Very Low
2 Low
3 Medium
4 High
5 Very High
I exported it to a CSV, Headers are the Questions, below each line represents a respondent.
Q1;Q2;Q3;Q4
0;3;3;2
1;0;3;3
2;0;5;4
That should be straightforward, right?
I import it into a dataset
DFG <- read.csv("export.csv", headers = TRUE, sep = ";")
I can see the dataset fine with print (DFG), the headers look OK, too. But that's how far I got. Likert complains about
All items (columns) must have the same number of levels
items parameter contains non-factors. Will convert to factors
All columns have the same amount of data, 58 sets, there are no unanswered items.
I'm not even thinking about adding Grouping, I am so far away from a bar-chart, I can't even get this to work. What am I missing?
Step 2:
After applying Heather's solution to my initial problem with
DFG <- lapply(DFG, factor, levels = 0:5)
apparently, now R is aware of Levels (which it wasn't before) and I get a print(DFG) result of
$H1.1..founders..education
[1] 3 4 2 2 5 4 4 3 3 4 3 1 2 2 4 4 4 3 3 4 3 3 4 3 3 3 3 2 3 4 3 2 3 4 5
[36] 2 1 2 3 3 5 5 2 3 3 3 2 3 3 3 3 3 0 4 2 4 2 0
Levels: 0 1 2 3 4 5
$H1.2..founders..past.professional.experience
[1] 5 5 5 4 5 5 5 3 3 5 5 3 2 5 4 5 5 4 4 4 3 5 4 4 3 4 3 3 5 5 4 2 4 4 5
[36] 3 5 4 4 4 5 5 4 4 5 4 3 4 5 4 5 4 3 4 3 5 5 4
Levels: 0 1 2 3 4 5
(shortened)
And if I change that to
DFG <- lapply(DFG, ordered, levels = 0:5)
I can run the net_stacked() script against it. Yay, happy! THANKS.
When you read the data in with read.csv the columns are read in as numeric variables (as you have not specified otherwise).
The function you are using requires the columns to be factors - as they are not, it attempts to convert them to factors. The problem is that it does not know that the possible levels are 0-5 for each item. Therefore each column is converted to a factor with only the observed levels and as the full set of levels is not observed for each item, you get an error because the levels are not the same.
To fix this, convert the variables to factors yourself:
DFG <- lapply(DFG, factor, levels = 0:5)

Multiple plot by group by one function

I have the following data:
Animal MY Age
1 17.03672067 1
1 17.00833641 2
1 16.97995215 3
1 16.95156788 4
1 16.92318362 5
1 16.88157748 6
2 16.83997133 2
2 16.79836519 3
2 16.75675905 4
2 16.7151529 5
2 16.67354676 6
2 16.63194062 7
3 16.59033447 1
3 16.54872833 2
3 16.50712219 3
3 16.46551604 4
3 16.4239099 5
3 16.38230376 6
4 16.34069761 1
4 16.29909147 2
4 16.25748533 3
4 16.21587918 4
4 16.17427304 5
4 16.1326669 6
I want to plot a scatter plot between MY vs Age for each animal. I use this function
plot(memo$MY[memo$Animal=="1223100747"]~memo$Age[memo$Animal=="1223100747"]).
If I now want to add a same plot (MY vs Age) for another animals, I just need to use function: lines.
However, since I have about 200 animals I do not want to do this manually 100 times. My questions is that: How can I plot these different animals by one function?, instead of using lines, lines ....lines)
Regards,
Phuong
You can use by for example :
by(memo,memo$Animal,FUN=function(x) plot(x$MY~x$Age))
You could use a loop or a matplot if you want to use base R, but I advise you to use package ggplot2.
DF <- read.table(text="Animal MY Age
1 17.03672067 1
1 17.00833641 2
1 16.97995215 3
1 16.95156788 4
1 16.92318362 5
1 16.88157748 6
2 16.83997133 2
2 16.79836519 3
2 16.75675905 4
2 16.7151529 5
2 16.67354676 6
2 16.63194062 7
3 16.59033447 1
3 16.54872833 2
3 16.50712219 3
3 16.46551604 4
3 16.4239099 5
3 16.38230376 6
4 16.34069761 1
4 16.29909147 2
4 16.25748533 3
4 16.21587918 4
4 16.17427304 5
4 16.1326669 6",header=TRUE)
library(ggplot2)
DF$Animal <- factor(DF$Animal)
p1 <- ggplot(DF,aes(x=MY,y=Age,colour=Animal)) + geom_line()
print(p1)

Resources