Topicmodels transposes the term document matrix - r

I am trying to run an LDA using the topicmodels package in R. The example given in the manual uses Associated Press data and works nicely. However, when I try it on my own data I get topics whose terms are the document names. I have traced the problem to the fact that my term document matrix is the transpose of the way is should be (rows -> columns).
The example TDM:
str(AssociatedPress)
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
Whereas,my TDM has Terms as rows, and Docs as columns:
List of 6
$ i : int [1:10489] 1 3 4 13 20 24 25 26 27 28 ...
$ j : int [1:10489] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:10489] 1 1 1 1 2 1 67 1 44 3 ...
$ nrow : int 5903
$ ncol : int 9
$ dimnames:List of 2
..$ Terms: chr [1:5903] "\u2439aa" "aars" "\u2439ab" "\u242dab" ...
..$ Docs : chr [1:9] "art111130.txt" "art111131.txt" "art111132.txt" "art111133.txt" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Which is causing LDA(art_tdm,3) to build topics based on doc names, not terms within docs. Is this a change in the codebase of the tm package? I can't imagine what I would be doing to cause this transposition in my code:
art_cor<-Corpus(DirSource(directory = "tmptxts"))
art_tdm<-TermDocumentMatrix(art_cor)
Any help would be appreciated.

On the one hand you have an object of class "TermDocumentMatrix" and the other you have one of "DocumentTermMatrix".
You probably just need to do this:
art_tdm<-DocumentTermMatrix(art_cor)

Related

Building a table/dataframe/something exportable from Desc function output in R

I'm definitely a noob, though I have used R for various small tasks for several years.
For the life of me, I cannot figure out how to get the results from the "Desc" function into something I can work with. When I save the x<-Desc(mydata) the class(x) shows up as "Desc." In R studio it is under Values and says "List of 1." Then when I click on x it says ":List of 25" in the first line. There is a list of data in this object, but I cannot for the life of me figure out how to grab any of it.
Clearly I have a severe misunderstanding of the R data structures, but I have been searching for the past 90 minutes to no avail so figured I would reach out.
In short, I just want to pull certain aspects (N, mean, UB, LB, median) of the descriptive statistics provided from the Desc results for multiple datasets and build a little table that I can then work with.
Thanks for the help.
Say you have a dataframe, x, where:
x <- data.frame(i=c(1,2,3),j=c(4,5,6))
You could set:
desc.x <- Desc(x)
And access the info on any given column like:
desc.x$i
desc.x$i$mead
desc.x$j$sd
And any other stats Desc comes up with. The $ is the key here, it's how you access the named fields of the list that Desc returns.
Edit: In case you pass a single column (as the asker does), or simply a vector to Desc, you are then returned a 1 item list. The same principle applies but the usual syntax is different. Now you would use:
desc.x <- Desc(df$my.col)
desc.x[[1]]$mean
In the future, the way to attack this is to either look in the environment window in RStudio and play around trying to figure out how to access the fields, check the source code on github or elsewhere, or (best first choice) use str(desc.x), which gives us:
> str(desc.x)
List of 1
$ :List of 25
..$ xname : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i"
..$ label : NULL
..$ class : chr "numeric"
..$ classlabel: chr "numeric"
..$ length : int 3
..$ n : int 3
..$ NAs : int 0
..$ main : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i (numeric)"
..$ unique : int 3
..$ 0s : int 0
..$ mean : num 2
..$ meanSE : num 0.577
..$ quant : Named num [1:9] 1 1.1 1.2 1.5 2 2.5 2.8 2.9 3
.. ..- attr(*, "names")= chr [1:9] "min" ".05" ".10" ".25" ...
..$ range : num 2
..$ sd : num 1
..$ vcoef : num 0.5
..$ mad : num 1.48
..$ IQR : num 1
..$ skew : num 0
..$ kurt : num -2.33
..$ small :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 1 2 3
.. ..$ freq: num [1:3] 1 1 1
..$ large :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 3 2 1
.. ..$ freq: num [1:3] 1 1 1
..$ freq :Classes ‘Freq’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ level : Factor w/ 3 levels "1","2","3": 1 2 3
.. ..$ freq : int [1:3] 1 1 1
.. ..$ perc : num [1:3] 0.333 0.333 0.333
.. ..$ cumfreq: int [1:3] 1 2 3
.. ..$ cumperc: num [1:3] 0.333 0.667 1
..$ maxrows : num 12
..$ x : num [1:3] 1 2 3
- attr(*, "class")= chr "Desc"
"List of 1" means you access it by desc.x[[1]], and below that follow the $s. When you see something like num[1:3] that means it's an atomic vector so you access the first member like var$field$numbers[1]

How to cast a dataframe to a DocumentTermMatrix?

I am trying to use tidytext to transform a tibble of word frequencies into a DocumentTermMatrix, but the function doesn't seem to work as expected. I start from AssociatedPress which I know is a documentTermMatrix, tidy and cast it back, but the output is not the same as the original matrix. What am I doing wrong?
library(topicmodels)
data(AssociatedPress)
ap_td <- tidy(AssociatedPress)
tt <- ap_td %>%
cast_dtm(document, term, count)
The element $Docs is not-NULL when I cast ap_td but it was NULL in AssociatedPress:
str(tt)
List of 6
$ i : int [1:302031] 1 16 35 72 84 93 101 111 155 161 ...
$ j : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : chr [1:2246] "1" "2" "3" "4" ...
..$ Terms: chr [1:10473] "adding" "adult" "ago" "alcohol" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : num [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
cast_dtm retrieves a warning
Warning message: Trying to compute distinct() for variables not found
in the data:
- row_col, column_col This is an error, but only a warning is raised for compatibility reasons. The operation will return the input
unchanged.
On GitHub, I found this issue which should have been fixed now.
I don't get your warning message using tidytext 0.1.9.900 and R 3.5.0.
The dtm's are identical for the number of terms, rows and columns. Also all the counts are correct.
The difference is indeed between the $dimnames$Docs of tt$dimnames$Docs and AssociatedPress$dimnames$Docs.
The reason for this is that if there are no docids in the dtm before tidying as is the case with AssociatedPress, the tidy function assigns AssociatedPress$i to the document variable in the tidy_text (ap_td). Casting this back into a dtm, will fill the $dimnames$Docs with the document value from the tidy_text data.frame (ap_td). So in the end the AssociatedPress$i values will end up in tt$dimnames$Docs.
You can see that if you compare the $i from Associated Press with the Docs from tt.
all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs))
[1] TRUE
Or comparing from AssociatedPress to ap_td to tt:
all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs), unique(ap_td))
[1] TRUE
If you want to follow the logic yourself, you can check all the functions used on the github page for the sparse_tidiers. Start with tidy.DocumentTermMatrix and follow the function calls to tidy.simple_triplet_matrix and finally to tidy_triplet.

how to get the exact observations analyzed in multinom() in R, or how to make fitted() produce a fit on entire data including NAs

My question is similar to another older question "How to get the number of observations included in a model created using the function multinom in R?" but what I want to look for are the exact observations, not the number of observations, analyzed in the model. Ultimately I want to have the original dataset combined with a new column of the predicted (fitted) probabilities. But let me use an example to illustrate my problem:
If my sample is 1000, some variables have NA values,and I fit a multinom() in R, and use fitted(), then find the length of fitted() is only 870, which means 130 obs are excluded when the model is estimated. Now, the fitted() only generates a 870*1 (i.e. one column) of numbers (probabilities), there is no way for me to know which observation does each probability number corresponds to. I think there're two ways to solve this:
Find out which observations are excluded and delete them in the raw data before estimating the model.
Try to let the fitted() produce a 1000*1 matrix with 130 elements being NA.
I don't know the answer to either one. Any advice would be appreciated. The ultimate goal is to be able to append the fitted probabilities to the original dataset (as a new column) so I can draw inferences. Thanks.
from ?multinom in nnet:
model: logical. If true, the model frame is saved as component
'model' of the returned object.
so call multinom(..., model=TRUE), and the model frame will be in the result.
EDIT:
Following the example in ?multinom:
options(contrasts = c("contr.treatment", "contr.poly"))
library(MASS)
example(birthwt)
bwt.mu <- multinom(low ~ ., bwt, model=TRUE)
Viewing the model frame inside the object:
> str(bwt.mu$model)
'data.frame': 189 obs. of 9 variables:
$ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "white","black",..: 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: logi FALSE FALSE TRUE TRUE TRUE FALSE ...
$ ptd : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ui : logi TRUE FALSE FALSE TRUE TRUE FALSE ...
$ ftv : Factor w/ 3 levels "0","1","2+": 1 3 2 3 1 1 2 2 2 1 ...
- attr(*, "terms")=Classes 'terms', 'formula' length 3 low ~ age + lwt + race + smoke + ptd + ht + ui + ftv
.. ..- attr(*, "variables")= language list(low, age, lwt, race, smoke, ptd, ht, ui, ftv)
.. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:9] "low" "age" "lwt" "race" ...
.. .. .. ..$ : chr [1:8] "age" "lwt" "race" "smoke" ...
.. ..- attr(*, "term.labels")= chr [1:8] "age" "lwt" "race" "smoke" ...
.. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(low, age, lwt, race, smoke, ptd, ht, ui, ftv)
.. ..- attr(*, "dataClasses")= Named chr [1:9] "factor" "numeric" "numeric" "factor" ...
.. .. ..- attr(*, "names")= chr [1:9] "low" "age" "lwt" "race" ...

How to convert DocumentTermMatrix (tm package) to sparse matrix in R?

I used tm package and DocumentTermMatrix to create a DocumentTermMatrix and now I'd like to convert it to spare matrix for an ouput to glmnet function from glmnet package.
Any idea on how to do this?
The objects looks like this:
> str(yy)
List of 6
$ i : int [1:13864810] 2 2 2 2 2 2 2 2 2 2 ...
$ j : int [1:13864810] 320 334 339 346 347 348 355 360 362 363 ...
$ v : num [1:13864810] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 709678
$ ncol : int 371
$ dimnames:List of 2
..$ Docs : chr [1:709678] "1" "2" "3" "4" ...
..$ Terms: chr [1:371] "declarative_" "declarative_0" "declarative_0zc" "declarative_0zd" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
> class(yy)
[1] "DocumentTermMatrix" "simple_triplet_matrix"
Is this the only way?
sparseYY <- sparseMatrix( i = yy$i, j=yy$j, x =yy$v)
Simply use as.matrix to convert to a sparse matrix:
> dtm_matrix <- as.matrix(dtm)
> class(dtm_matrix)
[1] "matrix"

All N Combinations of All Subsets

Given a vector of elements, I would like to obtain a list of all possible n-length combinations of subsets of elements. For example, given the (simplest) sequence 1:2, I would like to obtain a list object of the form
{ {{1},{1}}, {{1},{2}}, {{2},{2}}, {{1},{1,2}}, {{2},{1,2}}, {{1,2},{1,2}} }
when n=2.
I was able to generate a list of all non-empty subsets using the following:
listOfAllSubsets <- function (s) {
n <- length(s)
unlist(lapply(1:n, function (n) {
combn(s, n, simplify=FALSE)
}), recursive=FALSE)
}
However, I'm not sure the best way to proceed from here. Essentially, I want a Cartesian product of this list with itself (for n=2).
Any suggestions? A non-iterative solution would be preferable (i.e., no for loops).
It is easier to start with a Cartesian product of the indices. Then duplication can be avoided by making sure the tuple of indices is sorted.
combosn <- function(items,n) {
i <- seq_along(items)
idx <-do.call(expand.grid,rep(list(i),n))
idx <- idx[!apply(idx,1,is.unsorted),]
apply(idx,1,function(x) items[x])
}
ss<-listOfAllSubsets(1:2)
str(combosn(ss,2))
List of 6
$ :List of 2
..$ : int 1
..$ : int 1
$ :List of 2
..$ : int 1
..$ : int 2
$ :List of 2
..$ : int 2
..$ : int 2
$ :List of 2
..$ : int 1
..$ : int [1:2] 1 2
$ :List of 2
..$ : int 2
..$ : int [1:2] 1 2
$ :List of 2
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
Or, for n=3,
str(combosn(ss,3))
List of 10
$ :List of 3
..$ : int 1
..$ : int 1
..$ : int 1
$ :List of 3
..$ : int 1
..$ : int 1
..$ : int 2
$ :List of 3
..$ : int 1
..$ : int 2
..$ : int 2
$ :List of 3
..$ : int 2
..$ : int 2
..$ : int 2
$ :List of 3
..$ : int 1
..$ : int 1
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 1
..$ : int 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 2
..$ : int 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 1
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int 2
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
$ :List of 3
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
..$ : int [1:2] 1 2
This is what I would do, with, e.g., s=1:2:
1) Represent subsets with a 0/1 matrix for each element's membership.
subsets = as.matrix(do.call(expand.grid,replicate(length(s),0:1,simplify=FALSE)))
which gives
Var1 Var2
[1,] 0 0
[2,] 1 0
[3,] 0 1
[4,] 1 1
Here, the first row is the empty subset; the second, {1}; the third, {2}; and the fourth, {1,2}. To get the subset itself, use mysubset = s[subsets[row,]], where row is the row of the subset you want.
2) Represent pairs of subsets as pairs of rows of the matrix:
pairs <- expand.grid(Row1=1:nrow(subsets),Row2=1:nrow(subsets))
which gives
Row1 Row2
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 4 2
9 1 3
10 2 3
11 3 3
12 4 3
13 1 4
14 2 4
15 3 4
16 4 4
Here, the fourteenth row corresponds to the second and fourth rows of subsets, so {1} & {1,2}. This assumes the order of the pair matters (which is implicit in taking the Cartesian product). To recover the subsets, use mypairosubsets=lapply(pairs[p,],function(r) s[subsets[r,]]) where p is the row of the pair you want.
Expanding beyond pairs to the P(s)^n case (where P(s) is the power set of s) would look like
setsosets = as.matrix(do.call(expand.grid,replicate(n,1:nrow(subsets),simplify=FALSE)))
Here, each row will have a vector of numbers. Each number corresponds to a row in the subsets matrix.
Making copies of the elements of s is probably not necessary for whatever you are doing after this. However, you could do it from here by using lapply(1:nrow(pairs),function(p)lapply(pairs[p,],function(r) s[subsets[r,]])), which starts like...
[[1]]
[[1]]$Row1
integer(0)
[[1]]$Row2
integer(0)
[[2]]
[[2]]$Row1
[1] 1
[[2]]$Row2
integer(0)
allSubsets<-function(n,# size of initial set
m,# number of subsets
includeEmpty=FALSE)# should the empty set be consiered a subset?
{
# m can't exceed the number of possible subsets
if(includeEmpty)
stopifnot(m <= 2^n)
else
stopifnot(m <= 2^n-1)
# get the subsets of the initial set (of size n)
if(includeEmpty){
ll <- split(t(combn(2^n,m)),seq(choose(2^n,m)))
}else
ll <- split(t(combn(2^n-1,m)),seq(choose(2^n-1,m)))
# get the subets
subsets <- apply(do.call(expand.grid,rep(list(c(F,T)),n)),
1,which)
# remove the empty subset if desired
if(!includeEmpty)
subsets <- subsets[-1]
# covert the subsets to vector
subsets <- lapply(subsets,as.vector)
# return the list of subsets
apply(t(mapply('[',list(subsets),ll)),1,function(x)x)
}
# returns a list where each element is a list of length 2 with
# subsets of the initial set of length 4
x = allSubsets(4,2,F)

Resources