How to cast a dataframe to a DocumentTermMatrix? - r

I am trying to use tidytext to transform a tibble of word frequencies into a DocumentTermMatrix, but the function doesn't seem to work as expected. I start from AssociatedPress which I know is a documentTermMatrix, tidy and cast it back, but the output is not the same as the original matrix. What am I doing wrong?
library(topicmodels)
data(AssociatedPress)
ap_td <- tidy(AssociatedPress)
tt <- ap_td %>%
cast_dtm(document, term, count)
The element $Docs is not-NULL when I cast ap_td but it was NULL in AssociatedPress:
str(tt)
List of 6
$ i : int [1:302031] 1 16 35 72 84 93 101 111 155 161 ...
$ j : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : chr [1:2246] "1" "2" "3" "4" ...
..$ Terms: chr [1:10473] "adding" "adult" "ago" "alcohol" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : num [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
cast_dtm retrieves a warning
Warning message: Trying to compute distinct() for variables not found
in the data:
- row_col, column_col This is an error, but only a warning is raised for compatibility reasons. The operation will return the input
unchanged.
On GitHub, I found this issue which should have been fixed now.

I don't get your warning message using tidytext 0.1.9.900 and R 3.5.0.
The dtm's are identical for the number of terms, rows and columns. Also all the counts are correct.
The difference is indeed between the $dimnames$Docs of tt$dimnames$Docs and AssociatedPress$dimnames$Docs.
The reason for this is that if there are no docids in the dtm before tidying as is the case with AssociatedPress, the tidy function assigns AssociatedPress$i to the document variable in the tidy_text (ap_td). Casting this back into a dtm, will fill the $dimnames$Docs with the document value from the tidy_text data.frame (ap_td). So in the end the AssociatedPress$i values will end up in tt$dimnames$Docs.
You can see that if you compare the $i from Associated Press with the Docs from tt.
all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs))
[1] TRUE
Or comparing from AssociatedPress to ap_td to tt:
all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs), unique(ap_td))
[1] TRUE
If you want to follow the logic yourself, you can check all the functions used on the github page for the sparse_tidiers. Start with tidy.DocumentTermMatrix and follow the function calls to tidy.simple_triplet_matrix and finally to tidy_triplet.

Related

Building a table/dataframe/something exportable from Desc function output in R

I'm definitely a noob, though I have used R for various small tasks for several years.
For the life of me, I cannot figure out how to get the results from the "Desc" function into something I can work with. When I save the x<-Desc(mydata) the class(x) shows up as "Desc." In R studio it is under Values and says "List of 1." Then when I click on x it says ":List of 25" in the first line. There is a list of data in this object, but I cannot for the life of me figure out how to grab any of it.
Clearly I have a severe misunderstanding of the R data structures, but I have been searching for the past 90 minutes to no avail so figured I would reach out.
In short, I just want to pull certain aspects (N, mean, UB, LB, median) of the descriptive statistics provided from the Desc results for multiple datasets and build a little table that I can then work with.
Thanks for the help.
Say you have a dataframe, x, where:
x <- data.frame(i=c(1,2,3),j=c(4,5,6))
You could set:
desc.x <- Desc(x)
And access the info on any given column like:
desc.x$i
desc.x$i$mead
desc.x$j$sd
And any other stats Desc comes up with. The $ is the key here, it's how you access the named fields of the list that Desc returns.
Edit: In case you pass a single column (as the asker does), or simply a vector to Desc, you are then returned a 1 item list. The same principle applies but the usual syntax is different. Now you would use:
desc.x <- Desc(df$my.col)
desc.x[[1]]$mean
In the future, the way to attack this is to either look in the environment window in RStudio and play around trying to figure out how to access the fields, check the source code on github or elsewhere, or (best first choice) use str(desc.x), which gives us:
> str(desc.x)
List of 1
$ :List of 25
..$ xname : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i"
..$ label : NULL
..$ class : chr "numeric"
..$ classlabel: chr "numeric"
..$ length : int 3
..$ n : int 3
..$ NAs : int 0
..$ main : chr "data.frame(i = c(1, 2, 3), j = c(4, 5, 6))$i (numeric)"
..$ unique : int 3
..$ 0s : int 0
..$ mean : num 2
..$ meanSE : num 0.577
..$ quant : Named num [1:9] 1 1.1 1.2 1.5 2 2.5 2.8 2.9 3
.. ..- attr(*, "names")= chr [1:9] "min" ".05" ".10" ".25" ...
..$ range : num 2
..$ sd : num 1
..$ vcoef : num 0.5
..$ mad : num 1.48
..$ IQR : num 1
..$ skew : num 0
..$ kurt : num -2.33
..$ small :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 1 2 3
.. ..$ freq: num [1:3] 1 1 1
..$ large :'data.frame': 3 obs. of 2 variables:
.. ..$ val : num [1:3] 3 2 1
.. ..$ freq: num [1:3] 1 1 1
..$ freq :Classes ‘Freq’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ level : Factor w/ 3 levels "1","2","3": 1 2 3
.. ..$ freq : int [1:3] 1 1 1
.. ..$ perc : num [1:3] 0.333 0.333 0.333
.. ..$ cumfreq: int [1:3] 1 2 3
.. ..$ cumperc: num [1:3] 0.333 0.667 1
..$ maxrows : num 12
..$ x : num [1:3] 1 2 3
- attr(*, "class")= chr "Desc"
"List of 1" means you access it by desc.x[[1]], and below that follow the $s. When you see something like num[1:3] that means it's an atomic vector so you access the first member like var$field$numbers[1]

how to get the exact observations analyzed in multinom() in R, or how to make fitted() produce a fit on entire data including NAs

My question is similar to another older question "How to get the number of observations included in a model created using the function multinom in R?" but what I want to look for are the exact observations, not the number of observations, analyzed in the model. Ultimately I want to have the original dataset combined with a new column of the predicted (fitted) probabilities. But let me use an example to illustrate my problem:
If my sample is 1000, some variables have NA values,and I fit a multinom() in R, and use fitted(), then find the length of fitted() is only 870, which means 130 obs are excluded when the model is estimated. Now, the fitted() only generates a 870*1 (i.e. one column) of numbers (probabilities), there is no way for me to know which observation does each probability number corresponds to. I think there're two ways to solve this:
Find out which observations are excluded and delete them in the raw data before estimating the model.
Try to let the fitted() produce a 1000*1 matrix with 130 elements being NA.
I don't know the answer to either one. Any advice would be appreciated. The ultimate goal is to be able to append the fitted probabilities to the original dataset (as a new column) so I can draw inferences. Thanks.
from ?multinom in nnet:
model: logical. If true, the model frame is saved as component
'model' of the returned object.
so call multinom(..., model=TRUE), and the model frame will be in the result.
EDIT:
Following the example in ?multinom:
options(contrasts = c("contr.treatment", "contr.poly"))
library(MASS)
example(birthwt)
bwt.mu <- multinom(low ~ ., bwt, model=TRUE)
Viewing the model frame inside the object:
> str(bwt.mu$model)
'data.frame': 189 obs. of 9 variables:
$ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "white","black",..: 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: logi FALSE FALSE TRUE TRUE TRUE FALSE ...
$ ptd : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ui : logi TRUE FALSE FALSE TRUE TRUE FALSE ...
$ ftv : Factor w/ 3 levels "0","1","2+": 1 3 2 3 1 1 2 2 2 1 ...
- attr(*, "terms")=Classes 'terms', 'formula' length 3 low ~ age + lwt + race + smoke + ptd + ht + ui + ftv
.. ..- attr(*, "variables")= language list(low, age, lwt, race, smoke, ptd, ht, ui, ftv)
.. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:9] "low" "age" "lwt" "race" ...
.. .. .. ..$ : chr [1:8] "age" "lwt" "race" "smoke" ...
.. ..- attr(*, "term.labels")= chr [1:8] "age" "lwt" "race" "smoke" ...
.. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(low, age, lwt, race, smoke, ptd, ht, ui, ftv)
.. ..- attr(*, "dataClasses")= Named chr [1:9] "factor" "numeric" "numeric" "factor" ...
.. .. ..- attr(*, "names")= chr [1:9] "low" "age" "lwt" "race" ...

How to convert DocumentTermMatrix (tm package) to sparse matrix in R?

I used tm package and DocumentTermMatrix to create a DocumentTermMatrix and now I'd like to convert it to spare matrix for an ouput to glmnet function from glmnet package.
Any idea on how to do this?
The objects looks like this:
> str(yy)
List of 6
$ i : int [1:13864810] 2 2 2 2 2 2 2 2 2 2 ...
$ j : int [1:13864810] 320 334 339 346 347 348 355 360 362 363 ...
$ v : num [1:13864810] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 709678
$ ncol : int 371
$ dimnames:List of 2
..$ Docs : chr [1:709678] "1" "2" "3" "4" ...
..$ Terms: chr [1:371] "declarative_" "declarative_0" "declarative_0zc" "declarative_0zd" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
> class(yy)
[1] "DocumentTermMatrix" "simple_triplet_matrix"
Is this the only way?
sparseYY <- sparseMatrix( i = yy$i, j=yy$j, x =yy$v)
Simply use as.matrix to convert to a sparse matrix:
> dtm_matrix <- as.matrix(dtm)
> class(dtm_matrix)
[1] "matrix"

R tm package. Where I can find a detailed description of the components of the TermDocumentMatrix? i, j, v

As an example this is a tdm:
str(AssociatedPress)
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
I have been trying to find the description of these columns $i, $j, $v ...
Thanks a lot,
Have a look at this: http://www.inside-r.org/packages/cran/slam/docs/as.simple_triplet_matrix
Under ?TermDocumentMatrix
We see:
Value
An object of class TermDocumentMatrix or class DocumentTermMatrix
(both inheriting from a simple triplet matrix in package slam)
containing a sparse term-document matrix or document-term matrix. The
attribute Weighting contains the weighting applied to the matrix.
When you click on the link in the statement both inheriting from a simple triplet matrix:
Arguments
i, j
Integer vectors of row and column indices, respectively.
v
Vector of values.
and...
Details
simple_triplet_matrix is a generator for a class of
“lightweight” sparse matrices, “simply” represented by triplets (i,
j, v) of row indices i, column indices j, and values v, respectively.
simple_triplet_zero_matrix and simple_triplet_diag_matrix are
convenience functions for the creation of empty and diagonal
matrices.

Topicmodels transposes the term document matrix

I am trying to run an LDA using the topicmodels package in R. The example given in the manual uses Associated Press data and works nicely. However, when I try it on my own data I get topics whose terms are the document names. I have traced the problem to the fact that my term document matrix is the transpose of the way is should be (rows -> columns).
The example TDM:
str(AssociatedPress)
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
Whereas,my TDM has Terms as rows, and Docs as columns:
List of 6
$ i : int [1:10489] 1 3 4 13 20 24 25 26 27 28 ...
$ j : int [1:10489] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:10489] 1 1 1 1 2 1 67 1 44 3 ...
$ nrow : int 5903
$ ncol : int 9
$ dimnames:List of 2
..$ Terms: chr [1:5903] "\u2439aa" "aars" "\u2439ab" "\u242dab" ...
..$ Docs : chr [1:9] "art111130.txt" "art111131.txt" "art111132.txt" "art111133.txt" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Which is causing LDA(art_tdm,3) to build topics based on doc names, not terms within docs. Is this a change in the codebase of the tm package? I can't imagine what I would be doing to cause this transposition in my code:
art_cor<-Corpus(DirSource(directory = "tmptxts"))
art_tdm<-TermDocumentMatrix(art_cor)
Any help would be appreciated.
On the one hand you have an object of class "TermDocumentMatrix" and the other you have one of "DocumentTermMatrix".
You probably just need to do this:
art_tdm<-DocumentTermMatrix(art_cor)

Resources