Work-around to clear blank entries in a document term matrix? - r

I have some r code that I've used in the past to produce topic models. Everything was working fine until I updated all of my r packages in the hopes of fixing a slightly unrelated problem. Now, code which had previously worked seems to be broken and I can't figure out what to do.
I read this post and found it very helpful in setting this up originally. It describes a method of cleaning blank rows after sparse terms have been removed to set up subsequent analysis. Here is what happens when I enter the same code with my current packages:
> rowTotals <- apply(a.dtm.t, 1, sum) #Find the sum of words in each Document
> a.dtm.t.rt <- a.dtm.t[rowTotals>0]
Error in `[.simple_triplet_matrix`(a.dtm.t, rowTotals > 0) :
Logical vector subscripting disabled for this object.
Does anyone know how I can go about locating the problem, and roll back to a working solution? Thanks.

Try a.dtm.t.rt <- a.dtm.t[which(rowTotals>0)]
If that doesn't work then you need to show a reproducible example. We have no idea what anything you're doing here is.

I find the same problem as yours. I use slam package to solve this issue.
library(slam)
# take tdm as a large term-document matrix
freq <- rowapply_simple_triplet_matrix(tdm,sum)
Also the colapply_simple_triplet_matrix will help to handle the sparse matrix

Related

How to get R to read my first column as a "header"?

I want to calculate diversity indices of different sampling sites in R. I have sites in the first row and the different species in the first column. However, R is reading the first column as normal data (not as a header so to speak).
Pics:
https://imgur.com/a/iBsFtbe
Code:
>Macro<-read.csv("C:\\Users\\Carly\\OneDrive\\Desktop\\Ecology >Projects\\Macroinvertebrates & Water >Quality\\Macro_RData\\Macroinvert\\MacroR\\MacroCSV.csv", header = T)
You need to add row.names = 1 to your command. This will indicate that row names are stored in column number 1.
Macro <- read.csv("<...>/MacroCSV.csv", header = TRUE, row.names = 1)
I sense that you are frustrated. As r2evans said, it is easier for people to help you if you provide them with the data in text form and not with screenshots - because we can't recreate the problem or try to solve it by loading a screenshot into R.
CSV files are just text, so you can open them with a text editor such as NotePad and copy and paste it here. You don't need the whole text - the columns and lines needed to reproduce the problem are enough. This was what we were looking for:
Site,Aeshnidae,Amnicolidae,Ancylidae,Asellidae
AN0119A,0,0,0,6,0
AN0143,0,0,0,0,0
Programming for many people is very frustrating when they start out, don't let this discourage you!
It looks like your data is in the wrong orientation for analysis in vegan - your species are the rows, and sites are columns. From your pics, it looks like you've spotted this issue and tried transposing, but are having issues with the placement of the headers.
Try reading your csv in, and specifying that the first column should be row names:
MacroDataDataFinal <- read.csv("Path/to/file.csv",
row.names=1)
Then transpose the data
MacroDataDataFinal_transposed <- t(MacroDataDataFinal)
Then try running the specaccum function:
library(vegan)
speccurve <- specaccum(comm=MacroDataDataFinal_transposed,
method="random",
permutation=1000)
Hopefully this will work. If you get any errors please let us know the code you typed, and the precise error message.

ImpulseDE2, matrix counts contains non-integer elements

Possibly it's a stupid question (but be patient, I'm a beginner in R's word)... I'm working with ImpulseDE2, a package designed to RNAseq data analysis along different times (see article for more information).
The running function (runImpulseDE2) requires a matrix counts and a annotation data frame. I've created both but it appears this error message:
Error in checkCounts(matCountData, "matCountData"): ERROR: matCountData contains non-integer elements. Requires count data.
I have tried some solutions and nothing seems to work (and I've not found any solution in the Internet)...
as.matrix(data)
(data + 1) > and there isn't NAs nor zero values that originate this error ($ which(is.na(data)) and $ which(data < 1), but both results are integer(0))
as.numeric(data) > and appears another error: ERROR: [Rownames of matCountData] was not given as input.
I think that's something I'm not realizing, but I'm totally locked. Every tip will be welcome!
And here is the (silly) solution! This function seems not to accept float numbers... so applying a simple round is enough to solve this error.
Thanks for your help!

Convert from matrix to list matrix

Sorry for the noob question but I can't seem to get this to work!
X=cbind(rep(1,m), h2(x), h3(x)) #obs
So I have a 17*3 matrix X I have to create a matrix(list(),17,3) version of this matrix. I did manually below so you can see the desired result, but there must be an easier way to do this?
Z=matrix(list(X[1,1],X[2,1],X[3,1],X[4,1],X[5,1],X[6,1],X[7,1],X[8,1],X[9,1],X[10,1],X[11,1],X[12,1],X[13,1],X[14,1],X[15,1],X[16,1],X[17,1],X[1,2],X[2,2],X[3,2],X[4,2],X[5,2],X[6,2],X[7,2],X[8,2],X[9,2],X[10,2],X[11,2],X[12,2],X[13,2],X[14,2],X[15,2],X[16,2],X[17,2],X[1,3],X[2,3],X[3,3],X[4,3],X[5,3],X[6,3],X[7,3],X[8,3],X[9,3],X[10,3],X[11,3],X[12,3],X[13,3],X[14,3],X[15,3],X[16,3],X[17,3]),17,3)
I tried this (amongst others)
Z2=list(X[1:17,1],X[1:17,2],X[1:17,3])
Z3=matrix(Z2[1:3],17,3)
But it doesn't give the correct results! It just repeats the three column vectors over and over.
Can someone please explain how to do this correctly.
Apparently you want Z <- matrix(as.list(X), ncol = 3). However, I don't see how this structure could be useful.

package tm. problems with kmeans

I have a question about k-means clustering in R. Actually i'm doing everything according to this article. Everything is based on examples within the tm package so it's required no data import. acq contains 50 documents and crude 20 documents.
library(tm)
data("acq")
data("crude")
ws <- c(acq, crude)
wsTDM <- Data(TermDocumentMatrix(ws)) #First problem here
wsKMeans <- kmeans(wsTDM, 2)
wsReutersCluster <- c(rep("acq", 50), rep("crude", 20))
cl_agreement(wsKMeans, as.cl_partition(wsReutersCluster), "diag")
Error in lapply(X, FUN, ...) :
(list) object cannot be coerced to type 'integer'
I actually want to create cross agreement matrix. But this article was wrote in 2008 since then a lot have changed. The Data function is only available in RSurvey package, but i'm kinda doubt is it the same. And i think that the main problem is that TermDocumentMatrix was S4 class and now it's S3. I know it's possibly to do this having text only. But I wanna do it like this since in TDM it's possible to remove stopwords, punct, etc for better results. So if someone has any solution that would be terrific.
The TDM is stored as a sparse matrix, as described in ?TermDocumentMatrix. This can also be seen from just inspecting the object like str(wsTDM). That old Data() function was just a way to access the contents as a regular matrix. It is not needed anymore. Just do kmeans(wsTDM, 2) and you'll see that the output is as expected, with clusters identified for 2775 observations (terms) on 70 features (documents). Good luck!

converting R code snippet to use the Matrix package?

I am not sure there are any R users out there, but just in case:
I am a novice at R and was kindly "handed down" the following R code snippet:
Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)
infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')
open(infile)
open(outfile)
for (i in 1:93049) {
vec <- t(scan(infile, nlines=1))
topics <- (vec/WordProbs) %*% Beta
write.table(topics, outfile, append=T, row.names=F, col.names=F)
}
When I tried running this on my dataset, the system thrashed and swapped like crazy. Now I realize that has a simple reason: the file freq-matrix holds a large (22GB) matrix and I was trying to read it into memory.
I have been told to use the Matrix package, because freq-matrix has many, many zeros all over the place and it handles such cases well. Will that help? If so, any hints on how to change this code would be most welcome. I have no R experience and just started reading through the introduction PDF available on the site.
Many thanks
~l
My suggestion might be completely off, because you don't give enough details about the contents of your files, and I had to guess from the code. Anyway, here it goes.
You don't state it, but I would assume that your code crashes on the second line, when you read in the big matrix. The loop reads the lines one-at-a-time, and should not crash. The only reason you need that big matrix is to calculate the WordProbs vector. So why don't you rewrite that part using the same looping using scan? In fact, you could probably don't even need to store the WordProbs vector, just sum(WordFreq) - you can get that using an initial run through hte file. Then rewrite the formula within the loop to calculate the current WordProb.
Belated answer, but I'd recommend reading the data into a memory mapped file, using the bigmemory package. After that, I'd look for the non-zero entries, which can then be represented as a 3 column matrix: (ix_row, ix_col, value). This is called a coordinate object list (COO), though the name is unimportant. From there, Matrix supports the creation of sparse matrices (via sparseMatrix). After you get the COO, you're pretty much set - conversion to and from the sparse matrix format is reasonably fast. Multiplying the matrix by Beta should be reasonably fast. If you need even greater speed, you could use an optimized BLAS library, but that opens up more questions. :)

Resources