Creating a sparse matrix from a TermDocumentMatrix - r

I've created a TermDocumentMatrix from the tm library in R. It looks something like this:
> inspect(freq.terms)
A document-term matrix (19 documents, 214 terms)
Non-/sparse entries: 256/3810
Sparsity : 94%
Maximal term length: 19
Weighting : term frequency (tf)
Terms
Docs abundant acid active adhesion aeropyrum alternative
1 0 0 1 0 0 0
2 0 0 0 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 1 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 1 0
11 0 0 1 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 1 0 0 0 0 0
16 0 0 0 0 0 0
17 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 1
This is just a small sample of the matrix; there are actually 214 terms that I'm working with. On a small scale, this is fine. If I want to convert my TermDocumentMatrix into an ordinary matrix, I'd do:
data.matrix <- as.matrix(freq.terms)
However the data that I've displayed above is just a subset of my overall data. My overall data has probably at least 10,000 terms. When I try to create a TDM from the overall data, I run an error:
> Error cannot allocate vector of size n Kb
So from here, I'm looking into alternative ways of finding efficient memory allocation for my tdm.
I tried turning my tdm into a sparse matrix from the Matrix library but ran into the same problem.
What are my alternatives at this point? I feel like I should be investigating one of:
bigmemory/ff packages as talked about here (although the bigmemory package doesn't seem available for Windows at the moment)
the irlba package for computing partials SVD of my tdm as mentioned here
I've experimented with functions from both libraries but can't seem to arrive at anything substantial. Does anyone know what the best way forward is? I've spent so long fiddling around with this that I thought I'd ask people who have much more experience than myself working with large datasets before I waste even more time going in the wrong direction.
EDIT: changed 10,00 to 10,000. thanks #nograpes.

The package qdap seems to be able to handle a problem this large. The first part is recreating a data set that matches the OP's problem followed by the solution. As of qdap version 1.1.0 there is compatibility with the tm package:
library(qdapDictionaries)
FUN <- function() {
paste(sample(DICTIONARY[, 1], sample(seq(100, 10000, by=1000), 1, TRUE)), collapse=" ")
}
library(qdap)
mycorpus <- tm::Corpus(tm::VectorSource(lapply(paste0("doc", 1:15), function(i) FUN())))
This gives a similar corpus...
Now the qdap approach. You have to first convert the Corpus to a dataframe (tm_corpus2df) and then use the tdm function to create a TermDocumentMatrix.
out <- with(tm_corpus2df(mycorpus), tdm(text, docs))
tm::inspect(out)
## A term-document matrix (19914 terms, 15 documents)
##
## Non-/sparse entries: 80235/218475
## Sparsity : 73%
## Maximal term length: 19
## Weighting : term frequency (tf)

Related

Is there an R function for finding keywords within a certain 'word distance'?

What I need is a function to find words within a certain 'word distance'. The words 'bag' and 'tool' are interesting in a sentence "He had a bag of tools in his car."
With the Quanteda kwic function I can find 'bag' and 'tool' individually, but this often gives me an overload of results. I need e.g. 'bag' and 'tools' within five words from eachother.
You can use the fcm() function to count the co-occurrences within a fixed window, for instance 5 words. This creates a "feature co-occurrence matrix" and can be defined for any size of token span, or for the context of an entire document.
For your example, or at least an example based on my interpretation of your questions, this would look like:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
Here, the term bag occurs once within 5 tokens of tool, in the first document. In the second document, they are more than 5 tokens apart, so this is not counted.

Document-Term Matrix with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very goood film
..
Then I tried to create a DocumentTermMatris using quanteda package :
temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens
+ dfm %>% # generate dfm
+ convert(to = "tm")
I get this matrix :
> inspect(temp.tf)
<<DocumentTermMatrix (documents: 63023, terms: 23892)>>
Non-/sparse entries: 520634/1505224882
Sparsity : 100%
Maximal term length: 77
Weighting : term frequency (tf)
Sample :
Whith this structure :
Terms
Docs good very film my excellent heart David plus always so
text14670 1 0 0 0 1 0 0 0 2 0
text19951 3 0 0 0 0 0 0 1 1 1
text24305 7 0 2 1 0 0 0 2 0 0
text26985 6 0 0 0 0 0 0 4 0 1
text29518 4 0 1 0 1 0 0 3 0 1
text34547 5 2 0 0 0 0 2 3 1 3
text3781 3 0 1 4 0 0 0 3 0 0
text5272 4 0 0 4 0 5 0 3 1 2
text5367 3 0 1 3 0 0 1 4 0 1
text6001 3 0 9 1 0 6 0 1 0 1
So I think It is good , but I think that : text6001 , text5367, text5272 ... refer to document's name...
My question is that rows in this matrix are ordered? or randoms putted in the matrix?
Thank you
EDIT :
I created a document term frequency :
mydfm <- dfm(df$Review, remove = stopwords("french"), stem = TRUE)
Then, I created a tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
Then I would like to merge the tfidf matrix to the Rank column to have something like this
features
Docs good very film my excellent heart David plus always so Rank
text14670 1 0 0 0 1 0 0 0 2 0 3
text19951 3 0 0 0 0 0 0 1 1 1 2
text24305 7 0 2 1 0 0 0 2 0 0 4
text26985 6 0 0 0 0 0 0 4 0 1 5
Can you help to make this merge?
Thank you
The rows (documents) are alphabetically ordered, which is why text14670 comes before text19951. It is possible that the conversion has reordered the documents, but you can test this using
sum(rownames(temp.tf) == sort(rownames(temp.tf))
If that is not 0, then they are not alphabetically ordered.
The feature ordering, at least in the quanteda dfm, come from the order in which they are found in the texts. You can resort both using dfm_sort().
In your code, the tokens(ngrams = 1:1) is unnecessary since dfm() does that and ngrams = 1 is the default.
Also, do you need to convert this to a tm object? Probably most of what you need can be done in quanteda.

Sorting a filtered TDM matrix in R

I got a problem with a TDM matrix - I was able to sort it accordingly, everything went fine there, but no I would like to filter it (or the other way round, I heard, its more efficient to filter a unsorded matrix). Either way, The thing I want to do is filtering the TDM, as in this question: subset vector by first letter in R
Now the TDM looks like this:
> inspect(tdm[1:5,1:10])
<<TermDocumentMatrix (terms: 5, documents: 10)>>
Non-/sparse entries: 3/47
Sparsity : 94%
Maximal term length: 10
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4 5 6 7 8 9 10
ability 0 0 0 0 0 0 1 0 0 0
about 0 0 0 1 0 0 3 0 0 0
acceptance 0 0 0 0 0 0 0 0 0 0
accepted 0 0 0 0 0 0 0 0 0 0
access 0 0 0 0 0 0 0 0 0 0
But I would like to filter the Terms with ac and leave only that:
Docs
Terms 1 2 3 4 5 6 7 8 9 10
ability 0 0 0 0 0 0 1 0 0 0
about 0 0 0 1 0 0 3 0 0 0
I tried to use grep or subset, but couldnt manage to achieve that, but I get the error that there is no such case (named numeric (0)). I'm pretty new to R so please - if I'm searching in the wrong direction, point it out, I'd be really grateful. Big thanks in advance.
The code is pretty straighforward:
library("tm")
data(acq)
corpus<-Corpus(VectorSource(acq))
tdm<-TermDocumentMatrix(corpus)
final<-as.matrix(tdm)
final[grep("^[aA].*", final)]
To get all the terms that start with "a" use
final[ grep("^[aA].*", rownames(final)) , ]
It's the not matrix values themselves that hold the terms, it's the row names of the matrix that have the values you want to grep against. And then since you're subsetting a matrix by rows, you should use the two value [row,col] subsetting syntax.

plotting graphs when rownames and column names are not identical

I tried everything and I could not find any meaningful answers so I decided to post this here. I have an adjacency matrix as shown below
I am trying to create a plot of a simple graph
library(graph)
g = as(x4, "graphNEL")
plot(g, "neato")
I got an error message
Error in asMethod(object) : 'rownames(from)' and 'colnames(from)' must be identical
Abdominal pain Chest pain Flu-like Liver Damage Nausea Numbness Swelling
Avandaia 1 0 0 1 1 1 1
Warfrin 0 1 1 0 1 1 1
Flu-like 0 0 0 0 0 0 0
Liver Damage 0 0 0 0 0 0 0
Nausea 0 0 0 0 0 0 0
Numbness 0 0 0 0 0 0 0
Swelling 0 0 0 0 0 0 0
Any advice would be helpful. Thanks.
jlhoward when I do rownames(x4) <- colnames(x4) I am making my rownames and column names the same, I am interested in a graph when row names and column names are not equal.

Random subsampling in R

I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.
Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)
This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results

Resources