R tm TermDocumentMatrix based on a sparse matrix - r

I have a collection of books in txt format and want to apply some procedures of the tm R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster.
Suppose I am able to get from bash a data.frame such as:
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
I know that TermDocumentMatrices are actually sparse matrices with metadata. In fact, I can create a sparse matrix from the TDM using the TDM's i, j and v entries for the i, j and x ones of the sparseMatrix function. Please help me if you know how to do the inverse, or in this case, how to construct a TDM by using the three columns in the data.frame above. Thanks!

You could try
library(tm)
library(reshape2)
txt <- readLines(n = 7)
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
df <- read.table(header=T, text=txt[-2])
dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
mat <- as.matrix(dfwide[, -1])
dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
(tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
# <<TermDocumentMatrix (terms: 4, documents: 2)>>
# Non-/sparse entries: 5/3
# Sparsity : 38%
# Maximal term length: 6
# Weighting : term frequency (tf)
as.matrix(tdm)
# Docs
# Terms 1 2
# animal 0 2
# car 0 3
# the 10 20
# zoo 2 0

Related

Interpretation question: Textstat_similarity Quanteda

I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:
0.9999 1
2288 162743
Here's my code:
dfmat_users <- dfm_data %>%
dfm_select(min_nchar = 2) %>%
dfm_trim(min_termfreq = 10)
dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]
tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)
table(tstat_sim#x) #result of this code is given above.
I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?
The easiest way is to convert the textstat_simil() output to a data.frame of unique pairs, and then filter the ones whose cosine value is above your threshold (here, .9999).
To illustrate, we can reshape the built-in inaugural address corpus into sentences, and then compute the similarity matrix on these, and then do the coercion to data.frame and use dplyr to filter the results you want.
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
sim_df <- data_corpus_inaugural %>%
corpus_reshape(to = "sentences") %>%
dfm() %>%
textstat_simil(method = "cosine") %>%
as.data.frame()
nrow(sim_df)
## [1] 12508670
You can adjust the condition below for your data to 0.9999 - here I'm using 0.99 as an illustration.
library("dplyr", warn.conflicts = FALSE)
filter(sim_df, cosine > .99)
## document1 document2 cosine
## 1 1861-Lincoln.69 1861-Lincoln.71 1
## 2 1861-Lincoln.69 1861-Lincoln.73 1
## 3 1861-Lincoln.71 1861-Lincoln.73 1
## 4 1953-Eisenhower.6 1985-Reagan.6 1
## 5 1953-Eisenhower.6 1989-Bush.15 1
## 6 1985-Reagan.6 1989-Bush.15 1
## 7 1989-Bush.140 2009-Obama.108 1
## 8 1989-Bush.140 2013-Obama.87 1
## 9 2009-Obama.108 2013-Obama.87 1
## 10 1989-Bush.140 2017-Trump.9 1
## 11 2009-Obama.108 2017-Trump.9 1
## 12 2013-Obama.87 2017-Trump.9 1
(And: yeah, that's a very fast computation of cosine similarity between 12.5 million sentence pairs!)

R- textminig with tm packages

I want to use the tm packakes, so I have created the next code:
x<-inspect(DocumentTermMatrix(docs,
list(dictionary = c("survive", "survival"))))
I need to find any word beginning with "surviv" in the text, such as to include words like "survival" "survivor" "survive" and others. Is there any way to write that condition - words begining with "surviv"- in the code?
What you could do is creating a general DocumentTermMatrix and then filtering it to keep only rows which start with surviv using startsWith.
corp <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm[startsWith(rownames(tdm), "surv"),])
You can stem the words with stemDocument. Then you need to look only for surviv and survivor as these are the stem words you are looking for. Using and expanding the list of words from #AshOfFire
my_corpus <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive", "survivors", "surviving")))
my_corpus <- tm_map(my_corpus, stemDocument)
my_dtm <- DocumentTermMatrix(my_corpus, control = list(dictionary = c("surviv", "survivor")))
inspect(my_dtm)
<<DocumentTermMatrix (documents: 8, terms: 2)>>
Non-/sparse entries: 6/10
Sparsity : 62%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs surviv survivor
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 1 0
7 0 1
8 1 0
p.s. only do x <- inspect(DocumentTermMatrix(docs, .....) if you want to get the first 10 rows and 10 columns in your x variable.

Split up ngrams in (sparse) document-feature matrix

This is a follow up question to this one. There, I asked if it's possible to split up ngram-features in a document-feature matrix (dfm-class from the quanteda-package) in such a way that e.g. bigrams result in two separate unigrams.
For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
There was a nice answer to this example, which works absolutely fine for relatively small matrices as the one above. However, as soon as the matrix is bigger, I'm constantly running into the following memory error.
> #turn the dfm into a matrix
> DF <- as.data.frame(eg.dfm)
Error in asMethod(object) :
Cholmod-error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Hence, is there a more memory efficient way to solve this ngram-problem or to deal with large (sparse) matrices/data frames? Thank you in advance!
The problem here is that you are turning the sparse (dfm) matrix into a dense object when you call as.data.frame(). Since the typical document-feature matrix is 90% sparse, this means you are creating something larger than you can handle. The solution: use dfm handling functions to maintain the sparsity.
Note that this is both a better solution than proposed in the linked question but also should work efficiently for your much larger object.
Here's a function that does that. It allows you to set the concatenator character(s), and works with ngrams of variable sizes. Most importantly, it uses dfm methods to make sure the dfm remains sparse.
# function to split and duplicate counts in features containing
# the concatenator character
dfm_splitgrams <- function(x, concatenator = "_") {
# separate the unigrams
x_unigrams <- dfm_remove(x, concatenator, valuetype = "regex")
# separate the ngrams
x_ngrams <- dfm_select(x, concatenator, valuetype = "regex")
# split into components
split_ngrams <- stringi::stri_split_regex(featnames(x_ngrams), concatenator)
# get a repeated index for the ngram feature names
index_split_ngrams <- rep(featnames(x_ngrams), lengths(split_ngrams))
# subset the ngram matrix using the (repeated) ngram feature names
x_split_ngrams <- x_ngrams[, index_split_ngrams]
# assign the ngram dfm the feature names of the split ngrams
colnames(x_split_ngrams) <- unlist(split_ngrams, use.names = FALSE)
# return the column concatenation of unigrams and split ngrams
suppressWarnings(cbind(x_unigrams, x_split_ngrams))
}
So:
dfm_splitgrams(eg.dfm)
## Document-feature matrix of: 3 documents, 9 features (40.7% sparse).
## 3 x 9 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction emission increase
## text1 1 1 1 1 1 0 0 0 0
## text2 0 1 1 0 0 1 1 0 0
## text3 1 0 0 1 1 1 1 1 1
Here, splitting ngrams results in new "unigrams" of the same feature name. You can (re)combine them efficiently with dfm_compress():
dfm_compress(dfm_splitgrams(eg.dfm))
## Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs increase great plenary in the emission reduction
## text1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 1 1
## text3 2 0 0 1 1 2 1

How to get unigram and trigram only?

I need to get the unigrame and trigram without bigrame
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
how to edit this code to get the answer
One way is to use the dfm function from quanteda package as follows,
library(quanteda)
dfm('I only want uni and trigrams', ngrams = c(1,3), verbose = FALSE)
#Document-feature matrix of: 1 document, 10 features.
#1 x 10 sparse Matrix of class "dfmSparse"
# features
#docs i only want uni and trigrams i_only_want only_want_uni want_uni_and uni_and_trigrams
# text1 1 1 1 1 1 1 1 1 1 1

Recognize/differentiate two sentences by R

Here is an example of my data
id address
Table1:User table
id address
1 mont carlo road,CA
2 mont road,IS
3 mont carlo road1-11,CA
Table 2(The output I wanna get)
Similarity Matrix
id 1 2 3
1
2 3
3 1 3
1~3 very similar~very dissimilar
My problem is how to recognize the similarity between the case by address in the Table 1, and then output a result, say Similarity Matrix like Table 2 in R. The point is how to figure out the comparison between two sentences in R and then set a scale to measure the similarity between a pair, finally output a matrix.
I'd also use the stringdist package but would make use of outer and cut to finish the job:
library(stringdist)
dat <- data.frame(
address = c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA"),
id = 1:3
)
m <- outer(dat[["address"]], dat[["address"]], stringdist, method="jw")
m[lower.tri(m)] <- cut(m[lower.tri(m)], 3, labels=1:3)
m[upper.tri(m)] <- cut(m[upper.tri(m)], 3, labels=1:3)
dimnames(m) <- list(dat[["id"]], dat[["id"]])
diag(m) <- NA
m
## 1 2 3
## 1 NA 3 1
## 2 3 NA 3
## 3 1 3 NA
You can use whatever method you want for calculating distance (?stringdist).
You might be interested in the Levenshtein Distance implemented in the R package stringdist. For example:
library(stringdist)
address <- c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA")
stringdist(address[1], address[2], method="lv")
[1] 8
You could then tailor these results to a matrix or whatever output you desire

Resources