How does the removeSparseTerms in R work? - r

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix.
How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc?

In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)
For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99.
The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which
$df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).
Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)
An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:
> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)),
+ weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity : 0%
Maximal term length: 2
Weighting : term frequency (tf)
>
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)),
+ weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity : 49%
Maximal term length: 2
Weighting : term frequency (tf)
Here are a few additional examples with actual text and terms:
> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
"the sparse brown furry matrix",
"the quick matrix")
> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
Terms
Docs brown fox furry jumped matrix over quick second sparse the
1 2 2 2 1 0 1 1 1 0 1
2 1 0 1 0 1 0 0 0 1 1
3 0 0 0 0 1 0 1 0 0 1
> as.matrix(removeSparseTerms(myTdm, .01))
Terms
Docs the
1 1
2 1
3 1
> as.matrix(removeSparseTerms(myTdm, .99))
Terms
Docs brown fox furry jumped matrix over quick second sparse the
1 2 2 2 1 0 1 1 1 0 1
2 1 0 1 0 1 0 0 0 1 1
3 0 0 0 0 1 0 1 0 0 1
> as.matrix(removeSparseTerms(myTdm, .5))
Terms
Docs brown furry matrix quick the
1 2 2 0 1 1
2 1 1 1 0 1
3 0 0 1 1 1
In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.
An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).
> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
a brown fox furry jumped matrix over quick second sparse the
1 2 1 2 1 2 1 2 1 1 3
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
features
docs brown furry the matrix quick
text1 2 2 1 0 1
text2 1 1 1 1 0
text3 0 0 1 1 1
This usage seems much more straightforward to me.

In the function removeSparseTerms(), the argument sparse = x means:
"remove all terms whose sparsity is greater than the threshold (x)".
e.g: removeSparseTerms(my_dtm, sparse = 0.90) means remove all terms in the corpus whose sparsity is greater than 90%.
For example, a term that appears say just 4 times in a corpus of say size 1000, will have a frequency of appearance of 0.004 =4/1000.
This term's sparsity will be (1000-4)/1000 = 1- 0.004 = 0.996 = 99.6%.
Therefore if sparsity threshold is set to sparse = 0.90, this term will be removed as its sparsity (0.996) is greater than the upper bound sparsity (0.90).
However, if sparsity threshold is set to sparse = 0.999, this term will not be removed as its sparsity (0.996) is lower than the upper bound sparsity (0.999).

Simple its like frequency of an element, If you set the value as 0, it will return all the items which appear in all the text, wherever if you set it as 1, it will return all the item in text. If I choose 0.5 it will let me to view only the texts that are appearing in 50% of times in the entire element. This is done by calculating after all such per-processing as
1- (sum(no_off_times_of_the_individual_text_element)/sum(no_off_total_text_elements)) <= Set_Value

Related

tm package version 0.7 does not preserve intra-word dashes in DocumentTermMatrix

The behaviour of the tm package has changed between versions 0.6-2 and 0.7-x.
In the new version, DocumentTermMatrix does not preserve intra-word dashes, is it a bug or is there a new option to enforce that? An example follows below, using both tm versions installed with different paths. I am running R 3.3.3.
> string1 <- "big data data analysis machine learning project management"
> string2 <- "big-data data-analysis machine-learning project-management"
>
> two_strings <- c(string1, string2)
>
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.6-2")
> myCorpus <- Corpus(VectorSource(two_strings))
> dtm_0.6 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.6)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project project-management
1 1 1 0
2 0 0 1
So with the old version 0.6-2 the dashes in the second string are correctly preserved. With the new version 0.7-3 instead:
> detach("package:tm", unload=TRUE)
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> dtm_0.7 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
I tried to enforce the preservation of dashes as follows, but to no avail:
> dtm_test <- DocumentTermMatrix(myCorpus,
+ control = list(removePunctuation = list(preserve_intra_word_dashes = TRUE)))
> inspect(dtm_test)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
Any advice? Thanks!
The answer came from the tm author himself, dr. Ingo Feinerer - thanks! Reproducing it here:
Since 0.7 the default corpus is a "SimpleCorpus" (if supported; that
depends on the source). See ?SimpleCorpus
That triggers a certain behavior (see in ?TermDocumentMatrix).
Use VCorpus instead of Corpus to enforce the old behavior:
inspect(TermDocumentMatrix(Corpus(VectorSource(two_strings))))
inspect(TermDocumentMatrix(VCorpus(VectorSource(two_strings))))
Returning to my example above, using now VCorpus:
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> myVCorpus <- VCorpus(VectorSource(two_strings))
> dtm_0.7 <- DocumentTermMatrix(myVCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project
1 1 1
2 0 0

Importing a Term Document Matrix in CSV format into R

So I already have a TDM but it was on excel. So I saved it as CSV. Now I want to do some analysis but I can´t load it as a TDM using tm package. My CSV looks something like this:
item01 item02 item03 item04
red 0 1 1 0
circle 1 0 0 1
fame 1 0 0 0
yellow 0 0 1 1
square 1 0 1 0
So I haven't been able to load that file as a TDM, the best I've tried so far is this :
myDTM <- as.DocumentTermMatrix(df, weighting = weightBin)
But it loads 1's on all cells
<<DocumentTermMatrix (documents: 2529, terms: 1952)>>
Non-/sparse entries: 4936608/0
Sparsity : 0%
Maximal term length: 27
Weighting : binary (bin)
Sample :
Terms
Docs item01 item02 item03 item04
Red 1 1 1 1
Circle 1 1 1 1
fame 1 1 1 1
I've tried converting first to Corpus and other things but if i try to use any function like inspect(tdm) it returns an error, like this or similar.
Error in `[.simple_triplet_matrix`(x, docs, terms) :
I really don´t believe there isn't a way to import it in the right format, any suggestion? Thanks in advance.
Try first converting the CSV to a sparse matrix. My CSV is different from yours because I typed it myself, but it's the same idea.
> library(tm)
> library(Matrix)
> myDF <- read.csv("my.csv",row.names=1,colClasses=c('character',rep('integer',4)))
> mySM <- Matrix(as.matrix(myDF),sparse=TRUE)
> myDTM <- as.DocumentTermMatrix(mySM,weighting = weightBin)
> inspect(myDTM)
<<DocumentTermMatrix (documents: 5, terms: 4)>>
Non-/sparse entries: 7/13
Sparsity : 65%
Maximal term length: 6
Weighting : binary (bin)
Sample :
Terms
Docs item01 item02 item03 item04
circle 1 1 0 0
fame 1 0 0 0
red 0 0 0 0
square 1 0 1 0
yellow 0 0 1 1
>

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very good film
..
Then I tried to create a DocumentTermMatris using quanteda package :
mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)
I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value
Can you help me to resolve this problem please?
EDIT :
head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
features
docs bon accueil conseillèr efficac écout répond
text1 0 0 0 0 0 0
text2 1 1 1 1 1 1
text3 0 0 0 0 0 0
text4 0 0 0 0 0 0
text5 0 0 1 0 0 0
text6 0 0 0 0 1 0
...
text60300 0 0 1 1 1 1
Here I have my dfm matrix, then I create my tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :
textstat_keyness(mydfm, target = 2)
But, since I have 60300 target, I don't know how to do this automatically .
I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(
EDIT 2 :
Rank Review
10 always good
1 nice film
3 fine as usual
Here I try to group document with dfm :
mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)
But it fails to group documents
Can you help me please to resolve this problem
Thank you
See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.
textstat_keyness(mydfm, target = 1)
for the first document against the frequencies of all others, or
textstat_keyness(mydfm, target = 2)
for the second against all others, etc.
If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

using findAssocs in R to find frequently occurred words with central term

As I was working with findAssocs in R, I realised that the function don't actually pick up the words that occur together with the searched term across documents, but rather the words that occur when searched term frequently appeared.
I've tried using a simple test script below:
test <- list("housekeeping bath towel housekeeping room","housekeeping dirty","housekeeping very dirty","housekeeping super dirty")
test <-Corpus(VectorSource(test))
test_dtm<-DocumentTermMatrix(test)
test_dtms<-removeSparseTerms(test_dtm,0.99)
findAssocs(test_dtms,"housekeeping",corlimit = 0.1)
And the returning result from R is:
$housekeeping
bath room towel
1 1 1
Noticed that the word "dirty" occur in 3 out of the 4 documents, compared to the returned keywords which only occurred once in all documents.
Anyone has any idea what went wrong in my script or if there is a better way to do this?
The result I want to achieve is the model should reflect the words that occurs frequently with the search term across all documents and not within a specific document. I've tried combining the 4 documents into 1 but it doesn't work as findAssocs doesn't work on a single document.
Any advise?
How about an alternative, using the quanteda package? It imposes no mystery restrictions on the correlations returned, and has many other options (see ?similarity).
require(quanteda)
testDfm <- dfm(unlist(test), verbose = FALSE)
## Document-feature matrix of: 4 documents, 7 features.
## 4 x 7 sparse Matrix of class "dfmSparse"
## features
## docs housekeeping bath towel room dirty very super
## text1 2 1 1 1 0 0 0
## text2 1 0 0 0 1 0 0
## text3 1 0 0 0 1 1 0
## text4 1 0 0 0 1 0 1
similarity(testDfm, "housekeeping", margin = "features")
## similarity Matrix:
## $housekeeping
## bath towel room very super dirty
## 1.0000 1.0000 1.0000 -0.3333 -0.3333 -1.0000

R tree-based methods like randomForest, adaboost: interpret result of same data with different format

Suppose my dataset is a 100 x 3 matrix filled with categorical variables. I would like to do binary classification on the response variable. Let's make up a dataset with following code:
set.seed(2013)
y <- as.factor(round(runif(n=100,min=0,max=1),0))
var1 <- rep(c("red","blue","yellow","green"),each=25)
var2 <- rep(c("shortest","short","tall","tallest"),25)
df <- data.frame(y,var1,var2)
The data looks like this:
> head(df)
y var1 var2
1 0 red shortest
2 1 red short
3 1 red tall
4 1 red tallest
5 0 red shortest
6 1 red short
I tried to do random forest and adaboost on this data with two different approaches. The first approach is to use the data as it is:
> library(randomForest)
> randomForest(y~var1+var2,data=df,ntrees=500)
Call:
randomForest(formula = y ~ var1 + var2, data = df, ntrees = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 44%
Confusion matrix:
0 1 class.error
0 29 22 0.4313725
1 22 27 0.4489796
----------------------------------------------------
> library(ada)
> ada(y~var1+var2,data=df)
Call:
ada(y ~ var1 + var2, data = df)
Loss: exponential Method: discrete Iteration: 50
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 34 17
1 16 33
Train Error: 0.33
Out-Of-Bag Error: 0.33 iteration= 11
Additional Estimates of number of iterations:
train.err1 train.kap1
10 16
The second approach is to transform the dataset into wide format and treat each category as a variable. The reason I am doing this is because my actual dataset has 500+ factors in var1 and var2, and as a result, tree partitioning will always divide the 500 categories into 2 splits. A lot of information is lost by doing that. To transform the data:
id <- 1:100
library(reshape2)
tmp1 <- dcast(melt(cbind(id,df),id.vars=c("id","y")),id+y~var1,fun.aggregate=length)
tmp2 <- dcast(melt(cbind(id,df),id.vars=c("id","y")),id+y~var2,fun.aggregate=length)
df2 <- merge(tmp1,tmp2,by=c("id","y"))
The new data looks like this:
> head(df2)
id y blue green red yellow short shortest tall tallest
1 1 0 0 0 2 0 0 2 0 0
2 10 1 0 0 2 0 2 0 0 0
3 100 0 0 2 0 0 0 0 0 2
4 11 0 0 0 2 0 0 0 2 0
5 12 0 0 0 2 0 0 0 0 2
6 13 1 0 0 2 0 0 2 0 0
I apply random forest and adaboost to this new dataset:
> library(randomForest)
> randomForest(y~blue+green+red+yellow+short+shortest+tall+tallest,data=df2,ntrees=500)
Call:
randomForest(formula = y ~ blue + green + red + yellow + short + shortest + tall + tallest, data = df2, ntrees = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 39%
Confusion matrix:
0 1 class.error
0 32 19 0.3725490
1 20 29 0.4081633
----------------------------------------------------
> library(ada)
> ada(y~blue+green+red+yellow+short+shortest+tall+tallest,data=df2)
Call:
ada(y ~ blue + green + red + yellow + short + shortest + tall +
tallest, data = df2)
Loss: exponential Method: discrete Iteration: 50
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 36 15
1 20 29
Train Error: 0.35
Out-Of-Bag Error: 0.33 iteration= 26
Additional Estimates of number of iterations:
train.err1 train.kap1
5 10
The results from two approaches are different. The difference is more obvious as we introduce more levels in each variable, i.e., var1 and var2. My question is, since we are using exactly the same data, why is the result different? How should we interpret the results from both approaches? Which is more reliable?
While these two models look identical, they are fundamentally different from one another- On the second model, you implicitly include the possibility that a given observation may have multiple colors and multiple heights. The correct choice between the two model formulations will depend on the characteristics of your real-world observations. If these characters are exclusive (i.e., each observation is of a single color and height), the first formulation of the model will be the right one to use. However, if an observation may be both blue and green, or any other color combination, you may use the second formulation. From a hunch looking at your original data, it seems like the first one is most appropriate (i.e., how would an observation have multiple heights??).
Also, why did you code your logical variable columns in df2 as 0s and 2s instead of 0/1? I wander if that will have any impact on the fit depending on how the data is being coded as factor or numerical.

Resources