So I already have a TDM but it was on excel. So I saved it as CSV. Now I want to do some analysis but I can´t load it as a TDM using tm package. My CSV looks something like this:
item01 item02 item03 item04
red 0 1 1 0
circle 1 0 0 1
fame 1 0 0 0
yellow 0 0 1 1
square 1 0 1 0
So I haven't been able to load that file as a TDM, the best I've tried so far is this :
myDTM <- as.DocumentTermMatrix(df, weighting = weightBin)
But it loads 1's on all cells
<<DocumentTermMatrix (documents: 2529, terms: 1952)>>
Non-/sparse entries: 4936608/0
Sparsity : 0%
Maximal term length: 27
Weighting : binary (bin)
Sample :
Terms
Docs item01 item02 item03 item04
Red 1 1 1 1
Circle 1 1 1 1
fame 1 1 1 1
I've tried converting first to Corpus and other things but if i try to use any function like inspect(tdm) it returns an error, like this or similar.
Error in `[.simple_triplet_matrix`(x, docs, terms) :
I really don´t believe there isn't a way to import it in the right format, any suggestion? Thanks in advance.
Try first converting the CSV to a sparse matrix. My CSV is different from yours because I typed it myself, but it's the same idea.
> library(tm)
> library(Matrix)
> myDF <- read.csv("my.csv",row.names=1,colClasses=c('character',rep('integer',4)))
> mySM <- Matrix(as.matrix(myDF),sparse=TRUE)
> myDTM <- as.DocumentTermMatrix(mySM,weighting = weightBin)
> inspect(myDTM)
<<DocumentTermMatrix (documents: 5, terms: 4)>>
Non-/sparse entries: 7/13
Sparsity : 65%
Maximal term length: 6
Weighting : binary (bin)
Sample :
Terms
Docs item01 item02 item03 item04
circle 1 1 0 0
fame 1 0 0 0
red 0 0 0 0
square 1 0 1 0
yellow 0 0 1 1
>
Related
The behaviour of the tm package has changed between versions 0.6-2 and 0.7-x.
In the new version, DocumentTermMatrix does not preserve intra-word dashes, is it a bug or is there a new option to enforce that? An example follows below, using both tm versions installed with different paths. I am running R 3.3.3.
> string1 <- "big data data analysis machine learning project management"
> string2 <- "big-data data-analysis machine-learning project-management"
>
> two_strings <- c(string1, string2)
>
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.6-2")
> myCorpus <- Corpus(VectorSource(two_strings))
> dtm_0.6 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.6)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project project-management
1 1 1 0
2 0 0 1
So with the old version 0.6-2 the dashes in the second string are correctly preserved. With the new version 0.7-3 instead:
> detach("package:tm", unload=TRUE)
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> dtm_0.7 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
I tried to enforce the preservation of dashes as follows, but to no avail:
> dtm_test <- DocumentTermMatrix(myCorpus,
+ control = list(removePunctuation = list(preserve_intra_word_dashes = TRUE)))
> inspect(dtm_test)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
Any advice? Thanks!
The answer came from the tm author himself, dr. Ingo Feinerer - thanks! Reproducing it here:
Since 0.7 the default corpus is a "SimpleCorpus" (if supported; that
depends on the source). See ?SimpleCorpus
That triggers a certain behavior (see in ?TermDocumentMatrix).
Use VCorpus instead of Corpus to enforce the old behavior:
inspect(TermDocumentMatrix(Corpus(VectorSource(two_strings))))
inspect(TermDocumentMatrix(VCorpus(VectorSource(two_strings))))
Returning to my example above, using now VCorpus:
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> myVCorpus <- VCorpus(VectorSource(two_strings))
> dtm_0.7 <- DocumentTermMatrix(myVCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project
1 1 1
2 0 0
I have a dataframe df with this structure :
Rank Review
5 good film
8 very good film
..
Then I tried to create a DocumentTermMatris using quanteda package :
mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)
I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value
Can you help me to resolve this problem please?
EDIT :
head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
features
docs bon accueil conseillèr efficac écout répond
text1 0 0 0 0 0 0
text2 1 1 1 1 1 1
text3 0 0 0 0 0 0
text4 0 0 0 0 0 0
text5 0 0 1 0 0 0
text6 0 0 0 0 1 0
...
text60300 0 0 1 1 1 1
Here I have my dfm matrix, then I create my tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :
textstat_keyness(mydfm, target = 2)
But, since I have 60300 target, I don't know how to do this automatically .
I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(
EDIT 2 :
Rank Review
10 always good
1 nice film
3 fine as usual
Here I try to group document with dfm :
mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)
But it fails to group documents
Can you help me please to resolve this problem
Thank you
See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.
textstat_keyness(mydfm, target = 1)
for the first document against the frequencies of all others, or
textstat_keyness(mydfm, target = 2)
for the second against all others, etc.
If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.
I have a column in my data frame (df) as follows:
> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that occur the most are used as columns. I have tried the following code:
> people_list = strsplit(people, ", ")
> corp = Corpus(VectorSource(people_list))
> dtm = DocumentTermMatrix(corp, people_dict)
where people_dict is a list of the most commonly occurring people (~150 full names of people) from people_list as follows:
> people_dict[1:3]
[[1]]
[1] "Christian Slater"
[[2]]
[1] "Tara Reid"
[[3]]
[1] "Stephen Dorff"
However, the DocumentTermMatrix function seems to not be using the people_dict at all because I have way more columns than in my people_dict. Also, I think that the DocumentTermMatrix function is splitting each name string into multiple strings. For example, "Danny Devito" becomes a column for "Danny" and "Devito".
> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
I have read through all the TM documentation that I can find, and I have spent hours searching on stackoverflow for a solution. Please help!
The default tokenizer splits text into individual words. You need to provide a custom function
commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))
Note that you do not separate the actors before creating the corpus.
people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")
The control options didn't work with just Coprus, I used VCorpus
corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize =
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))
All of the options are passed within control, including:
tokenize - function
dictionary
tolower = FALSE
Results:
as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
1 0 1 0
2 0 0 0
3 0 0 1
I hope this helps
I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix.
How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc?
In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)
For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99.
The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which
$df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).
Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)
An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:
> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)),
+ weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity : 0%
Maximal term length: 2
Weighting : term frequency (tf)
>
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)),
+ weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity : 49%
Maximal term length: 2
Weighting : term frequency (tf)
Here are a few additional examples with actual text and terms:
> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
"the sparse brown furry matrix",
"the quick matrix")
> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
Terms
Docs brown fox furry jumped matrix over quick second sparse the
1 2 2 2 1 0 1 1 1 0 1
2 1 0 1 0 1 0 0 0 1 1
3 0 0 0 0 1 0 1 0 0 1
> as.matrix(removeSparseTerms(myTdm, .01))
Terms
Docs the
1 1
2 1
3 1
> as.matrix(removeSparseTerms(myTdm, .99))
Terms
Docs brown fox furry jumped matrix over quick second sparse the
1 2 2 2 1 0 1 1 1 0 1
2 1 0 1 0 1 0 0 0 1 1
3 0 0 0 0 1 0 1 0 0 1
> as.matrix(removeSparseTerms(myTdm, .5))
Terms
Docs brown furry matrix quick the
1 2 2 0 1 1
2 1 1 1 0 1
3 0 0 1 1 1
In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.
An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).
> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
a brown fox furry jumped matrix over quick second sparse the
1 2 1 2 1 2 1 2 1 1 3
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
features
docs brown furry the matrix quick
text1 2 2 1 0 1
text2 1 1 1 1 0
text3 0 0 1 1 1
This usage seems much more straightforward to me.
In the function removeSparseTerms(), the argument sparse = x means:
"remove all terms whose sparsity is greater than the threshold (x)".
e.g: removeSparseTerms(my_dtm, sparse = 0.90) means remove all terms in the corpus whose sparsity is greater than 90%.
For example, a term that appears say just 4 times in a corpus of say size 1000, will have a frequency of appearance of 0.004 =4/1000.
This term's sparsity will be (1000-4)/1000 = 1- 0.004 = 0.996 = 99.6%.
Therefore if sparsity threshold is set to sparse = 0.90, this term will be removed as its sparsity (0.996) is greater than the upper bound sparsity (0.90).
However, if sparsity threshold is set to sparse = 0.999, this term will not be removed as its sparsity (0.996) is lower than the upper bound sparsity (0.999).
Simple its like frequency of an element, If you set the value as 0, it will return all the items which appear in all the text, wherever if you set it as 1, it will return all the item in text. If I choose 0.5 it will let me to view only the texts that are appearing in 50% of times in the entire element. This is done by calculating after all such per-processing as
1- (sum(no_off_times_of_the_individual_text_element)/sum(no_off_total_text_elements)) <= Set_Value
I have a matrix which contains the genes and the mrna.
ID_REF GSM362168 GSM362169 GSM362170 GSM362171 GSM362172 GSM362173 GSM362174
244901_at 5.171072 5.207896 5.191145 5.067809 5.010239 5.556884 4.879528
244902_at 5.296012 5.460796 5.419633 5.440318 5.234789 7.567894 6.908795
I wanted to find the differentially expressed genes from the matrix using t test and i carried out the following.
stat=mt.teststat(control,classlabel,test="t",na=.mt.naNUM,nonpara="n")
and I get the following error
Error in is.factor(classlabel) : object 'classlabel' not found.
I am not sure how I have to assign the classlabels.Is it the right way to find the differentially expressed genes.
The classlabel should be a vector of integers corresponding to observation (column) class labels. I do not understand what that is.
If you open the documentation for mt.teststat:
?mt.teststat
and scroll down to the end, you'll see an example using the "Golub data":
data(golub)
teststat <- mt.teststat(golub, golub.cl)
If you look at golub.cl,it will become clear what the classlabel vector should look like:
golub.cl
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
In this case, 0 or 1 are labels for two classes of sample. There should be as many values in the vector as you have samples, in the same order that the samples appear in the data matrix. You can also look at:
?golub
golub.cl: numeric vector indicating the tumor class, 27 acute
lymphoblastic leukemia (ALL) cases (code 0) and 11 acute
myeloid leukemia (AML) cases (code 1).
So you need to create a similar vector, with labels (0, 1, ...) for however many classes you have for your own data.