R: Natural Language Processing on Support Vector Machine-TermDocumentMatrix - r

I have started working on a project which requires Natural Language Processing and building a model on Support Vector Machine (SVM) in R.
I’d like to generate a Term Document Matrix with all the tokens.
Example:
testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)
[[1]]
[1] "From" "month" "2" "the" "AST" "and" "total"
[8] "bilirubine" "were" "not" "measured" "."
[[2]]
[1] "16:OTHER" "-"
[3] "COMMENT" "REQUIRED"
[5] "IN" "COMMENT"
[7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"
[9] "consent" "not"
[11] "offered" "until"
[13] "T4" "."
[[3]]
[1] "M6" "is" "13" "days" "out" "of" "the" "visit" "window"
And then I generated a TDM:
tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity : 0%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms NULL
16:other 1
and 1
ast 1
bilirubine 1
column;07/02/2004/genotyping;sf- 1
comment 2
consent 1
days 1
from 1
genotyping 1
measured 1
month 1
not 2
offered 1
out 1
required 1
the 2
total 1
until 1
visit 1
were 1
window 1
I actually have three documents in the dataset:
"From month 2 the AST and total bilirubine were not measured.",
"16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",
"M6 is 13 days out of the visit window" so it should have shown 3 columns of documents.
But I only have one column shown here.
Could anyone please give me some advice on this?
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-2 openxlsx_3.0.0 magrittr_1.5 RWeka_0.4-28 openNLP_0.2-6 NLP_0.1-9
[7] rJava_0.9-8

I think what you are trying to do is take a list of 3 strings and then trying to make that into corpus. I am not sure if in a list 3 different strings count for 3 diff documents.
I took your data and put it into 3 txt files and ran this.
text_name <- file.path("C:\", "texts")
dir(text_name)
[1] "text1.txt" "text2.txt" "text3.txt"
if you dont want to do any cleaning you can directly convert it to corpus by
docs <- Corpus(DirSource(text_name))
summary(docs)
Length Class Mode
text1.txt 2 PlainTextDocument list
text2.txt 2 PlainTextDocument list
text3.txt 2 PlainTextDocument list
dtm <- DocumentTermMatrix(docs)
dtm
<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
tdm <- TermDocumentMatrix(docs)
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms text1.txt text2.txt text3.txt
16:other 0 1 0
and 1 0 0
ast 1 0 0
bilirubine 1 0 0
column;07/02/2004/genotyping;sf- 0 1 0
comment 0 2 0
consent 0 1 0
days 0 0 1
from 1 0 0
genotyping 0 1 0
measured. 1 0 0
month 1 0 0
not 1 1 0
offered 0 1 0
out 0 0 1
required 0 1 0
the 1 0 1
total 1 0 0
until 0 1 0
visit 0 0 1
were 1 0 0
window 0 0 1
I think you might want to create 3 different list and then covert it into corpus. let me know if this helps.

So considering you want each row in your column of text as document
coverting the list to dataframe
df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
follow the steps mentioned in the previous answer after this to get your tdm. this should solve your problem

Related

Component meaning of ranger.forest

I'm working with ranger, a fast implementation of Random Forests. The problem is I have no idea how to interpret the $forest component of the result. The document simply says
forest: Saved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily
represent the column number in R.
Well, that isn't really helpful, so I tried inspecting its components myself, by their names are not self-explanatory.
> names(ranger(Species ~ ., data = iris)$forest)
[1] "dependent.varID" "num.trees"
[3] "child.nodeIDs" "split.varIDs"
[5] "split.values" "is.ordered"
[7] "class.values" "levels"
[9] "independent.variable.names" "treetype"
Some components like num.trees are trivial to understand, but things like child.nodeIDs are really mind-blowing.
> ranger(Species ~ ., data = iris)$forest$child.nodeIDs[[1]]
[[1]]
[1] 1 3 5 0 7 9 11 0 0 0 13 15 0 0 0 0 0
[[2]]
[1] 2 4 6 0 8 10 12 0 0 0 14 16 0 0 0 0 0
Is it documented somewhere?
See the documentation for the ranger::treeInfo function: https://www.rdocumentation.org/packages/ranger/versions/0.11.2/topics/treeInfo

tm package version 0.7 does not preserve intra-word dashes in DocumentTermMatrix

The behaviour of the tm package has changed between versions 0.6-2 and 0.7-x.
In the new version, DocumentTermMatrix does not preserve intra-word dashes, is it a bug or is there a new option to enforce that? An example follows below, using both tm versions installed with different paths. I am running R 3.3.3.
> string1 <- "big data data analysis machine learning project management"
> string2 <- "big-data data-analysis machine-learning project-management"
>
> two_strings <- c(string1, string2)
>
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.6-2")
> myCorpus <- Corpus(VectorSource(two_strings))
> dtm_0.6 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.6)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project project-management
1 1 1 0
2 0 0 1
So with the old version 0.6-2 the dashes in the second string are correctly preserved. With the new version 0.7-3 instead:
> detach("package:tm", unload=TRUE)
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> dtm_0.7 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
I tried to enforce the preservation of dashes as follows, but to no avail:
> dtm_test <- DocumentTermMatrix(myCorpus,
+ control = list(removePunctuation = list(preserve_intra_word_dashes = TRUE)))
> inspect(dtm_test)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
Any advice? Thanks!
The answer came from the tm author himself, dr. Ingo Feinerer - thanks! Reproducing it here:
Since 0.7 the default corpus is a "SimpleCorpus" (if supported; that
depends on the source). See ?SimpleCorpus
That triggers a certain behavior (see in ?TermDocumentMatrix).
Use VCorpus instead of Corpus to enforce the old behavior:
inspect(TermDocumentMatrix(Corpus(VectorSource(two_strings))))
inspect(TermDocumentMatrix(VCorpus(VectorSource(two_strings))))
Returning to my example above, using now VCorpus:
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> myVCorpus <- VCorpus(VectorSource(two_strings))
> dtm_0.7 <- DocumentTermMatrix(myVCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project
1 1 1
2 0 0

Umlaut ¨ with package tm (text mining in R)

I am trying to read some PDF docs using the package tm for text mining in R. However, my PDF are in german and I dont know how to deal with those Special characters.
library(tm)
pathname <- "J:/branchwarren/docs/tm/"
raw_corpus <- VCorpus(DirSource(directory=path,encoding="UTF-8"), readerControl=list(reader=readPDF,language="de"))
tdm <- TermDocumentMatrix(raw_corpus)
tdm_mat <- as.data.frame(tdm)
The Output tdm_mat for example is (where the columns are the frequencies in each PDF)
1 geschã¤ftsverlauf 9 9 1 3 0 0
2 gesellschaft 1 3 1 1 1 1
3 gesellschaft. 0 0 1 1 1 0
4 gesellschaftskapital 1 1 1 1 1 1
5 gestaltung 1 1 1 1 1 1
6 gesteigert 0 0 2 0 2 6
7 gesunden 0 1 0 1 1 1
8 gewinnreserve 1 1 1 1 1 1
9 gewinnverwendung) 1 1
As you notice, the character in the first row is not displayed correctly. It should be geschäftsverlauf.
Any help or suggestions? thanks in advance
Too long for a comment, but e.g. this works for me as expected:
library(tm)
dir.create(pathname <- tempfile())
writeLines("Der Geschäftsbericht war gut. Die Maßnahmen griffen.", tf <- tempfile(fileext = ".md"))
rmarkdown::render(input=tf, output_format="pdf_document", output_file="1.pdf", output_dir=pathname)
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { # see ?readPDF
raw_corpus <- VCorpus(DirSource(directory=pathname, encoding="UTF-8"), readerControl=list(reader=readPDF,language="de"))
tdm <- TermDocumentMatrix(raw_corpus)
tdm_mat <- as.data.frame(as.matrix(tdm))
tdm_mat
}
# 1.pdf
# der 1
# die 1
# geschäftsbericht 1
# griffen. 1
# gut. 1
# maßnahmen 1
# war 1
My sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
...
tm_0.6-2 NLP_0.1-8
...
Maybe an encoding mismatch? Try providing input data + your sessionInfo to debug & reproduce the error.

How does the removeSparseTerms in R work?

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix.
How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc?
In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)
For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99.
The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which
$df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).
Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)
An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:
> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)),
+ weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity : 0%
Maximal term length: 2
Weighting : term frequency (tf)
>
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)),
+ weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity : 49%
Maximal term length: 2
Weighting : term frequency (tf)
Here are a few additional examples with actual text and terms:
> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
"the sparse brown furry matrix",
"the quick matrix")
> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
Terms
Docs brown fox furry jumped matrix over quick second sparse the
1 2 2 2 1 0 1 1 1 0 1
2 1 0 1 0 1 0 0 0 1 1
3 0 0 0 0 1 0 1 0 0 1
> as.matrix(removeSparseTerms(myTdm, .01))
Terms
Docs the
1 1
2 1
3 1
> as.matrix(removeSparseTerms(myTdm, .99))
Terms
Docs brown fox furry jumped matrix over quick second sparse the
1 2 2 2 1 0 1 1 1 0 1
2 1 0 1 0 1 0 0 0 1 1
3 0 0 0 0 1 0 1 0 0 1
> as.matrix(removeSparseTerms(myTdm, .5))
Terms
Docs brown furry matrix quick the
1 2 2 0 1 1
2 1 1 1 0 1
3 0 0 1 1 1
In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.
An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).
> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
a brown fox furry jumped matrix over quick second sparse the
1 2 1 2 1 2 1 2 1 1 3
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
features
docs brown furry the matrix quick
text1 2 2 1 0 1
text2 1 1 1 1 0
text3 0 0 1 1 1
This usage seems much more straightforward to me.
In the function removeSparseTerms(), the argument sparse = x means:
"remove all terms whose sparsity is greater than the threshold (x)".
e.g: removeSparseTerms(my_dtm, sparse = 0.90) means remove all terms in the corpus whose sparsity is greater than 90%.
For example, a term that appears say just 4 times in a corpus of say size 1000, will have a frequency of appearance of 0.004 =4/1000.
This term's sparsity will be (1000-4)/1000 = 1- 0.004 = 0.996 = 99.6%.
Therefore if sparsity threshold is set to sparse = 0.90, this term will be removed as its sparsity (0.996) is greater than the upper bound sparsity (0.90).
However, if sparsity threshold is set to sparse = 0.999, this term will not be removed as its sparsity (0.996) is lower than the upper bound sparsity (0.999).
Simple its like frequency of an element, If you set the value as 0, it will return all the items which appear in all the text, wherever if you set it as 1, it will return all the item in text. If I choose 0.5 it will let me to view only the texts that are appearing in 50% of times in the entire element. This is done by calculating after all such per-processing as
1- (sum(no_off_times_of_the_individual_text_element)/sum(no_off_total_text_elements)) <= Set_Value

Error in simpleLoess: NA/NaN/Inf in foreign function call

I am trying to use normalize.loess() through lumiN() from lumi package.
At the 38th iteration, in loess() function it fails with
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, :
NA/NaN/Inf in foreign function call (arg 1)
I have searched and it may be related with the fact that an argument is missing.
I checked with debug(loess) and all arguments are defined.
I cannot post data because they are very large (13237x566) and also because they are confidential but.. I found this:
a minimal example works (random matrix 20x5)
normalization fails between column 1 and 38
same normalization using only those columns completed successfully
it is not a memory issue
matrix has not NA values
What am I missing?
Thanks
Code
raw_matrix <- lumiR('example.txt')
norm_matrix <- lumiN(raw_matrix, method='loess')
Perform loess normalization ...
Done with 1 vs 2 in iteration 1
Done with 1 vs 3 in iteration 1
Done with 1 vs 4 in iteration 1
Done with 1 vs 5 in iteration 1
Done with 1 vs 6 in iteration 1
Done with 1 vs 7 in iteration 1
Done with 1 vs 8 in iteration 1
Done with 1 vs 9 in iteration 1
Done with 1 vs 10 in iteration 1
Done with 1 vs 11 in iteration 1
Done with 1 vs 12 in iteration 1
Done with 1 vs 13 in iteration 1
Done with 1 vs 14 in iteration 1
Done with 1 vs 15 in iteration 1
Done with 1 vs 16 in iteration 1
Done with 1 vs 17 in iteration 1
Done with 1 vs 18 in iteration 1
Done with 1 vs 19 in iteration 1
Done with 1 vs 20 in iteration 1
Done with 1 vs 21 in iteration 1
Done with 1 vs 22 in iteration 1
Done with 1 vs 23 in iteration 1
Done with 1 vs 24 in iteration 1
Done with 1 vs 25 in iteration 1
Done with 1 vs 26 in iteration 1
Done with 1 vs 27 in iteration 1
Done with 1 vs 28 in iteration 1
Done with 1 vs 29 in iteration 1
Done with 1 vs 30 in iteration 1
Done with 1 vs 31 in iteration 1
Done with 1 vs 32 in iteration 1
Done with 1 vs 33 in iteration 1
Done with 1 vs 34 in iteration 1
Done with 1 vs 35 in iteration 1
Done with 1 vs 36 in iteration 1
Done with 1 vs 37 in iteration 1
Done with 1 vs 38 in iteration 1
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, :
NA/NaN/Inf in foreign function call (arg 1)
Environment
My sessionInfo() is
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] affy_1.38.1 lumi_2.12.0 Biobase_2.20.0
[4] BiocGenerics_0.6.0 BiocInstaller_1.10.2
loaded via a namespace (and not attached):
[1] affyio_1.28.0 annotate_1.38.0 AnnotationDbi_1.22.6
[4] beanplot_1.1 Biostrings_2.28.0 colorspace_1.2-4
[7] DBI_0.2-7 GenomicRanges_1.12.5 grid_3.0.2
[10] illuminaio_0.2.0 IRanges_1.18.1 KernSmooth_2.23-10
[13] lattice_0.20-24 limma_3.16.8 MASS_7.3-29
[16] Matrix_1.0-14 matrixStats_0.8.12 mclust_4.2
[19] methylumi_2.6.1 mgcv_1.7-27 minfi_1.6.0
[22] multtest_2.16.0 nleqslv_2.0 nlme_3.1-111
[25] nor1mix_1.1-4 preprocessCore_1.22.0 RColorBrewer_1.0-5
[28] reshape_0.8.4 R.methodsS3_1.5.2 RSQLite_0.11.4
[31] siggenes_1.34.0 splines_3.0.2 stats4_3.0.2
[34] survival_2.37-4 tcltk_3.0.2 tools_3.0.2
[37] XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.6.0
I somehow figured out what was not working:
I was trying to normalize a log2 matrix. As far as I know normalize.loess by default log transforms the input matrix, so that was going to be log transformed twice.
This was a problem, because some values in input matrix were equal to 1, so:
log2(log2(1)) = Inf
that clearly is not allowed as value during normalization.
Hope this helps someone.

Resources