I am trying to read some PDF docs using the package tm for text mining in R. However, my PDF are in german and I dont know how to deal with those Special characters.
library(tm)
pathname <- "J:/branchwarren/docs/tm/"
raw_corpus <- VCorpus(DirSource(directory=path,encoding="UTF-8"), readerControl=list(reader=readPDF,language="de"))
tdm <- TermDocumentMatrix(raw_corpus)
tdm_mat <- as.data.frame(tdm)
The Output tdm_mat for example is (where the columns are the frequencies in each PDF)
1 geschã¤ftsverlauf 9 9 1 3 0 0
2 gesellschaft 1 3 1 1 1 1
3 gesellschaft. 0 0 1 1 1 0
4 gesellschaftskapital 1 1 1 1 1 1
5 gestaltung 1 1 1 1 1 1
6 gesteigert 0 0 2 0 2 6
7 gesunden 0 1 0 1 1 1
8 gewinnreserve 1 1 1 1 1 1
9 gewinnverwendung) 1 1
As you notice, the character in the first row is not displayed correctly. It should be geschäftsverlauf.
Any help or suggestions? thanks in advance
Too long for a comment, but e.g. this works for me as expected:
library(tm)
dir.create(pathname <- tempfile())
writeLines("Der Geschäftsbericht war gut. Die Maßnahmen griffen.", tf <- tempfile(fileext = ".md"))
rmarkdown::render(input=tf, output_format="pdf_document", output_file="1.pdf", output_dir=pathname)
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { # see ?readPDF
raw_corpus <- VCorpus(DirSource(directory=pathname, encoding="UTF-8"), readerControl=list(reader=readPDF,language="de"))
tdm <- TermDocumentMatrix(raw_corpus)
tdm_mat <- as.data.frame(as.matrix(tdm))
tdm_mat
}
# 1.pdf
# der 1
# die 1
# geschäftsbericht 1
# griffen. 1
# gut. 1
# maßnahmen 1
# war 1
My sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
...
tm_0.6-2 NLP_0.1-8
...
Maybe an encoding mismatch? Try providing input data + your sessionInfo to debug & reproduce the error.
Related
One day I tried to execute my routine cspade sequences mining in R and it suddenly failed with error and some very strange print to console. Here is the example code:
library(arulesSequences)
data(zaki)
cspade(zaki, parameter=list(support=0.5))
It throws very long output (even with option control=list(verbose=F)) followed by an error:
CONF 4 9 2.7 2.5
MINSUPPORT 2 4
MINMAX 1 4
1 SUPP 4
2 SUPP 4
4 SUPP 2
6 SUPP 4
numfreq 4 : 0 SUMSUP SUMDIFF = 0 0
EXTRARYSZ 2465792
OPENED C:\Users\Dawid\AppData\Local\Temp\Rtmp279Wy5\cspade2cd4751e5905.idx
OFF 9 38
Wrote Offt 0.00099802
BOUNDS 1 5
WROTE INVERT 0.000998974
Total elapsed time 0.00299406
MINSUPPORT 2 out of 4 sequences
1 -- 4 4
2 -- 4 4
4 -- 2 2
6 -- 4 4
1 6 -- 3 3
2 6 -- 4 4
4 -> 6 -- 2 2
4 -> 2 6 -- 2 2
1 2 6 -- 3 3
1 2 -- 3 3
4 -> 2 -- 2 2
2 -> 1 -- 2 2
4 -> 1 -- 2 2
6 -> 1 -- 2 2
4 -> 6 -> 1 -- 2 2
2 6 -> 1 -- 2 2
4 -> 2 6 -> 1 -- 2 2
4 -> 2 -> 1 -- 2 2
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file
'C:\Users\Dawid\AppData\Local\Temp\Rtmp279Wy5\cspade2cd4751e5905.out': No
such file or directory
It looks like it is printing the mined rules to the console (which has never happened before). And it ends with error so I can't write the rules into a variable. Looks like some problem with writing temporary files maybe?
My configuration:
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Packages:
arulesSequences_0.2-19
arules_1.6-1
(arulesSequences have new version but on the latest version arulesSequences_0.2-20 it fails in the same way)
Thank you!
One workaround is to use the R console, not Rstudio.
Well, it should work fine then. I see that more people have the same problem. I have tried reinstalling Rstudio together with reinstalling packages and using older Rstudio version but it didn't work.
Hope it helps but I would be grateful for a full answer. Thanks!
I have a dataframe with this structure :
#Load lexicon
Lexicon_DF <- read.csv("LexiconFrancais.csv",header=F, sep=";")
The structure of the "LexiconFrancais.csv" is like this :
French Translation (Google Translate);Positive;Negative
un dos;0;0
abaque;0;0
abandonner;0;1
abandonné;0;1
abandon;0;1
se calmer;0;0
réduction;0;0
abba;1;0
abbé;0;0
abréger;0;0
abréviation;0;0
> Lexicon_DF
V1 V2 V3
1 French Translation (Google Translate) Positive Negative
2 un dos 0 0
3 abaque 0 0
4 abandonner 0 1
5 abandonné 0 1
6 abandon 0 1
7 se calmer 0 0
8 réduction 0 0
9 abba 1 0
10 abbé 0 0
11 abréger 0 0
12 abréviation 0 0
I try to stemm the first column of the dataframe, for this I did :
Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
But after this command I find only the first column in the Lexicon_DF dataframe, the two other column disappear.
> Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
> Lexicon_DF
[1] "French Translation (Google Translate)" "un dos" "abaqu"
[4] "abandon" "abandon" "abandon"
[7] "se calm" "réduct" "abba"
[10] "abbé" "abreg" "abrévi"
How can I do the stemming wtihout missing the two other columns?
thank you
You are trying to replace the whole content of Lexicon_DF with the o/p of wordStem-
Try this :
Lexicon_DF$V1 <-SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
I have started working on a project which requires Natural Language Processing and building a model on Support Vector Machine (SVM) in R.
I’d like to generate a Term Document Matrix with all the tokens.
Example:
testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)
[[1]]
[1] "From" "month" "2" "the" "AST" "and" "total"
[8] "bilirubine" "were" "not" "measured" "."
[[2]]
[1] "16:OTHER" "-"
[3] "COMMENT" "REQUIRED"
[5] "IN" "COMMENT"
[7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"
[9] "consent" "not"
[11] "offered" "until"
[13] "T4" "."
[[3]]
[1] "M6" "is" "13" "days" "out" "of" "the" "visit" "window"
And then I generated a TDM:
tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity : 0%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms NULL
16:other 1
and 1
ast 1
bilirubine 1
column;07/02/2004/genotyping;sf- 1
comment 2
consent 1
days 1
from 1
genotyping 1
measured 1
month 1
not 2
offered 1
out 1
required 1
the 2
total 1
until 1
visit 1
were 1
window 1
I actually have three documents in the dataset:
"From month 2 the AST and total bilirubine were not measured.",
"16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",
"M6 is 13 days out of the visit window" so it should have shown 3 columns of documents.
But I only have one column shown here.
Could anyone please give me some advice on this?
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-2 openxlsx_3.0.0 magrittr_1.5 RWeka_0.4-28 openNLP_0.2-6 NLP_0.1-9
[7] rJava_0.9-8
I think what you are trying to do is take a list of 3 strings and then trying to make that into corpus. I am not sure if in a list 3 different strings count for 3 diff documents.
I took your data and put it into 3 txt files and ran this.
text_name <- file.path("C:\", "texts")
dir(text_name)
[1] "text1.txt" "text2.txt" "text3.txt"
if you dont want to do any cleaning you can directly convert it to corpus by
docs <- Corpus(DirSource(text_name))
summary(docs)
Length Class Mode
text1.txt 2 PlainTextDocument list
text2.txt 2 PlainTextDocument list
text3.txt 2 PlainTextDocument list
dtm <- DocumentTermMatrix(docs)
dtm
<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
tdm <- TermDocumentMatrix(docs)
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms text1.txt text2.txt text3.txt
16:other 0 1 0
and 1 0 0
ast 1 0 0
bilirubine 1 0 0
column;07/02/2004/genotyping;sf- 0 1 0
comment 0 2 0
consent 0 1 0
days 0 0 1
from 1 0 0
genotyping 0 1 0
measured. 1 0 0
month 1 0 0
not 1 1 0
offered 0 1 0
out 0 0 1
required 0 1 0
the 1 0 1
total 1 0 0
until 0 1 0
visit 0 0 1
were 1 0 0
window 0 0 1
I think you might want to create 3 different list and then covert it into corpus. let me know if this helps.
So considering you want each row in your column of text as document
coverting the list to dataframe
df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
follow the steps mentioned in the previous answer after this to get your tdm. this should solve your problem
So I'm trying to create bigrams and trigrams of a given set of text, which just happens to be Chinese. At first glance, the tau package seems almost perfect for the application. Given the following set-up, I get close to what I want:
library(tau)
q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
textcnt(q,method="ngram",n=3L,decreasing=TRUE)
The only problem is that the output is in unicode character strings, not the characters themselves. So I get something like:
_ + < <U <U+ > U U+ 9 +5 5 U+5 >_ _< _<U +59 59 2 29 29> 592 7 92
22 19 19 19 19 19 19 19 17 14 14 14 11 11 11 9 9 8 8 8 8 8 8
929 9> >< ><U 9>_ E +5E 3 3> 3>_ 5E 5E7 6 73 73> A E7 E73 4 8 9>< A> +6
8 8 8 8 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 2
+7 4> 4>< 7A A>< C U+6 U+7 +4 +4E +5F +66 +6C +76 +7A 0 0A 0A> 1 14 14> 4E 4EC
2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
597 5F 5F8 60 60A 66 660 68 684 6C 6C1 76 768 7A7 7A> 7D 7D> 84 84> 88 88> 8> 8><
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
97 97D A7 A7A A>_ C1 C14 CA CA> D D> D>_ EC ECA F F8 F88 U+4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I tried to write something that would perform a similar function, but I can't wrap my head around the code for anything more than a monogram (apologies if the code is inefficient or ugly, I'm doing my best here). The advantage of this method is also that I can get word-counts within individual "documents" by simply examining DTM, which is kind of nice.
data <- c(NA, NA, NA)
names(data) <- c("doc", "term", "freq")
terms <- NA
for(i in 1:length(q)){
temp <- data.frame(i,table(strsplit(q[i],"")))
names(temp) <- c("doc", "term", "freq")
data <- rbind(data, temp)
}
data <- data[-1,]
DTM <- xtabs(freq ~ doc + term, data)
colSums(DTM)
This actually gives a nice little output:
天 平 空 昊 今 好 很 气 的
8 4 1 1 1 1 1 1 1
Does anyone have any suggestions for using tau or altering my own code to achieve bigrams and trigrams for my Chinese characters?
Edit:
As requested in the comments, here is my sessionInfo() output:
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tau_0.0-15
loaded via a namespace (and not attached):
[1] tools_3.0.0
The stringdist package will do that for you:
> library(stringdist)
> q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> v1 <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> t(qgrams(v1, q=1))
V1
天 8
平 4
空 1
昊 1
...
> v2 <- c("天气气","平","很好平","天空天空天空","昊天","今天的天天气很好")
> t(qgrams(v2, q=2))
V1
天气 2
气气 1
空天 2
天空 3
天的 1
天天 3
今天 1
...
The reason why I transpose the returned matrices is because R renders the matrices incorrectly with regards to the column width - which happens to be the length of the unicode-ID character string (f.x. "<U+6C14><U+6C14>").
In case you are interested in further details about the stringdist package - I recommend this text: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms ;)
I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin
Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?
The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1