R: Creating n-grams in R with Asian / Chinese characters? - r

So I'm trying to create bigrams and trigrams of a given set of text, which just happens to be Chinese. At first glance, the tau package seems almost perfect for the application. Given the following set-up, I get close to what I want:
library(tau)
q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
textcnt(q,method="ngram",n=3L,decreasing=TRUE)
The only problem is that the output is in unicode character strings, not the characters themselves. So I get something like:
_ + < <U <U+ > U U+ 9 +5 5 U+5 >_ _< _<U +59 59 2 29 29> 592 7 92
22 19 19 19 19 19 19 19 17 14 14 14 11 11 11 9 9 8 8 8 8 8 8
929 9> >< ><U 9>_ E +5E 3 3> 3>_ 5E 5E7 6 73 73> A E7 E73 4 8 9>< A> +6
8 8 8 8 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 2
+7 4> 4>< 7A A>< C U+6 U+7 +4 +4E +5F +66 +6C +76 +7A 0 0A 0A> 1 14 14> 4E 4EC
2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
597 5F 5F8 60 60A 66 660 68 684 6C 6C1 76 768 7A7 7A> 7D 7D> 84 84> 88 88> 8> 8><
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
97 97D A7 A7A A>_ C1 C14 CA CA> D D> D>_ EC ECA F F8 F88 U+4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I tried to write something that would perform a similar function, but I can't wrap my head around the code for anything more than a monogram (apologies if the code is inefficient or ugly, I'm doing my best here). The advantage of this method is also that I can get word-counts within individual "documents" by simply examining DTM, which is kind of nice.
data <- c(NA, NA, NA)
names(data) <- c("doc", "term", "freq")
terms <- NA
for(i in 1:length(q)){
temp <- data.frame(i,table(strsplit(q[i],"")))
names(temp) <- c("doc", "term", "freq")
data <- rbind(data, temp)
}
data <- data[-1,]
DTM <- xtabs(freq ~ doc + term, data)
colSums(DTM)
This actually gives a nice little output:
天 平 空 昊 今 好 很 气 的
8 4 1 1 1 1 1 1 1
Does anyone have any suggestions for using tau or altering my own code to achieve bigrams and trigrams for my Chinese characters?
Edit:
As requested in the comments, here is my sessionInfo() output:
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tau_0.0-15
loaded via a namespace (and not attached):
[1] tools_3.0.0

The stringdist package will do that for you:
> library(stringdist)
> q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> v1 <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> t(qgrams(v1, q=1))
V1
天 8
平 4
空 1
昊 1
...
> v2 <- c("天气气","平","很好平","天空天空天空","昊天","今天的天天气很好")
> t(qgrams(v2, q=2))
V1
天气 2
气气 1
空天 2
天空 3
天的 1
天天 3
今天 1
...
The reason why I transpose the returned matrices is because R renders the matrices incorrectly with regards to the column width - which happens to be the length of the unicode-ID character string (f.x. "<U+6C14><U+6C14>").
In case you are interested in further details about the stringdist package - I recommend this text: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms ;)

Related

R ranges: 1:0 - illogical behavior

I have an array X of length N, and I'd like to compute sum(X[(i+1):N]) - sum(X[1:(i-1)]. This works fine if my index, i, is within 2..(N-1). If it's equal to 1, the second term will return the first element of the array rather than exclude it. If it's equal to N, the first term will return the last element of the array rather than exclude it. seq_len is the only function I'm aware of that does the job, but only for the 2nd term (it indexes 1:n). What I need is a range function that will return NULL (rather than throw an exception like seq) when its 2nd argument is below its first. The sum function will do the rest. Is anyone aware of one, or do I have to write one myself?
I suggest an alternate path for generating indexing sequences: seq_len, which reacts intuitively in the extremes.
Bottom Line Up Front: use sum(X[-seq_len(i)]) - sum(X[seq_len(i-1)]) instead.
First, some sample data:
X <- 1:10
N <- length(X)
Your approach, at the two extremes:
i <- 1
X[(i+1):N]
# [1] 2 3 4 5 6 7 8 9 10
X[1:(i-1)] # oops
# [1] 1
That should return "nothing", I believe. (More the point, sum(...) should return 0. For the record, sum(integer(0)) is 0.)
i <- 10
X[(i+1):N] # oops
# [1] NA 10
X[1:(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
There's your other error, where you'd expect "nothing" in the first subset.
Instead, I suggest you use seq_len:
i <- 1
X[-seq_len(i)]
# [1] 2 3 4 5 6 7 8 9 10
X[seq_len(i-1)]
# integer(0)
i <- 10
X[-seq_len(i)]
# integer(0)
X[seq_len(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
Both seem fine, and something in the middle makes sense.
i <- 5
X[-seq_len(i)]
# [1] 6 7 8 9 10
X[seq_len(i-1)]
# [1] 1 2 3 4
In this contrived example, what we're looking for at any value of i:
1: sum(2:10) - 0 = 54 - 0 = 54
2: sum(3:10) - sum(1:1) = 52 - 1 = 51
3: sum(4:10) - sum(1:2) = 49 - 3 = 46
...
10: 0 - sum(1:9) = 0 - 45 = -45
And we now get that:
func <- function(i, x) sum(x[-seq_len(i)]) - sum(x[seq_len(i-1)])
sapply(c(1,2,3,10), func, X)
# [1] 54 51 46 -45
Edit:
李哲源's answer got me to thinking that you don't need to re-sum the numbers before and after all the time. Just do it once and re-use it. This method could be easily a bit faster if your vector is large.
Xb <- c(0, cumsum(X)[-N])
Xb
# [1] 0 1 3 6 10 15 21 28 36 45
Xa <- c(rev(cumsum(rev(X)))[-1], 0)
Xa
# [1] 54 52 49 45 40 34 27 19 10 0
sapply(c(1,2,3,10), function(i) Xa[i] - Xb[i])
# [1] 54 51 46 -45
So this suggests that your summed value at any value of i is
Xs <- Xa - Xb
Xs
# [1] 54 51 46 39 30 19 6 -9 -26 -45
where you can find the specific value with Xs[i]. No repeated summing required.

Feature selection in document-feature matrix by using chi-squared test

I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test.
I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r)
I learned that I could use chi.squared in FSelector package but I don't know how to apply this function to a dfm class object (trainingtfidf below). (Shows in the manual, it applies to the predictor variable)
Could anyone give me a hint? I appreciate it!
Example code:
description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)
library(quanteda)
trainingcorpus <- corpus(example$description)
trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)
# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
Here's a general method for computing Chi-squared values for features. It requires that you have some variable against which to form the associations, which here could be some classification variable you are using for training your classifier.
Note that I am showing how to do this in the quanteda package, but the results should be general enough to work for other text package matrix objects. Here, I am using the data from the auxiliary quantedaData package that has all of the State of the Union addresses of US presidents.
data(data_corpus_sotu, package = "quanteda.corpora")
table(docvars(data_corpus_sotu, "party"))
## Democratic Democratic-Republican Federalist Independent
## 90 28 4 8
## Republican Whig
## 9 8
sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))
# compute chi-squared values for each feature
chi2vals <- apply(sotuDfm, 2, function(x) {
chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
})
head(sort(chi2vals, decreasing = TRUE), 10)
## government will united states year public congress upon
## 85.19783 74.55845 68.62642 66.57434 64.30859 63.19322 59.49949 57.83603
## war people
## 57.43142 57.38697
These can now be selected using the dfm_select() command. (Note that column indexing by name would also work.)
# select just 100 top Chi^2 vals from dfm
dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs)
## Document-feature matrix of: 182 documents, 100 features.
## (showing first 6 documents and first 6 features)
## features
## docs citizens government upon duties constitution present
## Jackson-1830 14 68 67 12 17 23
## Jackson-1831 21 26 13 7 5 22
## Jackson-1832 17 36 23 11 11 18
## Jackson-1829 17 58 37 16 7 17
## Jackson-1833 14 43 27 18 1 17
## Jackson-1834 24 74 67 11 11 29
Added: With >= v0.9.9 this can be done using the textstat_keyness() function.
# to avoid empty factors
docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- data_corpus_sotu %>%
corpus_subset(party %in% c("Democratic", "Republican")) %>%
dfm(remove = stopwords("english"))
chi2vals <- dfm_group(sotuDfm, "party") %>%
textstat_keyness(measure = "chi2")
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 5 million 132.3267 0 366 131
# 6 texas 101.1991 0 174 37
This information can then be used to select the most discriminating features, after the sign of the chi^2 score is removed.
# remove sign
chi2vals$chi2 <- abs(chi2vals$chi2)
# sort
chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 29044 commission 190.3010 0 175 588
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 29043 law 137.8330 0 607 1178
dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs, nf = 6)
Document-feature matrix of: 6 documents, 6 features (0% sparse).
6 x 6 sparse Matrix of class "dfm"
features
docs fellow citizens senate house representatives :
Jackson-1829 5 17 2 3 5 1
Jackson-1830 6 14 4 6 9 3
Jackson-1831 9 21 3 1 4 1
Jackson-1832 6 17 4 1 2 1
Jackson-1833 2 14 7 4 6 1
Jackson-1834 3 24 5 1 3 5

Umlaut ¨ with package tm (text mining in R)

I am trying to read some PDF docs using the package tm for text mining in R. However, my PDF are in german and I dont know how to deal with those Special characters.
library(tm)
pathname <- "J:/branchwarren/docs/tm/"
raw_corpus <- VCorpus(DirSource(directory=path,encoding="UTF-8"), readerControl=list(reader=readPDF,language="de"))
tdm <- TermDocumentMatrix(raw_corpus)
tdm_mat <- as.data.frame(tdm)
The Output tdm_mat for example is (where the columns are the frequencies in each PDF)
1 geschã¤ftsverlauf 9 9 1 3 0 0
2 gesellschaft 1 3 1 1 1 1
3 gesellschaft. 0 0 1 1 1 0
4 gesellschaftskapital 1 1 1 1 1 1
5 gestaltung 1 1 1 1 1 1
6 gesteigert 0 0 2 0 2 6
7 gesunden 0 1 0 1 1 1
8 gewinnreserve 1 1 1 1 1 1
9 gewinnverwendung) 1 1
As you notice, the character in the first row is not displayed correctly. It should be geschäftsverlauf.
Any help or suggestions? thanks in advance
Too long for a comment, but e.g. this works for me as expected:
library(tm)
dir.create(pathname <- tempfile())
writeLines("Der Geschäftsbericht war gut. Die Maßnahmen griffen.", tf <- tempfile(fileext = ".md"))
rmarkdown::render(input=tf, output_format="pdf_document", output_file="1.pdf", output_dir=pathname)
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { # see ?readPDF
raw_corpus <- VCorpus(DirSource(directory=pathname, encoding="UTF-8"), readerControl=list(reader=readPDF,language="de"))
tdm <- TermDocumentMatrix(raw_corpus)
tdm_mat <- as.data.frame(as.matrix(tdm))
tdm_mat
}
# 1.pdf
# der 1
# die 1
# geschäftsbericht 1
# griffen. 1
# gut. 1
# maßnahmen 1
# war 1
My sessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
...
tm_0.6-2 NLP_0.1-8
...
Maybe an encoding mismatch? Try providing input data + your sessionInfo to debug & reproduce the error.

R creating dataframe

I have a sequence:
seq <- seq (5, 10)
and list of floats like:
values<-runif(20,0,15)
> values
[1] 3.9826299 5.5818585 8.5928005 13.6231168 3.0252290 13.4758453
[7] 14.1701290 9.9119669 9.4367107 0.9267941 3.0896186 2.6483513
[13] 10.3053427 5.7615558 11.5476213 7.4654886 10.7642776 14.8785914
[19] 5.7005277 11.6616783
I need to create dataframe, which 1st column will contain sequence, and second - count of numbers from the values, which is greater than sequence number.
like
seq sum
1 5 15
2 6 12
3 7 12
4 8 11
5 9 10
6 10 8
If I understand correctly, something like this:
> set.seed(1)
> seq<-5:10
> values<-runif(20,0,15)
> values
[1] 3.9826299 5.5818585 8.5928005 13.6231168 3.0252290 13.4758453
[7] 14.1701290 9.9119669 9.4367107 0.9267941 3.0896186 2.6483513
[13] 10.3053427 5.7615558 11.5476213 7.4654886 10.7642776 14.8785914
[19] 5.7005277 11.6616783
> data.frame(seq,sum=sapply(seq,function(x)sum(values[values>x])))
seq sum
1 5 152.8775
2 6 135.8336
3 7 135.8336
4 8 128.3681
5 9 119.7753
6 10 100.4266
Edit: from your comment, it looks like you actually want this:
> data.frame(seq,sum=sapply(seq,function(x)sum(values>x)))
seq sum
1 5 15
2 6 12
3 7 12
4 8 11
5 9 10
6 10 8

Subsetting rows by passing an argument to a function

I have the following data frame which I imported into R using read.table() (I incorporated read.table() within read_data() which is a function I created that also throw messages in case the file name is not written appropriately):
> raw_data <- read_data("n44.txt")
[1] #### Reading txt file ####
> head(raw_data)
subject block trial_num soa target_identity prime_type target_type congruency prime_exposure target_exposure button_pressed rt ac
1 99 1 1 200 82 9 1 9 0 36 1 1253 1
2 99 1 2 102 95 2 1 2 75 36 1 1895 1
3 99 1 3 68 257 2 2 1 75 36 2 1049 1
4 99 1 4 68 62 9 1 9 0 36 1 1732 1
5 99 1 5 34 482 9 3 9 0 36 3 765 1
6 99 1 6 68 63 9 1 9 0 36 1 2027 1
Then I'm using raw_data within the early_prep() function I created (I copied only the relevant part of the function):
early_prep <- function(file_name, keep_rows = NULL, id = NULL){
if (is.null(id)) {
# Stops running the function
stop("~~~~~~~~~~~ id is missing. Please provide name of id column ~~~~~~~~~~~")
}
# Call read_data() function
raw_data <- read_data(file_name)
if (!is.null(keep_rows)) {
raw_data <- raw_data[keep_rows, ]
# Print to console
print("#### Deleting unnecesarry rows in raw_data ####", quote = FALSE)
}
print(dim(raw_data))
print(head(raw_data))
return(raw_data)
}
}
My problem is with raw_data <- raw_data[keep_rows, ].
When I enter keep_rows = "raw_data$block > 1" this is what I get:
> x1 <- early_prep(file_name = "n44.txt", keep_rows = "raw_data$block > 1", id = "subject")
[1] #### Reading txt file ####
[1] #### Deleting unnecesarry rows in raw_data ####
[1] 1 13
subject block trial_num soa target_identity prime_type target_type congruency prime_exposure target_exposure button_pressed rt ac
NA NA NA NA NA NA NA NA NA NA NA NA NA NA
How can I solve this so it will only delete the rows I want?
Any help will be greatly appreciated
Best,
Ayala
The problem is that you pass the condition as a string and not as a real condition, so R can't evaluate it when you want it to.
if you still want to pass it as string you need to parse and eval it in the right place for example:
cond = eval(parse(text=keep_rows))
raw_data = raw_data[cond,]
This should work, I think

Resources