R text mining how to segment document into phrases not terms - r

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to segment document into phases but not word(term).

You can do this in R using the quanteda package, which can detect multi-word expressions as statistical collocates, which would be the multi-word expressions that you are probably referring to in English. To remove the collocations containing stop words, you would first tokenise the text, then remove the stop words leaving a "pad" in place to prevent false adjacencies in the results (two words that were not actually adjacent before the removal of stop words between them).
require(quanteda)
pres_tokens <-
tokens(data_corpus_inaugural) %>%
tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%
tokens_remove(stopwords("english"), padding = TRUE)
pres_collocations <- textstat_collocations(pres_tokens, size = 2)
head(pres_collocations)
# collocation count count_nested length lambda z
# 1 united states 157 0 2 7.893307 41.19459
# 2 let us 97 0 2 6.291128 36.15520
# 3 fellow citizens 78 0 2 7.963336 32.93813
# 4 american people 40 0 2 4.426552 23.45052
# 5 years ago 26 0 2 7.896626 23.26935
# 6 federal government 32 0 2 5.312702 21.80328
# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])
tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon" "shall_endeavor" "high_sense" "official_act"
Using this "compounded" token set, we can now turn this into a document-feature matrix where the features consist of a mixture of original terms (those not found in a collocation) and the collocations. As can be seen below, "united" occurs alone and as part of the collocation "united_states".
pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
# features
# docs united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
# 1789-Washington 4 2 0 0 0 0 0 0 0 0
# 1793-Washington 1 0 0 0 0 0 0 0 0 0
# 1797-Adams 3 9 0 0 0 0 0 0 0 0
# 1801-Jefferson 0 0 0 0 0 0 0 0 0 0
# 1805-Jefferson 1 4 0 0 0 0 0 0 0 0
If you want a more brute-force approach, it's possible simply to create a document-by-bigram matrix this way:
# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens_of of_the the_senate senate_and and_of the_house
## 1789-Washington 1 20 1 1 2 2
## 1797-Adams 0 29 0 0 2 0
## 1793-Washington 0 4 0 0 1 0
## 1801-Jefferson 0 28 0 0 3 0
## 1805-Jefferson 0 17 0 0 1 0
## 1809-Madison 0 20 0 0 2 0

Related

Identify WHICH words in a document have been matched by dictionary lookup and how many times

Quanteda question.
For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much.
Put differently, I want to get a matrix of the features in each dictionary category that have been matched using the tokens_lookup and dfm_lookup functions, and their frequency per document. So not the aggregated frequency of all words in the category, but of each of them separately.
Is there an easy way to get this?
The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").
I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.
The loop iterates over the dictionary keys to build up a list, each time doing the following:
select the tokens but leave a pad for ones not selected;
compound the multi-word tokens into single tokens;
rename the pad ("") to OTHER, so that we can count non-matches; and
create the dfm.
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
Now we have all of the dictionary matches for each key in a list of dfm objects:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658

Ordination plot with vectors coded as centroids

In the vegan package, I tried to make a ordination plot with species as objects and environmental variables as vectors. However the environmental variables are regarded as centroids instead of factors. Strangely each data frame cell is seen as an environmental factor, so I think the dataframe is not structured correctly. When I plot the ordination without environmental variables I don't get any problems.
summary(gutter.dca)
environfit = envfit(gutter.dca,gutterenv)
> head(environfit)
$vectors
NULL
$factors
Centroids:
DCA1 DCA2
vocht0,246435845 -0.2185 -1.0601
vocht0,249249249 0.1932 -1.1339
vocht0,251497006 0.0331 -2.0888
vocht0,264735265 -0.3353 -1.3403
vocht0,26911315 -0.0017 -0.9498
vocht0,272369715 -1.0733 0.0021
Species dataframe
head(gutter)
Acer.campestre Acer.pseudoplantanus Adoxa.moschatellina Aegopodium.podagraria Ajuga.reptans
Q1-1 0 0 5 0 0
Q1-2 0 70 15 20 0
Q1-3 0 15 0 0 0
Q1-4 0 3 0 0 0
Q2-1 0 3 0 0 0
Q2-2 1 0 0 0 0
Environmental variables dataframe
head(gutterenv)
vocht Ph.H2O ph.KCl mg.NO3.kg.soil mg.NH4.N.kg.soil litter.depth..cm.
1 0,26911315 7,41 6,686 2,811031105 4,674304351 7,5
2 0,246435845 7,225 6,349 2,567981088 6,735395066 6,5
3 0,264735265 7,001 6,491 2,336821354 8,400116244 5,1
4 0,325123153 6,732 5,444 2,518858082 7,684506342 8,25
5 0,446875 6,87 7,45 2,443686352 9,886923756 4
6 0,548476454 8,1 7,05 3,144954614 11,3179919 3

Converting a Term Document Matrix to a Term Document Matrix supported by tm library

I have a csv file, where I have all of my documents stemmed in a Term Document Matrix form and a categorical variable as a sentiment.
I'd like to use tm's capabilities (terms frequencies etc.). Is there a way to do so, given the data I started with?
# given:
dtm = read.csv(file_path, na.strings="")
dtm$rating = as.factor(dtm$rating)
str(dtm)
# 'data.frame': 2000 obs. of 2002 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ abl : int 0 0 0 0 0 0 0 0 0 0 ...
# ...
head(dtm)
#ID abl absolut absorb accept
#1 1 0 0 0
#2 2 0 0 1
# I'd like to achieve...
tdm <- TermDocumentMatrix(dtm,
control = list(removePunctuation = TRUE,
stopwords = TRUE))
Can you use as.TermDocumentMatrix(df, weighting = weightTf) (in the R package tm) to do what you seek?

using graph.adjacency() in R

I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin
Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?
The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1

How can I calculate an inner product with an arbitrary number of columns using ddply?

I want to perform an inner product of the first D columns for each row in a data frame with a given array, W. I am trying the following:
W = (1,2,3);
ddply(df, .(id), transform, inner_product=c(col1, col2, col3) %*% W);
This works but I typically may have an arbitrary number of columns. Can I generalize the above expression to handle that case?
Update:
This is an updated example as asked for in the comments:
libary(kernlab);
data(spam);
W = array();
W[1:3] = seq(1,3);
spamdf = head(spam);
spamdf$id = seq(1,nrow(spamdf));
df_out=ddply(spamdf, .(id), transform, inner_product=c(make, address, all) %*% W);
> W
[1] 1 2 3
> spamdf[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id
1 0 0 3.756 61 278 spam 1
> df_out[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id inner_product
1 0 0 3.756 61 278 spam 1 3.2
The above example performs a inner product of the first three dimensions with an array W=(1,2,3) of the spam data set available in kernlab package. Here I have explicity specified the first three dimensions as c(make, address, all).
Thus df_out[1,"inner_product"] = 3.2.
Instead I want to perform the inner product over all the dimensions without having to list all the dimensions. The conversion to a matrix and back to a data frame seems to be an expensive operation?
A strategy along the lines of the following should work:
Convert each chunk to a matrix
Perform a matrix multiplication
Convert results to data.frame
The code:
set.seed(1)
df <- data.frame(
id=sample(1:5, 20, replace=TRUE),
col1 = runif(20),
col2 = runif(20),
col3 = runif(20),
col4 = runif(20)
)
W <- c(1,2,3,4)
ddply(df, .(id), function(x)as.data.frame(as.matrix(x[, -1]) %*% W))
The results:
id V1
1 1 4.924994
2 1 5.076043
3 2 7.053864
4 2 5.237132
5 2 6.307620
6 2 3.413056
7 2 5.182214
8 2 7.623164
9 3 5.194714
10 3 6.733229
11 4 4.122548
12 4 3.569013
13 4 4.978939
14 4 5.513444
15 4 5.840900
16 4 6.526522
17 5 3.530220
18 5 3.549646
19 5 4.340173
20 5 3.955517
If you want to append a column of cross-products, you could do this (assuming W had the right number of elements to match the non-"id" columns:
df2 <- cbind(df, as.matrix(df[, -grep("id", names(df))]) %*% W )
It does not appear that the .(id) serves any useful purpose, since you are not do a sum of crossproducts within id, and if you were then you wouldn't be using transform but some other aggregating function.

Resources