R Document Term Matrix Truncating Words - r

Working with a document term matrix in R seems to be truncating the words.
I create a document term matrix from a corpus like below:
library(tm)
docs <- c("All that we are is the result of what we have thought.",
"Wisely, and slow. They stumble that run fast.",
"The future belongs to those who prepare for it today.",
"Our life is frittered away by detail... simplify, simplify.",
"Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.")
myCorpus <- Corpus(VectorSource(docs))
ndocs <- length(myCorpus)
minTermFreq <- 0.05 * ndocs
maxTermFreq <- 0.6 * ndocs
myDTM <- DocumentTermMatrix(myCorpus,
control = list(stopwords = TRUE,
wordLengths=c(3, Inf),
removePunctuation = TRUE,
removeNumbers = TRUE,
tolower=TRUE,
stemming = TRUE,
remove_separators = TRUE,
bounds = list(global = c(minTermFreq, maxTermFreq))
)
)
When I look at the terms, longer ones are truncated, but not consistently:
myDTM[["dimnames"]][["Terms"]]
# [1] "absolut" "away" "beauti" "belong" "better"
# [6] "bore" "detail" "fast" "fritter" "futur"
# [11] "genius" "imperfect" "it’" "life" "mad"
# [16] "prepar" "result" "ridicul" "run" "simplifi"
# [21] "slow" "stumbl" "thought" "today" "wise"
"Absolutely" is truncated to 7 characters, while "beauty" is truncated to 6. What's the fix for this? Or am I missing something obvious?

You have stemmed the words by using the option stemming = TRUE.
You can either set this to false to avoid stemming, meaning that words such as stumble, stumbles and stumbled will all be counted separately, or complete the stems using stemCompletion. This will replace the stems with the most common option from the text by default (though you can change the behaviour with the type parameter.

Related

tokens_compound() in quanteda changes the order of features

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session.
Here is the replication procedure:
Find collocations, compound tokens, and save them.
library(quanteda)
set.seed(12345)
data(data_corpus_inaugural)
toks <- data_corpus_inaugural %>%
tokens(remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE) %>%
tokens_tolower()
col <- toks %>%
textstat_collocations()
toks.col <- toks %>%
tokens_compound(pattern = col[col$z > 3])
write(attr(toks.col, "types"), "col1.txt")
End and restart R session and run the above code again with "col1.txt" replaced by "col2.txt".
Compare the two sets of tokens and find they are different.
col1 <- read.table("col1.txt")
col2 <- read.table("col2.txt")
identical(col1$V1, col2$V1) # This should return FALSE.
col1$V1[head(which(col1$V1 != col2$V1))]
col2$V1[head(which(col1$V1 != col2$V1))]
This does not matter for many cases but the result of LDA (by {topicmodels}) changes in different sessions. I guess so because the result of LDA is constant if I reset the order of features in tokens by as.list() and thereafter as.tokens() (dfm_sort() does not work for this).
I wonder whether this happens only for me (Ubuntu 18.04.5, R 4.0.4, and quanteda 2.1.2) and would be happy to hear another (easier) solution.
Updated on Feb 20
For example, the output of LDA is not reproduced.
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
lis[[i]] <- tokens_compound(toks, pattern = col[col$z > 3]) %>%
dfm() %>%
convert(to = "topicmodels") %>%
LDA(k = 5,
method = "Gibbs",
control = list(seed = 12345,
iter = 100))
}
head(lis[[1]]#gamma)
head(lis[[2]]#gamma)
An interesting investigation but this is neither an error nor anything to be concerned with. Within a quanteda tokens object, the types are not determinate in order, after a processing step such as textstat_compound(). This is because this function is parallelised in C++ and how these threads operate is not fixed by set.seed() from R. But this will not affect the important part, which is the set of types, or anything about the tokens themselves. If you want the order of the types that you extract to be the same, then you should sort them upon extraction.
library("quanteda")
## Package version: 2.1.2
toks <- data_corpus_inaugural %>%
tokens(
remove_punct = TRUE,
remove_symbol = TRUE,
padding = TRUE
) %>%
tokens_tolower()
col <- quanteda.textstats::textstat_collocations(toks)
It turns out that you do not need to save the output or restart R - this happens within a single session.
# types are differently indexed, but are the same set
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(types = types(toks.col)))
}
dframe <- data.frame(lis)
sum(dframe$types != dframe$types.1)
## [1] 19898
head(dframe[dframe$types != dframe$types.1, ])
## types types.1
## 8897 at_this_second my_fellow_citizens
## 8898 to_take_the_oath_of_the_presidential_office no_people
## 8899 there_is on_earth
## 8900 occasion_for cause_to_be_thankful
## 8901 an_extended this_is_said
## 8902 there_was spirit_of
However the (unordered) set of types is identical:
# but
setequal(dframe$types, dframe$types.1)
## [1] TRUE
More important is that when we compare the values of each token, which is ordered, these are identical:
# tokens are the same
lis <- list()
for (i in seq_len(2)) {
set.seed(123)
toks.col <- tokens_compound(toks, pattern = col[col$z > 3])
lis <- c(lis, list(toks = as.character(toks.col)))
}
dframe <- data.frame(lis)
all.equal(dframe$toks, dframe$toks.1)
## [1] TRUE
Created on 2021-02-18 by the reprex package (v1.0.0)
An additional comment, whose importance is underscored by this analysis: We strongly discourage direct access to object attributes. Use types(x) as above, not attr(x, "types"). The former will always work. The latter relies on our implementation of the object, which may change as we improve the package.

How is the correct use of stemDocument?

I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:
q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
readerControl = list(language = "pt",
load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
If I use:
> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"
it does work! But if I use:
> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
it doesn't work. Why so?
Unfortunately you stumbled on a bug. stemDocument works if you pass on the language when you do:
stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"
But when using this in tm_map, the function starts of with stemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character, but without the language component. In stemDocument.character the default language is specified as English. So within the tm_map call (or the DocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.
A workaround could be using the package quanteda:
library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")
my_dfm
Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
features
docs pod
text1 1
text2 1
Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.

String pulled directly from source data seems to not match string in source data

I have a string that is failing to evaluate as a match with itself. I am trying to do a simple subset based on one of 8 possible values in a column,
out <- df[df$`Var name` == "string",]
I've had it work multiple times with different strings but for some reason this string fails. I have tried to get the exact string (thinking there may be some character encoding issue) from the source using the four below avenues but have had no success. Even when I make an explicit call to a cell I know contains that string and copy that into an evaluation statement it fails
> df[i,j]
[1] "string"
df[i,j]=="string" # pasted from above line
I don't understand how I can be explicitly pasting the output I was just given and it not match.
## attempts to get exact string to paste into subset statement
# from dput
"IF APPLICABLE – Which of the following best characterizes the expectations with"
# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with
# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with
I don't have a clue what could be going on here. I am explicitly drawing the string straight from the data and yet it still fails to evaluate as true. Is there something going on in the background that I'm not seeing? Am I overlooking something ridiculously simple?
edit:
I subset based on another way, below is a dput and actual example of what I'm doing:
> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with",
`Item Response` = "It was required.", orgchar_group = "locale",
`Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475,
`Graphs note` = NA_character_, `Report note` = NA_character_,
`Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_,
statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem",
"Item Response", "orgchar_group", "Org Characteristic", "N",
"percent", "Graphs note", "Report note", "Other note", "subsig",
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE
Turns out it was in fact a non-printable character, shoutout to the commenters for helping me figure it out by 1) suggesting it and 2) showing that it worked for them.
I was able to figure it out using insights from here (& here) and here.
I used a grep command (from #Tyler Rinker) to determine that there was in fact a non-ASCII character in my string, and a stringi command (from #hadley) to determine what kind. I then used base solution from #Josh O'Brien to remove it. Turns out it was the heiphen.
# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")
[1] "IF APPLICABLE Which of the following best characterizes the expectations with"
# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")
Still don't understand it enough to explain why it happened but it's fixed now.

Efficiently match multiple strings/keywords to multiple texts in R

I am trying to efficiently map exact peptides (short sequences of amino acids in the 26 character alphabet A-Z1) to proteins (longer sequences of the same alphabet). The most efficient way to do this I'm aware of is an Aho-Corasick trie (where peptides are the keywords). Unfortunately I can't find a version of AC in R that will work with a non-nucleotide alphabet (Biostrings' PDict and Starr's match_ac are both hard-coded for DNA).
As a crutch I've been trying to parallelize a basic grep approach. But I'm having trouble figuring out a way to do so without incurring significant IO overhead. Here is a brief example:
peptides = c("FSSSGGGGGGGR","GAHLQGGAK","GGSGGSYGGGGSGGGYGGGSGSR","IISNASCTTNCLAPLAK")
if (!exists("proteins"))
{
biocLite("biomaRt", ask=F, suppressUpdates=T, suppressAutoUpdate=T)
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
proteins = getBM(attributes=c('peptide', 'refseq_peptide'), filters='refseq_peptide', values=c("NP_000217", "NP_001276675"), mart=ensembl)
row.names(proteins) = proteins$refseq_peptide
}
library(snowfall)
library(Biostrings)
library(plyr)
sfInit(parallel=T, cpus=detectCores()-1)
allPeptideInstances = NULL
i=1
increment=100
count=nrow(proteins)
while(T)
{
print(paste(i, min(count, i+increment), sep=":"))
text_source = proteins[i:min(count, i+increment),]
text = text_source$peptide
#peptideInstances = sapply(peptides, regexpr, text, fixed=T, useBytes=T)
peptideInstances = sfSapply(peptides, regexpr, text, fixed=T, useBytes=T)
dimnames(peptideInstances) = list(text_source$refseq_peptide, colnames(peptideInstances))
sparsePeptideInstances = alply(peptideInstances, 2, .fun = function(x) {x[x > 0]}, .dims = T)
allPeptideInstances = c(allPeptideInstances, sparsePeptideInstances, recursive=T)
if (i==count | nrow(text_source) < increment)
break
i = i+increment
}
sfStop()
There are a few issues here:
peptideInstances here is a dense matrix, so
returning it from each worker is very verbose. I have broken it up
into blocks so that I'm not dealing with a 40,000 (proteins) x 60,000
(peptides) matrix.
Parallelizing over peptides, when it would make
more sense to parallelize over the proteins because they're bigger.
But I got frustrated with trying to do it by protein because:
This code breaks if there is only one protein in text_source.
Alternatively, if anyone is aware of a better solution in R, I'm happy to use that. I've spent enough time on this I probably would have been better served implementing Aho-Corasick.
1 Some of those are ambiguity codes, but for simplicity, ignore that.
I learned Rcpp and implemented an Aho-Corasick myself. Now CRAN has a good general purpose multiple-keyword search package.
Here are some usage examples:
listEquals = function(a, b) { is.null(unlist(a)) && is.null(unlist(b)) || !is.null(a) && !is.null(b) && all(unlist(a) == unlist(b)) }
# simple search of multiple keywords in a single text
keywords = c("Abra", "cadabra", "is", "the", "Magic", "Word")
oneSearch = AhoCorasickSearch(keywords, "Is Abracadabra the Magic Word?")
stopifnot(listEquals(oneSearch[[1]][[1]], list(keyword="Abra", offset=4)))
stopifnot(listEquals(oneSearch[[1]][[2]], list(keyword="cadabra", offset=8)))
stopifnot(listEquals(oneSearch[[1]][[3]], list(keyword="the", offset=16)))
stopifnot(listEquals(oneSearch[[1]][[4]], list(keyword="Magic", offset=20)))
stopifnot(listEquals(oneSearch[[1]][[5]], list(keyword="Word", offset=26)))
# search a list of lists
# * sublists are accessed by index
# * texts are accessed by index
# * non-matched texts are kept (to preserve index order)
listSearch = AhoCorasickSearchList(keywords, list(c("What in", "the world"), c("is"), "secret about", "the Magic Word?"))
stopifnot(listEquals(listSearch[[1]][[1]], list()))
stopifnot(listEquals(listSearch[[1]][[2]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[2]][[1]][[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(listSearch[[3]], list()))
stopifnot(listEquals(listSearch[[4]][[1]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[4]][[1]][[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(listSearch[[4]][[1]][[3]], list(keyword="Word", offset=11)))
# named search of a list of lists
# * sublists are accessed by name
# * matched texts are accessed by name
# * non-matched texts are dropped
namedSearch = AhoCorasickSearchList(keywords, list(subject=c(phrase1="What in", phrase2="the world"),
verb=c(phrase1="is"),
predicate1=c(phrase1="secret about"),
predicate2=c(phrase1="the Magic Word?")))
stopifnot(listEquals(namedSearch$subject$phrase2[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$verb$phrase1[[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(namedSearch$predicate1, list()))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[3]], list(keyword="Word", offset=11)))
# named search of multiple texts in a single list with keyword grouping and aminoacid alphabet
# * all matches to a keyword are accessed by name
# * non-matched keywords are dropped
proteins = c(protein1="PEPTIDEPEPTIDEDADADARARARARAKEKEKEKEPEPTIDE",
protein2="DERPADERPAPEWPEWPEEPEERAWRAWWARRAGTAGPEPTIDEKESEQUENCE")
peptides = c("PEPTIDE", "DERPA", "SEQUENCE", "KEKE", "PEPPIE")
peptideSearch = AhoCorasickSearch(peptides, proteins, alphabet="aminoacid", groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(list(keyword="protein1", offset=1),
list(keyword="protein1", offset=8),
list(keyword="protein1", offset=37),
list(keyword="protein2", offset=38))))
stopifnot(listEquals(peptideSearch$DERPA, list(list(keyword="protein2", offset=1),
list(keyword="protein2", offset=6))))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(list(keyword="protein2", offset=47))))
stopifnot(listEquals(peptideSearch$KEKE, list(list(keyword="protein1", offset=29),
list(keyword="protein1", offset=31),
list(keyword="protein1", offset=33))))
stopifnot(listEquals(peptideSearch$PEPPIE, NULL))
# grouping by keyword without text names: offsets are given without reference to the text
names(proteins) = NULL
peptideSearch = AhoCorasickSearch(peptides, proteins, groupByKeyword=T)
stopifnot(listEquals(peptideSearch$PEPTIDE, list(1, 8, 37, 38)))
stopifnot(listEquals(peptideSearch$DERPA, list(1, 6)))
stopifnot(listEquals(peptideSearch$SEQUENCE, list(47)))
stopifnot(listEquals(peptideSearch$KEKE, list(29, 31, 33)))

R: how to map test data into lsa space created by training data

I am trying to do text analysis using LSA. I've read many other posts regarding LSA on StackOverflow, but I have not found one similar to mine yet. IF you know there's one similar to mine, please kindly redirect me to it! Much appreciated!
here's my reproducible code with sample data created:
creating sample data train & test sets
sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time',
'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 1 im happy
2: 1 this is good
3: 0 what a bummer X(
4: 1 today is kinda okay day for me
5: 0 i somehow messed up big time
6: 1 guess not being promoted is not too bad :]
7: 0 stayhing home is boring :(
8: 0 kids wont stop crying QQ
9: 1 warriors are legendary!
10: 0 stop reading my tweets!!!
sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 0 running out of things to say...
2: 1 if you are still reading, good for you!
3: 0 nothing ended on a good note today
4: 0 seriously sleep deprived!! >__<
preprocessing for training data set
corpus.train = Corpus(VectorSource(train_data$text))
create a term document matrix for training set
tdm.train = TermDocumentMatrix(
corpus.train,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
convert into matrix (for later use)
train_matrix = as.matrix(tdm.train)
create an lsa space using train data
lsa.train = lsa(tdm.train, dimcalc_share())
set dimension # (i randomly picked one here b/c the data size is too small to create an elbow shape)
k = 6
project train matrix into the new LSA space
projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]
convert above projected data into a matrix
projected.train.matrix = matrix(projected.train,
nrow = dim(projected.train)[1],
ncol = dim(projected.train)[2])
train the random forest model (somehow this step does not work anymore with this small sample data... but it's okay, won't be a big problem in this question; however, if you can help me with this error too, that'd be fantastic! i tried googling for this error but it's just not fixed...)
trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)
preprocessing for test data set
basically im repeating whatever i did to the training data set, except i did not use the test set to create its own LSA space
corpus.test = Corpus(VectorSource(test_data$text))
create a term document matrix for test set
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
convert into matrix (for later use)
test_matrix = as.matrix(tdm.test)
project test matrix into the trained LSA space (here's where the question is)
projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)
but i'd get an error:
Error in crossprod(docvecs, LSAspace$tk) : non-conformable arguments
i am not finding any useful google search results regarding this error... (there's only one search results page from google QQ)
any help is much appreciated! Thank you!
When you build the LSA model you are using the vocabulary of the training data. But when you build the TermDocumentMatrix for the test data, you are using the vocabulary of the test data. The LSA model only know how to handle documents tabulated against the vocabulary of the training data.
One way to remedy this is to create your test TDM with dictionary set to the vocabulary of the training data:
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removeNumbers = TRUE,
tolower = TRUE,
stopwords = stopwords("en"),
stemming = TRUE,
removePunctuation = TRUE,
weighting = weightTfIdf,
dictionary=rownames(tdm.train)
)
)

Resources