TermDocumentMatrix in R - only 1-grams created - r

I just started with tm package in R and cannot seem to overcome an issue.
Even though my tokenizer functions seem to work right:
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer))
when I try to pull 2-grams from biTDM, only 1-grams come up...
findFreqTerms(biTDM, 50)
[1] "after" "and" "most" "the" "were" "years" "love"
[8] "you" "all" "also" "been" "did" "from" "get"
at the same, the 2-gram function appears to be in tact:
x <- biTokenizer(corpus)
head(x)
[1] "c in" "in the" "the years"
[4] "years thereafter" "thereafter most" "most of"

I can only assume what the problem is here: NGramTokenizer needs a VCorpus object rather than a Corpus object.
library(tm)
library(RWeka)
# some dummy text
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr",
"sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat",
"sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet")
# create a VCorpus
corpus <- VCorpus(VectorSource(text))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
print(biTDM$dimnames$Terms)
[1] "accusam et" "aliquyam erat" "amet consetetur" "at vero" "clita kasd" "consetetur sadipscing" "diam nonumy" "diam voluptua" "dolor sit" "dolore magna"
[11] "dolores et" "duo dolores" "ea rebum" "eirmod tempor" "eos et" "est lorem" "et accusam" "et dolore" "et ea" "et justo"
[21] "gubergren no" "invidunt ut" "ipsum dolor" "justo duo" "kasd gubergren" "labore et" "lorem ipsum" "magna aliquyam" "no sea" "nonumy eirmod"
[31] "sadipscing elitr" "sanctus est" "sea takimata" "sed diam" "sit amet" "stet clita" "takimata sanctus" "tempor invidunt" "ut labore" "vero eos"
[41] "voluptua at"

Related

R - Efficiently insert multiple strings

I would like to insert a list of sub strings (word_list) into a string (text) at specific positions (idx_list)
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")
I know there are multiple possibilites functions (gsub, stri_sub etc.) which I can use in a loop. This gets however quite slow on large corpora. Is there a more efficient solution? Maybe vectorized?
Solution 1
A small wrapper around stringi package function for input the OP wanted.
inject <- function(string, index, replacement){
stringi::stri_sub_replace_all(string, from = index,
to = index-1,
replacement = replacement)
}
text <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list <- c(5, 16, 30, 50)
word_list <- c("AAA", "BBB", "CCC", "DDD")
inject(text, idx_list, word_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Solution 2
An adaptation of Zach Foster's answer, condensed into a single function
inject_two <- function(string, inject, index) {
inject <- inject[order(index)]
index <- sort(index)
# expand string
split <- substr(rep(string, length(index) + 1),
start = c(1, index),
stop = c(index - 1, nchar(string))
)
ord1 <- 2 * (1:length(split)) - 1
ord2 <- 2 * (1:length(inject))
paste(c(split, inject)[order(c(ord1, ord2))], collapse = "")
}
inject_two(text, word_list, idx_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Benchmarks
evans <- function(string, index, replacement){
ord <- order(-index)
Reduce(function(S, R) {
paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, index[ord], replacement[ord]), string)
}
ggplot2::autoplot(microbecnhmark::microbenchmark(
inject(text, idx_list, word_list),
inject_two(text, idx_list, word_list),
evans(text, idx_list, word_list),
times = 1000
))
For few insertions all solutions perform similarly. What if we perform many (here N = 410 000) insertions?
text_large = paste0(rep(text, 10000), collapse = "")
idx_list_large = seq(1, nchar(text_large), by = 3)
word_list_large = sample(LETTERS, size = length(idx_list_large), replace = T)
bench::mark(
inject = inject(text_large, idx_list_large, word_list_large),
inject_two = inject_two(text_large, idx_list_large, word_list_large),
iterations = 50
)[,c(1,3,5,7)]
# A tibble: 2 x 4
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 inject 32.2ms 9.38MB 50
2 inject_two 157.4ms 65.69MB 50
Surprise surprise, C++ - based stringi takes the cake in both speed and memory. Note evans is omitted because 1 pass took multiple minutes when tested.
I think it's important to start from the last (highest idx_list) first, since otherwise all numbers will need to be shifted. (This is certainly not hard, but going backwards seems easier.)
# 0 1 2 3 4 5 6 7 8 9 a b c
# 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")
The work:
ord <- order(-idx_list)
Reduce(function(S, R) {
paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, idx_list[ord], word_list[ord]), text)
# [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Walk-through:
ord is just the decreasing order, so for instance
word_list[ord]
# [1] "DDD" "CCC" "BBB" "AAA"
because we're going to use Reduce (explanation in a second), we need the combination of idx_list[1] and word_list[1] to be in one argument, not individual; for this, we combine them using Map(list, ...), which "zips" them together into a single list, each containing the character position and the string to insert:
str( Map(list, idx_list[ord], word_list[ord]) )
# List of 4
# $ :List of 2
# ..$ : num 50
# ..$ : chr "DDD"
# $ :List of 2
# ..$ : num 30
# ..$ : chr "CCC"
# $ :List of 2
# ..$ : num 16
# ..$ : chr "BBB"
# $ :List of 2
# ..$ : num 5
# ..$ : chr "AAA"
(This can be used with an arbitrary number of arguments.)
Because we need to insert a string, then insert another string into the result of the first, the base function Reduce will work well here. The first arg is a function that accepts two arguments: the results from the previous call, and the next element from the Map'd argument.

Pull all 8 digit numbers from a data frame

I have this assignment where I need to pull all the 8 digit numbers from a text file. I've converted the text file into a dataframe and now have some 67 columns with 18000 rows. There are empty cells as well.
Within this table, some 8 digit number exist, (not in any particular row or column) which is what I want to extract.
I need all these numbers to be extracted into one single column without checking for duplicates.
The only code I've written so far:
data <- read.delim("cerupload_adsi_1_01-02-2019.txt", header = FALSE, sep="|")
You may use regmatches() and match for a juxtaposition of exactly 8 digits with regex "\\d{8}". Specifying word boundaries "\\b" might make this more robust.
Example
txt <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore 235462354 magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. Stet clita 235 kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet. 12345678 Lorem ipsum dolor 345.454 sit amet,
12345678 consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam 345 voluptua. At vero eos et accusam et justo duo
dolores et ea rebum. Stet clita 12345.67 12345.678 kasd gubergren, no sea takimata sanctus
est Lorem ipsum dolor sit amet. 12345678"
regmatches(txt, gregexpr("\\b\\d{8}\\b", txt))
# [[1]]
# [1] "12345678" "12345678" "12345678"
First, put all of your data into a simple integer vector:
data = as.integer(unlist(data))
Next, remove any elements that weren't convertible to integers (optional):
data = data[!is.na(data)]
Next, find the integers that are 8 characters long:
data = data[nchar(as.character(data))==8]
Then, make a data.frame with the integer vector as a column:
data = data.frame(x=data)
Using str_extract_all from stringr
temp <- data.frame(col = unlist(stringr::str_extract_all(unlist(data), "\\d{8}$")))
temp
# col
#1 12352318
#2 98765432
data
Tested on this sample data with two columns.
data <- data.frame(a = "This is a text with number 1234 and 12352318",
b = "More random text 123456789 98765432")

Subsetting a vector using a list of sequences in R [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I have a character vector that contains textual data which I can subset by selecting individual lines. The eventual goal is to store different sequences of the vector as independent variables or element of a list. I am able to do this using a simple loop, but I don't succeed in subsetting a character vector by a list of sequences.
See the following example:
Text<-scan("~/Desktop/Lorem Ipsum.txt", what="character", sep="\n")
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
[5] "Ut enim ad minim veniam, "
[6] "quis nostrud exercitation "
[7] "ullamco laboris nisi ut aliquip ex ea commodo consequat."
[8] "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."
[9] "Excepteur sint occaecat cupidatat non proident,"
[10] "sunt in culpa qui officia deserunt mollit anim id est laborum."
The normal way of subsetting the vector would be text[1:4], returning
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
In a list I have stored sequences of numbers that represent different sets of lines in the vector.
Sentence.numbers<-c(1:4, 5:7, 8, 9:10).
Now I would like to subset all the numbers that make up the different sentences at once and store them in a list for further analysis.
I used Text[Sentence.numbers], but the error message is "invalid index type 'list'".
Is there a way to use a list of values to subset?
You need to set up Sentence.numbers as a list and then use lapply -
Sentence.numbers <- list(1:4, 5:7, 8, 9:10)
lapply(Sentence.numbers, function(x) Text[x])
Here's an example -
lapply(Sentence.numbers, function(x) letters[x])
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
[[3]]
[1] "h"
[[4]]
[1] "i" "j"

Efficiently break up a string based on the nth occurrence of a substring using R

Introduction
Given a string in R, is it possible to get a vectorized solution (i.e. no loops) where we can break the string into blocks where each block is determined by the nth occurrence of a substring in the string.
Work done with Reproducible Example
Suppose we have several paragraphs of the famous Lorem Ipsum text.
library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "
We would like to break this text into segments at every 3rd occurrence of the the word " in" (a space is included in order to distinguish from words which contain "in" as part of them, such as "min").
I have the following solution with a while loop:
# We wish to break up the string at every
# 3rd occurence of the worn "in"
break.character = " in"
break.occurrence = 3
string.list = list()
i = 1
# initialize string to send into the loop
current.string = my.string
while(length(current.string) > 0){
# Enter segment into the list which occurs BEFORE nth occurence character of interest
string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)
# Update next string to exmine.
# Next string to examine is current string AFTER nth occurence of character of interest
current.string = str_after_nth(current.string, break.character, break.occurrence)
i = i + 1
}
We are able to get the desired output in a list with a warning (warning not shown)
> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"
[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...
[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
Goal
Is it possible to improve this solution by vectorizing (i.e. using apply(), lapply(), mapply() etc.). Also, my current solution cut's off the last occurrence of the substring in a block.
The current solution may not work well on extremely long strings (such as DNA sequences where we are looking for blocks with the nth occurrence of a substring of nucleotides).
Try with this:
text_split=strsplit(text," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
Last conditional is in case number of in is not divisible by 3. Also, the last in pasted in the sapply() is there because I don't know what you want to do with the one in that separates your blocks.
Let me know if this does the trick. I will try to make it faster. It keeps the third in in the code block. If it works I'll annotate it more too.
library(lipsum)
library(stringi)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
EDIT: actually, use stri_sub from stringi. It will scale much better than substring. See:
my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999
microbenchmark::microbenchmark(
sol1 = {
text_split=strsplit(my.string," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
},
sol2 = {
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
},
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941 10
sol2 55.4163 56.40759 58.53444 56.86043 57.03707 71.02974 10

Save .dta files with long strings in R

I have to save an R-dataset in Stata's .dta format.
The dataset contains, among other data, a single column containing long strings (column 3).
test data:
r_data <- data.frame( ae= 1, be= 2, ce= "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet"
,stringsAsFactors = FALSE )
export to dta
library(foreign)
write.dta(r_data, file = "r_data.dta")
results in this warning message:
Warning message:
In write.dta(r_data, file = "r_data.dta") :
character strings of >244 bytes in column 3 will be truncated
Furthermore, I can't open the file in Stata (14 SE) at all due to an error stating:
. use "r_data.dta"
file not Stata format
.dta file contains 1 invalid storage-type code.
File uses invalid codes other than code 0.
r(610);
How can I save longer strings as a .dta file?
R-solution prefered because I am not experienced with Stata.
PS: The indirect route via a CSV-file does not work, because the resulting CSV-file is too big for my little RAM when importing in Stata.
Old question, but deserves to be closed:
Use the haven package to write to a dta-file in Stata 14 format.
library(haven)
r_data <- data.frame(ae = 1, be = 2, ce = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet",
stringsAsFactors = FALSE)
write_dta(r_data, "r_data.dta")

Resources