R - Efficiently insert multiple strings - r

I would like to insert a list of sub strings (word_list) into a string (text) at specific positions (idx_list)
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")
I know there are multiple possibilites functions (gsub, stri_sub etc.) which I can use in a loop. This gets however quite slow on large corpora. Is there a more efficient solution? Maybe vectorized?

Solution 1
A small wrapper around stringi package function for input the OP wanted.
inject <- function(string, index, replacement){
stringi::stri_sub_replace_all(string, from = index,
to = index-1,
replacement = replacement)
}
text <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list <- c(5, 16, 30, 50)
word_list <- c("AAA", "BBB", "CCC", "DDD")
inject(text, idx_list, word_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Solution 2
An adaptation of Zach Foster's answer, condensed into a single function
inject_two <- function(string, inject, index) {
inject <- inject[order(index)]
index <- sort(index)
# expand string
split <- substr(rep(string, length(index) + 1),
start = c(1, index),
stop = c(index - 1, nchar(string))
)
ord1 <- 2 * (1:length(split)) - 1
ord2 <- 2 * (1:length(inject))
paste(c(split, inject)[order(c(ord1, ord2))], collapse = "")
}
inject_two(text, word_list, idx_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Benchmarks
evans <- function(string, index, replacement){
ord <- order(-index)
Reduce(function(S, R) {
paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, index[ord], replacement[ord]), string)
}
ggplot2::autoplot(microbecnhmark::microbenchmark(
inject(text, idx_list, word_list),
inject_two(text, idx_list, word_list),
evans(text, idx_list, word_list),
times = 1000
))
For few insertions all solutions perform similarly. What if we perform many (here N = 410 000) insertions?
text_large = paste0(rep(text, 10000), collapse = "")
idx_list_large = seq(1, nchar(text_large), by = 3)
word_list_large = sample(LETTERS, size = length(idx_list_large), replace = T)
bench::mark(
inject = inject(text_large, idx_list_large, word_list_large),
inject_two = inject_two(text_large, idx_list_large, word_list_large),
iterations = 50
)[,c(1,3,5,7)]
# A tibble: 2 x 4
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 inject 32.2ms 9.38MB 50
2 inject_two 157.4ms 65.69MB 50
Surprise surprise, C++ - based stringi takes the cake in both speed and memory. Note evans is omitted because 1 pass took multiple minutes when tested.

I think it's important to start from the last (highest idx_list) first, since otherwise all numbers will need to be shifted. (This is certainly not hard, but going backwards seems easier.)
# 0 1 2 3 4 5 6 7 8 9 a b c
# 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")
The work:
ord <- order(-idx_list)
Reduce(function(S, R) {
paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, idx_list[ord], word_list[ord]), text)
# [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Walk-through:
ord is just the decreasing order, so for instance
word_list[ord]
# [1] "DDD" "CCC" "BBB" "AAA"
because we're going to use Reduce (explanation in a second), we need the combination of idx_list[1] and word_list[1] to be in one argument, not individual; for this, we combine them using Map(list, ...), which "zips" them together into a single list, each containing the character position and the string to insert:
str( Map(list, idx_list[ord], word_list[ord]) )
# List of 4
# $ :List of 2
# ..$ : num 50
# ..$ : chr "DDD"
# $ :List of 2
# ..$ : num 30
# ..$ : chr "CCC"
# $ :List of 2
# ..$ : num 16
# ..$ : chr "BBB"
# $ :List of 2
# ..$ : num 5
# ..$ : chr "AAA"
(This can be used with an arbitrary number of arguments.)
Because we need to insert a string, then insert another string into the result of the first, the base function Reduce will work well here. The first arg is a function that accepts two arguments: the results from the previous call, and the next element from the Map'd argument.

Related

Search for matches to argument pattern within every item of a character vector and a window function

I have the following string
library(stringi)
s=stri_rand_lipsum(10)
Function grepl searches for matches to argument pattern within every item of a character vector. As far as I know, it performs the search of just one word at once. For example if I would like to search "conubia" and "viverra" I have to perform two searches:
x=s[grepl("conubia",s)]
x=x[grepl("viverra",x)]
Anyway, I would like to search two or more terms which appear in the same entry of s within a window of length equal to, e.g. 140 characters.
You can use *apply family. If your source text is a character vector, I recommend using vapply, but you have to specify the type and the length of the returned values. Because you use grepl, the returned values are logical vectors.
txt = "My name is Abdur Rohman"
patt = c("na", "Ab","man", "om")
vapply(patt, function(x) grepl(x,txt),
FUN.VALUE = logical(length(txt)))
# na Ab man om
# TRUE TRUE TRUE FALSE
So, in your example you can use:
s = stri_rand_lipsum(10)
vapply(c("conubia","viverra"), function(x) grepl(x,s),
FUN.VALUE = logical(length(s))
# conubia viverra
# [1,] TRUE TRUE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
#[10,] FALSE FALSE
Edit to include a 140-character window
As for the requirement to create a limiting window with 140-character length, as explained in your comment, one way of meeting the requirement is by extracting all characters between the two targeted strings, and then calculate the number of the extracted characters. The requirement is met only if the number is less than or equal to 140.
Extracting all characters between two strings can be done by regular expressions in gsub. However,in case the strings are repeated, you need to specify the window. Let me give examples:
txt <- "Lorem conubia amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor viverra"
This text contains two conubias and two viverras. You have four options to choose the window to specify all characters between conubia and viverra.
Option 1: between the last conubia and the first viverra
gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 2: between the first conubia and the last viverra
gsub(".*?conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
# [1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
Option 3: between the first conubia and the first viverra
gsub(".*?conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 4: between the last conubia and the last viverra
gsub(".*conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
To calculate the number of the extracted characters, nchar can be used.
# Option 1
nchar(gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE))
#[1] 68
Applying this approach:
set.seed(8)
s1 <- stri_rand_lipsum(10)
Nch <- nchar(gsub(".*conubia(.*?)viverra.*", "\\1", s1, perl = TRUE))
Nch
# [1] 637 42 512 528 595 640 522 407 388 512
we found that the second element of s1 meets the requirement.
To print the element we can use: s1[which(Nch <= 140)].
Some great references I've been learning from:
https://www.buymeacoffee.com/wstribizew/extracting-text-two-strings-regular-expressions
https://regex101.com/
Extracting a string between other two strings in R

Pull all 8 digit numbers from a data frame

I have this assignment where I need to pull all the 8 digit numbers from a text file. I've converted the text file into a dataframe and now have some 67 columns with 18000 rows. There are empty cells as well.
Within this table, some 8 digit number exist, (not in any particular row or column) which is what I want to extract.
I need all these numbers to be extracted into one single column without checking for duplicates.
The only code I've written so far:
data <- read.delim("cerupload_adsi_1_01-02-2019.txt", header = FALSE, sep="|")
You may use regmatches() and match for a juxtaposition of exactly 8 digits with regex "\\d{8}". Specifying word boundaries "\\b" might make this more robust.
Example
txt <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore 235462354 magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. Stet clita 235 kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet. 12345678 Lorem ipsum dolor 345.454 sit amet,
12345678 consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam 345 voluptua. At vero eos et accusam et justo duo
dolores et ea rebum. Stet clita 12345.67 12345.678 kasd gubergren, no sea takimata sanctus
est Lorem ipsum dolor sit amet. 12345678"
regmatches(txt, gregexpr("\\b\\d{8}\\b", txt))
# [[1]]
# [1] "12345678" "12345678" "12345678"
First, put all of your data into a simple integer vector:
data = as.integer(unlist(data))
Next, remove any elements that weren't convertible to integers (optional):
data = data[!is.na(data)]
Next, find the integers that are 8 characters long:
data = data[nchar(as.character(data))==8]
Then, make a data.frame with the integer vector as a column:
data = data.frame(x=data)
Using str_extract_all from stringr
temp <- data.frame(col = unlist(stringr::str_extract_all(unlist(data), "\\d{8}$")))
temp
# col
#1 12352318
#2 98765432
data
Tested on this sample data with two columns.
data <- data.frame(a = "This is a text with number 1234 and 12352318",
b = "More random text 123456789 98765432")

Subsetting a vector using a list of sequences in R [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I have a character vector that contains textual data which I can subset by selecting individual lines. The eventual goal is to store different sequences of the vector as independent variables or element of a list. I am able to do this using a simple loop, but I don't succeed in subsetting a character vector by a list of sequences.
See the following example:
Text<-scan("~/Desktop/Lorem Ipsum.txt", what="character", sep="\n")
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
[5] "Ut enim ad minim veniam, "
[6] "quis nostrud exercitation "
[7] "ullamco laboris nisi ut aliquip ex ea commodo consequat."
[8] "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."
[9] "Excepteur sint occaecat cupidatat non proident,"
[10] "sunt in culpa qui officia deserunt mollit anim id est laborum."
The normal way of subsetting the vector would be text[1:4], returning
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
In a list I have stored sequences of numbers that represent different sets of lines in the vector.
Sentence.numbers<-c(1:4, 5:7, 8, 9:10).
Now I would like to subset all the numbers that make up the different sentences at once and store them in a list for further analysis.
I used Text[Sentence.numbers], but the error message is "invalid index type 'list'".
Is there a way to use a list of values to subset?
You need to set up Sentence.numbers as a list and then use lapply -
Sentence.numbers <- list(1:4, 5:7, 8, 9:10)
lapply(Sentence.numbers, function(x) Text[x])
Here's an example -
lapply(Sentence.numbers, function(x) letters[x])
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
[[3]]
[1] "h"
[[4]]
[1] "i" "j"

TermDocumentMatrix in R - only 1-grams created

I just started with tm package in R and cannot seem to overcome an issue.
Even though my tokenizer functions seem to work right:
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer))
when I try to pull 2-grams from biTDM, only 1-grams come up...
findFreqTerms(biTDM, 50)
[1] "after" "and" "most" "the" "were" "years" "love"
[8] "you" "all" "also" "been" "did" "from" "get"
at the same, the 2-gram function appears to be in tact:
x <- biTokenizer(corpus)
head(x)
[1] "c in" "in the" "the years"
[4] "years thereafter" "thereafter most" "most of"
I can only assume what the problem is here: NGramTokenizer needs a VCorpus object rather than a Corpus object.
library(tm)
library(RWeka)
# some dummy text
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr",
"sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat",
"sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet")
# create a VCorpus
corpus <- VCorpus(VectorSource(text))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
print(biTDM$dimnames$Terms)
[1] "accusam et" "aliquyam erat" "amet consetetur" "at vero" "clita kasd" "consetetur sadipscing" "diam nonumy" "diam voluptua" "dolor sit" "dolore magna"
[11] "dolores et" "duo dolores" "ea rebum" "eirmod tempor" "eos et" "est lorem" "et accusam" "et dolore" "et ea" "et justo"
[21] "gubergren no" "invidunt ut" "ipsum dolor" "justo duo" "kasd gubergren" "labore et" "lorem ipsum" "magna aliquyam" "no sea" "nonumy eirmod"
[31] "sadipscing elitr" "sanctus est" "sea takimata" "sed diam" "sit amet" "stet clita" "takimata sanctus" "tempor invidunt" "ut labore" "vero eos"
[41] "voluptua at"

How to repeat this statement in R probably using apply()

It might seem a silly question but how to repeat this line for 152 times and I would not like to use a for loop,since later it will not be efficient with larger data sets:
reviews = as.vector(t(mydata)[,1])
mydata is a row in a data.frame and
reviews is an array of characters, also
[,1] is just the first row
The output could be a matrix or worst case a data.frame.
I tried something like this, but it did not work :
testing = apply(mydata, 1, function(x) {as.vector(t(mydata[,x]))})
Error in t(mydata)[, x] : subscript out of bounds
Thanks.
EDIT:
Quick data sample:
> reviews = as.vector(t(mydata)[,1])
> class(reviews)
[1] "character"
> length(reviews)
[1] 14
> reviews
[1] "I was involuntarily"
[2] "I was in transit"
[3] "My initial flight"
[4] "That still left"
[5] "After disembarking"
[6] "customs and proceed to my gate."
[7] "I arrived"
[8] "When my boarding pass was scanned"
[9] "No reason was given for the bump."
[10] "The UA gate staff"
[11] "I boarded Air Canada."
[12] "After arriving"
[13] "I spent 5 hours"
[14] NA
mydata data.frame:
> class(mydata)
[1] "data.frame"
> length(mydata[,1])
[1] 152
> mydata[,1]
[1] I was involuntarily... .
[2] First time... .
...
...
152 Levels: First time . ...
I have about 30.000 of these, but I want to start small, so only 152 of paragraphs split in individual sentence and put into a data.frame. Each row in the data.frame has 5-15 sentences.
I want to to be able to access each row as an array since I need to perform some action on each row of the data.frame
Packages used: plyr, sentiment(downloaded from here and installed manually)
EDIT 2:
dput(myData[1:6, 1:6])
structure(list(V1 = structure(c(70L, 41L, 94L, 114L, 47L, 49L),
.Label = c(" Air Canada",
"their service",
"hours for de-icing",
"have flown BA",
"my booking",
"If the video screen",
"Frankfurt flights",
"and another 150 lines of text data",
Here's a recommended way to ask a question, focusing on the fact that your actual data is too big, too complicated, or too private to share.
Question: how to apply a function on each row of a data.frame?
My data:
# make up some data
s <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
mydata <- as.data.frame(matrix(strsplit(s, '\\s')[[1]][1:18], nrow=3, ncol=6), stringsAsFactors=FALSE)
mydata
## V1 V2 V3 V4 V5 V6
## 1 Lorem sit adipiscing do incididunt et
## 2 ipsum amet, elit, eiusmod ut dolore
## 3 dolor consectetur sed tempor labore magna
If you have data that you can use directly, then as has been suggested multiple times in the comments, the use of dput is helpful:
mydata <- structure(list(V1 = c("Lorem", "ipsum", "dolor"),V2 = c("sit", "amet,", "consectetur"), V3 = c("adipiscing", "elit,", "sed"),
V4 = c("do", "eiusmod", "tempor"), V5 = c("incididunt", "ut", "labore"), V6 = c("et", "dolore", "magna")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6"), row.names = c(NA, -3L), class = "data.frame")
In either order, state (i) what you are trying to do, and (ii) what you have tried and how it is not working.
My desired output:
Converting a row into a vector is ... confusing. A row is already a vector, so I don't know what you are ultimately trying to do. So, I'll come up with something short an to the point: I want the words on each row to be in reverse alphabetical order, perhaps like this:
## V1 V2 V3 V4 V5 V6
## 1 sit Lorem incididunt et do adipiscing
## 2 ut ipsum elit, eiusmod dolore amet,
## 3 tempor sed magna labore dolor consectetur
This is a good time to show the code you've tried, errors you've encountered, and/or how the unerring output is not what you intended.
Answer, generically:
Several ways to do something to each row:
Use apply, though this breaks if you have numeric and character intermingled. If you try this, you'll see that the output is actually the transpose of what you may think, in which case you'll need to wrap (and all of the other *apply-based suggestions here) with t(...). It's a little confusing, but it's necessary here. Oh, and they'll all be a matrix class which can easily be converted to data.frame if needed.
ret <- apply(mydata, 1, function(r) {
do_something(r)
})
Use sapply or lapply on row indices. Note that these are returning lists or vectors of results, so you'll need to convert into whatever format you ultimately need.
ret <- sapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
})
# if you need to keep each row's results rather encapsulated, use one of the following:
ret <- sapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
}, simplify=FALSE)
ret <- lapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
})
Use foreach and iterators.
library(foreach)
library(iterators)
ret <- foreach(df=iter(mydata, by='row'), .combine=rbind) %do% {
do_something(df) # just one row of mydata this time
}
In the case of my (contrived) question, here are several ways to do it:
as.data.frame(t(apply(mydata, 1, function(r) sort(r, decreasing=TRUE))))
## V1 V2 V3 V4 V5 V6
## 1 sit Lorem incididunt et do adipiscing
## 2 ut ipsum elit, eiusmod dolore amet,
## 3 tempor sed magna labore dolor consectetur
as.data.frame(t(sapply(1:nrow(mydata), function(i) sort(mydata[i,], decreasing=TRUE))))
## same output
library(foreach)
library(iterators)
## notice the use of as.character(...), perhaps still a blasphemy
## to the structure of a data.frame
ret <- foreach(df=iter(mydata, by='row'), .combine=rbind) %do% {
sort(as.character(df), decreasing=TRUE)
}
ret
## [,1] [,2] [,3] [,4] [,5] [,6]
## result.1 "sit" "Lorem" "incididunt" "et" "do" "adipiscing"
## result.2 "ut" "ipsum" "elit," "eiusmod" "dolore" "amet,"
## result.3 "tempor" "sed" "magna" "labore" "dolor" "consectetur"

Resources