It might seem a silly question but how to repeat this line for 152 times and I would not like to use a for loop,since later it will not be efficient with larger data sets:
reviews = as.vector(t(mydata)[,1])
mydata is a row in a data.frame and
reviews is an array of characters, also
[,1] is just the first row
The output could be a matrix or worst case a data.frame.
I tried something like this, but it did not work :
testing = apply(mydata, 1, function(x) {as.vector(t(mydata[,x]))})
Error in t(mydata)[, x] : subscript out of bounds
Thanks.
EDIT:
Quick data sample:
> reviews = as.vector(t(mydata)[,1])
> class(reviews)
[1] "character"
> length(reviews)
[1] 14
> reviews
[1] "I was involuntarily"
[2] "I was in transit"
[3] "My initial flight"
[4] "That still left"
[5] "After disembarking"
[6] "customs and proceed to my gate."
[7] "I arrived"
[8] "When my boarding pass was scanned"
[9] "No reason was given for the bump."
[10] "The UA gate staff"
[11] "I boarded Air Canada."
[12] "After arriving"
[13] "I spent 5 hours"
[14] NA
mydata data.frame:
> class(mydata)
[1] "data.frame"
> length(mydata[,1])
[1] 152
> mydata[,1]
[1] I was involuntarily... .
[2] First time... .
...
...
152 Levels: First time . ...
I have about 30.000 of these, but I want to start small, so only 152 of paragraphs split in individual sentence and put into a data.frame. Each row in the data.frame has 5-15 sentences.
I want to to be able to access each row as an array since I need to perform some action on each row of the data.frame
Packages used: plyr, sentiment(downloaded from here and installed manually)
EDIT 2:
dput(myData[1:6, 1:6])
structure(list(V1 = structure(c(70L, 41L, 94L, 114L, 47L, 49L),
.Label = c(" Air Canada",
"their service",
"hours for de-icing",
"have flown BA",
"my booking",
"If the video screen",
"Frankfurt flights",
"and another 150 lines of text data",
Here's a recommended way to ask a question, focusing on the fact that your actual data is too big, too complicated, or too private to share.
Question: how to apply a function on each row of a data.frame?
My data:
# make up some data
s <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
mydata <- as.data.frame(matrix(strsplit(s, '\\s')[[1]][1:18], nrow=3, ncol=6), stringsAsFactors=FALSE)
mydata
## V1 V2 V3 V4 V5 V6
## 1 Lorem sit adipiscing do incididunt et
## 2 ipsum amet, elit, eiusmod ut dolore
## 3 dolor consectetur sed tempor labore magna
If you have data that you can use directly, then as has been suggested multiple times in the comments, the use of dput is helpful:
mydata <- structure(list(V1 = c("Lorem", "ipsum", "dolor"),V2 = c("sit", "amet,", "consectetur"), V3 = c("adipiscing", "elit,", "sed"),
V4 = c("do", "eiusmod", "tempor"), V5 = c("incididunt", "ut", "labore"), V6 = c("et", "dolore", "magna")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6"), row.names = c(NA, -3L), class = "data.frame")
In either order, state (i) what you are trying to do, and (ii) what you have tried and how it is not working.
My desired output:
Converting a row into a vector is ... confusing. A row is already a vector, so I don't know what you are ultimately trying to do. So, I'll come up with something short an to the point: I want the words on each row to be in reverse alphabetical order, perhaps like this:
## V1 V2 V3 V4 V5 V6
## 1 sit Lorem incididunt et do adipiscing
## 2 ut ipsum elit, eiusmod dolore amet,
## 3 tempor sed magna labore dolor consectetur
This is a good time to show the code you've tried, errors you've encountered, and/or how the unerring output is not what you intended.
Answer, generically:
Several ways to do something to each row:
Use apply, though this breaks if you have numeric and character intermingled. If you try this, you'll see that the output is actually the transpose of what you may think, in which case you'll need to wrap (and all of the other *apply-based suggestions here) with t(...). It's a little confusing, but it's necessary here. Oh, and they'll all be a matrix class which can easily be converted to data.frame if needed.
ret <- apply(mydata, 1, function(r) {
do_something(r)
})
Use sapply or lapply on row indices. Note that these are returning lists or vectors of results, so you'll need to convert into whatever format you ultimately need.
ret <- sapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
})
# if you need to keep each row's results rather encapsulated, use one of the following:
ret <- sapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
}, simplify=FALSE)
ret <- lapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
})
Use foreach and iterators.
library(foreach)
library(iterators)
ret <- foreach(df=iter(mydata, by='row'), .combine=rbind) %do% {
do_something(df) # just one row of mydata this time
}
In the case of my (contrived) question, here are several ways to do it:
as.data.frame(t(apply(mydata, 1, function(r) sort(r, decreasing=TRUE))))
## V1 V2 V3 V4 V5 V6
## 1 sit Lorem incididunt et do adipiscing
## 2 ut ipsum elit, eiusmod dolore amet,
## 3 tempor sed magna labore dolor consectetur
as.data.frame(t(sapply(1:nrow(mydata), function(i) sort(mydata[i,], decreasing=TRUE))))
## same output
library(foreach)
library(iterators)
## notice the use of as.character(...), perhaps still a blasphemy
## to the structure of a data.frame
ret <- foreach(df=iter(mydata, by='row'), .combine=rbind) %do% {
sort(as.character(df), decreasing=TRUE)
}
ret
## [,1] [,2] [,3] [,4] [,5] [,6]
## result.1 "sit" "Lorem" "incididunt" "et" "do" "adipiscing"
## result.2 "ut" "ipsum" "elit," "eiusmod" "dolore" "amet,"
## result.3 "tempor" "sed" "magna" "labore" "dolor" "consectetur"
Related
I have the following string
library(stringi)
s=stri_rand_lipsum(10)
Function grepl searches for matches to argument pattern within every item of a character vector. As far as I know, it performs the search of just one word at once. For example if I would like to search "conubia" and "viverra" I have to perform two searches:
x=s[grepl("conubia",s)]
x=x[grepl("viverra",x)]
Anyway, I would like to search two or more terms which appear in the same entry of s within a window of length equal to, e.g. 140 characters.
You can use *apply family. If your source text is a character vector, I recommend using vapply, but you have to specify the type and the length of the returned values. Because you use grepl, the returned values are logical vectors.
txt = "My name is Abdur Rohman"
patt = c("na", "Ab","man", "om")
vapply(patt, function(x) grepl(x,txt),
FUN.VALUE = logical(length(txt)))
# na Ab man om
# TRUE TRUE TRUE FALSE
So, in your example you can use:
s = stri_rand_lipsum(10)
vapply(c("conubia","viverra"), function(x) grepl(x,s),
FUN.VALUE = logical(length(s))
# conubia viverra
# [1,] TRUE TRUE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
#[10,] FALSE FALSE
Edit to include a 140-character window
As for the requirement to create a limiting window with 140-character length, as explained in your comment, one way of meeting the requirement is by extracting all characters between the two targeted strings, and then calculate the number of the extracted characters. The requirement is met only if the number is less than or equal to 140.
Extracting all characters between two strings can be done by regular expressions in gsub. However,in case the strings are repeated, you need to specify the window. Let me give examples:
txt <- "Lorem conubia amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor viverra"
This text contains two conubias and two viverras. You have four options to choose the window to specify all characters between conubia and viverra.
Option 1: between the last conubia and the first viverra
gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 2: between the first conubia and the last viverra
gsub(".*?conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
# [1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
Option 3: between the first conubia and the first viverra
gsub(".*?conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 4: between the last conubia and the last viverra
gsub(".*conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
To calculate the number of the extracted characters, nchar can be used.
# Option 1
nchar(gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE))
#[1] 68
Applying this approach:
set.seed(8)
s1 <- stri_rand_lipsum(10)
Nch <- nchar(gsub(".*conubia(.*?)viverra.*", "\\1", s1, perl = TRUE))
Nch
# [1] 637 42 512 528 595 640 522 407 388 512
we found that the second element of s1 meets the requirement.
To print the element we can use: s1[which(Nch <= 140)].
Some great references I've been learning from:
https://www.buymeacoffee.com/wstribizew/extracting-text-two-strings-regular-expressions
https://regex101.com/
Extracting a string between other two strings in R
I would like to insert a list of sub strings (word_list) into a string (text) at specific positions (idx_list)
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")
I know there are multiple possibilites functions (gsub, stri_sub etc.) which I can use in a loop. This gets however quite slow on large corpora. Is there a more efficient solution? Maybe vectorized?
Solution 1
A small wrapper around stringi package function for input the OP wanted.
inject <- function(string, index, replacement){
stringi::stri_sub_replace_all(string, from = index,
to = index-1,
replacement = replacement)
}
text <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list <- c(5, 16, 30, 50)
word_list <- c("AAA", "BBB", "CCC", "DDD")
inject(text, idx_list, word_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Solution 2
An adaptation of Zach Foster's answer, condensed into a single function
inject_two <- function(string, inject, index) {
inject <- inject[order(index)]
index <- sort(index)
# expand string
split <- substr(rep(string, length(index) + 1),
start = c(1, index),
stop = c(index - 1, nchar(string))
)
ord1 <- 2 * (1:length(split)) - 1
ord2 <- 2 * (1:length(inject))
paste(c(split, inject)[order(c(ord1, ord2))], collapse = "")
}
inject_two(text, word_list, idx_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Benchmarks
evans <- function(string, index, replacement){
ord <- order(-index)
Reduce(function(S, R) {
paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, index[ord], replacement[ord]), string)
}
ggplot2::autoplot(microbecnhmark::microbenchmark(
inject(text, idx_list, word_list),
inject_two(text, idx_list, word_list),
evans(text, idx_list, word_list),
times = 1000
))
For few insertions all solutions perform similarly. What if we perform many (here N = 410 000) insertions?
text_large = paste0(rep(text, 10000), collapse = "")
idx_list_large = seq(1, nchar(text_large), by = 3)
word_list_large = sample(LETTERS, size = length(idx_list_large), replace = T)
bench::mark(
inject = inject(text_large, idx_list_large, word_list_large),
inject_two = inject_two(text_large, idx_list_large, word_list_large),
iterations = 50
)[,c(1,3,5,7)]
# A tibble: 2 x 4
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 inject 32.2ms 9.38MB 50
2 inject_two 157.4ms 65.69MB 50
Surprise surprise, C++ - based stringi takes the cake in both speed and memory. Note evans is omitted because 1 pass took multiple minutes when tested.
I think it's important to start from the last (highest idx_list) first, since otherwise all numbers will need to be shifted. (This is certainly not hard, but going backwards seems easier.)
# 0 1 2 3 4 5 6 7 8 9 a b c
# 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")
The work:
ord <- order(-idx_list)
Reduce(function(S, R) {
paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, idx_list[ord], word_list[ord]), text)
# [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
Walk-through:
ord is just the decreasing order, so for instance
word_list[ord]
# [1] "DDD" "CCC" "BBB" "AAA"
because we're going to use Reduce (explanation in a second), we need the combination of idx_list[1] and word_list[1] to be in one argument, not individual; for this, we combine them using Map(list, ...), which "zips" them together into a single list, each containing the character position and the string to insert:
str( Map(list, idx_list[ord], word_list[ord]) )
# List of 4
# $ :List of 2
# ..$ : num 50
# ..$ : chr "DDD"
# $ :List of 2
# ..$ : num 30
# ..$ : chr "CCC"
# $ :List of 2
# ..$ : num 16
# ..$ : chr "BBB"
# $ :List of 2
# ..$ : num 5
# ..$ : chr "AAA"
(This can be used with an arbitrary number of arguments.)
Because we need to insert a string, then insert another string into the result of the first, the base function Reduce will work well here. The first arg is a function that accepts two arguments: the results from the previous call, and the next element from the Map'd argument.
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I have a character vector that contains textual data which I can subset by selecting individual lines. The eventual goal is to store different sequences of the vector as independent variables or element of a list. I am able to do this using a simple loop, but I don't succeed in subsetting a character vector by a list of sequences.
See the following example:
Text<-scan("~/Desktop/Lorem Ipsum.txt", what="character", sep="\n")
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
[5] "Ut enim ad minim veniam, "
[6] "quis nostrud exercitation "
[7] "ullamco laboris nisi ut aliquip ex ea commodo consequat."
[8] "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."
[9] "Excepteur sint occaecat cupidatat non proident,"
[10] "sunt in culpa qui officia deserunt mollit anim id est laborum."
The normal way of subsetting the vector would be text[1:4], returning
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
In a list I have stored sequences of numbers that represent different sets of lines in the vector.
Sentence.numbers<-c(1:4, 5:7, 8, 9:10).
Now I would like to subset all the numbers that make up the different sentences at once and store them in a list for further analysis.
I used Text[Sentence.numbers], but the error message is "invalid index type 'list'".
Is there a way to use a list of values to subset?
You need to set up Sentence.numbers as a list and then use lapply -
Sentence.numbers <- list(1:4, 5:7, 8, 9:10)
lapply(Sentence.numbers, function(x) Text[x])
Here's an example -
lapply(Sentence.numbers, function(x) letters[x])
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
[[3]]
[1] "h"
[[4]]
[1] "i" "j"
Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?
Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table
thanks
You can try to use readtext library:
library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:
library(readtext)
doc.text <- readtext("test.docx")$text
# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]
# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"
# Similarly we can extract some other parts from a header
issn <- doc.parts[2]
issue <- doc.parts[3]
# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]
# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]
# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants.
# This study was supported by a grant from the Tunisian Ministry of Health and
# Ministry of Higher Education ...
The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.
You should find that one of these packages will do the trick for you.
https://davidgohel.github.io/officer/
https://cran.r-project.org/web/packages/docxtractr/index.html
At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)
Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line
pandoc -f docx -t markdown -o a.md a.docx
You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.
Pandoc is part of RStudio, by the way, so you may already have it.
You can do it with package officer:
library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#> [1] "Title 1"
#> [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
#> [3] "Title 2"
#> [4] "Quisque tristique "
#> [5] "Augue nisi, et convallis "
#> [6] "Sapien mollis nec. "
#> [7] "Sub title 1"
#> [8] "Quisque tristique "
#> [9] "Augue nisi, et convallis "
#> [10] "Sapien mollis nec. "
#> [11] ""
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "
#> [13] "Sub title 2"
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""
#> [16] ""
#> [17] ""
I just started with tm package in R and cannot seem to overcome an issue.
Even though my tokenizer functions seem to work right:
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer))
when I try to pull 2-grams from biTDM, only 1-grams come up...
findFreqTerms(biTDM, 50)
[1] "after" "and" "most" "the" "were" "years" "love"
[8] "you" "all" "also" "been" "did" "from" "get"
at the same, the 2-gram function appears to be in tact:
x <- biTokenizer(corpus)
head(x)
[1] "c in" "in the" "the years"
[4] "years thereafter" "thereafter most" "most of"
I can only assume what the problem is here: NGramTokenizer needs a VCorpus object rather than a Corpus object.
library(tm)
library(RWeka)
# some dummy text
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr",
"sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat",
"sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet")
# create a VCorpus
corpus <- VCorpus(VectorSource(text))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
print(biTDM$dimnames$Terms)
[1] "accusam et" "aliquyam erat" "amet consetetur" "at vero" "clita kasd" "consetetur sadipscing" "diam nonumy" "diam voluptua" "dolor sit" "dolore magna"
[11] "dolores et" "duo dolores" "ea rebum" "eirmod tempor" "eos et" "est lorem" "et accusam" "et dolore" "et ea" "et justo"
[21] "gubergren no" "invidunt ut" "ipsum dolor" "justo duo" "kasd gubergren" "labore et" "lorem ipsum" "magna aliquyam" "no sea" "nonumy eirmod"
[31] "sadipscing elitr" "sanctus est" "sea takimata" "sed diam" "sit amet" "stet clita" "takimata sanctus" "tempor invidunt" "ut labore" "vero eos"
[41] "voluptua at"