Introduction
Given a string in R, is it possible to get a vectorized solution (i.e. no loops) where we can break the string into blocks where each block is determined by the nth occurrence of a substring in the string.
Work done with Reproducible Example
Suppose we have several paragraphs of the famous Lorem Ipsum text.
library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "
We would like to break this text into segments at every 3rd occurrence of the the word " in" (a space is included in order to distinguish from words which contain "in" as part of them, such as "min").
I have the following solution with a while loop:
# We wish to break up the string at every
# 3rd occurence of the worn "in"
break.character = " in"
break.occurrence = 3
string.list = list()
i = 1
# initialize string to send into the loop
current.string = my.string
while(length(current.string) > 0){
# Enter segment into the list which occurs BEFORE nth occurence character of interest
string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)
# Update next string to exmine.
# Next string to examine is current string AFTER nth occurence of character of interest
current.string = str_after_nth(current.string, break.character, break.occurrence)
i = i + 1
}
We are able to get the desired output in a list with a warning (warning not shown)
> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"
[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...
[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
Goal
Is it possible to improve this solution by vectorizing (i.e. using apply(), lapply(), mapply() etc.). Also, my current solution cut's off the last occurrence of the substring in a block.
The current solution may not work well on extremely long strings (such as DNA sequences where we are looking for blocks with the nth occurrence of a substring of nucleotides).
Try with this:
text_split=strsplit(text," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
Last conditional is in case number of in is not divisible by 3. Also, the last in pasted in the sapply() is there because I don't know what you want to do with the one in that separates your blocks.
Let me know if this does the trick. I will try to make it faster. It keeps the third in in the code block. If it works I'll annotate it more too.
library(lipsum)
library(stringi)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
EDIT: actually, use stri_sub from stringi. It will scale much better than substring. See:
my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999
microbenchmark::microbenchmark(
sol1 = {
text_split=strsplit(my.string," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
},
sol2 = {
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
},
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941 10
sol2 55.4163 56.40759 58.53444 56.86043 57.03707 71.02974 10
How can you take the string and replace every instance of ".", ",", " " (i.e. dot, comma or space) with one random character selected from c('|', ':', '#', '*')?
Say I have a string like this
Aenean ut odio dignissim augue rutrum faucibus. Fusce posuere, tellus eget viverra mattis, erat tellus porta mi, at facilisis sem nibh non urna. Phasellus quis turpis quis mauris suscipit vulputate. Sed interdum lacus non velit. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae;
To get one random character, we can treat the characters as a vector then use sample function to select one out. I assume first I need to search for dot, comma or space, then use gsub function to replace all these?
Given your clarification, try this one:
x <- c("this, is nice.", "nice, this is.")
gr <- gregexpr("[., ]", x)
regmatches(x,gr) <- lapply(lengths(gr), sample, x=c('|',':','#','*'))
x
#[1] "this|*is#nice:" "nice#|this*is:"
Here is another option with chartr
pat <- paste(sample(c('|', ';', '#', '*'), 3), collapse="")
chartr('., ', pat, x)
#[1] "this|*is*nice;" "nice|*this*is;"
data
x <- c("this, is nice.", "nice, this is.")
I have to save an R-dataset in Stata's .dta format.
The dataset contains, among other data, a single column containing long strings (column 3).
test data:
r_data <- data.frame( ae= 1, be= 2, ce= "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet"
,stringsAsFactors = FALSE )
export to dta
library(foreign)
write.dta(r_data, file = "r_data.dta")
results in this warning message:
Warning message:
In write.dta(r_data, file = "r_data.dta") :
character strings of >244 bytes in column 3 will be truncated
Furthermore, I can't open the file in Stata (14 SE) at all due to an error stating:
. use "r_data.dta"
file not Stata format
.dta file contains 1 invalid storage-type code.
File uses invalid codes other than code 0.
r(610);
How can I save longer strings as a .dta file?
R-solution prefered because I am not experienced with Stata.
PS: The indirect route via a CSV-file does not work, because the resulting CSV-file is too big for my little RAM when importing in Stata.
Old question, but deserves to be closed:
Use the haven package to write to a dta-file in Stata 14 format.
library(haven)
r_data <- data.frame(ae = 1, be = 2, ce = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet",
stringsAsFactors = FALSE)
write_dta(r_data, "r_data.dta")
I want to text-mine a set of files based on the below form. I can create a corpus where each file is a document (using tm), but I'm thinking it might be better to create a corpus where each section in the 2nd form table was a document having the following meta data:
Author : John Smith
DateTimeStamp: 2013-04-18 16:53:31
Description :
Heading : Current Focus
ID : Smith-John_e.doc Current Focus
Language : en_CA
Origin : Smith-John_e.doc
Name : John Smith
Title : Manager
TeamMembers : Joe Blow, John Doe
GroupLeader : She who must be obeyed
where Name, Title, TeamMembers and GroupLeader are extracted from the first table on the form. In this way, each chunk of text to be analyzed would maintain some of its context.
What is the best way to approach this? I can think of 2 ways:
somehow parse the corpus I have into child corpora.
somehow parse the document into subdocuments and make a corpus from those.
Any pointers would be much appreciated.
This is the form:
Here is an RData file of a corpus with 2 documents. exc[[1]] came from a .doc and exc[[2]] came from a docx. They both used the form above.
Here's a quick sketch of a method, hopefully it might provoke someone more talented to stop by and suggest something more efficient and robust... Using the RData file in your question, I found that the doc and docx files have slightly different structures and so require slightly different approaches (though I see in the metadata that your docx is 'fake2.txt', so is it really docx? I see in your other Q that you used a converter outside of R, that must be why it's txt).
library(tm)
First get custom metadata for the doc file. I'm no regex expert, as you can see, but it's roughly 'get rid of trailing and leading spaces' then 'get rid of "Word"', then get rid of punctuation...
# create User-defined local meta data pairs
meta(exc[[1]], type = "corpus", tag = "Name1") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[1]][3])))
meta(exc[[1]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[1]][4])))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[1]][5])))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[1]][7])))
Now have a look at the result
# inspect
meta(exc[[1]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 13:59:28
Description :
Heading :
ID : fake1.doc
Language : en_CA
Origin :
User-defined local meta data pairs are:
$Name1
[1] "John Doe"
$Title
[1] "Manager"
$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"
$ManagerName
[1] "Selma Furtgenstein"
Do the same for the docx file
# create User-defined local meta data pairs
meta(exc[[2]], type = "corpus", tag = "Name2") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[2]][2])))
meta(exc[[2]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[2]][4])))
meta(exc[[2]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[2]][6])))
meta(exc[[2]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[2]][8])))
And have a look
# inspect
meta(exc[[2]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 14:06:10
Description :
Heading :
ID : fake2.txt
Language : en
Origin :
User-defined local meta data pairs are:
$Name2
[1] "Joe Blow"
$Title
[1] "Shift Lead"
$TeamMembers
[1] "Melanie Baumgartner Toby Morrison"
$ManagerName
[1] "Selma Furtgenstein"
If you have a large number of documents then a lapply function that includes these meta functions would be the way to go.
Now that we've got the custom metadata, we can subset the documents to exclude that part of the text:
# create new corpus that excludes part of doc that is now in metadata. We just use square bracket indexing to subset the lines that are the second table of the forms (slightly different for each doc type)
excBody <- Corpus(VectorSource(c(paste(exc[[1]][13:length(exc[[1]])], collapse = ","),
paste(exc[[2]][9:length(exc[[2]])], collapse = ","))))
# get rid of all the white spaces
excBody <- tm_map(excBody, stripWhitespace)
Have a look:
inspect(excBody)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
|CURRENT RESEARCH FOCUS |,| |,|Lorem ipsum dolor sit amet, consectetur adipiscing elit. |,|Donec at ipsum est, vel ullamcorper enim. |,|In vel dui massa, eget egestas libero. |,|Phasellus facilisis cursus nisi, gravida convallis velit ornare a. |,|MAIN AREAS OF EXPERTISE |,|Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel. |,|In sit amet ante non turpis elementum porttitor. |,|TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED |,| Vestibulum sed turpis id nulla eleifend fermentum. |,|Nunc sit amet elit eu neque tincidunt aliquet eu at risus. |,|Cras tempor ipsum justo, ut blandit lacus. |,|INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS) |,| Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio. |,|Etiam a justo vel sapien rhoncus interdum. |,|ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT |,|(Please include anticipated percentages of your time.) |,| Proin vitae ligula quis enim vulputate sagittis vitae ut ante. |,|ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES |,|e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign |,|Canvasser (GCWCC), OSH representative, Social Committee |,|Sed nec tellus nec massa accumsan faucibus non imperdiet nibh. |,,
[[2]]
CURRENT RESEARCH FOCUS,,* Lorem ipsum dolor sit amet, consectetur adipiscing elit.,* Donec at ipsum est, vel ullamcorper enim.,* In vel dui massa, eget egestas libero.,* Phasellus facilisis cursus nisi, gravida convallis velit ornare a.,MAIN AREAS OF EXPERTISE,* Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel.,* In sit amet ante non turpis elementum porttitor. ,TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED,* Vestibulum sed turpis id nulla eleifend fermentum.,* Nunc sit amet elit eu neque tincidunt aliquet eu at risus.,* Cras tempor ipsum justo, ut blandit lacus.,INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS),* Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio.,* Etiam a justo vel sapien rhoncus interdum.,ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT ,(Please include anticipated percentages of your time.),* Proin vitae ligula quis enim vulputate sagittis vitae ut ante.,ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES,e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign Canvasser (GCWCC), OSH representative, Social Committee,* Sed nec tellus nec massa accumsan faucibus non imperdiet nibh.,,
Now the documents are ready for text mining, with the data from the upper table moved out of the document and into the document metadata.
Of course all of this depends on the documents being highly regular. If there are different numbers of lines in the first table in each doc, then the simple indexing method might fail (give it a try and see what happens) and something more robust will be needed.
UPDATE: A more robust method
Having read the question a little more carefully, and got a bit more education about regex, here's a method that is more robust and doesn't depend on indexing specific lines of the documents. Instead, we use regular expressions to extract text from between two words to make the metadata and split the document
Here's how we make the User-defined local meta data (a method to replace the one above)
library(gdata) # for the trim function
txt <- paste0(as.character(exc[[1]]), collapse = ",")
# inspect the document to identify the words on either side of the string
# we want, so 'Name' and 'Title' are on either side of 'John Doe'
extract <- regmatches(txt, gregexpr("(?<=Name).*?(?=Title)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Name1") <- trim(gsub("[[:punct:]]", "", extract))
extract <- regmatches(txt, gregexpr("(?<=Title).*?(?=Team)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Title") <- trim(gsub("[[:punct:]]","", extract))
extract <- regmatches(txt, gregexpr("(?<=Members).*?(?=Supervised)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- trim(gsub("[[:punct:]]","", extract))
extract <- regmatches(txt, gregexpr("(?<=your).*?(?=Supervisor)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- trim(gsub("[[:punct:]]","", extract))
# inspect
meta(exc[[1]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 13:59:28
Description :
Heading :
ID : fake1.doc
Language : en_CA
Origin :
User-defined local meta data pairs are:
$Name1
[1] "John Doe"
$Title
[1] "Manager"
$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"
$ManagerName
[1] "Selma Furtgenstein"
Similarly we can extract the sections of your second table into separate
vectors and then you can make them into documents and corpora or just work
on them as vectors.
txt <- paste0(as.character(exc[[1]]), collapse = ",")
CURRENT_RESEARCH_FOCUS <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=CURRENT RESEARCH FOCUS).*?(?=MAIN AREAS OF EXPERTISE)", txt, perl=TRUE))))
[1] "Lorem ipsum dolor sit amet consectetur adipiscing elit Donec at ipsum est vel ullamcorper enim In vel dui massa eget egestas libero Phasellus facilisis cursus nisi gravida convallis velit ornare a"
MAIN_AREAS_OF_EXPERTISE <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=MAIN AREAS OF EXPERTISE).*?(?=TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED)", txt, perl=TRUE))))
[1] "Vestibulum aliquet faucibus tortor sed aliquet purus elementum vel In sit amet ante non turpis elementum porttitor"
And so on. I hope that's a bit closer to what you're after. If not, it might be best to break down your task into a set of smaller, more focused questions, and ask them separately (or wait for one of the gurus to stop by this question!).