I have the following string
library(stringi)
s=stri_rand_lipsum(10)
Function grepl searches for matches to argument pattern within every item of a character vector. As far as I know, it performs the search of just one word at once. For example if I would like to search "conubia" and "viverra" I have to perform two searches:
x=s[grepl("conubia",s)]
x=x[grepl("viverra",x)]
Anyway, I would like to search two or more terms which appear in the same entry of s within a window of length equal to, e.g. 140 characters.
You can use *apply family. If your source text is a character vector, I recommend using vapply, but you have to specify the type and the length of the returned values. Because you use grepl, the returned values are logical vectors.
txt = "My name is Abdur Rohman"
patt = c("na", "Ab","man", "om")
vapply(patt, function(x) grepl(x,txt),
FUN.VALUE = logical(length(txt)))
# na Ab man om
# TRUE TRUE TRUE FALSE
So, in your example you can use:
s = stri_rand_lipsum(10)
vapply(c("conubia","viverra"), function(x) grepl(x,s),
FUN.VALUE = logical(length(s))
# conubia viverra
# [1,] TRUE TRUE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
#[10,] FALSE FALSE
Edit to include a 140-character window
As for the requirement to create a limiting window with 140-character length, as explained in your comment, one way of meeting the requirement is by extracting all characters between the two targeted strings, and then calculate the number of the extracted characters. The requirement is met only if the number is less than or equal to 140.
Extracting all characters between two strings can be done by regular expressions in gsub. However,in case the strings are repeated, you need to specify the window. Let me give examples:
txt <- "Lorem conubia amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor viverra"
This text contains two conubias and two viverras. You have four options to choose the window to specify all characters between conubia and viverra.
Option 1: between the last conubia and the first viverra
gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 2: between the first conubia and the last viverra
gsub(".*?conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
# [1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
Option 3: between the first conubia and the first viverra
gsub(".*?conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 4: between the last conubia and the last viverra
gsub(".*conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
To calculate the number of the extracted characters, nchar can be used.
# Option 1
nchar(gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE))
#[1] 68
Applying this approach:
set.seed(8)
s1 <- stri_rand_lipsum(10)
Nch <- nchar(gsub(".*conubia(.*?)viverra.*", "\\1", s1, perl = TRUE))
Nch
# [1] 637 42 512 528 595 640 522 407 388 512
we found that the second element of s1 meets the requirement.
To print the element we can use: s1[which(Nch <= 140)].
Some great references I've been learning from:
https://www.buymeacoffee.com/wstribizew/extracting-text-two-strings-regular-expressions
https://regex101.com/
Extracting a string between other two strings in R
Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?
Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table
thanks
You can try to use readtext library:
library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:
library(readtext)
doc.text <- readtext("test.docx")$text
# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]
# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"
# Similarly we can extract some other parts from a header
issn <- doc.parts[2]
issue <- doc.parts[3]
# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]
# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]
# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants.
# This study was supported by a grant from the Tunisian Ministry of Health and
# Ministry of Higher Education ...
The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.
You should find that one of these packages will do the trick for you.
https://davidgohel.github.io/officer/
https://cran.r-project.org/web/packages/docxtractr/index.html
At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)
Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line
pandoc -f docx -t markdown -o a.md a.docx
You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.
Pandoc is part of RStudio, by the way, so you may already have it.
You can do it with package officer:
library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#> [1] "Title 1"
#> [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
#> [3] "Title 2"
#> [4] "Quisque tristique "
#> [5] "Augue nisi, et convallis "
#> [6] "Sapien mollis nec. "
#> [7] "Sub title 1"
#> [8] "Quisque tristique "
#> [9] "Augue nisi, et convallis "
#> [10] "Sapien mollis nec. "
#> [11] ""
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "
#> [13] "Sub title 2"
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""
#> [16] ""
#> [17] ""
How can you take the string and replace every instance of ".", ",", " " (i.e. dot, comma or space) with one random character selected from c('|', ':', '#', '*')?
Say I have a string like this
Aenean ut odio dignissim augue rutrum faucibus. Fusce posuere, tellus eget viverra mattis, erat tellus porta mi, at facilisis sem nibh non urna. Phasellus quis turpis quis mauris suscipit vulputate. Sed interdum lacus non velit. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae;
To get one random character, we can treat the characters as a vector then use sample function to select one out. I assume first I need to search for dot, comma or space, then use gsub function to replace all these?
Given your clarification, try this one:
x <- c("this, is nice.", "nice, this is.")
gr <- gregexpr("[., ]", x)
regmatches(x,gr) <- lapply(lengths(gr), sample, x=c('|',':','#','*'))
x
#[1] "this|*is#nice:" "nice#|this*is:"
Here is another option with chartr
pat <- paste(sample(c('|', ';', '#', '*'), 3), collapse="")
chartr('., ', pat, x)
#[1] "this|*is*nice;" "nice|*this*is;"
data
x <- c("this, is nice.", "nice, this is.")
I have a list of characters with sentences. I have about 10000+ lines. I want to delete 1000+ words from it. So I have a character vector with the words to be deleted. I am using the approach as follows:
c<-gsub(pattern = wordsToBeDeleted,replacement = "",x = mainList)
This is using only the first word. How can I get this done?
gsub only takes one pattern at at time, but You could combine it with Reduce
#sample data
sentences<-c(
"Morbi in tempus metus, quis commodo eros",
"Cum sociis natoque penatibus et magnis dis parturient montes",
"Nulla diam quam, imperdiet vitae blandit eu",
"Nullam nec pellentesque sapien, ac mollis mauris")
words<-c("quis","eros","diam","nec")
New we loop over all the words, removing them from the sentences
Reduce(function(a,b) gsub(b,"", a,fixed=T), words, sentences)
which gives us
[1] "Morbi in tempus metus, commodo "
[2] "Cum sociis natoque penatibus et magnis dis parturient montes"
[3] "Nulla quam, imperdiet vitae blandit eu"
[4] "Nullam pellentesque sapien, ac mollis mauris"
How about trying this recipe:
sentences = tolower(c("I don't like you.", "But I do like this."))
dropWords = tolower(c("I", "like"))
splitSentences = strsplit(sentences, " ")
purged = lapply(X=splitSentences, FUN=setdiff, y=dropWords)
purged
[[1]]
[1] "don't" "you."
[[2]]
[1] "but" "do" "this."
I also recommend using tolower there since it will take care of case differences.
I want to text-mine a set of files based on the below form. I can create a corpus where each file is a document (using tm), but I'm thinking it might be better to create a corpus where each section in the 2nd form table was a document having the following meta data:
Author : John Smith
DateTimeStamp: 2013-04-18 16:53:31
Description :
Heading : Current Focus
ID : Smith-John_e.doc Current Focus
Language : en_CA
Origin : Smith-John_e.doc
Name : John Smith
Title : Manager
TeamMembers : Joe Blow, John Doe
GroupLeader : She who must be obeyed
where Name, Title, TeamMembers and GroupLeader are extracted from the first table on the form. In this way, each chunk of text to be analyzed would maintain some of its context.
What is the best way to approach this? I can think of 2 ways:
somehow parse the corpus I have into child corpora.
somehow parse the document into subdocuments and make a corpus from those.
Any pointers would be much appreciated.
This is the form:
Here is an RData file of a corpus with 2 documents. exc[[1]] came from a .doc and exc[[2]] came from a docx. They both used the form above.
Here's a quick sketch of a method, hopefully it might provoke someone more talented to stop by and suggest something more efficient and robust... Using the RData file in your question, I found that the doc and docx files have slightly different structures and so require slightly different approaches (though I see in the metadata that your docx is 'fake2.txt', so is it really docx? I see in your other Q that you used a converter outside of R, that must be why it's txt).
library(tm)
First get custom metadata for the doc file. I'm no regex expert, as you can see, but it's roughly 'get rid of trailing and leading spaces' then 'get rid of "Word"', then get rid of punctuation...
# create User-defined local meta data pairs
meta(exc[[1]], type = "corpus", tag = "Name1") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[1]][3])))
meta(exc[[1]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[1]][4])))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[1]][5])))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[1]][7])))
Now have a look at the result
# inspect
meta(exc[[1]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 13:59:28
Description :
Heading :
ID : fake1.doc
Language : en_CA
Origin :
User-defined local meta data pairs are:
$Name1
[1] "John Doe"
$Title
[1] "Manager"
$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"
$ManagerName
[1] "Selma Furtgenstein"
Do the same for the docx file
# create User-defined local meta data pairs
meta(exc[[2]], type = "corpus", tag = "Name2") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[2]][2])))
meta(exc[[2]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[2]][4])))
meta(exc[[2]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[2]][6])))
meta(exc[[2]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[2]][8])))
And have a look
# inspect
meta(exc[[2]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 14:06:10
Description :
Heading :
ID : fake2.txt
Language : en
Origin :
User-defined local meta data pairs are:
$Name2
[1] "Joe Blow"
$Title
[1] "Shift Lead"
$TeamMembers
[1] "Melanie Baumgartner Toby Morrison"
$ManagerName
[1] "Selma Furtgenstein"
If you have a large number of documents then a lapply function that includes these meta functions would be the way to go.
Now that we've got the custom metadata, we can subset the documents to exclude that part of the text:
# create new corpus that excludes part of doc that is now in metadata. We just use square bracket indexing to subset the lines that are the second table of the forms (slightly different for each doc type)
excBody <- Corpus(VectorSource(c(paste(exc[[1]][13:length(exc[[1]])], collapse = ","),
paste(exc[[2]][9:length(exc[[2]])], collapse = ","))))
# get rid of all the white spaces
excBody <- tm_map(excBody, stripWhitespace)
Have a look:
inspect(excBody)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
|CURRENT RESEARCH FOCUS |,| |,|Lorem ipsum dolor sit amet, consectetur adipiscing elit. |,|Donec at ipsum est, vel ullamcorper enim. |,|In vel dui massa, eget egestas libero. |,|Phasellus facilisis cursus nisi, gravida convallis velit ornare a. |,|MAIN AREAS OF EXPERTISE |,|Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel. |,|In sit amet ante non turpis elementum porttitor. |,|TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED |,| Vestibulum sed turpis id nulla eleifend fermentum. |,|Nunc sit amet elit eu neque tincidunt aliquet eu at risus. |,|Cras tempor ipsum justo, ut blandit lacus. |,|INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS) |,| Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio. |,|Etiam a justo vel sapien rhoncus interdum. |,|ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT |,|(Please include anticipated percentages of your time.) |,| Proin vitae ligula quis enim vulputate sagittis vitae ut ante. |,|ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES |,|e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign |,|Canvasser (GCWCC), OSH representative, Social Committee |,|Sed nec tellus nec massa accumsan faucibus non imperdiet nibh. |,,
[[2]]
CURRENT RESEARCH FOCUS,,* Lorem ipsum dolor sit amet, consectetur adipiscing elit.,* Donec at ipsum est, vel ullamcorper enim.,* In vel dui massa, eget egestas libero.,* Phasellus facilisis cursus nisi, gravida convallis velit ornare a.,MAIN AREAS OF EXPERTISE,* Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel.,* In sit amet ante non turpis elementum porttitor. ,TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED,* Vestibulum sed turpis id nulla eleifend fermentum.,* Nunc sit amet elit eu neque tincidunt aliquet eu at risus.,* Cras tempor ipsum justo, ut blandit lacus.,INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS),* Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio.,* Etiam a justo vel sapien rhoncus interdum.,ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT ,(Please include anticipated percentages of your time.),* Proin vitae ligula quis enim vulputate sagittis vitae ut ante.,ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES,e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign Canvasser (GCWCC), OSH representative, Social Committee,* Sed nec tellus nec massa accumsan faucibus non imperdiet nibh.,,
Now the documents are ready for text mining, with the data from the upper table moved out of the document and into the document metadata.
Of course all of this depends on the documents being highly regular. If there are different numbers of lines in the first table in each doc, then the simple indexing method might fail (give it a try and see what happens) and something more robust will be needed.
UPDATE: A more robust method
Having read the question a little more carefully, and got a bit more education about regex, here's a method that is more robust and doesn't depend on indexing specific lines of the documents. Instead, we use regular expressions to extract text from between two words to make the metadata and split the document
Here's how we make the User-defined local meta data (a method to replace the one above)
library(gdata) # for the trim function
txt <- paste0(as.character(exc[[1]]), collapse = ",")
# inspect the document to identify the words on either side of the string
# we want, so 'Name' and 'Title' are on either side of 'John Doe'
extract <- regmatches(txt, gregexpr("(?<=Name).*?(?=Title)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Name1") <- trim(gsub("[[:punct:]]", "", extract))
extract <- regmatches(txt, gregexpr("(?<=Title).*?(?=Team)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Title") <- trim(gsub("[[:punct:]]","", extract))
extract <- regmatches(txt, gregexpr("(?<=Members).*?(?=Supervised)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- trim(gsub("[[:punct:]]","", extract))
extract <- regmatches(txt, gregexpr("(?<=your).*?(?=Supervisor)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- trim(gsub("[[:punct:]]","", extract))
# inspect
meta(exc[[1]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 13:59:28
Description :
Heading :
ID : fake1.doc
Language : en_CA
Origin :
User-defined local meta data pairs are:
$Name1
[1] "John Doe"
$Title
[1] "Manager"
$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"
$ManagerName
[1] "Selma Furtgenstein"
Similarly we can extract the sections of your second table into separate
vectors and then you can make them into documents and corpora or just work
on them as vectors.
txt <- paste0(as.character(exc[[1]]), collapse = ",")
CURRENT_RESEARCH_FOCUS <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=CURRENT RESEARCH FOCUS).*?(?=MAIN AREAS OF EXPERTISE)", txt, perl=TRUE))))
[1] "Lorem ipsum dolor sit amet consectetur adipiscing elit Donec at ipsum est vel ullamcorper enim In vel dui massa eget egestas libero Phasellus facilisis cursus nisi gravida convallis velit ornare a"
MAIN_AREAS_OF_EXPERTISE <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=MAIN AREAS OF EXPERTISE).*?(?=TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED)", txt, perl=TRUE))))
[1] "Vestibulum aliquet faucibus tortor sed aliquet purus elementum vel In sit amet ante non turpis elementum porttitor"
And so on. I hope that's a bit closer to what you're after. If not, it might be best to break down your task into a set of smaller, more focused questions, and ask them separately (or wait for one of the gurus to stop by this question!).