How to extract multiple quotes from multiple documents in R? - r

I have several Word files containing articles from which I want to extract the strings between quotes. My code works fine if I have one quote per article but if I have more than one R extracts the sentence that separates one quote from the next.
Here is the text from my articles:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, “I adore tigers”. This is the end.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
And this is my code:
library(readtext)
library(stringr)
#' folder where you've saved your articles
path <- "articles"
#' reads in anything saved as .docx
mydata <-
readtext(paste0(path, "\\*.docx")) #' make sure the Word document is saved as .docx
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))
#' extract the quotes
stringi::stri_extract_all_regex(str = mydata$text, pattern = '(?<=").*?(?=")')
The output is:
[[1]]
[1] "We got him and he is healthy,"
[2] " said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, "
[3] "I adore tigers"
[[2]]
[1] "The target catalysed much greater conservation action, which was desperately needed,"
You can see that the second element of the first output is incorrect. I don't want to include
" said Houston Police Department (HPD) Major Offenders Commander Ron
Borza. He went on to say, "

Well, technically the second element of the first output is within quotes so the code is working correctly as per the pattern used. A quick fix would be to remove every 2nd entry from the list.
sapply(
stringi::stri_extract_all_regex(str = text, pattern = '(?<=").*?(?=")'),
`[`, c(TRUE, FALSE)
)
#[[1]]
#[1] "We got him and he is healthy," "I adore tigers"
#[[2]]
#[1] "The target catalysed much greater conservation action, which was desperately needed,"

We can do this with base R
sapply(regmatches(text, gregexpr('(?<=")[^"]+)', text, perl = TRUE)), function(x) x[c(TRUE, FALSE)])

Related

How to remove the first words of specific rows that appear in another column?

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string.
In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content".
Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell.
I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this:
I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S+", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w+", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w+", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her

Matching word list to individual words in strings using R

Edit: Fixed data example issue
Background/Data: I'm working on a merge between two datasets: one is a list of the legal names of various publicly traded companies and the second is a fairly dirty field with company names, individual titles, and all sorts of other difficult to predict words. The company name list is about 14,000 rows and the dirty data is about 1.3M rows. Not every publicly traded company will appear in the dirty data and some may appear multiple times with different presentations (Exxon Mobil, Exxon, ExxonMobil, etc.).
Accordingly, my current approach is to dismantle the publicly traded company name list into the individual words used in each title (after cleaning out some common words like company, corporation, inc, etc.), resulting in the data shown below as Have1. An example of some of the dirty data is shown below as Have2. I have also cleaned these strings to eliminate words like Inc and Company in my ongoing work, but in case anyone has a better idea than my current approach, I'm leaving the data as-is. Additionally, we can assume there are very few, if any, exact matches in the data and that the Have2 data is too noisy to successfully use a fuzzy match without additional work.
Question: What is the best way to go about determining which of the items in Have2 contains the words from Have1? Specifically, I think I need the final data to look like Want, so that I can then link the public company name to the dirty data name. The plan is to hand-verify the matches given the difficult of the Have2 data, but if anyone has any suggestions on another way to go about this, I am definitely open to suggestions (please, someone, have a suggestion haha).
Tried so far: I have code that sort of works, but takes ages to run and seems inefficient. That is:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\\b",have1$V1[i],"\\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
Want
Name (from Have2)
Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
radiation
vp and general bird aficionado of the amazon apple assn. branch F
amazon
vp and general bird aficionado of the amazon apple assn. branch F
apple
ceo and director, apple inc
apple
xerox inc., president and ceo
xerox
Have1
Word
N
amazon
1
apple
3
xerox
1
notgoingtomatch
2
radiation
1
Have2
Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F
Using what you have documented, in terms of data from company_name_data and have2 only:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
The results in the second column, based on the text analysis are "apple," "radiation," "xerox," and "amazon apple."
I'm certain this code isn't mine originally. I'm sure I got these ideas from somewhere on StackOverflow...

Map trough a nested list using regex to remove entrys of a character vector

I have a nested list (https://www.filehosting.org/file/details/841630/bribe.RData) which I want to transform into a tibble. In the list are some character vectors which differ in length from the other vectors. This is due to a webscraping problem which could not be fixed before in another question.
Since all extra character strings in these vectors have a specific pattern I wrote a regex to find and remove those. These extra strings only appear in the 5th character vector of each sublist. How can I map over all those sublists and vectors to get rid of the extra strings at once.
The mapping version should extract the following of every 5th character vector of each sub list :
MMW :
bribe.test <-stringr:::str_remove( bribe.info[[1]][[5]],
"(\\\r\\\n\\s*){3}.+(\\\r\\\n\\s*){2}")
How can I do this? Is transpose() in combination with simplfy_all() an option? If so, how do I exactly use this here in this case?
The final tibble should have the structure that all character vector x of each sublist create one column in the data.frame.
UPDATE :
To make things a little bit more accessible I will include the output of the list
bribe.info[[1]][[5]]
[1] "\r\n Hello sir, My uncle just coming india yesterday >night at ahmedabad airport from New Zealand. And i gave him 2 iphone , iphone 8 >plus and iphone 11 pro...Read more\r\n "
[2] "\r\n Date of the incident: 29th December 2019\nTime of >incident: Around 8 PM in the evening\nPlace of incident: ECR road, Pondicherry >to Tamil Nadu check pos...Read more\r\n "
[3] "\r\n Dear Sir,\n\nThis is not the first time I am >facing this issue with Rohit Gas Agency. I tried to bring it to the notice of >Indane. Its of no use. Rohit ...Read more\r\n "
[4] "\r\n \r\n \r\n >How to get a LPG gas connection\r\n \r\n "
[5] "\r\n I paid bribe today to a police officer who came >for passport verification of my mother. Even after providing all supporting >documents and required inf...Read more\r\n "
[6] "\r\n I have asked to pay bribe to avoid huge penalty >for putting tent sheet on car windows. Police asked me to pay 1100 rs fine or >pay bribe instead of tha...Read more\r\n "
[7] "\r\n Help desk officer prashant who are trapping people >to make work done by giving bribes to higher officials at malakpet rto malakpet >Hyderabad ...Read more\r\n "
[8] "\r\n Get free shipping when you buy the Revolution the >great american electric cigarette machine, within the continental US from >https://hardworkingproduct...Read more\r\n "
[9] "\r\n I Would like to Inform you that a lot of >corruption is going on in the DC Office Bangalore Urban Dept. I am not paid >bribe directly there is lot more...Read more\r\n "
[10] "\r\n Are you interested in selling one of your k1dney >for a good amount of 14Crore 7 cr Advance kindly Contact us now 9663960578 >.\n...Read more\r\n "
[11] "\r\n Are you interested in selling one of your k1dney >for a good amount of 14Crore 7 cr Advance kindly Contact us now, as we are >looking for k1dney donor, ...Read more\r\n
The regular expressions needs to filter vector 4 of the list. Which works fine. But I have no idea how I can do this for bribe.info[[2]][[5]],bribe.info[[3]][[5]]... and so on with map.
I found a way to do this with base R. So basically I hope anyone can provide a simple solution with purrr.
### remove text of embedded links transform the list into a tibble
briber <- list()
for(i in 1:130){
for(j in 1:5 )
bribe.info[[i]][[j]] <- subset(bribe.info[[i]][[j]],!grepl("(\\\r\\\n\\s*){3}.+(\\\r\\\n\\s*){2}", bribe.info[[i]][[j]] ))
}
bribe.tibble <- bribe.info %>% map(~ as_tibble(.x, .name_repair = "unique")) %>% bind_rows()

R Programming Need String based unique solution for splitting huge text

I have a text file with a sample text like below all in small case:
"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a
once powerful oil minister and former head of state oil company pdvsa, in
connection with an alleged $4.8 billion vienna-based corruption scheme, the
state prosecutor's office announced on friday.
5.5 hours ago
— reuters
amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its
online marketplace when they mistakenly search for "brikenstock",
"birkenstok", "bierkenstock" and other variations in google.
6 hours ago
— business standard"
What I need in R is to get these two pieces of text, separated out.
The first piece of text would correspond with the text1 variable and the second piece of text should correspond with the text2 variable.
Please remember I have many text-like paragraphs in this file. The solution would have to work for, say, 100,000 texts.
The only thing I thought that could be used as a delimiter is "—" but with that I lose the source of the information such as "reuters" or "business standard". I need that as well.
Would you know how to accomplish this in R?
Read the text from field with readLines and then split on the shifted cumsum of the occurence of that special dash in from of the publisher:
Lines <- readLines("Lines.txt") # from file in wd()
split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "
[3] "once powerful oil minister and former head of state oil company pdvsa, in "
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."
[6] "5.5 hours ago"
[7] "— reuters"
$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."
[5] "6 hours ago"
[6] "— business standard'"
It's not a regular "-". Its a "—". And notice the by default readLines will omit the blank lines.
Here's what I could do. I do not like the loop in this, but I could not vectorize it. Hopefully this answer will at least serve as a starting point for other better answers.
Assumptions: All publisher names are preceeded by "— "
TEXT <- read.delim2("C:/Users/Arani.das/Desktop/TEXT.txt", header=FALSE, quote="", stringsAsFactors=F)
TEXT$Publisher <- grepl("— ", TEXT$V1)
TEXT$V1 <- gsub("^\\s+|\\s+$", "", TEXT$V1) #trim whitespaces in start and end of line
TEXT$FLAG <- 1 #grouping variable
for(i in 2:nrow(TEXT)){
if(TEXT$Publisher[i-1]==T){TEXT$FLAG[i]=TEXT$FLAG[i]+1}else{TEXT$FLAG[i]=TEXT$FLAG[i-1]}
} # Grouping entries
TEXT <- data.table::data.table(TEXT, key="FLAG")
TEXT2 <- TEXT[, list(News=paste0(V1[1:(length(V1)-2)], collapse=" "), Time=V1[length(V1)-1], Publisher=V1[length(V1)]), by="FLAG"]
Output:
FLAG News Time Publisher
1 Venezuela... 5.5 hours ago — reuters
2 amazon... 6 hours ago — business standard

How do I parse the contents of hundreds of csvs that are in a list of dataframes and split on ";" and ";" in loops?

I'm working with a large number (1,983) of CSV files. Posts on stackoverflow have said that lists are easier to work with so I've approached my task that way. I have read the CSVs in and gotten the first part of my task accomplished: what is the maximum number of concurrent users of the application? (A:203) Here's that code:
# get a list of the files
files <- list.files("my_path_here",pattern="*.CSV$", recursive = TRUE, full.names=TRUE)
#read in the csv's and store them as a list of dataframes
tables <- lapply(files, read.csv)
#store the counts of the number of users here
counts<-rep(NA,length(tables))
#loop thru the files to find the count and store that value
for (i in 1:length(files)) {
counts[i] <- length(tables[[i]][[2]])
}
#what's the largest number?
max(counts)
#203
The 2nd part of the task is to show the count of each title for each file. The contents of each file will be something like this:
compute_0001 compute_0002
[1] 3/26/2015 6:00:00 Business System Manager;Lead CoPath Analyst
[2] Regional Histotechnologist;Hist Tech - Ht
[3] Regional Histotechnologist;Tissue Tech
[4] SDX Histotechnologist;Histology Tech
[5] SDX Histotechnologist;Histology Tech
[6] Regional Histotechnologist;Lab Asst II Histology
[7] CytoPrep Tech;Histo Tech - Ht
[8] Regional Histotechnologist;Tissue Tech
[9] Histology Supervisor;Supv Reg Lab Unit
[10] Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT
What will differ from file to file is the time stamp in compute_0001, name of the file and the number of users (ie length of the file).
My approach was to try this:
>col2 <- sapply(tables,summary, maxsum=300) # gives me a list of 1983 elements that is 23.6Mb
(I noticed that when doing a summary() on the files I would get something like this - which is why I was trying it)
>col2[[1]]
compute_0001 compute_0002
1] Business System Manager;Lead CoPath Analyst :1
[2] Regional Histotechnologist;Hist Tech - Ht :1
[3] Regional Histotechnologist;Tissue Tech :1
[4] SDX Histotechnologist;Histology Tech :1
[5] SDX Histotechnologist;Histology Tech :1
[6] Regional Histotechnologist;Lab Asst II Histology :2
[7] CytoPrep Tech;Histo Tech - Ht :4
[8] Regional Histotechnologist;Tissue Tech :1
[9 Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT :1
The above is actually many different people. For my purposes, [2],[3], [6] and [8] are the same title (even though the stuff after the ";" is different. The truth is that even [4] and [5] could also be considered the same as [2,3,6,8]).
That ":1" (or generally ":#") is the number of users with that title at that particular time. I was hoping to grab that character, make it numeric and add them up to get a count of the users with each title for each file. Each file is an observation at a particular datetime.
I tried something like this:
>for (s in 1:length(col2)) {
>split <- strsplit(col2[[s]][,2], ":")
>#... make it numeric so I can do addition with it
>num <- as.numeric(split[[s]][2])
>#... and put it in the correct df
>tables[[s]]$count <- num
# After dealing with the ":" I was going to handle splitting on the first ";"
>}
But I couldn't get the loop to iterate more than a single time or past the first element of col2.
A more experienced useR suggested something like this:
>strsplit(x = as.character(compute2[[s]]),split=";",fixed=TRUE)
He said "However this results in a messy list also, since there are multiple ";" in some lines. What I would #suggest is to use grep() with a regex that returns the text before the first ";"- use that with sapply(compute2,grep()) and then you can run sapply(??,table) on the list that is returned to tally the job titles."
I'd prefer not to get into regex but, following his advice, I tried:
>for (s in 1:length(tables)){
>+ split <- strsplit(x = >as.character(compute2[[s]]),split=";",fixed=TRUE)
>+ }
split is a list of only 122 , not nearly long enough so it's not iterating thru the loop either. So, I figured I'd skip the loop and try:
>title_split<- sapply(compute2, strsplit, x = as.character(compute2[[1]]),split=";",fixed=TRUE)
But that gave me more than 50 warnings and a matrix that had 105,000+ elements that was 20.2Mb in size.
Like I said, I'd prefer to not venture into the world of regex, since I think I should be able to split on the ":" first and then the first of the ";" and return the string that precedes the ";". I'm just not sure why the loop is failing.
What I eventually want is a table that shows the count of each title (collapsed for duplicates like [2],[3], [6] and [8] above) for each file (which represents an observation at a particular datetime). I'm pretty agnostic as to approach, so if I have to do it via regex, then so be it.
Sorry for the lengthy post but I suspect that part of my problem (besides being brand new to stackoverflow, R and not understanding regex well) is that I'm not well versed in list manipulation and I wanted you to have the context.
Many thanks for reading.
You data isn't easily reproducible, so I've created a simple list of fake data that I hope captures the essence of your data.
Make a list of fake data frames:
string1 = "53 Regional histotechnologist;text2 - more text"
string2 = "54 Regional histotechnologist;text2 - more text"
string3 = "CytoPrep Tech;text2 - more text"
tables = list(df1=data.frame(compute=c(string1, string2, string3)),
df2=data.frame(compute=c(string1, string2, string3)))
Count the number of rows in each data frame:
counts = sapply(tables, nrow)
Add a column that extracts job titles from the compute column. The regex pattern skips zero or more digit characters ([0-9]*) followed by zero or one space character (?) and then captures everything up to, but not including, the first semi-colon(([^;]*);) and then skips every character after the semi-colon (.*).
tables = sapply(names(tables), function(df) {
cbind(tables[[df]], title=gsub("[0-9]* ?([^;]*);.*", "\\1", tables[[df]][,"compute"]))
}, simplify=FALSE)
tables
$df1
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
$df2
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
Make a table of counts of each title for each data frame in tables:
title.table.list = lapply(tables, function(df) table(df$title))
title.table.list
$df1
CytoPrep Tech Regional histotechnologist
1 2
$df2
CytoPrep Tech Regional histotechnologist
1 2

Resources