Extract larger body of character data with stringr? - r

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!

One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"

After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)

Related

How to filter R dataset by multiple partial match strings, similar to SQL % wildcard? [duplicate]

This question already has answers here:
What's the R equivalent of SQL's LIKE 'description%' statement?
(4 answers)
Closed 11 days ago.
I have a dataset with with a field of interest and a list of strings (several hundred of them).
What I want to do is, for each line of the data, to check if the field has any of the partials strings in it.
Essentially, I want to replicate the SQL % wildcard. So, if for example a value is "Game123" and one of my strings is "Ga" I want that to be a match. (But I don't want "OGame" to match "Ga").
I'm hoping to write some statement like this:
df%>%
filter(My_Field contains any one of List_Of_Strings)
How do I fill in that filter statement?
I tried to use the %in% operator but couldn't make it work. I know how to use substrings to check against a single string, but I have a long list of them and need to check all of them.
R filter rows based on multiple partial strings applied to multiple columns: This post is similar to what I'm trying to do, but my list of substrings is 400 plus, so I can't write it all out manually in a grepl statement (I think?)
Since there is no particular dataset or reproductible example, I can think of a way to implement it with two apply functions and a smart use of regex. Remember that the regex operator ^ matches only if the following expression shows up in its beginning.
library(dplyr)
MyField <- c("OGame","Game123","Duck","Dugame","Aldubame")
df <- data.frame(MyField)
ListOfStrings <- c("^Ga","^Du") #Notice the use of ^ here
match_s <- function(patterns,entry){
lapply(patterns,grepl,x = entry) %>% unlist() %>% any()
}
df$match_string <- lapply(df$MyField, match_s, pat = ListOfStrings)
df %>% filter(match_string == 1)
With dplyr (using stringr for words and sentences as examples) and grepl in conjunction with \\b to get the word boundary match at the beginning.
library(stringr)
library(dplyr)
set.seed(22)
tibble(sentences) %>%
rowwise() %>%
filter(any(sapply(words[sample(length(words), 10)], function(x)
grepl(paste0("\\b", x), sentences)))) %>%
ungroup()
# A tibble: 32 × 1
sentences
<chr>
1 It's easy to tell the depth of a well.
2 Kick the ball straight and follow through.
3 A king ruled the state in the early days.
4 March the soldiers past the next hill.
5 The dune rose from the edge of the water.
6 The grass curled around the fence post.
7 Cats and Dogs each hate the other.
8 The harder he tried the less he got done.
9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows
I guess the problem you're facing is this:
You have a list of what could be called key words (what you call "a list of strings") and a vector/column with text (what you call "a field of interest") and your goal is to filter the vector/column on whether or not any of the key words is present. If that's correct the solution might be this:
Data:
a. List of key words:
keys <- c("how", "why", "what")
b. Dataframe with a vector/column of text:
df <- data.frame(
text = c("Hi there", "How are you?", "I'm fine.", "So how's work?", "Ah kinda stressful.", "Why?", "Well you know")
)
Solution:
To filter df on keys in text you need to convert keys into a regex alternation pattern (by collapsing the strings with |). Depending on your keys it may be useful or even necessary to also include word \\boundary markers (in case the keys values need to match as such, but not occurring inside other words). And finally, if there may be an issue with lower- or upper-case, we can use the case-insensitive flag (?i):
df %>%
filter(str_detect(text, str_c("(?i)\\b(", str_c(keys, collapse = "|"), ")\\b")))
text
1 How are you?
2 So how's work?
3 Why?

Remove Everything Except Specific Words From Text

I'm working with twitter data using R. I have a large data frame where I need to remove everything from the text except from specific information. Specifically, I want to remove everything except from statistical information. So basically, I want to keep numbers as well as words such as "half", "quarter", "third". Also is there a way to also keep symbols such as "£", "%", "$"?
I have been using "gsub" to try and do this:
df$text <- as.numeric(gsub(".*?([0-9]+).*", "\\1", df$text))
This code removes everything except from numbers, however information regarding any words was gone. I'm struggling to figure out how I would be able to keep specific words within the text as well as the numbers.
Here's a mock data frame:
text <- c("here is some text with stuff inside that i dont need but also some that i do, here is a word half and quarter also 99 is too old for lego", "heres another one with numbers 132 1244 5950 303 2022 and one and a half", "plz help me with code i am struggling")
df <- data.frame(text)
I would like to be be able to end up with data frame outputting:
Also, I've included a N/A table in the picture because some of my observations will have neither a number or the specific words. The goal of this code is really just to be able to say that these observations contain some form of statistical language and these other observations do not.
Any help would be massively appreciate and I'll do my best to answer any Q's!
I am sure there is a more elegant solution, but I believe this will accomplish what you want!
df$newstrings <- unlist(lapply(regmatches(df$text, gregexpr("half|quarter|third|[[:digit:]]+", df$text)), function(x) paste(x, collapse = "")))
df$newstrings[df$newstrings == ""] <- NA
> df$newstrings
# [1] "halfquarter99" "132124459503032022half" NA
You can capture what you need to keep and then match and consume any character to replace with a backreference to the group value:
text <- c("here is some text with stuff inside that i dont need but also some that i do, here is a word half and quarter also 99 is too old for lego", "heres another one with numbers 132 1244 5950 303 2022 and one and a half", "plz help me with code i am struggling")
gsub("(half|quarter|third|\\d+)|.", "\\1", text)
See the regex demo. Details:
(half|quarter|third|\d+) - a half, quarter or third word, or one or more digits
| - or
. - any single char.
The \1 in the replacement pattern puts the captured vaue back into the resulting string.
Output:
[1] "halfquarter99" "132124459503032022half" ""

How to use quanteda to find instances of appearance of certain words before certain others in a sentence

As an R newbie, by using quanteda I am trying to find instances when a certain word sequentially appears somewhere before another certain word in a sentence. To be more specific, I am looking for instances when the word "investors" is located somewhere before the word "shall" in a sentence in the corpus consisted of an international treaty concluded between Morocco and Nigeria (the text can be found here: https://edit.wti.org/app.php/document/show/bde2bcf4-e20b-4d05-a3f1-5b9eb86d3b3b).
The problem is that sometimes there are multiple words between these two words. For instance, sometimes it is written as "investors and investments shall". I tried to apply similar solutions offered on this website. When I tried the solution on (Keyword in context (kwic) for skipgrams?) and ran the following code:
kwic(corpus_mar_nga, phrase("investors * shall"))
I get 0 observations since this counts only instances when there is only one word between "investors" and "shall".
And when I follow another solution offered on (Is it possible to use `kwic` function to find words near to each other?) and ran the following code:
toks <- tokens(corpus_mar_nga)
toks_investors <- tokens_select(toks, "investors", window = 10)
kwic(toks_investors, "shall")
I get instances when "investor" appear also after "shall" and this changes the context fundamentally since in that case, the subject of the sentence is something different.
At the end, in addition to instances of "investors shall", I should also be getting, for example the instances when it reads as "Investors, their investment and host state authorities shall", but I can't do it with the above codes.
Could anyone offer me a solution on this issue?
Huge thanks in advance!
Good question. Here are two methods, one relying on regular expressions on the corpus text, and the second using (as #Kohei_Watanabe suggests in the comment) using window for tokens_select().
First, create some sample text.
library("quanteda")
## Package version: 2.1.2
# sample text
txt <- c("The investors and their supporters shall do something.
Shall we tell the investors? Investors shall invest.
Shall someone else do something?")
Now reshape this into sentences, since your search occurs within sentence.
# reshape to sentences
corp <- txt %>%
corpus() %>%
corpus_reshape(to = "sentences")
Method 1 uses regular expressions. We add a boundary (\\b) before "investors", and the .+ says one or more of any character in between "investors" and "shall". (This would not catch newlines, but corpus_reshape(x, to = "sentences") will remove them.)
# method 1: regular expressions
corp$flag <- stringi::stri_detect_regex(corp, "\\binvestors.+shall",
case_insensitive = TRUE
)
print(corpus_subset(corp, flag == TRUE), -1, -1)
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "The investors and their supporters shall do something."
##
## text1.2 :
## "Investors shall invest."
A second method applies tokens_select() with an asymmetric window, with kwic(). First we select all documents (which are sentences) containing "investors", but discarding tokens before and keeping all tokens after. 1000 tokens after should be enough. Then, apply the kwic() where we keep all context words but focus on the word after, which by definition must be after, since the first word was "investors".
# method 2: tokens_select()
toks <- tokens(corp)
tokens_select(toks, "investors", window = c(0, 1000)) %>%
kwic("shall", window = 1000)
##
## [text1.1, 5] investors and their supporters | shall | do something.
## [text1.3, 2] Investors | shall | invest.
The choice depends on what suits your needs best.

How to Count Text Lines in R?

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:
MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that.
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.
MR. JOHN: Thank you
In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:
MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1
Thanks for pointers using R!
You can use the pattern : to split the string by and then use table:
table(sapply(strsplit(x, ":"), "[[", 1))
# MR. JOHN MR. LEHMAN MS. SMITH
# 2 1 1
strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency
Edit: Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.
tt <- readLines("./tmp.txt")
Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.
Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).
You can use strsplit followed by sapply (as shown below)
Using strsplit:
# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:
out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))
There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.
Of course, this assumes that there is no other line, for example, like this:
"Mr. Chariman, whatever (bla bla): It is not a problem"
Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.

How to create a word grouping report using R language and .Net?

I would like to create a simple application in C# that takes in a group of words, then returns all groupings of those individual words from a data set.
For example, given car and bike, return a list of groups/combinations of words (with the number of combinations found) from a data set.
To further clarify - given a category named "car", I would like to see a list of word groupings with the word "car". This category could also be several words rather than just one.
With a sample data set of:
CAR:
Another car for sale
Blue car on the horizon
For Sale - used car
this car is painted blue
should return
car : for sale : 2
car : blue : 2
I'd like to set a threshold, say 20 or greater, so if there are over 20 instances of the word(s) with car, then display them - category, words, count, where only category is known; words and count is determined by the algorithm.
The data set is in a SQL Server 2008 table, and I was hoping to use something like a .Net implementation of R to accomplish this.
I am guessing that the best way to accomplish this may be with the R programming language, and am only now looking at R.Net.
I would prefer to do this with .Net, as that is what I am most familiar with, but open to suggestions.
Can someone with some experience with this lead me in the right direction?
Thanks.
It seems your question consists of 4 parts:
Getting data from SQL Server 2008
Extracting substrings from a set of strings
Setting a threshold for when to accept that number
Producing some document or other output (?) containing this.
For 1, I think that's a different question (see the RODBC package), but I won't be dealing with that here as that's not the main part of your question. You've left 4. a little vague and I think that's also peripheral to the meat of your question.
Part 2 can be easily dealt with using regular expressions:
countstring <- function(string, pattern){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
This function basically gets a vector of strings and a pattern to search for. It finds which of them match and gets the sum of the number that do (ie the count). It then prints out these together in one string. For example:
car <- c("Another car for sale", "Blue car on the horizon", "For Sale - used car", "this car is painted blue")
countstring(car, "blue")
## [1] "car : blue : 2"
Part 3 requires a small change to the function
countstring <- function(string, pattern, threshold=20){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
if(stringcount >= threshold){
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
}

Resources