Remove the string before a certain word with R - r

I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word "Votes." Note that the number has a comma to separate thousands, so it's easier to treat it as a string.
I know that gsub("*. Votes","", text) will remove everything, but how do I just remove the number? Also, how do I collapse the repeated spaces into just one space?
Thanks for any help you might have!
Example data:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"

You may use
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
See the online R demo and the online regex demo.
Details
(\\s){2,} - matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1 placeholder in the replacement pattern
| - or
\\d - a digit
[0-9,]* - 0 or more digits or commas
\\s* - 0+ whitespace chars
(Votes) - Group 2 (will be restored in the output using the \2 placeholder): a Votes substring.
Note that trimws will remove any leading/trailing whitespace.

Easiest way is with stringr:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
To do the same thing but extract only the number, wrap it in gsub:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
If you want the label too, then just throw out the gsub part:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
And if you want to pull out all the numbers:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"

Related

Remove specific string or blank member from character vector

I am scraping https://www.transparency.org/news/pressreleases/year/2010 to retrieve header and details from each page. But along with header and details a telephone number and a blank string is coming in the retrieved list for every page.
[1] "See our simple, animated definitions of types of corruption and the ways to challenge it."
[2] "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
[3] " "
[4] "+49 30 3438 20 666"
I have tried with following codes but they didn't worked.
html %>% str_remove('+49 30 3438 20 666') %>% str_remove(' ').
How these elements can be removed?
Is it because you failed to escape the + sign?
From this cheatsheet,
Metacharacters (. * + etc.) can be used as
literal characters by escaping them. Characters
can be escaped using \ or by enclosing them
in \Q...\E.
s = "+49 30 3438 20 666"
str_remove(s, "\\+49 30 3438 20 666")
# ""
In case you want to drop all lines that start with a + and end with a number:
dd <- c(
"See our simple, animated definitions of types of corruption and the ways to challenge it."
, "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
," "
, "+49 30 3438 20 666")
c <- dd[!grepl("^\\+.*\\d*$",dd)]
You can also use \\s (one empty space) and \\d{2} (2 numbers) to have an exact match, to be on the safe side, if all numbers have the same format. Note that you can also use it in str_remove, with the end result beig an empty string. grep instead returns as logical vector that subsets your string.
If you want to delete also all empty lines
dd[!grepl("^\\s*$",dd)]
Note that you can do both at the same time by using "|":
dd[!grepl("^\\+.*\\d*$|^\\s*$",dd)]
You can get familiar with regex here: https://regex101.com/

Extract all characters to the right of a list of possible characters

I have a series of strings in a dataframe like the ones below:
item_time<-c("pink dress july noon", "shirt early september morning", "purple dress
april", "tall purple shoes february")
And I want to extract all the characters to the right of a list of possible characters like these:
item<-c("pink dress","shirt","purple dress", "tall purple shoes")
The result I want would look like this:
[1] july noon
[2] early september morning
[3] april
[4] february
I can't separate them by spaces as there are varying number of words in the time and item lists. I also don't have a symbol that separates them. I feel that there should be a quite simple and elegant way of solving this but I can't figure it out.
You can do this with sub and a regular expression.
Pat = paste0("(.*)(", paste0(item, collapse="|"), ")(.*)")
sub(Pat, "\\3", item_time)
[1] " july noon" " early september morning"
[3] " april" " february"
Details: The pattern that is created is:
Pat
[1] "(.*)(pink dress|shirt|purple dress|tall purple shoes)(.*)"
The middle part "(.*)(pink dress|shirt|purple dress|tall purple shoes) matche4s any one of your patterns. The first (.*) matches anything before the pattern. The second (.*) matches anything after the pattern. The sub statement then replaces the whole string with just the part after the pattern match.
Another way is using mapply
mapply(gsub,pattern=item,replacement='',x=item_time)
If you also want to remove the space between item and the right part of item_time, you can instead use:
mapply(gsub,pattern=paste0(item,' '),replacement='',x=item_time)
Here is another option using stringr::str_replace(string, pattern, replacement), which has the advantage that it is vectorised over both string and pattern (and replacement as well).
trimws(stringr::str_replace(item_time, item, ""))
#[1] "july noon" "early september morning"
#[3] "april" "february"
trimws removes the leading whitespace.
Note that this requires item_time and item to have pairwise matching entries.

Extract first sentence in string

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.
require(stringr)
x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."
My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:
str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA
Any tips much appreciated.
You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:
> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."
See this regex demo.
Details
.*? - any 0+ chars other than line break chars
[a-z0-9] - an ASCII lowercase letter or a digit
[.?!] - a ., ? or !
(?= ) - that is followed with a literal space.
Alternatively, you may use sub:
sub("([a-z0-9][?!.])\\s.*", "\\1", x)
See this regex demo.
Details
([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
\s - a whitespace
.* - any 0+ chars, as many as possible (up to the end of string).
corpus has special handling for abbreviations when determining sentence boundaries:
library(corpus)
text_split(x, "sentences")
#> parent index text
#> 1 1 1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1 2 The death toll has now risen to at least 187.
There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

Finding Abbreviations in Data with R

In my data (which is text), there are abbreviations.
Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.
Much appreciated!
detecting 3-4-5 capital letter abbreviations
You may use
\b[A-Z]{3,5}\b
See the regex demo
Details:
\b - a word boundary
[A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
\b - a word boundary.
R demo online (leveraging the regex occurrence count code from #TheComeOnMan)
abbrev_regex <- "\\b[A-Z]{3,5}\\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ" "WXYZ" "VWXYZ"
You can use the regular expression [A-Z] to match any ocurrence of acapital letter. If you want this pattern to be repeated 3 times you can add \1{3} to your regex. Consider using variables and a loop to get the job done for 3 to 5 repetition times.

How to Count Text Lines in R?

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:
MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that.
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.
MR. JOHN: Thank you
In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:
MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1
Thanks for pointers using R!
You can use the pattern : to split the string by and then use table:
table(sapply(strsplit(x, ":"), "[[", 1))
# MR. JOHN MR. LEHMAN MS. SMITH
# 2 1 1
strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency
Edit: Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.
tt <- readLines("./tmp.txt")
Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.
Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).
You can use strsplit followed by sapply (as shown below)
Using strsplit:
# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:
out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))
There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.
Of course, this assumes that there is no other line, for example, like this:
"Mr. Chariman, whatever (bla bla): It is not a problem"
Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.

Resources