I am scraping https://www.transparency.org/news/pressreleases/year/2010 to retrieve header and details from each page. But along with header and details a telephone number and a blank string is coming in the retrieved list for every page.
[1] "See our simple, animated definitions of types of corruption and the ways to challenge it."
[2] "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
[3] " "
[4] "+49 30 3438 20 666"
I have tried with following codes but they didn't worked.
html %>% str_remove('+49 30 3438 20 666') %>% str_remove(' ').
How these elements can be removed?
Is it because you failed to escape the + sign?
From this cheatsheet,
Metacharacters (. * + etc.) can be used as
literal characters by escaping them. Characters
can be escaped using \ or by enclosing them
in \Q...\E.
s = "+49 30 3438 20 666"
str_remove(s, "\\+49 30 3438 20 666")
# ""
In case you want to drop all lines that start with a + and end with a number:
dd <- c(
"See our simple, animated definitions of types of corruption and the ways to challenge it."
, "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
," "
, "+49 30 3438 20 666")
c <- dd[!grepl("^\\+.*\\d*$",dd)]
You can also use \\s (one empty space) and \\d{2} (2 numbers) to have an exact match, to be on the safe side, if all numbers have the same format. Note that you can also use it in str_remove, with the end result beig an empty string. grep instead returns as logical vector that subsets your string.
If you want to delete also all empty lines
dd[!grepl("^\\s*$",dd)]
Note that you can do both at the same time by using "|":
dd[!grepl("^\\+.*\\d*$|^\\s*$",dd)]
You can get familiar with regex here: https://regex101.com/
Related
I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!
One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"
After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)
I'm trying to extract phone numbers in all formats (international and otherwise) in R.
Example data:
phonenum_txt <- "sDlkjazsdklzdjsdasz+49 123 999dszDLJhfadslkjhds0001 123.456sL:hdLKJDHS+31 (0) 8123zsKJHSDlkhzs&^#%Q(999)9999999adlfkhjsflj(999)999-9999sDLKO*$^9999999999adf;jhklslFjafhd9999999999999zdlfjx,hafdsifgsiaUDSahj"
I'd like:
extract_vector
[1] "+49 123 999"
[2] 0001 123.456
[3] "+31 (0) 8123"
[4] (999)9999999
[5] (999)999-9999
[6] 9999999999
[7] 9999999999999
I tried using:
extract_vector <- str_extract_all(phonenum_txt,"^(?:\\+\\d{1,3}|0\\d{1,3}|00\\d{1,2})?(?:\\s?\\(\\d+\\))?(?:[-\\/\\s.]|\\d)+$")
which I got from HERE, but my regex skills aren't good enough to convert it to make it work in R.
Thanks!
While your data does not seem to be realistic, this expression might help you to design a desired expression to match your string.
(?=.+[0-9]{2,})([0-9+\.\-\(\)\s]+)
I have added an extra boundary, which is usually good to add when inputs are complex.
You might add or remove boundaries, if you wish. For instance, this expression might work as well:
([0-9+\.\-\(\)\s]+)
Or you can add additional left and right boundaries to it, for instance if all phone numbers are wrapped with lower/uppercase letters:
[a-z]([0-9+\.\-\(\)\s]+)[a-z]
You can simply call your desired target output, which is in a capturing group using $1.
Regular expression design works best, if/when there is real data available.
You can use this regex to match and extract all the phone numbers you have in your string.
(?: *[-+().]? *\d){6,14}
The idea behind this regex is to allow optionally one character from this set [-+().] (as these characters can appear within your phone number) before one digit in your phone number. If your phone number can contain further more characters like { or } or [ or ] then you may add them to this character set. And this optional character set may be surrounded by optional spaces hence we have space star before and after that char set and at the end we have \d for matching it with a number and whole of this pattern is quantified {6,14} to at least appear 6 or at max appear 14 times (you can configure these numbers as per your needs) as a minimum numbers in a phone number as per your sample data is 6 (although in actual I think it is 7 or 8 of Singapore but that's up to you)
Regex Demo
R Code Demo
library(stringr)
str_match_all("sDlkjazsdklzdjsdasz+49 123 999dszDLJhfadslkjhds0001 123.456sL:hdLKJDHS+31 (0) 8123zsKJHSDlkhzs&^#%Q(999)9999999adlfkhjsflj(999)999-9999sDLKO*$^9999999999adf;jhklslFjafhd9999999999999zdlfjx,hafdsifgsiaUDSahj", "(?: *[-+().]? *\\d){6,14}")
Prints all your required numbers,
[[1]]
[,1]
[1,] "+49 123 999"
[2,] "0001 123.456"
[3,] "+31 (0) 8123"
[4,] "(999)9999999"
[5,] "(999)999-9999"
[6,] "9999999999"
[7,] "9999999999999"
The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'
This must be very simple, but I've hit a brick wall on it.
I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.
df<-data.frame( nms= c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ),
b = c('A', 'X', "T"))
nms b
1 XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3 YYY,XXX REP LIMITED,999,XXX T
I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.
However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]
grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3
The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.
When \b does not "work", the problem usually lies in the definition of the "whole word".
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
It seems you want to only match a word in between commas or start/end of the string).
You may use a PCRE regex (note the perl=TRUE argument) like
(?<![^,])XXX(?![^,])
See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).
Details
(?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
XXX - an XXX word
(?![^,]) (equal to (?=$|,)) - either end of the string or a comma
R demo:
> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3
The equivalent TRE regex will look like
> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)
Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).
This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:
library(stringr)
df$nms %>%
str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
str_detect('(^|,)XXX(,|$)') # Detects string XXX surrounded by comma or beginning/end
[1] FALSE TRUE TRUE
Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).
I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word "Votes." Note that the number has a comma to separate thousands, so it's easier to treat it as a string.
I know that gsub("*. Votes","", text) will remove everything, but how do I just remove the number? Also, how do I collapse the repeated spaces into just one space?
Thanks for any help you might have!
Example data:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
You may use
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
See the online R demo and the online regex demo.
Details
(\\s){2,} - matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1 placeholder in the replacement pattern
| - or
\\d - a digit
[0-9,]* - 0 or more digits or commas
\\s* - 0+ whitespace chars
(Votes) - Group 2 (will be restored in the output using the \2 placeholder): a Votes substring.
Note that trimws will remove any leading/trailing whitespace.
Easiest way is with stringr:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
To do the same thing but extract only the number, wrap it in gsub:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
If you want the label too, then just throw out the gsub part:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
And if you want to pull out all the numbers:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"
I'am stuck trying to extract, from a big text (around 17000 documents), words that contain punctuation expressions. For example
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The
aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A
cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This
prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"
I would like to extract words like the following:
[1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
[2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
[3] c(PATIENTS & METHODS,
[4] c(c(Introduction
but not for example words like "cross-sectional", or "2013.", or "2)", or "(inability". This is the first step, my idea is to be able to get to this:
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this
study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
[95] Introduction Silicosis is a fibrotic"
As a way to extract these words and not grabbing any words that include punctuation (like "surgeries.n)"), I have seen that they always start or include "c(" expression. But had some trouble with the regex:
grep("c(", test)
Error en grep("c(", test) :
invalid regular expression 'c(', reason 'Missing ')''
also tried with:
grep("c\\(", test, value = T)
But returns the whole text file. Have also use str_match from the dap package but I don't seem to get the correct pattern (regex) code right. Have any recommendation?
If I understood your problem (I'm unsure your second text is expected output or just a step) I would go with gsub like this:
gsub("(c\\(|<\\/?sc>)","",text)
The regex (first parameter) will match c( or <sc> or </sc> and replace them with nothing, thus cleaning the text as you expect (again, if I understood correctly your expectation).
more on the regex involved:
(|) is the structure to OR condition
c\\( will match literally c( anywhere in the text
<\\/?sc> will match <sc> or </sc> as the ? after the / mean it can be there 0 or 1 time, so it's optionnal.
The double \\ are there so after R interpreter has removed the first backslash there's still a backslash to tell the regex interpreter we want to match a litteral ( and a litteral /
Try this,
text <- "...urine bag tubing and the vent jutting above the summit also strapped with the white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This prospective double blind,...[95] c(c(Introduction, Silicosis is a fibroticf"
require(stringr)
words <- str_split(text, " ")
words[[1]][grepl("c\\(", words[[1]])]
## [1] "\n\nc(A<sc>IMS" "c(M<sc>ATERIALS" "\n\nc(PATIENTS" "c(c(Introduction,"