Start with certain phrase up to certain phrase - r

I have some texts and some of them actually have pre-defined templates which add no value to the analysis.
I want to use regex to systematically remove the template (which typically consist of a header text like greetings and and closing text like thank you, so that I can focus on the variable text.
Both the header and closing may have variable texts like variable location or variable staff name. So text 1 may have location equals ABC and staff name equals Sofia.
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"
header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"
tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
Current attempt I have is as below.
# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that
# starts with "Hello, thank you for contacting" up to "Please find our available menu"
# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"
2nd attempt
# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
# remove any text in between 'Hello, thank you for contacting`
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
, x = have))
# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
, x = want))

An option might be to match all lines before Menu. Then capture all consecutive lines that start with Menu and match the rest of the lines starting at Sincerely.
In the replacement use capture group 1.
^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*
The pattern matches:
^Start of string
[\s\S]*?\R Match any char as least as possible followed by a newline
( Capture group 1
(?:Menu .*\R+)* Repeat matching all lines that start with Menu and match a newline
) Close group 1
\s* Match optional whitespace chars
Sincerely, Match literally
[\s\S]* Match the rest of the lines
Regex demo | R demo
Example
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
trimws(gsub('^[\\s\\S]*?\\R((?:Menu .*\\R+)*)\\s*Sincerely,[\\s\\S]*','\\1', have, perl = TRUE))
Output
[1] "Menu 1 USD 1.99\nMenu 2 USD 3.99"
A bit longer an more precise pattern could be:
^(?:(?!Menu ).*(?:\R(?!Menu ).*)*\R+)?(Menu .*(?:\RMenu .*)*)\R\s*Sincerely,[\s\S]*
Regex demo | R demo

Related

problem with web scrapping from wikipedia

I have been practicing web scrapping from wikipedia with the rvest library, and I would like to solve a problem that I found when using the str_replace_all() function. here is the code:
library(tidyverse)
library(rvest)
pagina <- read_html("https://es.wikipedia.org/wiki/Anexo:Premio_Grammy_al_mejor_%C3%A1lbum_de_rap") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# convert to a table
html_table()
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista <- str_replace_all(rap$Artista, '\\[[^\\]]*\\]', '')
rap$Trabajo <- str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
The problem is that when I remove the elements between brackets (hyperlinks in wikipedia) from the Artist variable, when doing the tabulation to see the count by artist, Eminem is repeated three times as if it were three different artists, the same happens with Kanye West that is repeated twice. I appreciate any solutions in advance.
There are some hidden bits still attached to the strings and trimws() is not working to remove them. You can use nchar(sort(test)) to see the number of character associated with each entry.
Here is a messy regular expression to extract out the letters, space, comma and - and skip everything else at the end.
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista<-gsub("([a-zA-Z -,&]+).*", "\\1", rap$Artista)
rap$Trabajo <- stringr::str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
Cardi B Chance the Rapper Drake Eminem Jay Kanye West Kendrick Lamar
1 1 1 6 1 4 2
Lil Wayne Ludacris Macklemore & Ryan Lewis Nas Naughty by Nature Outkast Puff Daddy
1 1 1 1 1 2 1
The Fugees Tyler, the Creator
1 2
Here is another reguarlar expression that seems a bit clearer:
gsub("[^[:alpha:]]*$", "", rap$Artista)
From the end, replace zero or more characters which are not a to z or A to Z.

How can I parse the text from one countryName to another countryName in R?

I'm just having a really hard time figuring this out. Let's go straight to the data.
library(countrycode)
countries <- codelist$country.name.en #list of countries from the library
text <- "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. (Spain) No information available. (Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."
I'd want to create a list of the parsed text (eg. from "(France)" to "Nissan.") for all three countries. The actual text is 30 pages long and each (countryName) is followed by several paragraphs of text.
All the countryNames are in parentheses but there might be other non-country parentheses in the text or countryNames without parentheses. But the general pattern is that each segment I want to parse starts with (countryName1) and ends with (countryName2)
Output:
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan."
[[2]]
[1] "(Spain) No information available."
[[3]]
[1] "(Chad) Mr.Smith (from N'Djamena) bought a new house. It's very nice."
If all the countries in the 'text' matches with the reference vector, we may paste the reference vector into a single string to split the string just before the country match
as.list(strsplit(text, sprintf('(?<=\\s)(?=(%s))',
paste(paste0("\\(", countries), collapse = "|")), perl = TRUE)[[1]])
-output
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. "
[[2]]
[1] "(Spain) No information available. "
[[3]]
[1] "(Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."

Obtain specific sections of a text file and their contents

I have a text file that looks like:
1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last
I have another text file that looks like.
1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last
I was wondering how you could read in data from the previous text file, and use it to find the specific sections within the second data file and all the content after it uptil the the next section. So basically, I'm trying to get something like:
Section Content
1 Hello My name is John. It was nice to meet you.
1.1 Hi Hi again. My last name is Doe. 1.1.1 Bye
1.2 Hey Greetings.
.....And so on
I was wondering how I could do so.
The following solution might certainly be improved, but it might provide you with an idea how to approach your issue. Depending on the size and structure of the files you need to process, this approach might be ok or require more tuning concerning the detection of sections and the speed.
file1 =
"1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last"
file2 =
"1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last"
file1 = unlist(strsplit(file1, "\n", fixed = T))
file2 = unlist(strsplit(file2, "\n", fixed = T))
positions = unlist(sapply(file1, function(x) grep(paste0("^", x, "$"), file2, ignore.case = T)))
positions = cbind(positions, c(positions[-1]-1, length(file2)))
text = mapply(function(x, y) file2[x:y], positions[,1], positions[,2])
text = lapply(text, function(x) x[-1])
result = cbind(positions, text)
result
# positions text
# 1 Hello 1 2 "My name is John. It was nice to meet you."
# 1.1 Hi 3 5 Character,2
# 1.2 Hey 6 7 "Greetings."
# 2 Next section 8 9 "This is the second section. I am majoring in CS."
# 2.1 New section 10 15 Character,5
# 4 last 16 16 Character,0
# Note that the text column contains lists storing the individual lines.
# e.g. for "2.1 New section":
class(result[5, "text"])
# list
result[5, "text"]
# [[1]]
# [1] "Welcome. I am an undergraduate student." "3 third" #<< note the different spelling of third
# [3] "1. hi" "2. hello"
# [5] "3. hey"
The answer to this question is Yes, it can be done. The implementation will vary wildly based on what programming language that you are using to accomplish this task. High level overview would be
split original file into string array by line. These are you list of keys for searching the second document.
read second file into string variable
iterate through all your keys (iterator x) and find their index in the second document. something like
int start = seconddocument.indexof(keys[x]);
int end = seconddocument.indexof(keys[x+1]);
then with these start and end positions you can use a substring() function to extract the content.
string matchedContent = seconddocument.substring(start, end);
This works until you get to the last match because keys[x+1] will not exist where x is the last key. in this case end needs to be set to the position of the last character in the document, or you use a substring method that just takes a starting point.
HTH

Extract state abbreviation and zip code from strings

I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers
text <- readLines(textConnection(address))
library(stringi)
zip <- stri_extract_last_regex(text, "\\d{5}")
zip
library(qdapRegex)
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
zip <- rm_zip3(text)
zip
[1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA
For the state abbreviation, I have no idea how to extract
Any help is appreciated! Thanks in advance!
Edit 1: Include phone numbers
Code to extract zip code:
zip <- str_extract(text, "\\d{5}")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")
Code to extract phone numbers:
phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")
NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:
Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)
zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")
Demo: https://regex101.com/r/7Im0Mu/2
I am using address as input not the text, see if it works for your case.
Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.
Input:
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
I am using stringr library , you may choose any other to extract the information as you wish.
library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")
EDIT:
In case someone want to use text as input,
text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))
OUTPUT:
State Zip pin
1 AK 99577 907-481-1670
2 AL 35007 205-620-0360
3 CT 06854 860-409-0404
4 ID 83338 208-324-4333
5 ME 4330 207-623-8223
6 VT 5495 802-878-5233
Thank you #Rahul. Both would be great. At least can you show me how to do it with Notepad++?
Extraction using Notepad++
Well first copy your whole data in a file.
Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.
Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.
Now go to Search --> Bookmark --> Remove Unmarked Lines
Finally search with ^-|-$ and replace with empty string.
Update
So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.

how to extract text from anchor tag inside div class in r

I am trying to fetch text from anchor tag, which is embedded in div tag. Following is the link of website `http://mmb.moneycontrol.com/forum-topics/stocks-1.html
The text I want to extract is Mawana Sugars
Mawana Sugars
So I want to extract all the stocks names listed on this website and description of it.
Here is my attempt to do it in R
doc <- htmlParse("http://mmb.moneycontrol.com/forum-topics/stocks-1.html")
xpathSApply(doc,"//div[#class='clearfix PR PB5']//text()",xmlValue)
But, it does not return anything. How can I do it in R?
My answer is essentially the same as the one I just gave here.
The data is dynamically loaded, and cannot be retrieved directly from the html. But, looking at "Network" in Chrome DevTools for instance, we can find a nicely formatted JSON at http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1
To get you started:
library(jsonlite)
dat <- fromJSON("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1")
Output looks like:
dat[1:3, c("msg_id", "user_id", "topic", "heading", "flag", "price", "message")]
# msg_id user_id topic heading flag
# 1 47730730 liontrade NMDC Stocks APR
# 2 47730726 agrawalknath Glenmark Glenmark APR
# 3 47730725 bissy91 Infosys Stocks APR
# price
# 1 Price when posted : BSE: Rs. 127.90 NSE: Rs. 128.15
# 2 Price when posted : NSE: Rs. 714.10
# 3 Price when posted : BSE: Rs. 956.50 NSE: Rs. 955.00
# message
# 1 There is no mention of dividend in the announcement.
# 2 Eagerly Waiting for 670 to 675 to BUY second phase of Buying in Cash Delivery. Already Holding # 800.
# 3 6 ✂ ✂--Don t Pay High Brokerage While Trading. Take Delivery Free & Rs 20 to trade in any size - Join Today .👉 goo.gl/hDqLnm

Resources