Extract text and numbers from web page using regex in R

Extract text and numbers from web page using regex in R - r

I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES
Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
test <- test %>% html_nodes("tr") %>% html_text()
# This extracts 31 lines of text; here is what my target text looks like:
# [10] "NPDES\n6515\nOPERATORS OF RESIDENTIAL MOBILE HOME SITES\n\n"
Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"
How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...
as.numeric(sub(".*?NPDES.*?(\\d{4}).*", "\\1", test))
# 4424
Any advice?

From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
tables <- html_nodes(test, "table")
tables
SIC <- as.data.frame(html_table(tables[5], fill = TRUE))

Related

How to scrape data from multiple wikipedia pages with R?

I am new to data scraping in R, but I would like to do the following. I have a list of celebrities, celebs, and I would like to grab their date of birth from Wikipedia. I know how to do it for each individual celebrity, but I am trying to animate this process.
celebs <- c("Tom Hanks", "Tim Cook", "Michael Bloomberg")
I do the following to get the information I need for the first celebrity, Tom Hanks.
library(rvest)
wiki <- read_html("https://en.wikipedia.org/wiki/Tom_Hanks")
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
Is there a way to get the information I need for Tim Cook and Michael Bloomberg without manually editing the above code?

welcome to SO.
To do any task repeatedly with code, you should always look to build a loop. Before you can build a loop, you should try to build a single iteration of the loop. You almost have that ready here, but there are a few missing steps.
First of all, we should try to generalize the code so that it could work by simply switch the value of one variable from your vector of iterators (celebs).
person <- "Tom Hanks"
Now, using that, we need to create the wikipedia link through code. There are two things to consider here:
We need to add the link before the name of the person;
We should replace the space in "Tom Hanks" for an underline
We can do that with this code:
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
This creates the correct link, which we can use for the subsequent steps. Now, it is just a question of iterating through the celebs vector. There are many ways to do it, but in R, the most appropriate would be with an sapply. For that, we will create an anonymous function that will take a person's name as input, query wikipedia and extract their birthday, using the code that you have already written:
function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
}
You can now wrap an sapply structure around that:
birthdates <- sapply(celebs, function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
})

Creating a dataframe from paragraph text scraped from website in R

I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.
Website :Website I want Scraped
Code:
library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']
How I want the dataframe to look:
I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!

Proof-of-concept (based on what I wrote in the comment):
Code
lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)
tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'
url_text <- read_html(burl)
chunks <- url_text %>% html_nodes(tags) %>% html_text()
coordFunc <- function(chunk){
patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
return(ret[[1]][2])
}
longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))
Output
# using 'cat' to make the output easier to read
> cat(chunks[14])
Mt. Laurel DOT
Rt. 38, East
1/4 mile East of Rt. 295
Mt. Laurel Open 24 Hrs
Unleaded / Diesel
856-235-3096Latitude: 39.96744662Longitude: -74.88930386
> longitudes[14]
[1] -74.8893
If you do not coerce longitudes to be numeric, you get:
longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"
I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html tag. The function could return the extracted values in a data.table which can then be combined into one using rbindlist.
You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).
Edit
the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.
Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.
newfunc <- function(chunk){
# Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
# the output is a list so we use 'unlist' to get a vector
# then use 'trimws' to remove whitespace around it - try out each of these functions
# separately to understand what is going on. The final output here is a vector.
txt <- trimws(unlist(strsplit(chunk, '\r\n')))
return(txt)
}
This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:
> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
[1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6
A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc you could add:
if(length(txt) == 1){
return(NULL)
}
This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:
The first line is typically the name
the coordinates are in the last line
The fuel can be either unleaded or diesel. You can grep on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
Another approach would be to use a different set of html tags e.g. all coorindates and opening hours are in boldface and have the tag strong. You can extract those separately and then use regular expressions to get what you want.
You could search for 24(Hrs|Hours) to first extract all sites that are open 24 hours and then use selective regex on the remainder to get their operating times.
There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.

You can use tidyverse package (stringr, tibble, purrr)
library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks

Excluding notes using html_nodes() in r

I am scraping stock market prices using the rvest package in R. I would like to exclude nodes when using html_nodes().
The following classes appear on the website with stock prices:
[4] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_DifferenceBlock_lblRelativeDifferenceDown" class="ValueDown">-0,51%</span>
[5] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_ctl02_lblDifference" class="ValueDown Difference">-51%</span>
Now I would like to include only the text after class="ValueDown", and I would like to exclude the text after class="ValueDown Difference".
For this I use the following code:
urlIEX <- "https://www.iex.nl/Koersen/Europa_Lokale_Beurzen/Amsterdam/AMX.aspx"
webpageIEX <- read_html(urlIEX)
percentage_change <- webpageIEX %>%
html_nodes(".ValueDown") %>%
html_text()
However, this gives me both the values -0,51% and -51%. Is there a way to include everything with class="ValueDown" and exclude everything with class="ValueDown Difference"?

I'am not expert, but I think you should use the attribute selector:
percentage_change <- webpageIEX %>%
html_nodes("[class='ValueDown']") %>%
html_text()

Iteratively Create Dataset Using rvest

I am pretty new to R, but I am really interested in learning how to use it (specifically the new package rvest) in order to screen scrape information from articles for research papers, etc.
I want to create a dataset of all the ratings and directors of movies on IMDB. I have the code that can get ONE rating at a time:
library(rvest)
HG_Movie <- html("http://www.imdb.com/title/tt01781922")
score <- HG_Movie %>%
html_node("strong span") %>%
html_text() %>%
as.numeric()
print(score)
That will work and I print the score at the end to make sure it is correct (6.9)
So, now, the hard part. I want to be able to iterate over many imdb pages and collect the rating and the name of the director as well, and I want these to be written into a dataset of some type (doesn't matter if it is .csv or .txt or whatever). The finishing dataset would look something like:
Title Score Director
XX YY HH
AA BB CC
and so on. It would be amazing to learn to do this both with a list of all the urls, or wihtout, using some sort of loop over a certain range of values. Any help would be greatly appreciated!

R: Webscraping irregular blocks of values

So I am attempting to webscrape a webpage that has irregular blocks of data that is organized in a manner easy to spot with the eye. Let's imagine we are looking at wikipedia. If I am scraping the text from articles of the following link I end up with 33 entries. If I instead grab just the headers, I end up with only 7 (see code below). This result does not surprise us as we know that some sections of articles have multiple paragraphs while others have only one or no paragraph text.
My question though is, how do I associate my headers with my texts. If there were the same number of paragraphs per header or some multiple, this would be trivial.
library(rvest)
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")
wikitext <- wiki %>%
html_nodes('p+ ul li , p') %>%
html_text(trim=TRUE)
wikiheading <- wiki %>%
html_nodes('.mw-headline') %>%
html_text(trim=TRUE)

This will give you a list called content whose elements are named according to the headings and contain the corresponding text.
library(rvest) # Assumes version 0.2.0.9 is installed not currently on CRAN
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")
# This node set contains the headings and text
wikicontent <- wiki %>%
html_nodes("div[id='mw-content-text']") %>%
xml_children()
# Locates the positions of the headings
headings <- sapply(wikicontent,xml_name)
headings <- c(grep("h2",headings),length(headings)-1)
# Loop through the headings keeping the stuff in-between them as content
content <- list()
for (i in 1:(length(headings)-1)) {
foo <- wikicontent[headings[i]:(headings[i+1]-1)]
foo.title <- xml_text(foo[[1]])
foo.content <- xml_text(foo[-c(1)])
content[[i]] <- foo.content
names(content)[i] <- foo.title
}
The key was spotting the mw-content-text node which has all the things you want as children.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract text and numbers from web page using regex in R - r

Related

How to scrape data from multiple wikipedia pages with R?

Creating a dataframe from paragraph text scraped from website in R

Excluding notes using html_nodes() in r

Iteratively Create Dataset Using rvest

R: Webscraping irregular blocks of values

Categories

Resources