Iteratively Create Dataset Using rvest - r

I am pretty new to R, but I am really interested in learning how to use it (specifically the new package rvest) in order to screen scrape information from articles for research papers, etc.
I want to create a dataset of all the ratings and directors of movies on IMDB. I have the code that can get ONE rating at a time:
library(rvest)
HG_Movie <- html("http://www.imdb.com/title/tt01781922")
score <- HG_Movie %>%
html_node("strong span") %>%
html_text() %>%
as.numeric()
print(score)
That will work and I print the score at the end to make sure it is correct (6.9)
So, now, the hard part. I want to be able to iterate over many imdb pages and collect the rating and the name of the director as well, and I want these to be written into a dataset of some type (doesn't matter if it is .csv or .txt or whatever). The finishing dataset would look something like:
Title Score Director
XX YY HH
AA BB CC
and so on. It would be amazing to learn to do this both with a list of all the urls, or wihtout, using some sort of loop over a certain range of values. Any help would be greatly appreciated!

Related

Rvest'ing using 'for' loops in R

My goal is to get the weather data from one of the websites. I (with a little help of kind stack users, thank you) already created the vector consisting list of 1440 links and decided to try and use the 'for' loop to iterate over them.
Additionaly, every page has weather for each week so it's 7 rows of data (one for each day) which i have to obtain, which are marked as num0/num1/num2/num3.
That's what I came up with:
Links <- #here are the 1440 links i need to iterate over
library("rvest")
for (index in seq(from=1, to=length(Links), by=1)) {
link = paste(Links[index])
for (num in 0:7) {
node_date <-paste(".num",num," .date",sep="")
node_conditions<-paste(".num",num," .cond span",sep="")
#here I tried to create an 'embeded for loop' to iterate 7 times over various nodes consisting data
page = read_html(link)
DayOfWeek = page %>% html_nodes(node_date) %>% html_text()
Conditions = page %>% html_nodes(node_conditions) %>% html_text()
}
}
For now I get an error
error in command 'open.connection(x, "rb")':HTTP error 502
and I'm really quite confused what should I do now.
Are there other ways to accomplish this goal? Or maybe I'm making some rookie mistakes in here?
Thank you in advance!

Web Scraping Education Data in R

Was presented a problem at work and am trying to think / work my way through it. However, I am very new at web scraping, and need some help, or just good starting points, on web scraping.
I have a website from the education commission.
http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA
This site contains 50 tables, one for each state, with two columns in a question / answer format. My first attempt has been this...
library(tidyverse)
library(httr)
library(XML)
tibble(url = "http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA") %>%
mutate(get_data = map(.x = url,
~GET(.x))) %>%
mutate(list_data = map(.x = get_data,
~readHTMLTable(doc=content(.x, "text")))) %>%
pull(list_data)
My first thought was to create multiple dataframes, one for each state, in a list format.
This idea does not seem to have worked as anticipated. I was expecting a list, but it seems like a list of on response rather than 50. It appears that this one response read each line, but did not differentiate from one table to the next. Confused on next steps, anyone with any ideas? Web Scraping is odd to me.
Second attempt was to copy and paste the table into R as a tribble, one state at a time. This sort of worked, but not every column is formatted the same way. Attempted to use tidyr::separate() to break up the columns by "/t" and that worked for some columns, but not all.
Any help on this problem, or even just where to look to learn more about web scraping, would be very helpful. This did not seem all the difficult at first, but seems like there are a couple of things I am am missing. Maybe rvest? Have never used it, but know it is common with web scraping activities.
Thanks in advance!
As you already guessed rvest is a very good choice for web scraping. Using rvest you can get the table from your desired website in just two steps. With some additional data wrangling this could be transformed in a nice data frame.
library(rvest)
#> Loading required package: xml2
library(tidyverse)
html <- read_html("http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA")
df <- html %>%
html_table(fill = TRUE, header = FALSE) %>%
.[[1]] %>%
# Remove empty rows and rows containing the table header
filter(!(X1 == "" & X2 == ""), !(grepl("^Dual", X1) & grepl("^Dual", X2))) %>%
# Create state column
mutate(is_state = X1 == X2, state = ifelse(is_state, X1, NA_character_)) %>%
fill(state) %>%
filter(!is_state) %>%
select(-is_state)
head(df, 2)
#> X1
#> 1 Statewide policy in place
#> 2 Definition or title of program
#> X2
#> 1 Yes
#> 2 Dual Enrollment – Postsecondary Institutions. High school students are allowed to take college courses for credit either at a high school or on a college campus.
#> state
#> 1 Alabama
#> 2 Alabama

How to scrape data from multiple wikipedia pages with R?

I am new to data scraping in R, but I would like to do the following. I have a list of celebrities, celebs, and I would like to grab their date of birth from Wikipedia. I know how to do it for each individual celebrity, but I am trying to animate this process.
celebs <- c("Tom Hanks", "Tim Cook", "Michael Bloomberg")
I do the following to get the information I need for the first celebrity, Tom Hanks.
library(rvest)
wiki <- read_html("https://en.wikipedia.org/wiki/Tom_Hanks")
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
Is there a way to get the information I need for Tim Cook and Michael Bloomberg without manually editing the above code?
welcome to SO.
To do any task repeatedly with code, you should always look to build a loop. Before you can build a loop, you should try to build a single iteration of the loop. You almost have that ready here, but there are a few missing steps.
First of all, we should try to generalize the code so that it could work by simply switch the value of one variable from your vector of iterators (celebs).
person <- "Tom Hanks"
Now, using that, we need to create the wikipedia link through code. There are two things to consider here:
We need to add the link before the name of the person;
We should replace the space in "Tom Hanks" for an underline
We can do that with this code:
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
This creates the correct link, which we can use for the subsequent steps. Now, it is just a question of iterating through the celebs vector. There are many ways to do it, but in R, the most appropriate would be with an sapply. For that, we will create an anonymous function that will take a person's name as input, query wikipedia and extract their birthday, using the code that you have already written:
function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
}
You can now wrap an sapply structure around that:
birthdates <- sapply(celebs, function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
})

Extract text and numbers from web page using regex in R

I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES
Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
test <- test %>% html_nodes("tr") %>% html_text()
# This extracts 31 lines of text; here is what my target text looks like:
# [10] "NPDES\n6515\nOPERATORS OF RESIDENTIAL MOBILE HOME SITES\n\n"
Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"
How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...
as.numeric(sub(".*?NPDES.*?(\\d{4}).*", "\\1", test))
# 4424
Any advice?
From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
tables <- html_nodes(test, "table")
tables
SIC <- as.data.frame(html_table(tables[5], fill = TRUE))

How to fix incorrect postcodes when the adress is correct

I am cleaning a company file that contains the adresses and postal codes of the companies.
Some companies are added multiple times, however the postal codes differ. This is probably caused by human errors, but makes working with the dataset very difficult.
The dataset would look something like this:
Company | Adress | Postal Code
Company1 | Limestreet | 4444ER
Company1 | Limestreet | 4445ER
Company2 | Applestreet | 3745BB
I would like to check which companies have different postal codes. Since the companynames are often spelled differently too (also human errors), it would be best to check this based on matching addresses.
I've tried to solve with tidyverse, but it's not working. My plan was to find all the faulty postal codes and correct them manually. However, if there are too many, I might have to find a way to do it more efficiently. So not only would I like to ask advise on how to detect the errors, but I'd also like to ask advise on how to correct it in R. Maybe point me towards some good packages or pages describing how to fix it?
df2 <- df1 %>%
select(Adress PostalCode) %>%
group_by(Adress) %>%
summarise( n())
To create a mock example of the dataset:
company <- c("company1", "company1", "company2", "company2", "company3")
Address <- c("Limestreet", "Limestreet", "Applestreet", "Applestreet",
"Pearstreet")
Postal_code <- c("4444ER", "4445ER", "3745BB", "3745BC", "8743IJ")
trail_data <- data.frame(company, Address, Postal_code)
I think you were close with your code, but I would just show the ones that have different lines. This will show you the ones to focus on.
trail_data %>%
select(Address, Postal_code) %>%
group_by(Address) %>%
unique() %>%
filter(n() > 1)
I think we need a little more information from your database to get the final answer, BUT you can start by writing a little code that identifies if, when sorted, there is a discrepancy in the postal codes. Note that I added one more row of data (company 3) that serves as a "non-discrepant" instance.
I created a new variable called same which is equal to 1 if company name and address match for any pair of rows, but 0 otherwise. You can use this information with other data (which we don't have) to determine which value might be the correct one.
company <- c("company1", "company1", "company2", "company2", "company3","company3")
Address <- c("Limestreet", "Limestreet", "Applestreet", "Applestreet",
"Pearstreet","Pearstreet")
Postal_code <- c("4444ER", "4445ER", "3745BB", "3745BC", "8743IJ","8743IJ")
trail_data <- data.frame(company, Address, Postal_code)
trail_data$same<-ifelse(trail_data$company==lag(trail_data$company, trail_data$Address==lag(trail_data$Address,1) & trail_data$Postal_code!=lag(trail_data$Postal_code),0,1)

Resources