How do I add a loop when using R to scrape data? - r

I'm trying to create a database of crime data by zip code based on Trulia.com's data. I have the code below but so far it only produces 1 line of data. In the code below, Zipcodes is just a list of US zip codes. Can anyone tell me what I need to add to make this run through my entire list "i" ?
Here is a link to one of the Trulia pages for reference: https://www.trulia.com/real_estate/20004-Washington/crime/
UPDATE:
Here are zip codes for download: https://www.dropbox.com/s/uxukqpu0v88d7tf/Zip%20Code%20Database%20wo%20Boston.xlsx?dl=0
I also changed the code a bit this time after realizing the crime stats appear in different orders depending on the zip code. Is it possible to have the loop produce 4 lines per zipcode? This currently works but only produces the last zip code in the dataset. I can't figure out how to make sure each zip code's data is recorded on separate lines, so it doesn't overwrite and only leave one line of the last zip code.
Please help!!
library(rvest)
data=data.frame(Zipcodes)
for(i in data$Zip.Code)
{
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- html(site)
crime<- data.frame(zip =i,
type =site %>% html_nodes(".brs") %>% html_text() ,
stringsAsFactors=FALSE)
}
View(crime)
If that code doesn't work, try this:
data=data.frame(Zillow_Data_for_R_Test)
for(i in data$Zip.Code)
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- read_html(site)
crime<- data.frame(zip =i,
theft =site %>% html_nodes(".crime-text-0") %>% html_text() ,
assault =site %>% html_nodes(".crime-text-1") %>% html_text() ,
arrest =site %>% html_nodes(".crime-text-2") %>% html_text() ,
vandalism =site %>% html_nodes(".crime-text-3") %>% html_text() ,
robbery =site %>% html_nodes(".crime-text-4") %>% html_text() ,
type =site %>% html_nodes(".clearfix") %>% html_text() ,
stringsAsFactors=FALSE)
View(crime)

The comment of #r2evans already provides an answer. Since the #ShanCham asked how to actually implement this I wanted to guide with the following code, which is just more verbose than the comment and could therefore not be posted as additional comment.
library(rvest)
#only two exemplary zipcodes, could be more, of course
zipcodes <- c("02110", "02125")
crime <- lapply(zipcodes, function(z) {
site <- read_html(paste0("https://www.trulia.com/real_estate/",z,"-Boston/crime/"))
#for illustrative purposes:
#introduced as.numeric to numeric columns
#exluded some of your other columns and shortenend the current text in type
data.frame(zip = z,
theft = site %>% html_nodes(".crime-text-0") %>% html_text() %>% as.numeric(),
assault = site %>% html_nodes(".crime-text-1") %>% html_text() %>% as.numeric() ,
type = site %>% html_nodes(".clearfix") %>% html_text() %>% paste(collapse = " ") %>% substr(1, 50) ,
stringsAsFactors=FALSE)
})
class(crime)
#list
#Output are lists that can be bound together to one data.frame
crime <- do.call(rbind, crime)
#crime is a data.frame, hence, classes/types are kept
class(crime$type)
# [1] "character"
class(crime$assault)
# [1] "numeric"

Related

How to convert this code into a for-loop and have it automatically fill in the appropriate data in r?

I created another post related to this here, but it caused a lot of confusion, so hopefully this post is clearer. FYI the code I initially wrote (but didn't do what I needed) for the loop is also in the link above if you want to see it.
goal
I'm trying to scrape all the competitor data from the IBJJF website. This includes each competitors division, gender, belt and weight. Although the stats only appear once at the top of the page, the below code will appropriately fill in the correct information for each competitor.
The problem
When this code is used in a for-loop, the loop does not automatically fill in the appropriate information for each competitor. For example, if the belt is black, I would like it to say black next to all competitors on that page. But I cannot figure out how to do that.
Question
How would you change this code into a loop so that the appropriate division, gender, belt, and weight is next to each competitor's name? Or if you were trying to capture all the data I mentioned from each page, how would you go about doing it?
library(rvest)
library(tidyverse)
MensUrl <- read_html('https://www.bjjcompsystem.com/tournaments/1869/categories/2053146')
## SCRAPE FIGHT INFO -------------------------------------------
# fight info
ageDivision <- MensUrl %>%
html_nodes('.category-title__age-division') %>%
html_text()
gender <- MensUrl %>%
html_nodes('.category-title__age-division+ .category-title__label') %>%
html_text()
belt <- MensUrl %>%
html_nodes('.category-title__label:nth-child(3)') %>%
html_text()
weight <- MensUrl %>%
html_nodes('.category-title__label:nth-child(4)') %>%
html_text()
fightAndMat <- MensUrl %>%
html_nodes('.bracket-match-header__where , .bracket-match-header__fight') %>%
html_text()
date = MensUrl %>%
html_nodes('.bracket-match-header__when') %>%
html_text()
CompetitorNo = MensUrl %>%
html_nodes('.match-card__competitor-n') %>%
html_text()
name = MensUrl %>%
html_nodes('.match-card__competitor-description div:nth-child(1)') %>%
html_text()
gym = MensUrl %>%
html_nodes('.match-card__club-name') %>%
html_text()
#### create match df ####
matches = data.frame('division' = ageDivision,
'gender' = gender,
'belt' = belt,
'weight' = weight,
'fightAndMat' = fightAndMat,
'date' = date,
'competitor' = CompetitorNo,
'name' = name,
'gym' = gym)
This is what the above code produces when put in a data frame. I want the final result to look the same.

Extracting repeated class with rvest html_elements in R

how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)

What would be the best practice to merge additional variables to data based on specific row information when web scraping in R using 'rvest'?

I'm currently web scraping the IMDB website to extract movie data.
I would like to know how you would solve this problem.
library(tidyverse)
library(data.table)
library(rvest)
library(janitor)
#top rated movies website
url <- 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
# extract the title of the movies using rvest
titles <- url %>%
read_html() %>%
html_nodes(' .titleColumn a') %>%
html_text() %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='title')
# extract links to each of the titles, this will be the reference
links <- url %>%
read_html() %>%
html_nodes('.titleColumn a') %>%
html_attr('href') %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='links')
# creating a DT with the data
movies <- cbind(titles,links)
I will have movies DT with title and links as columns.
Now, I will like to extract additional data of each movie using the links
I will continue using the first row as an example.
#the first link in movies
link <- 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=NJ52X0MM1V9FKSPBT46G&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
Now I have a 1x4 table with budget information
I would like to pull data for each link of movies and merge it into DT to have a final DT with 6 columns; 'title', 'link' + four budget variables. I was trying to create a function that includes the code to get the budget data using each row's link as a parameter and the using 'lapply', I don't think this is the correct approach.
I would like to see if you have a solution to this in an efficient way.
Thanks so much for your help.
I think this would solve your problem:
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
## As function
get_budget = function(link,select){
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
return(budget)
}
#As your code is slow I'll subset movies to have 10 rows:
movies = movies[1:10,]
tmp =
lapply(movies[, links], function(x)
get_budget(link = paste0("https://www.imdb.com/",x),select=select )) %>%
rbindlist(., fill = T)
movies = cbind(movies, tmp)
And your result would seem like this: movies_result
Finally, I think this little advice would make your code loke cooler:
setnames doesn't need . from magrittr; it automatically understands your kind of code.
When possible avoid using setnames(. ,old = colnames(.), new='links'). In your case is just necessary setnames('links') since you are renaming all your variables.
setnames(dt,old = oldnames, new=newnames) is only necessary when oldnames is not equal to names(dt).
Since DT is another R popular library, completely unrelated with data.table I think is better to refer to a data.table as what is a data.table.

R Webscraping: How to feed URLS into a function

My end goal is to be able to take all 310 articles from this page and its following pages and run it through this function:
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)
library(dplyr)
scrape_docs <- function(URL){
doc <- read_html(URL)
speaker <- html_nodes(doc, ".diet-title a") %>%
html_text()
date <- html_nodes(doc, ".date-display-single") %>%
html_text() %>%
mdy()
title <- html_nodes(doc, "h1") %>%
html_text()
text <- html_nodes(doc, "div.field-docs-content") %>%
html_text()
all_info <- list(speaker = speaker, date = date, title = title, text = text)
return(all_info)
}
I assume the way to go forward would be to somehow create a list of the URLs I want, then iterate that list through the scrape_docs function. As it stands, however, I'm having a hard time understanding how to go about that. I thought something like this would work, but I seem to be missing something key given the following error:
xml_attr cannot be applied to object of class "character'.
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
pages <- 4
all_links <- tibble()
for(i in seq_len(pages)){
page <- paste0(source_col,i) %>%
read_html() %>%
html_attr("href") %>%
html_attr()
tmp <- page[[1]]
all_links <- bind_rows(all_links, tmp)
}
all_links
You can get all the url's by doing
library(rvest)
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
all_urls <- source_col %>%
read_html() %>%
html_nodes("td a") %>%
html_attr("href") %>%
.[c(FALSE, TRUE)] %>%
paste0("https://www.presidency.ucsb.edu", .)
Now do the same by changing the page number in source_col to get remaining data.
You can then use a for loop or map to extract all the data.
purrr::map(all_urls, scrape_docs)
Testing the function scrape_docs on 1 URL
scrape_docs(all_urls[1])
#$speaker
#[1] "Dwight D. Eisenhower"
#$date
#[1] "1958-04-02"
#$title
#[1] "Special Message to the Congress Relative to Space Science and Exploration."
#$text
#[1] "\n To the Congress of the United States:\nRecent developments in long-range
# rockets for military purposes have for the first time provided man with new mac......

R scraping reviews from multiple pages on TripAdvisor

I'm trying to pull out a few pages of reviews from TripAdvisor for a academic project.
Here's my attempt using R
#Load libraries
library(rvest)
library(RSelenium)
# main url for stadium
urlmainlist=c(
hampdenpark="http://www.tripadvisor.com.ph/Attraction_Review-g186534-d214132-Reviews-Hampden_Park-Glasgow_Scotland.html"
)
# Specify how many search pages and counter
morepglist=list(
hampdenpark=seq(10,360,10)
)
#----------------------------------------------------------------------------------------------------------
# create pickstadium variable
pickstadium="hampdenpark"
# get list of urllinks corresponding to different pages
# url link for first search page
urllinkmain=urlmainlist[pickstadium]
# counter for additional pages
morepg=as.numeric(morepglist[[pickstadium]])
urllinkpre=paste(strsplit(urllinkmain,"Reviews-")[[1]][1],"Reviews",sep="")
urllinkpost=strsplit(urllinkmain,"Reviews-")[[1]][2]
urllink=rep(NA,length(morepg)+1)
urllink[1]=urllinkmain
for(i in 1:length(morepg)){
urllink[i+1]=paste(urllinkpre,"-or",morepg[i],"-",urllinkpost,sep="")
}
head(urllink)
write.csv(urllink,'urllink.csv')
##########
#SCRAPING#
##########
library(rvest)
library(RSelenium)
#install.packages('RSelenium')
testurl <- read.csv("urllink.csv", header=FALSE, quote="'", stringsAsFactors = F)
testurl=testurl[-1,]
testurl=testurl[,-1]
testurl=as.data.frame(testurl)
testurl=gsub('"',"",testurl$testurl)
list<-unlist(testurl)
tripadvisor <- NULL
#Scrape
for(i in 1:length(list)){
reviews <- list[i] %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
id <- reviews %>%
html_node(".quote a") %>%
html_attr("id")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title") %>%
strptime("%b %d, %Y") %>%
as.POSIXct()
review <- reviews %>%
html_node(".entry .partial_entry") %>%
html_text()%>%
as.character()
rowthing <- data.frame(id, review,rating, date, stringsAsFactors = FALSE)
tripadvisor<-rbind(rowthing, tripadvisor)
}
However this results in an empty tripadvisor dataframe. Any help on fixing this would be appreciated.
Additional Question
I'd like to capture the full reviews, as my code currently intends to capture partial entries only. For each review, I'd like to automatically click on the 'More' link and then extract the full review.
Here too, any help would be grately appreciated.

Resources