Scraping Yellowpages in R - r

I am trying to scrape a list of plumbers from http://www.yellowpages.com.au to build a tibble.
The code works fine with each section (name, phone number, email) but when I put it together in a function to build the tibble it hits an error because some don't have phone numbers or emails.
url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"
testscrape <- function(){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","#")
return(tibble(docname = docname, ph_no = ph_no, email = email))
}
Then I run the function:
test_run <- testscrape
test_run()
And the following errors arrive:
Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]>
Which leaves it hanging.
I appreciate that there are fewer phone numbers than listed plumbers so how do I create a N/A return for that line for that plumber so that the numbers align with the relevant plumbers?
Thanks in advance.

You can subset the extracted data to get 1st value which will give NA when the value is empty.
library(rvest)
library(stringr)
testscrape <- function(url){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","#")
n <- seq_len(max(length(practice), length(ph_no), length(email)))
tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)
# docname ph_no email
# <lgl> <lgl> <lgl>
#1 NA NA NA

Related

Trouble mapping a function to a list of scraped links using rvest

I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)
Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))

Need help scraping a big archive

For a schoolproject i have to scrape a website which isn't a problem. But for it to be called BigData i wanted to scrape the whole archive(which is the past 5 years). The only thing that changes in the url is the date at the end of the url but i don't know how to write a script that changes only the date at the end.
The website I'm using is this: https://www.ongelukvandaag.nl/archief/ .
And the dates i need are from 01-01-2015 until 24-09-2020. The first part of the code i already figured out and I'm able to scrape 1 page. I'm a beginner at using R and would like to know if anyone could help me. The code is shown below. Thanks in advance!
This is what i got so far and the errors are underneath the code.
install.packages("XML")
install.packages("reshape")
install.packages("robotstxt")
install.packages("Rcrawler")
install.packages("RSelenium")
install.packages("devtools")
install.packages("exifr")
install.packages("Publish")
devtools::install_github("r-lib/xml2")
library(rvest)
library(dplyr)
library(xml)
library(stringr)
library(jsonlite)
library(xml12)
library(purrr)
library(tidyr)
library(reshape)
library(XML)
library(robotstxt)
library(Rcrawler)
library(RSelenium)
library(ps)
library(devtools)
library(exifr)
library(Publish)
#Create an url object
url<-"https://www.ongelukvandaag.nl/archief/%d "
#Verify the web can be scraped
paths_allowed(paths = c(url))
#Obtain the links for every day from 2015 to 2020
map_df(2015:2020, function(i){
page<-read_html(sprintf(url,i))
data.frame(Links = html_attr(html_nodes(page, ".archief a"),"href"))
}) -> Links %>%
Links$Links<-paste("https://www.ongelukvandaag.nl/",Links$Links,sep = "")
#Scrape what you want from each link:
d<- map(Links$Links, function(x) {
Z <- read_html(x)
Date <- Z %>% html_nodes(".text-muted") %>% html_text(trim = TRUE) # Last update
All_title <- Z %>% html_nodes("h2") %>% html_text(trim = TRUE) # Title
return(tibble(All_title,Date))
})
The errors i get:
Error in open.connection(x, "rb") : HTTP error 400.
in paste("https://www.ongelukvandaag.nl/", Links$Links, sep = "") : object 'Links' not found >
in map(Links$Links, function(x) { : object 'Links' not found
and the packages "xml12" & "xml" don't work in this version of RStudio
Take a look at my code and my comments:
library(purrr)
library(rvest) # don't load a lot of libraries if you don't need them
url <- "https://www.ongelukvandaag.nl/archief/"
bigdata <-
map_dfr(
2015:2020,
function(year){
year_pg <- read_html(paste0(url, year))
list_dates <- year_pg %>% html_nodes(xpath = "//div[#class='archief']/a") %>% html_text() # in case some dates are missing
map_dfr(
list_dates,
function(date) {
pg <- read_html(paste0(url, date))
items <- pg %>% html_nodes("div.full > div.row")
items <- items[sapply(items, function(x) length(x %>% html_node(xpath = "./descendant::h2"))) > 0] # drop NA items
data.frame(
date = date,
title = items %>% html_node(xpath = "./descendant::h2") %>% html_text(),
update = items %>% html_node(xpath = "./descendant::h4") %>% html_text(),
image = items %>% html_node(xpath = "./descendant::img") %>% html_attr("src")
)
}
)
}
)

Scraping pages with inconsistent lengths in dataframe

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`
How can I let my code run but fill with Na's in tibble if the data isn't there.
My code for a pauzing robot later used in scraper function:
pauzing_robot <- function (periods = c(0, 1)) {
tictoc <- runif(1, periods[1], periods[2])
cat(paste0(Sys.time()),
"- Sleeping for ", round(tictoc, 2), "seconds\n")
Sys.sleep(tictoc)
}
Scraper:
library(tidyverse)
library(rvest)
scrape_page <- function(pagina_nummer) {
page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer))
pauzing_robot(periods = c(0, 1.5))
tibble(
huisarts = page %>%
html_nodes(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = page %>%
html_nodes(".location") %>%
html_text() %>%
str_trim(),
url = page %>%
html_nodes(".media-heading.title.orange") %>%
html_nodes("a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
}
Total number of pages 445, but for example sake only scraping three:
huisartsen <- map_df(sample(1:3), scrape_page)
Page 2 seems to be the problem with inconsistent lengths because this code works:
huisartsen <- map_df(3:4, scrape_page)
If possible with tidyverse code. Thanks in advance.
You need to retrieve the list of parent nodes
parents <- page %>% html_nodes("li.media")
Then parse the parent nodes with function html_node().
tibble(
huisarts = parents %>%
html_node(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = parents %>%
html_node(".location") %>%
html_text() %>%
str_trim(),
url = parents %>%
html_node(".media-heading.title.orange a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
The html_node function will always return a value even if it is just a NA

R Webscraping: How to feed URLS into a function

My end goal is to be able to take all 310 articles from this page and its following pages and run it through this function:
library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)
library(dplyr)
scrape_docs <- function(URL){
doc <- read_html(URL)
speaker <- html_nodes(doc, ".diet-title a") %>%
html_text()
date <- html_nodes(doc, ".date-display-single") %>%
html_text() %>%
mdy()
title <- html_nodes(doc, "h1") %>%
html_text()
text <- html_nodes(doc, "div.field-docs-content") %>%
html_text()
all_info <- list(speaker = speaker, date = date, title = title, text = text)
return(all_info)
}
I assume the way to go forward would be to somehow create a list of the URLs I want, then iterate that list through the scrape_docs function. As it stands, however, I'm having a hard time understanding how to go about that. I thought something like this would work, but I seem to be missing something key given the following error:
xml_attr cannot be applied to object of class "character'.
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
pages <- 4
all_links <- tibble()
for(i in seq_len(pages)){
page <- paste0(source_col,i) %>%
read_html() %>%
html_attr("href") %>%
html_attr()
tmp <- page[[1]]
all_links <- bind_rows(all_links, tmp)
}
all_links
You can get all the url's by doing
library(rvest)
source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"
all_urls <- source_col %>%
read_html() %>%
html_nodes("td a") %>%
html_attr("href") %>%
.[c(FALSE, TRUE)] %>%
paste0("https://www.presidency.ucsb.edu", .)
Now do the same by changing the page number in source_col to get remaining data.
You can then use a for loop or map to extract all the data.
purrr::map(all_urls, scrape_docs)
Testing the function scrape_docs on 1 URL
scrape_docs(all_urls[1])
#$speaker
#[1] "Dwight D. Eisenhower"
#$date
#[1] "1958-04-02"
#$title
#[1] "Special Message to the Congress Relative to Space Science and Exploration."
#$text
#[1] "\n To the Congress of the United States:\nRecent developments in long-range
# rockets for military purposes have for the first time provided man with new mac......

R: Using Rvest to loop through list

I try to scape the prices, area and addresses from all flats of this homepage (https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden)
Getting the data for one list element with Rvest and xpath works fine (see code), but I don´t know how to get the ID of each list element to loop through all elements.
Here is a part of the html-code with the data-go-to-expose-id I need for the loop. How can I get all IDs?
<span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/541dfd45-c75a-4da7-a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80">a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80"></a>
And here is my current R-code to fetch the data from one list element:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
address <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[2]/div[2]/a') %>% html_text()
price <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[1]/dd') %>% html_text()
area <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[2]/dd') %>% html_text()
Does this get what you are after
library("tidyverse")
library("httr")
library("rvest")
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
x <- read_html(url)
x %>%
html_nodes("#listings") %>%
html_nodes(".result-list__listing") %>%
html_attr("data-id")

Resources