i am facing trouble need help.
i have list of links (about 9000 links) which i am running in loop and doing some process on each one
links look like this :-
link1
link2
link3
link4
.....
link9000
but i am facing trouble as sometimes link 2nd gets failed (timeout) and sometime link2nd works and 400 or any random link fails as timeout . is there any way i can try failed link again n again ? i have added :-
status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150))
but still i get timeout . please help me! or any suggestion regarding it? final_links_bind = have all list of links
some sample links:-
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
url_parse<- read_html(Links)
col_name<- url_parse %>%
html_nodes("tr") %>%
html_text()
col_name <- stringr::str_remove_all(col_name, "\\\t|\\\n|\\\r")
pattern_col_no <- grep("využití", col_name)
col_name <- as.data.frame(col_name)
method_selected <- col_name[pattern_col_no,]
WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
#METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
print(WRITE_CSV_DATA)
}else{
print("LINK NOT WORKING")
no_Links <- sorted_link[i,]
not_working_link <- rbind(not_working_link,no_Links)
}
}
It is not clear how you want the final output, but here is how to scrape and skip links that are not working
library(rvest)
library(httr2)
library(tidyverse)
Given this data frame of links, notice the third one is not working:
df <- tibble(
links = c(
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
)
)
# A tibble: 4 × 1
links
<chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
Create a function to scrape the table, specifically the third row:
get_info <- function(link) {
cat("Scraping", link, "\n")
link %>%
read_html() %>%
html_table() %>%
pluck(2) %>%
slice(3) %>%
pull(2)
}
And mutate() a new column with the info, NA if the link is not working. If the link is not working possibly() will throw NA (NA_character_) back instead of stopping the code.
df %>%
mutate(
info = map_chr(links, possibly(get_info, otherwise = NA_character_))
)
# A tibble: 4 × 2
links info
<chr> <chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům
Related
I have been trying to understand how to use possibly() to wrap a lambda/anonymous function within map_dfr() so that my iterations continue on should an error be encountered. I am currently iterating over a large amount of webpages and using rvest to scrape them, however some are not compiled correctly or do not work. I would simply like to note that error so that I can return to it at a later time while continuing collecting data from the remainder of the webpages. My current code is posted below in addition to what I've tried:
df <- tibble(df, map_dfr(df$link, ~ {
# Replicate Human Input by Forcing Random Pauses
Sys.sleep(runif(1,1,3))
# Read in the html links
url <- .x %>% html_session(user_agent(user_agents)) %>% read_html()
# Full Job Description Text
description <- url %>%
html_elements(xpath = "//div[#id = 'jobDescriptionText']") %>%
html_text() %>% tolower()
description <- as.character(description)
# Hiring Insights
hiring_insights <- url %>%
html_elements(xpath = "//div[#id = 'hiringInsightsSectionRoot']") %>%
html_text() %>% str_extract("#REGEX") %>%
str_extract("#REGEX") %>%
str_trim()
hiring_insights <- as.character(hiring_insights)
### Extract Number of Hires
hiring_insights <- str_trim(str_extract(hiring_insights,"#REGEX"))
hiring_insights <- tolower(hiring_insights)
### Fill in all Missing Values with 1
hiring_insights[which(is.na(hiring_insights))] <- "1"
tibble(description, hiring_insights)
}))
I have tried wrapping the lambda function a few different ways but without success:
# First Attempt
df <- tibble(df, map_dfr(df$link, possibly(~ {——}, otherwise = "error)))
# Second Attempt
df <- tibble(df, map_dfr(df$link, possibly(function(x) {——}, otherwise = "error)))
# Third Attempt
df <- tibble(df, possibly(map_dfr(df$link, ~ {——}), otherwise = "error"))
# Fourth Attempt
df <- tibble(df, possibly(map_dfr(df$link, function(x) {——}), otherwise = "error"))
When writing the function with function(x) rather than with the ~ I update the .x to x within the lambda function when defining the url variable. However with each of these iterations I encountered a bad link and receive the HTTP 403 error, which then stops the iteration and discards all of data scraped from the previous variables. What I would like is to either have a dummy variable which notes whether or not the link was bad and then if it is bad fill in the values for the scraped variables with or simply whatever the otherwise argument is set too. Thank you in advance! I've really hit a wall here
map_dfr() expects a dataframe or named vector on every iteration. Your otherwise value isn’t named, so it throws an error. To illustrate:
library(purrr)
vals <- list(1, 2, "bad", 4, 5)
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = NA_real_
)
)
Error in `dplyr::bind_rows()`:
! Argument 3 must have names.
But if you change otherwise to return a dataframe:
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = data.frame(x = NA_real_)
)
)
x
1 1
2 4
3 NA
4 16
5 25
I am trying to scrape a list of plumbers from http://www.yellowpages.com.au to build a tibble.
The code works fine with each section (name, phone number, email) but when I put it together in a function to build the tibble it hits an error because some don't have phone numbers or emails.
url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"
testscrape <- function(){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","#")
return(tibble(docname = docname, ph_no = ph_no, email = email))
}
Then I run the function:
test_run <- testscrape
test_run()
And the following errors arrive:
Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]>
Which leaves it hanging.
I appreciate that there are fewer phone numbers than listed plumbers so how do I create a N/A return for that line for that plumber so that the numbers align with the relevant plumbers?
Thanks in advance.
You can subset the extracted data to get 1st value which will give NA when the value is empty.
library(rvest)
library(stringr)
testscrape <- function(url){
webpage <- read_html(url)
docname <- webpage %>%
html_nodes(".left .listing-name") %>%
html_text()
ph_no <- webpage %>%
html_nodes(".contact-phone .contact-text") %>%
html_text()
email <- webpage %>%
html_nodes(".contact-email") %>%
html_attr("href") %>%
as.character() %>%
str_remove_all(".*:") %>%
str_remove_all("\\?(.*)") %>%
str_replace_all("%40","#")
n <- seq_len(max(length(practice), length(ph_no), length(email)))
tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)
# docname ph_no email
# <lgl> <lgl> <lgl>
#1 NA NA NA
I'm trying to scrape the links from multiple pages of a web forum, and I'm getting an error message that I'm not sure how to fix.
I tried the following, using rvest and purrr
pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
paste0(1:18000) %>%
paste0(c(".html"))
i<-1
pages.subset<-pages[1:(i+49)==(i+49)]
pages.subset<-as_data_frame(pages.subset)
scrape_links<-function(pages.subset){read_html(pages.subset) %>% html_node(".topictitle") %>% html_attr('href')}
links<-map_df(pages.subset, scrape_links)
However, I got this error message
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=360].
Does anyone have any ideas as to how to solve this?
Although I am not 100% sure what caused the error, it seems that passing an entire data.frame as list in the map_df command blew things up. I have readjusted you code:
library(tidyverse)
library(rvest)
pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
paste0(1:18000) %>%
paste0(c(".html"))
scrape_links <- function(url) {
out <- url %>%
read_html() %>%
html_node(".topictitle") %>%
html_attr("href")
return(out)
}
links <- tibble(page = pages[1:(50) == (50)]) %>%
mutate(url = map_chr(page, scrape_links))
head(links)
# # A tibble: 6 x 2
# page url
# <chr> <chr>
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
we have a df like so:
df <- data.frame(id= c(1,2,3,4,5),
urls= c(NA,NA,"https://www.bing.com",
"https://www.bing.com https://www.google.com",
"https://github.com/"),
stringsAsFactors = FALSE)
Then we have a function that read in real urls, and get the title tag of each page. Like so-
get_title_tag <- function(url) {
if (is.na(ifelse(url == "", NA, url))) {
return(NA)
}
else if(identical(xml2::read_html(url), character(0))){
return(NA)
}
else{
page <- xml2::read_html(url)
path_to_title <- "/html/head/title"
conf_nodes <- rvest::html_nodes(page, xpath = path_to_title)
title <- rvest::html_text(conf_nodes)
#return(title)
return ("PAGE_TITLE")
}
}
The problem is that the element at 4th position at urls column contains two consecutive urls, so we get errors. We have looked at several posts here in the forums however none have problems like what We are facing.
Our goal is to get this output:
> df
id urls
1 1 <NA>
2 2 <NA>
3 3 PAGE_TITLE
4 4 PAGE_TITLE PAGE_TITLE
5 5 PAGE_TITLE
I have tried this method that separates the urls, but expands the df which is not what I want:
urls_only_vector <- df %>%
mutate(urls= strsplit(as.character(urls), " ")) %>%
unnest(urls) #%>% select("urls")
Using this method I can read urls one at a time, but again, since it expands my dataframe, I was wondering if there is something else I can do? Can I get an hint please? I will cherish any help.
It is better to get url's in different rows, apply get_title_tag function get the title and combine the data again grouping by id so that size of data remains the same.
library(dplyr)
df %>%
tidyr::separate_rows(urls, sep = '\\s+') %>%
mutate(title = purrr::map_chr(urls, get_title_tag)) %>%
group_by(id) %>%
summarise(title = toString(title))
I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~