I'm trying to scrape some data from websites with rvest. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.
My current code is below. I am going to each URL in url_tibble and pulling data (".verified").
# Function to extract data
get_data <- function(x) {
read_html(x) %>%
html_nodes(".verified") %>%
html_attr("href")
}
# Extract data
data_I_need <- url_tibble %>%
mutate(profile = map(url, ~ get_data(.x)),)
This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?
The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map. I tried doing something like this:
output <- vector(length = length(url_tibble$url))
for(i in 1:length(url_tibble$url)) {
data_I_need <- read_html(url_tibble$url[i]) %>%
html_nodes(".verified") %>%
html_attr("href")
output[i] <- data_I_need
if((i %% 200) == 0){
Sys.sleep(160)
}
}
However, this does not work either, and I receive an error message.
We can lapply in lieu of a loop. Also, I have added an https:// to each URL such that read_html recognises them as links not files. We should replace 2 with 200 for the actual data.
lapply(1:length(url_tibble$url), function(x){
if(x%%2 == 0){
print(paste0("Sleeping at ", x))
Sys.sleep(20)
}
read_html(paste0("https://",url_tibble$url[x])) %>%
html_nodes(".verified") %>%
html_attr("href")
})
Output (truncated)
[1] "Sleeping at 2"
[1] "Sleeping at 4"
[1] "Sleeping at 6"
[[1]]
[1] "https://www.psychologytoday.com/us/therapists/aak-bright-start-rego-park-ny/936718"
[2] "https://www.psychologytoday.com/us/therapists/leslie-aaron-new-york-ny/148793"
[3] "https://www.psychologytoday.com/us/therapists/lindsay-aaron-frieman-new-york-ny/761657"
[4] "https://www.psychologytoday.com/us/therapists/fay-m-aaronson-brooklyn-ny/840861"
[5] "https://www.psychologytoday.com/us/therapists/anita-aasen-staten-island-ny/291614"
[6] "https://www.psychologytoday.com/us/therapists/aask-therapeutic-services-fishkill-ny/185423"
[7] "https://www.psychologytoday.com/us/therapists/amanda-abady-brooklyn-ny/935849"
[8] "https://www.psychologytoday.com/us/therapists/denise-abatemarco-new-york-ny/143678"
[9] "https://www.psychologytoday.com/us/therapists/raya-abat-robinson-new-york-ny/810730"
Related
I am using json to scrape the content of multiple (1000) links. However, some of the links do not work in json format so there is not content to be scraped. Due to this, my code stop working when finding one of those links.
I have tried to use TryCatch to avoid the error but it seems not to be working
Here is the code I am using
library(jsonlite)
library(rvest)
lapply(links_jason[1:6], function(x) {
tryCatch(
{
json_data <- read_html(x) %>% html_text()%>%
jsonlite::fromJSON(.)%>%
select(1)
},
error = function(cond) return(NULL),
finally = print(x)
)
})
This is the issue I am getting
Debug location is approximate beacuse the source is not available
Here are some examples of the links I am trying to scrape
Links number 1, 2 and 6 works fine. 3, 4 and 5 needs to be avoid
> head(links_jason)
[1] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
[2] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
[3] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
[4] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
[5] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
[6] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json"
I have also tried to use if statements with no results. Could anyone help? Thanks!
Read direct with jsonlite and test length of return
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)>0){
print(x)
}
}
Or something like:
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)==0){
json_data <- NA}
else{
print('doing something with json_data')
}
})
I have a dataframe pubs with two columns: url, html.node. I want to write a loop that reads each url retrieves the html contents, and extract the information indicated by html.node column, and accumulates it in a data frame, or list.
All URL's are different, all html nodes are different. My code so far is:
score <- vector()
k <- 1
for (r in 1:nrow(pubs)){
art.url <- pubs[r, 1] # column 1 contains URL
art.node <- pubs[r, 2] # column 2 contains html nodes as charcters
art.contents <- read_html(art.url)
score <- art.contents %>% html_nodes(art.node) %>% html_text()
k<-k+1
print(score)
}
I appreciate your help.
First of all, be sure that each site you're going to scrape, allows you to scrape data, you can incurr in legal issue if you break some rules.
(Note, I've used only http://toscrape.com/ , a sandbox site to scraping, due you did not provide your data)
After that, you can proceed with this, hope it helps:
# first, your data I think they're similar to this
pubs <- data.frame(site = c("http://quotes.toscrape.com/",
"http://quotes.toscrape.com/"),
html.node = c(".text",".author"), stringsAsFactors = F)
Then the loop you required:
library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()
# here you are going to fill the list with the scraped data
for (i in 1:nrow(pubs)){
art.url <- pubs[i, 1] # choose the site as you did
art.node <- pubs[i, 2] # choose the node as you did
# scrape it!
empty_list[[i]] <- read_html(art.url) %>% html_nodes(art.node) %>% html_text()
}
Now the result is a list, but, with:
names(empty_list) <- pubs$site
You are going to add to each element of the list the name of the site, with the result:
$`http://quotes.toscrape.com/`
[1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
[2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"
[3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
[4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"
[5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"
[6] "“Try not to become a man of success. Rather become a man of value.”"
[7] "“It is better to be hated for what you are than to be loved for what you are not.”"
[8] "“I have not failed. I've just found 10,000 ways that won't work.”"
[9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
[10] "“A day without sunshine is like, you know, night.”"
$`http://quotes.toscrape.com/`
[1] "Albert Einstein" "J.K. Rowling" "Albert Einstein" "Jane Austen" "Marilyn Monroe" "Albert Einstein" "André Gide"
[8] "Thomas A. Edison" "Eleanor Roosevelt" "Steve Martin"
Clearly it should work with different sites, and different nodes.
You could also use map from the purrr package instead of a loop:
expand.grid(c("http://quotes.toscrape.com/", "http://quotes.toscrape.com/tag/inspirational/"), # vector of urls
c(".text",".author"), # vector of nodes
stringsAsFactors = FALSE) %>% # assuming that the same nodes are relevant for all urls, otherwise you would have to do something like join
as_tibble() %>%
set_names(c("url", "node")) %>%
mutate(out = map2(url, node, ~ read_html(.x) %>% html_nodes(.y) %>% html_text())) %>%
unnest()
I am trying to download a bunch of zip files from the website
https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml
Any suggestions? I have tried using rvest to identify the href, but have not had any luck.
We can avoid platform-specific issues with download.file() and handle the downloads with httr.
First, we'll read in the page:
library(xml2)
library(httr)
library(rvest)
library(tidyverse)
pg <- read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml")
Now, we'll target all the .zip file links. They're relative paths (e.g. Zip) so we'll prepend the URL prefix to them as well:
html_nodes(pg, xpath=".//a[contains(#href, '.zip')]") %>% # this href gets _all_ of them
html_attr("href") %>%
sprintf("https://mesonet.agron.iastate.edu%s", .) -> zip_urls
Here's a sample of what ^^ looks like:
head(zip_urls)
## [1] "https://mesonet.agron.iastate.edu/data/gis/shape/4326/us/current_ww.zip"
## [2] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_all.zip"
## [3] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_tsmf.zip"
## [4] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_all.zip"
## [5] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_tsmf.zip"
## [6] "https://mesonet.agron.iastate.edu/pickup/wwa/1988_all.zip"
There are 84 of them:
length(zip_urls)
## [1] 84
So we'll make sure to include a Sys.sleep(5) in our download walker so we aren't hammering their servers since our needs are not more important than the site's.
Make a place to store things:
dir.create("mesonet-dl")
This could also be done with a for loop but using purrr::walk makes it fairly explicit we're generating side effects (i.e. downloading to disk and not modifying anything in the R environment):
walk(zip_urls, ~{
message("Downloading: ", .x) # keep us informed
# this is way better than download.file(). Read the httr man page on write_disk
httr::GET(
url = .x,
httr::write_disk(file.path("mesonet-dl", basename(.x)))
)
Sys.sleep(5) # be kind
})
We use file.path() to construct the save-file location in a platform-agnostic way and use basename() to extract the filename portion vs regex hacking since it's an R C-backed internal function that is platform-idiosyncrasy-aware.
This should work
library(tidyverse)
library(rvest)
setwd("YourDirectoryName") # set the directory where you want to download all files
read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml") %>%
html_nodes(".table-striped a") %>%
html_attr("href") %>%
lapply(function(x) {
filename <- str_extract(x, pattern = "(?<=wwa/).*") # this extracts the filename from the url
paste0("https://mesonet.agron.iastate.edu",x) %>% # this creates the relevant url from href
download.file(destfile=filename, mode = "wb")
Sys.sleep(5)})})
I am quite new to web scraping and I am trying to scrape the 5-yr market value from a five thirty eight site linked here (https://projects.fivethirtyeight.com/carmelo/kyrie-irving/). This is the code I am running from the rvest package to do so.
kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
kyrie_irving %>%
html_node(".market-value") %>%
html_text() %>%
as.numeric()
However the output looks like this:
> kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
> kyrie_irving %>%
+ html_node(".market-value") %>%
+ html_text() %>%
+ as.numeric()
[1] NA
I'm just wondering where I am going wrong with this?
EDIT: I have tried using RSelenium to do this and still get no value returned. I am really lost as to what the problem is. Here is the code:
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
elem <- remDr$findElement(using="css selector", value=".market-value")
elemtxt <- elem$getElementAttribute("div")
Rselenium works, you just need to change the last line code and you can get the result.
elem$getElementText()
[[1]]
[1] "$136.5m"
By the way, the result is a string, so you need to remove $ and m, then you can parse it into a number.
I know that this question has been asked here tons of times but after reading a bunch of topics I'm still stucked on this :( . I've a list of scraped html nodes like this
http://bit.d o/bnRinN9
and I just want to clean all code part. Unfortunately I'm a newbie and the only thing it comes to my mind is the Cthulhu way (regex, argh!). Which way I can do this?
*I put a space between "d" and "o" in domain name because SO doesn't allow to post that link
This uses the data linked in Why R can't scrape these links? which was downloaded.
library(rvest)
library(stringr)
# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")
# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")
# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]
# the real urls are in the html text, prefixed with http
span_text <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]
# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
library rvest includes many simple functions for scraping and processing html. It depends on package xml2. Generally you can scrape and filter in one step.
It's not clear if you want to extract the href value or the html text, which are the same in your example. This code extracts the href value by finding the a nodes and then the html attribute href. alternatively you can use html_text to get the link display text.
library(rvest)
links <- list('
http://anydomain.com/bnRinN9
<a href="domain.com/page">
')
# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs
## [1] "http://anydomain.com/bnRinN9" "domain.com/page"