I am using json to scrape the content of multiple (1000) links. However, some of the links do not work in json format so there is not content to be scraped. Due to this, my code stop working when finding one of those links.
I have tried to use TryCatch to avoid the error but it seems not to be working
Here is the code I am using
library(jsonlite)
library(rvest)
lapply(links_jason[1:6], function(x) {
tryCatch(
{
json_data <- read_html(x) %>% html_text()%>%
jsonlite::fromJSON(.)%>%
select(1)
},
error = function(cond) return(NULL),
finally = print(x)
)
})
This is the issue I am getting
Debug location is approximate beacuse the source is not available
Here are some examples of the links I am trying to scrape
Links number 1, 2 and 6 works fine. 3, 4 and 5 needs to be avoid
> head(links_jason)
[1] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
[2] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
[3] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
[4] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
[5] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
[6] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json"
I have also tried to use if statements with no results. Could anyone help? Thanks!
Read direct with jsonlite and test length of return
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)>0){
print(x)
}
}
Or something like:
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)==0){
json_data <- NA}
else{
print('doing something with json_data')
}
})
Related
I'm trying to scrape some data from websites with rvest. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.
My current code is below. I am going to each URL in url_tibble and pulling data (".verified").
# Function to extract data
get_data <- function(x) {
read_html(x) %>%
html_nodes(".verified") %>%
html_attr("href")
}
# Extract data
data_I_need <- url_tibble %>%
mutate(profile = map(url, ~ get_data(.x)),)
This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?
The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map. I tried doing something like this:
output <- vector(length = length(url_tibble$url))
for(i in 1:length(url_tibble$url)) {
data_I_need <- read_html(url_tibble$url[i]) %>%
html_nodes(".verified") %>%
html_attr("href")
output[i] <- data_I_need
if((i %% 200) == 0){
Sys.sleep(160)
}
}
However, this does not work either, and I receive an error message.
We can lapply in lieu of a loop. Also, I have added an https:// to each URL such that read_html recognises them as links not files. We should replace 2 with 200 for the actual data.
lapply(1:length(url_tibble$url), function(x){
if(x%%2 == 0){
print(paste0("Sleeping at ", x))
Sys.sleep(20)
}
read_html(paste0("https://",url_tibble$url[x])) %>%
html_nodes(".verified") %>%
html_attr("href")
})
Output (truncated)
[1] "Sleeping at 2"
[1] "Sleeping at 4"
[1] "Sleeping at 6"
[[1]]
[1] "https://www.psychologytoday.com/us/therapists/aak-bright-start-rego-park-ny/936718"
[2] "https://www.psychologytoday.com/us/therapists/leslie-aaron-new-york-ny/148793"
[3] "https://www.psychologytoday.com/us/therapists/lindsay-aaron-frieman-new-york-ny/761657"
[4] "https://www.psychologytoday.com/us/therapists/fay-m-aaronson-brooklyn-ny/840861"
[5] "https://www.psychologytoday.com/us/therapists/anita-aasen-staten-island-ny/291614"
[6] "https://www.psychologytoday.com/us/therapists/aask-therapeutic-services-fishkill-ny/185423"
[7] "https://www.psychologytoday.com/us/therapists/amanda-abady-brooklyn-ny/935849"
[8] "https://www.psychologytoday.com/us/therapists/denise-abatemarco-new-york-ny/143678"
[9] "https://www.psychologytoday.com/us/therapists/raya-abat-robinson-new-york-ny/810730"
I used the following code:
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
}
getGoogleLinks <- function(google.url)
{
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//a[#href][#class='l']")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]))
}
search.term <- "cran"
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
I would like to find all the links that resulted from my search and I get the following result:
> links
list()
How can I get the links?
In addition I would like to get the headlines and summary of google results how can I get it?
And finally is there a way to get the links that resides in ChillingEffects.org results?
If you look at the htmlvariable, you can see that the search result links all are nested in <h3 class="r"> tags.
Try to change your getGoogleLinks function to:
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]]))
}
I created this function to read in a list of company names and then get the top website result for each. It will get you started then you can adjust it as needed.
#libraries.
library(URLencode)
library(rvest)
#load data
d <-read.csv("P:\\needWebsites.csv")
c <- as.character(d$Company.Name)
# Function for getting website.
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>% # Get all notes of type cite. You can change this to grab other node types.
html_text()
result <- results[1]
return(as.character(result)) # Return results if you want to see them all.
}
# Apply the function to a list of company names.
websites <- data.frame(Website = sapply(c,getWebsite))]
other solutions here don't work for me, here's my take on #Bryce-Chamberlain's issue which works for me in August 2019, it answers also another closed question : company name to URL in R
# install.packages("rvest")
get_first_google_link <- function(name, root = TRUE) {
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- xml2::read_html(url)
# extract all links
nodes <- rvest::html_nodes(page, "a")
links <- rvest::html_attr(nodes,"href")
# extract first link of the search results
link <- links[startsWith(links, "/url?q=")][1]
# clean it
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
# get root if relevant
if(root) link <- sub("^(https?://.*?/).*$", "\\1", link)
link
}
companies <- data.frame(company = c("apple acres llc","abbvie inc","apple inc"))
companies <- transform(companies, url = sapply(company,get_first_google_link))
companies
#> company url
#> 1 apple acres llc https://www.appleacresllc.com/
#> 2 abbvie inc https://www.abbvie.com/
#> 3 apple inc https://www.apple.com/
Created on 2019-08-10 by the reprex package (v0.2.1)
The free solutions don't work anymore. Plus it doesn't allow you to search for regions outside your location. Here's a solution using Google Custom Search API. The API allows 100 free API calls per day. The function below returns only 10 results or page 1. 1 API call returns only 10 results.
Google.Search.API <- function(keyword, google.key, google.cx, country = "us")
{
# keyword = keywords[10]; country = "us"
url <- paste0("https://www.googleapis.com/customsearch/v1?"
, "key=", google.key
, "&q=", gsub(" ", "+", keyword)
, "&gl=", country # Country
, "&hl=en" # Language from Browser, english
, "&cx=", google.cx
, "&fields=items(link)"
)
d2 <- url %>%
httr::GET(ssl.verifypeer=TRUE) %>%
httr::content(.) %>% .[["items"]] %>%
data.table::rbindlist(.) %>%
mutate(keyword, SERP = row_number(), search.engine = "Google API") %>%
rename(source = link) %>%
select(search.engine, keyword, SERP, source)
pause <- round(runif(1, min = 1.1, max = 5), 1)
if(nrow(d2) == 0)
{cat("\nPausing", pause, "seconds. Failed for:", keyword)} else
{cat("\nPausing", pause, "seconds. Successful for:", keyword)}
Sys.sleep(pause)
rm(keyword, country, pause, url, google.key, google.cx)
return(d2)
}
I tried scraping the first two pages of topics from this discussion forum by using this code but received an error message which I do not understand - "Error in sprintf(url_base, i) : unrecognised format specification '%2C'"
Can someone help? Thanks.
library(rvest)
library(purrr)
url_base <- "http://www.epilepsy.com/connect/forums/living-epilepsy-adults?page=0%2C"
map_df(1:2, function(i) {
# simple but effective progress indicator
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(title=html_text(html_nodes(pg, ".field-content a")),
excerpt=html_text(html_nodes(pg, ".field-content p")),
date=html_text(html_nodes(pg, ".views-field-created .field-content")),
stringsAsFactors=FALSE)
}) -> epilepsyforum
df <- data.frame(epilepsyforum)
write.csv(df,"epilepsyforum.csv")
I'm not sure exactly what you're doing with:
pg <- read_html(sprintf(url_base, i))
but this works just fine for the url you specified:
pg <- read_html(url_base)
Like mentioned in the comment above, if you're trying to loop through pages, then use:
pg <- read_html(paste0(url_base,i))
Good evening everyone,
I am currently trying to scrape zalando website to get the name of every products that appaears on the first two pages of the following url : (https://www.zalando.nl/damesschoenen-sneakers/)
Here is my code:
require(rvest)
require(dplyr)
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
output <- html_nodes(x = url, css = selector_name) %>% html_text
The result is a list of 24 items while there is 86 products on the page. Has anyone encounter this issue before ? Any idea on how to solve it ?
Thank you for your help.
Thomas
I just tried what Nicolas Velasqueaz suggested
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
write_html(url, file = "test_url.html")
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
test_file <- read_html("test_url.html")
output <- html_nodes(x = test_file, css = selector_name) %>% html_text
The results are the same. I still have only 24 items that shows up.
So if anyone has a solution would be very appreciated.
Thank you for your kind answer. I will dive into that direction.
I also find a way to get the name of the brand without RSelenium, here si my code:
library('httr')
library('magrittr')
library('rvest')
################# FUNCTION #################
extract_data <- function(firstPosition,lastPosition){
mapply(function(first,last){
substr(pageContent, first, last) %>%
gsub( "\\W", "\\1 ",.) %>%
gsub("^ *|(?<= ) | *$", "", ., perl = TRUE)
},
firstPosition, lastPosition )
}
############################################
url <- 'https://www.zalando.nl/damesschoenen-sneakers/'
page <- GET(url)
pageContent <- content(page, as='text')
# Get the brand name of the products
firstPosition <-
unlist(gregexpr('brand_name',pageContent))+nchar('brand_name')+1
lastPosition <- unlist(gregexpr('is_premium',pageContent))-2
extract_data(firstPosition, lastPosition)
Unfortunately it starts being difficult when you want something else than brand name so maybe that the best soution is to do it with RSelenium.
Here's the code I'm running
library(rvest)
rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
html(l)
})
Up until this point it seems to work fine, but when I try to extract the text:
html_text(messages)
I get:
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
Trying to extract a specific element:
html_text(messages[1])
Can't do that either...
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
So I try a different way:
html_text(messages[[1]])
This seems to at least get at the data, but is still not succesful:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"
How can I extract the text material from each of the elements of my list?
There are two problems with your code. Look here for examples on how to use the package.
1. You cannot just use every function with everything.
html() is for download of content
html_node() is for selecting node(s) from the downloaded content of a page
html_text() is for extracting text from a previously selected node
Therefore, to download one of your pages and extract the text of the html-node, use this:
library(rvest)
old-school style:
url <- "https://github.com/rails/rails/pull/100"
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text
... or this ...
hard to read old-school style:
url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text
... or this ...
magritr-piping style
url_mainnode_text <-
html("https://github.com/rails/rails/pull/100") %>%
html_node("*") %>%
html_text()
url_mainnode_text
2. When using lists you have to apply functions to the list with e.g. lapply()
If you want to kind of batch-process several URLs you can try something like this:
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
get_html_text <- function(url, css_or_xpath="*"){
html_text(
html_node(
html("https://github.com/rails/rails/pull/100"), css_or_xpath
)
)
}
lapply(url_list, get_html_text, css_or_xpath="a[class=message]")
You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200
rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()
[1] "jaw6" "jaw6" "josevalim"