I tried scraping the first two pages of topics from this discussion forum by using this code but received an error message which I do not understand - "Error in sprintf(url_base, i) : unrecognised format specification '%2C'"
Can someone help? Thanks.
library(rvest)
library(purrr)
url_base <- "http://www.epilepsy.com/connect/forums/living-epilepsy-adults?page=0%2C"
map_df(1:2, function(i) {
# simple but effective progress indicator
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(title=html_text(html_nodes(pg, ".field-content a")),
excerpt=html_text(html_nodes(pg, ".field-content p")),
date=html_text(html_nodes(pg, ".views-field-created .field-content")),
stringsAsFactors=FALSE)
}) -> epilepsyforum
df <- data.frame(epilepsyforum)
write.csv(df,"epilepsyforum.csv")
I'm not sure exactly what you're doing with:
pg <- read_html(sprintf(url_base, i))
but this works just fine for the url you specified:
pg <- read_html(url_base)
Like mentioned in the comment above, if you're trying to loop through pages, then use:
pg <- read_html(paste0(url_base,i))
Related
I am trying to scrape a website that has hundreds of pages. I have been using the following code to get through all pages, but in order to not overwhelm the website, there must be a pause between scrapes. I have been trying to induce this pause using Sys.sleep(15), but this causes the final dataframe to come out empty. Any ideas why this is happening?
Version one:
a <- lapply(paste0("https://website.com/page/",1:500),
function(url){
url %>% read_html() %>%
html_nodes(".text") %>%
html_text()
Sys.sleep(15)
})
raw_posts <- unlist(a)
a <- data.frame(raw_posts)
This simply returns empty data frame.
Version two:
url_base <- "https://website.com/page/"
map_df(1:500, function(i) {
Sys.sleep(15)
cat(" bababooeey ")
pg <- read_html(sprintf(url_base, i))
data.frame(text=html_text(html_nodes(pg, ".text")),
date=html_text(html_nodes(pg, "time")),
stringsAsFactors=FALSE)
}) -> b
This just pastes the same set of results found on the same page over and over.
Does anything stand out as being wrongly coded?
I am trying to scrape some data from the dutch train disruptions website. I have done this successfully before with multiple pages, but I am now trying to go a level deeper. But unfortunately I am getting the following error:
Error: '/storingen/25215-29-december-2018-defect-spoor-amersfoort-ede-wageningen' does not exist.
This should be the correct url but I think it missing the first part
https://www.rijdendetreinen.nl/storingen/25235-31-december-2018-seinstoring-groningen-eemshaven
I can't seem to locate the origin of the problem. I think i might be possible that not the entire url is retrieved.
I am using the following script:
library(tidyverse)
library(rvest)
get_element_data <- function(link){
if(!is.na(link)){
html <- read_html(link)
Sys.sleep(2)
datum <- html %>%
html_node(".disruption-cause") %>%
html_text()
return(tibble(datum=datum))
}
}
get_elements_from_url <- function(url){
html_page <- read_html(url)
Sys.sleep(2)
route <- scrape_css(".disruption-line",".resolved",html_page)
problem <- scrape_css("em",".resolved",html_page)
time <- scrape_css(".timestamp",".resolved",html_page)
element_urls <- scrape_css_attr(".resolved","div","href",html_page)
element_data_detail <- element_urls %>%
map(get_element_data) %>%
bind_rows()
elements_data <- tibble(route=route, problem=problem, time=time, element_urls=element_urls)
elements_data_overview <- elements_data[complete.cases(elements_data[,2]), ]
return(bind_cols(elements_data_overview,element_data_detail))
}
scrape_write_table <- function(url){
list_of_pages <- str_c(url, 2)
list_of_pages %>%
map(get_elements_from_url) %>%
bind_rows()
}
trainDisruptions <- scrape_write_table("https://www.rijdendetreinen.nl/storingen?lines=&reasons=&date_before=31-12-2018&date_after=01-01-2018&page=")
View(trainDisruptions)
I am using json to scrape the content of multiple (1000) links. However, some of the links do not work in json format so there is not content to be scraped. Due to this, my code stop working when finding one of those links.
I have tried to use TryCatch to avoid the error but it seems not to be working
Here is the code I am using
library(jsonlite)
library(rvest)
lapply(links_jason[1:6], function(x) {
tryCatch(
{
json_data <- read_html(x) %>% html_text()%>%
jsonlite::fromJSON(.)%>%
select(1)
},
error = function(cond) return(NULL),
finally = print(x)
)
})
This is the issue I am getting
Debug location is approximate beacuse the source is not available
Here are some examples of the links I am trying to scrape
Links number 1, 2 and 6 works fine. 3, 4 and 5 needs to be avoid
> head(links_jason)
[1] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
[2] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
[3] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
[4] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
[5] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
[6] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json"
I have also tried to use if statements with no results. Could anyone help? Thanks!
Read direct with jsonlite and test length of return
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)>0){
print(x)
}
}
Or something like:
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)==0){
json_data <- NA}
else{
print('doing something with json_data')
}
})
I need to scrap some data from a website that changes only by a number.
I tried to make a loop but I just can't do it. This is the way with I've tried. I'm using the library rvest
prueba <- data.frame(1:11)
for(KST in 861:1804)){
url <- print(paste("https://estudiosdemograficosyurbanos.colmex.mx/index.php/edu/rt/metadata/",KST,"/0", sep="")) ## from 861 to 1804
webpage <- read_html(url)
articles_data_html <- html_nodes(webpage, 'tr:nth-child(4), tr:nth-child(6), tr:nth-child(8), tr:nth-child(10)
, tr:nth-child(12), tr:nth-child(20), tr:nth-child(22) , tr:nth-child(28)
, tr:nth-child(26), tr:nth-child(30), tr:nth-child(32)')
articles_data <- html_text(articles_data_html)
#putting on a dataframe
as.data.frame(prueba[paste("a",KST,sep="")])<-articles_data
}
somebody can help me about how to do it?
Thanks in advance
I believe that the best way to solve your problem is to use an object of class "list" to hold what you are reading in. Something like the following.
library(rvest)
prueba <- vector("list", length(861:1804))
for(KST in 861:1804){
url <- paste("https://estudiosdemograficosyurbanos.colmex.mx/index.php/edu/rt/metadata/",KST,"/0", sep="") ## from 861 to 1804
webpage <- read_html(url)
articles_data_html <- html_nodes(webpage, 'tr:nth-child(4), tr:nth-child(6), tr:nth-child(8), tr:nth-child(10)
, tr:nth-child(12), tr:nth-child(20), tr:nth-child(22) , tr:nth-child(28)
, tr:nth-child(26), tr:nth-child(30), tr:nth-child(32)')
articles_data <- html_text(articles_data_html)
#putting on a dataframe
prueba[[KST]] <- articles_data
}
Then, when you are done, maybe end with
closeAllConnections()
I'm new to r and rvest. I got help with this code two days ago which scrapes all player names and it works well. Now I'm trying to add code to function "fetch_current_players" where it also creates a vector of the player codes for that website (taken off the url). Any help would be appreciated as I've spent a day googling, reading, and watching YouTube videos trying to teach myself. Thanks!
library(rvest)
library(purrr) # flatten/map/safely
library(dplyr) # progress bar
fetch_current_players <- function(letter){
URL <- sprintf("http://www.baseball-reference.com/players/%s/", letter)
pg <- read_html(URL)
if (is.null(pg)) return(NULL)
player_data <- html_nodes(pg, "b a")
player_code<-html_attr(html_nodes(pg, "b a"), "href") #I'm trying to scrape the URL as well as the player name
substring(player_code, 12, 20) #Strips the code out of the URL
html_text(player_data)
player_code #Not sure how to create vector of all codes from all 27 webpages
}
pb <- progress_estimated(length(letters))
player_list <- flatten_chr(map(letters, function(x) {
pb$tick()$print()
fetch_current_players(x)
}))
I like to keep this kind of thing simple and readable, nothing wrong with a for loop. This code returns the names and codes in a simple data frame.
library(rvest)
library(purrr) # flatten/map/safely
library(dplyr) # progress bar
fetch_current_players <- function(letter){
URL <- sprintf("http://www.baseball-reference.com/players/%s/", letter)
pg <- read_html(URL)
if (is.null(pg)) return(NULL)
player_data <- html_nodes(pg, "b a")
player_code<-html_attr(html_nodes(pg, "b a"), "href") #I'm trying to scrape the URL as well as the player name
player_code <- substring(player_code, 12, 20) #Strips the code out of the URL
player_names <- html_text(player_data)
return(data.frame(code=player_code,name=player_names))
}
pb <- progress_estimated(length(letters))
for (x in letters) {
pb$tick()$print()
if(exists("player_list"))
{player_list <- rbind(player_list,fetch_current_players(x))
} else player_list <- fetch_current_players(x)
}