Rvest scraping result is empty list

Rvest scraping result is empty list - web-scraping

I've a simple question: I tried to scrape odds from a webpage but the result is a blank list
library(rvest)
url<- 'http://www.betexplorer.com/soccer/italy/serie-a-2016-2017/'
html <- read_html(url)
odds <- html_nodes(html, '.table-matches__odds , .colored span')
odds <- html_text(odds)
Also I tried to use xpath instead of css selector but I had same result.
library(rvest)
url<- 'http://www.betexplorer.com/soccer/italy/serie-a-2016-2017/'
html <- read_html(url)
odds <- html_nodes(html, xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "table-matches__odds", " " ))]')
odds <- html_text(odds)

Related

How to skip webpages during a web scraping in rvest

I'm trying to collect information using rvest package in R.
While collecting the data with for loop, I found some of the pages do not contain information so that it comes out an error: Error in open.connection(x, "rb") : HTTP error 404.
Here is my R code. The page number 15138 and 15140 do have information, whereas 15139 does not. How can I skip 15139 with for loop function?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
I'm looking forward to your answer!

You can handle exceptions a few different ways. I'm a noob when it comes to scraping, but here are a few options for your situation.
Tailor Your Loop Range
If you know that you don't want the value 15139, you can remove from the vector of options, like:
for (i in c(15138,15140)) {
Which will completely ignore 1539 when running your loop.
Add Control Flow
This is basically the same thing as tailoring your loop range, but handles the exception within the loop itself, like:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
Condition Handling Tools
This is where I get out of my depth, and constantly reference Advanced-R. Essentially, you can wrap functions like try() around your potentially buggy code, which can insulate your loop from errors and keep it from breaking, and gives you flexibility about what to do if your code breaks in specific ways.
My usual approach would be to add something to your code like:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()

No data when scraping with rvest

I am trying to scrape a website but it does not give me any data.
#Get the Data
require(tidyverse)
require(rvest)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#get data
url %>%
read_html() %>%
html_nodes(".green div:nth-child(1)") %>%
html_text()
character(0)
I have also tried to use the xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "green", " " ))]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a' but this gives me the same result with 0 data.
I am expecting Horse names. Shouldnt I at least get some javascript code even if data on page is rendered by javascript?
I cant see what else CSS selector I should use here.

You can simply use RSelenium package to scrape dynamycal pages :
library(RSelenium)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#Create the remote driver / navigator
rsd <- rsDriver(browser = "chrome")
remDr <- rsd$client
#Go to your url
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
#get your horses data by parsing Selenium page with Rvest as you know to do
page %>% html_nodes(".green div:nth-child(1)") %>% html_text()
Hope that will helps
Gottavianoni

rvest: css / xpath from SelectorGadget not working

I want to extract all download links for txt-files from this page: https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle. To do so, I tried SelectorGadget and selected the following:
With this information I wrote this code:
library(rvest)
library(tidyverse)
protokolle <-
read_html("https://www.bundestag.de/dokumente/protokolle/plenarprotokolle/plenarprotokolle")
txts <-
protokolle %>%
html_nodes(".bt-link-dokument")
The result is the same if I try
txts <-
protokolle %>%
html_nodes(xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "bt-link-dokument", " " ))]')
For a reason I do not understand, txts contains only {xml_nodeset (0)}. Any ideas on what went wrong?

Extracting web table using Rvest (in R)

I am looking to pull a table in at http://www.nfl.com/inactives?week=5 in order to process active and inactive players. I am very familiar with rvest and have tried using the code:
library(rvest)
url <- paste0("http://www.nfl.com/inactives?week=5")
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[contains(concat( " ", #class, " " ), concat( " ", "yui3-datatable-cell", " " ))]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
Nothing is coming up correctly though. Ideally, I would like to be able to put all the players and their team name into one single table. I appreciate your insights.

Web scraping tables from over 5K websites listed by url in a .csv file, all in R

So, I am working to extract data from the following website: http://livingwage.mit.edu
...at the county level, and have tried many different iterations of using the rvest package to extract the data. Unfortunately, there are about 5K counties.
I have extracted all the urls into a single column .csv file. The urls have the form "http://livingwage.mit.edu/counties/..." where "..." is the state code followed by the county code.
The data I want has the css identifier as (from SelectorGadget)
css = '.wages_table .even .col-NaN , .wages_table .results .col-NaN'
or the xpath of
xpath = //*[contains(concat( " ", #class, " " ), concat( " ", "wages_table", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "even", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "col-NaN", " " ))] | //*[contains(concat( " ", #class, " " ), concat( " ", "wages_table", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "results", " " ))]//*[contains(concat( " ", #class, " " ), concat( " ", "col-NaN", " " ))]
This is where I started:
library(rvest)
url <- read_html("http://livingwage.mit.edu/counties/01001")
url %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
...but was only able to extract one table at a time, and got the headers and the final row, which I did not want.
So, I tried something like this:
counties <- 01001:54500
urls <- paste0("http://livingwage.mit.edu/counties/", counties)
get_table <- function(url) {
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
results <- sapply(urls, get_table)
...but quickly realized that not all the numbers are sequential (they are mostly odd), but are not continuous either, i.e., one state may only have 4 counties, and only have urls that go up to ~/10009 for example.
Finally, I got as far as this when trying to access the .csv list of urls on my desktop:
URL <- read.csv("~/Desktop/LW_url.csv", header=T)
URL %>%
html_nodes("table", ".wages_table .even .col-NaN , .wages_table .results .col-NaN") %>%
.[[1]] %>%
html_table()
...and know that the css and the read all do not like talking to each other nicely.
Any help in making this happen would be thoroughly appreciated.

I think this is what you are looking for.
install.packages("pbapply") # has a nice addition to lapply, estimates run time
library(rvest)
library(dplyr)
library(magrittr)
library(pbapply)
## Get State urls
lwc.url <- "http://livingwage.mit.edu"
state.urls <- read_html(lwc.url)
state.urls %<>% html_nodes(".col-md-6 a") %>% xml_attr("href") %>%
paste0(lwc.url, .)
## get county urls and county names
county.urls <- lapply(state.urls, function(x) read_html(x) %>%
html_nodes(".col-md-3 a") %>% xml_attr("href") %>%
paste0(lwc.url, .)) %>% unlist
## Get the tables Hourly wage & typical Expenses
dfs <- pblapply(county.urls, function(x){
LWC <- read_html(x)
df <- rbind(
LWC %>% html_nodes("table") %>% .[[1]] %>%
html_table() %>% setNames(c("Info", names(.)[-1])),
LWC %>% html_nodes("table") %>% .[[2]] %>%
html_table() %>% setNames(c("Info", names(.)[-1])))
title <- LWC %>% html_nodes("h1") %>% html_text
df$State <- trimws(gsub(".*,", "", title))
df$County <- trimws(gsub(".*for (.*) County.*", "\\1", title))
df$url <- x
df
})
df <- data.table::rbindlist(dfs)
View(df)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rvest scraping result is empty list - web-scraping

Related

How to skip webpages during a web scraping in rvest

No data when scraping with rvest

rvest: css / xpath from SelectorGadget not working

Extracting web table using Rvest (in R)

Web scraping tables from over 5K websites listed by url in a .csv file, all in R

Categories

Resources