Check if table within website exists R - r

For a little project for myself I'm trying to get the results from some races.
I can access the pages with the results and download the data from the table in page. However, there are only 20 results per page, but luckily the web addresses are built logically so I can create them, and in a loop, access these pages and download the data. However, each category has a different number of racers, and thus can have different number of pages. I want to avoid to manually having to check how many racers there are in each category.
My first thought was to just generate a lot of links, making sure there are enough (based on the total amount of racers) to get all the data.
nrs <- rep(seq(1,5,1),2)
sex <- c("M","M","M","M","M","F","F","F","F","F")
links <- NULL
#Loop to create 10 links, 5 for the male age grou 18-24, 5 for women agegroup 18-24. However,
#there are only 3 pages in the male age group with a table.
for (i in 1:length(nrs) ) {
links[i] = paste("http://www.ironman.com/triathlon/events/americas/ironman/texas/results.aspx?p=",nrs[i],"&race=texas&rd=20160514&sex=",sex[i],"&agegroup=18-24&loc=",sep="")
}
resultlist <- list() #create empty list to store results
for (i in 1:length(links)) {
results = readHTMLTable(links[i],
as.data.frame = TRUE,
which=1,
stringsAsFactors = FALSE,
header = TRUE) #get data
resultlist[[i]] <- results #combine results in one big list
}
results = do.call(rbind, resultlist) #combine results into dataframe
As you can see in this code readHTMLTable throws an error message as soon as it encounters a page with no table, and then stops.
I thought of two possible solutions.
1) Somehow check all the links if they exist. I tried with url.exists from the RCurl package. But this doesn't work. It returns TRUE for all pages, as the page exists, it just doesn't have a table in it (so for me it would be a false positive). Somehow I would need some code to check if a table in the page exists, but I don't know how to go about that.
2) Suppress the error message from readHTMLTable so the loop continuous, but I'm not sure if that's possible.
Any suggestions for these two methods, or any other suggestions?

I think that method #2 is easier. I modified your code with tryCatch, one of R's builtin exception handling mechanisms. It works for me.
PS I would recommend using rvest for web scraping like this.

Related

Downloading and storing multiple files from URLs on R; skipping urls that are empty

Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)

using a for loop to utilize spotifyr get_artist_audio_features function in R, skip errors in the loop

I downloaded my personal Spotify data from the Spotify website.
I converted these data from JSON to a regular R dataframe for further analysis. This personal dataframe has 4 columns:
Endtime artistName trackName Msplayed
However, Spotify has many variables coupled to songs from an artist, that you can only retrieve using the function get_artist_audio_features from the spotifyr package. I want to join these variables to my personal dataframe. The package allows data retrieval for only one artist at a time and it would be very time consuming to write a line of code for all 3000+ artists in my dataframe.
I used a for loop to try and collect the metadata for the artists:
empty_list <- vector(mode = "list")
for(i in df$artistName){
empty_list[[i]] <- get_artist_audio_features(i)
}
My dataframe also has podcasts, for which non of this meta-data is available. When i try using the function on a podcast i get the error message:
Error in get_artist_audio_features(i) :
No artist found with artist_id=''.
In addition: Warning messages:
1: Unknown or uninitialised column: `id`.
2: Unknown or uninitialised column: `name`.
When i use the for loop, it stops as soon as the first error (podcast) in the dataframe occurs. When i feed it a vector of only artists and no podcasts, it works perfectly.
I checked stack for possible answers (most notably: Skipping error in for-loop) but i cant get the loop to work.
My question: how can i use the function spotifyr::get_artist_audio_features in a for loop and skip the errors, storing the results in a list. Unfortunately, it is very difficult to post a reproducable example, since you need to active a developer account on spotify to use the spotifyr package.
It looks like your issue is in artist_id = '', so try the below code to see if it helps get you started (since I don't have reproducible data, not sure if it will help). In this case it should just skip the podcasts, but I'm sure some more codesmithing will allow you to put relevant data in the given list position.
for(i in df$artistName){
if(artist_id = ''){
empty_list[[i]] <- NA
} else {
empty_list[[i]] <- get_artist_audio_features(i)
}
}
You could also use a while loop conditioning on an incremental i to restart the loop, but I can't do that without the data.

R Web Scraping: Error handling when web page doesn't contain a table

I'm having some difficulties web scraping. Specifically, I'm scraping web pages that generally have tables embedded. However, for the instances in which there is no embedded table, I can't seem to handle the error in a way that doesn't break the loop.
Example code below:
event = c("UFC 226: Miocic vs. Cormier", "ONE Championship 76: Battle for the Heavens", "Rizin FF 12")
eventLinks = c("https://www.bestfightodds.com/events/ufc-226-miocic-vs-cormier-1447", "https://www.bestfightodds.com/events/one-championship-76-battle-for-the-heavens-1532", "https://www.bestfightodds.com/events/rizin-ff-12-1538")
testLinks = data.frame(event, eventLinks)
for (i in 1:length(testLinks)) {
print(testLinks$event[i])
event = tryCatch(as.data.frame(read_html(testLinks$eventLink[i]) %>% html_table(fill=T)),
error = function(e) {NA})
}
The second link does not have a table embedded. I thought I'd just skip it with my tryCatch, but instead of skipping it, the link breaks the loop.
What I'm hoping to figure out is a way to skip links with no tables, but continue scraping the next link in the list. To continue using the example above, I want the tryCatch to move from the second link onto the third.
Any help? Much appreciated!
There are a few things to fix here. Firstly, your links are considered factors (you can see this with testLinks %>% sapply(class), so you'll need to convert them to character using as.chracter() I've done this in the code below.
Secondly, you need to assign each scrape to a list element, so we create a list outside the loop with events <- list(), and then assign each scrape to an element of the list inside the loop i.e. events[[i]] <- "something" Without a list, you'll simply override the first scrape with the second, and the second with the third, and so on.
Now your tryCatch will work and assign NA when a url does not contain a table (there will be no error)
events <- list()
for (i in 1:nrow(testLinks)) {
print(testLinks$event[i])
events[[i]] = tryCatch(as.data.frame(read_html(testLinks$eventLink[i] %>% as.character(.)) %>% html_table(fill=T)),
error = function(e) {NA})
}
events

Scraping data from a dynamic web page (.asp) with R

I'm trying to scrap some data using this code.
require(XML)
tables <- readHTMLTable('http://fantasynba.movistarplus.es/basketball/reports/player_rankings.asp')
str(tables, max.level = 1)
df <- tables$searchResults
It works perfect but the problem is that it only gives me data for the first 188 observations that corresponds to the players whose position is "Base". Whenever I try to get data from "Pivot" or "Alero" players, it gives me the same info. Since the url never changes, I don't know how to get this info.

requesting data from the Center for Disease Control using RSocrata or XML in R

My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:
#install.packages('rvest')
library(rvest)
## Code to get all URLS
getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="
urls <- NULL
for (year in y1:y2){
for (week in 1:53){
for (part in clist) {
urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
}
}
}
return(urls)
}
TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed.
WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)
#Example of how to extract data from a single webpage.
url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]
#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found.
#Can use this command to filter only pages that you need and select only those columns.
test <- grep("Leg", sb)
sb <- sb[,c(1,test)]
### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]
#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x)
as.character(gsub(",", "", as.character(x), fixed = TRUE)))
Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)
However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.
Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.
install.packages("RSocrata")
library("RSocrata")
df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")
How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.
I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.
In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.

Resources