R Cannot download a file from the web - r

I can download in the browser a file from this website
https://www.cmegroup.com/ftp/pub/settle/comex_future.csv
However when I try the following
url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
dest <- "C:\\COMEXfut.csv"
download.file(url, dest)
I get the following error message
Error in download.file(url, dest) :
cannot open URL 'https://www.cmegroup.com/ftp/pub/settle/comex_future.csv'
In addition: Warning message:
In download.file(url, dest) :
InternetOpenUrl failed: 'The operation timed out'
even if I choose:
options(timeout = max(600, getOption("timeout")))
any idea why is this happening ? thanks !

The problem here is that the site from which you are downloading needs a couple of additional headers. The easiest way to supply them is using the httr package
library(httr)
url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
UA <- paste('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0)',
'Gecko/20100101 Firefox/98.0')
res <- GET(url, add_headers(`User-Agent` = UA, Connection = 'keep-alive'))
This should download in less than a second.
If you want to save the file you can do
writeBin(res$content, 'myfile.csv')
Or if you just want to read the data straight into R without even saving it, you can do:
content(res)
#> Rows: 527 Columns: 20
#> 0s-- Column specification ----------------------------------------------------------------
#> Delimiter: ","
#> chr (10): PRODUCT SYMBOL, CONTRACT MONTH, CONTRACT DAY, CONTRACT, PRODUCT DESCRIPTIO...
#> dbl (10): CONTRACT YEAR, OPEN, HIGH, LOW, LAST, SETTLE, EST. VOL, PRIOR SETTLE, PRIO...
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 527 x 20
#> `PRODUCT SYMBOL` `CONTRACT MONTH` `CONTRACT YEAR` `CONTRACT DAY` CONTRACT
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 0GC 07 2022 NA 0GCN22
#> 2 4GC 03 2022 NA 4GCH22
#> 3 4GC 05 2022 NA 4GCK22
#> 4 4GC 06 2022 NA 4GCM22
#> 5 4GC 08 2022 NA 4GCQ22
#> 6 4GC 10 2022 NA 4GCV22
#> 7 4GC 12 2022 NA 4GCZ22
#> 8 4GC 02 2023 NA 4GCG23
#> 9 4GC 04 2023 NA 4GCJ23
#> 10 4GC 06 2023 NA 4GCM23
#> # ... with 517 more rows, and 15 more variables: PRODUCT DESCRIPTION <chr>, OPEN <dbl>,
#> # HIGH <dbl>, HIGH AB INDICATOR <chr>, LOW <dbl>, LOW AB INDICATOR <chr>, LAST <dbl>,
#> # LAST AB INDICATOR <chr>, SETTLE <dbl>, PT CHG <chr>, EST. VOL <dbl>,
#> # PRIOR SETTLE <dbl>, PRIOR VOL <dbl>, PRIOR INT <dbl>, TRADEDATE <chr>

Related

R: Error in `html_form_submit()`: `form` doesn't contain a `action` attribute

I'm trying to automate downloading of the data contained here:
https://www.offenerhaushalt.at/gemeinde/innsbruck/download
I can fairly easily specify the form, either through the url in the way:
https://www.offenerhaushalt.at/gemeinde/innsbruck/download?year=2022&haushalt=fhh&rechnungsabschluss=va&origin=gemeinde
Or through the rvest function html_form(), but I cannot download the form as the html_form_submit() throws the error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
library(rvest)
library(tidyverse)
html_form(read_html("https://www.offenerhaushalt.at/gemeinde/innsbruck/download"))[[1]] %>%
html_form_set(year = "2022",
haushalt = "fhh",
rechnungsabschluss = "va",
origin = "gemeinde") %>%
html_form_submit()
Any ideas on how to capture the file that is generated afterwards and download it?
It seems to me that it sends the "action" to a url that looks like: https://www.offenerhaushalt.at/downloads/ghdByParams
But I'm not sure what to do with that.
Thanks all!
You can manually set the action url for that form:
library(rvest)
library(purrr)
dl_url <- "https://www.offenerhaushalt.at/gemeinde/innsbruck/download"
sess <- session(dl_url)
form <- sess %>% read_html() %>% html_form() %>% .[[1]]
# list valid options for select boxes
map(form$fields, "options") %>% keep(~ length(.x) > 0) %>%
imap_dfr(~ list(field = .y, options = paste(.x, collapse = " ")))
#> # A tibble: 4 × 2
#> field options
#> <chr> <chr>
#> 1 haushalt default fhh ehh vhh
#> 2 rechnungsabschluss default ra va
#> 3 year default 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 …
#> 4 origin default statistik_at gemeinde
# set values
form$fields$haushalt$value <- "fhh"
form$fields$rechnungsabschluss$value <- "ra"
form$fields$year$value <- "2020"
form$fields$origin$value <- "statistik_at"
# manually set form method & action
form$method <- "POST"
form$action <- "https://www.offenerhaushalt.at/downloads/ghdByParams"
# submit form
sess <- session_submit(sess, form)
# response headers
imap_dfr(sess$response$headers, ~ list(header = .y, value = .x))
#> # A tibble: 10 × 2
#> header value
#> <chr> <chr>
#> 1 date Sat, 21 Jan 2023 01:47:13 GMT
#> 2 server Apache
#> 3 content-disposition attachment; filename=offenerhaushalt_70101_2020_ra_fhh.c…
#> 4 pragma no-cache
#> 5 cache-control must-revalidate, post-check=0, pre-check=0, private
#> 6 expires 0
#> 7 set-cookie XSRF-TOKEN=eyJpdiI6IjdHd2pSakwzV09xb3Jab05zXC81em1RPT0iL…
#> 8 set-cookie offener_haushalt_session=eyJpdiI6IjI5cUN5MGhCSmVadmN5enV…
#> 9 transfer-encoding chunked
#> 10 content-type text/csv; charset=UTF-8
# parse attached CSV
httr::content(sess$response, as = "text") %>% readr::read_csv2()
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1408 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (8): ansatz_uab, ansatz_ugl, konto_grp, konto_ugl, sonst_ugl, vorhabenco...
#> dbl (2): mvag, wert
#> lgl (1): verguetung
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,408 × 11
#> ansat…¹ ansat…² konto…³ konto…⁴ sonst…⁵ vergu…⁶ vorha…⁷ mvag ansat…⁸ konto…⁹
#> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <chr> <chr>
#> 1 000 000 042 000 000 NA 0000000 3415 Gewähl… Amts-,…
#> 2 000 000 070 000 000 NA 0000000 3411 Gewähl… Aktivi…
#> 3 000 000 400 000 000 NA 0000000 3221 Gewähl… Gering…
#> 4 000 000 413 000 000 NA 0000000 3221 Gewähl… Handel…
#> 5 000 000 456 000 000 NA 0000000 3221 Gewähl… Schrei…
#> 6 000 000 457 000 000 NA 0000000 3221 Gewähl… Druckw…
#> 7 000 000 459 000 000 NA 0000000 3221 Gewähl… Sonsti…
#> 8 000 000 618 000 000 NA 0000000 3224 Gewähl… Instan…
#> 9 000 000 621 000 000 NA 0000000 3222 Gewähl… Sonsti…
#> 10 000 000 631 000 000 NA 0000000 3222 Gewähl… Teleko…
#> # … with 1,398 more rows, 1 more variable: wert <dbl>, and abbreviated variable
#> # names ¹​ansatz_uab, ²​ansatz_ugl, ³​konto_grp, ⁴​konto_ugl, ⁵​sonst_ugl,
#> # ⁶​verguetung, ⁷​vorhabencode, ⁸​ansatz_text, ⁹​konto_text
As rvest accepts and passes on httr configs, attached files can be saved directly too:
dest_file <- tempfile(fileext = ".csv")
session_submit(sess, form, submit = NULL, httr::write_disk(dest_file))
# browseURL(dirname(dest_file))

Webscrape Table from FantasyLabs

I am trying to webscrape historical DFS NFL ownership from fanatsylabs.com using Rselenium. I am able to navigate to the page and even able to highlight the element I am trying to scrape, but am coming up with an error when I put it into a table.
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"webElement"’
I have looked up the error but cannot seem to find a reason why. I am essentially trying to follow this stack overflow example for this web scraping problem. Would someone be able to help me understand why I am not able to scrape this table and what I could do differently in order to do so?
here is my full code:
library(RSelenium)
library(XML)
library(RCurl)
# start the Selenium server
rdriver <- rsDriver(browser = "chrome",
chromever = "106.0.5249.61",
)
# creating a client object and opening the browser
obj <- rdriver$client
# navigate to the url
appURL <- 'https://www.fantasylabs.com/nfl/contest-ownership/?date=10112022'
obj$navigate(appURL)
obj$findElement(using = 'xpath', '//*[#id="ownershipGrid"]')$highlightElement()
tableElem <- obj$findElement(using = 'xpath', '//*[#id="ownershipGrid"]')
projTable <- readHTMLTable(tableElem, header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
dvpCTable <- projTable[[1]]
dvpCTable
library(tidyverse)
library(httr2)
"https://www.fantasylabs.com/api/contest-ownership/1/10_12_2022/4/75377/0/" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
as_tibble
#> # A tibble: 43 x 4
#> Prope~1 $Fant~2 $Posi~3 $Play~4 $Team $Salary $Actu~5 Playe~6 SortV~7 Fanta~8
#> <int> <int> <chr> <chr> <chr> <int> <dbl> <int> <lgl> <int>
#> 1 50882 1376298 TE Albert~ "DEN" 2800 NA 50882 NA 1376298
#> 2 51124 1376299 TE Andrew~ "DEN" 2500 1.7 51124 NA 1376299
#> 3 33781 1385590 RB Austin~ "LAC" 7500 24.3 33781 NA 1385590
#> 4 55217 1376255 QB Brett ~ "DEN" 5000 NA 55217 NA 1376255
#> 5 2409 1376309 QB Chase ~ "LAC" 4800 NA 2409 NA 1376309
#> 6 40663 1385288 WR Courtl~ "DEN" 6100 3.4 40663 NA 1385288
#> 7 50854 1376263 RB Damare~ "DEN" 4000 NA 50854 NA 1376263
#> 8 8580 1376342 WR DeAndr~ "LAC" 3600 4.7 8580 NA 1376342
#> 9 8472 1376304 D Denver~ "DEN" 2500 7 8472 NA 1376304
#> 10 62112 1376262 RB Devine~ "" 4000 NA 62112 NA 1376262
#> # ... with 33 more rows, 34 more variables:
#> # Properties$`$5 NFL $70K Flea Flicker [$20K to 1st] (Mon-Thu)` <dbl>,
#> # $Average <dbl>, $Volatility <lgl>, $GppGrade <chr>, $MyExposure <lgl>,
#> # $MyLeverage <lgl>, $MyLeverage_rnk <lgl>, $MediumOwnership_pct <lgl>,
#> # $PlayerId_rnk <int>, $PlayerId_pct <dbl>, $FantasyResultId_rnk <int>,
#> # $FantasyResultId_pct <dbl>, $Position_rnk <lgl>, $Position_pct <lgl>,
#> # $Player_Name_rnk <lgl>, $Player_Name_pct <lgl>, $Team_rnk <lgl>, ...
Created on 2022-11-03 with reprex v2.0.2

Scrap webpage that requires button click

I am trying to scrap data from the link below. I need to click and download a csv file available in the csv button from the webpage.
library(netstat)
library(RSelenium)
url <- https://gtr.ukri.org/search/project?term=%22climate+change%22+OR+%22climate+crisis%22&fetchSize=25&selectedSortableField=&selectedSortOrder=&fields=pro.gr%2Cpro.t%2Cpro.a%2Cpro.orcidId%2Cper.fn%2Cper.on%2Cper.sn%2Cper.fnsn%2Cper.orcidId%2Cper.org.n%2Cper.pro.t%2Cper.pro.abs%2Cpub.t%2Cpub.a%2Cpub.orcidId%2Corg.n%2Corg.orcidId%2Cacp.t%2Cacp.d%2Cacp.i%2Cacp.oid%2Ckf.d%2Ckf.oid%2Cis.t%2Cis.d%2Cis.oid%2Ccol.i%2Ccol.d%2Ccol.c%2Ccol.dept%2Ccol.org%2Ccol.pc%2Ccol.pic%2Ccol.oid%2Cip.t%2Cip.d%2Cip.i%2Cip.oid%2Cpol.i%2Cpol.gt%2Cpol.in%2Cpol.oid%2Cprod.t%2Cprod.d%2Cprod.i%2Cprod.oid%2Crtp.t%2Crtp.d%2Crtp.i%2Crtp.oid%2Crdm.t%2Crdm.d%2Crdm.i%2Crdm.oid%2Cstp.t%2Cstp.d%2Cstp.i%2Cstp.oid%2Cso.t%2Cso.d%2Cso.cn%2Cso.i%2Cso.oid%2Cff.t%2Cff.d%2Cff.c%2Cff.org%2Cff.dept%2Cff.oid%2Cdis.t%2Cdis.d%2Cdis.i%2Cdis.oid%2Ccpro.rtpc%2Ccpro.rcpgm%2Ccpro.hlt&type=#/csvConfirm
I am struggling to implement that using Selenium. Here is the code I have so far.
rD <- rsDriver(port= free_port(), browser = "chrome", chromever = "106.0.5249.21", check = TRUE, verbose = TRUE)
remote_driver <- rD[["client"]]
remDr <- rD$client
remDr$navigate(url)
webElem <- remDr$findElement(using = "css", "content gtr-body d-flex flex-column ng-scope")
webElem$clickElement()
You can often just record the network log and see what request is sent when hitting the download button. In Chrome, right click Inspect, then look for the network tab. In this case there is only one request sent:
Right click and "copy as cURL" to see the whole request or just click copy URL, since the cookies and headers are not necessary here. I wrote a quick function around the task of querying the site:
dl_ukri <- function(query,
destfile = paste0(query, ".csv"),
size = 25L,
quiet_download = FALSE) {
url <- paste0(
"https://gtr.ukri.org/search/project/csv?term=",
urltools::url_encode(query),
"&selectedFacets=&fields=acp.d,is.t,prod.t,pol.oid,acp.oid,rtp.t,pol.in,prod.i,per.pro.abs,acp.i,col.org,acp.t,is.d,is.oid,cpro.rtpc,prod.d,stp.oid,rtp.i,rdm.oid,rtp.d,col.dept,ff.d,ff.c,col.pc,pub.t,kf.d,dis.t,col.oid,pro.t,per.sn,org.orcidId,per.on,ff.dept,rdm.t,org.n,dis.d,prod.oid,so.cn,dis.i,pro.a,pub.orcidId,pol.gt,rdm.i,rdm.d,so.oid,per.fnsn,per.org.n,per.pro.t,pro.orcidId,pub.a,col.d,per.orcidId,col.c,ip.i,pro.gr,pol.i,so.t,per.fn,col.i,ip.t,ff.oid,stp.i,so.i,cpro.rcpgm,cpro.hlt,col.pic,so.d,ff.t,ip.d,dis.oid,ip.oid,stp.d,rtp.oid,ff.org,kf.oid,stp.t&type=&selectedSortableField=score&selectedSortOrder=DESC"
)
curl::curl_download(url, destfile, quiet = quiet_download)
}
Testing this with your original search:
dl_ukri('"climate change" OR "climate crisis"', destfile = "test.csv")
readr::read_csv("test.csv")
#> Rows: 5894 Columns: 25
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (23): FundingOrgName, ProjectReference, LeadROName, Department, ProjectC...
#> dbl (2): AwardPounds, ExpenditurePounds
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5,894 × 25
#> FundingOrgN…¹ Proje…² LeadR…³ Depar…⁴ Proje…⁵ PISur…⁶ PIFir…⁷ PIOth…⁸ PI OR…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ESRC ES/W00… Univer… School… Fellow… Thew Harriet Christ… http:/…
#> 2 AHRC AH/W00… Univer… Arts L… Resear… Scott Peter Manley <NA>
#> 3 AHRC 2609218 Queen … Drama Studen… <NA> <NA> <NA> <NA>
#> 4 UKRI MR/V02… Univer… Politi… Fellow… Spaiser Viktor… <NA> http:/…
#> 5 MRC MC_PC_… Univer… <NA> Intram… Alessi Dario Renato <NA>
#> 6 AHRC 1948811 Royal … School… Studen… <NA> <NA> <NA> <NA>
#> 7 EPSRC 2688399 Brunel… Chemic… Studen… <NA> <NA> <NA> <NA>
#> 8 ESRC ES/T01… Univer… Social… Resear… Walker Cather… Louise http:/…
#> 9 AHRC AH/X00… Queen … Drama Resear… Herita… Paul <NA> http:/…
#> 10 ESRC 2272756 Univer… Sch of… Studen… <NA> <NA> <NA> <NA>
#> # … with 5,884 more rows, 16 more variables: StudentSurname <chr>,
#> # StudentFirstName <chr>, StudentOtherNames <chr>, `Student ORCID iD` <chr>,
#> # Title <chr>, StartDate <chr>, EndDate <chr>, AwardPounds <dbl>,
#> # ExpenditurePounds <dbl>, Region <chr>, Status <chr>, GTRProjectUrl <chr>,
#> # ProjectId <chr>, FundingOrgId <chr>, LeadROId <chr>, PIId <chr>, and
#> # abbreviated variable names ¹​FundingOrgName, ²​ProjectReference, ³​LeadROName,
#> # ⁴​Department, ⁵​ProjectCategory, ⁶​PISurname, ⁷​PIFirstName, ⁸​PIOtherNames, …
Created on 2022-10-17 with reprex v2.0.2
Voilà. I also played around with the fetchSize=25 which is in the original URL. But it does not seem to do anything, so I just omitted it.

read file from google drive

I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)

R - for loop url

Evening all, I'm having a few issues at the moment scraping data from multiple web pages.
library(RCurl)
library(XML)
tables <- readHTMLTable(getURL("https://www.basketball-reference.com/leagues/NBA_2018_games.html"))
for (i in c("october", "november", "december", "january")) {
readHTMLTable(getURL(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html")))
regular <- tables[["schedule"]]
write.csv(regular, file = paste0("./", i, i, ".csv"))
}
I'm having an issue where it doesn't appear to be looping through the months and is just saving 4 files from october.
Any help appreciated.
this is not the most elegant way but it works good.
Hope help you.
Code to web scraping
rm(list = ls())
if(!require("rvest")){install.packages("rvest");library("rvest")}
for (i in c("october", "november", "december", "january")) {
nba_url <- read_html(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html"))
#Left part of the table
left<-nba_url %>%
html_nodes(".left") %>% #item de precios
html_text()
left<-left[-length(left)]
left<-left[-(1:4)]
#Assign specific values
Date<-left[seq(1,length(left),4)]
Visitor<-left[seq(2,length(left),4)]
Home<-left[seq(3,length(left),4)]
#Right part of the table
right<-nba_url %>%
html_nodes(".right") %>% #item de precios
html_text()
right<-right[-length(right)]
right<-right[-(1:2)]
#Assign specific values
Start<-right[seq(1,length(right),3)]
PTS1<-right[seq(2,length(right),3)]
PTS2<-right[seq(3,length(right),3)]
nba_data<-data.frame(Date,Start,Visitor,PTS1,Home,PTS2)
write.csv(nba_data, file = paste0("./", i, i, ".csv"))
}
This is a solution using the tidyvere to scrape this website. But first we check the robots.txt file of the website to get a sense of the limit rate for request. See for reference the post Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R for further info.
library(spiderbar)
library(robotstxt)
rt <- robxp(get_robotstxt("https://www.basketball-reference.com"))
crawl_delays(rt)
#> agent crawl_delay
#> 1 * 3
#> 2 ahrefsbot -1
#> 3 twitterbot -1
#> 4 slysearch -1
#> 5 ground-control -1
#> 6 groundcontrol -1
#> 7 matrix -1
#> 8 hal9000 -1
#> 9 carmine -1
#> 10 the-matrix -1
#> 11 skynet -1
We are interested by the * value. We see we have to wait a minimum of 3 sec between requests. We will took 5 secondes.
We use the tidyverse ecosystem to build the urls and iterate through them to get a table with all the data.
library(tidyverse)
library(rvest)
#> Le chargement a nécessité le package : xml2
#>
#> Attachement du package : 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
month_sub <- c("october", "november", "december", "january")
urls <- map_chr(month_sub, ~ paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-", .,".html"))
urls
#> [1] "https://www.basketball-reference.com/leagues/NBA_2018_games-october.html"
#> [2] "https://www.basketball-reference.com/leagues/NBA_2018_games-november.html"
#> [3] "https://www.basketball-reference.com/leagues/NBA_2018_games-december.html"
#> [4] "https://www.basketball-reference.com/leagues/NBA_2018_games-january.html"
pb <- progress_estimated(length(urls))
map(urls, ~{
url <- .
pb$tick()$print()
Sys.sleep(5) # we take 5sec
tables <- read_html(url) %>%
# we select the table part by its table id tag
html_nodes("#schedule") %>%
# we extract the table
html_table() %>%
# we get a 1 element list so we take flatten to get a tibble
flatten_df()
}) -> tables
# we get a list of tables, one per month
str(tables, 1)
#> List of 4
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 104 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 213 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 227 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 216 obs. of 8 variables:
# we can get all the data in one table by binding rows.
# As we saw on the website that there are 2 empty columns with no names,
# we need to take care of it with repair_name before row binding
res <- tables %>%
map_df(tibble::repair_names)
res
#> # A tibble: 760 x 8
#> Date `Start (ET)` `Visitor/Neutral` PTS
#> <chr> <chr> <chr> <int>
#> 1 Tue, Oct 17, 2017 8:01 pm Boston Celtics 102
#> 2 Tue, Oct 17, 2017 10:30 pm Houston Rockets 121
#> 3 Wed, Oct 18, 2017 7:30 pm Milwaukee Bucks 100
#> 4 Wed, Oct 18, 2017 8:30 pm Atlanta Hawks 111
#> 5 Wed, Oct 18, 2017 7:00 pm Charlotte Hornets 102
#> 6 Wed, Oct 18, 2017 7:00 pm Brooklyn Nets 140
#> 7 Wed, Oct 18, 2017 8:00 pm New Orleans Pelicans 103
#> 8 Wed, Oct 18, 2017 7:00 pm Miami Heat 116
#> 9 Wed, Oct 18, 2017 10:00 pm Portland Trail Blazers 76
#> 10 Wed, Oct 18, 2017 10:00 pm Houston Rockets 100
#> # ... with 750 more rows, and 4 more variables: `Home/Neutral` <chr>,
#> # V1 <chr>, V2 <chr>, Notes <lgl>

Resources