R - Scraping with rvest package

R - Scraping with rvest package - r

I'm trying to get the data from the "Team Statistics" table on this webpage:
https://www.hockey-reference.com/teams/CGY/2010.html
I don't have a lot of experience with web scraping, but have made a few attempts with the XML package and now with the rvest package:
library(rvest)
url <- html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[#id='team_stats']")
And end up with what appears to be a single node:
{xml_node}
<table class="sortable stats_table" id="team_stats" data-cols-to-freeze="1">
[1] <caption>Team Statistics Table</caption>
[2] <colgroup>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\ ...
[3] <thead><tr>\n<th aria-label="Team" data-stat="team_name" sco ...
[4] <tbody>\n<tr>\n<th scope="row" class="left " data-stat="team ...
How do I parse this to just get the header and information in the two row table?

You just need to add html_table at the end of the chain:
library(rvest)
url <- read_html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[#id='team_stats']") %>%
html_table()
Alternatively:
library(rvest)
url %>%
html_table() %>%
.[[1]]
Both solutions return:
Team AvAge GP W L OL PTS PTS% GF GA SRS SOS TG/G PP PPO PP% PPA PPOA PK% SH SHA S S% SA SV% PDO
1 Calgary Flames 28.8 82 40 32 10 90 0.549 201 203 -0.03 0.04 5.05 43 268 16.04 54 305 82.30 7 1 2350 8.6 2367 0.916 100.1
2 League Average 27.9 82 41 31 10 92 0.561 233 233 0.00 0.00 5.68 56 304 18.23 56 304 81.77 6 6 2486 9.1 2479 0.911 NA

Related

Scraping data from public Google sheet - same url for different tabs

I want to scrape data from a public web page of a Google sheet. This is the link.
I am specifically interested in the data in the 4th tab, "US daily 4 pm ET", however the url for that tab is the same as for all the other tabs (at least according the address bar of the browsers I've tried - both Chrome and Firefox). When I try to scrape the data using the rvest package in R, I end up with the data from the 2nd tab, "States current".
I did a right-click to inspect the 1st tab, "README", to see if I could figure something out about the tab names. It looks like the name of the 4th tab is sheet-button-916628299. But entering URLS in my browser that ended with /pubhtml#gid=sheet-button-916628299 or /pubhtml#gid=916628299 didn't take me to the 4th tab.
How can I find a URL that takes me (and, more importantly, the rvest package in R) to the data in the 4th tab?

This is fairly straightforward: the data for all the tabs is loaded on the page already rather than being loaded by xhr requests. The contents of each tab are just hidden or unhidden by css.
If you use the developer pane in your browser, you can see that each tab is in a div with a numerical id which is given by the number in the id of each tab.
We can get the page and make a dataframe of the correct css selectors to get each tab's contents like this:
library(rvest)
url <- paste0("https://docs.google.com/spreadsheets/u/2/d/e/",
"2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZ",
"HXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml#")
page <- read_html(url)
tabs <- html_nodes(page, xpath = "//li")
tab_df <- data.frame(name = tabs %>% html_text,
css = paste0("#", gsub("\\D", "", html_attr(tabs, "id"))),
stringsAsFactors = FALSE)
tab_df
#> name css
#> 1 README #1600800428
#> 2 States current #1189059067
#> 3 US current #294274214
#> 4 States daily 4 pm ET #916628299
#> 5 US daily 4 pm ET #964640830
#> 6 States #1983833656
So now we can get the contents of, say, the fourth tab like this:
html_node(page, tab_df$css[4]) %>% html_nodes("table") %>% html_table()
#> [[1]]
#>
#> 1 1 Date State Positive Negative Pending Death Total
#> 2 NA
#> 3 2 20200314 AK 1 143 144
#> 4 3 20200314 AL 6 22 46 74
#> 5 4 20200314 AR 12 65 26 103
#> 6 5 20200314 AZ 12 121 50 0 183
#> 7 6 20200314 CA 252 916 5 1,168
#> 8 7 20200314 CO 101 712 1 814
#> 9 8 20200314 CT 11 125 136
#> 10 9 20200314 DC 10 49 10 69
#> 11 10 20200314 DE 6 36 32 74
#> 12 11 20200314 FL 77 478 221 3 776
#> 13 12 20200314 GA 66 1 66
#> 14 13 20200314 HI 2 2
#> 15 14 20200314 IA 17 83 100
#> .... (535 rows in total)

Web scraping with R, message that Javascript is disabled

Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here:
https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
what I have tried
code:
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
webpage = read_html(url)
data = webpage %>% html_nodes('p') %>% html_text()
data
Ouput:
[1] "\r\n The page could not be loaded. This web site
currently does not fully support browsers with \"JavaScript\" disabled.
Please note that if you choose to continue without enabling
\"JavaScript\" certain functionalities on this website may not be
available.\r\n

In this cases, you may want to use RSelenium with docker to scrape a Javascript website
require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
#Start the remote driver
remDr$open()
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
remDr$navigate(url)
doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
html_nodes(xpath = '//*[#id="gridAvergeScore"]/table') %>%
html_table(fill=TRUE)
head(table[[1]])
## JURISDICTION AVERAGE SCORE (0 - 500) AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1 JURISDICTION Score Difference from National public (NP) At or above Basic At or above Proficient
## 2 Massachusetts 249 10 87 53
## 3 Minnesota 249 10 86 53
## 4 DoDEA 249 9 91 51
## 5 Virginia 248 9 87 50
## 6 New Jersey 248 9 87 50

Introducing third-party dependencies increases complexity and hampers reproducibility.
That site uses XHR requests to load the data asynchronously (and, poorly IMO) after the initial page load.
Open up Developer Tools in your browser and then load the page and navigate to Network -> XHR:
Do a teensy bit of spelunking to get actual, lovely JSON data vs have to use error-prone HTML table parsing:
httr::GET(
"https://www.nationsreportcard.gov/ndedataservice/ChartHandler.aspx?type=sp_state_map_datatable&subject=MAT&cohort=1&year=2017R3&_=2_0"
) -> res
str(xdat <- httr::content(res)$result, 2)
## List of 1
## $ StateMap_DataTableData:List of 6
## ..$ FocalJurisdiction: chr "NP"
## ..$ Title : chr "Mathematics, Grade 4<br />Difference in average scale scores between all jurisdictions and National public, for"| __truncated__
## ..$ TableSortPrompt : chr "Click on column headers to sort data by scores for a student group or score differences"
## ..$ TableColumns :List of 7
## ..$ Statedata :List of 54
## ..$ Footnotes :List of 4
dplyr::bind_rows(xdat$StateMap_DataTableData$Statedata)
## # A tibble: 54 x 11
## Jurisdiction JurisdictionCode MN SigDiff SigSymbol AB AP MN_FP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Massachuset… MA 249 10 ▲ 87 53 249.…
## 2 Minnesota MN 249 10 ▲ 86 53 248.…
## 3 DoDEA DS 249 9 ▲ 91 51 248.…
## 4 Virginia VA 248 9 ▲ 87 50 248.…
## 5 New Jersey NJ 248 9 ▲ 87 50 247.…
## 6 Wyoming WY 248 9 ▲ 89 51 247.…
## 7 Indiana IN 247 7 ▲ 86 48 246.…
## 8 Florida FL 246 7 ▲ 88 48 246.…
## 9 Nebraska NE 246 6 ▲ 85 49 245.…
## 10 New Hampshi… NH 245 6 ▲ 85 48 245.…
## # ... with 44 more rows, and 3 more variables: SigDiff_FP <chr>,
## # AB_FP <chr>, AP_FP <chr>
You can select-away unnecessary columns and type.convert() or readr::type_convert() to get proper object types.
Also, consider paramer-izing the GET request for potential functional use; e.g.
httr::GET(
url = "https://www.nationsreportcard.gov/ndedataservice/ChartHandler.aspx",
query = list(
type = "sp_state_map_datatable",
subject = "MAT",
cohort = "1",
year = "2017R3",
`_` = "2_0"
)
) -> res
^^ could be wrapped in a function with parameters passed to the query list elements.

Getting net values as a proportion from a dataframe in R (part 2)

I recently got help with calculating net proportions for a table in R, but trying to make a summary of that hasn't worked and as I selected an answer I need to post a new question.
Here is my original data (I call qf):
genre status rb wrb inn
Fiction FAILURE 621 66 1347
Fiction FAILURE 400 46 928
Fiction FAILURE 238 35 663
Poetry FAILURE 513 105 1732
Poetry FAILURE 165 47 393
Poetry FAILURE 896 193 2350
Love-story FAILURE 5690 501 8869
Love-story FAILURE 1284 174 2793
Love-story FAILURE 7279 715 13852
Love-story SUCCESS 18150 1734 39635
Poetry SUCCESS 1988 226 4712
Love-story SUCCESS 20110 2222 43953
Love-story SUCCESS 20762 2288 46706
Poetry SUCCESS 1824 322 3984
Poetry SUCCESS 1105 148 2751
Adventure SUCCESS 4675 617 8462
Adventure SUCCESS 7943 599 17247
Adventure SUCCESS 7290 601 17774
Thanks to the answers I manage to get it to summarise by genre and success/failure like so(I like to track all transformations hence the multiple dataframes):
qf2 <- qf %>% group_by(genre,status) %>% summarise_all(sum)
qf3 <- ff2 %>% as.data.frame()
qf4 <- qf3 %>% mutate(rowSum = rowSums(.[,names(qf3)[3:5]])) %>%
group_by(genre) %>%
summarise_at(vars(names(qf3)[3:5]),
funs(net = .[status == "SUCCESS"]/rowSum[status == "SUCCESS"] -
.[status == "FAILURE"]/rowSum[status == "FAILURE"] )) %>%
as.data.frame()
However what I want to now do is get the overall proportions. But whatever I try it just won't work. I think I'm missing something obvious.
What I want to get is the output of:
Sum-FAILURE 0.329241738 0.036265536 0.634492726
Sum-SUCCESS 0.301794636 0.031519501 0.666685863
Net -0.027447103 -0.004746035 0.032193137
The calculation I'm trying to create to get this is (for rb):
(Sum(success_rb)/(Sum(success_rb)+Sum(success_wrb)+Sum(Success_inn)) - (Sum(failure_rb)/(Sum(failure_rb)+Sum(failure_wrb)+Sum(failure_inn))

qf %>%
select(-genre)%>%
group_by(status) %>%
summarise_all(sum)%>%
{.[-1]/rowSums(.[-1])}%>%
rbind(.[2,]-.[1,])
rb wrb inn
1 0.3292417 0.036265536 0.63449273
2 0.3017946 0.031519501 0.66668586
21 -0.0274471 -0.004746035 0.03219314
library(data.table)
setDT(qf)[,lapply(.SD,sum),status,.SDcols=3:5][,
.SD/rowSums(.SD),.SDcols=-1][,rbind(.SD,.SD[2]-.SD[1])]
rb wrb inn
1: 0.3292417 0.036265536 0.63449273
2: 0.3017946 0.031519501 0.66668586
3: -0.0274471 -0.004746035 0.03219314

Extracting html table from a website in R

Hi I am trying to extract the table from the premierleague website.
The package I am using is rvest package and the code I am using in the inital phase is as follows:
library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")
I couldn't find a html tag that would work to extract the html_nodes for rvest package.
I was using similar approach to extract data from "http://admissions.calpoly.edu/prospective/profile.html" and I was able to extract the data. The code I used for calpoly is as follows:
library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")
CPadmissions %>% html_nodes("table") %>%
.[[1]] %>%
html_table()
Got the code above from youtube through this link: https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien
Any help on getting data from fantasy.premierleague.com is highly appreciated. Do I need to use some kind of API ?

Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):
library(RSelenium)
library(rvest)
# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()
# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]
# clean up
rd$close()
ptm$stop()
# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>% # clean column names
setNames(gsub('\\s', '_', names(.)))
str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1" "GW2" "GW3" "GW4" ...
## $ Gameweek_Points : int 34 47 53 51 66 66 65 63 48 90 ...
## $ Points_Bench : int 1 6 9 7 14 2 9 3 8 2 ...
## $ Gameweek_Rank : chr "2,406,373" "2,659,789" "541,258" "905,524" ...
## $ Transfers_Made : int 0 0 2 0 3 2 2 0 2 0 ...
## $ Transfers_Cost : int 0 0 0 0 4 4 4 0 0 0 ...
## $ Overall_Points : chr "34" "81" "134" "185" ...
## $ Overall_Rank : chr "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
## $ Value : chr "£100.0" "£100.0" "£99.9" "£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...
As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

This solution uses RSelenium along with the package XML. It also assumes that you have a working installation of RSelenium that can properly work with firefox. Just make sure you have the firefox starter script path added to your PATH.
If you are using OS X, you will need to add /Applications/Firefox.app/Contents/MacOS/ to your PATH. Or, if you're on an Ubuntu machine, it's likely /usr/lib/firefox/. Once you're sure this is working, you can move on to R with the following:
# Install RSelenium and XML for R
#install.packages("RSelenium")
#install.packages("XML")
# Import packages
library(RSelenium)
library(XML)
# Check and start servers for Selenium
checkForServer()
startServer()
# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)
# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")
# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")
# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)
# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]
# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))
# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
OP = as.numeric(gsub(",","",OP)),
OR = as.numeric(gsub(",","",OR)),
Value = as.numeric(gsub("Â£","",Value)))
This should yield:
GW GP PB GR TM TC OP OR Value CPW
GW1 34 1 2406373 0 0 34 2406373 100.0
GW2 47 6 2659789 0 0 81 2448674 100.0
GW3 53 9 541258 2 0 134 1914025 99.9
GW4 51 7 905524 0 0 185 1461665 100.0
GW5 66 14 379438 3 4 247 958889 100.1
GW6 66 2 303704 2 4 309 510376 99.9
GW7 65 9 138792 2 4 370 232474 99.8
GW8 63 3 108363 0 0 433 87967 100.4
GW9 48 8 1114609 2 0 481 75385 100.9
GW10 90 2 71210 0 0 571 27716 101.1
GW11 71 2 421706 3 4 638 16083 100.9
GW12 35 9 2798661 2 4 669 31820 101.2
GW13 41 8 2738535 1 0 710 53487 101.1
GW14 82 15 308725 0 0 792 29436 100.2
GW15 55 9 1048808 2 4 843 29399 100.6
GW16 49 8 1801549 0 0 892 35142 100.7
GW17 48 4 2116706 2 0 940 40857 100.7
GW18 42 2 3315031 0 0 982 78136 100.8
GW19 41 9 2600618 0 0 1023 99048 100.6
GW20 53 0 1644385 0 0 1076 113148 100.8
Please note that the column CPW (change from previous week) is a vector of empty strings.
I hope this helps.

Summarising values in dplyr - Crashes RStudio

Can dplyr perform chained summarise operations on a data.frame?
My data.frame has the structure:
data_df = tbl_df(data)
data_df %.%
group_by(col_1) %.%
summarise(number_of= length(col_2)) %.%
summarise(sum_of = sum(col_3))
This causes RStudio to encounter a fatal error - R Session Aborted message
Usually with plyr I would include these summarise functions without problems.
UPDATE
Data are here.
Code is:
library(dplyr)
orth <- read.csv('orth0106.csv')
orth_df = tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure)) %.%
summarise(SSIs = sum(SSI))

I can reproduce the error on Windows 7 machine running RStudio 0.97.551
It may be because you're calling summarise and chaining onto something that's not there. You can summarise with 2 different columns as I've done here.
url <- "https://raw.github.com/johnmarquess/some.data/master/orth0106.csv"
library(dplyr)
orth <- read.csv(url)
orth_df <- tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure), SSIs = sum(SSI))
## Source: local data frame [18 x 3]
##
## Hospital Procs SSIs
## 1 A 865 80
## 2 B 1069 38
## 3 C 796 24
## 4 D 891 35
## 5 E 997 39
## 6 F 550 30
## 7 G 2598 128
## 8 H 373 27
## 9 I 1079 70
## 10 J 714 30
## 11 K 477 30
## 12 L 227 2
## 13 M 125 6
## 14 N 589 38
## 15 O 292 3
## 16 P 149 9
## 17 Q 1984 52
## 18 R 351 13
In any event this seems like either an RStudio or a dplyr bug. I'd open up an issue with Hadley as he probably cares either way. https://github.com/hadley/dplyr/issues
EDIT This (your first call) also cause rgui (windows) and the terminal to crash as well on:
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
This indicates a dplyr problem Hadley and Romain will want to know about.
To get my first point we run:
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure))
Source: local data frame [18 x 2]
Hospital Procs
1 A 865
2 B 1069
3 C 796
4 D 891
5 E 997
6 F 550
7 G 2598
8 H 373
9 I 1079
10 J 714
11 K 477
12 L 227
13 M 125
14 N 589
15 O 292
16 P 149
17 Q 1984
18 R 351
Where is %.% summarise(SSIs = sum(SSI)) supposed to find SSI?
So the chaining you think is happening fails. TO my understanding %.% isn't exactly like how ggplot2 works but similar. In ggplot2 once you pass the data in the initial mapping you can access it later on. Here %.% seems to modify grab the left chunk and operate on it like this:
So you're grabbing:
Hospital Procs
1 A 865
2 B 1069
3 C 796
.
.
.
17 Q 1984
18 R 351
when you use %.% summarise(SSIs = sum(SSI)) and there is no SSI to be gotten. So the analogy that comes to mind is serial vs. parallel wiring Christmas lights. %.% = serial ggplot() + = parallel. This is a nonprogrammer's understanding of things and the R gurus may come and tell me I'm stupid but for now that's the best theory you've got.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Scraping with rvest package - r

Related

Scraping data from public Google sheet - same url for different tabs

Web scraping with R, message that Javascript is disabled

Getting net values as a proportion from a dataframe in R (part 2)

Extracting html table from a website in R

Summarising values in dplyr - Crashes RStudio

Categories

Resources