Web scraping with R, message that Javascript is disabled - r

Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here:
https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
what I have tried
code:
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
webpage = read_html(url)
data = webpage %>% html_nodes('p') %>% html_text()
data
Ouput:
[1] "\r\n The page could not be loaded. This web site
currently does not fully support browsers with \"JavaScript\" disabled.
Please note that if you choose to continue without enabling
\"JavaScript\" certain functionalities on this website may not be
available.\r\n

In this cases, you may want to use RSelenium with docker to scrape a Javascript website
require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
#Start the remote driver
remDr$open()
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
remDr$navigate(url)
doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
html_nodes(xpath = '//*[#id="gridAvergeScore"]/table') %>%
html_table(fill=TRUE)
head(table[[1]])
## JURISDICTION AVERAGE SCORE (0 - 500) AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1 JURISDICTION Score Difference from National public (NP) At or above Basic At or above Proficient
## 2 Massachusetts 249 10 87 53
## 3 Minnesota 249 10 86 53
## 4 DoDEA 249 9 91 51
## 5 Virginia 248 9 87 50
## 6 New Jersey 248 9 87 50

Introducing third-party dependencies increases complexity and hampers reproducibility.
That site uses XHR requests to load the data asynchronously (and, poorly IMO) after the initial page load.
Open up Developer Tools in your browser and then load the page and navigate to Network -> XHR:
Do a teensy bit of spelunking to get actual, lovely JSON data vs have to use error-prone HTML table parsing:
httr::GET(
"https://www.nationsreportcard.gov/ndedataservice/ChartHandler.aspx?type=sp_state_map_datatable&subject=MAT&cohort=1&year=2017R3&_=2_0"
) -> res
str(xdat <- httr::content(res)$result, 2)
## List of 1
## $ StateMap_DataTableData:List of 6
## ..$ FocalJurisdiction: chr "NP"
## ..$ Title : chr "Mathematics, Grade 4<br />Difference in average scale scores between all jurisdictions and National public, for"| __truncated__
## ..$ TableSortPrompt : chr "Click on column headers to sort data by scores for a student group or score differences"
## ..$ TableColumns :List of 7
## ..$ Statedata :List of 54
## ..$ Footnotes :List of 4
dplyr::bind_rows(xdat$StateMap_DataTableData$Statedata)
## # A tibble: 54 x 11
## Jurisdiction JurisdictionCode MN SigDiff SigSymbol AB AP MN_FP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Massachuset… MA 249 10 ▲ 87 53 249.…
## 2 Minnesota MN 249 10 ▲ 86 53 248.…
## 3 DoDEA DS 249 9 ▲ 91 51 248.…
## 4 Virginia VA 248 9 ▲ 87 50 248.…
## 5 New Jersey NJ 248 9 ▲ 87 50 247.…
## 6 Wyoming WY 248 9 ▲ 89 51 247.…
## 7 Indiana IN 247 7 ▲ 86 48 246.…
## 8 Florida FL 246 7 ▲ 88 48 246.…
## 9 Nebraska NE 246 6 ▲ 85 49 245.…
## 10 New Hampshi… NH 245 6 ▲ 85 48 245.…
## # ... with 44 more rows, and 3 more variables: SigDiff_FP <chr>,
## # AB_FP <chr>, AP_FP <chr>
You can select-away unnecessary columns and type.convert() or readr::type_convert() to get proper object types.
Also, consider paramer-izing the GET request for potential functional use; e.g.
httr::GET(
url = "https://www.nationsreportcard.gov/ndedataservice/ChartHandler.aspx",
query = list(
type = "sp_state_map_datatable",
subject = "MAT",
cohort = "1",
year = "2017R3",
`_` = "2_0"
)
) -> res
^^ could be wrapped in a function with parameters passed to the query list elements.

Related

Scraping data from public Google sheet - same url for different tabs

I want to scrape data from a public web page of a Google sheet. This is the link.
I am specifically interested in the data in the 4th tab, "US daily 4 pm ET", however the url for that tab is the same as for all the other tabs (at least according the address bar of the browsers I've tried - both Chrome and Firefox). When I try to scrape the data using the rvest package in R, I end up with the data from the 2nd tab, "States current".
I did a right-click to inspect the 1st tab, "README", to see if I could figure something out about the tab names. It looks like the name of the 4th tab is sheet-button-916628299. But entering URLS in my browser that ended with /pubhtml#gid=sheet-button-916628299 or /pubhtml#gid=916628299 didn't take me to the 4th tab.
How can I find a URL that takes me (and, more importantly, the rvest package in R) to the data in the 4th tab?
This is fairly straightforward: the data for all the tabs is loaded on the page already rather than being loaded by xhr requests. The contents of each tab are just hidden or unhidden by css.
If you use the developer pane in your browser, you can see that each tab is in a div with a numerical id which is given by the number in the id of each tab.
We can get the page and make a dataframe of the correct css selectors to get each tab's contents like this:
library(rvest)
url <- paste0("https://docs.google.com/spreadsheets/u/2/d/e/",
"2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZ",
"HXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml#")
page <- read_html(url)
tabs <- html_nodes(page, xpath = "//li")
tab_df <- data.frame(name = tabs %>% html_text,
css = paste0("#", gsub("\\D", "", html_attr(tabs, "id"))),
stringsAsFactors = FALSE)
tab_df
#> name css
#> 1 README #1600800428
#> 2 States current #1189059067
#> 3 US current #294274214
#> 4 States daily 4 pm ET #916628299
#> 5 US daily 4 pm ET #964640830
#> 6 States #1983833656
So now we can get the contents of, say, the fourth tab like this:
html_node(page, tab_df$css[4]) %>% html_nodes("table") %>% html_table()
#> [[1]]
#>
#> 1 1 Date State Positive Negative Pending Death Total
#> 2 NA
#> 3 2 20200314 AK 1 143 144
#> 4 3 20200314 AL 6 22 46 74
#> 5 4 20200314 AR 12 65 26 103
#> 6 5 20200314 AZ 12 121 50 0 183
#> 7 6 20200314 CA 252 916 5 1,168
#> 8 7 20200314 CO 101 712 1 814
#> 9 8 20200314 CT 11 125 136
#> 10 9 20200314 DC 10 49 10 69
#> 11 10 20200314 DE 6 36 32 74
#> 12 11 20200314 FL 77 478 221 3 776
#> 13 12 20200314 GA 66 1 66
#> 14 13 20200314 HI 2 2
#> 15 14 20200314 IA 17 83 100
#> .... (535 rows in total)

R - Filter a specific variable based on other variables

I have the problem that I want to filter my variable Position(containing 5 atomic levels: Analyst, CEO, Analyst level II, Manger II, Ceo Level II) for age.
This means that I want to remove Analyst level II","Ceo level II","Manger level II" if their age is below 58 or keep them if their age is above 58. The other atomic levels (Analyst, CEO) shouldn't be affected by the age constraint. (example: analyst, age=50 should be kept)
library(tidyverse)
Test<- tibble(Age=50:69,Position=rep(c("Analyst","Analyst Level II","Ceo level II", "Manager", "Manager level II"), times=4),Value=201:220)
exam32 <-Test %>%
filter(!Position==c("Analyst level II","Ceo level II","Manager level II"), Age>58)
View(exam32)
Hope you can help
Use %in% to match the string, and & specifying that both condition should be satisfied.
Test %>%
filter(!(Position %in% c("Analyst level II",
"Ceo level II",
"Manager level II") & Age < 58))
# # A tibble: 17 x 3
# Age Position Value
# <int> <chr> <int>
# 1 50 Analyst 201
# 2 51 Analyst Level II 202
# 3 53 Manager 204
# 4 55 Analyst 206
# 5 56 Analyst Level II 207
# 6 58 Manager 209
# 7 59 Manager level II 210
# 8 60 Analyst 211
# 9 61 Analyst Level II 212
# 10 62 Ceo level II 213
# 11 63 Manager 214
# 12 64 Manager level II 215
# 13 65 Analyst 216
# 14 66 Analyst Level II 217
# 15 67 Ceo level II 218
# 16 68 Manager 219
# 17 69 Manager level II 220

Extracting html table from a website in R

Hi I am trying to extract the table from the premierleague website.
The package I am using is rvest package and the code I am using in the inital phase is as follows:
library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")
I couldn't find a html tag that would work to extract the html_nodes for rvest package.
I was using similar approach to extract data from "http://admissions.calpoly.edu/prospective/profile.html" and I was able to extract the data. The code I used for calpoly is as follows:
library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")
CPadmissions %>% html_nodes("table") %>%
.[[1]] %>%
html_table()
Got the code above from youtube through this link: https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien
Any help on getting data from fantasy.premierleague.com is highly appreciated. Do I need to use some kind of API ?
Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):
library(RSelenium)
library(rvest)
# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()
# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]
# clean up
rd$close()
ptm$stop()
# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>% # clean column names
setNames(gsub('\\s', '_', names(.)))
str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1" "GW2" "GW3" "GW4" ...
## $ Gameweek_Points : int 34 47 53 51 66 66 65 63 48 90 ...
## $ Points_Bench : int 1 6 9 7 14 2 9 3 8 2 ...
## $ Gameweek_Rank : chr "2,406,373" "2,659,789" "541,258" "905,524" ...
## $ Transfers_Made : int 0 0 2 0 3 2 2 0 2 0 ...
## $ Transfers_Cost : int 0 0 0 0 4 4 4 0 0 0 ...
## $ Overall_Points : chr "34" "81" "134" "185" ...
## $ Overall_Rank : chr "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
## $ Value : chr "£100.0" "£100.0" "£99.9" "£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...
As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.
This solution uses RSelenium along with the package XML. It also assumes that you have a working installation of RSelenium that can properly work with firefox. Just make sure you have the firefox starter script path added to your PATH.
If you are using OS X, you will need to add /Applications/Firefox.app/Contents/MacOS/ to your PATH. Or, if you're on an Ubuntu machine, it's likely /usr/lib/firefox/. Once you're sure this is working, you can move on to R with the following:
# Install RSelenium and XML for R
#install.packages("RSelenium")
#install.packages("XML")
# Import packages
library(RSelenium)
library(XML)
# Check and start servers for Selenium
checkForServer()
startServer()
# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)
# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")
# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")
# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)
# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]
# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))
# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
OP = as.numeric(gsub(",","",OP)),
OR = as.numeric(gsub(",","",OR)),
Value = as.numeric(gsub("£","",Value)))
This should yield:
GW GP PB GR TM TC OP OR Value CPW
GW1 34 1 2406373 0 0 34 2406373 100.0
GW2 47 6 2659789 0 0 81 2448674 100.0
GW3 53 9 541258 2 0 134 1914025 99.9
GW4 51 7 905524 0 0 185 1461665 100.0
GW5 66 14 379438 3 4 247 958889 100.1
GW6 66 2 303704 2 4 309 510376 99.9
GW7 65 9 138792 2 4 370 232474 99.8
GW8 63 3 108363 0 0 433 87967 100.4
GW9 48 8 1114609 2 0 481 75385 100.9
GW10 90 2 71210 0 0 571 27716 101.1
GW11 71 2 421706 3 4 638 16083 100.9
GW12 35 9 2798661 2 4 669 31820 101.2
GW13 41 8 2738535 1 0 710 53487 101.1
GW14 82 15 308725 0 0 792 29436 100.2
GW15 55 9 1048808 2 4 843 29399 100.6
GW16 49 8 1801549 0 0 892 35142 100.7
GW17 48 4 2116706 2 0 940 40857 100.7
GW18 42 2 3315031 0 0 982 78136 100.8
GW19 41 9 2600618 0 0 1023 99048 100.6
GW20 53 0 1644385 0 0 1076 113148 100.8
Please note that the column CPW (change from previous week) is a vector of empty strings.
I hope this helps.

Measuring distance between centroids R

I want to create a matrix of the distance (in metres) between the centroids of every country in the world. Country names or country IDs should be included in the matrix.
The matrix is based on a shapefile of the world downloaded here: http://gadm.org/version2
Here is some rough info on the shapefile I'm using (I'm using shapefile#data$UN as my ID):
> str(shapefile#data)
'data.frame': 174 obs. of 11 variables:
$ FIPS : Factor w/ 243 levels "AA","AC","AE",..: 5 6 7 8 10 12 13
$ ISO2 : Factor w/ 246 levels "AD","AE","AF",..: 61 17 6 7 9 11 14
$ ISO3 : Factor w/ 246 levels "ABW","AFG","AGO",..: 64 18 6 11 3 10
$ UN : int 12 31 8 51 24 32 36 48 50 84 ...
$ NAME : Factor w/ 246 levels "Afghanistan",..: 3 15 2 11 6 10 13
$ AREA : int 238174 8260 2740 2820 124670 273669 768230 71 13017
$ POP2005 : int 32854159 8352021 3153731 3017661 16095214 38747148
$ REGION : int 2 142 150 142 2 19 9 142 142 19 ...
$ SUBREGION: int 15 145 39 145 17 5 53 145 34 13 ...
$ LON : num 2.63 47.4 20.07 44.56 17.54 ...
$ LAT : num 28.2 40.4 41.1 40.5 -12.3 ...
I tried this:
library(rgeos)
shapefile <- readOGR("./Map/Shapefiles/World/World Map", layer = "TM_WORLD_BORDERS-0.3") # Read in world shapefile
row.names(shapefile) <- as.character(shapefile#data$UN)
centroids <- gCentroid(shapefile, byid = TRUE, id = as.character(shapefile#data$UN)) # create centroids
dist_matrix <- as.data.frame(geosphere::distm(centroids))
The result looks something like this:
V1 V2 V3 V4
1 0.0 4296620.6 2145659.7 4077948.2
2 4296620.6 0.0 2309537.4 219442.4
3 2145659.7 2309537.4 0.0 2094277.3
4 4077948.2 219442.4 2094277.3 0.0
1) Instead of the first column (1,2,3,4) and row (V1, V2, V3, V4) I would like to have country IDs (shapefile#data$UN) or names (shapefile#data#NAME). How does that work?
2) I'm not sure of the value that is returned. Is it metres, kilometres, etc?
3) Is geosphere::distm preferable to geosphere:distGeo in this instance?
1.
This should work to add the column and row names to your matrix. Just as you had done when adding the row names to shapefile
crnames<-as.character(shapefile#data$UN)
colnames(dist_matrix)<- crnames
rownames(dist_matrix)<- crnames
2.
The default distance function in distm is distHaversine, which takes a radius( of the earth) variable in m. So I assume the output is in m.
3.
Look at the documentation for distGeo and distHaversine and decide the level of accuracy you want in your results. To look at the docs in R itself just enter ?distGeo.
edit: answer to q1 may be wrong since the matrix data may be aggregated, looking at alternatives

Add scale column to data frame by factor

I'm attempting to add a column to a data frame that consists of normalized values by a factor.
For example:
'data.frame': 261 obs. of 3 variables:
$ Area : Factor w/ 29 levels "Antrim","Ards",..: 1 1 1 1 1 1 1 1 1 2 ...
$ Year : Factor w/ 9 levels "2002","2003",..: 1 2 3 4 5 6 7 8 9 1 ...
$ Arrests: int 18 54 47 70 62 85 96 123 99 38 ...
I'd like to add a column that are the Arrests values normalized in groups by Area.
The best I've come up with is:
data$Arrests.norm <- unlist(unname(by(data$Arrests,data$Area,function(x){ scale(x)[,1] } )))
This command processes but the data is scrambled, ie, the normalized values don't match to the correct Areas in the data frame.
Appreciate your tips.
EDIT:Just to clarify what I mean by scrambled data, subsetting the data frame after my code I get output like the following, where the normalized values clearly belong to another factor group.
Area Year Arrests Arrests.norm
199 Larne 2002 92 -0.992843957
200 Larne 2003 124 -0.404975825
201 Larne 2004 89 -1.169204397
202 Larne 2005 94 -0.581336264
203 Larne 2006 98 -0.228615385
204 Larne 2007 8 0.006531868
205 Larne 2008 31 0.418039561
206 Larne 2009 25 0.947120880
207 Larne 2010 22 2.005283518
Following up your by attempt:
df <- data.frame(A = factor(rep(c("a", "b"), each = 4)),
B = sample(1:4, 8, TRUE))
ll <- by(data = df, df$A, function(x){
x$B_scale <- scale(x$B)
x
}
)
df2 <- do.call(rbind, ll)
data <- transform(data, Arrests.norm = ave(Arrests, Area, FUN = scale))
will do the trick.

Resources