Scraping data from public Google sheet - same url for different tabs - r

I want to scrape data from a public web page of a Google sheet. This is the link.
I am specifically interested in the data in the 4th tab, "US daily 4 pm ET", however the url for that tab is the same as for all the other tabs (at least according the address bar of the browsers I've tried - both Chrome and Firefox). When I try to scrape the data using the rvest package in R, I end up with the data from the 2nd tab, "States current".
I did a right-click to inspect the 1st tab, "README", to see if I could figure something out about the tab names. It looks like the name of the 4th tab is sheet-button-916628299. But entering URLS in my browser that ended with /pubhtml#gid=sheet-button-916628299 or /pubhtml#gid=916628299 didn't take me to the 4th tab.
How can I find a URL that takes me (and, more importantly, the rvest package in R) to the data in the 4th tab?

This is fairly straightforward: the data for all the tabs is loaded on the page already rather than being loaded by xhr requests. The contents of each tab are just hidden or unhidden by css.
If you use the developer pane in your browser, you can see that each tab is in a div with a numerical id which is given by the number in the id of each tab.
We can get the page and make a dataframe of the correct css selectors to get each tab's contents like this:
library(rvest)
url <- paste0("https://docs.google.com/spreadsheets/u/2/d/e/",
"2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZ",
"HXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml#")
page <- read_html(url)
tabs <- html_nodes(page, xpath = "//li")
tab_df <- data.frame(name = tabs %>% html_text,
css = paste0("#", gsub("\\D", "", html_attr(tabs, "id"))),
stringsAsFactors = FALSE)
tab_df
#> name css
#> 1 README #1600800428
#> 2 States current #1189059067
#> 3 US current #294274214
#> 4 States daily 4 pm ET #916628299
#> 5 US daily 4 pm ET #964640830
#> 6 States #1983833656
So now we can get the contents of, say, the fourth tab like this:
html_node(page, tab_df$css[4]) %>% html_nodes("table") %>% html_table()
#> [[1]]
#>
#> 1 1 Date State Positive Negative Pending Death Total
#> 2 NA
#> 3 2 20200314 AK 1 143 144
#> 4 3 20200314 AL 6 22 46 74
#> 5 4 20200314 AR 12 65 26 103
#> 6 5 20200314 AZ 12 121 50 0 183
#> 7 6 20200314 CA 252 916 5 1,168
#> 8 7 20200314 CO 101 712 1 814
#> 9 8 20200314 CT 11 125 136
#> 10 9 20200314 DC 10 49 10 69
#> 11 10 20200314 DE 6 36 32 74
#> 12 11 20200314 FL 77 478 221 3 776
#> 13 12 20200314 GA 66 1 66
#> 14 13 20200314 HI 2 2
#> 15 14 20200314 IA 17 83 100
#> .... (535 rows in total)

Related

How to change column names for mrset in R?

I am trying to create crosstabs I have a dataframe in which I have multiple select questions. I am importing the data frame from SPSS file using foreign and expss package. I am creating the multiple select questions using the mrset function. Here's the demo code for this to make it clear.
Banner1 = w %>%
tab_cells(mrset(as.category( temp1,counted_value = "Checked"))) %>%
tab_cols(total(),mrset(as.category( temp2, counted_value = "Checked"))) %>%
tab_stat_cases(total_row_position = "none",label = "")
tab_pivot(Banner1)
The datatable imported looks like this
Total Q12_1 Q12_2 Q12_3 Q12_4 Q12_5
A B C D E F
Total Cases 803 34 18 14 38 37
Q13_1 64 11 7 8 9 7
Q13_2 12 54 54 43 13 12
Q13_3 67 54 23 21 6 4
Sorry about the alignment here....So this is the imported dataset.
Coming to the problem, As you can see this dataset has column labels as Question numbers and not variable labels. For single select questions everything works fine. Is there any function I can change the colnames for mrset functions dynamically?
The desired output should be something like this. For eg,
Total Apple Mango Banana Orange Grapes
A B C D E F
Total Cases 803 34 18 14 38 37
Apple 64 11 7 8 9 7
Mango 12 54 54 43 13 12
banana 67 54 23 21 6 4
Any help would be greatly appreciated.

How can I scrape Informations from a Website, that are constantly updated?

I am trying to get informations about not occupied parking space in a car park. The info on the website is constantly updating the numbers of free parking spots.
Since I'm on the beginning of learning webscraping with R, I started learning the basics.
So I tried getting the Year of an IMDB Movie with the code
url2 <- "https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature"
page2 <- read_html(url2)
data2 <- page2 %>%
html_node(".lister-item-year") %>%
html_text
data2
This code is running with no problems.
Now I tried the same with the website about parking spots and since the HTML Code is almost the same as in the example above, I figured it shouldn't be that hard.
url <- "https://www.rosenheim.de/stadt-buerger/verkehr/parken.html"
page <- read_html(url)
data <- page %>%
html_node('.jwGetFreeParking-8') %>%
html_text
data
But as a result I don't get the information about free parking spots. The Result I get is "". So nothing.
Is it because the number on the second webpage is updating from time to time?
This page is rendered using javascript thus the techniques from your example don't apply. If you use the developer tools from your browser and examine the files loaded on the network tab, you will find a file named "index.php". This is a JSON file containing the parking information.
Downloading this file will provide the requested information. The fromJSON function the "jsonlite" library will access the file and convert it into a data frame.
library(jsonlite)
answer<-fromJSON("https://www.rosenheim.de/index.php?eID=jwParkingGetParkings")
answer
uid title parkings occupied free isOpened link
1 4 Reserve 0 0 --- FALSE 0
2 7 Reserve 0 0 --- FALSE 0
3 13 Reserve 0 0 --- FALSE 0
4 14 Reserve 0 0 --- FALSE 0
5 0 P1 Zentrum 257 253 4 TRUE 224
6 1 P2 KU'KO 138 133 5 TRUE 225
7 2 P3 Rathaus 31 29 2 TRUE 226
8 3 P4 Mitte 275 275 0 TRUE 227
9 5 P6 Salinplatz 232 148 84 TRUE 228
10 6 P7 Altstadt-Ost 82 108 0 TRUE 229
11 10 P8 Beilhack-Citydome 160 130 30 TRUE 230
12 8 P9 Am Klinikum 426 424 2 TRUE 1053
13 9 P10 Stadtcenter 56 54 2 TRUE 231
14 11 P11 Beilhack-Gießereistr. 155 155 --- FALSE 1151
15 12 P12 Bahnhof Nord 148 45 103 TRUE 1203

tidyr::gather can't find my value variables

I have an extremely large data.frame. I reproduce part of it.
RECORDING_SESSION_LABEL condition TRIAL_INDEX IA_LABEL IA_DWELL_TIME
1 23 match 1 eyes 3580
2 23 match 1 nose 2410
3 23 match 1 mouth 1442
4 23 match 1 face 841
5 23 mismatch 3 eyes 1817
6 23 mismatch 3 nose 1724
7 23 mismatch 3 mouth 1600
8 23 mismatch 3 face 1136
9 23 mismatch 4 eyes 4812
10 23 mismatch 4 nose 3710
11 23 mismatch 4 mouth 4684
12 23 mismatch 4 face 1557
13 23 mismatch 6 eyes 4645
14 23 mismatch 6 nose 2321
15 23 mismatch 6 mouth 674
16 23 mismatch 6 face 684
17 23 match 7 eyes 1062
18 23 match 7 nose 1359
19 23 match 7 mouth 215
20 23 match 7 face 0
I need to calculate the percentage of IA_DWELL_TIME for each IA_LABEL in each trial index. For that, I first put IA_label in different columns
data_IA_DWELL_TIME <- tidyr::spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
For calculating the percentage, I create a new dataframe:
data_IA_DWELL_TIME_percentage <-data_IA_DWELL_TIME
data_IA_DWELL_TIME_percentage$eyes <- 100*(data_IA_DWELL_TIME$eyes/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$nose <- 100*(data_IA_DWELL_TIME$nose/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$mouth <- 100*(data_IA_DWELL_TIME$mouth/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$face <- 100*(data_IA_DWELL_TIME$face/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
So all is fine, and I get the wanted output. The problem is when I want to put the columns back to the rows
data_IA_DWELL_TIME_percentage <- tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)
I obtain this error:
Error in tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,
-RECORDING_SESSION_LABEL, : object 'RECORDING_SESSION_LABEL' not found
>
Any idea of what is going on here? Thanks!
As explained, you're not referring to your data frame in the gather statement.
However, you could avoid the need for referring to it altogether and put the second part in a dplyr pipeline, like below:
library(dplyr)
library(tidyr)
data_IA_DWELL_TIME <- spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
data_IA_DWELL_TIME %>%
mutate_at(
vars(eyes, nose, mouth, face),
~ 100 * (. / (rowSums(data_IA_DWELL_TIME[, c("eyes", "nose", "mouth", "face")])))
) %>%
gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)

Web scraping with R, message that Javascript is disabled

Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here:
https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
what I have tried
code:
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
webpage = read_html(url)
data = webpage %>% html_nodes('p') %>% html_text()
data
Ouput:
[1] "\r\n The page could not be loaded. This web site
currently does not fully support browsers with \"JavaScript\" disabled.
Please note that if you choose to continue without enabling
\"JavaScript\" certain functionalities on this website may not be
available.\r\n
In this cases, you may want to use RSelenium with docker to scrape a Javascript website
require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
#Start the remote driver
remDr$open()
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
remDr$navigate(url)
doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
html_nodes(xpath = '//*[#id="gridAvergeScore"]/table') %>%
html_table(fill=TRUE)
head(table[[1]])
## JURISDICTION AVERAGE SCORE (0 - 500) AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1 JURISDICTION Score Difference from National public (NP) At or above Basic At or above Proficient
## 2 Massachusetts 249 10 87 53
## 3 Minnesota 249 10 86 53
## 4 DoDEA 249 9 91 51
## 5 Virginia 248 9 87 50
## 6 New Jersey 248 9 87 50
Introducing third-party dependencies increases complexity and hampers reproducibility.
That site uses XHR requests to load the data asynchronously (and, poorly IMO) after the initial page load.
Open up Developer Tools in your browser and then load the page and navigate to Network -> XHR:
Do a teensy bit of spelunking to get actual, lovely JSON data vs have to use error-prone HTML table parsing:
httr::GET(
"https://www.nationsreportcard.gov/ndedataservice/ChartHandler.aspx?type=sp_state_map_datatable&subject=MAT&cohort=1&year=2017R3&_=2_0"
) -> res
str(xdat <- httr::content(res)$result, 2)
## List of 1
## $ StateMap_DataTableData:List of 6
## ..$ FocalJurisdiction: chr "NP"
## ..$ Title : chr "Mathematics, Grade 4<br />Difference in average scale scores between all jurisdictions and National public, for"| __truncated__
## ..$ TableSortPrompt : chr "Click on column headers to sort data by scores for a student group or score differences"
## ..$ TableColumns :List of 7
## ..$ Statedata :List of 54
## ..$ Footnotes :List of 4
dplyr::bind_rows(xdat$StateMap_DataTableData$Statedata)
## # A tibble: 54 x 11
## Jurisdiction JurisdictionCode MN SigDiff SigSymbol AB AP MN_FP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Massachuset… MA 249 10 ▲ 87 53 249.…
## 2 Minnesota MN 249 10 ▲ 86 53 248.…
## 3 DoDEA DS 249 9 ▲ 91 51 248.…
## 4 Virginia VA 248 9 ▲ 87 50 248.…
## 5 New Jersey NJ 248 9 ▲ 87 50 247.…
## 6 Wyoming WY 248 9 ▲ 89 51 247.…
## 7 Indiana IN 247 7 ▲ 86 48 246.…
## 8 Florida FL 246 7 ▲ 88 48 246.…
## 9 Nebraska NE 246 6 ▲ 85 49 245.…
## 10 New Hampshi… NH 245 6 ▲ 85 48 245.…
## # ... with 44 more rows, and 3 more variables: SigDiff_FP <chr>,
## # AB_FP <chr>, AP_FP <chr>
You can select-away unnecessary columns and type.convert() or readr::type_convert() to get proper object types.
Also, consider paramer-izing the GET request for potential functional use; e.g.
httr::GET(
url = "https://www.nationsreportcard.gov/ndedataservice/ChartHandler.aspx",
query = list(
type = "sp_state_map_datatable",
subject = "MAT",
cohort = "1",
year = "2017R3",
`_` = "2_0"
)
) -> res
^^ could be wrapped in a function with parameters passed to the query list elements.

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources