I'm using rvest and tidyverse to scrape and process some data off the web.
There was recently a change to the website where some of the data is now in 2 tables and you can change between them using a button.
I'm trying to figure out how to scrape the data from both. They seem to have the same css class now so I can't figure out how to access each individually.
The code below seems to grab the "extended snowfall history", but I can't seem to figure out how to get the "2022-2023 winter season" data. Obviously I'll need to do a little processing and math to put the "2022-2023 winter season" into a new row in "extended snowfall history", but I can't even figure out how to grab it.
Currently I have :
library(rvest)
library(tidyverse)
mammoth <- read_html('https://www.mammothmountain.com/on-the-mountain/historical-snowfall')
snow <- mammoth %>%
html_element('table.css-86hwhl') %>%
html_table(header= TRUE, convert = TRUE) %>%
mutate_if(is.character,as.factor) %>%
mutate_if(is.integer,as.double) %>%
select(-Total)
A simple approach would be to use rvest::html_elements('table.css-86hwhl') (plural rather than singular) which will extract all html elements with the css class 'table.css-86hwhl'. Then you can manually choose the tables you want.
For example:
mammoth %>%
html_elements('table.css-86hwhl') %>%
html_table(header= TRUE, convert = TRUE)
gives a list of datasets
[[1]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 4 × 3
Date Inches `Season Total to Date`
<chr> <chr> <chr>
1 November 8 "15\"" "28\""
2 November 7 "2\"" "13\""
3 November 3 "5\"" "11\""
4 November 2 "6\"" "6\""
[[3]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
[[4]]
# A tibble: 4 × 3
Date Inches `Season Total to Date`
<chr> <chr> <chr>
1 November 8 "15\"" "28\""
2 November 7 "2\"" "13\""
3 November 3 "5\"" "11\""
4 November 2 "6\"" "6\""
[[5]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
You can then just extract [[1]] and [[2]] and go from there, the tables that you are looking for. I'm sure there's a more principled approach out there, but this should do the job.
I am trying to simulate dataset for a linear regression in a bit of bayesian stats.
Obviously the overall formula is
Y = A + BX
I have simulated a variety of values of A and B using
A <- rnorm(10,0,1)
B <- rnorm(10,0,1)
#10 Random draws from a normal distribution for the values of each of A and B
I setup a list of possible values of X
stuff <- tibble(x = seq(130,170,10)) %>%
#Make table for possible values of X between 130>170 in intervals of 10
mutate(Y = A + B*x)
Make new value which is A plus B*each value of X
This works fine when I have only 1 value in A & B (i.e if I do A <- rnorm(1,0,1))
But obviously it doesnt work when the length of A & B > 1
What I am trying to figure out how to do us something that would be like
mutate(Y[i] = A[i] + B[i]*x
Resulting in 10 new columns Y1>Y10
Any suggestions welcomed
Here's how I would do what I think you want. I'd start long and then convert to wide...
library(tidyverse)
set.seed(123)
df <- tibble() %>%
expand(
nesting(
ID=1:10,
A=rnorm(10,0,1),
B=rnorm(10,0,1)
),
X=seq(130,170,10)
) %>%
mutate(Y=A + B*X)
df
# A tibble: 50 × 5
ID A B X Y
<int> <dbl> <dbl> <dbl> <dbl>
1 1 -1.07 0.426 130 54.4
2 1 -1.07 0.426 140 58.6
3 1 -1.07 0.426 150 62.9
4 1 -1.07 0.426 160 67.2
5 1 -1.07 0.426 170 71.4
6 2 -0.218 -0.295 130 -38.6
7 2 -0.218 -0.295 140 -41.5
8 2 -0.218 -0.295 150 -44.5
9 2 -0.218 -0.295 160 -47.4
10 2 -0.218 -0.295 170 -50.4
# … with 40 more rows
Now, pivot to wide...
df %>%
pivot_wider(
names_from=ID,
values_from=Y,
names_prefix="Y",
id_cols=X
)
# A tibble: 5 × 11
X Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 130 54.4 -38.6 115. 113. 106. 87.8 72.8 -7.90 -40.9 -48.2
2 140 58.6 -41.5 124. 122. 114. 94.7 78.4 -8.51 -44.0 -52.0
3 150 62.9 -44.5 133. 131. 123. 102. 83.9 -9.13 -47.0 -55.8
4 160 67.2 -47.4 142. 140. 131. 108. 89.5 -9.75 -50.1 -59.6
5 170 71.4 -50.4 151. 149. 139. 115. 95.0 -10.4 -53.2 -63.4
At this point you've lost A & B, because you'd need another 10 columns to store the original A's and another 10 to store the original B's.
Personally, I'd probably stick with the long format, because that's most likely going to make your future workflow easier. And I get to keep the A's and B's.
I have the following dataset
out
# A tibble: 1,356 x 7
ID GROUP Gender Age Education tests score
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 TRAINING 1 74 18 ADAS_CogT0 14.7
2 1 TRAINING 1 74 18 ROCF_CT0 32
3 1 TRAINING 1 74 18 ROCF_IT0 3.7
4 1 TRAINING 1 74 18 ROCF_RT0 3.9
5 1 TRAINING 1 74 18 PVF_T0 41.3
6 1 TRAINING 1 74 18 SVF_T0 40
7 1 TRAINING 1 74 18 ADAS_CogT7 16
8 1 TRAINING 1 74 18 ROCF_CT7 33
9 1 TRAINING 1 74 18 ROCF_IT7 1.7
10 1 TRAINING 1 74 18 ROCF_RT7 2.4
If I would like to create a column where in place of the tests ending with T0 would corresponf the value score0 whereas in place of tests ending with T7 the value would be score7`, which are the possible way to fulfill this?
Please be so kind put the data in your posts. >> dput(df)
You could use a combination of case_when and str_detect
library(dplyr)
library(stringr)
df <- structure(
list(
ID = 1:10,
GROUP = rep('TRAINING', 10),
Gender = rep(1, 10),
Education = rep(74, 10),
test = c(
'ADAS_CogT0',
'ROCF_CT0',
'ROCF_IT0',
'ROCF_RT0',
'PVF_T0',
'SVF_T0',
'ADAS_CogT7',
'ROCF_CT7',
'ROCF_IT7',
'ROCF_RT7'
),
score = c(14.7,32,3.7,3.9,41.3,40,16,33,1.7,2.4)
),
row.names = c(1:10),
class = "data.frame"
)
df2 <- df %>%
mutate(new = case_when(str_detect(test, 'T0') ~ 'score0',
str_detect(test, 'T7') ~ 'score7',
TRUE ~ test)
)
ID GROUP Gender Education test score new
1 1 TRAINING 1 74 ADAS_CogT0 14.7 score0
2 2 TRAINING 1 74 ROCF_CT0 32.0 score0
3 3 TRAINING 1 74 ROCF_IT0 3.7 score0
4 4 TRAINING 1 74 ROCF_RT0 3.9 score0
5 5 TRAINING 1 74 PVF_T0 41.3 score0
6 6 TRAINING 1 74 SVF_T0 40.0 score0
7 7 TRAINING 1 74 ADAS_CogT7 16.0 score7
8 8 TRAINING 1 74 ROCF_CT7 33.0 score7
9 9 TRAINING 1 74 ROCF_IT7 1.7 score7
10 10 TRAINING 1 74 ROCF_RT7 2.4 score7
Do you want the output to be string 'score0' and 'score7' ?
You may try -
library(dplyr)
out %>%
mutate(result = case_when(grepl('T0$', tests) ~ 'score0',
grepl('T7$', tests) ~ 'score7'))
# ID GROUP Gender Age Education tests score result
#1 1 TRAINING 1 74 18 ADAS_CogT0 14.7 score0
#2 1 TRAINING 1 74 18 ROCF_CT0 32.0 score0
#3 1 TRAINING 1 74 18 ROCF_IT0 3.7 score0
#4 1 TRAINING 1 74 18 ROCF_RT0 3.9 score0
#5 1 TRAINING 1 74 18 PVF_T0 41.3 score0
#6 1 TRAINING 1 74 18 SVF_T0 40.0 score0
#7 1 TRAINING 1 74 18 ADAS_CogT7 16.0 score7
#8 1 TRAINING 1 74 18 ROCF_CT7 33.0 score7
#9 1 TRAINING 1 74 18 ROCF_IT7 1.7 score7
#10 1 TRAINING 1 74 18 ROCF_RT7 2.4 score7
Or another option with readr::parse_number.
out %>%
mutate(result = paste0('score', readr::parse_number(tests)))
I have a dataset below in which I want to do linear regression for each country and state and then cbind the predicted values in the dataset:
Final data frame after adding three more columns:
I have done it for one country and one area but want to do it for each country and area and put the predicted, upper and lower limit values back in the data set by cbind:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
data_1 <- data[(data$country=="US" & data$Area=="G"),]
model <- lm(amount ~ week, data = data_1)
pre <- predict(model,newdata = data_1,interval = "prediction",level = 0.95)
pre
How can I loop this for other combination of country and Area?
...and a Base R solution:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
splitVar <- paste0(data$country,"-",data$Area)
dfList <- split(data,splitVar)
result <- do.call(rbind,lapply(dfList,function(x){
model <- lm(amount ~ week, data = x)
cbind(x,predict(model,newdata = x,interval = "prediction",level = 0.95))
}))
result
...the results:
country Area week amount fit lwr upr
UK-A.11 UK A 1 45 36.6 -6.0463638 79.24636
UK-A.12 UK A 2 34 37.1 -1.3409128 75.54091
UK-A.13 UK A 3 23 37.6 0.6671656 74.53283
UK-A.14 UK A 4 43 38.1 -0.3409128 76.54091
UK-A.15 UK A 5 43 38.6 -4.0463638 81.24636
US-G.1 US G 1 12 20.8 -27.6791493 69.27915
US-G.2 US G 2 23 21.7 -21.9985147 65.39851
US-G.3 US G 3 34 22.6 -19.3841749 64.58417
US-G.4 US G 4 32 23.5 -20.1985147 67.19851
US-G.5 US G 5 12 24.4 -24.0791493 72.87915
US-I.6 US I 1 12 20.8 -33.8985900 75.49859
US-I.7 US I 2 34 30.5 -18.8046427 79.80464
US-I.8 US I 3 45 40.2 -7.1703685 87.57037
US-I.9 US I 4 65 49.9 0.5953573 99.20464
US-I.10 US I 5 45 59.6 4.9014100 114.29859
We can also use function augment from package broom to get your desired information:
library(purrr)
library(broom)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(models = map(data, ~ lm(amount ~ week, data = .)),
aug = map(models, ~ augment(.x, interval = "prediction"))) %>%
unnest(aug) %>%
select(country, Area, amount, week, .fitted, .lower, .upper)
# A tibble: 15 x 7
# Groups: country, Area [3]
country Area amount week .fitted .lower .upper
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US G 12 1 20.8 -27.7 69.3
2 US G 23 2 21.7 -22.0 65.4
3 US G 34 3 22.6 -19.4 64.6
4 US G 32 4 23.5 -20.2 67.2
5 US G 12 5 24.4 -24.1 72.9
6 US I 12 1 20.8 -33.9 75.5
7 US I 34 2 30.5 -18.8 79.8
8 US I 45 3 40.2 -7.17 87.6
9 US I 65 4 49.9 0.595 99.2
10 US I 45 5 59.6 4.90 114.
11 UK A 45 1 36.6 -6.05 79.2
12 UK A 34 2 37.1 -1.34 75.5
13 UK A 23 3 37.6 0.667 74.5
14 UK A 43 4 38.1 -0.341 76.5
15 UK A 43 5 38.6 -4.05 81.2
Here is a tidyverse way to do this for every combination of country and Area.
library(tidyverse)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(model = map(data, ~ lm(amount ~ week, data = .x)),
result = map2(model, data, ~data.frame(predict(.x, newdata = .y,
interval = "prediction",level = 0.95)))) %>%
ungroup %>%
select(-model) %>%
unnest(c(data, result))
# country Area week amount fit lwr upr
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 US G 1 12 20.8 -27.7 69.3
# 2 US G 2 23 21.7 -22.0 65.4
# 3 US G 3 34 22.6 -19.4 64.6
# 4 US G 4 32 23.5 -20.2 67.2
# 5 US G 5 12 24.4 -24.1 72.9
# 6 US I 1 12 20.8 -33.9 75.5
# 7 US I 2 34 30.5 -18.8 79.8
# 8 US I 3 45 40.2 -7.17 87.6
# 9 US I 4 65 49.9 0.595 99.2
#10 US I 5 45 59.6 4.90 114.
#11 UK A 1 45 36.6 -6.05 79.2
#12 UK A 2 34 37.1 -1.34 75.5
#13 UK A 3 23 37.6 0.667 74.5
#14 UK A 4 43 38.1 -0.341 76.5
#15 UK A 5 43 38.6 -4.05 81.2
And one more:
library(tidyverse)
data %>%
mutate(CountryArea=paste0(country,Area) %>% factor %>% fct_inorder) %>%
split(.$CountryArea) %>%
map(~lm(amount~week, data=.)) %>%
map(predict, interval = "prediction",level = 0.95) %>%
reduce(rbind) %>%
cbind(data, .)
country Area week amount fit lwr upr
1 US G 1 12 20.8 -27.6791493 69.27915
2 US G 2 23 21.7 -21.9985147 65.39851
3 US G 3 34 22.6 -19.3841749 64.58417
4 US G 4 32 23.5 -20.1985147 67.19851
5 US G 5 12 24.4 -24.0791493 72.87915
6 US I 1 12 20.8 -33.8985900 75.49859
7 US I 2 34 30.5 -18.8046427 79.80464
8 US I 3 45 40.2 -7.1703685 87.57037
9 US I 4 65 49.9 0.5953573 99.20464
10 US I 5 45 59.6 4.9014100 114.29859
11 UK A 1 45 36.6 -6.0463638 79.24636
12 UK A 2 34 37.1 -1.3409128 75.54091
13 UK A 3 23 37.6 0.6671656 74.53283
14 UK A 4 43 38.1 -0.3409128 76.54091
15 UK A 5 43 38.6 -4.0463638 81.24636
I am doing a linear regression by group and want to extract the residuals of the regression
library(dplyr)
set.seed(124)
dat <- data.frame(ID = sample(111:503, 18576, replace = T),
ID2 = sample(11:50, 18576, replace = T),
ID3 = sample(1:14, 18576, replace = T),
yearRef = sample(1998:2014, 18576, replace = T),
value = rnorm(18576))
resid <- dat %>% dplyr::group_by(ID3) %>%
do(augment(lm(value ~ yearRef, data=.))) %>% ungroup()
How do I retain the ID, ID2 as well in the resid. At the moment, it only retains the ID3 in the final data frame
Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols
library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>%
map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")
# A tibble: 18,576 x 12
ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma .cooksd
<chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 196 16 -0.385 2009 -0.0406 0.0308 -0.344 1.00e-3 0.973 6.27e-5
2 1 372 47 -0.793 2012 -0.0676 0.0414 -0.726 1.81e-3 0.973 5.05e-4
3 1 470 15 -0.496 2011 -0.0586 0.0374 -0.438 1.48e-3 0.973 1.50e-4
4 1 242 40 -1.13 2010 -0.0496 0.0338 -1.08 1.21e-3 0.973 7.54e-4
5 1 471 34 1.28 2006 -0.0135 0.0262 1.29 7.26e-4 0.972 6.39e-4
6 1 434 35 -1.09 1998 0.0586 0.0496 -1.15 2.61e-3 0.973 1.82e-3
7 1 467 45 -0.0663 2011 -0.0586 0.0374 -0.00769 1.48e-3 0.973 4.64e-8
8 1 334 27 -1.37 2003 0.0135 0.0305 -1.38 9.86e-4 0.972 9.92e-4
9 1 186 25 -0.0195 2003 0.0135 0.0305 -0.0331 9.86e-4 0.973 5.71e-7
10 1 114 34 1.09 2014 -0.0857 0.0500 1.18 2.64e-3 0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>
Taking the "many models" approach, you can nest the data on ID3 and use purrr::map to create a list-column of the broom::augment data frames. The data list-column has all the original columns aside from ID3; map into that and select just the ones you want. Here I'm assuming you want to keep any column that starts with "ID", but you can change this. Then unnest both the data and the augment data frames.
library(dplyr)
library(tidyr)
dat %>%
group_by(ID3) %>%
nest() %>%
mutate(aug = purrr::map(data, ~broom::augment(lm(value ~ yearRef, data = .))),
data = purrr::map(data, select, starts_with("ID"))) %>%
unnest(c(data, aug))
#> # A tibble: 18,576 x 12
#> # Groups: ID3 [14]
#> ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma
#> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 431 15 0.619 2002 0.0326 0.0346 0.586 1.21e-3 0.995
#> 2 11 500 21 -0.432 2000 0.0299 0.0424 -0.462 1.82e-3 0.995
#> 3 11 392 28 -0.246 1998 0.0273 0.0515 -0.273 2.67e-3 0.995
#> 4 11 292 40 -0.425 1998 0.0273 0.0515 -0.452 2.67e-3 0.995
#> 5 11 175 36 -0.258 1999 0.0286 0.0468 -0.287 2.22e-3 0.995
#> 6 11 419 23 3.13 2005 0.0365 0.0273 3.09 7.54e-4 0.992
#> 7 11 329 17 -0.0414 2007 0.0391 0.0274 -0.0806 7.57e-4 0.995
#> 8 11 284 23 -0.450 2006 0.0378 0.0268 -0.488 7.25e-4 0.995
#> 9 11 136 28 -0.129 2006 0.0378 0.0268 -0.167 7.25e-4 0.995
#> 10 11 118 17 -1.55 2013 0.0470 0.0470 -1.60 2.24e-3 0.995
#> # … with 18,566 more rows, and 2 more variables: .cooksd <dbl>,
#> # .std.resid <dbl>