Predict nested model with nested newdata - r

Imagine a high resolution temperature and light time series taken at many locations (stations) over many days. Except, at each station temp and light are taken by different sensors, resulting in a slightly different set of timestamps.
To merge these into one data.frame, I've been trying to make a model of light for each day at each station in df.light. Then, I want to predict light values at the exact timestamps of temp readings, which are nested the same way in df.temp (the temperature dataset).
station <- rep(1:5, each=36500)
dayofyear <- rep(1:365, 5, each=100)
hourofday.light <- runif(182500, min=0, max=24)
light <- runif(182500, min=0, max=40)
hourofday.temp <- runif(182500, min=0, max=24)
temp <- runif(182500, min=0, max=40)
df.light <- data.frame(station, dayofyear, hourofday.light, light)
df.temp <- data.frame(station, dayofyear, hourofday.temp, temp)
> head(df.light)
station dayofyear hourofday.light light
1 1 1 10.217349 0.120381
2 1 1 12.179213 12.423694
3 1 1 16.515400 7.277784
4 1 1 3.775723 31.793782
5 1 1 7.719266 30.578220
6 1 1 9.269916 16.937042
> tail(df.light)
station dayofyear hourofday.light light
182495 5 365 4.712285 19.2047471
182496 5 365 11.190919 39.5921675
182497 5 365 18.710969 11.8182347
182498 5 365 20.288101 11.6874453
182499 5 365 15.466373 0.3264828
182500 5 365 12.969125 29.4429034
> head(df.temp)
station dayofyear hourofday.temp temp
1 1 1 12.1298554 30.862308
2 1 1 23.6226076 9.328942
3 1 1 9.3699831 28.970397
4 1 1 0.1814767 1.405557
5 1 1 23.6300014 39.875743
6 1 1 7.6999984 39.786182
I can make the light models, e.g. GAMs, for each day at each station in df.light using dplyr. But I am stuck not knowing how to feed the nested newdata from df.temp to the models to generate the per-station-per-day predictions.
library("mgcv")
library("tidyverse")
data <- as_tibble(df.light) %>%
group_by(station, dayofyear) %>%
nest()
models <- data %>%
mutate(
model = map(data, ~ gam(light ~ s(hourofday.light), data = .x)),
predicted = map(model, ~ predict.gam(.x, newdata = hourofday.temp)) # newdata doesn't look nested
)
The last line starting with predicted does not work because newdata is not nested...I think. Please help. I'm guessing this could be a common issue in merging time series generated by multiple sources.

You can first prepare the data.
names(df.temp)[3:4] <- names(df.light)[3:4]
data1 <- df.light %>% group_by(station, dayofyear) %>%nest() %>% ungroup()
data2 <- df.temp %>% group_by(station, dayofyear) %>% nest() %>% ungroup()
apply model and get predicted values.
result <- data1 %>%
mutate(data2 = data2$data,
model = map(data, ~ gam(light ~ s(hourofday.light),data = .x)),
predicted = map2(model, data2, predict.gam))
result
# A tibble: 1,825 x 6
# station dayofyear data data2 model predicted
# <int> <int> <list> <list> <list> <list>
# 1 1 1 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 2 1 2 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 3 1 3 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 4 1 4 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 5 1 5 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 6 1 6 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 7 1 7 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 8 1 8 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 9 1 9 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
#10 1 10 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# … with 1,815 more rows

Related

how to create nested data frame by collapsing columns

I have a dataframe I want to collapse some columns (y and z) to create a nested dataframe, for instance:
df <- data.frame(x = rep(c(1,2,3,4),times=3), y = rep(c("Y","W","T","R"),times=3), z = rep(c("A","B","C","D"),times=3))
x y z
=========
1 Y A
2 W B
3 T C
4 R D
1 Y A
2 W B
3 T C
4 R D
1 Y A
2 W B
I want to collapse the z column and nest it for each unique group of x. The resulting dataframe should look like this:
x zy
======
1 <dataframe>
2 <dataframe>
3 <dataframe>
4 <dataframe>
How do I accomplish this?
library(tidyverse)
df %>%
group_by(x) %>%
nest(data = c(z, y))
# A tibble: 4 × 2
# Groups: x [4]
x data
<dbl> <list>
1 1 <tibble [3 × 2]>
2 2 <tibble [3 × 2]>
3 3 <tibble [3 × 2]>
4 4 <tibble [3 × 2]>
Try a list
library(dplyr)
df %>%
group_by(x) %>%
summarise(zy = list(cbind(y, z)))
# A tibble: 4 × 2
x zy
<dbl> <list>
1 1 <chr [3 × 2]>
2 2 <chr [3 × 2]>
3 3 <chr [3 × 2]>
4 4 <chr [3 × 2]>

How to refer to data frames dynamically in an R for loop?

R novice, so apologies in advance. I want to write a for loop that does sequential operations on a series of dataframes and then binds them (by sequence number).
Ideally, I'd think it would be something like this (where sc2 is the base dataframe I'm working from, week3 is the selection variable used. The dataframes I'm trying to create would be t1, t2, t3,... and w1, w2, w3,... etc. In other words, the 'i' in the dataframe name would read from the for statement.
for(i in 1:16) {
ti= tail((subset(sc2, sc2$week3<i)), n=200)
wi= subset(sc2, sc2$week3==i)
mi=rbind(ti, wi)
}
Which I'm sure you know doesn't work. I've gotten this far -
for(i in 1:16) {
txi= tail((subset(sc2, sc2$week3<i)), n=200)
assign(paste0("trst",i), txi, envir = .GlobalEnv)
wxi= subset(sc2, sc2$week3==i)
assign(paste0("w",i), wxi, envir = .GlobalEnv)
}
Which creates a dummy dataframes (*xi) that are then assigned for each i to the global environment. But now how to rbind them? Is there a more elegant way to do all of this, or am I missing something about the way to refer to the dataframes dynamically?
Don't do it in a loop!
This can be done much easier by holding data frame in data frame or rather I should write tibble in tibble. See the example below.
library(tidyverse)
sc2 = tibble(
week3 = sample(1:20, 100, replace = TRUE),
x = rnorm(100)
)
ftxi = function(i) sc2 %>% filter(week3<i)
fwxi = function(i) sc2 %>% filter(week3==i)
df = tibble(id = 1:16) %>%
group_by(id) %>%
mutate(txi = map(id, ~ftxi(.x)),
wxi = map(id, ~fwxi(.x)))
Let's see what is df.
# A tibble: 16 x 3
# Groups: id [16]
id txi wxi
<int> <list> <list>
1 1 <tibble [0 x 2]> <tibble [4 x 2]>
2 2 <tibble [4 x 2]> <tibble [6 x 2]>
3 3 <tibble [10 x 2]> <tibble [6 x 2]>
4 4 <tibble [16 x 2]> <tibble [6 x 2]>
5 5 <tibble [22 x 2]> <tibble [4 x 2]>
6 6 <tibble [26 x 2]> <tibble [4 x 2]>
7 7 <tibble [30 x 2]> <tibble [6 x 2]>
8 8 <tibble [36 x 2]> <tibble [4 x 2]>
9 9 <tibble [40 x 2]> <tibble [3 x 2]>
10 10 <tibble [43 x 2]> <tibble [6 x 2]>
11 11 <tibble [49 x 2]> <tibble [3 x 2]>
12 12 <tibble [52 x 2]> <tibble [4 x 2]>
13 13 <tibble [56 x 2]> <tibble [6 x 2]>
14 14 <tibble [62 x 2]> <tibble [5 x 2]>
15 15 <tibble [67 x 2]> <tibble [5 x 2]>
16 16 <tibble [72 x 2]> <tibble [7 x 2]>
As you can see it is a tibble which has other tibble in it.
So let's see if everything is correct and take a look at the second row.
First, let's look at the txi variable df$txi[[2]]
# A tibble: 4 x 2
week3 x
<int> <dbl>
1 1 -0.0829
2 1 -2.15
3 1 -0.949
4 1 -0.0583
Now it's the turn of the variable wxi df$wxi[[2]]
# A tibble: 6 x 2
week3 x
<int> <dbl>
1 2 -0.0643
2 2 -0.228
3 2 -0.620
4 2 -1.21
5 2 0.186
6 2 1.19
Bingo you get what you expected!
It is also a very quick method. You can see my other answer in this forum
What is faster/better: Loop over each row..

R: pass chr variable as variable in other data frame

Based on the response to this question that I posed I have something like the following:
library(tidyverse)
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
dataset <- tibble(
y1=rnorm(n=100),
y2=rnorm(n=100),
x1=rnorm(n=100),
x2=rnorm(n=100),)
outcomes <- dataset %>%
select(y1,y2) %>% colnames
covars <- dataset %>%
select(x1,x2) %>% colnames
paramlist <- expand_grid(outcomes, covars)
paramlist %>%
rowwise %>%
mutate(mod = list(lm(reformulate(outcomes, covars), data = dataset)),
res = list(broom::tidy(mod)),
predicted=list(predict(mod)),
data=list(cbind(dataset,predicted)))
# A tibble: 4 x 6
# Rowwise:
#> outcomes covars mod res predicted data
#> <chr> <chr> <list> <list> <list> <list>
#> 1 y1 x1 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 2 y1 x2 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 3 y2 x1 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
#> 4 y2 x2 <lm> <tibble [2 x 5]> <dbl [100]> <df [100 x 5]>
What I would like to do now is - for each combination of outcomes and covars - I'd like to calculate the mean or sd of the predicted value in data conditional on some value of x1. For example, x1 might be a treatment variable, and I'd like the adjusted mean of the outcome for those with x1=0. The tricky part seems to be that the outcome and conditioning variable differ across rows.

How to convert a list of tibbles/dataframes into a nested tibble/dataframe

Sample Data
ex_list <- list(a = tibble(x = 1:4, y = 5:8),
b = mtcars)
How do I convert this list of tibbles/dataframes into a nested tibble as shown below:
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Tidy solutions appreciated!
We may use enframe
library(tibble)
enframe(ex_list)
# A tibble: 2 x 2
name value
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
If we need to change the column names, use the name and value
> enframe(ex_list, name = 'data_name', value = 'data')
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Is this what you want?
library(tidyverse)
lapply(ex_list, nest) %>%
dplyr::bind_rows(., .id = "data_name")
# # A tibble: 2 x 2
# data_name data
# <chr> <list>
# 1 a <tibble [4 x 2]>
# 2 b <tibble [32 x 11]>
#OR map
#map(ex_list, nest) %>%
# bind_rows(., .id = "data_name")

apply function on certain columns in nested variable in r

I want to apply a vectorised operation on certain columns in the nested variable. The function that I want to apply is to find the sum of missing value in the numeric features i.e. weight and calories. The data frame that I have is as following
df <- data.frame(country = c("US", "US", "UK", "PAK"),name = c("David",
"James", "Junaid", "Ali"), fruit = c("Apple", "banana", "orange", "melon"),
weight = c(90,110,120,NA), calories = c(NA,20, NA,NA))
country name fruit weight calories
1 US David Apple 90 NA
2 US James banana 110 20
3 UK Junaid orange 120 NA
4 PAK Ali melon NA NA
When I nest the data frame
nested_df <- df %>% group_by(country) %>% nest()
# A tibble: 3 × 2
country data
<fctr> <list>
1 US <tibble [2 × 4]>
2 UK <tibble [1 × 4]>
3 PAK <tibble [1 × 4]>
I have tried to use the following syntax but to no avail.
nested_df %>% mutate(missings = map(data, c("weight", "calories")) %>%
map_lgl(function(x) sum(!is.na(x))/length(x) ==1))`
The result I am expected are as following
`# A tibble: 3 × 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 × 4]> FALSE
2 UK <tibble [1 × 4]> FALSE
3 PAK <tibble [1 × 4]> TRUE`
however, what i am getting is
` A tibble: 3 × 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 × 4]> NA
2 UK <tibble [1 × 4]> NA
3 PAK <tibble [1 × 4]> NA`
This will check whether more than 50% of values are NA...
colstocheck <- c("weight", "calories")
nested_df %>% mutate(missings = (map_lgl(data,
function(x) sum(is.na(x[,colstocheck]))/length(x[,colstocheck]) > 0.5)))
# A tibble: 3 x 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 x 4]> FALSE
2 UK <tibble [1 x 4]> FALSE
3 PAK <tibble [1 x 4]> TRUE

Resources