Summation of matrices in separate tibble list columns - r

I have a tibble data frame with two list columns. Within the list column mat_base, each row contains a 2x2 matrix. In the list column mat_sim, each row contains a list of 10 2x2 matrices. I would like to create a new list column mat_out, which is the sum of the mat_base matrix and each of the mat_sim matrices (within a given row). I.e. Each row of mat_out should contain a list of 10 matrices.
I assume there is a way to do this using lapply or the purrr library, but I haven't been able to figure it out. Any help appreciated.
library(tibble)
library(dplyr)
library(purrr)
mat_base <- list(diag(2) * 1, diag(2) * 2, diag(2) * 3)
mat_sim_a <- replicate(10, matrix(rnorm(4), nrow = 2), simplify = F)
mat_sim_b <- replicate(10, matrix(rnorm(4), nrow = 2), simplify = F)
mat_sim_c <- replicate(10, matrix(rnorm(4), nrow = 2), simplify = F)
dat <- tibble(group = c('a', 'b', 'c')) %>%
mutate(mat_base = mat_base,
mat_sim = list(mat_sim_a, mat_sim_b, mat_sim_c))
# doesn't work
dat %>%
mutate(mat_out = lapply(.$mat_sim, function(x, y) x + y, y = .$mat_base))
# doesn't work
dat %>%
mutate(mat_out = purrr::map(.$mat_sim, function(x, y) x + y, y = .$mat_base))

We could use a nested map2 to get the + of 'mat_base' and 'mat_sim' to create the 'mat_out' as a column
dat %>%
mutate(mat_out = map2(mat_base, mat_sim, ~
map2(list(.), .y, `+`)))
# A tibble: 3 x 4
# group mat_base mat_sim mat_out
# <chr> <list> <list> <list>
#1 a <dbl [2 x 2]> <list [10]> <list [10]>
#2 b <dbl [2 x 2]> <list [10]> <list [10]>
#3 c <dbl [2 x 2]> <list [10]> <list [10]>

You can solve the issue by using lapply on position rather than the actual list, which lets you access nested levels:
dat %>%
mutate(mat_out = lapply(1:3, function(x)
lapply(dat$mat_sim[[x]],function(y) y+dat$mat_base[[x]])))

Related

R Use map2 to iterate over columns within a list of data frames to fit statistical models

I'm trying to figure out a purrr approach to iteratively map over columns within a list of data frames to fit univariate GLMs. Using map2, the first element, .x, would be the three pred columns, and the second element, .y, would be the list of data frames (or vice-versa). map2 seems to be able to do this, but I recognize that I need to cross the .x and .y elements first, so I use tidyr::crossing first to do this. From here, I am unsure how to properly reference the columns to select within the data frames. Example code is below:
#Sample data
set.seed(100)
test_df <- tibble(pred1 = sample(40:80, size = 1000, replace = TRUE),
pred2 = sample(40:80, size = 1000, replace = TRUE),
pred3 = sample(40:80, size = 1000, replace = TRUE),
resp = sample(100:200, size = 1000, replace = TRUE),
group = sample(c('a','b','c'), size = 1000, replace = TRUE))
#Split into list
test_ls <- test_df %>%
group_by(group) %>%
{df_groups <<- .} %>%
group_split()
#Obtain keys and name list elements
group_keys <- df_groups %>%
group_keys() %>%
pull()
test_ls <- test_ls %>% setNames(nm = group_keys)
#Cross all combinations of pred columns and list element names
preds <- c('pred1','pred2','pred3')
map_keys <- crossing(preds, group_keys)
#.y = list of data frames; iterate over data frames
#.x = three pred columns; iterate over columns
#Use purrr to fit glm of each .x columns within each of .y dfs
#Example structure - does not work
map2(.x, .y, .f = ~glm(resp ~ .x, data = .y))
#Workaround that does work
lapply(test_ls, function(x) {
x %>%
select(pred1, pred2, pred3) %>%
map(.f = ~glm(resp ~ .x, data = x))
})
There's something I'm missing, and I can't seem to figure it out. I've gotten a variety of errors with a few approaches, but I think it's coming down to not properly referencing the .x columns within the .y data frames. My approaches don't seem to recognize that .x is a column within .y. The workaround does the trick, but I'd prefer to avoid using both lapply and map.
My suggestion would be to NOT split the data before fitting models, since you are considering all possible combinations of variables that are already available directly in your original dataset. Instead, consider converting the original data frame to the "long" format, and then grouping by the necessary variables:
test_df %>% gather( pred, value, pred1:pred3 ) %>%
nest( -c(group, pred) ) %>%
mutate( models = map(data, ~glm(resp ~ value, data=.x)) )
# # A tibble: 9 x 4
# group pred data models
# <chr> <chr> <list> <list>
# 1 b pred1 <tibble [340 x 2]> <glm>
# 2 a pred1 <tibble [317 x 2]> <glm>
# 3 c pred1 <tibble [343 x 2]> <glm>
# 4 b pred2 <tibble [340 x 2]> <glm>
# 5 a pred2 <tibble [317 x 2]> <glm>
# 6 c pred2 <tibble [343 x 2]> <glm>
# 7 b pred3 <tibble [340 x 2]> <glm>
# 8 a pred3 <tibble [317 x 2]> <glm>
# 9 c pred3 <tibble [343 x 2]> <glm>
This substantially simplifies your code, and you can now split the result, if you still need those models in a list.

Select and apply correct model from different data frame using purrr

In my data, I have correlated data (diet and liver) for 50+ different compounds (simplified here).
library(tidyverse)
Sigma <- matrix(.7, nrow=6, ncol=6) + diag(6)*.3
vars_tr <- data.frame(MASS::mvrnorm(n=10, mu=c(2:7), Sigma=Sigma))
tr<-tibble(
compound=c(rep("A", 10), rep("B", 10), rep("C",10)),
diet=c(vars_tr$X1, vars_tr$X2, vars_tr$X3),
liver=c(vars_tr$X4, vars_tr$X5, vars_tr$X6))
Following the guidance on doing regressions for multiple models, I created a nested data frame and stored the output (learning this method this week was a lifesaver!).
model<-function(df){lm(data=df, liver~diet)}
mods<- tr %>%
group_by(compound) %>%
nest() %>%
mutate(model=map(data, model))
Now I have new 'diet' data for which no 'liver' data exists.
new<-tibble(
compound=c(rep("A", 10), rep("B", 10), rep("C",10)),
diet=c(rnorm(10, 4), rnorm(10, 5), rnorm(10,6)))
What I would like to do is take advantage of purrr generate a liver concentration for each diet concentration using the correct model for the compound. My best attempt looks like:
preds<-function(c, x){
add_predictions(tibble(diet=x), filter(mods, compound==c)$model[[1]], 'liver')$liver
}
new%>%
mutate(liver=map2(compound, diet, preds))
which returns an error.
I would greatly appreciate any help!
EDIT 6/4/2020:
Based on the helpful comments from Bruno and Ronak Shah below, I've made some progress but haven't found the solution. Both suggest joining the models to the existing table, which makes way more sense than what I was doing.
Based on that, it is relatively simple to do the following:
new_mods<-
new%>%
group_by(compound)%>%
nest()%>%
left_join(., select(mods_d, compound, model), , by='compound')%>%
mutate(predicts = map2(data, model, add_predictions))%>%
unnest(predicts)
You can use a join operation and keep working on tibbles
library(tidyverse)
library(MASS)
Sigma <- matrix(.7, nrow=6, ncol=6) + diag(6)*.3
vars_tr <- data.frame(mvrnorm(n=10, mu=c(2:7), Sigma=Sigma))
tr<-tibble(
compound=c(rep("A", 10), rep("B", 10), rep("C",10)),
diet=c(vars_tr$X1, vars_tr$X2, vars_tr$X3),
liver=c(vars_tr$X4, vars_tr$X5, vars_tr$X6))
model<-function(df){lm(data=df, liver~diet)}
mods<- tr %>%
nest_by(compound) %>%
mutate(model = list(model(data)))
new<-tibble(
compound=c(rep("A", 10), rep("B", 10), rep("C",10)),
diet=c(rnorm(10, 4), rnorm(10, 5), rnorm(10,6)))
new_nest <- new %>%
nest_by(compound)
results <- mods %>%
left_join(new_nest,by = "compound") %>%
mutate(predicts = list(predict(model,data.y)))
You can create a function for prediction :
preds<-function(data, mod){
modelr::add_predictions(data, mod)$liver
}
nest the dataframe for each compound, join with mods and apply the respective model for each group of data.
library(dplyr)
new %>%
tidyr::nest(data = diet) %>%
left_join(mods, by = 'compound') %>%
mutate(liver = purrr::map2(data.y, model, preds))
# A tibble: 3 x 5
# compound data.x data.y model liver
# <chr> <list> <list> <list> <list>
#1 A <tibble [10 × 1]> <tibble [10 × 2]> <lm> <dbl [10]>
#2 B <tibble [10 × 1]> <tibble [10 × 2]> <lm> <dbl [10]>
#3 C <tibble [10 × 1]> <tibble [10 × 2]> <lm> <dbl [10]>
Based on the requirement you can select relevant columns and unnest the results if needed.

I wish to keep/discard items in a PURRR nested list based on a sublist within each grouped level

I have sets of weather station data which I wish to compare by site. I need to do this efficiently because each set is large and I wish to build my experience with PURRR. My issue concerns use of the keep/discard (or list.exclude (rlist)) to remove days (id) with incomplete data - it should be a doozy but I can't get the syntax right. I have tried to approach this problem by computing the dimensions of each tibble, and then use the length to give me a unitary list). I am using R 3.6.1 on a PC running Windows 10. Here is a trivial example. I wish 'mylist' to comprise id = 'a' only in this example.
mylist <- tibble(id = c(rep("a",5),rep("b",4)),
dl = c(seq(1,5,1), seq(1,4,1)),
v = c(seq(0, 40, 10), seq(50, 80, 10))) %>%
group_by(id) %>%
nest() %>%
mutate(ddim = map(data, dim)) %>%
mutate(nn = map(ddim, extract(1)))
mylist
# A tibble: 2 x 4
# Groups: id [2]
id data ddim nn
<chr> <list<df[,2]>> <list> <list>
1 a [5 x 2] <int [2]> <int [1]>
2 b [4 x 2] <int [2]> <int [1]>
It is not clear how "incomplete data" is defined but since the question is more about how to filter rows where a certain condition is satisfied in a list, I have considered a temporary condition which is select rows where v column of tibble has first value as 0. This condition can be changed after clarification from OP.
We can use filter to select rows and map_lgl to loop over data column for each id.
library(tidyverse)
mylist %>% filter(map_lgl(data, ~first(.x$v) == 0))
# id data
# <chr> <list<df[,2]>>
#1 a [5 × 2]
Similarly, in base R, we can use subset with sapply
subset(mylist, sapply(data, function(x) x$v[1] == 0))
data
mylist <- tibble(id = c(rep("a",5),rep("b",4)),
dl = c(seq(1,5,1), seq(1,4,1)),
v = c(seq(0, 40, 10), seq(50, 80, 10))) %>%
group_by(id) %>% nest()

Filtering out nested data frames by number of observations

Following from: Use filter() (and other dplyr functions) inside nested data frames with map()
I want to nest on multiple columns, and then filter out rows by the number of items that were nested into that row. For example,
df <- tibble(
a = sample(x = c(rep(c('x','y'),4), 'w', 'z')),
b = sample(c(1:10)),
c = sample(c(91:100))
)
I want to nest on column a, as in:
df_nest <- df %>%
nest(-a)
Then, I want to filter out the rows that only have 1 observation in the data column (where a = w or a = z, in this case.) How can I do that?
You can use map/map_int on the data column to return the nrow in each nested tibble, and construct the filter condition based on it:
df %>%
nest(-a) %>%
filter(map_int(data, nrow) == 1)
# filter(map(data, nrow) == 1) works as well
# A tibble: 2 x 2
# a data
# <chr> <list>
#1 w <tibble [1 x 2]>
#2 z <tibble [1 x 2]>

List Columns - Creating a data frame of data frames

I'd like to create a pretty simple data frame of data frames. I'd like the master data frame to have 100 rows, with two columns. One column is called "row" and has the numbers 1-100 and two other column called "df1" and "df2" that are each a data frame with one column "row" and the numbers 1-100. I've tried the following:
mydf <- data.frame(row=1:100)
for(i in 1:100){
mydf$df1[i] <- data.frame(row=1:100)
mydf$df2[i] <- data.frame(row=1:100)
}
But that creates lists not data frames and the columns are unnamed. I also tried:
mydf <- data.frame(row=1:100)
mydf <- mydf %>% mutate(df1=data.frame(row=1:100),df2=data.frame(row=1:100))
But that throws an error. It seems like what I'm doing shouldn't be too difficult, what am I doing wrong and how can I accomplish this?
Thanks.
Use do on a per-row basis instead of mutate:
mydf <- data.frame( row = 1:100 ) %>% group_by(row) %>%
do( df1 = data.frame(row=1:100), df2 = data.frame(row=1:100) ) %>% ungroup
# # A tibble: 100 x 3
# row df1 df2
# <int> <list> <list>
# 1 1 <data.frame [100 x 1]> <data.frame [100 x 1]>
# 2 2 <data.frame [100 x 1]> <data.frame [100 x 1]>
# 3 3 <data.frame [100 x 1]> <data.frame [100 x 1]>
# ...
You can use replicate to achieve that, i.e.
mydf$df1 <- replicate(100, mydf)
mydf$df2 <- replicate(nrow(mydf), mydf) #I used nrow here to make it more generic
I think you should used nested data frames as described in
https://www.rdocumentation.org/packages/tidyr/versions/0.6.1/topics/nest
But for what you ask, you need the operator I.
mydf <- data.frame(row=1:100)
for(i in 1:100){
mydf$df1[i] <- I(data.frame(row=1:100))
mydf$df2[i] <- I(data.frame(row=1:100))
}
show(mydf)
mydf$df1

Resources