List Columns - Creating a data frame of data frames - r

I'd like to create a pretty simple data frame of data frames. I'd like the master data frame to have 100 rows, with two columns. One column is called "row" and has the numbers 1-100 and two other column called "df1" and "df2" that are each a data frame with one column "row" and the numbers 1-100. I've tried the following:
mydf <- data.frame(row=1:100)
for(i in 1:100){
mydf$df1[i] <- data.frame(row=1:100)
mydf$df2[i] <- data.frame(row=1:100)
}
But that creates lists not data frames and the columns are unnamed. I also tried:
mydf <- data.frame(row=1:100)
mydf <- mydf %>% mutate(df1=data.frame(row=1:100),df2=data.frame(row=1:100))
But that throws an error. It seems like what I'm doing shouldn't be too difficult, what am I doing wrong and how can I accomplish this?
Thanks.

Use do on a per-row basis instead of mutate:
mydf <- data.frame( row = 1:100 ) %>% group_by(row) %>%
do( df1 = data.frame(row=1:100), df2 = data.frame(row=1:100) ) %>% ungroup
# # A tibble: 100 x 3
# row df1 df2
# <int> <list> <list>
# 1 1 <data.frame [100 x 1]> <data.frame [100 x 1]>
# 2 2 <data.frame [100 x 1]> <data.frame [100 x 1]>
# 3 3 <data.frame [100 x 1]> <data.frame [100 x 1]>
# ...

You can use replicate to achieve that, i.e.
mydf$df1 <- replicate(100, mydf)
mydf$df2 <- replicate(nrow(mydf), mydf) #I used nrow here to make it more generic

I think you should used nested data frames as described in
https://www.rdocumentation.org/packages/tidyr/versions/0.6.1/topics/nest
But for what you ask, you need the operator I.
mydf <- data.frame(row=1:100)
for(i in 1:100){
mydf$df1[i] <- I(data.frame(row=1:100))
mydf$df2[i] <- I(data.frame(row=1:100))
}
show(mydf)
mydf$df1

Related

purrr package in R: How to use a changing parameter for a function that is applied to nested data with map()

My problem is, that I don't want to use a fixed parameter (n=100) in the random_points() function applied to my nested data (last line of my code):
trk_id <- trk2 %>% mutate(random_used = map(hr1, random_points(.,n=100)))
Instead, for every id the amount of random points should be equal to the total number of observations of the id (data frame id_n). How can I work with map() and include a changing parameter for the function that is repeated?
This is my code using the data set amt_fisher included in the package amt:
library(purrr)
library(amt)
library(tidyverse)
trk <- amt_fisher %>% make_track(x_, y_, t_, id = id)
#nest data
data <- trk %>% nest(dat = -"id")
#calculate home range for every id (hr) and convert it to an sp object (hr1)
trk1 <- data %>% mutate(hr = map(dat, ~hr_kde(.,levels=0.95)))
trk1 <- trk1 %>% mutate(hr1 = map(hr, hr_isopleths))
#create a data frame with the number of observations for every id
animalsid <- unique(amt_fisher$id)
output <- list()
for (i in 1:length(unique(amt_fisher$id))){
id1<- amt_fisher %>% filter(amt_fisher$id == animalsid[i])
n <- length(id1$t_)
output[[i]] <- list(n = n, id = animalsid[i])}
id_n <- do.call(rbind.data.frame, output)
#calculate n random points for every id within the calculated homerange of that id
trk_id <- trk1 %>% mutate(random_used = map(hr1, random_points(.,n=100)))
I think this is what you're looking for - we use map2 to supply two arguments to the random_points function, the first being the shapefile and the second being the number of points we'd like to generate. To get those both in one data frame to loop over, I use a quick left_join first:
trk_id <- trk1 %>%
left_join(id_n) %>%
mutate(random_used = map2(hr1, n, random_points))
# A tibble: 4 x 6
id dat hr hr1 n random_used
<chr> <list> <list> <list> <int> <list>
1 M1 <track_xyt [919 x 3]> <kde [7]> <sf [1 x 4]> 919 <random_points [919 x 3]>
2 M4 <track_xyt [8,958 x 3]> <kde [7]> <sf [1 x 4]> 8958 <random_points [8,958 x 3]>
3 F2 <track_xyt [3,004 x 3]> <kde [7]> <sf [1 x 4]> 3004 <random_points [3,004 x 3]>
4 F1 <track_xyt [1,349 x 3]> <kde [7]> <sf [1 x 4]> 1349 <random_points [1,349 x 3]>

R Use map2 to iterate over columns within a list of data frames to fit statistical models

I'm trying to figure out a purrr approach to iteratively map over columns within a list of data frames to fit univariate GLMs. Using map2, the first element, .x, would be the three pred columns, and the second element, .y, would be the list of data frames (or vice-versa). map2 seems to be able to do this, but I recognize that I need to cross the .x and .y elements first, so I use tidyr::crossing first to do this. From here, I am unsure how to properly reference the columns to select within the data frames. Example code is below:
#Sample data
set.seed(100)
test_df <- tibble(pred1 = sample(40:80, size = 1000, replace = TRUE),
pred2 = sample(40:80, size = 1000, replace = TRUE),
pred3 = sample(40:80, size = 1000, replace = TRUE),
resp = sample(100:200, size = 1000, replace = TRUE),
group = sample(c('a','b','c'), size = 1000, replace = TRUE))
#Split into list
test_ls <- test_df %>%
group_by(group) %>%
{df_groups <<- .} %>%
group_split()
#Obtain keys and name list elements
group_keys <- df_groups %>%
group_keys() %>%
pull()
test_ls <- test_ls %>% setNames(nm = group_keys)
#Cross all combinations of pred columns and list element names
preds <- c('pred1','pred2','pred3')
map_keys <- crossing(preds, group_keys)
#.y = list of data frames; iterate over data frames
#.x = three pred columns; iterate over columns
#Use purrr to fit glm of each .x columns within each of .y dfs
#Example structure - does not work
map2(.x, .y, .f = ~glm(resp ~ .x, data = .y))
#Workaround that does work
lapply(test_ls, function(x) {
x %>%
select(pred1, pred2, pred3) %>%
map(.f = ~glm(resp ~ .x, data = x))
})
There's something I'm missing, and I can't seem to figure it out. I've gotten a variety of errors with a few approaches, but I think it's coming down to not properly referencing the .x columns within the .y data frames. My approaches don't seem to recognize that .x is a column within .y. The workaround does the trick, but I'd prefer to avoid using both lapply and map.
My suggestion would be to NOT split the data before fitting models, since you are considering all possible combinations of variables that are already available directly in your original dataset. Instead, consider converting the original data frame to the "long" format, and then grouping by the necessary variables:
test_df %>% gather( pred, value, pred1:pred3 ) %>%
nest( -c(group, pred) ) %>%
mutate( models = map(data, ~glm(resp ~ value, data=.x)) )
# # A tibble: 9 x 4
# group pred data models
# <chr> <chr> <list> <list>
# 1 b pred1 <tibble [340 x 2]> <glm>
# 2 a pred1 <tibble [317 x 2]> <glm>
# 3 c pred1 <tibble [343 x 2]> <glm>
# 4 b pred2 <tibble [340 x 2]> <glm>
# 5 a pred2 <tibble [317 x 2]> <glm>
# 6 c pred2 <tibble [343 x 2]> <glm>
# 7 b pred3 <tibble [340 x 2]> <glm>
# 8 a pred3 <tibble [317 x 2]> <glm>
# 9 c pred3 <tibble [343 x 2]> <glm>
This substantially simplifies your code, and you can now split the result, if you still need those models in a list.

Create a list of dataframes and use it to call details about that dataframe

I am trying to create a list of dataframes and then using that list of dataframes to create another dataframe about the attributes of that dataframe. I wanted to do this by creating a loop.
I tried creating a list of dataframes. Then I used that list in a loop that says for each row in my new dataframe, put in the name of the dataframe in one column and the number of rows in that dataframe in another column.
df_Months <- as.list(c(df_Jan2018, df_Feb2018, df_March2018, df_April2018, df_May2018))
for i in 1:length(df_Months) {
Monthly_Size$Month[i] <- paste(df_Months [i])
Monthly_Size$Size[i] <- nrow(df_Months[i])
}
if I do nrow(df_Months[1]) the result is NULL even though I know that is not the case because if i just do nrow(df_Jan2018) it gives me back the correct number of rows.
Here is a solution using the purrr and dplyr package that should work on your data. You wouldn't need the for loop anymore.
library("purrr")
library("dplyr")
test_df <- data.frame( a = c(1,2,3,4,NA),
b = c(NA,6,5,7,9))
test_df2 <- data.frame(c = c(1:10),
d = c(11:20))
df_list <- list(test_df = test_df, test_df2 = test_df2)
res <- map_dbl(df_list,nrow)
tibble(df = names(res), nrow = res)
The output looks like this
# A tibble: 2 x 2
df nrow
<chr> <dbl>
1 test_df 5
2 test_df2 10
A slightly different approach would be to put the above list df_list into a tibble and then do operations on that tibble and create new rows with the information you are looking for.
df_tibble <- tibble(name = names(df_list), df = df_list)
df_tibble %>% mutate(nrow = map_dbl(df, ~ nrow(.x)))
# A tibble: 2 x 3
name df nrow
<chr> <list> <dbl>
1 test_df <data.frame [5 × 2]> 5
2 test_df2 <data.frame [10 × 2]> 10
You could go on and include more information in this way. For example the number of columns.
df_tibble %>% mutate(nrow = map_dbl(df, ~ nrow(.x)),
ncol = map_dbl(df, ~ ncol(.x)))

Summation of matrices in separate tibble list columns

I have a tibble data frame with two list columns. Within the list column mat_base, each row contains a 2x2 matrix. In the list column mat_sim, each row contains a list of 10 2x2 matrices. I would like to create a new list column mat_out, which is the sum of the mat_base matrix and each of the mat_sim matrices (within a given row). I.e. Each row of mat_out should contain a list of 10 matrices.
I assume there is a way to do this using lapply or the purrr library, but I haven't been able to figure it out. Any help appreciated.
library(tibble)
library(dplyr)
library(purrr)
mat_base <- list(diag(2) * 1, diag(2) * 2, diag(2) * 3)
mat_sim_a <- replicate(10, matrix(rnorm(4), nrow = 2), simplify = F)
mat_sim_b <- replicate(10, matrix(rnorm(4), nrow = 2), simplify = F)
mat_sim_c <- replicate(10, matrix(rnorm(4), nrow = 2), simplify = F)
dat <- tibble(group = c('a', 'b', 'c')) %>%
mutate(mat_base = mat_base,
mat_sim = list(mat_sim_a, mat_sim_b, mat_sim_c))
# doesn't work
dat %>%
mutate(mat_out = lapply(.$mat_sim, function(x, y) x + y, y = .$mat_base))
# doesn't work
dat %>%
mutate(mat_out = purrr::map(.$mat_sim, function(x, y) x + y, y = .$mat_base))
We could use a nested map2 to get the + of 'mat_base' and 'mat_sim' to create the 'mat_out' as a column
dat %>%
mutate(mat_out = map2(mat_base, mat_sim, ~
map2(list(.), .y, `+`)))
# A tibble: 3 x 4
# group mat_base mat_sim mat_out
# <chr> <list> <list> <list>
#1 a <dbl [2 x 2]> <list [10]> <list [10]>
#2 b <dbl [2 x 2]> <list [10]> <list [10]>
#3 c <dbl [2 x 2]> <list [10]> <list [10]>
You can solve the issue by using lapply on position rather than the actual list, which lets you access nested levels:
dat %>%
mutate(mat_out = lapply(1:3, function(x)
lapply(dat$mat_sim[[x]],function(y) y+dat$mat_base[[x]])))

Filtering out nested data frames by number of observations

Following from: Use filter() (and other dplyr functions) inside nested data frames with map()
I want to nest on multiple columns, and then filter out rows by the number of items that were nested into that row. For example,
df <- tibble(
a = sample(x = c(rep(c('x','y'),4), 'w', 'z')),
b = sample(c(1:10)),
c = sample(c(91:100))
)
I want to nest on column a, as in:
df_nest <- df %>%
nest(-a)
Then, I want to filter out the rows that only have 1 observation in the data column (where a = w or a = z, in this case.) How can I do that?
You can use map/map_int on the data column to return the nrow in each nested tibble, and construct the filter condition based on it:
df %>%
nest(-a) %>%
filter(map_int(data, nrow) == 1)
# filter(map(data, nrow) == 1) works as well
# A tibble: 2 x 2
# a data
# <chr> <list>
#1 w <tibble [1 x 2]>
#2 z <tibble [1 x 2]>

Resources