Size of nested vs. unested (tidy) data.frame? - r

This question uses a data.frame which contains list-columns (nested). It had me wondering why/if there's an advantage to working this way. I assumed you would want to minimize the amount of memory each table uses...But when I checked I was surprised:
Compare table sizes for nested vs. tidy format:
1. Generate nested/tidy versions of a 2-col and 5-col data.frame:
library(pryr)
library(dplyr)
library(tidyr)
library(ggvis)
n <- 1:1E6
df <- data_frame(id = n, vars = lapply(n, function(x) x <- sample(letters,sample(1:26,1))))
dfu <- df %>% unnest(vars)
df_morecols <- data_frame(id = n, other1 = n, other2 = n, other3 = n,
vars = lapply(n, function(x) x <- sample(letters,sample(1:26,1))))
dfu_morecols <- df_morecols %>% unnest(vars)
they look like:
head(df)
#> Source: local data frame [6 x 2]
#> id vars
#> 1 1 <chr[16]>
#> 2 2 <chr[4]>
#> 3 3 <chr[26]>
#> 4 4 <chr[9]>
#> 5 5 <chr[11]>
#> 6 6 <chr[18]>
head(dfu)
#> Source: local data frame [6 x 2]
#> id vars
#> 1 1 k
#> 2 1 d
#> 3 1 s
#> 4 1 j
#> 5 1 m
#> 6 1 t
head(df_morecols)
#> Source: local data frame [6 x 5]
#> id other1 other2 other3 vars
#> 1 1 1 1 1 <chr[4]>
#> 2 2 2 2 2 <chr[22]>
#> 3 3 3 3 3 <chr[24]>
#> 4 4 4 4 4 <chr[6]>
#> 5 5 5 5 5 <chr[15]>
#> 6 6 6 6 6 <chr[11]>
head(dfu_morecols)
#> Source: local data frame [6 x 5]
#> id other1 other2 other3 vars
#> 1 1 1 1 1 r
#> 2 1 1 1 1 p
#> 3 1 1 1 1 s
#> 4 1 1 1 1 w
#> 5 2 2 2 2 l
#> 6 2 2 2 2 j
2. Calculate object sizes and col sizes
from: lapply(list(df,dfu,df_morecols,dfu_morecols),object_size)
170 MB vs. 162 MB for nested vs. tidy 2-col df
170 MB vs. 324 MB for nested vs. tidy 5-col df
col_sizes <- sapply(c(df,dfu,df_morecols,dfu_morecols),object_size)
col_names <- names(col_sizes)
parent_obj <- c(rep(c('df','dfu'),each = 2),
rep(c('df_morecols','dfu_morecols'),each = 5))
res <- data_frame(parent_obj,col_names,col_sizes) %>%
unite(elementof, parent_obj,col_names, remove = F)
3. Plot columns sizes coloured by parent object:
res %>%
ggvis(y = ~elementof, x = ~0, x2 = ~col_sizes, fill = ~parent_obj) %>%
layer_rects(height = band())
Questions:
What explains the smaller footprint of the tidy 2-col df compared to the nested one?
Why doesn't this effect change for to the 5-col df?

Related

Find first occurrence of value in vector, and return length of vector if value not present

I would like to find first occurrence of a value in a vector. The value can be present or not. If not present I would like to get the length of the vector.
Why I want this: This is to slice a data frame by group, from first row up to (and including) the first row with the occurrence of the value. Or all rows if the value is not present. See below my approach for the latter as well. Maybe there is no need to take the detour over the vectors, and there is a more direct approach for this, and I'd appreciate a hint/solution very much, but this question is more about the vector problem. Thanks!
x <- 0:1
y <- c(0:2, 2)
z <- c(y, 3)
# Those approaches with max/min(which) do not work
max(which(x < 2))
#> [1] 2
## desired result should be 3
max(which(y < 2))
#> [1] 2
## != does of course also not work
max(which(z != 2))
#> [1] 5
## desired result
library(dplyr)
## my way for the vectors
my_vecs <- list(x, y, z)
my_len <- lengths(my_vecs)
my_ind <- sapply(my_vecs, function(u) which(u == 2)[1])
coalesce(my_ind, my_len)
#> [1] 2 3 3
## in a dataframe
foo <- data.frame(id = letters[rep(my_len, my_len)], n = c(x,y,z))
foo %>%
group_by(id) %>%
mutate(cens = which(n == 2)[1],
cens = ifelse(is.na(cens), n(), cens)) %>%
slice(1:max(cens))
#> # A tibble: 8 × 3
#> # Groups: id [3]
#> id n cens
#> <chr> <dbl> <int>
#> 1 b 0 2
#> 2 b 1 2
#> 3 d 0 3
#> 4 d 1 3
#> 5 d 2 3
#> 6 e 0 3
#> 7 e 1 3
#> 8 e 2 3
match has the third argument no_match which shortens a lengthy if else construction considerably and makes really neat code.
x <- 0:1
y <- c(0:2, 2)
z <- c(y, 3)
sapply(list(x, y, z), function(u) match(2, u, length(u)))
#> [1] 2 3 3
Applied to the data frame problem in the question, this will give:
library(dplyr)
foo %>%
group_by(id) %>%
## note: n and n() are not the same! the first refers to the column, the other is a dplyr function
mutate(cens = match(2, n, n()))%>%
slice(1:max(cens))
#> # A tibble: 8 × 3
#> # Groups: id [3]
#> id n cens
#> <chr> <dbl> <int>
#> 1 b 0 2
#> 2 b 1 2
#> 3 d 0 3
#> 4 d 1 3
#> 5 d 2 3
#> 6 e 0 3
#> 7 e 1 3
#> 8 e 2 3

How to use a function with mutable number of arguments in R

I have two tibbles with different number of columns. I want to filter df1 using a value from column b and I also want to filter df2 using a value from column b and also column c. Is it possible to do this using the same function?
I followed the list(...) procedure, but of course, I got an error since, in the first case there is no x[[2]].
library(dplyr)
df1 <- tibble(a = c(4,2,3,4),
b = c(8,6,7,8))
df2 <- tibble(a = c(1,2,3,4),
b = c(5,6,7,8),
c = c(1,5,3,7))
df1
#> # A tibble: 4 × 2
#> a b
#> <dbl> <dbl>
#> 1 4 8
#> 2 2 6
#> 3 3 7
#> 4 4 8
df2
#> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 5 1
#> 2 2 6 5
#> 3 3 7 3
#> 4 4 8 7
createTable <- function(df, ...) {
x <- list(...)
tabl <- df %>%
filter(b < x[[1]], c < x[[2]])
return(tabl)
}
tabl1 <- createTable(df1, 8)
#> Error in `filter()`:
#> ! Problem while computing `..2 = c < x[[2]]`.
#> Caused by error in `x[[2]]`:
#> ! subscript out of bounds
tabl2 <- createTable(df2, 7, 5)
Created on 2022-07-27 by the reprex package (v2.0.1)

Mutate column using array names

Context:
My data analysis involves manipulating ~100 different trials separately, and each trial has >1000 rows. Eventually, one step requires me to combine each trial with a column value from a different dataset. I plan to combine this dataset with each trial within an array using left_join() and "ID" as the key.
Dilemma
I want to mutate() the trial name into a new column labeled "ID". I feel like this should be a simple task, but I'm still a novice when working with lists and arrays.
Working Code
I don't know how to share .csv files, but you can save the example datasets as .csv files within a practice folder named "data".
library(tidyverse)
# Create practice dataset
df1 <- tibble(Time = seq(1, 5, by = 1),
Point = seq(6, 10, by = 1)) %>% print()
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
df2 <- tibble(Time = seq(6, 10, by = 1),
Point = seq(1, 5, by = 1)) %>% print()
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
write_csv(df1, file.path("data", "21May27_CtYJ10.csv")
write_csv(df2, file.path("data", "21May27_HrOW07.csv"))
This is the code I have working right now:
# Isolate .csv files from directory into a list
rawFiles_List <- list.files("data", pattern = ".csv", full = TRUE) %>% print()
# Naming scheme for files w/n list
trialDate <- list(str_sub(rawFiles_List, 13, 26)) %>%
print() # Adjust the substring to include date and trial
[[1]]
[1] "21May27_CtYJ10" "21May27_HrOW07"
trial <- list(str_sub(rawFiles_List, 21, 26)) %>% print() # Only include trial
[[1]]
[1] "CtYJ10" "HrOW07"
# Combine the list and list names into an array
rawFiles <- array(map(rawFiles_List, read_csv), dimnames = trialDate) %>% print()
Parsed with column specification:
cols(
Time = col_double(),
Point = col_double()
)
Parsed with column specification:
cols(
Time = col_double(),
Point = col_double()
)
$`21May27_CtYJ10`
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
$`21May27_HrOW07`
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
This partially does what I want:
map(rawFiles, ~ data.frame(.) %>% # Convert to dataframe
# Create a new column with trial name
mutate(ID = map(trial, paste)) %>% # Pastes the list, not the respective value
as_tibble(.)) # Convert back to tibble
$`21May27_CtYJ10`
# A tibble: 5 x 3
Time Point MouseID
<dbl> <dbl> <list>
1 1 6 <chr [2]>
2 2 7 <chr [2]>
3 3 8 <chr [2]>
4 4 9 <chr [2]>
5 5 10 <chr [2]>
$`21May27_HrOW07`
# A tibble: 5 x 3
Time Point MouseID
<dbl> <dbl> <list>
1 6 1 <chr [2]>
2 7 2 <chr [2]>
3 8 3 <chr [2]>
4 9 4 <chr [2]>
5 10 5 <chr [2]>
Question:
Can you please help me make a new column filled with their respective trial IDs? I am trying to use mostly tidyverse functions, but I'm open to Base-R functions, too. If you are able to give some explanation as how you match the list elements to the array elements or refer me to a helpful resource, that would be much appreciated.
Bonus Question:
I am working on how to save each file after all manipulations, but I'm not sure if I'm writing my for loop correctly. Could you provide some guidance as how I should edit my for loop? I'm using previous code as a guide, but I'm willing to scrap it if I'm over-complicating things. The following is what I have written so far:
SaveDate <- format(Sys.Date(), format = "%y%b%d")
for (i in 1:length(combFiles)) { # Dataset combing array of trials manipulated
filename <- vector("list", length(rawFiles)) # Vector to fill
filename[[i]] <- paste( # Fill vector with respective filenames
as.data.frame(trial)[[1]][i], "_mod_", SaveDate, ".csv", sep = "")
write.csv(file = filename[[i]],
modFiles[[i]], # Array of trials manipulated
sep = ",", row.names = FALSE, col.names = TRUE)
}
library(tidyverse)
# Create practice dataset
df1 <- tibble(Time = seq(1, 5, by = 1),
Point = seq(6, 10, by = 1)) %>% print()
#> # A tibble: 5 x 2
#> Time Point
#> <dbl> <dbl>
#> 1 1 6
#> 2 2 7
#> 3 3 8
#> 4 4 9
#> 5 5 10
df2 <- tibble(Time = seq(6, 10, by = 1),
Point = seq(1, 5, by = 1)) %>% print()
#> # A tibble: 5 x 2
#> Time Point
#> <dbl> <dbl>
#> 1 6 1
#> 2 7 2
#> 3 8 3
#> 4 9 4
#> 5 10 5
write_csv(df1, "21May27_CtYJ10.csv")
write_csv(df2, "21May27_HrOW07.csv")
rm(df1, df2)
The easiest is to use imap_*. This will automatically loop on all the files in your list and combine them if needed. For this to work, the file list must have names.
# Prepare raw file list with names equal to the values
rawFiles_List <- list.files(pattern = "^21May27") %>%
set_names()
rawFiles_List
#> 21May27_CtYJ10.csv 21May27_HrOW07.csv
#> "21May27_CtYJ10.csv" "21May27_HrOW07.csv"
imap_dfr(rawFiles_List,
~ read_csv(.x, col_types = "dd") %>%
add_column(source_file = .y))
#> # A tibble: 10 x 3
#> Time Point source_file
#> <dbl> <dbl> <chr>
#> 1 1 6 21May27_CtYJ10.csv
#> 2 2 7 21May27_CtYJ10.csv
#> 3 3 8 21May27_CtYJ10.csv
#> 4 4 9 21May27_CtYJ10.csv
#> 5 5 10 21May27_CtYJ10.csv
#> 6 6 1 21May27_HrOW07.csv
#> 7 7 2 21May27_HrOW07.csv
#> 8 8 3 21May27_HrOW07.csv
#> 9 9 4 21May27_HrOW07.csv
#> 10 10 5 21May27_HrOW07.csv
If you prefer to stay with a list of data frames and just add a column in each, use imap():
imap(rawFiles_List,
~ read_csv(.x, col_types = "dd") %>%
add_column(source_file = .y))
#> $`21May27_CtYJ10.csv`
#> # A tibble: 5 x 3
#> Time Point source_file
#> <dbl> <dbl> <chr>
#> 1 1 6 21May27_CtYJ10.csv
#> 2 2 7 21May27_CtYJ10.csv
#> 3 3 8 21May27_CtYJ10.csv
#> 4 4 9 21May27_CtYJ10.csv
#> 5 5 10 21May27_CtYJ10.csv
#>
#> $`21May27_HrOW07.csv`
#> # A tibble: 5 x 3
#> Time Point source_file
#> <dbl> <dbl> <chr>
#> 1 6 1 21May27_HrOW07.csv
#> 2 7 2 21May27_HrOW07.csv
#> 3 8 3 21May27_HrOW07.csv
#> 4 9 4 21May27_HrOW07.csv
#> 5 10 5 21May27_HrOW07.csv
Of course, if you manipulate the names of the filelist before running the map command, you can make sure the correct value is inserted in the column:
rawFiles_List <- list.files(pattern = "^21May27") %>%
set_names(str_sub(., 21L, 26L))
As for saving, I suggest you use iwalk(). I think your for loop is not doing what you want (you are reinitializing filename at each pass, erasing its previous content, probably not what you want).

How to Create Iterative Forumla to calculate Z Score in R?

I have a number of large data frames that have the following basic format, where the final two rows are a mean (d) and standard deviation (e) - although these are calculated elsewhere.
a b c
a 4 3 4
b 3 2 6
c 2 1 8
d 3 2 6
e 1 1 2
I would like to create an iterative function that converts each raw data point into a z-score via the mean and sd value in d and e per column. The formula I would like to apply is ((x-mean)/SD).
The result would be the following:
a b c
a 1 1 1
b 0 0 0
c -1 -1 -1
I don't mind if this is added to the end, created as a new dataframe or the data is converted.
Thanks!
Here is one approach, note that I do not use the mean/sd provided in the data but re-calculate it on the fly.
Also note that usually the data should be in a tidy data representation, which in your case would mean that a, b, c would be in columns and then mean/sd would be either calculated on the fly or be in a separate column (note that this would reshaping the data, not shown here).
# your input data
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
raw_data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
#> d 3 2 6
#> e 1 1 2
# remove the mean/sd values
data <- raw_data[!rownames(raw_data) %in% c("d", "e"), ]
data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
# quick way to recalculate the values
means <- apply(data, 2, mean)
means
#> a b c
#> 3 2 6
sds <- apply(data, 2, sd)
sds
#> a b c
#> 1 1 2
z_scores <- apply(data, 2, function(x) (x - mean(x)) / sd(x))
z_scores
#> a b c
#> a 1 1 -1
#> b 0 0 0
#> c -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
Edit / Full Code
The following code is a bit longer but most of it is spent on getting the data into the right (long/tidy) format.
If you have any questions, feel free to use the comments.
Note that the tidyverse is really helpful, but might need some time to get used to. The code used here is mostly dplyr (included in the tidyverse).
If you understand the functions: %>% (pipe), group_by(), mutate(), summarise(), and pivot_longer/wider() you got everything.
library(tidyverse)
# use your original dataset again
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
### 1) Turn the data into a nicer format
# match-table how to rename the variables
var_match <- c(d = "mean", e = "sd")
# convert the raw data into a nicer format, first we do some minor changes
# (variable names, etc)
data_mixed <- raw_data %>%
# have the rownames as explicit variable
rownames_to_column("metric") %>%
# nicer printing etc
as_tibble() %>%
# replace variable names with mean/sd
mutate(metric = ifelse(metric %in% c("d", "e"),
var_match[metric], metric))
data_mixed
#> # A tibble: 5 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 4
#> 2 b 3 2 6
#> 3 c 2 1 8
#> 4 mean 3 2 6
#> 5 sd 1 1 2
# separate the dataset into two:
# data holds the values
# data_vars holds the metrics mean and sd
data <- data_mixed %>% filter(!metric %in% var_match) %>% select(-metric)
data_vars <- data_mixed %>% filter(metric %in% var_match)
data
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 4 3 4
#> 2 3 2 6
#> 3 2 1 8
data_vars
#> # A tibble: 2 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 mean 3 2 6
#> 2 sd 1 1 2
# turn the value dataset into its longer form, makes it easier to work with it later
data_long <- data %>%
pivot_longer(everything(), names_to = "var", values_to = "val")
data_long
#> # A tibble: 9 x 2
#> var val
#> <chr> <dbl>
#> 1 a 4
#> 2 b 3
#> 3 c 4
#> 4 a 3
#> 5 b 2
#> 6 c 6
#> 7 a 2
#> 8 b 1
#> 9 c 8
# turn the metric dataset into another long form, allowing easy combination in the next step
data_vars2 <- data_vars %>%
pivot_longer(-metric, names_to = "var", values_to = "val") %>%
pivot_wider(var, names_from = metric, values_from = val)
data_vars2
#> # A tibble: 3 x 3
#> var mean sd
#> <chr> <dbl> <dbl>
#> 1 a 3 1
#> 2 b 2 1
#> 3 c 6 2
# combine the datasets
data_all <- left_join(data_long, data_vars2, by = "var")
data_all
#> # A tibble: 9 x 4
#> var val mean sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 1
#> 2 b 3 2 1
#> 3 c 4 6 2
#> 4 a 3 3 1
#> 5 b 2 2 1
#> 6 c 6 6 2
#> 7 a 2 3 1
#> 8 b 1 2 1
#> 9 c 8 6 2
## 2) calculate the z-score
# now comes the actual number crunchin!
# per variable var (a, b, c) compute the variable val_z as the z-score
data_res <- data_all %>%
group_by(var) %>%
mutate(val_z = (val - mean) / sd)
data_res
#> # A tibble: 9 x 5
#> # Groups: var [3]
#> var val mean sd val_z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 4 3 1 1
#> 2 b 3 2 1 1
#> 3 c 4 6 2 -1
#> 4 a 3 3 1 0
#> 5 b 2 2 1 0
#> 6 c 6 6 2 0
#> 7 a 2 3 1 -1
#> 8 b 1 2 1 -1
#> 9 c 8 6 2 1
## 3) make the results more readable
# lastly pivot the results to its original form
data_res_wide <- data_res %>%
select(var, val_z) %>%
group_by(var) %>%
mutate(id = 1:n()) %>% # needed for easier identification of values
pivot_wider(id, names_from = var, values_from = val_z)
data_res_wide
#> # A tibble: 3 x 4
#> id a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 -1
#> 2 2 0 0 0
#> 3 3 -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)

invoke_map has the difficulty on finding arguments

I am exploring the tidyverse package. So I am interested in how to get the following task down in the tidy way. One can easily circumvent the problem using *apply functions.
Consider the following data
tb <-
lapply(matrix(c("a", "b", "c")), function(x)
rep(x, 3)) %>% unlist %>% c(rep(c(1, 2, 3), 6)) %>% matrix(ncol = 3) %>%
as_tibble(.name_repair = ~ c("tag", "x1", "x2")) %>% type.convert()
# A tibble: 9 x 3
tag x1 x2
<fct> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 1 1
5 b 2 2
6 b 3 3
7 c 1 1
8 c 2 2
9 c 3 3
I group them using nest() function and for each group I want to apply a different function from a list of functions f_1, f_2, f_3
f_1 <- function(x)
x[,1] + x[,2]
f_2 <- function(x)
x[,1] - x[,2]
f_3 <- function(x)
x[,1] * x[,2]
tb_func_attached <-
tb %>% group_by(tag) %>% nest() %>% mutate(func = c(f_0, f_1, f_2))
# A tibble: 3 x 3
tag data func
<fct> <list> <list>
1 a <tibble [3 x 2]> <fn>
2 b <tibble [3 x 2]> <fn>
3 c <tibble [3 x 2]> <fn>
I try to use invoke_map to apply the functions
tb_func_attached %>% {invoke_map(.$func, .$data)}
invoke_map(tb_func_attached$func, tb_func_attached$data)
But I get the error Error in (function (x) : unused arguments (x1 = 1:3, x2 = 1:3), while the following code runs
> tb_func_attached$func[[1]](tb_func_attached$data[[1]])
x1
1 2
2 4
3 6
> tb_func_attached$func[[2]](tb_func_attached$data[[2]])
x1
1 0
2 0
3 0
> tb_func_attached$func[[3]](tb_func_attached$data[[3]])
x1
1 1
2 4
3 9
But invoke_map still does not work.
So the question is, given a nested data tb_func_attached, how to apply the functions tb_func_attached$func 'rowwisely' to tb_func_attached$data?
And a side question, what is the reason for the retirement of invoke_map? It fits quitely well in the concept of vetorisation, IMHO.
Update:
The previous version dealt with single column data (tb has only tag and x1 columns) and #A. Suliman's comment provides a solution.
However when the data column in the nested tibble has a matrix structure, the code stops running again.
Use map2 to iterate over the list of functions first, and over the data column second. Like this:
tb_func_attached %>%
mutate(output = map2(func, data, ~ .x(.y))) %>%
unnest(data, output)
The output looks this way:
# A tibble: 9 x 4
tag x1 x2 x11
<fct> <int> <int> <int>
1 a 1 1 2
2 a 2 2 4
3 a 3 3 6
4 b 1 1 0
5 b 2 2 0
6 b 3 3 0
7 c 1 1 1
8 c 2 2 4
9 c 3 3 9

Resources