Context:
My data analysis involves manipulating ~100 different trials separately, and each trial has >1000 rows. Eventually, one step requires me to combine each trial with a column value from a different dataset. I plan to combine this dataset with each trial within an array using left_join() and "ID" as the key.
Dilemma
I want to mutate() the trial name into a new column labeled "ID". I feel like this should be a simple task, but I'm still a novice when working with lists and arrays.
Working Code
I don't know how to share .csv files, but you can save the example datasets as .csv files within a practice folder named "data".
library(tidyverse)
# Create practice dataset
df1 <- tibble(Time = seq(1, 5, by = 1),
Point = seq(6, 10, by = 1)) %>% print()
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
df2 <- tibble(Time = seq(6, 10, by = 1),
Point = seq(1, 5, by = 1)) %>% print()
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
write_csv(df1, file.path("data", "21May27_CtYJ10.csv")
write_csv(df2, file.path("data", "21May27_HrOW07.csv"))
This is the code I have working right now:
# Isolate .csv files from directory into a list
rawFiles_List <- list.files("data", pattern = ".csv", full = TRUE) %>% print()
# Naming scheme for files w/n list
trialDate <- list(str_sub(rawFiles_List, 13, 26)) %>%
print() # Adjust the substring to include date and trial
[[1]]
[1] "21May27_CtYJ10" "21May27_HrOW07"
trial <- list(str_sub(rawFiles_List, 21, 26)) %>% print() # Only include trial
[[1]]
[1] "CtYJ10" "HrOW07"
# Combine the list and list names into an array
rawFiles <- array(map(rawFiles_List, read_csv), dimnames = trialDate) %>% print()
Parsed with column specification:
cols(
Time = col_double(),
Point = col_double()
)
Parsed with column specification:
cols(
Time = col_double(),
Point = col_double()
)
$`21May27_CtYJ10`
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
$`21May27_HrOW07`
# A tibble: 5 x 2
Time Point
<dbl> <dbl>
1 6 1
2 7 2
3 8 3
4 9 4
5 10 5
This partially does what I want:
map(rawFiles, ~ data.frame(.) %>% # Convert to dataframe
# Create a new column with trial name
mutate(ID = map(trial, paste)) %>% # Pastes the list, not the respective value
as_tibble(.)) # Convert back to tibble
$`21May27_CtYJ10`
# A tibble: 5 x 3
Time Point MouseID
<dbl> <dbl> <list>
1 1 6 <chr [2]>
2 2 7 <chr [2]>
3 3 8 <chr [2]>
4 4 9 <chr [2]>
5 5 10 <chr [2]>
$`21May27_HrOW07`
# A tibble: 5 x 3
Time Point MouseID
<dbl> <dbl> <list>
1 6 1 <chr [2]>
2 7 2 <chr [2]>
3 8 3 <chr [2]>
4 9 4 <chr [2]>
5 10 5 <chr [2]>
Question:
Can you please help me make a new column filled with their respective trial IDs? I am trying to use mostly tidyverse functions, but I'm open to Base-R functions, too. If you are able to give some explanation as how you match the list elements to the array elements or refer me to a helpful resource, that would be much appreciated.
Bonus Question:
I am working on how to save each file after all manipulations, but I'm not sure if I'm writing my for loop correctly. Could you provide some guidance as how I should edit my for loop? I'm using previous code as a guide, but I'm willing to scrap it if I'm over-complicating things. The following is what I have written so far:
SaveDate <- format(Sys.Date(), format = "%y%b%d")
for (i in 1:length(combFiles)) { # Dataset combing array of trials manipulated
filename <- vector("list", length(rawFiles)) # Vector to fill
filename[[i]] <- paste( # Fill vector with respective filenames
as.data.frame(trial)[[1]][i], "_mod_", SaveDate, ".csv", sep = "")
write.csv(file = filename[[i]],
modFiles[[i]], # Array of trials manipulated
sep = ",", row.names = FALSE, col.names = TRUE)
}
library(tidyverse)
# Create practice dataset
df1 <- tibble(Time = seq(1, 5, by = 1),
Point = seq(6, 10, by = 1)) %>% print()
#> # A tibble: 5 x 2
#> Time Point
#> <dbl> <dbl>
#> 1 1 6
#> 2 2 7
#> 3 3 8
#> 4 4 9
#> 5 5 10
df2 <- tibble(Time = seq(6, 10, by = 1),
Point = seq(1, 5, by = 1)) %>% print()
#> # A tibble: 5 x 2
#> Time Point
#> <dbl> <dbl>
#> 1 6 1
#> 2 7 2
#> 3 8 3
#> 4 9 4
#> 5 10 5
write_csv(df1, "21May27_CtYJ10.csv")
write_csv(df2, "21May27_HrOW07.csv")
rm(df1, df2)
The easiest is to use imap_*. This will automatically loop on all the files in your list and combine them if needed. For this to work, the file list must have names.
# Prepare raw file list with names equal to the values
rawFiles_List <- list.files(pattern = "^21May27") %>%
set_names()
rawFiles_List
#> 21May27_CtYJ10.csv 21May27_HrOW07.csv
#> "21May27_CtYJ10.csv" "21May27_HrOW07.csv"
imap_dfr(rawFiles_List,
~ read_csv(.x, col_types = "dd") %>%
add_column(source_file = .y))
#> # A tibble: 10 x 3
#> Time Point source_file
#> <dbl> <dbl> <chr>
#> 1 1 6 21May27_CtYJ10.csv
#> 2 2 7 21May27_CtYJ10.csv
#> 3 3 8 21May27_CtYJ10.csv
#> 4 4 9 21May27_CtYJ10.csv
#> 5 5 10 21May27_CtYJ10.csv
#> 6 6 1 21May27_HrOW07.csv
#> 7 7 2 21May27_HrOW07.csv
#> 8 8 3 21May27_HrOW07.csv
#> 9 9 4 21May27_HrOW07.csv
#> 10 10 5 21May27_HrOW07.csv
If you prefer to stay with a list of data frames and just add a column in each, use imap():
imap(rawFiles_List,
~ read_csv(.x, col_types = "dd") %>%
add_column(source_file = .y))
#> $`21May27_CtYJ10.csv`
#> # A tibble: 5 x 3
#> Time Point source_file
#> <dbl> <dbl> <chr>
#> 1 1 6 21May27_CtYJ10.csv
#> 2 2 7 21May27_CtYJ10.csv
#> 3 3 8 21May27_CtYJ10.csv
#> 4 4 9 21May27_CtYJ10.csv
#> 5 5 10 21May27_CtYJ10.csv
#>
#> $`21May27_HrOW07.csv`
#> # A tibble: 5 x 3
#> Time Point source_file
#> <dbl> <dbl> <chr>
#> 1 6 1 21May27_HrOW07.csv
#> 2 7 2 21May27_HrOW07.csv
#> 3 8 3 21May27_HrOW07.csv
#> 4 9 4 21May27_HrOW07.csv
#> 5 10 5 21May27_HrOW07.csv
Of course, if you manipulate the names of the filelist before running the map command, you can make sure the correct value is inserted in the column:
rawFiles_List <- list.files(pattern = "^21May27") %>%
set_names(str_sub(., 21L, 26L))
As for saving, I suggest you use iwalk(). I think your for loop is not doing what you want (you are reinitializing filename at each pass, erasing its previous content, probably not what you want).
Related
I'm following up on this question. My LIST of data.frames below is made from my data. However, this LIST is missing the paper column (the name(s) of the missing column(s) are always provided) which is available in the original data.
I was wondering how to put the missing paper column back into LIST to achieve my DESIRED_LIST below?
I tried the solution suggested in this answer (lapply(LIST, function(x)data[do.call(paste, data[names(x)]) %in% do.call(paste, x),])) but it doesn't produce my DESIRED_LIST.
A Base R or tidyverse solution is appreciated.
Reproducible data and code are below.
m2="
paper study sample comp ES bar
1 1 1 1 1 7
1 2 2 2 2 6
1 2 3 3 3 5
2 3 4 4 4 4
2 3 4 4 5 3
2 3 4 5 6 2
2 3 4 5 7 1"
data <- read.table(text=m2,h=T)
LIST <- list(data.frame(study=1 ,sample=1 ,comp=1),
data.frame(study=rep(3,4),sample=rep(4,4),comp=c(4,4,5,5)),
data.frame(study=c(2,2) ,sample=c(2,3) ,comp=c(2,3)))
DESIRED_LIST <- list(data.frame(paper=1 ,study=1 ,sample=1 ,comp=1),
data.frame(paper=rep(2,4),study=rep(3,4),sample=rep(4,4),comp=c(4,4,5,5)),
data.frame(paper=rep(1,2),study=c(2,2) ,sample=c(2,3) ,comp=c(2,3)))
Please find a solution with the package data.table. Is this what you were looking for?
Reprex 1
library(data.table)
cols_to_remove <- c("ES")
split(setDT(data)[, (cols_to_remove) := NULL], by = c("paper", "study"))
#> $`1.1`
#> paper study sample comp
#> 1: 1 1 1 1
#>
#> $`1.2`
#> paper study sample comp
#> 1: 1 2 2 2
#> 2: 1 2 3 3
#>
#> $`2.3`
#> paper study sample comp
#> 1: 2 3 4 4
#> 2: 2 3 4 4
#> 3: 2 3 4 5
#> 4: 2 3 4 5
Created on 2021-11-06 by the reprex package (v2.0.1)
EDIT
Please find solution 2 with the package dplyr
Reprex 2
library(dplyr)
drop.cols <- c("ES")
data %>%
group_by(paper, study) %>%
select(-drop.cols) %>%
group_split()
#> <list_of<
#> tbl_df<
#> paper : integer
#> study : integer
#> sample: integer
#> comp : integer
#> >
#> >[3]>
#> [[1]]
#> # A tibble: 1 x 4
#> paper study sample comp
#> <int> <int> <int> <int>
#> 1 1 1 1 1
#>
#> [[2]]
#> # A tibble: 2 x 4
#> paper study sample comp
#> <int> <int> <int> <int>
#> 1 1 2 2 2
#> 2 1 2 3 3
#>
#> [[3]]
#> # A tibble: 4 x 4
#> paper study sample comp
#> <int> <int> <int> <int>
#> 1 2 3 4 4
#> 2 2 3 4 4
#> 3 2 3 4 5
#> 4 2 3 4 5
Created on 2021-11-07 by the reprex package (v2.0.1)
Consider ave to create a grouping column (due to repeated rows) and then run an iterative merge.
DESIRED_LIST_SO <- lapply(
LIST,
function(df) merge(
transform(data, grp = ave(paper, paper, study, sample, comp, FUN=seq_along)),
transform(df, grp = ave(study, study, sample, comp, FUN=seq_along)),
by=c("study", "sample", "comp", "grp")
)[c("paper", "study", "sample", "comp")]
)
all.equal(DESIRED_LIST, DESIRED_LIST_SO)
[1] TRUE
(Consider keeping the unique identifiers, ES and bar in desired list to avoid the duplicates rows.)
A tidyverse solution. First, create a look-up table, data2, which contains the four target columns. mutate(across(.fns = as.numeric)) is to make column type consistent. It may not be needed. Second, use map to apply left_join to all data frames in LIST. LIST2 and DESIRED_LIST are completely the same.
data2 <- data %>%
distinct(paper, study, sample, comp) %>%
mutate(across(.fns = as.numeric))
LIST2 <- map(LIST, function(x){
x2 <- x %>%
left_join(data2, by = names(x)) %>%
select(all_of(names(data2)))
return(x2)
})
# Check if the results are the same
identical(DESIRED_LIST, LIST2)
# [1] TRUE
I have a number of large data frames that have the following basic format, where the final two rows are a mean (d) and standard deviation (e) - although these are calculated elsewhere.
a b c
a 4 3 4
b 3 2 6
c 2 1 8
d 3 2 6
e 1 1 2
I would like to create an iterative function that converts each raw data point into a z-score via the mean and sd value in d and e per column. The formula I would like to apply is ((x-mean)/SD).
The result would be the following:
a b c
a 1 1 1
b 0 0 0
c -1 -1 -1
I don't mind if this is added to the end, created as a new dataframe or the data is converted.
Thanks!
Here is one approach, note that I do not use the mean/sd provided in the data but re-calculate it on the fly.
Also note that usually the data should be in a tidy data representation, which in your case would mean that a, b, c would be in columns and then mean/sd would be either calculated on the fly or be in a separate column (note that this would reshaping the data, not shown here).
# your input data
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
raw_data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
#> d 3 2 6
#> e 1 1 2
# remove the mean/sd values
data <- raw_data[!rownames(raw_data) %in% c("d", "e"), ]
data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
# quick way to recalculate the values
means <- apply(data, 2, mean)
means
#> a b c
#> 3 2 6
sds <- apply(data, 2, sd)
sds
#> a b c
#> 1 1 2
z_scores <- apply(data, 2, function(x) (x - mean(x)) / sd(x))
z_scores
#> a b c
#> a 1 1 -1
#> b 0 0 0
#> c -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
Edit / Full Code
The following code is a bit longer but most of it is spent on getting the data into the right (long/tidy) format.
If you have any questions, feel free to use the comments.
Note that the tidyverse is really helpful, but might need some time to get used to. The code used here is mostly dplyr (included in the tidyverse).
If you understand the functions: %>% (pipe), group_by(), mutate(), summarise(), and pivot_longer/wider() you got everything.
library(tidyverse)
# use your original dataset again
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
### 1) Turn the data into a nicer format
# match-table how to rename the variables
var_match <- c(d = "mean", e = "sd")
# convert the raw data into a nicer format, first we do some minor changes
# (variable names, etc)
data_mixed <- raw_data %>%
# have the rownames as explicit variable
rownames_to_column("metric") %>%
# nicer printing etc
as_tibble() %>%
# replace variable names with mean/sd
mutate(metric = ifelse(metric %in% c("d", "e"),
var_match[metric], metric))
data_mixed
#> # A tibble: 5 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 4
#> 2 b 3 2 6
#> 3 c 2 1 8
#> 4 mean 3 2 6
#> 5 sd 1 1 2
# separate the dataset into two:
# data holds the values
# data_vars holds the metrics mean and sd
data <- data_mixed %>% filter(!metric %in% var_match) %>% select(-metric)
data_vars <- data_mixed %>% filter(metric %in% var_match)
data
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 4 3 4
#> 2 3 2 6
#> 3 2 1 8
data_vars
#> # A tibble: 2 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 mean 3 2 6
#> 2 sd 1 1 2
# turn the value dataset into its longer form, makes it easier to work with it later
data_long <- data %>%
pivot_longer(everything(), names_to = "var", values_to = "val")
data_long
#> # A tibble: 9 x 2
#> var val
#> <chr> <dbl>
#> 1 a 4
#> 2 b 3
#> 3 c 4
#> 4 a 3
#> 5 b 2
#> 6 c 6
#> 7 a 2
#> 8 b 1
#> 9 c 8
# turn the metric dataset into another long form, allowing easy combination in the next step
data_vars2 <- data_vars %>%
pivot_longer(-metric, names_to = "var", values_to = "val") %>%
pivot_wider(var, names_from = metric, values_from = val)
data_vars2
#> # A tibble: 3 x 3
#> var mean sd
#> <chr> <dbl> <dbl>
#> 1 a 3 1
#> 2 b 2 1
#> 3 c 6 2
# combine the datasets
data_all <- left_join(data_long, data_vars2, by = "var")
data_all
#> # A tibble: 9 x 4
#> var val mean sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 1
#> 2 b 3 2 1
#> 3 c 4 6 2
#> 4 a 3 3 1
#> 5 b 2 2 1
#> 6 c 6 6 2
#> 7 a 2 3 1
#> 8 b 1 2 1
#> 9 c 8 6 2
## 2) calculate the z-score
# now comes the actual number crunchin!
# per variable var (a, b, c) compute the variable val_z as the z-score
data_res <- data_all %>%
group_by(var) %>%
mutate(val_z = (val - mean) / sd)
data_res
#> # A tibble: 9 x 5
#> # Groups: var [3]
#> var val mean sd val_z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 4 3 1 1
#> 2 b 3 2 1 1
#> 3 c 4 6 2 -1
#> 4 a 3 3 1 0
#> 5 b 2 2 1 0
#> 6 c 6 6 2 0
#> 7 a 2 3 1 -1
#> 8 b 1 2 1 -1
#> 9 c 8 6 2 1
## 3) make the results more readable
# lastly pivot the results to its original form
data_res_wide <- data_res %>%
select(var, val_z) %>%
group_by(var) %>%
mutate(id = 1:n()) %>% # needed for easier identification of values
pivot_wider(id, names_from = var, values_from = val_z)
data_res_wide
#> # A tibble: 3 x 4
#> id a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 -1
#> 2 2 0 0 0
#> 3 3 -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
I have the following data with ID and value:
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2","1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2","2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5","2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1","2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
If you notice, there are multiple values for the same id. What I'd like to do is get the value that are only 3 and 6 only if the IDs are the same. for eg. ID "1103-5" has both 3 and 6, so it should be in the list, but not "2347-2"
I'm using R
One method I tried is the following, but it gives me everything with value 3 and 6.
d <- data.frame(id, value)
group36 <- d[d$value == 3 | d$value == 6,]
and
d %>% group_by(id) %>% filter(3 == value | 6 == value)
The output should be like this:
id value
1103-5 6
1103-5 3
1104-2 6
1104-2 3
1104-4 6
1104-4 3
1106-2 6
1106-2 3
1106-3 6
1106-3 3
2294-1 3
2294-1 6
2294-2 3
2294-2 6
2294-3 3
2294-3 6
2294-4 3
2294-4 6
2294-5 3
2294-5 6
d<-group_by(d,id)
filter(d,any(value==3),any(value==6))
This gives you all the IDs where there is both a value of 3 (somewhere) AND a value of 6 (somewhere). Mind you, your data contains some IDs with THREE values. In these cases, if both 3 and 6 are present, it will be included in the result.
If you want to exclude those lines that remain which done equal 3 or 6, add this:
filter(d,value==3 | value==6)
If you want to exclude IDs that also have 3 and 6 as values but also have OTHER values, use this:
filter(d,any(value==3),any(value==6),value==3 | value==6)
Not sure if this is what you want. We can filter rows that equal to either 3 or 6 then convert from long to wide format and keep only columns which have both 3 and 6 values. After that, convert back to long format.
library(dplyr)
library(tidyr)
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2",
"1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2",
"2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5",
"2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1",
"2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
d <- data.frame(id, value)
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.)))
#> # A tibble: 2 x 11
#> rows `1103-5` `1104-2` `1104-4` `1106-2` `1106-3` `2294-1` `2294-2`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 6 6 6 6 3 3
#> 2 2 3 3 3 3 3 6 6
#> # ... with 3 more variables: `2294-3` <dbl>, `2294-4` <dbl>,
#> # `2294-5` <dbl>
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.))) %>%
select(-rows) %>%
gather(id, value)
#> # A tibble: 20 x 2
#> id value
#> <chr> <dbl>
#> 1 1103-5 6
#> 2 1103-5 3
#> 3 1104-2 6
#> 4 1104-2 3
#> 5 1104-4 6
#> 6 1104-4 3
#> 7 1106-2 6
#> 8 1106-2 3
#> 9 1106-3 6
#> 10 1106-3 3
#> 11 2294-1 3
#> 12 2294-1 6
#> 13 2294-2 3
#> 14 2294-2 6
#> 15 2294-3 3
#> 16 2294-3 6
#> 17 2294-4 3
#> 18 2294-4 6
#> 19 2294-5 3
#> 20 2294-5 6
Created on 2018-07-01 by the reprex package (v0.2.0.9000).
I've imported an excel data set and want to set nearly all columns (greater than 90) to numeric when they are initially characters. What is the best way to achieve this because importing and changing each to numeric one by one isn't the most efficient approach?
This should do as you wish:
# Random data frame for illustration (100 columns wide)
df <- data.frame(replicate(100,sample(0:1,1000,rep=TRUE)))
# Check column names / return column number (just encase you wanted to check)
colnames(df)
# Specify columns
cols <- c(1:length(df)) # length(df) is useful as if you ever add more columns at later date
# Or if only want to specify specific column numbers:
# cols <- c(1:100)
#With help of magrittr pipe function change all to numeric
library(magrittr)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Check our columns are numeric
str(df)
Assuming your data is already imported with all character columns, you can convert the relevant columns to numeric using mutate_at by position or name:
suppressPackageStartupMessages(library(tidyverse))
# Assume the imported excel file has 5 columns a to e
df <- tibble(a = as.character(1:3),
b = as.character(5:7),
c = as.character(8:10),
d = as.character(2:4),
e = as.character(2:4))
# select the columns by position (convert all except 'b')
df %>% mutate_at(c(1, 3:5), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# or drop the columns that shouldn't be used ('b' and 'd' should stay as chr)
df %>% mutate_at(-c(2, 4), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# select the columns by name
df %>% mutate_at(c("a", "c", "d", "e"), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
This question uses a data.frame which contains list-columns (nested). It had me wondering why/if there's an advantage to working this way. I assumed you would want to minimize the amount of memory each table uses...But when I checked I was surprised:
Compare table sizes for nested vs. tidy format:
1. Generate nested/tidy versions of a 2-col and 5-col data.frame:
library(pryr)
library(dplyr)
library(tidyr)
library(ggvis)
n <- 1:1E6
df <- data_frame(id = n, vars = lapply(n, function(x) x <- sample(letters,sample(1:26,1))))
dfu <- df %>% unnest(vars)
df_morecols <- data_frame(id = n, other1 = n, other2 = n, other3 = n,
vars = lapply(n, function(x) x <- sample(letters,sample(1:26,1))))
dfu_morecols <- df_morecols %>% unnest(vars)
they look like:
head(df)
#> Source: local data frame [6 x 2]
#> id vars
#> 1 1 <chr[16]>
#> 2 2 <chr[4]>
#> 3 3 <chr[26]>
#> 4 4 <chr[9]>
#> 5 5 <chr[11]>
#> 6 6 <chr[18]>
head(dfu)
#> Source: local data frame [6 x 2]
#> id vars
#> 1 1 k
#> 2 1 d
#> 3 1 s
#> 4 1 j
#> 5 1 m
#> 6 1 t
head(df_morecols)
#> Source: local data frame [6 x 5]
#> id other1 other2 other3 vars
#> 1 1 1 1 1 <chr[4]>
#> 2 2 2 2 2 <chr[22]>
#> 3 3 3 3 3 <chr[24]>
#> 4 4 4 4 4 <chr[6]>
#> 5 5 5 5 5 <chr[15]>
#> 6 6 6 6 6 <chr[11]>
head(dfu_morecols)
#> Source: local data frame [6 x 5]
#> id other1 other2 other3 vars
#> 1 1 1 1 1 r
#> 2 1 1 1 1 p
#> 3 1 1 1 1 s
#> 4 1 1 1 1 w
#> 5 2 2 2 2 l
#> 6 2 2 2 2 j
2. Calculate object sizes and col sizes
from: lapply(list(df,dfu,df_morecols,dfu_morecols),object_size)
170 MB vs. 162 MB for nested vs. tidy 2-col df
170 MB vs. 324 MB for nested vs. tidy 5-col df
col_sizes <- sapply(c(df,dfu,df_morecols,dfu_morecols),object_size)
col_names <- names(col_sizes)
parent_obj <- c(rep(c('df','dfu'),each = 2),
rep(c('df_morecols','dfu_morecols'),each = 5))
res <- data_frame(parent_obj,col_names,col_sizes) %>%
unite(elementof, parent_obj,col_names, remove = F)
3. Plot columns sizes coloured by parent object:
res %>%
ggvis(y = ~elementof, x = ~0, x2 = ~col_sizes, fill = ~parent_obj) %>%
layer_rects(height = band())
Questions:
What explains the smaller footprint of the tidy 2-col df compared to the nested one?
Why doesn't this effect change for to the 5-col df?