write from nested dataframe with on-the-fly filename using purrr::walk - r

I'm applying a function to a nested dataframe using purrr::map to get a new dataframe list column.
Now I want to write each of these new dataframes to file using column values from the same row as part of the filename.
I'm stuck on how to pull the other column values out in order to pass to the filename for writing to file. I'm confident purrr::walk should be involved but the manner of how to access column variables and the list dataframe contents is the problem.
Reprex below:
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
# Data
data("mtcars")
mtcars_nest <- mtcars %>% rownames_to_column() %>% rename(rowname_1 = rowname) %>% select(-mpg) %>% group_by(cyl) %>% nest()
mtcars_mpg <- mtcars %>% rownames_to_column() %>% rename(rowname_2 = rowname) %>% select(rowname_2, mpg)
# Function to apply to nested dataframe
join_df <- function(df_nest, df_other) {
df_all <- inner_join(df_nest, df_other, by = c("rowname_1" = "rowname_2"))
return(df_all)
}
# 1. Apply function to `$data` to get new dataframe list column and add an extra 'case' column for filename
mtcars_nest %>%
mutate(case = c("first", "second", "third")) %>%
mutate(new_mpg = map(data, ~ join_df(., mtcars_mpg)))
# 2. Now write `$new_mpg` to file with filename sources from $cyl and $case
# I think `walk` is the correct to use but how to pass the two row values into filename?
## Not real code##
# mtcars_nest %>%
# walk(., function(x) {write.csv(., file = paste0(cyl, "_", case, ".csv")})

Use pwalk:
... %>%
select(cyl, case, new_mpg) %>%
pwalk(~ write.csv(..3, file = paste0(..1, '_', ..2, '.csv')))
Chain after your code:
mtcars_nest %>%
mutate(case = c("first", "second", "third")) %>%
mutate(new_mpg = map(data, ~ join_df(., mtcars_mpg))) %>%
select(cyl, case, new_mpg) %>%
pwalk(~ write.csv(..3, file = paste0(..1, '_', ..2, '.csv')))

Related

How to apply multiple functions to a list of data frames?

I have a list of more than 50 csv files with the same numbers of columns and rows.
I want to find the percentage of missing values for each of the data frames and I have found the code that works fine with a single file which is the following:
missing.values <- estaciones2 %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing))
Now I want to apply these functions to each of my data frames in my list.
I read that I can use the map function to create a loop and run the code for each of my files in the list, although I am not quite sure how to insert the map function into my code shown above and I have tried the following but doesn't seem right:
missing.values <- map(estaciones2, ~ map(estaciones2, ~ estaciones2 %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing)))
We need a lambda function (~) to loop over the list (assuming estaciones2 is a list object). The .x is the data.frame element of the list using the lambda call
library(purrr)
library(tidyr)
library(dplyr)
map(estaciones2, ~ .x %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing)))
In the OP's code, multiple map functions are called on the same list element again and again i.e. estaciones2

Save intermediate list output in dplyr pipeline and map it back to another list further down the pipeline - R

I am running pcas on groups in a data set using dplyr pipelines. I am starting with group_split, so am working with a list. In order to run the prcomp() function, only the numeric columns of each list can be included, but I would like the factor column brought back in for plotting at the end. I have tried saving an intermediate output using {. ->> temp} partway through the pipeline, but since it is a list, I don't know how to index the grouping column when plotting.
library(tidyverse)
library(ggbiplot)
iris %>%
group_split(Species, keep = T) %>% #group by species, one pca per species
{. ->> temp} %>% # save intermediate output to preserve species column for use in plotting later
map(~.x %>% select_if(is.numeric) %>% select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE))%>% #run pca on numeric columns only
map(~ggbiplot(.x), label=temp$Species)#plot each pca, labeling points as species names form the temporary object
This works to produce one pca plot for each species in the irisdata set, but since temp$species = NULL, the points are not labelled.
If you use map2() and pass the .y argument as the species list you can get the result I think you want. Note that in your original code the labels argument was outside the ggbiplot() function and was ignored.
library(tidyverse)
library(ggbiplot)
iris %>%
group_split(Species, keep = T) %>%
{. ->> temp} %>%
map(~.x %>%
select_if(is.numeric) %>%
select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
map2(map(temp, "Species"), ~ggbiplot(.x, labels = .y))
In response to your comment, if you wanted to add a third argument you could use pmap() instead of map2(). In the example below, pmap() is being passed a (nested) list of the data for the ggbiplot() arguments. Note I've changed the new variable so that it's a factor and not constant across groups.
iris %>%
mutate(new = factor(sample(1:3, 150, replace = TRUE))) %>%
group_split(Species, keep = T) %>%
{. ->> temp} %>%
map(~.x %>%
select_if(is.numeric) %>%
select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
list(map(temp, "Species"), map(temp, "new")) %>%
pmap(~ ggbiplot(pcobj = ..1, labels = ..2, groups = ..3))
One option is to use split and imap
library(tidyverse)
library(ggbiplot)
iris %>%
split(.$Species) %>% # save intermediate output to preserve species column for use in plotting later
map(~.x %>% select_if(is.numeric) %>% select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
imap(~ggbiplot(.x, labels = .y))

Standardize column names in excel sheets before combining with purrr and readxl

I would like to compile an Excel file with multiple tabs labeled by year (2016, 2015, 2014, etc). Each tab has identical data, but column names may be spelled differently from year-to-year.
I would like to standardize columns in each sheet before combining.
This is the generic way of combining using purrr and readxl for such tasks:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map_dfr(read_excel, path = my.file, .id = "sheet")
...however as noted, this creates separate columns for "COLUMN ONE", and "Column One", which have the same data.
Inserting make.names into the pipeline would probably be the best solution.
Keeping it all together would be ideal...something like:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>%
map(~(names(.) %>% #<---WRONG
make.names() %>%
str_to_upper() %>%
str_trim() %>%
set_names()) )
..but the syntax is all wrong.
Rather than defining your own function, the clean_names function from the janitor package may be able to help you. It takes a dataframe/tibble as an input and returns a dataframe/tibble with clean names as an output.
Here's an example:
library(tidyverse)
tibble(" a col name" = 1,
"another-col-NAME" = 2,
"yet another name " = 3) %>%
janitor::clean_names()
#> # A tibble: 1 x 3
#> a_col_name another_col_name yet_another_name
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
You can then plop it right into the code you gave:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(janitor::clean_names) %>% #<janitor::clean_names
bind_rows(.id = "sheet")
Creating a new function is doable but is verbose and uses two maps:
# User defined function: col_rename
col_rename <- function(df){
names(df) <- names(df) %>%
str_to_upper() %>%
make.names() %>%
str_trim()
return(df)
}
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(col_rename) %>% #<Fix colnames (user defined function)
bind_rows(.id = "sheet")

R: dynamic variable name comparisons

I recoded a bunch of variables in a dataset, and and gave the newly recoded variables the prefix "r_" in my dataset.
I'd like to run table on the pairs to ensure the recoding was correct. Something like table(v1, r_v1), but I need to do it for lots of variables. They are not in any particular order, so I couldn't use indexing.
Here is a reproducible example of data one can use (also any tips on optimizing that code are appreciated!).
mtcars %>% select(c(disp,hp)) %>%
mutate_all(funs(if_else(.>100,1,0))) %>%
rename_(.dots=setNames(names(.), paste0('r_', names(.)))) %>%
cbind(mtcars,.)
Any ideas?
I would just use variable names and simple for loop. Calling your modified data dd,
orig = c("disp", "hp")
trans = paste0("r_", orig)
check_list = list()
for (i in seq_along(orig)) {
check_list[[i]] = table(dd[[orig[i]]], dd[[trans[i]]])
# or whatever other check you want to do
}
check_list
You can then examine the check_list contents one at a time.
To keep things in the tidy format with which you started:
library(purrr)
library(tidyr)
mtcars %>%
select(disp,hp) %>%
mutate_all(funs(r = if_else(.>100,1,0))) %>%
mutate(index = row_number()) %>%
gather(key = key, value = value, -index) %>%
separate(key, c("Variable", "Type")) %>%
mutate(Type = ifelse(is.na(Type), "Original", "Recode")) %>%
spread(key = Type, value = value) %>%
select(-index) %>%
split(.$Variable) %>%
map(~ select(.,-Variable)) %>%
map(~ table(.))

Transpose data frame variables and add null, unique counts in [r]

I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile

Resources