library(htmlTable)
library(tidyverse)
library(ggmosaic) for "happy" dataset
I want to create a function that creates frequency tables for all the categorical variables in a dataset and then generate htmlTables for each one. However, by using purrr::map, the tables are in a list. How do I generate the tables using htmlTable? Or any better package that generates similar tables for publication? I suppose I need to split the list or use additional purrr::map functions? Help would be appreciated...
Something like this...
FUN<-function(data){
TAB<-happy%>%select_if(is.factor)%>%
map(table)
TABLES<-htmlTable(TAB)
return(TABLES)
}
Here's a solution that uses a tibble to store the arguments to the function as well as the resulting HTML strings:
Edit: added new column (percent)
library(ggmosaic)
library(purrr)
library(tidyverse)
library(htmlTable)
library(magrittr)
library(scales)
data(happy)
# Use a subset of `happy` for the example
h <- happy %>% as_tibble %>% sample_n(100)
# create the function
make_html_table <- function(data, .name, .col_names) {
data %>%
table %>%
as.data.frame %>%
set_colnames(.col_names) %>%
as.data.frame %>%
mutate(percent = scales::percent(count/sum(count))) %>% # add the percent column
htmlTable(caption = .name)
}
# Apply the function and store the results in a tibble
tbl <-
h %>%
select_if(is.factor) %>%
{ tibble(NAME = names(.),
data = map(., ~.x)) } %>%
mutate(TABLE = map2(.x = data,
.y = NAME,
.f = make_html_table,
.col_names = c("levels", "count")))
# Check out the tables in the Viewer Pane (if you're using RStudio)
tbl %>% extract2("TABLE") %>% map(htmlTableWidget)
#> $happy
#>
#> $sex
#>
#> $marital
#>
#> $degree
#>
#> $finrela
#>
#> $health
Here's a screenshot of the one of the tables this creates:
Related
I want to make a bunch of new variables a,b,c,d.....z to store tibble data frames. I will then rbind the new variables that store tibble data frames and export them as a csv. How do I do this faster without having to specify the new variables each time?
a<- subset(data.frame, variable1="condition1",....,) %>% group_by() %>% summarize( a=mean())
b<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
....
z<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
rbind(a,b,....,z)
There's got to be a faster way to do this. My data set is large so having it stored in memory as partitions of a,b,c,....z is causing the computer to crash. Typing the subset conditions to form the partitions repeatedly is tedious.
You could do something like this using purrr package:
You may need to use NSE depends on what's your condition. You can reference Programming with dplyr
purrr::map_df(
c("condition1","condition2",..., "conditionn"),
# .x for each condition
~ subset(your_data_frame, variable1=.x,....,) %>% group_by(some_columns) %>% summarise(a = mean(some_columns))
)
Example using iris:
library(rlang)
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, function(x){
iris %>%
dplyr::filter(!!rlang::parse_expr(x)) %>%
head()
})
Example using iris:
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, ~ iris %>% dplyr::filter(!!rlang::parse_expr(.x)) %>% nrow())
# or (!! is almost equivalent to eval or rlang::eval_tidy())
map(conditions, ~ iris %>% dplyr::filter(eval(rlang::parse_expr(.x))) %>% nrow())
[[1]]
[1] 113
[[2]]
[1] 50
[[3]]
[1] 118
Instead of creating multiple objects in the global environemnt, rread them in a list, and bind it
library(data.table)
files <- list.files(pattern = "\\.csv", full.names = TRUE)
rbindlist(lapply(files, fread))
It would be much faster with fread than in any other option
If we are using strings to be passed onto group_by, convert the string to symbol with sym from rlang and evaluate (!!)
library(purrr)
map2_df(c("condition1", "condition2"), c("a", "b") ~ df1 %>%
group_by(!! rlang::sym(.x)) %>%
summarise(!! .y := mean(colname)))
If the 'condition1', 'condition2' etc are expressions, place it as quosure and evaluate it
map2_df(quos(condition1, condition2), c("a", "b"), ~ df1 %>%
filter(!! .x) %>%
summarise(!! .y := mean(colnames)))
Using a reproducible example
conditions <- quos(Petal.Length>1.5,Species == 'setosa',Sepal.Length > 5)
map2(conditions, c('a', 'b', 'c'), ~
iris %>%
filter(!! .x) %>%
summarise(!! .y := mean(Sepal.Length)))
#[[1]]
# a
#1 6.124779
#[[2]]
# b
#1 5.006
#[[3]]
# c
#1 6.129661
It would be a 3 column dataset if we use map2_dfc
NOTE: It is not clear whether the OP meant 'condition1', 'condition2' as expressions to be passed on for filtering the rows or not.
In this SO Question bootstrapping by several groups and subgroups seemed to be easy using the broom::bootstrap function specifying the by_group argument with TRUE.
My desired output is a nested tibble with n rows where the data column contains the bootstrapped data generated by each bootstrap call (and each group and subgroup has the same amount of cases as in the original data).
In broom I did the following:
# packages
library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
library(rsample)
library(broom)
# some data to bootstrap
set.seed(123)
data <- tibble(
group=rep(c('group1','group2','group3','group4'), 25),
subgroup=rep(c('subgroup1','subgroup2','subgroup3','subgroup4'), 25),
v1=rnorm(100),
v2=rnorm(100)
)
# the actual approach using broom::bootstrap
tibble(id = 1:100) %>%
mutate(data = map(id, ~ {data %>%
group_by(group,subgroup) %>%
broom::bootstrap(100, by_group=TRUE)}))
Since the broom::bootstrap function is deprecated, I rebuild my approach with the desired output using rsample::bootstraps. It seems to be much more complicated to get my desired output. Am I doing something wrong or have things gotten more complicated in the tidyverse when generating grouped bootstraps?
data %>%
dplyr::mutate(group2 = group,
subgroup2 = subgroup) %>%
tidyr::nest(-group2, -subgroup2) %>%
dplyr::mutate(boot = map(data, ~ rsample::bootstraps(., 100))) %>%
pull(boot) %>%
purrr::map(., "splits") %>%
transpose %>%
purrr::map(., ~ purrr::map_dfr(., rsample::analysis)) %>%
tibble(id = 1:length(.), data = .)
I would like to compile an Excel file with multiple tabs labeled by year (2016, 2015, 2014, etc). Each tab has identical data, but column names may be spelled differently from year-to-year.
I would like to standardize columns in each sheet before combining.
This is the generic way of combining using purrr and readxl for such tasks:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map_dfr(read_excel, path = my.file, .id = "sheet")
...however as noted, this creates separate columns for "COLUMN ONE", and "Column One", which have the same data.
Inserting make.names into the pipeline would probably be the best solution.
Keeping it all together would be ideal...something like:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>%
map(~(names(.) %>% #<---WRONG
make.names() %>%
str_to_upper() %>%
str_trim() %>%
set_names()) )
..but the syntax is all wrong.
Rather than defining your own function, the clean_names function from the janitor package may be able to help you. It takes a dataframe/tibble as an input and returns a dataframe/tibble with clean names as an output.
Here's an example:
library(tidyverse)
tibble(" a col name" = 1,
"another-col-NAME" = 2,
"yet another name " = 3) %>%
janitor::clean_names()
#> # A tibble: 1 x 3
#> a_col_name another_col_name yet_another_name
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
You can then plop it right into the code you gave:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(janitor::clean_names) %>% #<janitor::clean_names
bind_rows(.id = "sheet")
Creating a new function is doable but is verbose and uses two maps:
# User defined function: col_rename
col_rename <- function(df){
names(df) <- names(df) %>%
str_to_upper() %>%
make.names() %>%
str_trim()
return(df)
}
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(col_rename) %>% #<Fix colnames (user defined function)
bind_rows(.id = "sheet")
I'm applying a function to a nested dataframe using purrr::map to get a new dataframe list column.
Now I want to write each of these new dataframes to file using column values from the same row as part of the filename.
I'm stuck on how to pull the other column values out in order to pass to the filename for writing to file. I'm confident purrr::walk should be involved but the manner of how to access column variables and the list dataframe contents is the problem.
Reprex below:
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
# Data
data("mtcars")
mtcars_nest <- mtcars %>% rownames_to_column() %>% rename(rowname_1 = rowname) %>% select(-mpg) %>% group_by(cyl) %>% nest()
mtcars_mpg <- mtcars %>% rownames_to_column() %>% rename(rowname_2 = rowname) %>% select(rowname_2, mpg)
# Function to apply to nested dataframe
join_df <- function(df_nest, df_other) {
df_all <- inner_join(df_nest, df_other, by = c("rowname_1" = "rowname_2"))
return(df_all)
}
# 1. Apply function to `$data` to get new dataframe list column and add an extra 'case' column for filename
mtcars_nest %>%
mutate(case = c("first", "second", "third")) %>%
mutate(new_mpg = map(data, ~ join_df(., mtcars_mpg)))
# 2. Now write `$new_mpg` to file with filename sources from $cyl and $case
# I think `walk` is the correct to use but how to pass the two row values into filename?
## Not real code##
# mtcars_nest %>%
# walk(., function(x) {write.csv(., file = paste0(cyl, "_", case, ".csv")})
Use pwalk:
... %>%
select(cyl, case, new_mpg) %>%
pwalk(~ write.csv(..3, file = paste0(..1, '_', ..2, '.csv')))
Chain after your code:
mtcars_nest %>%
mutate(case = c("first", "second", "third")) %>%
mutate(new_mpg = map(data, ~ join_df(., mtcars_mpg))) %>%
select(cyl, case, new_mpg) %>%
pwalk(~ write.csv(..3, file = paste0(..1, '_', ..2, '.csv')))
I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile