I am finally trying to make the jump from for-loops to purrr. The basic examples are quite intuitive enough, but translating for-loops has been tricky, especially when trying to assign output objects to lists.
Consider the following example where I am trying to assign the output of a nested for-loop to a named list element.
library(tidyverse)
# output list
loop_list <- list()
# iterate over colors
for (color in diamonds %>% distinct(color) %>% pull()){
# sublist for each color
loop_list[[color]] <- list()
# iterate over cuts
for(cut in diamonds %>% distinct(cut) %>% pull()){
# filter data
data <- diamonds %>%
filter(color == color & cut == cut)
# define output
out <- data %>%
pull(price) %>%
mean()
# assign output to sublist of its color
loop_list[[color]][[cut]] <- out
# clean up filtered data set
rm(data)
}
}
This nested loop assigns the output object to its properly named sublist for each color of the data set. My purrr attempt creates something similar, but without the named sublists. All output objects are assigned to the same list, which is not what I'd ideally like.
grid <- expand_grid(color = diamonds %>% distinct(color) %>% pull(),
cut = diamonds %>% distinct(cut) %>% pull())
myfunc <- function(data, color, cut){
# create output object
out <- data %>%
# filter data
filter(color == color & cut == cut) %>%
pull(price) %>%
mean()
# return output
return(out)
}
purrr_list <- grid %>%
pmap(myfunc, data = diamonds)
Is there a way to arrive at the same output with purrr? I am aware that global assignment with <<- is a possibility, but this is generally discouraged, from what I understand.
It is better to change the argument names to avoid a clash with the actual column names.
myfunc <- function(data, col1, col2){
# filter data
data <- diamonds %>%
filter(color == col1 & cut == col2)
# define output
out <- data %>%
pull(price) %>%
mean()
# return output
return(out)
}
grid %>%
pmap_dbl(~ myfunc(diamonds, ..1, ..2))
#[1] 2597.550 3538.914 3423.644 3214.652 3682.312 4451.970 5946.181 5078.533 5255.880 4685.446 4918.186 6294.592 4574.173 5103.513 4975.655
#[16] 3889.335 5216.707 4276.255 4535.390 5135.683 3374.939 4324.890 3495.750 3778.820 3827.003 3720.706 4500.742 4123.482 3872.754 4239.255
#[31] 2629.095 3631.293 3405.382 3470.467 4291.061
If we need a nested output
library(dplyr)
grid %>%
split(.$color) %>%
map(~ pmap(.x, ~ myfunc(diamonds, ..1, ..2)) %>%
setNames(.x$cut))
-output
#$D
#$D$Ideal
#[1] 2629.095
#$D$Premium
#[1] 3631.293
#$D$Good
#[1] 3405.382
#$D$`Very Good`
#[1] 3470.467
#$D$Fair
#[1] 4291.061
#$E
#$E$Ideal
#[1] 2597.55
#$E$Premium
#[1] 3538.914
#$E$Good
#[1] 3423.644
# ..
Related
I have a list of data frames with inconsistent but overlapping variables. Some of the shared variables have similar but not identical names. I would like to conditionally rename the variable so that it is consistent across datasets. The way to do this one at a time would be
library(tidyverse)
df_1 <- starwars
df_2 <- starwars %>% rename(haircolor = hair_color)
df_3 <- starwars
df_list <- list(df_1, df_2, df_3)
df_list[[2]] <- df_list[[2]] %>% rename(hair_color = haircolor)
But I would like this to be flexible such that I can just feed in a list of any size and it will rename any variable titled hair_color as haircolor. Is there a way to purrr::map over these in a way that renames conditionally on the variable existing? The most basic interpretation would look something like:
df_list %>%
purrr::map( ~ rename(., hair_color = haircolor))
We can pass this in a select_helpers function
library(dplyr)
library(purrr)
df_list %>%
purrr::map( ~ .x %>%
rename_at(vars(matches('hair_color')), ~ 'haircolor'))
Or use an if/else condition
df_list %>%
purrr::map( ~ if('hair_color' %in% names(.)) {
rename(., haircolor = hair_color)
} else .)
I want to make a bunch of new variables a,b,c,d.....z to store tibble data frames. I will then rbind the new variables that store tibble data frames and export them as a csv. How do I do this faster without having to specify the new variables each time?
a<- subset(data.frame, variable1="condition1",....,) %>% group_by() %>% summarize( a=mean())
b<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
....
z<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
rbind(a,b,....,z)
There's got to be a faster way to do this. My data set is large so having it stored in memory as partitions of a,b,c,....z is causing the computer to crash. Typing the subset conditions to form the partitions repeatedly is tedious.
You could do something like this using purrr package:
You may need to use NSE depends on what's your condition. You can reference Programming with dplyr
purrr::map_df(
c("condition1","condition2",..., "conditionn"),
# .x for each condition
~ subset(your_data_frame, variable1=.x,....,) %>% group_by(some_columns) %>% summarise(a = mean(some_columns))
)
Example using iris:
library(rlang)
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, function(x){
iris %>%
dplyr::filter(!!rlang::parse_expr(x)) %>%
head()
})
Example using iris:
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, ~ iris %>% dplyr::filter(!!rlang::parse_expr(.x)) %>% nrow())
# or (!! is almost equivalent to eval or rlang::eval_tidy())
map(conditions, ~ iris %>% dplyr::filter(eval(rlang::parse_expr(.x))) %>% nrow())
[[1]]
[1] 113
[[2]]
[1] 50
[[3]]
[1] 118
Instead of creating multiple objects in the global environemnt, rread them in a list, and bind it
library(data.table)
files <- list.files(pattern = "\\.csv", full.names = TRUE)
rbindlist(lapply(files, fread))
It would be much faster with fread than in any other option
If we are using strings to be passed onto group_by, convert the string to symbol with sym from rlang and evaluate (!!)
library(purrr)
map2_df(c("condition1", "condition2"), c("a", "b") ~ df1 %>%
group_by(!! rlang::sym(.x)) %>%
summarise(!! .y := mean(colname)))
If the 'condition1', 'condition2' etc are expressions, place it as quosure and evaluate it
map2_df(quos(condition1, condition2), c("a", "b"), ~ df1 %>%
filter(!! .x) %>%
summarise(!! .y := mean(colnames)))
Using a reproducible example
conditions <- quos(Petal.Length>1.5,Species == 'setosa',Sepal.Length > 5)
map2(conditions, c('a', 'b', 'c'), ~
iris %>%
filter(!! .x) %>%
summarise(!! .y := mean(Sepal.Length)))
#[[1]]
# a
#1 6.124779
#[[2]]
# b
#1 5.006
#[[3]]
# c
#1 6.129661
It would be a 3 column dataset if we use map2_dfc
NOTE: It is not clear whether the OP meant 'condition1', 'condition2' as expressions to be passed on for filtering the rows or not.
I am trying to apply a custom function to a data.frame row by row, but I can't figure out how to apply the function row by row. I'm trying rowwise() as in the simple artificial example below:
library(tidyverse)
my_fun <- function(df, col_1, col_2){
df[,col_1] + df[,col_2]
}
dff <- data.frame("a" = 1:10, "b" = 1:10)
dff %>%
rowwise() %>%
mutate(res = my_fun(., "a", "b"))
How ever the data does not get passed by row. How can I achieve that?
dplyr's rowwise() puts the row-output (.data) as a list of lists, so you need to use [[. You also need to use .data rather than ., because . is the entire dff, rather than the individual rows.
my_fun <- function(df, col_1, col_2){
df[[col_1]] + df[[col_2]]
}
dff %>%
rowwise() %>%
mutate(res = my_fun(.data, 'a', 'b'))
You can see what .data looks like with the code below
dff %>%
rowwise() %>%
do(res = .data) %>%
.[[1]] %>%
head(1)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 1
I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)
I have a table of values that I want to save as a kable() table. Each row of the table is a variable and each column is a value of that variable (e.g., a mean, minimum, maximum, etc.). You can apply the format() function to columns of a data frame but applying it across rows seems very awkward. I finally achieved my goal with this code, but would be interested if there is a tidier way of doing it!
library(tidyverse)
library(broom)
library(kableExtra)
# test data
all <- tibble(PARAMETER=c("A","B", "C"),
Value1=c(0.0123, 1230, NA),
Value2=c(0.0234, 2340, 1.23),
Units=c("m", "Pa", "ha"),
Description=c("Length", "Pressure", "Area"))
# my formatting function
my_format <- function(x){
y <- format(x, digits=3, scientific=FALSE, TRIM=TRUE)
y[is.na(x)] <- ""
y
}
# format values by row
all_formatted <- all %>%
`row.names<-`(.$PARAMETER) %>% # set row names for transpose
select(-PARAMETER, -Units, -Description) %>% # only numeric columns
t() %>% # transpose
tidy() %>% # convert to tibble (creates .rownames column)
modify(my_format) %>% # apply format function to each column of values in place
`row.names<-`(.$.rownames) %>% # set row names for transpose
select(-.rownames) %>% # drop rownames column
t() %>% # transpose
tidy() %>% # convert to tibble (creates .rownames column)
select(-.rownames) %>% # drop rownames
add_column(PARAMETER=all$PARAMETER, .before=1) %>% # add back nonnumeric columns
add_column(UNITS=all$Units,
DESCRIPTION=all$Description)
# print formatted table
all_formatted %>%
kable() %>%
kable_styling(
bootstrap_options = c("condensed", "striped", "hover"),
full_width=FALSE, position="left", font_size=12) %>%
save_kable(file="temp.html", self_contained=TRUE) # very slow
Not sure how attached you are to kable, but flextable is one of my favorite things and it usually renders to HTML and .doc the same way. The trick is to do it by column, not row... just specify the characteristics of the numbers you want to format differently:
library(flextable)
all %>% regulartable() %>% align(align="center",part="all") %>%
set_formatter(Value1 = function(x)
ifelse(x > 1, sprintf("%.0f", x), sprintf("%.03f", x) ),
Value2 = function(x)
ifelse(x > 1, sprintf("%.0f", x), sprintf("%.03f", x) ))
Hope this helps :)