I have a list of tibbles. I'm trying to filter on a column common to all tibbles, and then remove any tibbles that end up with zero rows (but are not technically empty since they have columns). It seems like purrr:::compact() is intended for this purpose, but I don't think I've got it quite right. Is there a better solution?
require(tidyverse)
#> Loading required package: tidyverse
mylst <- lst(cars1 = cars %>% as.tibble(), cars2 = cars %>% as.tibble() %>% mutate(speed = speed + 100))
#This produces a list with zero-row tibble elements:
mylst %>% map(function(x) filter(x, speed == 125))
#> $cars1
#> # A tibble: 0 x 2
#> # ... with 2 variables: speed <dbl>, dist <dbl>
#>
#> $cars2
#> # A tibble: 1 x 2
#> speed dist
#> <dbl> <dbl>
#> 1 125. 85.
#This results in the same thing:
mylst %>% map(function(x) filter(x, speed == 125)) %>% compact()
#> $cars1
#> # A tibble: 0 x 2
#> # ... with 2 variables: speed <dbl>, dist <dbl>
#>
#> $cars2
#> # A tibble: 1 x 2
#> speed dist
#> <dbl> <dbl>
#> 1 125. 85.
#Putting compact inside the map function reduces $cars1 to 0x0, but it's still there:
mylst %>% map(function(x) filter(x, speed == 125) %>% compact())
#> $cars1
#> # A tibble: 0 x 0
#>
#> $cars2
#> # A tibble: 1 x 2
#> speed dist
#> <dbl> <dbl>
#> 1 125. 85.
#This finally drops the empty element, but seems clumsy.
mylst %>% map(function(x) filter(x, speed == 125) %>% compact()) %>% compact()
#> $cars2
#> # A tibble: 1 x 2
#> speed dist
#> <dbl> <dbl>
#> 1 125. 85.
Created on 2018-04-06 by the reprex package (v0.2.0).
You are trying to use compact but this only filters out NULL elements. To filter out zero row elements, you can use discard:
mylst %>%
map(function(x) filter(x, speed == 125)) %>%
discard(function(x) nrow(x) == 0)
#$cars2
## A tibble: 1 x 2
# speed dist
# <dbl> <dbl>
#1 125. 85.
Related
I am trying to run a for loop where I randomly subsample a dataset using sample_n command. I also want to name each new subsampled dataframe as "df1" "df2" "df3". Where the numbers correspond to i in the for loop. I know the way I wrote this code is wrong and why i am getting the error. How can I access "df" "i" in the for loop so that it reads as df1, df2, etc.? Happy to clarify if needed. Thanks!
for (i in 1:9){ print(get(paste("df", i, sep=""))) = sub %>%
group_by(dietAandB) %>%
sample_n(1) }
Error in print(get(paste("df", i, sep = ""))) = sub %>% group_by(dietAandB) %>% :
target of assignment expands to non-language object
Instead of using get you could use assign.
Using some fake example data:
library(dplyr, warn=FALSE)
sub <- data.frame(
dietAandB = LETTERS[1:2]
)
for (i in 1:2) {
assign(paste0("df", i), sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
}
df1
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
df2
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
But the more R-ish way to do this would be to use a list instead of creating single objects:
df <- list(); for (i in 1:2) { df[[i]] = sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup() }
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
Or more concise to use lapply instead of a for loop
df <- lapply(1:2, function(x) sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
It depends on the sample size which is missing in your question. So, As an example I considered the mtcars dataset (32 rows) and sampling three subsamples of size 20 from the data:
library(dplyr)
for (i in 1:3) {
assign(paste0("df", i), sample_n(mtcars, 20))
}
Using the {{var}} notation the following code works.
The variables to be used for grouping and for summarizing van be given as parameters to my_summary
I would like to modify my_summary so that I can give a varying number of variables for both grouping and summarizing. Can this be done?
suppressPackageStartupMessages({
library(tidyverse)
})
set.seed(4321)
demo_df <-
tibble(age=as.integer(rep(c(10,20),each=10)),
gender=rep(c("f","m"),10),
weight=rnorm(20,70,7),
size=rnorm(20,160,15))
my_summary <- function(df_in,group_var,summary_var){
df_in |>
group_by({{group_var}}) |>
summarise_at(vars({{summary_var}}),mean)
}
my_summary(demo_df,gender,weight)
Another possible solution, allowing for multiple grouping variables:
library(tidyverse)
my_summary <- function(df_in, group_var,summary_var){
df_in %>%
group_by(!!!group_var) %>%
summarise(across({{summary_var}}, mean), .groups = "drop")
}
my_summary(demo_df, vars(age,gender), c(weight,size))
#> # A tibble: 4 × 4
#> age gender weight size
#> <int> <chr> <dbl> <dbl>
#> 1 10 f 71.5 159.
#> 2 10 m 72.4 158.
#> 3 20 f 64.3 167.
#> 4 20 m 71.6 164.
Alternatively, without vars (that may be superseded):
library(tidyverse)
my_summary <- function(df_in, summary_var , ...){
summary_var <- enquos(summary_var)
group_var <- enquos(...)
df_in %>%
group_by(!!!group_var) %>%
summarise(across(!!!summary_var,mean), .groups = "drop")
}
my_summary(demo_df, c(weight, size), age, gender)
#> # A tibble: 4 × 4
#> age gender weight size
#> <int> <chr> <dbl> <dbl>
#> 1 10 f 71.5 159.
#> 2 10 m 72.4 158.
#> 3 20 f 64.3 167.
#> 4 20 m 71.6 164.
Use summarise(across(.)).
suppressPackageStartupMessages({
library(tidyverse)
})
set.seed(4321)
demo_df <-
tibble(age=as.integer(rep(c(10,20),each=10)),
gender=rep(c("f","m"),10),
weight=rnorm(20,70,7),
size=rnorm(20,160,15))
my_summary <- function(df_in,group_var,summary_var){
df_in |>
group_by({{group_var}}) |>
summarise(across({{summary_var}}, mean))
}
my_summary(demo_df, gender, weight:size)
#> # A tibble: 2 × 3
#> gender weight size
#> <chr> <dbl> <dbl>
#> 1 f 67.9 163.
#> 2 m 72.0 161.
Created on 2022-06-09 by the reprex package (v2.0.1)
I have a dataframe (df) like this. I have 82 SKUs started from M1 to M82.
SKU date sales
M1 2-jan 4
M2 2-jan 5
M1 3-jan 8
M82 3-jan 1
...
M82 31-dec 9
i want to filter each SKU seperate and then group_by(date) and summarise(sales_perday = sum(sales)
Something like this
for(i in SKU){
SKU_M[i] <- df %>% filter(SKU == SKU_M[i]) %>% group_by(date)
%>% summarise(sales_perday = sum(sales))
Expected output are 82 dataframes with each SKU in 1 dataframe.
I did this below for 1 SKU but i want it for all 82 in an easy way.
M50 <- df %>% filter(SKU == 'M50') %>% group_by(date) %>% summarise(sales_perday = sum(sales))
You probably want to group by multiple columns:
library(tidyverse)
data <- tribble(
~SKU, ~date, ~sales,
"M1", "2-jan",4,
"M2", "2-jan",5,
"M1", "3-jan",8
)
# the cioncise way
data %>%
group_by(SKU, date) %>%
summarise(sales_perday = sum(sales))
#> `summarise()` has grouped output by 'SKU'. You can override using the `.groups`
#> argument.
#> # A tibble: 3 × 3
#> # Groups: SKU [2]
#> SKU date sales_perday
#> <chr> <chr> <dbl>
#> 1 M1 2-jan 4
#> 2 M1 3-jan 8
#> 3 M2 2-jan 5
# if one really want to have multiple data frames
data %>%
group_by(SKU, date) %>%
summarise(sales_perday = sum(sales)) %>%
nest(-SKU) %>%
pull(data)
#> Warning: All elements of `...` must be named.
#> Did you want `data = -SKU`?
#> `summarise()` has grouped output by 'SKU'. You can override using the `.groups`
#> argument.
#> [[1]]
#> # A tibble: 2 × 2
#> date sales_perday
#> <chr> <dbl>
#> 1 2-jan 4
#> 2 3-jan 8
#>
#> [[2]]
#> # A tibble: 1 × 2
#> date sales_perday
#> <chr> <dbl>
#> 1 2-jan 5
Created on 2022-06-08 by the reprex package (v2.0.0)
Another option with split:
df <- df |>
group_by(date) |>
summarise(sales_perday = sum(sales))
split(df, df$SKU)
If you really do want separate data frames, then after grouping by SKU and date, and then summarizing, use group_split() to partition by SKU.
library(tidyverse)
df <- tribble(
~SKU, ~date, ~sales,
"M1", "2-jan",4,
"M2", "2-jan",5,
"M1", "3-jan",8
)
df |>
group_by(SKU, date) |>
summarise(sales_perday = sum(sales)) |>
group_split()
#> `summarise()` has grouped output by 'SKU'. You can override using the `.groups`
#> argument.
#> <list_of<
#> tbl_df<
#> SKU : character
#> date : character
#> sales_perday: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 2 × 3
#> SKU date sales_perday
#> <chr> <chr> <dbl>
#> 1 M1 2-jan 4
#> 2 M1 3-jan 8
#>
#> [[2]]
#> # A tibble: 1 × 3
#> SKU date sales_perday
#> <chr> <chr> <dbl>
#> 1 M2 2-jan 5
I'm using the following script to make a table in R:
library(dplyr)
library(tidyr)
get_probability <- function(parameter_array, threshold) {
return(round(100 * sum(parameter_array >= threshold) /
length(parameter_array)))
}
thresholds = c(75, 100, 125)
mtcars %>% group_by(gear) %>%
dplyr::summarise(
low=get_probability(disp, thresholds[[1]]),
medium=get_probability(disp, thresholds[[2]]),
high=get_probability(disp, thresholds[[3]]),
)
The table that comes out is the following:
# A tibble: 3 x 4
gear low medium high
<dbl> <dbl> <dbl> <dbl>
1 3 100 100 93
2 4 92 67 50
3 5 100 80 60
My question is, how can condense what I have passed to summarise to a single line? i.e., is there a way to iterate over both the thresholds vector, also while passing custom variable names?
In recent versions of dplyr, summarise will auto-splice data.frames created within it into new columns. So, you just need a way to iterate over thresholds to create a data.frame.
One option is purrr:::map_dfc.
library(dplyr, warn.conflicts = FALSE)
get_probability <- function(parameter_array, threshold) {
return(round(100 * sum(parameter_array >= threshold) /
length(parameter_array)))
}
thresholds = c(75, 100, 125)
thresholds <- setNames(thresholds, c('low', 'medium', 'high'))
mtcars %>%
group_by(gear) %>%
summarise(purrr::map_dfc(thresholds, ~ get_probability(disp, .x)))
#> # A tibble: 3 × 4
#> gear low medium high
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 100 100 93
#> 2 4 92 67 50
#> 3 5 100 80 60
If you prefer not to use an extra package though, you could just lapply and then convert the output to data.frame. (Replace \(x) with function(x) in older versions of R)
mtcars %>%
group_by(gear) %>%
summarise(as.data.frame(lapply(thresholds, \(x) get_probability(disp, x))))
#> # A tibble: 3 × 4
#> gear low medium high
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 100 100 93
#> 2 4 92 67 50
#> 3 5 100 80 60
Created on 2021-08-17 by the reprex package (v2.0.1)
I think this question is related to Using map_dfr and .id for list names and list of list names but not identical ...
I often use map_dfr for a case where I want to use the value of each argument, not its name, as the .id variable. Here's a silly example: I am computing the mean of mtcars$mpg raised to the second, fourth, and sixth power:
library(tidyverse)
list(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
## name x
## <chr> <dbl>
## 1 1 439.
## 2 2 262350.
## 3 3 198039783.
I would like the name variable to be 2, 4, 6 instead of 1, 2, 3. I can hack this by including setNames(.data) in the pipeline:
list(2,4,6) %>%
setNames(.data) %>%
map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
but I wonder if there is a more idiomatic approach I'm missing?
As for the suggestion of using something like ~ tible(name=., ...): nice, but slightly less convenient for the case where the mapping function already returns a tibble, because we have to add an otherwise unnecessary tibble() call:
list(2, 4, 6) %>%
map_dfr(~ tibble(name=.,
broom::tidy(lm(mpg~cyl, data=mtcars, offset=rep(., nrow(mtcars))))))
OK, I think I found this shortly before posting (so I'll answer). This answer points out that tibble::lst() is a self-naming list function, so as long as we use tibble::lst(2,4,6) instead of list(2,4,6), it Just Works, e.g.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
This can work too:
library(tidyverse)
##ben Bolker answer.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="power")
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
list(2, 4, 6) %>% map_df(~ tibble(power = as.character(.x) , x = mean(mtcars$mpg^.)))
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
#another option
seq(2, 6, 2) %>% map2_df(rerun(length(.), mtcars$mpg), ~ c(x = as.character(.x), mean = round(mean(.y^.x), 0)))
#> # A tibble: 3 x 2
#> x mean
#> <chr> <chr>
#> 1 2 439
#> 2 4 262350
#> 3 6 198039783
Created on 2021-06-06 by the reprex package (v2.0.0)
This is also possible, however it would not have been my first choice and only a map would suffice:
library(purrr)
list(2, 4, 6) %>%
pmap_dfr(~ tibble(power = c(...), x = map_dbl(c(...), ~ mean(mtcars$mpg ^ .x))))
# A tibble: 3 x 2
power x
<dbl> <dbl>
1 2 439.
2 4 262350.
3 6 198039783.