I'm dealing with a big dataframe that has a number of columns I want to group by. I'd like to do something like this:
output <- df %>%
group_by(starts_with("GEN", ignore.case=TRUE),x,y) %>%
summarize(total=n()) %>%
arrange(desc(total))
is there a way to do this? Maybe with group_by_at or some other similar function?
To use starts_with() in group_by(), you need to wrap it in across(). Here is an example using some built data.
library(dplyr)
mtcars %>%
group_by(across(starts_with("c"))) %>%
summarize(total = n()) %>%
arrange(-total)
# A tibble: 9 x 3
# Groups: cyl [3]
cyl carb total
<dbl> <dbl> <int>
1 4 2 6
2 8 4 6
3 4 1 5
4 6 4 4
5 8 2 4
6 8 3 3
7 6 1 2
8 6 6 1
9 8 8 1
Yes, there is. You could use the group_by_at function:
mtcars %>% group_by_at(vars(starts_with("c"), gear))
Group by all columns whose name starts with "c" and by the column gear
Output
# A tibble: 32 x 11
# Groups: cyl, carb, gear [12]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
Related
I'm trying to use dplyr::mutate to change a dynamic column with conditions using other columns dynamically.
I've got this bit of code:
d <- mtcars %>% tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
d <- d %>% mutate(!!fld_name := ifelse(!!other_fld_name < 5,NA,!!fld_name))
which sets mpg to
mpg
<chr>
1 mpg
2 mpg
3 mpg
4 mpg
5 mpg
6 mpg
7 mpg
8 mpg
9 mpg
10 mpg
it seems to select the field on the LHS of assignment operator, but just pastes the field name on the RHS.
Removing the unquotes on the RHS yields the same result.
Any help is much appreciated.
use get to retreive column value instead
library(tidyverse)
d <- mtcars %>% tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
d %>% mutate(!!fld_name := ifelse(get(other_fld_name) < 5 ,NA, get(fld_name)))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2021-06-22 by the reprex package (v2.0.0)
We can also use ensym function to quote variable name stored as string and unquote it with !! like the following:
library(rlang)
d <- mtcars %>% tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
d %>%
mutate(!!ensym(fld_name) := ifelse(!!ensym(other_fld_name) < 5, NA, !!ensym(fld_name)))
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows
We could also use .data
library(dplyr)
d %>%
mutate(!! fld_name := case_when(.data[[other_fld_name]] >=5 ~
.data[[fld_name]]))
-output
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
data
d <- mtcars %>%
as_tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
This question already has answers here:
Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?
(2 answers)
Closed 2 years ago.
I know this question has answers in multiple places, but I am unable to figure out where I am going wrong. Suppose I want to find the sum of hp for each group in cyl:
mtcars%>%
group_by(cyl) %>%
mutate(
sum_hp = sum(hp)
)
sum_hp is giving me 4694 for every value. I want the sum for each value of cyl.
It could be a case of plyr::mutate masking dplyr::mutate when both the packages are loaded. We can specify dplyr::<functionname> to correct this
library(dplyr)
mtcars%>%
group_by(cyl) %>%
dplyr::mutate(sum_hp = sum(hp))
# A tibble: 32 x 12
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb sum_hp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 856
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 856
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 909
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 856
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 2929
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 856
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 2929
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 909
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 909
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 856
# … with 22 more rows
If we use plyr::mutate, the OP's output can be reproduced
mtcars%>%
group_by(cyl) %>%
plyr::mutate(
sum_hp = sum(hp)
)
# A tibble: 32 x 12
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb sum_hp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 4694
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 4694
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4694
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 4694
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 4694
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 4694
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 4694
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 4694
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 4694
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 4694
# … with 22 more rows
I am trying to write a custom function that uses rlang's non-standard evaluation to group a dataframe by more than one variable.
This is what I've-
library(rlang)
# function definition
tryfn <- function(data, groups, ...) {
# preparing data
df <- dplyr::group_by(data, !!!rlang::enquos(groups))
print(head(df))
# applying some function `.f` on df that absorbs `...`
# .f(df, ...)
}
This works with a single grouping variable-
# works
tryfn(mtcars, am)
#> # A tibble: 6 x 11
#> # Groups: am [2]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
But if try to use more than one grouping variable, this doesn't work-
# doesn't work
tryfn(mtcars, c(am, cyl))
#> Error: Column `c(am, cyl)` must be length 32 (the number of rows) or one, not 64
# doesn't work
tryfn(mtcars, list(am, cyl))
#> Error: Column `list(am, cyl)` must be length 32 (the number of rows) or one, not 2
We could parse as an expression with enexpr and use !!!
tryfn <- function(data, groups, ...) {
groups <- as.list(rlang::enexpr(groups))
groups <- if(length(groups) > 1) groups[-1] else groups
group_by(data, !!!groups)
}
-testing
tryfn(mtcars, am)
# A tibble: 32 x 11
# Groups: am [2]
# mpg cyl disp hp drat wt qsec vs am gear carb
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
tryfn(mtcars, c(am, cyl))
# A tibble: 32 x 11
# Groups: am, cyl [6]
# mpg cyl disp hp drat wt qsec vs am gear carb
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
So basically I want to turn a numeric income variable into an ordinal income variable where the cut-off points for the categories are decided so that each category ends up with the same N (or 1 less for one of the categories if it's an odd number N, to begin with).
Does anyone know how I can do this in R?
Here's an example using mtcars.
I'd suggest you use the ntile function that splits your variable into groups with the same number of cases.
Assume that the variable of interest is disp:
library(dplyr)
mtcars %>%
group_by(g = ntile(disp, 3)) %>% # split variable into 3 groups
mutate(g_range = paste0(min(disp), "-", max(disp))) %>% # create the ranges
ungroup() -> df
Your updated data (df) will look like this:
# # A tibble: 32 x 13
# mpg cyl disp hp drat wt qsec vs am gear carb g g_range
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 146.7-301
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2 146.7-301
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1 71.1-145
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 2 146.7-301
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 3 304-472
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2 146.7-301
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 3 304-472
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 2 146.7-301
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1 71.1-145
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 2 146.7-301
# # ... with 22 more rows
You can check the number of cases within each group:
df %>% count(g, g_range)
# # A tibble: 3 x 3
# g g_range n
# <int> <chr> <int>
# 1 1 71.1-145 11
# 2 2 146.7-301 11
# 3 3 304-472 10
I have a number of dataframes and a series of changes I want to make to each of them. For this example, let to the desired change be simply making each data frame a tibble using as_tibble. I know there are various ways of doing this, but I'd like to do this using purrr:walk.
For data frames df1 and df2,
df1 <- mtcars
df2 <- mtcars
I'd like to do the equivalent of
df1 %<>% as_tibble
df2 %<>% as_tibble
using walk. My attempt:
library(tidyverse)
walk(c(df1, df2), ~ assign(deparse(substitute(.)), as_tibble(.)))
This runs but does not make the desired change:
is_tibble(df1)
#> [1] FALSE
Here is how you can combine assign with walk (see the comments the code for more explanation)-
library(tidyverse)
# data
df1 <- mtcars
df2 <- mtcars
# creating tibbles
# this creates a list of objects with names ("df1", "df2")
tibble::lst(df1, df2) %>%
purrr::walk2(
.x = names(.), # names to assign
.y = ., # object to be assigned
.f = ~ assign(x = .x,
value = tibble::as.tibble(.y),
envir = .GlobalEnv)
)
# checking the newly created tibbles
df1
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
df2
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2018-11-13 by the reprex package (v0.2.1)