R - dplyr Summarize and Retain Other Columns - r

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks
TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE))
TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]
TDAA.states <- TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
summarize(n=n()) %>%
ungroup() %>%
arrange(State)

Perhaps we need
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
summarise(State.Full.Name = first(State.Full.Name), n = n())
Or use mutate to create the column and then do the distinct
TDAAtest %>% f
filter(!is.na(State)) %>%
group_by(State) %>%
mutate(n= n()) %>%
distinct(State, .keep_all=TRUE)

To retain all columns, you can include across() as a summarize argument, as explained in the documentation for dplyr::do().
by_cyl <- head(mtcars) %>%
group_by(cyl)
by_cyl %>%
summarise(m_mpg = mean(mpg), across())
cyl m_mpg mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 22.8 22.8 108 93 3.85 2.32 18.6 1 1 4 1
2 6 20.4 21 160 110 3.9 2.62 16.5 0 1 4 4
3 6 20.4 21 160 110 3.9 2.88 17.0 0 1 4 4
4 6 20.4 21.4 258 110 3.08 3.22 19.4 1 0 3 1
5 6 20.4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
6 8 18.7 18.7 360 175 3.15 3.44 17.0 0 0 3 2
To retain only a subset of unaltered columns, you can select them within across using tidyselect semantics.

I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column
).
Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).
We have the following options:
1. dplyr version
From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
mutate(maxPopulation = max(Population)) %>%
ungroup() %>%
filter(maxPopulation == Population)
2. Function version
This one gives you as much flexibility as you want and you can apply any kind of operation to each group:
maxFUN = function(x) {
# order population in a descending order
x = x[with(x, order(-Population)), ]
x[1, ]
}
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
do(maxFUN(.))
This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

Related

Take the latest year value for nested object

I have a nested object whereby the name of individual vehicles in the inner nest. This is not my dataset but I can reproduce the error with mtcars. Essentially, I am trying to grab the manufacturing_size for the latest year when its anything but Not Provided, and use only this value for manufacturing_size. However, for whatever reason the map/function does not enter all nests.
dataset:
mtcars <- mtcars %>% rownames_to_column()
emp <- c("Not Provided","Less than 250","250 to 499","500 to 999","1000 to 4999","5000 to 19,999")
mtcars$manufacturing_size <- c(rep(emp, 5) , "Not Provided", "Less than 250")
mtcars$year <- rep(2018:2021, 8)
mtcars1 <- mtcars
mtcars2 <- mtcars
mtcars3 <- mtcars
mtcars1$year <- rep(c(2019:2021, 2018), 8)
mtcars2$year <- rep(c(2020:2021, 2018, 2019), 8)
mtcars3$year <- rep(c(2021:2018), 8)
mtcarsAll <- rbind(mtcars, mtcars1, mtcars2, mtcars3)
Here is what I have tried:
mtcars %>% nest_by(gear) %>% ungroup %>% mutate(data = map(data, ~ .x %>% nest(data=rowname) %>%
mutate(data = map(data, function(x){
someSize <- x[x$year == x[which.max(x$year),]$year,]$manufacturing_size
if(someSize != 'Not Provided'){
x$manufacturing_size = someSize
return(x)
}else {
for(i in 1:nrow(x)){
if(x$year[i] != 2018){
someSize <- x[x$year == x[which.max(x$year)-i,]$year,]$manufacturing_size
if(someSize != 'Not Provided'){
x$manufacturing_size = someSize
return(x)
}
} else{
someSize <- x[x$year == x[which.max(x$year)+i,]$year,]$manufacturing_size
if(someSize != 'Not Provided'){
x$manufacturing_size = someSize
return(x)
}
}
}
}
}
))))
Which produces the following error:
Error in `mutate()`:
! Problem while computing `data = map(...)`.
Caused by error in `mutate()`:
! Problem while computing `data = map(...)`.
Caused by error in `vectbl_as_row_location()`:
! Must subset rows with a valid subscript vector.
ℹ Logical subscripts must match the size of the indexed input.
✖ Input has size 1 but subscript `x$year == x[which.max(x$year)]$year` has size 0.
This is because If I remove most of the function and print out someSize then It enters the first outer nest but not the others. What is an easier alternative?
Using the answer below, the following works:
mtr <- mtcarsAll %>% group_by(rowname) %>%
mutate(
man_size = case_when(
manufacturing_size != "Not Provided" & max(year) == year~ manufacturing_size
)
)
mtr %>% ungroup %>%
fill(man_size, .direction = "updown")
Does this do what you want. There is a lot of nesting in your example, which unless I am mistaken, isn't necessary.
I've altered your setup a little bit cause I don't think what you wanted was going to work:
used mtcars2 so as to not overwrite mtcars,
replace rep(emp, 5) with random draws from a standard normal distrubution rnorm(30)) becuase you didn't define emp
added a new grouping variable group so that each year only appears once for each group. (The way you had it with gear as the grouping var didn't work because there were multiple values for the most recent year)
mtcars2 <- mtcars %>% rownames_to_column("make")
mtcars2$manufacturing_size <- c(rnorm(30),"Not Provided", "Less than 250")
mtcars2$group <- rep(LETTERS[1:8], each = 4)
mtcars2$year <- rep(2018:2021, 8)
Then, rather than all the complex nesting you've done, you just do use an if_else statement or, I've prefered case_when to get the values you are intereseted in for the new variable man_size.
mtcars2 %>%
group_by(group) %>%
mutate(
man_size = case_when(
manufacturing_size != "Not Provided" & max(year) == year ~ manufacturing_size,
TRUE ~ NA_character_
)
)
# A tibble: 32 × 16
# Groups: group [8]
make mpg cyl disp hp drat wt qsec vs am gear carb manufacturing_size group year man_size
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <int> <chr>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 -0.10777645987017 A 2018 NA
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0.685034939673918 A 2019 NA
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0.0216291773402855 A 2020 NA
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0.227610843395319 A 2021 0.2276108433953…
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0.342964251360947 B 2018 NA
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 1.20792448510301 B 2019 NA
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0.395983818669596 B 2020 NA
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 -0.42502805147035 B 2021 -0.425028051470…
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 0.961054295375392 C 2018 NA
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 -1.32030765978216 C 2019 NA
# … with 22 more rows
If you then want to fill in those NAs with what you need you can just use tidyr::fill
Hope this helps.
EDIT after change from OP in comments.
OK, I see what you want now. Thanks for providing emp. I still made one more tiny change to your setup, to ensure there was a case where Not Provided would be the value of manufacuring_size for the maximum year in one of the groups (for group H).
mtcars2 <- mtcars %>% rownames_to_column()
emp <- c("Not Provided","Less than 250","250 to 499","500 to 999","1000 to 4999","5000 to 19,999")
mtcars2$manufacturing_size <- c(rep(emp, 5) , "Less than 250", "Not Provided")
mtcars2$group <- rep(LETTERS[1:8], each = 4)
mtcars2$year <- rep(2018:2021, 8)
We can then use the following:
mtcars3 <- mtcars2 %>%
group_by(group) %>%
mutate(
man_size = case_when(
max(year[manufacturing_size != "Not Provided"]) == year ~ manufacturing_size,
TRUE ~ NA_character_
)
)
Then if you want to fill in all the values, you can do:
mtcars3 %>%
fill(man_size, .direction = "updown")

Programming with `{data.table}`: how to name a new column?

The following question seems very basic in programming with data.table, so my apologies if it's a duplicate. I spent time researching but could not find an answer.
I want to create a "user-defined function" that wraps around a data.table wrangling procedure. In this procedure, a new column is created, and I want to let the user set the name of that new column.
Example
Consider the following code that works as-is. I want to wrap it inside a function.
library(data.table)
library(magrittr)
library(tibble)
mtcars %>%
as.data.table() %>%
.[, .(max_mpg = max(mpg)), by = cyl] %>%
as_tibble()
#> # A tibble: 3 x 2
#> cyl max_mpg
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
Created on 2021-10-13 by the reprex package (v0.3.0)
All I want my function to do is let the user set the name of new_colname_of_choice:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#> cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
I've tried using curly braces which didn't work either (actually threw an error):
my_wrapper_2 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
as_tibble()
}
Error: unexpected '=' in:
" as.data.table() %>%
.[, .({new_colname_of_choice} ="
Which is surprising because curly braces do promote the desired naming ability, but in a different (yet similar) kind of code:
my_wrapper_3 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
as_tibble()
}
my_wrapper_3(new_colname_of_choice = "my_lovely_colname")
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb my_lovely_colname <---- SUCCESS!
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 33.9
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 19.2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.4
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 19.2
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.9
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 33.9
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.4
## # ... with 22 more rows
Bottom line
My conclusion is that the = operator is sensitive to {...} on the LHS. How can I otherwise pass a name (from argument) to the LHS in the initial my_wrapper() example?
EDIT
I'd like to add the dplyr solution for the same problem, taken from the programming with dplyr vignette:
library(dplyr)
my_wrapper_dplyr <- function(new_colname_of_choice) {
mtcars %>%
group_by(cyl) %>%
summarise("{new_colname_of_choice}" := max(mpg))
}
my_wrapper_dplyr("another_lovely_colname")
Which is pretty robust and works in all naming situations I've encountered. Is there a built-in/canonical practice in data.table similar to {dplyr}'s?
With the upcoming data.table version 1.14.3, you'll be able to use the new env parameter:
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.
# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz", repo = NULL, type = "source")
library(tibble)
library(data.table)
my_wrapper_new <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl,
env=list(new_colname_of_choice = new_colname_of_choice)] %>%
as_tibble()
}
my_wrapper_new('test')
# A tibble: 3 x 2
cyl test
<dbl> <dbl>
1 6 21.4
2 4 33.9
3 8 19.2
One thing you can do is separate the creation of the column and the naming of the column like so:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(tempcol = max(mpg)), by = cyl] %>%
setnames(., "tempcol", new_colname_of_choice) %>%
as.tibble()
}
my_wrapper("my_lovely_colname")
Using this method you can use either .(tempcol = max(mpg)) or tempcol := max(mpg)
Using setNames from stats:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")

R - keep random rows per group, but different numbers per group

The function sample_n() from package dplyr allows to randomly keep a specific number of rows. Combine with group_by(), you can for instance keep 2 observations per group:
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(2)
# A tibble: 4 x 2
# Groups: vs [2]
vs drat
<dbl> <dbl>
1 0 3.07
2 0 3.9
3 1 4.22
4 1 3.08
Question: is there an easy way to select a different number of observations per group? For instance, if I want to keep 2 observations for the first group, and 3 for the second one. If I give a vector to the function sample_n(), it only uses the first value (result is the same as above).
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(c(2,3))
Thanks in advance.
create list-columns of each groups using group_nest(), add a column with the number of samples you want in each group, then map these two columns to the sample_n() function:
library(tidyverse)
mtcars %>%
select(vs, drat) %>%
group_nest(vs, keep= TRUE) %>%
add_column(mysamples = c(2,3)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows()
# A tibble: 5 x 2
vs drat
<dbl> <dbl>
1 0 3.15
2 0 4.22
3 1 3.7
4 1 4.93
5 1 3.08
>

complex column selection in dplyr group_by

I would like to use, within a group_by call, dplyr's column selectors like starts_with(), ends_with(), matches(), ..., or even the syntax -colName.
(Silly) example of the syntax I am after:
library("dplyr")
# I would like to do something like this
mtcars %>%
group_by(matches("a")) %>%
summarise(mpg=mean(mpg))
# but I get a "wrong result size" error
I was hoping it would work, by analogy with:
mtcars %>% select(matches("a"))
which here would select columns drat, am, gear, carb
To be crystal clear: I want to use matches("a") (or equivalent) to achieve the same output as:
mtcars %>%
group_by(drat, am, gear, carb) %>%
summarise(mpg=mean(mpg))
I am only interested in answers using dplyr. Thanks!
The current answer, while good, only allows selecting columns with a regex.
I am still looking for a more global answer that would allow the use of the full range of dplyr's selection syntax. Of course I can massage any regex to select what I want, but I wish I had something which integrates nicer with dplyr (especially to use the -colName syntax). I am going to leave this opened for a while.
Here is an option to construct your own group_at() which I don't think exists with the matches and SE group_by_() function:
mtcars %>%
group_by_(.dots = names(mtcars)[matches("a", vars = names(mtcars))]) %>%
summarise(mpg = mean(mpg))
#Source: local data frame [26 x 5]
#Groups: drat, am, gear [?]
# drat am gear carb mpg
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2.76 0 3 1 18.10
#2 2.76 0 3 2 15.50
#3 2.93 0 3 4 10.40
#4 3.00 0 3 4 10.40
#5 3.07 0 3 3 16.30
#6 3.08 0 3 1 21.40
#7 3.08 0 3 2 19.20
#8 3.15 0 3 2 16.95
#9 3.21 0 3 4 14.30
#10 3.23 0 3 4 14.70
# ... with 16 more rows
Or equivalently, just use grep:
mtcars %>%
group_by_(.dots = grep('a', names(mtcars), value = TRUE)) %>%
summarise(mpg=mean(mpg))
group_by_at was added to dplyr some time in 2017 and does just that.
mtcars %>%
group_by_at(matches("a")) %>%
summarise(mpg=mean(mpg))

Using a data_frame as an argument into a mutate and group_by routine

I have this data_frame (db) here with lots of columns:
A B C D ... ZZ
1 .23 .21 ... .23
2 .45 .12 ... .23
1 .47 ... .53
2 .49 ... .27
I want to employ group_by and mutate with a function which gets a complete data_frame and returns a vector.
function1 <- function(data_frame) {
...
return(vector)
}
db %>%
group_by(A) %>%
mutate(results = function1(.))
This is not working. It returns the results of using the function with the whole data_frame, not with the groups.
I know I could solve it using for, but I'm looking for a dplyr solution. The function necessarily gets a data_frame, I'm not passing columns separately as arguments.
dplyr
My trick has been to use bind_cols. By itself it won't honor any groups, so you need to nest it within a do block, such as:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(bind_cols(., {
# "insert complex stuff here"
data_frame(results = apply(., 1, mean))
}))
# Source: local data frame [32 x 12]
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb results
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 23.59818
# 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.63455
# 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 27.23364
# # ... with 29 more rows
On benefit of this approach is that the code in the block can return one or more columns without further complication.
So, using your code, it would look something like:
db %>%
group_by(A) %>%
do(bind_cols(., data_frame(results = function(.))))
tidyr
Another option is to use tidy (RStudio blog here, though a little out of date it is still useful).
library(tidyr) # nest, unnest
library(purrr) # map
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(results = map(data, ~ apply(., 1, mean))) %>%
unnest()
Your code might be something like (untested):
db %>%
group_by(A) %>%
nest() %>%
mutate(results = purrr::map(data, ~ function1(.))) %>%
unnest()

Resources