Can I group in a loop in the tidyverse?
The bigger task is to replace a grouping variable with NA if there are few observations in the group. I want to consolidate small groups into an NA group.
However, the code below won't let me group_by(x) where x is the looping variable.
library(tidyverse)
for (x in c("cyl", "gear")) {
mtcars %>%
add_count(x) %>%
mutate(x = ifelse(n() < 10, NA, x))
}
I receive the following error.
Error in grouped_df_impl(data, unname(vars), drop) :
Column `x` is unknown
Do you mean something like this?
library(dplyr)
for (x in c("cyl", "gear")) {
col <- sym(x)
mtcars <- mtcars %>%
add_count(!!col) %>%
mutate(!!col := ifelse(n < 10, NA, !!col)) %>%
select(-n)
}
mtcars
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2018-12-08 by the reprex package (v0.2.1)
(Not the easiest syntax, I know....)
You could also use mutate_at with table
library(tidyverse)
mtcars %>%
mutate_at(vars(cyl, gear), ~ {
t <- table(.)
ifelse(. %in% names(t[t < 10]), NA, .)})
The function can be simplified to one line with purrr::keep
mtcars %>%
mutate_at(vars(cyl, gear),
~ ifelse(. %in% names(keep(table(.), `<`, 10)), NA, .))
Or if you happen to be working with a data.table, you can use an "update join" to subset to groups with low counts, then assign NA to that subset
library(data.table)
dt <- as.data.table(mtcars)
for(x in c('cyl', 'gear'))
dt[dt[, .N, x][N < 10], on = x, (x) := NA]
This will achieve the same result
all.equal(
dt,
mtcars %>%
mutate_at(vars(cyl, gear),
~ ifelse(. %in% names(keep(table(.), `<`, 10)), NA, .)) %>%
setDT
)
# [1] TRUE
Related
Here is a reprex:
library(crypto2)
library(dplyr)
coins = crypto_list(only_active = TRUE)
coins = coins[(coins$symbol %in% c("BTC","ETH")),]
thirteen.months.data = crypto_history(coins, start_date=Sys.Date() - (13 * 30))
mydf <- thirteen.months.data[substr(thirteen.months.data$timestamp,1,10) %in% as.character((Sys.Date()-c(1,31,366))),] %>% select(timestamp,name,close,market_cap) %>% arrange(name,timestamp) %>% as.data.frame
# Present
df1 <- mydf %>% group_by(name) %>% slice(3) %>% select(-1)
# M-o-M growth
df2 <- mydf %>% group_by(name) %>% summarise(m.o.m = (close[3]-close[2])/close[2]*100)
# Y-o-Y growth
df3 <- mydf %>% group_by(name) %>% summarise(y.o.y = (close[3]-close[1])/close[1]*100)
I have 2 queries regarding the above program.
Will the group_by which is done after the arrange mess up the ordering which has been done using arrange?
Will the m.o.m ( month on month ) / y.o.y ( year on year ) work as expected? In other words if I do close[2] after group by, will it use the second element in each group? Is this way of indexing allowed?
No, issuing a group_by has absolutely no effect on the order of the data. By demonstrating how grouping is done, realize that its group-indexing is based on the order of the frame.
X <- data.frame(id=1:3, grp=c(4,6,4))
group_by(X, grp) %>%
attr("groups") %>%
str()
# tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
# $ grp : num [1:2] 4 6
# $ .rows: list<int> [1:2]
# ..$ : int [1:2] 1 3
# ..$ : int 2
# ..# ptype: int(0)
# - attr(*, ".drop")= logi TRUE
The groups attribute of the grouped frame is not normally shown raw, though its content informs the printing of it, # Groups: grp [2]. In this example, the first element of .rows is c(1, 3), indicating that the first group consists of rows 1 and 3.
From this, one can understand that the grouping is handled by an internal structure that keeps track of rows in whatever order they may have been. (With some more effort, one can see that if you reorder the rows, the groups/.rows attribute adjusts.)
Yes, [-indexing works as expected. Using another example,
mtcars %>%
mutate(disp2a = disp[2]) %>%
group_by(cyl) %>%
mutate(disp2b = disp[2]) %>%
ungroup()
# # A tibble: 32 × 13
# mpg cyl disp hp drat wt qsec vs am gear carb disp2a disp2b
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 160 160
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 160 160
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 160 147.
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 160 160
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 160 360
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 160 160
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 160 360
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 160 147.
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 160 147.
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 160 160
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
Note that disp2a (the second element of disp with no grouping) is 160 for all rows, and disp2b (the second element of disp within each group) shows variability between groups (invariability within each group).
As #MartinGal suggested, though, the nth helper-function can be useful here as well:
mtcars %>%
mutate(disp2a = nth(disp, 2)) %>%
group_by(cyl) %>%
mutate(disp2b = nth(disp, 2)) %>%
ungroup()
Its arguments effectively give the same functionality we get with [: n= (the index(ices); order_by=mpg can be mimicked with disp[order(mpg)][2] (with n=2); and default= allows one to change what happens when indexed outside of range (R's default behavior is to return NA):
(1:3)[4]
# [1] NA
nth(1:3, 4)
# [1] NA
nth(1:3, 4, default = Inf)
# [1] Inf
R noob here, working in tidyverse / RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!
You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2
We could use which.max after count:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14
You can use which.max to index and table to count.
library(tidyverse)
mtcars |>
group_by(cyl) |>
summarise(modalcarb = carb[which.max(table(carb))])
#> # A tibble: 3 x 2
#> cyl modalcarb
#> <dbl> <dbl>
#> 1 4 2
#> 2 6 4
#> 3 8 3
library(dplyr)
data(mtcars)
mtcars$FACTORA = sample(c("A", "b"), r=T)
mtcars$FACTORB=sample("c","e")
DATA = mtcars %>%
group_by(FACTORA, FACTORB) %>%
slice(which.min(wt)) &
group_by(FACTORA) %>%
slice(which.min(wt))
I wish to keep rows that MINIMIZE wt by qsec and gear and also keep rows that minimize wt just by qsec all in one data.
or do i have to do this
DATA = mtcars %>%
group_by(FACTORA,FACTORB) %>%
slice(which.min(wt))
DATADATA = mtcars %>%
group_by(FACTORA) %>%
slice(which.min(wt))
and then do merge?
I think this is what you mean (replacing qsec for cyl which is categorical). Note that in this set of groupings the keep2 is a bit extraneous since any row that minimizes wt for each cyl is guaranteed to appear in the rows that minimize wt for each cyl/gear group.
Also, this will only return one minimum and drop ties, though since you use which.min above I figure that isn't important.
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
arrange(wt) %>%
mutate(keep1 = row_number() == 1L) %>%
group_by(cyl) %>%
arrange(wt) %>%
mutate(keep2 = row_number() == 1L) %>%
filter(keep1 | keep2)
#> # A tibble: 8 × 13
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb keep1 keep2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 TRUE TRUE
#> 2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 TRUE FALSE
#> 3 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 TRUE FALSE
#> 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 TRUE TRUE
#> 5 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 TRUE FALSE
#> 6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 TRUE TRUE
#> 7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 TRUE FALSE
#> 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 TRUE FALSE
Created on 2022-04-29 by the reprex package (v2.0.1)
Using mtcars dataset, as an example. I would like to:
group table based on the number of cylinders
within each group test whether any car has miles per gallon higher than 25 ( mpg > 25)
for only those groups that have at least one car with mpg > 25, I would like to remove the cars that have mpg < 20
The expected output is cars that belong to a cylinder group with at least one other car having mpg > 25, and that themselves have mpg < 20 are removed from dataset
PS: I can think of several ways to address this problem, but I wanted to see if someone can come up with straightforward and elegant solution, e.g.
xx <- split (mtcars, f = mtcars$cyl)
for (i in seq_along (xx)){
if (any (xx[[i]]$mpg) > 25) xx[[i]] <- filter (xx[[i]] > 20)
}
xx <- bind_rows (xx)
Maybe this ?
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(if(any(mpg > 25)) mpg > 20 else TRUE) %>%
ungroup
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
From the groups which has at least one value greater than 25 in mpg, we keep only the rows that has values greater than 20. If a group has no value greater than 25 keep all the rows of those groups.
We can use
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(any(mpg > 25) & mpg > 20) %>%
ungroup
I wonder if could be possible to mutate variables inside my recipe taking a list of variables and imputing a fixed value (-12345) when NA is found.
No success so far.
my_list <- c("impute1", "impute2", "impute3")
recipe <-
recipes::recipe(target ~ ., data = data_train) %>%
recipes::step_naomit(everything(), skip = TRUE) %>%
recipes::step_rm(c(v1, v2, id, id2 )) %>%
recipes::step_mutate_at(my_list, if_else(is.na(.), -12345, . ))
Error in step_mutate_at_new(terms = ellipse_check(...), fn = fn, trained = trained, :
argument "fn" is missing, with no default
You were on the right track. A couple of notes. to make recipes::step_mutate_at() work you need 2 things. A selection of variables to be transformed and 1 or more functions to apply to that selection. The functions should be passed to the fn argument either as a function, named or anonymous, or a named list of functions.
Setting fn = ~if_else(is.na(.), -12345, . ) in step_mutate_at() should fix your problem, using the ~fun(.) lambda style. Furthermore i used all_of(my_list) instead of my_list to avoid ambiguous selection by using external vectors reference.
Lastly using step_naomit() removes the observations with missing values during baking which might be undesirable since you are imputing the missing values.
library(recipes)
mtcars1 <- mtcars
mtcars1[1, 1:3] <- NA
my_list <- c("mpg", "cyl", "disp")
recipe <-
recipe(drat ~ ., data = mtcars1) %>%
step_mutate_at(all_of(my_list), fn = ~if_else(is.na(.), -12345, . ))
recipe %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 32 x 11
#> mpg cyl disp hp wt qsec vs am gear carb drat
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -12345 -12345 -12345 110 2.62 16.5 0 1 4 4 3.9
#> 2 21 6 160 110 2.88 17.0 0 1 4 4 3.9
#> 3 22.8 4 108 93 2.32 18.6 1 1 4 1 3.85
#> 4 21.4 6 258 110 3.22 19.4 1 0 3 1 3.08
#> 5 18.7 8 360 175 3.44 17.0 0 0 3 2 3.15
#> 6 18.1 6 225 105 3.46 20.2 1 0 3 1 2.76
#> 7 14.3 8 360 245 3.57 15.8 0 0 3 4 3.21
#> 8 24.4 4 147. 62 3.19 20 1 0 4 2 3.69
#> 9 22.8 4 141. 95 3.15 22.9 1 0 4 2 3.92
#> 10 19.2 6 168. 123 3.44 18.3 1 0 4 4 3.92
#> # … with 22 more rows
Created on 2021-06-21 by the reprex package (v2.0.0)