Passing column names to user defined function inside mutate_at - r

I am struggling to pass column names inside my custom function while using dplyr - mutate_at.
I have a dataset "dt" with thousands of columns and I would like to perform mutate for some of these columns, but in a way which is dependent on the column name
I have this piece of code
Option 1:
relevantcols = c("A", "B", "C")
myfunc <- function(colname, x) {
#write different logic per column name
}
dt%>%
mutate_at(relevantcols, funs(myfunc(<what should i give?>,.)))
I tried approaching the problem in another way, i.e by iterating over relevantcols and applying mutate_at for each of the elements of the vector as follows
Option 2:
for (i in 1:length(relevantcols)){
dt%>%
mutate_at(relevantcols[i], funs(myfunc(relevantcols[i], .))
}
I get the colnames in Option 2, but it is 10 times slower than Option 1. Can I get somehow the column names in Option 1?
Adding an example for more clarity
df = data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
df
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1 1 5.234383 6.857227 4.480943 7.233947
2 2 7.441399 3.777524 2.134075 6.310293
3 3 7.686558 8.598688 9.814882 9.192952
4 4 6.033345 5.658716 5.167388 3.018563
5 5 5.595006 7.582548 9.302917 6.071108
relevantcols = c("Mon_channelA", "Mon_channelB")
myfunc <- function(colname, x) {
#based on the channel and weekday, compare the data from corresponding column with the same channel but different weekday and return T if higher else F
}
# required output
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1 1 T F 4.480943 7.233947
2 2 T F 2.134075 6.310293
3 3 F F 9.814882 9.192952
4 4 T T 5.167388 3.018563
5 5 F T 9.302917 6.071108

I left a comment about data types, but assuming that that is what you're looking for, here's the approach I take to these sorts of problems. I do this in a seemingly convoluted process of reshaping a few times, but it lets you set up the variables that you're trying to compare without hard-coding much. I'll break it into pieces.
library(tidyverse)
set.seed(928)
df <- data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
First, I'd reshape it into a long shape and break the "Mon_channelA", etc apart into a day and a channel. This lets you use the channel designation to match values for comparison.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
head()
#> employee day channel value
#> 1 1 Mon channelA 2.039619
#> 2 2 Mon channelA 8.153684
#> 3 3 Mon channelA 9.027932
#> 4 4 Mon channelA 1.161967
#> 5 5 Mon channelA 3.583353
#> 6 1 Mon channelB 7.102797
Then, bring it back into a wide format based on the days. Now you have a column for each day for each combination of employee and channel.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
head()
#> employee channel Mon Tue
#> 1 1 channelA 2.039619 9.826677
#> 2 1 channelB 7.102797 7.388568
#> 3 2 channelA 8.153684 5.848375
#> 4 2 channelB 6.299178 9.452274
#> 5 3 channelA 9.027932 5.458906
#> 6 3 channelB 7.029408 7.087011
Then do your comparison, and take the data long again. Note that because the value column has numeric values, everything becomes numeric and the logical values are converted to 1 or 0.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
mutate(Mon = Mon > Tue) %>%
gather(key = day, value = value, Mon, Tue) %>%
head()
#> employee channel day value
#> 1 1 channelA Mon 0
#> 2 1 channelB Mon 0
#> 3 2 channelA Mon 1
#> 4 2 channelB Mon 0
#> 5 3 channelA Mon 1
#> 6 3 channelB Mon 0
Last few steps are to stick the day and channel back together to make the labels as you had them, spread back to a wide format, and turn all the columns starting with "Mon" back into logicals.
df %>%
gather(key, value, -employee) %>%
separate(key, into = c("day", "channel"), sep = "_") %>%
spread(key = day, value = value) %>%
mutate(Mon = Mon > Tue) %>%
gather(key = day, value = value, Mon, Tue) %>%
unite("variable", day, channel) %>%
spread(key = variable, value = value) %>%
mutate_at(vars(starts_with("Mon")), as.logical)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 FALSE FALSE 9.826677 7.388568
#> 2 2 TRUE FALSE 5.848375 9.452274
#> 3 3 TRUE FALSE 5.458906 7.087011
#> 4 4 FALSE FALSE 8.854263 8.946458
#> 5 5 FALSE FALSE 6.933054 8.450741
Created on 2018-09-28 by the reprex package (v0.2.1)

You can do things like :
L <- c("A","B")
df <- data.frame(A=rep(1:3,2),B=1:6,C=7:12)
df
# A B C
#1 1 1 7
#2 2 2 8
#3 3 3 9
#4 1 4 10
#5 2 5 11
#6 3 6 12
f <- function(x,y) x^y
df %>% mutate_at(L,funs(f(.,2)))
# A B C
#1 1 1 7
#2 4 4 8
#3 9 9 9
#4 1 16 10
#5 4 25 11
#6 9 36 12

This is an old question, but I just stumbled over one possible way to solve it using a custom mutate/case_when function in combination with purrr::reduce.
It's important to use non-standard evaluation (NSE) inside the mutate/case_when statement to match the variable names you need for your custom function.
I do not know a way to do something similar with mutate_at.
Below I provide two examples, the most basic form (using your original data), and a more advanced version (which contains three weekdays and two channels and) which creates more than two variables. The latter requires an initial set-up using, for example, switch.
Basic example
library(tidyverse)
# your data
df <- data.frame(employee=seq(1:5),
Mon_channelA=runif(5,1,10),
Mon_channelB=runif(5,1,10),
Tue_channelA=runif(5,1,10),
Tue_channelB=runif(5,1,10)
)
# custom function which takes two arguments, df and a string variable name
myfunc <- function(df, x) {
mutate(df,
# overwrites all "Mon_channel" variables ...
!! paste0("Mon_", x) := case_when(
# ... with TRUE, when Mon_channel is smaller than Tue_channel, and FALSE else
!! sym(paste0("Mon_", x)) < !! sym(paste0("Tue_", x)) ~ T,
T ~ F
)
)
}
# define the variables you want to loop over
var_ls <- c("channelA", "channelB")
# use var_ls and myfunc with reduce on your data
df %>%
reduce(var_ls, myfunc, .init = .)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 FALSE FALSE 3.437975 2.458389
#> 2 2 FALSE TRUE 3.686903 4.772390
#> 3 3 TRUE TRUE 5.158234 5.378021
#> 4 4 TRUE TRUE 5.338950 3.109760
#> 5 5 TRUE FALSE 6.365173 3.450495
Created on 2020-02-03 by the reprex package (v0.3.0)
More advanced example
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2
# your data plus one weekday with two channels
df <- data.frame(employee=seq(1:5),
Mon_channelA=runif(5,1,10),
Mon_channelB=runif(5,1,10),
Tue_channelA=runif(5,1,10),
Tue_channelB=runif(5,1,10),
Wed_channelA=runif(5,1,10),
Wed_channelB=runif(5,1,10)
)
# custom function which takes two argument, df and a string variable name
myfunc <- function(df, x) {
# an initial set-up is needed
# id gets the original day
id <- str_extract(x, "^\\w{3}")
# based on id the day of comparison is mapped with switch
y <- switch(id,
"Mon" = "Tue",
"Tue" = "Wed")
# j extracts the channel name including the underscore
j <- str_extract(x, "_channel[A-Z]{1}")
# this makes the function definition rather easy:
mutate(df,
!! x := case_when(
!! sym(x) < !! sym(paste0(y, j)) ~ T,
T ~ F
)
)
}
# define the variables you want to loop over
var_ls <- c("Mon_channelA",
"Mon_channelB",
"Tue_channelA",
"Tue_channelB")
# use var_ls and myfunc with reduce on your data
df %>%
reduce(var_ls, myfunc, .init = .)
#> employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1 1 TRUE TRUE TRUE FALSE
#> 2 2 FALSE TRUE TRUE FALSE
#> 3 3 FALSE TRUE FALSE TRUE
#> 4 4 FALSE TRUE TRUE FALSE
#> 5 5 TRUE FALSE FALSE FALSE
#> Wed_channelA Wed_channelB
#> 1 9.952454 5.634686
#> 2 9.356577 4.514683
#> 3 2.721330 7.107316
#> 4 4.410240 2.740289
#> 5 5.394057 4.772162
Created on 2020-02-03 by the reprex package (v0.3.0)

Related

R improve loop efficiency: Operating on columns that correspond to rows in a second dataframe

I have two data frames:
dat <- data.frame(Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24)
dat
#> Digits_Lower Digits_Upper random
#> 1 1 6 20
#> 2 2 7 21
#> 3 3 8 22
#> 4 4 9 23
#> 5 5 10 24
cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4)
cb
#> Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4
I am trying to perform some operation on multiple columns in dat similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I
am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb. The solution should be applicable
for a large dataset. I have created this for-loop so far.
dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <-
#some operation using every value in a column in `dat` with its corresponding row in `cb`
(dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop
#> Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
I will then perform operations on the data that I appended to dat in dat.loop applying a similar
for-loop, and then perform yet another operation on those values. My dataset is very large, and I imagine
my use of for-loops will become cumbersome. I am wondering:
Would another method improve efficiency such as using data.table or tidyverse?
How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code
to perform operations on columns in dat with corresponding rows in cb. Ideally, I would split my for-loop into
multiple functions that would for example, avoid indexing into cb for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to
do this.
Any help is appreciated!
EDIT:
I've modified the code #Desmond provided allowing for more generic code since dat and cb will be from user-inputted files,
and dat can have a varying number of columns/ column names that I will be operating on (columns in dat will always start with
"Digits_" and will be specified in the "Digits" column of cb.
library(tidytable)
results <- dat %>%
crossing.(cb) %>%
mutate_rowwise.(disp = (get(`Digits`)-y) *x ) %>%
pivot_wider.(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results2 <- results %>%
fill.(starts_with("disp"), .direction = c("downup"), .by = 'random') %>%
select.(-c(x,y)) %>%
distinct.()
results2
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Here's a tidyverse solution:
crossing generates combinations from both datasets
case_when to apply your logic
pivot_wider, filter and bind_cols to clean up the output
To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing() with crossing.(), pivot_wider() with pivot_wider.(), etc
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
dat <- data.frame(
Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24
)
cb <- data.frame(
Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4
)
results <- dat |>
crossing(cb) |>
mutate(disp = case_when(
Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
)) |>
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results |>
filter(!is.na(disp_Digits_Lower)) |>
select(-c(x, y, disp_Digits_Upper)) |>
bind_cols(results |>
filter(!is.na(disp_Digits_Upper)) |>
select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> <int> <int> <int> <int> <int>
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Created on 2022-08-20 by the reprex package (v2.0.1)

Filter rows in a group based on the value for another group

I have a table of data which includes, among others, an ID, a (somehow sorted) grouping column and a date. For each ID, based on the minimum value of the date for a given group, I would like to filter out the rows of another given group that occurred after that date.
I thought about using pivot_wider and pivot_longer, but I was not able to operate on columns containing list values and single values simultaneously.
How can I do it efficiently (using any tidyverse method, if possible)?
For instance, given
library(dplyr)
tbl <- tibble(id = c(rep(1,5), rep(2,5)),
type = c("A","A","A","B","C","A","A","B","B","C"),
dat = as.Date("2021-12-07") - c(3,0,1,2,0,3,6,2,4,3))
# A tibble: 10 × 3
# id type dat
# <int> <chr> <date>
# 1 1 A 2021-12-04
# 2 1 A 2021-12-07
# 3 1 A 2021-12-06
# 4 1 B 2021-12-05
# 5 1 C 2021-12-07
# 6 2 A 2021-12-04
# 7 2 A 2021-12-01
# 8 2 B 2021-12-05
# 9 2 B 2021-12-03
# 10 2 C 2021-12-04
I would like the following result, where I discarded A-typed elements that occurred after the first of the B-typed ones, but none of the C-typed ones:
# A tibble: 7 × 3
# id type dat
# <int> <chr> <date>
# 1 1 A 2021-12-04
# 2 1 B 2021-12-05
# 3 1 C 2021-12-07
# 4 2 A 2021-12-01
# 5 2 B 2021-12-05
# 6 2 B 2021-12-03
# 7 2 C 2021-12-04
I like to use pivot_wider aand pivot_longer in this case. It does the trick, but maybe you are looking for something shorter.
tbl <- tibble(id = 1:5, type = c("A","A","A","B","C"), dat = as.Date("2021-12-07") - c(3,4,1,2,0)) %>%
pivot_wider(names_from = type, values_from = dat) %>%
filter(A < min(B, na.rm = TRUE) | is.na(A)) %>%
pivot_longer(2:4, names_to = "type", values_to = "dat") %>%
na.omit()
# A tibble: 4 × 3
id type dat
<int> <chr> <date>
1 1 A 2021-12-04
2 2 A 2021-12-03
3 4 B 2021-12-05
4 5 C 2021-12-07
An easy way using kind of SQL logic :
tbl_to_delete <- tbl %>% dplyr::filter(type == "A" & dat > min(tbl$dat[tbl$type=="B"]))
tbl2 <- tbl %>% dplyr::anti_join(tbl_to_delete,by=c("type","dat"))
First you isolate the rows you want to delete, then you discard them from your original data.
You can of course merge the two lines before into one for better code management :
tbl %>% anti_join(tbl %>% filter(type == "A" & dat > min(tbl$dat[tbl$type=="B"])),by=c("type","dat"))
Or if you really hate rbase :
tbl %>% anti_join(tbl %>% filter(type == "A" & dat > tbl %>% filter(type == "B") %>% pull(dat) %>% min()),by=c("type","dat"))

cumulative grouping

I have the following data frame:
df = data.frame(a = c(1,1,3,2,2), b=6:10)
## a b
## 1 6
## 1 7
## 3 3
## 2 9
## 2 10
I want to analyze the data by groups (a is the grouping parameter), but instead of the usual (e.g. each value specify a group of rows, and the groups are disjoint) I need "cumulative groups". that is, for the value of a=i, the group should contain all the rows in which a<=i. These are not disjoint groups, but still I want to summarize each group separately.
So for example, if for each group I want the mean of b, the result would be:
## a mean_b
## 1 6.5
## 2 8
## 3 7
note that in the real scenario behind this simplified example, I cannot analyze disjoint group separately and then aggregate the relevant groups. the summarize function must be "aware" of all the rows in that group to perform the computation.
So of course, I can use some apply functions and compute things in the good old way, and make a new df out of it, but I look for the dplyr/tidyverse like functions to do that.
any suggestions?
How about something like this?
library(dplyr)
df %>%
arrange(a) %>%
group_by(a) %>%
summarise(sum_b = sum(b)) %>%
ungroup() %>%
mutate(sum_b = cumsum(sum_b))
# a sum_b
# <dbl> <int>
#1 1. 13
#2 2. 32
#3 3. 40
We take sum by group (a) and then take cumulative sum adding the previous value of the group in the next group.
I had a look and I don't see how it is possible with dplyr itself. However, we can hack the group_by function to make it cumulative. I'll quickly walkd you through it:
First, I make your df. It doesn't really fit your output above, so I slightly changed it.
df = data.frame(a = c(1,1,3,2,2), b=6:10)
df$b[3] <- 3
Now I use the normal group_by to check out what it actually does to the data.frame.
library(dplyr)
df_grouped <- df %>%
arrange(a) %>%
group_by(a)
> attributes(df_grouped)
$class
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5
$names
[1] "a" "b"
$vars
[1] "a"
$drop
[1] TRUE
$indices
$indices[[1]]
[1] 0 1
$indices[[2]]
[1] 2 3
$indices[[3]]
[1] 4
$group_sizes
[1] 2 2 1
$biggest_group_size
[1] 2
$labels
a
1 1
2 2
3 3
So besides other things, there is a new attribute called indices where the group of each element in the grouped variable is referenced. We can actually just change that to make it cumulative.
for (i in seq_along(attributes(df_grouped)[["indices"]])[-1]) {
attributes(df_grouped)[["indices"]][[i]] <- c(
attributes(df_grouped)[["indices"]][[i - 1]],
attributes(df_grouped)[["indices"]][[i]]
)
}
It looks a bit weird but is straightforward. The elements of each group are added to the next group. E.g. all elements from group 1 are added to group 2.
> attributes(df_grouped)$indices
[[1]]
[1] 0 1
[[2]]
[1] 0 1 3 4
[[3]]
[1] 0 1 3 4 2
We can use the changed groups in the normal dplyr way.
> df_grouped %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
Now of course this is pretty ugly and looks very hacky. But inside a function that doesn't really matter as long as it is still efficient (which it is). So let's make a custom group_by.
group_by_cuml <- function(.data, ...) {
.data_grouped <- group_by(.data, ...)
for (i in seq_along(attributes(.data_grouped)[["indices"]])[-1]) {
attributes(.data_grouped)[["indices"]][[i]] <- c(
attributes(.data_grouped)[["indices"]][[i - 1]],
attributes(.data_grouped)[["indices"]][[i]]
)
}
return(.data_grouped)
}
Now you can use the custom function in clean dplyr pipe.
> df %>%
+ group_by_cuml(a) %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
I would do it this way :
df %>%
arrange(a) %>%
map_dfr(seq_along(as <- unique(.$a)),
~filter(.y, a %in% as[1:.]),.y = ., .id = "a") %>%
group_by(a = meta_group) %>%
summarise(b = mean(b))
# # A tibble: 3 x 2
# a b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
If you want a separate function you can do :
summarize2 <- function(.data, ..., .by){
grps <- select_at(.data,.by) %>% pull %>% unique
.data %>%
arrange_at(.by) %>%
map_dfr(seq_along(grps),
~ filter_at(.y, .by,all_vars(. %in% grps[1:.x])),
.y = .,
.id = "meta_group") %>%
group_by(meta_group) %>%
summarise(...)
}
df %>%
summarize2(b = mean(b), .by = "a")
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
df %>%
summarize2(b = mean(b), .by = vars(a))
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
One way is to use the base function Reduce with the argument accumulate = TRUE. Once you concatenate, then you can apply any function, i.e.
Reduce(c, split(df$b,df$a), accumulate = TRUE)
#[[1]]
#[1] 6 7
#[[2]]
#[1] 6 7 9 10
#[[3]]
#[1] 6 7 9 10 3
and then for the mean,
sapply(Reduce(c, split(df$b,df$a), accumulate = TRUE), mean)
[1] 6.5 8.0 7.0

How to create a column that is a group label for unique collections of other columns data table [duplicate]

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )

How to number/label data-table by group-number from group_by?

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )

Resources