Create unique random group id in R [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I am trying to create a unique, randomly assigned (without replacement) group id without using a for loop. This is as far as I got:
library(datasets)
library(dplyr)
data(iris)
iris <- iris %>% group_by(Species) %>% mutate(id = cur_group_id())
This gives me a group id for each iris$Species, however, I would like the group id to randomly assigned from c(1,2,3) as opposed to assigned based on the order of the dataset.
Any help creating this would be very helpful! I am sure there is a way to do this with dplyr but I am stumped...

Maybe you can play some tricks on group_by by adding sample operation, e.g.,
iris <- iris %>%
group_by(factor(Species, levels = sample(levels(Species)))) %>%
mutate(id = cur_group_id())

Here's a sample answer creating a random number and ranking them.
library(datasets)
library(dplyr)
data(iris)
df <- iris %>%
group_by(Species) %>%
mutate(id = runif(1,0,1)) %>%
ungroup() %>%
mutate(id = dense_rank(id))
df %>% sample_n(10)
#> # A tibble: 10 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 4.4 3 1.3 0.2 setosa 3
#> 2 6.5 3 5.5 1.8 virginica 2
#> 3 6.3 2.7 4.9 1.8 virginica 2
#> 4 5 3.6 1.4 0.2 setosa 3
#> 5 6.3 2.3 4.4 1.3 versicolor 1
#> 6 7.9 3.8 6.4 2 virginica 2
#> 7 5.4 3.9 1.7 0.4 setosa 3
#> 8 5.7 4.4 1.5 0.4 setosa 3
#> 9 6.4 2.8 5.6 2.2 virginica 2
#> 10 5.2 3.4 1.4 0.2 setosa 3
Created on 2020-07-29 by the reprex package (v0.3.0)

Here's an approach with sample and recode:
Use seq_along(unique(id)) to create a vector of integer values to recode to.
Use sample to sample the appropriate number of random values.
Use setNames to name the ids with their new random values.
Use !!! to force that vector of named id into a list of expressions.
use recode to change the values.
iris %>%
group_by(Species) %>%
mutate(id = cur_group_id()) %>%
mutate(id = recode(id, !!!setNames(unique(id),
sample(seq_along(unique(id))))))
I think the other answers are better approachs, but having recode with !!! in your toolkit is helpful in other situations.

Randomise the rows and then assign id based on the occurrence of Species :
library(dplyr)
iris %>%
slice_sample(n = nrow(.)) %>%
#sample_n for dplyr < 1.0.0
#sample_n(n()) %>%
mutate(id = match(Species, unique(Species)))

Related

dplyr summarise (collapse) dataset by different functions for multiple columns

I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1 5.1 3.5 1.4 0.2 setosa 2000 red A
2 4.9 3.0 1.4 0.2 setosa 2000 red A
3 4.7 3.2 1.3 0.2 setosa 2000 red A
The resulting dataset should look like this:
full
Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 setosa 2000 87 6.2 5.8 1.9 A red
2 setosa 3000 84.4 6.1 5.5 1.9 A red
3 versicolor 2000 69.4 33.6 7 4.9 B green
4 versicolor 3000 69.1 32.7 6.8 5.1 B green
5 virginica 2000 73.2 51.1 7.7 6.9 C blue
6 virginica 3000 75.5 50.2 7.9 6.4 C blue
I can achieve this by doing the following which is a bit repetitive:
sums <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Width")), list(sum))
max <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Length")), list(max))
last <- iris %>%
group_by(Species, year) %>%
summarise_if(is.factor, list(last))
first <- iris %>%
group_by(Species, year) %>%
summarise_if(is.character, list(first))
full <- full_join(sums, max) %>% full_join(last) %>% full_join(first)
I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:
test <- iris %>%
#group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
group_by_at(.vars = vars(Species, year)) %>% #doesnt work
{left_join(
summarise_at(., vars(matches("Width")), list(sum)),
summarise_at(., vars(matches("Length")), list(max)),
summarise_if(., is.factor, list(last)),
summarise_if(., is.character, list(first))
)
} #doesnt work
This doesnt work, any suggestions or other approaches?
Helpful:
How can I use summarise_at to apply different functions to different columns?
Summarize different Columns with different Functions
Using dplyr summarize with different operations for multiple columns
By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):
iris %>%
group_by(Species, year) %>%
{
Reduce(
function(x, y) left_join(x, y),
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
))
}
# Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 setosa 2000 87 6.2 5.8 1.9 A red
# 2 setosa 3000 84.4 6.1 5.5 1.9 A red
# 3 versicolor 2000 69.4 33.6 7 4.9 B green
# 4 versicolor 3000 69.1 32.7 6.8 5.1 B green
# 5 virginica 2000 73.2 51.1 7.7 6.9 C blue
# 6 virginica 3000 75.5 50.2 7.9 6.4 C blue
Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.
Robbing #Ulises idea and using purrr::reduce instead of Reduce is an alternative:
iris %>%
group_by(Species, year) %>%
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
) %>%
.[c(2:5)] %>%
reduce(left_join)
OR solution with curly brackets to suppress the first argument:
iris %>%
group_by(Species, year) %>%
{
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
)
} %>%
reduce(left_join)

Sum of sequence of columns inside pipeline [duplicate]

I'm trying to mutate a new variable from sort of row calculation,
say rowSums as below
iris %>%
mutate_(sumVar =
iris %>%
select(Sepal.Length:Petal.Width) %>%
rowSums)
the result is that "sumVar" is truncated to its first value(10.2):
Source: local data frame [150 x 6]
Groups: <by row>
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sumVar
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 10.2
3 4.7 3.2 1.3 0.2 setosa 10.2
4 4.6 3.1 1.5 0.2 setosa 10.2
5 5.0 3.6 1.4 0.2 setosa 10.2
6 5.4 3.9 1.7 0.4 setosa 10.2
..
Warning message:
Truncating vector to length 1
Should it be rowwise applied? Or what's the right verb to use in these kind of calculations.
Edit:
More specifically, is there any way to realize the inline custom function with dplyr?
I'm wondering if it is possible do something like:
iris %>%
mutate(sumVar = colsum_function(Sepal.Length:Petal.Width))
This is more of a workaround but could be used
iris %>% mutate(sumVar = rowSums(.[1:4]))
As written in comments, you can also use a select inside of mutate to get the columns you want to sum up, for example
iris %>%
mutate(sumVar = rowSums(select(., contains("Sepal")))) %>%
head
or
iris %>%
mutate(sumVar = select(., contains("Sepal")) %>% rowSums()) %>%
head
You can use rowwise() function:
iris %>%
rowwise() %>%
mutate(sumVar = sum(c_across(Sepal.Length:Petal.Width)))
#> # A tibble: 150 x 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sumVar
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 10.2
#> 2 4.9 3 1.4 0.2 setosa 9.5
#> 3 4.7 3.2 1.3 0.2 setosa 9.4
#> 4 4.6 3.1 1.5 0.2 setosa 9.4
#> 5 5 3.6 1.4 0.2 setosa 10.2
#> 6 5.4 3.9 1.7 0.4 setosa 11.4
#> 7 4.6 3.4 1.4 0.3 setosa 9.7
#> 8 5 3.4 1.5 0.2 setosa 10.1
#> 9 4.4 2.9 1.4 0.2 setosa 8.9
#> 10 4.9 3.1 1.5 0.1 setosa 9.6
#> # ... with 140 more rows
"c_across() uses tidy selection syntax so you can to succinctly select many variables"'
Finally, if you want, you can use %>% ungroup at the end to exit from rowwise.
A more complicated way would be:
iris %>% select(Sepal.Length:Petal.Width) %>%
mutate(sumVar = rowSums(.)) %>% left_join(iris)
Adding #docendodiscimus's comment as an answer. +1 to him!
iris %>% mutate(sumVar = rowSums(select(., contains("Sepal"))))
I am using this simple solution, which is a more robust modification of the answer by Davide Passaretti:
iris %>% select(Sepal.Length:Petal.Width) %>%
transmute(sumVar = rowSums(.)) %>% bind_cols(iris, .)
(But it requires a defined row order, which should be fine, unless you work with remote datasets perhaps..)
You can also use a grep in place of contains or matches, just in case you need to get fancy with the regular expressions (matches doesn't seem to much like negative lookaheads and the like in my experience).
iris %>% mutate(sumVar = rowSums(select(., grep("Sepal", names(.)))))
As requested, transforming my commment into an answer:
For operations like sum that already have an efficient vectorised row-wise alternative, the proper way is currently:
df %>% mutate(total = rowSums(across(where(is.numeric))))
across can take anything that select can (e.g. rowSums(across(Sepal.Length:Petal.Width)) also works).
Scroll down the row-wise vignette to find this and have a look at across

Grouped times series lag on selected variables using dplyr

I am trying to use dplyr to lag some variables (all of which have a common naming convention) for each group in my data set.
I thought mutate_if would work, but I get an error (below). mutate_each works, but for all columns rather than the select few.
For example, I were looking to lag only the Sepal measurements:
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
# mutate_each(funs(lag(.)))
mutate_if(contains("Sepal"), funs(lag(.)))
#> Error in get(as.character(FUN), mode = "function", envir = envir) : object 'p' of mode 'function' was not found
to get a final data set like:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 NA NA 1.4 0.2 setosa
# 2 5.1 3.5 1.4 0.2 setosa
# 3 4.9 3.0 1.3 0.2 setosa
# 4 NA NA 4.7 1.4 versicolor
# 5 7.0 3.2 4.5 1.5 versicolor
# 6 6.4 3.2 4.9 1.5 versicolor
# 7 NA NA 6.0 2.5 virginica
# 8 6.3 3.3 5.1 1.9 virginica
# 9 5.8 2.7 5.9 2.1 virginica
This seems to work,
library(dplyr)
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_if(grepl('Sepal', names(.)), funs(lag(.)))
As #aosmith explains, contains returns an index of the columns that match the string, whereas mutate_if relies on a using predicate functions that return logical vectors, which is why the grepl option works.
In addition, as #StevenBeaupre mentions,
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_at(vars(contains('Sepal')), lag)

Adding boolean values in R dplyr [duplicate]

I'm trying to mutate a new variable from sort of row calculation,
say rowSums as below
iris %>%
mutate_(sumVar =
iris %>%
select(Sepal.Length:Petal.Width) %>%
rowSums)
the result is that "sumVar" is truncated to its first value(10.2):
Source: local data frame [150 x 6]
Groups: <by row>
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sumVar
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 10.2
3 4.7 3.2 1.3 0.2 setosa 10.2
4 4.6 3.1 1.5 0.2 setosa 10.2
5 5.0 3.6 1.4 0.2 setosa 10.2
6 5.4 3.9 1.7 0.4 setosa 10.2
..
Warning message:
Truncating vector to length 1
Should it be rowwise applied? Or what's the right verb to use in these kind of calculations.
Edit:
More specifically, is there any way to realize the inline custom function with dplyr?
I'm wondering if it is possible do something like:
iris %>%
mutate(sumVar = colsum_function(Sepal.Length:Petal.Width))
This is more of a workaround but could be used
iris %>% mutate(sumVar = rowSums(.[1:4]))
As written in comments, you can also use a select inside of mutate to get the columns you want to sum up, for example
iris %>%
mutate(sumVar = rowSums(select(., contains("Sepal")))) %>%
head
or
iris %>%
mutate(sumVar = select(., contains("Sepal")) %>% rowSums()) %>%
head
You can use rowwise() function:
iris %>%
rowwise() %>%
mutate(sumVar = sum(c_across(Sepal.Length:Petal.Width)))
#> # A tibble: 150 x 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sumVar
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 10.2
#> 2 4.9 3 1.4 0.2 setosa 9.5
#> 3 4.7 3.2 1.3 0.2 setosa 9.4
#> 4 4.6 3.1 1.5 0.2 setosa 9.4
#> 5 5 3.6 1.4 0.2 setosa 10.2
#> 6 5.4 3.9 1.7 0.4 setosa 11.4
#> 7 4.6 3.4 1.4 0.3 setosa 9.7
#> 8 5 3.4 1.5 0.2 setosa 10.1
#> 9 4.4 2.9 1.4 0.2 setosa 8.9
#> 10 4.9 3.1 1.5 0.1 setosa 9.6
#> # ... with 140 more rows
"c_across() uses tidy selection syntax so you can to succinctly select many variables"'
Finally, if you want, you can use %>% ungroup at the end to exit from rowwise.
A more complicated way would be:
iris %>% select(Sepal.Length:Petal.Width) %>%
mutate(sumVar = rowSums(.)) %>% left_join(iris)
Adding #docendodiscimus's comment as an answer. +1 to him!
iris %>% mutate(sumVar = rowSums(select(., contains("Sepal"))))
I am using this simple solution, which is a more robust modification of the answer by Davide Passaretti:
iris %>% select(Sepal.Length:Petal.Width) %>%
transmute(sumVar = rowSums(.)) %>% bind_cols(iris, .)
(But it requires a defined row order, which should be fine, unless you work with remote datasets perhaps..)
You can also use a grep in place of contains or matches, just in case you need to get fancy with the regular expressions (matches doesn't seem to much like negative lookaheads and the like in my experience).
iris %>% mutate(sumVar = rowSums(select(., grep("Sepal", names(.)))))
As requested, transforming my commment into an answer:
For operations like sum that already have an efficient vectorised row-wise alternative, the proper way is currently:
df %>% mutate(total = rowSums(across(where(is.numeric))))
across can take anything that select can (e.g. rowSums(across(Sepal.Length:Petal.Width)) also works).
Scroll down the row-wise vignette to find this and have a look at across

dplyr mutate rowSums calculations or custom functions

I'm trying to mutate a new variable from sort of row calculation,
say rowSums as below
iris %>%
mutate_(sumVar =
iris %>%
select(Sepal.Length:Petal.Width) %>%
rowSums)
the result is that "sumVar" is truncated to its first value(10.2):
Source: local data frame [150 x 6]
Groups: <by row>
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sumVar
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 10.2
3 4.7 3.2 1.3 0.2 setosa 10.2
4 4.6 3.1 1.5 0.2 setosa 10.2
5 5.0 3.6 1.4 0.2 setosa 10.2
6 5.4 3.9 1.7 0.4 setosa 10.2
..
Warning message:
Truncating vector to length 1
Should it be rowwise applied? Or what's the right verb to use in these kind of calculations.
Edit:
More specifically, is there any way to realize the inline custom function with dplyr?
I'm wondering if it is possible do something like:
iris %>%
mutate(sumVar = colsum_function(Sepal.Length:Petal.Width))
This is more of a workaround but could be used
iris %>% mutate(sumVar = rowSums(.[1:4]))
As written in comments, you can also use a select inside of mutate to get the columns you want to sum up, for example
iris %>%
mutate(sumVar = rowSums(select(., contains("Sepal")))) %>%
head
or
iris %>%
mutate(sumVar = select(., contains("Sepal")) %>% rowSums()) %>%
head
You can use rowwise() function:
iris %>%
rowwise() %>%
mutate(sumVar = sum(c_across(Sepal.Length:Petal.Width)))
#> # A tibble: 150 x 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sumVar
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 10.2
#> 2 4.9 3 1.4 0.2 setosa 9.5
#> 3 4.7 3.2 1.3 0.2 setosa 9.4
#> 4 4.6 3.1 1.5 0.2 setosa 9.4
#> 5 5 3.6 1.4 0.2 setosa 10.2
#> 6 5.4 3.9 1.7 0.4 setosa 11.4
#> 7 4.6 3.4 1.4 0.3 setosa 9.7
#> 8 5 3.4 1.5 0.2 setosa 10.1
#> 9 4.4 2.9 1.4 0.2 setosa 8.9
#> 10 4.9 3.1 1.5 0.1 setosa 9.6
#> # ... with 140 more rows
"c_across() uses tidy selection syntax so you can to succinctly select many variables"'
Finally, if you want, you can use %>% ungroup at the end to exit from rowwise.
A more complicated way would be:
iris %>% select(Sepal.Length:Petal.Width) %>%
mutate(sumVar = rowSums(.)) %>% left_join(iris)
Adding #docendodiscimus's comment as an answer. +1 to him!
iris %>% mutate(sumVar = rowSums(select(., contains("Sepal"))))
I am using this simple solution, which is a more robust modification of the answer by Davide Passaretti:
iris %>% select(Sepal.Length:Petal.Width) %>%
transmute(sumVar = rowSums(.)) %>% bind_cols(iris, .)
(But it requires a defined row order, which should be fine, unless you work with remote datasets perhaps..)
You can also use a grep in place of contains or matches, just in case you need to get fancy with the regular expressions (matches doesn't seem to much like negative lookaheads and the like in my experience).
iris %>% mutate(sumVar = rowSums(select(., grep("Sepal", names(.)))))
As requested, transforming my commment into an answer:
For operations like sum that already have an efficient vectorised row-wise alternative, the proper way is currently:
df %>% mutate(total = rowSums(across(where(is.numeric))))
across can take anything that select can (e.g. rowSums(across(Sepal.Length:Petal.Width)) also works).
Scroll down the row-wise vignette to find this and have a look at across

Resources