Take this simple dataset and function (representative of more complex problems):
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) (a + b) * n
Using base R's Map I could do this to add 2 new columns in a vectorised fashion:
ns <- 1:2
x[paste0("new",seq_along(ns))] <- Map(mult, x["a"], x["b"], n=ns)
x
# a b new1 new2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
purrr attempt via pmap gets close with a list output:
library(purrr)
library(dplyr)
x %>% select(a,b) %>% pmap(mult, n=1:2)
#[[1]]
#[1] 3 6
#
#[[2]]
#[1] 5 10
#
#[[3]]
#[1] 7 14
My attempts from here with pmap_dfr etc all seem to error out in trying to map this back to new columns.
How do I end up making 2 further variables which match my current "new1"/"new2"? I'm sure there is a simple incantation, but I'm clearly overlooking it or using the wrong *map* function.
There is some useful discussion here - How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs - but it seems overly hacky and inflexible for what I imagined was a simple problem.
The best approach I've found (which is still not terribly elegant) is to pipe into bind_cols. To get pmap_dfr to work correctly, the function should return a named list (which may or may not be a data frame):
library(tidyverse)
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) as.list(set_names((a + b) * n, paste0('new', n)))
x %>% bind_cols(pmap_dfr(., mult, n = 1:2))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
To avoid changing the definition of mult, you can wrap it in an anonymous function:
mult <- function(a,b,n) (a + b) * n
x %>% bind_cols(pmap_dfr(
.,
~as.list(set_names(
mult(...),
paste0('new', 1:2)
)),
n = 1:2
))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
In this particular case, it's not actually necessary to iterate over rows, though, because you can vectorize the inputs from x and instead iterate over n. The advantage is that usually n > p, so the number of iterations will be [potentially much] lower. To be clear, whether such an approach is possible depends on for which parameters the function can accept vector arguments.
mult still needs to be called on the variables of x. The simplest way to do this is to pass them explicitly:
x %>% bind_cols(map_dfc(1:2, ~mult(x$a, x$b, .x)))
#> a b V1 V2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
...but this loses the benefit of pmap that named variables will automatically get passed to the correct parameter. You can get that back by using purrr::lift, which is an adverb that changes the domain of a function so it accepts a list by wrapping it in do.call. The returned function can be called on x and the value of n for that iteration:
x %>% bind_cols(map_dfc(1:2, ~lift(mult)(x, n = .x)))
This is equivalent to
x %>% bind_cols(map_dfc(1:2, ~invoke(mult, x, n = .x)))
but the advantage of the former is that it returns a function which can be partially applied on x so it only has an n parameter left, and thus requires no explicit references to x and so pipes better:
x %>% bind_cols(map_dfc(1:2, partial(lift(mult), .)))
All return the same thing. Names can be fixed after the fact with %>% set_names(~sub('^V(\\d+)$', 'new\\1', .x)), if you like.
Here is one possibility.
library(purrr)
library(dplyr)
n <- 1:2
x %>%
mutate(val = pmap(., mult, n = n)) %>%
unnest() %>%
mutate(var = rep(paste0("new", n), nrow(.) / length(n))) %>%
spread(var, val)
# a b new1 new2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
Not pretty, so I'm also curious to see alternatives. A lot of excess comes about from unnesting the list column and spreading into new columns.
Here is another possibility using pmap_dfc plus an ugly as.data.frame(t(...)) call
bind_cols(x, as.data.frame(t(pmap_dfc(x, mult, n = n))))
# a b V1 V2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
Sample data
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) (a + b) * n
To mimic the input format for Map, we could call pmap from purrr in this way:
x[paste0("new",seq_along(ns))] <- pmap(list(x['a'], x['b'], ns), mult)
To fit this in a pipe:
x %>%
{list(.['a'], .['b'], ns)} %>%
pmap(mult) %>%
setNames(paste0('new', seq_along(ns))) %>%
cbind(x)
# new1 new2 a b
# 1 3 6 1 2
# 2 5 10 2 3
# 3 7 14 3 4
Apparently, this looks ugly compared to the concise base R code. But I could not think of a better way.
Related
In a dplyr mutate context, I would like to select the column a function is applied to by purrr:map using the value of another column.
Let's take a test data frame
test <- data.frame(a = c(1,2), b = c(3,4), selector = c("a","b"))
I want to apply following function
calc <- function(col)
{res <- col ^ 2
return(res)
}
I am trying something like this:
test_2 <- test %>% mutate(quad = map(.data[[selector]], ~ calc(.x)))
My expected result would be:
a b selector quad
1 1 3 a 1
2 2 4 b 16
but I get
Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?
I know .data[[var]] is supposed to be used only in special context of function programming, but also if I wrap this in functions or similar I cannot get it done. Trying to use tidy-selection gives the error that selecting helpers can only be used in special dplyr verbs, not functions like purrr:map.
how to use dynamic variable in purrr map within dplyr
hinted me to use get() and anonymous functions, but this also did not work in this context.
Here's one way:
test %>%
mutate(quad = map(seq_along(selector), ~ calc(test[[selector[.x]]])[.x]))
# a b selector quad
# 1 1 3 a 1
# 2 2 4 b 16
Instead of .data, you can also cur_data (which accounts for grouping):
test %>%
mutate(quad = map(seq(selector), ~ calc(cur_data()[[selector[.x]]])[.x]))
Or, with diag:
test %>%
mutate(quad = diag(as.matrix(calc(cur_data()[selector]))))
# a b selector quad
#1 1 3 a 1
#2 2 4 b 16
You can use rowwise() and get() the selector variable:
library(dplyr)
test %>%
rowwise() %>%
mutate(quad = calc(get(selector))) %>%
ungroup()
# A tibble: 2 × 4
a b selector quad
<dbl> <dbl> <chr> <dbl>
1 1 3 a 1
2 2 4 b 16
Or if the selector repeats, group_by() will be more efficient:
test <- data.frame(a = c(1,2,5), b = c(3,4,6), selector = c("a","b","a"))
test %>%
group_by(selector) %>%
mutate(quad = calc(get(selector[1]))) %>%
ungroup()
# A tibble: 3 × 4
a b selector quad
<dbl> <dbl> <chr> <dbl>
1 1 3 a 1
2 2 4 b 16
3 5 6 a 25
You could also change the function to return a single number and use purrr:
calc <- function(col, id) {test[[col]][[id]]^2}
test %>%
mutate(
quad = purrr::map2_dbl(selector, row_number(), calc)
)
a b selector quad
1 1 3 a 1
2 2 4 b 16
Using base R:
test$quad <- calc(test[,test$selector][cbind(seq_len(nrow(test)), test$selector)])
(R version 3.5.3 where strings are converted to factors in data.frame)
Not quite what you asked for but an alternative might be to restructure the data so that the calculation is easier:
test %>%
pivot_longer(
cols = c(a, b)
) %>%
filter(name == selector) %>%
mutate(quad = value**2)
# A tibble: 2 × 4
selector name value quad
<chr> <chr> <dbl> <dbl>
1 a a 1 1
2 b b 4 16
You can join the results back onto the original data using an id column.
I have 2 data frames: one with experimental data and one with values of some constants. Experimental data and constants are separated by categories (a and b). I would like to include a new column in the experimental data frame that is the result of the following calculation:
z = k*y
To do this, I'm using the dplyr package and the mutate() function, but I'm not getting the expected result. Does anyone have any tips or suggestions, even if it is necessary to use another package?
library(dplyr)
Category <- c("a", "b")
k <- c(1, 2)
# Data frame with the constants for each category
Constant <- data.frame(Category, k)
x <- seq(0,5,1)
df <- expand.grid(x = x,
Category = Category)
# Data frame with the experimental resultas
df$y <- seq(1,12,1)
# Failed attempt to calculate z separated by categories
df %>%
group_by(Category) %>%
mutate(z = Constant*y)
With dplyr you can do the following:
library(dplyr)
left_join(df, Constant, by = c("Category")) %>%
mutate(z = k * y) %>%
select(-k)
I did this:
a = c()
for(i in unique(df$Category)){
a = c(a,df[df$Category==i,"y"]*Constant[Constant$Category==i,'k'])
}
df$z=a
result:
x Category y z
1 0 a 1 1
2 1 a 2 2
3 2 a 3 3
4 3 a 4 4
5 4 a 5 5
6 5 a 6 6
7 0 b 7 14
8 1 b 8 16
9 2 b 9 18
10 3 b 10 20
11 4 b 11 22
12 5 b 12 24
I don't know if it was what you're looking for. Juste keep in mind that this works if your df is sorted by the category column
if you don't like for loop, here is a lapply version:
df$z =unlist( lapply(unique(df$Category), function(i){return(df[df$Category==i,"y"]*Constant[Constant$Category==i,'k'])}))
if the data isn't sorted by category:
df$z=unlist(lapply(1:nrow(df),function(i){ return(df[i,"y"]*Constant[Constant$Category==df[i,"Category"],'k'])}))
I have some question for programming using dplyr and for loop in order to create multiple data. The code without loop works very well, but the code with for loop doesn't give me the expected result as well as error message.
Error message was like:
"Error in UseMethod ("select_") : no applicable method for 'select_'
applied to an object of class "character"
Please anyone put me on the right way.
The code below worked
B <- data %>% select (column1) %>% group_by (column1) %>% arrange (column1) %>% summarise (n = n ())
The code below did not work
column_list <- c ('column1', 'column2', 'column3')
for (b in column_list) {
a <- data %>% select (b) %>% group_by (b) %>% arrange (b) %>% summarise (n = n () )
assign (paste0(b), a)
}
Don't use assign. Instead use lists.
We can use _at variations in dplyr which works with characters variables.
library(dplyr)
split_fun <- function(df, col) {
df %>% group_by_at(col) %>% summarise(n = n()) %>% arrange_at(col)
}
and then use lapply/map to apply it to different columns
purrr::map(column_list, ~split_fun(data, .))
This will return you a list of dataframes which can be accessed using [[ individually if needed.
Using example with mtcars
df <- mtcars
column_list <- c ('cyl', 'gear', 'carb')
purrr::map(column_list, ~split_fun(df, .))
#[[1]]
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
#[[2]]
# A tibble: 3 x 2
# gear n
# <dbl> <int>
#1 3 15
#2 4 12
#3 5 5
#[[3]]
# A tibble: 6 x 2
# carb n
# <dbl> <int>
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1
I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )
I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )