lets say I have a tibble which looks like this:
library(tidyverse)
tib <- tibble (a = 1:3, b = 4:6, d = -1:1)
I want to add a column to this tibble where each entry is a function with parameters a,b and d (like f(x) = ax^2 + bx +d).
This would mean that (e.g) in the first row I would like to add the function f(x) = 1 x ^2 + 4 x -1, and so on.
I tried the following:
tib2 <- tib %>%
mutate(fun = list(function(x) {a*x^2+b*x+d}))
This does not work since the functions do not know what a, b and d are.
I managed to build a work-around solution using the function mapply
lf <- mapply(function(a,b,d){return(function(x){a*x^2 + b*x + d})}, tib$a, tib$b, tib$d)
tib3 <- tib %>%
add_column(lf)
I was wondering if anyone knows a more elegant way of doing this within the tidyverse. It feels like there is a way using the map function from the purrr package, but I did not manage to get it working.
Thank you
When you used mutate in your example, you were giving it a list with one element (function). So this one function was recycled for all the other rows. Also, inside the definition of the function, it doesn't have any visibility of a, b or d.
You can instead use pmap so that each row has its own function.
tib2 <- tib %>%
mutate(
fun = pmap(
list(a, b, d),
~function(x) ..1 * x^2 + ..2 * x + ..3))
tib2
#> # A tibble: 3 x 4
#> a b d fun
#> <int> <int> <int> <list>
#> 1 1 4 -1 <fun>
#> 2 2 5 0 <fun>
#> 3 3 6 1 <fun>
tib2$fun[[1]](1)
#> [1] 4
Related
I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
I have a data.frame that looks like
df <- data.frame(P1 = c("ATG","GTA","GGG","GGG"), P2 = c("TGG","GAT","GGG","GCG"))
I want to convert each DNA codon to an amino-acid using the below function (but any translate option is viable), and output an identical data.frame but with single letter amino-acids rather than codons:
library(Biostrings)
library(seqinr)
translate_R <- function(x)
{
translate(s2c(as.character(x)))
}
It works for individual elements of the data.frame
> translate_R(df[1,1])
[1] "M"
But trying to apply this to the whole data.frame isn't working. What am I missing? I don't understand why there is an error, as googling how to do this suggests it should work. Missing something fundamental I guess.
> df[] <- lapply(df, translate_R)
Error in seq.default(from = frame + 1, to = frame + l, by = 3) :
wrong sign in 'by' argument
In addition: Warning message:
In s2c(as.character(x)) :
Error in seq.default(from = frame + 1, to = frame + l, by = 3) :
wrong sign in 'by' argument
Your translate_R function is expecting a single value, but it's getting a vector. You can fix this by passing in individual values.
In other words, iterate over columns of df with an outer apply and then over values in each column with an inner apply.
Here's how to do it with base R:
data.frame(lapply(df, function(x) sapply(x, translate_R)))
And here's a tidyverse version with map:
library(tidyverse)
df %>% mutate(across(everything(), ~map(., translate_R)))
In both cases, the output is:
P1 P2
1 M W
2 V D
3 G G
4 G A
Another potential tidyverse solution is to use the "rowwise" tidyverse function:
library(tidyverse)
library(Biostrings)
library(seqinr)
translate_R <- function(x) {
translate(s2c(as.character(x)))
}
df <- data.frame(P1 = c("ATG","GTA","GGG","GGG"), P2 = c("TGG","GAT","GGG","GCG"))
df %>%
rowwise() %>%
mutate(across(everything(), ~ translate_R(.x)))
#> # A tibble: 4 x 2
#> # Rowwise:
#> P1 P2
#> <chr> <chr>
#> 1 M W
#> 2 V D
#> 3 G G
#> 4 G A
Created on 2021-07-21 by the reprex package (v2.0.0)
Inside of dplyr::summarise, how can I apply filters based on different rows than the one I'm summarising?
Example:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(y), # this should only include rows where z == 1
mb = mean(y) # this should only include rows where z == 2
)
So, the problem here is to apply a summary function to one column, while filtering based on another, all within summarise.
One idea was double-grouping, so applying group_by on both x and z, but I don't want all summary columns to be based on double-grouping, some (like mall in the example above) should be based on single-grouping only.
One quick option would be to use ifelse to filter to the rows you need, make the rest missing and use the na.rm = T argument to ignore missing values, like the example below.
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(ifelse(z == 1, y, NA), na.rm = T), # this should only include rows where z == 1
mb = mean(ifelse(z == 2, y, NA), na.rm = T) # this should only include rows where z == 2
)
# A tibble: 3 x 4
x mall ma mb
<dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11
While the answer by #Colin H is certainly the way to go for this specific example, a more flexible way to approach this could be to work within the subsets of the first grouping operation. This could be implemented with dplyr::group_split plus a subsequent purrr::map_dfr, but there is also dplyr::group_modify to do this in one step.
Note this relevant sentence from the documentation of dplyr::group_modify:
Use group_modify() when summarize() is too limited, in terms of what you need to do and return for each group.
So here is a solution for the example provided above:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::group_modify(function(x, ...) {
x %>% dplyr::mutate(
mall = mean(y)
) %>%
dplyr::group_by(z, mall) %>%
dplyr::summarise(
m = mean(y),
.groups = "drop"
)
}) %>%
dplyr::ungroup()
# A tibble: 6 x 4
x z mall m
<dbl> <dbl> <dbl> <dbl>
1 1 1 2.5 2
2 1 2 2.5 3
3 2 1 6.5 6
4 2 2 6.5 7
5 3 1 10.5 10
6 3 2 10.5 11
group_modify applies a function to each subset tibble after grouping by x. This function has two arguments:
The subset of the data for the group, exposed as .x.
The key, a tibble with exactly one row and columns for each grouping
variable, exposed as .y.
Within our function here we use mutate to cover the requested mall-case first. We do not need any further grouping for that, because that is already covered by the wrapping group_modify. Then we apply another group_by + summarise to cover the different iterations of z. Note that this solution is independent of the number of cases in z we want to consider. While the two cases in this example can be easily handled manually, this might change if there are more.
If the wide output format with individual columns for the cases in z is required, then you can further modify the output of my code with tidyr::pivot_wider.
Another option and perhaps a little more concise is via subsetting:
t %>%
group_by(x) %>%
summarise(mall = mean(y),
ma = mean(y[z == 1]),
mb = mean(y[z == 2]))
# A tibble: 3 x 4
x mall ma mb
* <dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11
Here is another generic way (just like group_modify) to perform custom filtering on a group data while summarizing. This uses dplyr's context dependent expression: cur_data(), which makes the current group's data available inside dplyr verbs like mutate/summary:
t %>%
dplyr::group_by(x) %>%
dplyr::summarize(
mall = mean(y),
ma = mean(cur_data() %>% as.data.frame() %>% filter(z == 1) %>% pull(y)),
mb = mean(cur_data() %>% as.data.frame() %>% filter(z == 2) %>% pull(y))
)
The benefit of using cur_data() is that you can perform any complex filtering or munging before returning the final summary. For more information refer to: https://dplyr.tidyverse.org/reference/context.html
I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
I would like use a custom function within dplyr's function summarise(), as follows:
library(dplyr)
# Define custom function for calculating standard error
se <- function(x) sd(x) / sqrt(length(x))
# Create a dummy data table with two groups
d <- tibble(gp = sample(c("A", "B"), 20, replace = T),
x = ifelse(gp == "A", rnorm(20), rnorm(20) + 1))
# Summarise data
d %>%
group_by(gp) %>%
summarise(x = mean(x),
se = se(x))
Why do I get NA values in the output rather than the correct values of standard error?
# A tibble: 2 × 3
gp x se
<chr> <dbl> <lgl>
1 A -0.4060173 NA
2 B 0.2999004 NA
I'm aware of some possible alternatives. For example, using the base package:
tapply(d$x, d$gp, se)
But I don't understand why the first version gives the result that it does.
summarize evaluates each expression in turn, so when your first line does
x = mean(x)
The x column (within each group) is replaced by a single value, mean(x). Your next line calls sd on that constant x, and the sd of a single value is NA.
As #joran says in the comments, if you just choose a different name for your mean column, everything will work.
d %>%
group_by(gp) %>%
summarise(avg = mean(x),
se = se(x))
# # A tibble: 2 × 3
# gp avg se
# <chr> <dbl> <dbl>
# 1 A -0.2879016 0.2264810
# 2 B 0.8804859 0.2625018
Note that this sequential evaluation is a well-considered feature of dplyr. The practical difference between dplyr::mutate and base::transform is exactly that.
dd = data.frame(x = 1:3)
base::transform(dd, x = 0, y = x * 2)
# x y
# 1 0 2
# 2 0 4
# 3 0 6
dplyr::mutate(dd, x = 0, y = x * 2)
# x y
# 1 0 0
# 2 0 0
# 3 0 0
This is called out in the Introduction to dplyr vignette:
dplyr::mutate() works the same way as plyr::mutate() and similarly to base::transform(). The key difference between mutate() and transform() is that mutate allows you to refer to columns that you’ve just created.