Is there an helper function to make this code cleaner on tibble? - r

I need to sum sequences generated by one of column. I have done it in that way:
test <- tibble::tibble(
x = c(1,2,3)
)
test %>% dplyr::mutate(., s = plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))}))
Is there a cleaner way to do this? Is there some helper functions, construction which allows me to reduce this:
plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))})
I am looking for a generic solution, here sum and seq is only an example. Maybe the real problem is that I do want to execute function on element not all vector.
This is my real case:
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
I would like to do that by mutate function in a row way not vectorized way.

For the specific example, it is a direct application of cumsum
test %>%
mutate(s = cumsum(x))
For generic cases to loop through the sequence of rows, we can use map
test %>%
mutate(s = map_dbl(row_number(), ~ sum(seq(.x))))
# A tibble: 3 x 2
# x s
# <dbl> <dbl>
#1 1 1
#2 2 3
#3 3 6
Update
For the updated dataset, use map2, as we are using corresponding arguments in dnorm from the 'x' and 'y' columns of the dataset
test %>%
mutate(V1 = map2_dbl(x, y, ~ dnorm(d, mean = .x, sd = .y) %>%
sum))
# A tibble: 3 x 3
# x y V1
# <dbl> <dbl> <dbl>
#1 1 0.5 1.56
#2 2 1 0.929
#3 3 1.5 0.470

Related

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

use pipe operator in mutate

I am wondering with the following code does not work. Because pipe is not compatible in mutate?
tibble(x = c(1,2), y = c(3,4)) %>%
mutate(z = {. %>% (function(tb) {tb$x + tb$y})})
I know a workaround is
tibble(x = c(1,2), y = c(3,4)) %>%
mutate(z = map_depth(., .depth = 0, function(tb) {tb$x + tb$y}))
or
tibble(x = c(1,2), y = c(3,4)) %>%
mutate(z = exec(function(tb) {tb$x + tb$y}, .))
This works as you are expecting:
tibble(x = c(1,2), y = c(3,4)) %>%
mutate(z = {(.) %>% (function(tb) {tb$x + tb$y})})
# # A tibble: 2 x 3
# x y z
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
The problem isn't the pipe, but rather that . seems to be interpreted as a function (which throws off the pipe).
Edit:
#Aramis7d provided a link to the documentation for magrittr in a comment. The relevant line is:
Using the dot-place holder as lhs
When the dot is used as lhs, the result will be a functional sequence, i.e. a function which applies the entire chain of right-hand sides in turn to its input. See the examples.
So in your example, you were trying to assign an entire function to z within the mutate. You can see this based on the error message returned. By using (.), we force evaluation of the . and get results as expected.
tibble(x = c(1,2), y = c(3,4)) %>%
mutate(z = {. %>% (function(tb) {tb$x + tb$y})})
# Error: Column `z` is of unsupported type function
Interesting scenario indeed.
Without any more specific use cases, this seems like the %>% operator is not at all required even if you want to use anonymous functions within the mutate() .
tibble(x = c(1,2), y = c(3,4)) %>%
mutate(z = {(function(tb){tb$x + tb$y})(.)})
returns:
# A tibble: 2 x 3
x y z
<dbl> <dbl> <dbl>
1 1 3 4
2 2 4 6

How to use group_by() with an empty argument, in R?

I am writing a function that computes the mean of a variable according to some grouping (g1 and g2). I would like the function to take care of the case when the user just wants to compute the mean across the groups, so the group argument will be empty.
I want a solution using tidyverse.
Suppose the following:
y = 1:4
g1 = c('a', 'a', 'b', 'b')
g2 = c(1,2,1,2)
MyData = data.frame(g1, g2, y)
MyFun = function(group){
group_sym = syms(group)
MyData %>%
group_by(!!!group_sym) %>%
summarise(mean = mean(y))
}
# this works well
MyFun(group = c('g1', 'g2'))
Now suppose I want the mean of y across all groups. I would like the function be able to treat something like
MyFun(group = '')
or
MyFun(group = NULL)
So ideally I would like the group argument to be empty / null and thus MyData would not be grouped. One solution could be to add a condition at the beginning of the function checking if the argument is empty and if TRUE write summarise without group_by. But this is not elegant and my real code is much longer than just a few lines.
Any idea?
1) Use {{...}} and use g1 in place of 'g1':
MyFun = function(group) {
MyData %>%
group_by({{group}}) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun(g1)
## # A tibble: 2 x 2
## g1 mean
## <fct> <dbl>
## 1 a 1.5
## 2 b 3.5
MyFun()
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 2.5
2) This approach uses 'g1' as in the question.
MyFun = function(group) {
group <- if (missing(group)) 'All' else sym(group)
MyData %>%
group_by(!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun('g1')
## # A tibble: 2 x 2
## g1 mean
## <fct> <dbl>
## 1 a 1.5
## 2 b 3.5
MyFun()
## # A tibble: 1 x 2
## `"All"` mean
## <chr> <dbl>
## 1 All 2.5
3) This also works and gives the same output as (2).
MyFun = function(...) {
group <- if (...length()) syms(...) else 'All'
MyData %>%
group_by(!!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun('g1')
MyFun()
A different approach consists of creating a fake group (named 'across_group') in the data, in the case of group is missing.
MyFun = function(group) {
if (missing(group)) MyData$across_group = 1
group <- if (missing(group)) syms('across_group') else syms(group)
MyData %>%
group_by(!!!group) %>%
summarise(mean = mean(y)) %>%
ungroup
}
MyFun()
# A tibble: 1 x 2
across_group mean
<dbl> <dbl>
1 1 2.5

Passing multiple arguments to function in dplyr::summarise_if

I am trying to make a function that uses summarise_if (or summarise_at) to calculate the correlation between one column and many others in the data set.
data_set <- data.frame(grp = rep(c("a","b","c"), each =
3), x = rnorm(9), y = rnorm(9), z = rnorm(9))
multiple_cor <- function(d, vars){
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, cor, x) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
This gives the error:
Error in dots_list(...) : object 'x' not found
Called from: dots_list(...)
I'm am fairly sure this is from the cor function not evaluating x within the right environment, but I am not sure how to get around this issue.
summarise_at has a funs argument so it can handle anonymous functions. I created a function called cors inside your function and pass that one on to summarise_at inside the funs argument to handle the x.
multiple_cor <- function(d, vars){
cors <- function(x, a = NULL) {
stats::cor(x, a)
}
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, funs(cors(x, .))) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
# A tibble: 3 x 3
grp y z
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
The outcome of the function is exactly identical as the following lines of code:
data_set %>%
group_by(grp) %>%
summarise(cxy = cor(x, y),
cxz = cor(x, z))
# A tibble: 3 x 3
grp cxy cxz
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
Read this dplyr documentation.
And this google groups discussion.

roll-up record, max of each column, group_by R

This seems fairly simple, and I have a solution, but it's kinda time consuming since I have a lot of columns. I have looked at other solutions, but it's always been for something slightly different (aggregate one column, mutate all columns etc). In SQL I would do select PAT_ID, max(X), max(Y), max(Z) from table_name group by PAT_ID.
I have a data set that looks like this (but with more columns):
dt <- data.frame(
PAT_ID = c('P','P','P','A','A','A'),
X = c(1,NA,NA, 1,NA,NA),
Y = c(NA,2,NA,NA,1,NA),
Z = c(NA,NA,1,NA,NA,0)
)
So I summarize and then combine the results:
results_X <-dt %>%
group_by(PAT_ID ) %>%
summarise(X = max(X, na.rm=TRUE))
results_Y <-dt %>%
group_by(PAT_ID ) %>%
summarise(Y = max(Y, na.rm=TRUE))
results_Z <-dt %>%
group_by(PAT_ID ) %>%
summarise(Z = max(Z, na.rm=TRUE))
resulted <- left_join(results_X, results_Y )
resulted <- left_join(resulted, results_Z)
My output is the "roll-up" record that is the max value for each column per PAT_ID:
myresult <- data.frame(
PAT_ID = c('P','A'),
X = c(1,1),
Y = c(2,1),
Z = c(1,0)
)
I'm sure there's a better way to do this, but how?
This can be done with a summarize_all in dplyr. Here you go
library(dplyr)
dt %>% group_by(PAT_ID) %>% summarize_all(max, na.rm=T)
# PAT_ID X Y Z
# <fctr> <dbl> <dbl> <dbl>
# 1 A 1 1 0
# 2 P 1 2 1
This can also be accomplished with base R using aggregate.
aggregate(dt[c("X","Y","Z")], dt["PAT_ID"], FUN=max, na.rm=TRUE)
PAT_ID X Y Z
1 A 1 1 0
2 P 1 2 1

Resources