I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
Related
I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
I want to be able to take data that's in the same format but from different sources and concatenate the rows, but to keep track of the source of the data I'd like to also introduce a source column.
This seems routine enough that I thought I'd create a utility function to do it, but I'm having trouble getting it to work.
Here's what I tried:
library(tidyverse)
tibble1 = tribble(
~a, ~b,
1,2,
3,4
)
tibble2 = tribble(
~a, ~b,
5,6
)
bind_rows_with_source <- function(...){
out = tibble()
for (newtibb in list(...)){
out <- bind_rows(out, newtibb %>% mutate(source = deparse(substitute(newtibb))))
}
return(out)
}
bind_rows_with_source(tibble1,tibble2)
#source column just contains the string 'newtibb' on all rows
#I want it to contain tibble1 for the first two rows and tibble2 for the third:
#~a, ~b, ~source
# 1, 2, tibble1
# 3, 4, tibble1
# 5, 6, tibble2
Is there already a function that could achieve this?
Is there a better approach than the utility function I tried to create?
Is there a way to correct my approach?
Sincere thanks for reading my question
This could be done as:
bind_rows(list(tibble1=tibble1, tibble2=tibble2), .id='source')
# A tibble: 3 x 3
source a b
<chr> <dbl> <dbl>
1 tibble1 1 2
2 tibble1 3 4
3 tibble2 5 6
If you refer not inputing names:
bind_rows_with_source <- function(..., .id = 'source'){
bind_rows(setNames(list(...), as.character(substitute(...()))), .id = .id)
}
bind_rows_with_source(tibble1,tibble2)
# A tibble: 3 x 3
source a b
<chr> <dbl> <dbl>
1 tibble1 1 2
2 tibble1 3 4
3 tibble2 5 6
We could use lazyeval package: An alternative approach to non-standard evaluation using formulas. Provides a full implementation of LISP style 'quasiquotation',making it easier to generate code with other code.
https://cran.r-project.org/web/packages/lazyeval/lazyeval.pdf
library(lazyeval)
my_function <- function(df) {
df <- df %>% mutate(ref = expr_label(df))
return(df)
}
a <- my_function(tibble1)
b <- my_function(tibble2)
bind_rows(a, b)
Output:
a b ref
<dbl> <dbl> <chr>
1 1 2 `tibble1`
2 3 4 `tibble1`
3 5 6 `tibble2`
Another option is rbindlist
library(data.table)
rbindlist(list(tibble1, tibble2), idcol = 'source')
If you really want your function-signature to be with ..., you can use
bind_rows_with_source <- function(...){
tibbleNames <- as.character(unlist(as.list(match.call())[-1]))
tibbleList <- setNames(lapply(tibbleNames,get),tibbleNames)
sourceCol <- rep(tibbleNames,times=sapply(tibbleList,NROW))
out <- do.call("rbind",tibbleList)
out$source <- sourceCol
return(out)
}
or if you can use dplyr
bind_rows_with_source <- function(...){
tibbleNames <- as.character(unlist(as.list(match.call())[-1]))
tibbleList <- setNames(lapply(tibbleNames,get),tibbleNames)
dplyr::bind_rows(tibbleList, .id='source')
}
I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6
I am just started learning programming and I have a question that is probably easy for you.
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = LETTERS[seq( from = 1, to = 9 )], x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.6364831 A 0 -0.066480473
# 2 1 2 0.4476390 B 0 0.161372575
# 3 1 3 1.5113458 C 0 0.343956178
# 4 2 1 0.3532957 D 0 0.279987147
# 5 2 2 0.3401402 E 1 -0.462635393
# 6 2 3 -0.3160222 F 0 0.338454940
# 7 3 1 -1.3797158 G 1 -0.621169576
# 8 3 2 1.4026640 H 1 -0.005690801
# 9 3 3 0.2958363 I 1 -0.176488132
I am writing a function with multiple steps. I would like the feed the function with two elements the dataset and the variable of interest.
However, the function breaks down when I try to dcast it as it fails to individuate the variable. The crucial step of the function looks something like this.
testfun<-function(df,var)
{
newdf <- dcast(dataset,id+time~ x1, value.var = var) %>% # note this should be the variable of interest that i feed into the function
distinct()
return(newdf)
}
df2<-testfun(df,y)
Can anyone help me and explain how can I create a function where I index both a dataset and a function?
Thank you in advance for your help
If you pass column name as a string the function would work as it is
library(tidyverse)
library(data.table)
testfun1<-function(df,var) {
newdf <- dcast(df,id+time~ x1, value.var = var) %>% distinct()
return(newdf)
}
testfun1(df, "y")
However, if you want to pass unquoted variable as input you can use
testfun2<-function(df,var) {
var1 <- deparse(substitute(var))
newdf <- dcast(df,id+time~ x1, value.var = var1) %>% distinct()
return(newdf)
}
testfun2(df, y)
The equivalent tidyr function mentioned by #Konrad Rudolph is pivot_wider which would work with both types of inputs.
testfun3 <-function(df,var) {
new_df <- pivot_wider(df, names_from = x1, values_from = y)
return(new_df)
}
testfun3(df, y)
testfun3(df, "y")
I am building a tidy-compatible function for use inside dplyr's mutate where I'd like to pass a variable and also the data set I'm working with, and use information from both to build a vector.
As a basic example, imagine I want to return a string containing the mean of the variable and the number of rows in the data set (I know I could just take the length of var, ignore that, it's an example).
library(tidyverse)
library(rlang)
info <- function(var,df = get(".",envir = parent.frame())) {
paste(mean(var),nrow(df),sep=', ')
}
dat <- data.frame(a = 1:10, i = c(rep(1,5),rep(2,5)))
#Works fine, 'types' contains '5.5, 10'
dat %>% mutate(types = info(a))
Ok, great so far. But now maybe I want it to work with grouped data. var will be from just one group, but . would be the full data set. So instead I'll use rlang's .data pronoun, which is just the data being worked with.
However, .data is not like .. . is the data set, but .data is just a pronoun from which I can pull variables with .data[[varname]].
info2 <- function(var,df = get(".data",envir = parent.frame())) {
paste(mean(var),nrow(.data),sep=', ')
}
#Doesn't work. nrow(.data) gives blank strings
dat %>% group_by(i) %>% mutate(types = info2(a))
How can I get the full thing from .data? I know I didn't include it in the example but specifically I both need some stuff from attr(dat) AND some stuff from the variables in dat that is properly subsetted for the grouping, so neither reverting to . nor just pulling out variables and getting stuff from there would work.
As Alexis mentioned in the above comment, this is not possible, as it's not the intended use of .data. However, now that I've given up on doing this directly, I've worked up a kludge using a combination of . and .data.
info <- function(var,df = get(".",envir = parent.frame())) {
#First, get any information you need from .
fulldatasize <- nrow(df)
#Then, check if you actually need .data,
#i.e. the data is grouped and you need a subsample
if (length(var) < nrow(df)) {
#If you are, get the list of variables you want from .data, maybe all of them
namesiwant <- names(df)
#Get .data
datapronoun <- get('.data',envir=parent.frame())
#And remake df using just the subsample
df <- data.frame(lapply(namesiwant, function(x) datapronoun[[x]]))
names(df) <- namesiwant
}
#Now do whatever you want with the .data data
groupsize <- nrow(df)
paste(mean(var),groupsize,fulldatasize,sep=', ')
}
dat <- data.frame(a = 1:10, i = c(rep(1,5),rep(2,5)))
#types contains the within-group mean, then 5, then 10
dat %>% group_by(i) %>% mutate(types = info(a))
Why not use length() instead of nrow() here ?
dat <- data.frame(a = 1:10, i = c(rep(1,5),rep(2,5)))
info <- function(var) {
paste(mean(var),length(var),sep=', ')
}
dat %>% group_by(i) %>% mutate(types = info(a))
#> # A tibble: 10 x 3
#> # Groups: i [2]
#> a i types
#> <int> <dbl> <chr>
#> 1 1 1 3, 5
#> 2 2 1 3, 5
#> 3 3 1 3, 5
#> 4 4 1 3, 5
#> 5 5 1 3, 5
#> 6 6 2 8, 5
#> 7 7 2 8, 5
#> 8 8 2 8, 5
#> 9 9 2 8, 5
#> 10 10 2 8, 5