ggplot2 visualization looks off when code is called via a function [duplicate] - r

I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.

sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10

Related

Function to concatenate rows of data with an additional source column in R (and tidyverse)

I want to be able to take data that's in the same format but from different sources and concatenate the rows, but to keep track of the source of the data I'd like to also introduce a source column.
This seems routine enough that I thought I'd create a utility function to do it, but I'm having trouble getting it to work.
Here's what I tried:
library(tidyverse)
tibble1 = tribble(
~a, ~b,
1,2,
3,4
)
tibble2 = tribble(
~a, ~b,
5,6
)
bind_rows_with_source <- function(...){
out = tibble()
for (newtibb in list(...)){
out <- bind_rows(out, newtibb %>% mutate(source = deparse(substitute(newtibb))))
}
return(out)
}
bind_rows_with_source(tibble1,tibble2)
#source column just contains the string 'newtibb' on all rows
#I want it to contain tibble1 for the first two rows and tibble2 for the third:
#~a, ~b, ~source
# 1, 2, tibble1
# 3, 4, tibble1
# 5, 6, tibble2
Is there already a function that could achieve this?
Is there a better approach than the utility function I tried to create?
Is there a way to correct my approach?
Sincere thanks for reading my question
This could be done as:
bind_rows(list(tibble1=tibble1, tibble2=tibble2), .id='source')
# A tibble: 3 x 3
source a b
<chr> <dbl> <dbl>
1 tibble1 1 2
2 tibble1 3 4
3 tibble2 5 6
If you refer not inputing names:
bind_rows_with_source <- function(..., .id = 'source'){
bind_rows(setNames(list(...), as.character(substitute(...()))), .id = .id)
}
bind_rows_with_source(tibble1,tibble2)
# A tibble: 3 x 3
source a b
<chr> <dbl> <dbl>
1 tibble1 1 2
2 tibble1 3 4
3 tibble2 5 6
We could use lazyeval package: An alternative approach to non-standard evaluation using formulas. Provides a full implementation of LISP style 'quasiquotation',making it easier to generate code with other code.
https://cran.r-project.org/web/packages/lazyeval/lazyeval.pdf
library(lazyeval)
my_function <- function(df) {
df <- df %>% mutate(ref = expr_label(df))
return(df)
}
a <- my_function(tibble1)
b <- my_function(tibble2)
bind_rows(a, b)
Output:
a b ref
<dbl> <dbl> <chr>
1 1 2 `tibble1`
2 3 4 `tibble1`
3 5 6 `tibble2`
Another option is rbindlist
library(data.table)
rbindlist(list(tibble1, tibble2), idcol = 'source')
If you really want your function-signature to be with ..., you can use
bind_rows_with_source <- function(...){
tibbleNames <- as.character(unlist(as.list(match.call())[-1]))
tibbleList <- setNames(lapply(tibbleNames,get),tibbleNames)
sourceCol <- rep(tibbleNames,times=sapply(tibbleList,NROW))
out <- do.call("rbind",tibbleList)
out$source <- sourceCol
return(out)
}
or if you can use dplyr
bind_rows_with_source <- function(...){
tibbleNames <- as.character(unlist(as.list(match.call())[-1]))
tibbleList <- setNames(lapply(tibbleNames,get),tibbleNames)
dplyr::bind_rows(tibbleList, .id='source')
}

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

Function indexing dataset and variable of interest

I am just started learning programming and I have a question that is probably easy for you.
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = LETTERS[seq( from = 1, to = 9 )], x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.6364831 A 0 -0.066480473
# 2 1 2 0.4476390 B 0 0.161372575
# 3 1 3 1.5113458 C 0 0.343956178
# 4 2 1 0.3532957 D 0 0.279987147
# 5 2 2 0.3401402 E 1 -0.462635393
# 6 2 3 -0.3160222 F 0 0.338454940
# 7 3 1 -1.3797158 G 1 -0.621169576
# 8 3 2 1.4026640 H 1 -0.005690801
# 9 3 3 0.2958363 I 1 -0.176488132
I am writing a function with multiple steps. I would like the feed the function with two elements the dataset and the variable of interest.
However, the function breaks down when I try to dcast it as it fails to individuate the variable. The crucial step of the function looks something like this.
testfun<-function(df,var)
{
newdf <- dcast(dataset,id+time~ x1, value.var = var) %>% # note this should be the variable of interest that i feed into the function
distinct()
return(newdf)
}
df2<-testfun(df,y)
Can anyone help me and explain how can I create a function where I index both a dataset and a function?
Thank you in advance for your help
If you pass column name as a string the function would work as it is
library(tidyverse)
library(data.table)
testfun1<-function(df,var) {
newdf <- dcast(df,id+time~ x1, value.var = var) %>% distinct()
return(newdf)
}
testfun1(df, "y")
However, if you want to pass unquoted variable as input you can use
testfun2<-function(df,var) {
var1 <- deparse(substitute(var))
newdf <- dcast(df,id+time~ x1, value.var = var1) %>% distinct()
return(newdf)
}
testfun2(df, y)
The equivalent tidyr function mentioned by #Konrad Rudolph is pivot_wider which would work with both types of inputs.
testfun3 <-function(df,var) {
new_df <- pivot_wider(df, names_from = x1, values_from = y)
return(new_df)
}
testfun3(df, y)
testfun3(df, "y")

Using Variable names for dplyr inside function

I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10

Compute variable according to factor levels

I am kind of new to R and programming in general. I am currently strugling with a piece of code for data transformation and hope someone can take a little bit of time to help me.
Below a reproducible exemple :
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
Goal : Compute all values (a,b) using a reference value. Calculation should be : a/a_ref with a_ref = a when f2=0 depending on the family (f1 can be X,Y or Z).
I tried to solve this by using this code :
test <- filter(dt, f2!=0) %>% group_by(f1) %>%
mutate("a/a_ref"=a/(filter(dt, f2==0) %>% group_by(f1) %>% distinct(a) %>% pull))
I get :
test results
as you can see a is divided by a_ref. But my script seems to recycle the use of reference values (a_ref) regardless of the family f1.
Do you have any suggestion so A is computed with regard of the family (f1) ?
Thank you for reading !
EDIT
I found a way to do it 'manualy'
filter(dt, f1=="X") %>% mutate("a/a_ref"=a/(filter(dt, f1=="X" & f2==0) %>% distinct(a) %>% pull()))
f1 f2 a b a/a_ref
1 X 0 21.77605 24.53115 1.0000000
2 X 1 20.17327 24.02512 0.9263973
3 X 50 19.81482 25.58103 0.9099366
4 X 100 19.90205 24.66322 0.9139422
the problem is that I'd have to update the code for each variable and family and thus is not a clean way to do it.
# use this to reproduce the same dataset and results
set.seed(5)
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
dt %>%
group_by(f1) %>% # for each f1 value
mutate(a_ref = a[f2 == 0], # get the a_ref and add it in each row
"a/a_ref" = a/a_ref) %>% # divide a and a_ref
ungroup() %>% # forget the grouping
filter(f2 != 0) # remove rows where f2 == 0
# # A tibble: 9 x 6
# f1 f2 a b a_ref `a/a_ref`
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 X 1 21.38436 24.84247 19.15914 1.1161437
# 2 X 50 18.74451 23.92824 19.15914 0.9783583
# 3 X 100 20.07014 24.86101 19.15914 1.0475490
# 4 Y 1 19.39709 22.81603 21.71144 0.8934042
# 5 Y 50 19.52783 25.24082 21.71144 0.8994260
# 6 Y 100 19.36463 24.74064 21.71144 0.8919090
# 7 Z 1 20.13811 25.94187 19.71423 1.0215013
# 8 Z 50 21.22763 26.46796 19.71423 1.0767671
# 9 Z 100 19.19822 25.70676 19.71423 0.9738257
You can do this for more than one variable using:
dt %>%
group_by(f1) %>%
mutate_at(vars(a:b), funs(./.[f2 == 0])) %>%
ungroup()
Or generally use vars(a:z) to use all variables between a and z as long as they are one after the other in your dataset.
Another solution could be using mutate_if like:
dt %>%
group_by(f1) %>%
mutate_if(is.numeric, funs(./.[f2 == 0])) %>%
ungroup()
Where the function will be applied to all numeric variables you have. The variables f1 and f2 will be factor variables, so it just excludes those ones.

Resources