I am just started learning programming and I have a question that is probably easy for you.
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = LETTERS[seq( from = 1, to = 9 )], x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.6364831 A 0 -0.066480473
# 2 1 2 0.4476390 B 0 0.161372575
# 3 1 3 1.5113458 C 0 0.343956178
# 4 2 1 0.3532957 D 0 0.279987147
# 5 2 2 0.3401402 E 1 -0.462635393
# 6 2 3 -0.3160222 F 0 0.338454940
# 7 3 1 -1.3797158 G 1 -0.621169576
# 8 3 2 1.4026640 H 1 -0.005690801
# 9 3 3 0.2958363 I 1 -0.176488132
I am writing a function with multiple steps. I would like the feed the function with two elements the dataset and the variable of interest.
However, the function breaks down when I try to dcast it as it fails to individuate the variable. The crucial step of the function looks something like this.
testfun<-function(df,var)
{
newdf <- dcast(dataset,id+time~ x1, value.var = var) %>% # note this should be the variable of interest that i feed into the function
distinct()
return(newdf)
}
df2<-testfun(df,y)
Can anyone help me and explain how can I create a function where I index both a dataset and a function?
Thank you in advance for your help
If you pass column name as a string the function would work as it is
library(tidyverse)
library(data.table)
testfun1<-function(df,var) {
newdf <- dcast(df,id+time~ x1, value.var = var) %>% distinct()
return(newdf)
}
testfun1(df, "y")
However, if you want to pass unquoted variable as input you can use
testfun2<-function(df,var) {
var1 <- deparse(substitute(var))
newdf <- dcast(df,id+time~ x1, value.var = var1) %>% distinct()
return(newdf)
}
testfun2(df, y)
The equivalent tidyr function mentioned by #Konrad Rudolph is pivot_wider which would work with both types of inputs.
testfun3 <-function(df,var) {
new_df <- pivot_wider(df, names_from = x1, values_from = y)
return(new_df)
}
testfun3(df, y)
testfun3(df, "y")
Related
I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
I have a simple function in R to modify a dataframe
monthly_fun <- function(x){
x %>%
mutate(obstime = convert_dates(obstime)) %>%
select(obstime, x = obsvalue)
}
When applying the function to dataframe df, i.e. monthly_fun(df), I would like df to be the name of obsvalue. In my current code, the name is obviously "x", how can I modify the part in select to get the name of the supplied dataframe as the variable name instead?
Thanks a lot
EDIT: I want to apply this function to several dataframes using
result <- list( df1, df2, df3) %>%
lapply( monthly_fun )
You could extract the name of input by deparse(substitute(x)), and use !!y := obsvalue in mutate().
monthly_fun <- function(x) {
y <- deparse(substitute(x))
x %>%
mutate(obstime = convert_dates(obstime),
!!y := obsvalue) %>%
select(obstime, y)
}
A simplified example:
fun <- function(x) {
y <- deparse(substitute(x))
x %>%
mutate(!!y := 1) %>%
select(y)
}
fun(df)
# df
# 1 1
# 2 1
# 3 1
# 4 1
# 5 1
Update
If you want to apply it to several data frames stored in a list, you should design a 2-argument function, one argument for data and the other for new column names. Then use Map() to apply this function over each pair of data and names.
fun <- function(x, y) {
x %>%
mutate(!!y := 1) %>%
select(y)
}
Map(fun, list(df1, df2), c("name1", "name2"))
# [[1]]
# name1
# 1 1
# 2 1
# 3 1
# 4 1
# 5 1
#
# [[2]]
# name2
# 1 1
# 2 1
# 3 1
# 4 1
# 5 1
If you're familiar with purrr, The use of Map can be replaced with map2() or imap(). (Notice the difference of inputs to the both functions)
library(purrr)
# (1) map2(): Input data and names separately
map2(list(df1, df2), c("name1", "name2"), fun)
# (2) imap(): Input a named list
imap(list(name1 = df1, name2 = df2), fun)
Using the suggestion by Julien and creating a variable using deparse(substitute(df)) and rename using that.
monthly_fun <- function(x) {
y = deparse(substitute(x))
x <- x %>%
mutate(obstime = obstime*5) %>%
select(obstime, obsvalue)
names(x)[names(x) == "obsvalue"] <- y
return(x)
}
see this site for more naming methods.
I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6
I am relatively new to R and I have been facing issues using dplyr inside functions. I have scrounged the forum, looked at all similar issues but I am unable to resolve my issue. I have tried to simplify my issue with the following example
df <- tibble(
g1 = c(1, 2, 3, 4, 5),
a = sample(5),
b = sample(5)
)
I want to write a function to calculate the sum of a and b as follows:
sum <- function(df, group_var, a, b) {
group_var <- enquo(group_var)
a <- enquo(a)
b <- enquo(b)
df.temp<- df %>%
group_by(g1) %>%
mutate(
sum = !!a + !!b
)
return(df.temp)
}
and I can call the function thru this line:
df2 <- sum(df, g1, a, b)
My issue is that I do not want to hard code the columns names in function call since the columns names "g1", "a" and "b" are likely to change. and hence, I have the columns names assigned from a config file (config.yml) to a variable.
But when I use the variables, I run into multiple issues. Can someone guide me here please? For all column name references, I would ideally like to use variables. for e.g. I run into issues here in this code:
A.Key <- "a"
B.Key <- "b"
df2 <- sum(df, g1, A.Key, B.Key)
Thanks in advance and sorry if it has been answered before; I could not find it.
sum1 <- function(df, group_var,x,y) {
group_var <- enquo(group_var)
x = as.name(x)
y = as.name(y)
df.temp<- df %>%
group_by(!!group_var) %>%
mutate(
sum = !!enquo(x)+!!enquo(y)
)
return(df.temp)
}
sum1(df, g1, A.Key, B.Key)
# A tibble: 5 x 4
# Groups: g1 [5]
g1 a b sum
<dbl> <int> <int> <int>
1 1. 3 2 5
2 2. 2 1 3
3 3. 1 3 4
4 4. 4 4 8
5 5. 5 5 10
I am kind of new to R and programming in general. I am currently strugling with a piece of code for data transformation and hope someone can take a little bit of time to help me.
Below a reproducible exemple :
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
Goal : Compute all values (a,b) using a reference value. Calculation should be : a/a_ref with a_ref = a when f2=0 depending on the family (f1 can be X,Y or Z).
I tried to solve this by using this code :
test <- filter(dt, f2!=0) %>% group_by(f1) %>%
mutate("a/a_ref"=a/(filter(dt, f2==0) %>% group_by(f1) %>% distinct(a) %>% pull))
I get :
test results
as you can see a is divided by a_ref. But my script seems to recycle the use of reference values (a_ref) regardless of the family f1.
Do you have any suggestion so A is computed with regard of the family (f1) ?
Thank you for reading !
EDIT
I found a way to do it 'manualy'
filter(dt, f1=="X") %>% mutate("a/a_ref"=a/(filter(dt, f1=="X" & f2==0) %>% distinct(a) %>% pull()))
f1 f2 a b a/a_ref
1 X 0 21.77605 24.53115 1.0000000
2 X 1 20.17327 24.02512 0.9263973
3 X 50 19.81482 25.58103 0.9099366
4 X 100 19.90205 24.66322 0.9139422
the problem is that I'd have to update the code for each variable and family and thus is not a clean way to do it.
# use this to reproduce the same dataset and results
set.seed(5)
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
dt %>%
group_by(f1) %>% # for each f1 value
mutate(a_ref = a[f2 == 0], # get the a_ref and add it in each row
"a/a_ref" = a/a_ref) %>% # divide a and a_ref
ungroup() %>% # forget the grouping
filter(f2 != 0) # remove rows where f2 == 0
# # A tibble: 9 x 6
# f1 f2 a b a_ref `a/a_ref`
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 X 1 21.38436 24.84247 19.15914 1.1161437
# 2 X 50 18.74451 23.92824 19.15914 0.9783583
# 3 X 100 20.07014 24.86101 19.15914 1.0475490
# 4 Y 1 19.39709 22.81603 21.71144 0.8934042
# 5 Y 50 19.52783 25.24082 21.71144 0.8994260
# 6 Y 100 19.36463 24.74064 21.71144 0.8919090
# 7 Z 1 20.13811 25.94187 19.71423 1.0215013
# 8 Z 50 21.22763 26.46796 19.71423 1.0767671
# 9 Z 100 19.19822 25.70676 19.71423 0.9738257
You can do this for more than one variable using:
dt %>%
group_by(f1) %>%
mutate_at(vars(a:b), funs(./.[f2 == 0])) %>%
ungroup()
Or generally use vars(a:z) to use all variables between a and z as long as they are one after the other in your dataset.
Another solution could be using mutate_if like:
dt %>%
group_by(f1) %>%
mutate_if(is.numeric, funs(./.[f2 == 0])) %>%
ungroup()
Where the function will be applied to all numeric variables you have. The variables f1 and f2 will be factor variables, so it just excludes those ones.