Function with dplyr and multiple statements

Function with dplyr and multiple statements - r

I am having trouble using several dplyr functions in one function, despite using the function variants.
Example
library(dplyr)
# Data:
mydf <- data.frame(
var1 = factor(rep(1:24, each = 100)),
var2 = runif(2400, min = -10, max = 125),
var3 = runif(2400, min = 0, max = 2500),
var4 = runif(2400, min = - 10, max = 25)
)
# The function I want to build:
fx.average <- function(df, varlist) {
# select some varibles from a data frame
df <- df %>% dplyr::select_(.dots = varlist)
# Group by a variable and then just calculate the mean
df <-df %>% dplyr::group_by_(var1) %>% # added df here
dplyr::summarise_each_(funs_(mean(., na.rm = TRUE)))
}
So, now I am going to test the function in the following:
# Test function, Setup var-list
varlist0 <- c("var1", "var2", "var3")
fx.average(mydf, varlist0)
# Error in dplyr::group_by_(var1) : object 'var1' not found
# object 'var1' not found
# Manual example
mydf %>% dplyr::select(var1, var2, var3) %>%
group_by(var1) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
Not sure what goes wrong? From other questions it seems it should be solved by adding an underscore to the functions - since they are build for use inside functions?

In the OP's code, there are some typos (not specifying the data in the group_by step along with using NSE without quoted string and using funs_ with summarise_each_ where summarise_each and funs works)
fx.average <- function(df, varlist) {
df %>%
dplyr::select_(.dots = varlist) %>%
dplyr::group_by_(.dots = "var1") %>%
dplyr::summarise_each(funs(mean(., na.rm = TRUE)))
}
fx.average(mydf, varlist0)
# A tibble: 24 × 3
# var1 var2 var3
# <fctr> <dbl> <dbl>
#1 1 55.13601 1141.021
#2 2 59.16508 1155.226
#3 3 59.64524 1245.043
#4 4 60.12310 1284.808
#5 5 57.65874 1221.771
#6 6 58.86611 1266.026
#7 7 66.13987 1303.927
#8 8 54.21595 1303.638
#9 9 63.84230 1280.380
#10 10 49.15238 1236.456
# ... with 14 more rows

Related

Using group_by() function on multiple data frames?

I have data that were collected from a year but are broken up by months. For my code, I labeled them df1-df12 for each corresponding month. I am trying to group these data using the group_by function to group all the dataframes similarly. When I do the following code- it works fine alone:
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
However, I would like to streamline this code so that I can use this function for all 12 dataframes without having to copy/paste 12 times, since there is a lot of data to go through. Here is what I have tried to do to that end:
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
}
yr19<-c(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
map(yr19, func1)
However, i get the following error message: Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character". As stated above- i don't get this error message if I go through and do it individually, but there are many months and many years to be analyzed and from a time perspective I don't think doing this code manually is feasible. Thanks for your help

Two ways you can approach this, first using the approach suggested by #ktiu:
## Create example data
library(dplyr) # for pipe and group_by()
set.seed(914)
df1 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
df2 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
Modifying your function to address error
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
df
}
## And using list rather than c to combine data frames.
yr19 <- list(df1, df2)
yr19_data <- lapply(yr19, func1)
# This will return a list of data frames you can access with `yr19_data[[1]]`
Alternative approach is to add variable for your source data frames, then collapse it all into a single data frame and manipulate from there. Which approach makes more sense will depend on what else you want to do later.
func2 <- function(df.name){
mutate(get(df.name), source = df.name)
}
# This is set up to get objects given their names, so we'll use a character vector
# of names to iterate off of.
yr19 = c("df1", "df2")
df.list <- lapply(yr19, func2)
df.long <- do.call(bind_rows, df.list)
df.long
# # A tibble: 100 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 27 9 9.31 df1
# 2 5 3 16.5 df1
# 3 28 3 2.67 df1
# 4 24 4 8.94 df1
# 5 13 3 1.68 df1
At this point you can manipulate one data frame in your original pipe:
df <- df.long %>%
group_by(source, date,id) %>%
slice(n()) %>%
ungroup()
df
# # A tibble: 93 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 1 8 9.89 df1
# 2 2 4 10.9 df1
# 3 4 3 8.45 df1
# 4 5 3 16.5 df1
# 5 5 7 10.6 df1

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]

Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6

Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

Refer a column by variable name

Sample data
dat <-
data.frame(Sim.Y1 = rnorm(10), Sim.Y2 = rnorm(10),
Sim.Y3 = rnorm(10), obsY = rnorm(10),
ID = sample(1:10, 10), ID_s = rep(1:2, each = 5))
For the following vector, I want to calculate the mean across ID_s
simVec <- c('Sim.Y1.cor','Sim.Y2.cor')
for(s in simVec){
simRef <- simVec[s]
simID <- unlist(strsplit(simRef, split = '.cor',fixed = T))[1]
# this works
dat %>% dplyr::group_by(ID_s) %>%
dplyr::summarise(meanMod = mean(Sim.Y1))
# this doesn't work
dat %>% dplyr::group_by(ID_s) %>%
dplyr::summarise(meanMod = mean(!!(simID)))
}
How do I refer a column in dplyr not by its explicit name?

Note that your particular task can be performed without any non-standard evaluation by using summarize_at(), which works directly with strings:
simIDs <- stringr::str_split(simVec, ".cor") %>% purrr::map_chr(1)
# [1] "Sim.Y1" "Sim.Y2"
dat %>% dplyr::group_by(ID_s) %>% dplyr::summarise_at(simIDs, mean)
# # A tibble: 2 x 3
# ID_s Sim.Y1 Sim.Y2
# <int> <dbl> <dbl>
# 1 1 0.494 -0.0522
# 2 2 -0.104 -0.370
A custom suffix can also be supplied through the named list:
dat %>% dplyr::group_by(ID_s) %>% dplyr::summarise_at(simIDs, list(m=mean))
# # A tibble: 2 x 3
# ID_s Sim.Y1_m Sim.Y2_m <--- Note the _m suffix
# <int> <dbl> <dbl>
# 1 1 0.494 -0.0522
# 2 2 -0.104 -0.370

First, you have to use seq_along() if you want to index you vector with s.
Second, you are missing sym().
This should work:
simVec <- c('Sim.Y1.cor','Sim.Y3.cor')
for(s in seq_along(simVec)){
simRef <- simVec[s]
simID <- unlist(strsplit(simRef, split = '.cor',fixed = T))[1]
# this works
dat %>% dplyr::group_by(ID_s) %>%
dplyr::summarise(meanMod = mean(Sim.Y1))
# this doesn't work
dat %>% dplyr::group_by(ID_s) %>%
dplyr::summarise(meanMod = mean(!!sym(simID)))
}
edit: no Typo

Try this
library(dplyr)
dat %>% group_by(ID) %>%
summarise(mean_y1 =mean(Sim.Y1),
mean_y2 =mean(Sim.Y2),
mean_y3 =mean(Sim.Y3),
mean_obsY = mean(obsY))

I understand the question to be, how do you get a column without referencing the column name, i.e. using the index instead.
Let me know if my understanding is incorrect.
If not, I believe the easiest way would be as per below.
> df1 <- data.frame(ID_s=c('a','b','c'),Val=c('a1','b1','c1'))
> df1
ID_s Val
1 a a1
2 b b1
3 c c1
> df1[,1]
[1] a b c
Levels: a b c
If you want to save that as a dataframe, can be extended as per below:
cc <- data.frame(ID_s=df1[,1])
Hope this helps!

Is there an helper function to make this code cleaner on tibble?

I need to sum sequences generated by one of column. I have done it in that way:
test <- tibble::tibble(
x = c(1,2,3)
)
test %>% dplyr::mutate(., s = plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))}))
Is there a cleaner way to do this? Is there some helper functions, construction which allows me to reduce this:
plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))})
I am looking for a generic solution, here sum and seq is only an example. Maybe the real problem is that I do want to execute function on element not all vector.
This is my real case:
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
I would like to do that by mutate function in a row way not vectorized way.

For the specific example, it is a direct application of cumsum
test %>%
mutate(s = cumsum(x))
For generic cases to loop through the sequence of rows, we can use map
test %>%
mutate(s = map_dbl(row_number(), ~ sum(seq(.x))))
# A tibble: 3 x 2
# x s
# <dbl> <dbl>
#1 1 1
#2 2 3
#3 3 6
Update
For the updated dataset, use map2, as we are using corresponding arguments in dnorm from the 'x' and 'y' columns of the dataset
test %>%
mutate(V1 = map2_dbl(x, y, ~ dnorm(d, mean = .x, sd = .y) %>%
sum))
# A tibble: 3 x 3
# x y V1
# <dbl> <dbl> <dbl>
#1 1 0.5 1.56
#2 2 1 0.929
#3 3 1.5 0.470

R - Passing column name through ellipsis in R

I have a dataframe that looks like this
df = data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5), race = rep(1:2, 5))
I'm trying to write a function that takes on a dataframe as a first argument together with any number of arguments that represent column names in that dataframe and use these column names to perform operations on the dataframe. My function would look like this:
library(dplyr)
myFunction <- function(df, ...){
columns <- list(...)
for (i in 1:length(columns)){
var <- enquo(columns[[i]])
df <- df %>% group_by(!!var)
}
df2 = summarise(df, mean = mean(wt))
return(df2)
}
I call the function as the following
myFunction(df, race, gender)
However, I get the following error message:
Error in myFunction(df, race, gender) : object 'race' not found

We can convert the elements in ... to quosures and then do the evaluation (!!!)
myFunction <- function(dat, ...){
columns <- quos(...) # convert to quosures
dat %>%
group_by(!!! columns) %>% # evaluate
summarise(mean = mean(wt))
}
myFunction(df, race, gender)
# A tibble: 2 x 3
# Groups: race [?]
# race gender mean
# <int> <int> <dbl>
#1 1 1 75
#2 2 2 76
myFunction(df, race)
# A tibble: 2 x 2
# race mean
# <int> <dbl>
#1 1 75
#2 2 76
NOTE: In the OP's example, 'race' and 'gender' are the same
If it change it, will see the difference
df <- data.frame(id = 1:10, wt = 71:80, gender = rep(1:2, 5),
race = rep(1:2, each = 5))
myFunction(df, race, gender)
myFunction(df, race)
myFunction(df, gender)
If we decide to pass the arguments as quoted strings, then we can make use of group_by_at
myFunction2 <- function(df, ...) {
columns <- c(...)
df %>%
group_by_at(columns) %>%
summarise(mean= mean(wt))
}
myFunction2(df, "race", "gender")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Function with dplyr and multiple statements - r

Related

Using group_by() function on multiple data frames?

Dplyr to calculate mean, SD, and graph multiple variables

Refer a column by variable name

Is there an helper function to make this code cleaner on tibble?

R - Passing column name through ellipsis in R

Categories

Resources