How to use ddply + summarise in custom function - r

I'm trying to use the ddply-summarise function (e.g. mean()) within a custom function. However, instead of resulting in the means for each group, it results in a dataframe showing the mean of all observations.
Many thanks already in advance for your help!
library(plyr)
library(dplyr)
df <- data.frame(Titanic)
colnames(df)
# ddply-summarise - Outside of function
df.OutsideOfFunction <- ddply(df, c("Class","Sex"), summarise,
Mean=mean(Freq))
# new function
newFunction <- function(data, GroupVariables, ColA){
mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise,
Mean=mean(data[[ColA]]))
}
#ddply-summarise - InsideOfFunction
df.InsideOfFunction <- newFunction(data=df,
GroupVariables=c("Class","Sex"),
ColA ="Freq")

It should work this way, by converting ColA input first to symbol and then evaluating it:
# new function
newFunction <- function(data, GroupVariables, ColA){
#mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise, Mean=mean(UQ(sym(ColA))))
}
Please take a look also in this post as to why this happens. It's the first time i've seen it myself so i am not the best one to explain it - it looks like it depends on the way summarize and/or other plyr or dplyr functions accept parameters as input (with/without quote) and how these are evaluated.
Also since you are loading dplyr as well, you can stick to one package if you like and write your function like this:
newFunction <- function(data, GroupVariables, ColA){
data %>% group_by(.dots=GroupVariables) %>% summarise(Mean=mean(UQ(sym(ColA))))
}
Hope this helps

Related

Repeatedly mutate variable using dplyr and purrr

I'm self-taught in R and this is my first StackOverflow question. I apologize if this is an obvious issue; please be kind.
Short Version of my Question
I wrote a custom function to calculate the percent change in a variable year over year. I would like to use purrr's map_at function to apply my custom function to a vector of variable names. My custom function works when applied to a single variable, but fails when I chain it using map_a
My custom function
calculate_delta <- function(df, col) {
#generate variable name
newcolname = paste("d", col, sep="")
#get formula for first difference.
calculate_diff <- lazyeval::interp(~(a + lag(a))/a, a = as.name(col))
#pass formula to mutate, name new variable the columname generated above
df %>%
mutate_(.dots = setNames(list(calculate_diff), newcolname)) }
When I apply this function to a single variable in the mtcars dataset, the output is as expected (although obviously the meaning of the result is non-sensical).
calculate_delta(mtcars, "wt")
Attempt to Apply the Function to a Character Vector Using Purrr
I think that I'm having trouble conceptualizing how map_at passes arguments to the function. All of the example snippets I can find online use map_at with functions like is.character, which don't require additional arguments. Here are my attempts at applying the function using purrr.
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta)
This gives me this error message
Error in paste("d", col, sep = "") :
argument "col" is missing, with no default
I assume this is because map_at is passing vars as the df, and not passing an argument for col. To get around that issue, I tried the following:
vars <- c("wt", "mpg")
mtcars %>% map_at(vars, calculate_delta, df = .)
That throws me this error:
Error: unrecognised index type
I've monkeyed around with a bunch of different versions, including removing the df argument from the calculate_delta function, but I have had no luck.
Other potential solutions
1) A version of this using sapply, rather than purrr. I've tried solving the problem that way and had similar trouble. And my goal is to figure out a way to do this using purrr, if that is possible. Based on my understanding of purrr, this seems like a typical use case.
2) I can obviously think of how I would implement this using a for loop, but I'm trying to avoid that if possible for similar reasons.
Clearly I'm thinking about this wrong. Please help!
EDIT 1
To clarify, I am curious if there is a method of repeatedly transforming variables that accomplishes two things.
1) Generates new variables within the original tbl_df without replacing replace the columns being mutated (as is the case when using dplyr's mutate_at).
2) Automatically generates new variable labels.
3) If possible, accomplishes what I've described by applying a single function using map_at.
It may be that this is not possible, but I feel like there should be an elegant way to accomplish what I am describing.
Try simplifying the process:
delta <- function(x) (x + dplyr::lag(x)) /x
cols <- c("wt", "mpg")
#This
library(dplyr)
mtcars %>% mutate_at(cols, delta)
#Or
library(purrr)
mtcars %>% map_at(cols, delta)
#If necessary, in a function
f <- function(df, cols) {
df %>% mutate_at(cols, delta)
}
f(iris, c("Sepal.Width", "Petal.Length"))
f(mtcars, c("wt", "mpg"))
Edit
If you would like to embed new names after, we can write a custom pipe-ready function:
Rename <- function(object, old, new) {
names(object)[names(object) %in% old] <- new
object
}
mtcars %>%
mutate_at(cols, delta) %>%
Rename(cols, paste0("lagged",cols))
If you want to rename the resulting lagged variables:
mtcars %>% mutate_at(cols, funs(lagged = delta))

How to Split-Apply-Combine for several variables / columns in R

I'd like to perform a function on several variables, by group.
Fake data;
df<-data.frame(rnorm(100,mean=10),
rnorm(100,mean=15),
rnorm(100,mean=20),
rep(letters[1:10],each=10)
)
colnames(df)<-c("var1","var2","var3","group1")
In this particular case, I'd like to mean-center each variable by group. I want to return a dataframe with the original and centered variables.
Normally I use PLYR package for this;
library(plyr)
ddply(df, "group1", transform, centered_var1= scale(var1, scale=FALSE))
However, I haven't been able to successfully loop this function, or think of another minimal-code way to do this.
I'm open to non-PLYR solutions...My main criteria is keeping code to a minimum.
The colwise function may be what you're looking for.
library("plyr")
ddply(df, .(group1), colwise(scale, scale = FALSE))
Using dplyr
library(dplyr)
df %>% group_by(group1) %>%
mutate_each(funs(scale(., scale=F))) -> res
Is this what you want?
ddply(df, "group1", transform, centered_var1= scale(var1, scale=FALSE),
centered_var2 = scale(var2, scale=FALSE),
centered_var3 = scale(var3, scale=FALSE))

get lhs object name when piping with dplyr

I'd like to have a function that can use pipe operator as exported from dplyr. I am not using magrittr.
df %>% my_function
How can I get df name? If I try
my_function <- function(tbl){print(deparse(substitute(tbl)))}
it returns
[1] "."
while I'd like to have
[1] "df"
Any suggestion?
Thank you in advance,
Nicola
The SO answer that JBGruber links to in the comments mostly solves the problem. It works by moving upwards through execution environments until a certain variable is found, then returns the lhs from that environment. The only thing missing is the requirement that the function outputs both the name of the original data frame and the manipulated data – I gleaned the latter requirement from one of the OP's comments. For that we just need to output a list containing these things, which we can do by modifying MrFlick's answer:
get_orig_name <- function(df){
i <- 1
while(!("chain_parts" %in% ls(envir=parent.frame(i))) && i < sys.nframe()) {
i <- i+1
}
list(name = deparse(parent.frame(i)$lhs), output = df)
}
Now we can run get_orig_name to the end of any pipeline to the get the manipulated data and the original data frame's name in a list. We access both using $:
mtcars %>% summarize_all(mean) %>% get_orig_name
#### OUTPUT ####
$name
[1] "mtcars"
$output
mpg cyl disp hp drat wt qsec vs am gear carb
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
I should also mention that, although I think the details of this strategy are interesting, I also think it is needlessly complicated. It sounds like the OP's goal is to manipulate the data and then write it to a file with the same name as the original, unmanipulated, data frame, which can easily be done using more straightforward methods. For example, if we are dealing with multiple data frames we can just do something like the following:
df_list <- list(mtcars = mtcars, iris = iris)
for(name in names(df_list)){
df_list[[name]] %>%
group_by_if(is.factor) %>%
summarise_all(mean) %>%
write.csv(paste0(name, ".csv"))
}
Here's a hacky way of doing it, which I'm sure breaks in a ton of edge cases:
library(data.table) # for the address function
# or parse .Internal(inspect if you feel masochistic
fn = function(tbl) {
objs = ls(parent.env(environment()))
objs[sapply(objs,
function(x) address(get(x, env = parent.env(environment()))) == address(tbl))]
}
df = data.frame(a = 1:10)
df %>% fn
#[1] "df"
Although the question is an old one, and the bounty has already been awarded, I would like to extend on gersht's excellent answer which works perfectly fine for getting the most lefthand-side object name. However, integrating this functionality in a dplyr workflow is not yet solved, apart from using this approach in the very last step of a pipe.
Since I'm using dplyr a lot, I have created a group of custom wrapper functions around the common dplyr verbs which I call metadplyr (I'm still playing around with the functionality, which is why I haven't uploaded it on github yet).
In essence, those functions create a new class called meta_tbl on top of a tibble and write certain things in the attributes of that object. Applied to the problem of the OP I provide a simple example with filter, but the procedure works on any other dplyr verb as well.
In my original function family I use slightly different names than dplyr, but the approach also works when 'overwriting' the original dplyr verbs.
Below is a new filter function which turns a data frame or tibble into a meta_tbl and writes the original name of the lhs object into the attribute .name. Here I am using a short version of gersht's approach.
library(dplyr)
filter <- function(.data, ...) {
if(!("meta_tbl" %in% class(.data))) {
.data2 <- as_tibble(.data)
# add new class 'meta_tbl' to data.frame
attr(.data2, "class") <- c(attr(.data2, "class"), "meta_tbl")
# write lhs original name into attributes
i <- 1
while(!("chain_parts" %in% ls(envir=parent.frame(i)))) {
i <- i+1
}
attr(.data2, ".name") <- deparse(parent.frame(i)$lhs)
}
dplyr::filter(.data2, ...)
}
For convenience it is good to have some helper function to extract the original name from the attributes easily.
.name <- function(.data) {
if("meta_tbl" %in% class(.data)) {
attr(.data, ".name")
} else stop("this function only work on objects of class 'meta_tbl'")
}
Both functions can be used in a workflow in the following way:
mtcars %>%
filter(gear == 4) %>%
write.csv(paste0(.name(.), ".csv"))
This might be a bad example, since the pipe doesn't continue, but in theory, we could use this pipe including the original name and pipe it in further function calls.
Inspired by the link mentioned by gersht
You can go back 5 generations to get the name
df %>% {parent.frame(5)$lhs}
example as below:
library(dplyr)
a <- 1
df1 <- data.frame(a = 1:10)
df2 <- data.frame(a = 1:10)
a %>% {parent.frame(5)$lhs}
df1 %>% {parent.frame(5)$lhs}
df2 %>% {parent.frame(5)$lhs}
I don't believe this is possible without adding an extra argument to your my_function. When chaining functions with dplyr it automatically converts the df to a tbl_df object, hence the new name "." within the dplyr scope to make the piping simpler.
The following is a very hacky way with dplyr which just adds an addition argument to return the name of the original data.frame
my_function <- function(tbl, orig.df){print(deparse(substitute(orig.df)))}
df %>% my_function(df)
[1] "df"
Note you couldn't just pass the df with your original function because the tbl_df object is automatically passed to all subsequent functions.

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

ddply + summarise function column name input

I am trying to use ddply and summarise together from the plyr package but am having difficulty parsing through column names that keep changing...In my example i would like something that would parse in X1 programatically rather than hard coding in X1 into the ddply function.
setting up an example
require(xts)
require(plyr)
require(reshape2)
require(lubridate)
t <- xts(matrix(rnorm(10000),ncol=10), Sys.Date()-1000:1)
t.df <- data.frame(coredata(t))
t.df <- cbind(day=wday(index(t), label=TRUE, abbr=TRUE), t.df)
t.df.l <- melt(t.df, id.vars=c("day",colnames(t.df)[2]), measure.vars=colnames(t.df)[3:ncol(t.df)])
This is the bit im am struggling with....
cor.vars <- ddply(t.df.l, c("day","variable"), summarise, cor(X1, value))
i do not want to use the term X1 and would like to use something like
cor.vars <- ddply(t.df.l, c("day","variable"), summarise, cor(colnames(t.df)[2], value))
but that comes up with the error: Error in cor(colnames(t.df)[2], value) : 'x' must be numeric
I also tried various other combos that parse in the vector values for the x argument in cor...but for some reason none of them seem to work...
any ideas?
Although this is probably not the intended usage for summarize and there must be much better approaches to your problem, the direct answer to your question is to use get:
ddply(t.df.l, c("day","variable"), summarise, cor(get(colnames(t.df)[2]), value))
Edit: here is for example one approach that is in my opinion better suited to your problem:
ddply(t.df.l, c("day", "variable"), function(x)cor(x["X1"], x["value"]))
Above, "X1" can be also replaced by 2 or the name of a variable holding "X1", etc. It depends how you want to programmatically access the column.

Resources