How to Split-Apply-Combine for several variables / columns in R - r

I'd like to perform a function on several variables, by group.
Fake data;
df<-data.frame(rnorm(100,mean=10),
rnorm(100,mean=15),
rnorm(100,mean=20),
rep(letters[1:10],each=10)
)
colnames(df)<-c("var1","var2","var3","group1")
In this particular case, I'd like to mean-center each variable by group. I want to return a dataframe with the original and centered variables.
Normally I use PLYR package for this;
library(plyr)
ddply(df, "group1", transform, centered_var1= scale(var1, scale=FALSE))
However, I haven't been able to successfully loop this function, or think of another minimal-code way to do this.
I'm open to non-PLYR solutions...My main criteria is keeping code to a minimum.

The colwise function may be what you're looking for.
library("plyr")
ddply(df, .(group1), colwise(scale, scale = FALSE))

Using dplyr
library(dplyr)
df %>% group_by(group1) %>%
mutate_each(funs(scale(., scale=F))) -> res

Is this what you want?
ddply(df, "group1", transform, centered_var1= scale(var1, scale=FALSE),
centered_var2 = scale(var2, scale=FALSE),
centered_var3 = scale(var3, scale=FALSE))

Related

How to use ddply + summarise in custom function

I'm trying to use the ddply-summarise function (e.g. mean()) within a custom function. However, instead of resulting in the means for each group, it results in a dataframe showing the mean of all observations.
Many thanks already in advance for your help!
library(plyr)
library(dplyr)
df <- data.frame(Titanic)
colnames(df)
# ddply-summarise - Outside of function
df.OutsideOfFunction <- ddply(df, c("Class","Sex"), summarise,
Mean=mean(Freq))
# new function
newFunction <- function(data, GroupVariables, ColA){
mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise,
Mean=mean(data[[ColA]]))
}
#ddply-summarise - InsideOfFunction
df.InsideOfFunction <- newFunction(data=df,
GroupVariables=c("Class","Sex"),
ColA ="Freq")
It should work this way, by converting ColA input first to symbol and then evaluating it:
# new function
newFunction <- function(data, GroupVariables, ColA){
#mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise, Mean=mean(UQ(sym(ColA))))
}
Please take a look also in this post as to why this happens. It's the first time i've seen it myself so i am not the best one to explain it - it looks like it depends on the way summarize and/or other plyr or dplyr functions accept parameters as input (with/without quote) and how these are evaluated.
Also since you are loading dplyr as well, you can stick to one package if you like and write your function like this:
newFunction <- function(data, GroupVariables, ColA){
data %>% group_by(.dots=GroupVariables) %>% summarise(Mean=mean(UQ(sym(ColA))))
}
Hope this helps

Use a function of the data in dplyr::summarise

Assume I have a function of a data.frame which gives a single number back, now I would like to use the summarise in dplyr where the new variable should be this function applied for the data.frame grouped by another variable.
This is a stupid example
df <- data.frame(id=rep(c("A","B"),each=5),diff=rnorm(10))
func<-function(data){
mean(data$diff)
}
I know this example is easily done using summarise(Mean = mean(diff)), but the points is not solving this example but in general using summarise with a function of a data.frame
My try so far has been
df %>% group_by(id) %>% summarise(New = func(.))
but it gives the same value for every group, which is the overall function.
Hope everything is clear.
I'm not sure I understand what you are trying to do, and I'm not familiar with the differences between the plyr and dplyr packages. The most straightforward way to do what I think you're trying to do is with daply:
> daply(df, .(id), func)
A B
-0.0301488 0.2088815
As akrun pointed out in the comments, you can do this using do in dplyr:
df %>% group_by(id) %>% do(data.frame(New=func(.)))
You can also add other variables, though you have to use .$:
df %>% group_by(id) %>% do(data.frame(New=func(.), SmthElse = sd(.$diff)))
# id New SmthElse
#1 A 0.1934552 1.0932424
#2 B -0.4161216 0.4841031
That said, the simpler and faster performance solution is using data.table:
library(data.table)
dt = as.data.table(df) # or convert in place using setDT
dt[, .(New = func(.SD), SmthElse = sd(diff)), by = id]
# id New SmthElse
#1: A 0.1934552 1.0932424
#2: B -0.4161216 0.4841031

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

dplyr: colSums on sub-grouped (group_by) data frames: elegantly

I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).
Example code:
countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))
then what I'd like to do is ...
library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))
but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:
sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))
but the final do(...) bit seems very clunky.
Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.
Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.
Edit: results from suggestions
Three solutions:
My suggestion in post: elapsed, 146.765 seconds.
#joran's suggestion below: 6.902 seconds
#eddi's suggestion in the comments, using data.table: 6.715 seconds.
I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.
Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):
sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))
The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.
(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)
The method summarise_each mentioned in joran's answer from 2014 has been deprecated.
Instead, please use summarize_all() or summarize_at().
The methods summarize_all and summarize_at mentioned in Hack-R's answer from 2018 have been superseded.
Instead, please use summarize()/summarise() combined with across().

ddply aggregated column names

I am using ddply to aggregate my data but haven't found an elegant way to assign column names to the output data frame.
At the moment I am doing this:
agg_data <- ddply(raw_data, .(id, date, classification), nrow)
names(agg_data)[4] <- "no_entries"
and this
agg_data <- ddply(agg_data, .(classification, date), colwise(mean, .(no_entries)) )
names(agg_data)[3] <- "avg_no_entries"
Is there a better, more elegant way to do this?
The generic form I use a lot is:
ddply(raw_data, .(id, date, classification), function(x) data.frame( no_entries=nrow(x) )
I use anonymous functions in my ddply statements almost all the time so the above idiom meshes well with anonymous functions. This is not the most concise way to express a function like nrow() but with functions where I pass multiple arguments, I like it a lot.
You can use summarise:
agg_data <- ddply(raw_data, .(id, date, classification), summarise, "no_entries" = nrow(piece))
or you can use length(<column_name>) if nrow(piece) doesn't work. For instance, here's an example that should be runnable by anyone:
ddply(baseball, .(year), summarise, newColumn = nrow(piece))
or
ddply(baseball, .(year), summarise, newColumn = length(year))
EDIT
Or as Joshua comments, the all caps version, NROW does the checking for you.

Resources