ddply aggregated column names - r

I am using ddply to aggregate my data but haven't found an elegant way to assign column names to the output data frame.
At the moment I am doing this:
agg_data <- ddply(raw_data, .(id, date, classification), nrow)
names(agg_data)[4] <- "no_entries"
and this
agg_data <- ddply(agg_data, .(classification, date), colwise(mean, .(no_entries)) )
names(agg_data)[3] <- "avg_no_entries"
Is there a better, more elegant way to do this?

The generic form I use a lot is:
ddply(raw_data, .(id, date, classification), function(x) data.frame( no_entries=nrow(x) )
I use anonymous functions in my ddply statements almost all the time so the above idiom meshes well with anonymous functions. This is not the most concise way to express a function like nrow() but with functions where I pass multiple arguments, I like it a lot.

You can use summarise:
agg_data <- ddply(raw_data, .(id, date, classification), summarise, "no_entries" = nrow(piece))
or you can use length(<column_name>) if nrow(piece) doesn't work. For instance, here's an example that should be runnable by anyone:
ddply(baseball, .(year), summarise, newColumn = nrow(piece))
or
ddply(baseball, .(year), summarise, newColumn = length(year))
EDIT
Or as Joshua comments, the all caps version, NROW does the checking for you.

Related

How to use ddply + summarise in custom function

I'm trying to use the ddply-summarise function (e.g. mean()) within a custom function. However, instead of resulting in the means for each group, it results in a dataframe showing the mean of all observations.
Many thanks already in advance for your help!
library(plyr)
library(dplyr)
df <- data.frame(Titanic)
colnames(df)
# ddply-summarise - Outside of function
df.OutsideOfFunction <- ddply(df, c("Class","Sex"), summarise,
Mean=mean(Freq))
# new function
newFunction <- function(data, GroupVariables, ColA){
mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise,
Mean=mean(data[[ColA]]))
}
#ddply-summarise - InsideOfFunction
df.InsideOfFunction <- newFunction(data=df,
GroupVariables=c("Class","Sex"),
ColA ="Freq")
It should work this way, by converting ColA input first to symbol and then evaluating it:
# new function
newFunction <- function(data, GroupVariables, ColA){
#mean(data[[ColA]])
plyr::ddply(data, GroupVariables, summarise, Mean=mean(UQ(sym(ColA))))
}
Please take a look also in this post as to why this happens. It's the first time i've seen it myself so i am not the best one to explain it - it looks like it depends on the way summarize and/or other plyr or dplyr functions accept parameters as input (with/without quote) and how these are evaluated.
Also since you are loading dplyr as well, you can stick to one package if you like and write your function like this:
newFunction <- function(data, GroupVariables, ColA){
data %>% group_by(.dots=GroupVariables) %>% summarise(Mean=mean(UQ(sym(ColA))))
}
Hope this helps

Use a function of the data in dplyr::summarise

Assume I have a function of a data.frame which gives a single number back, now I would like to use the summarise in dplyr where the new variable should be this function applied for the data.frame grouped by another variable.
This is a stupid example
df <- data.frame(id=rep(c("A","B"),each=5),diff=rnorm(10))
func<-function(data){
mean(data$diff)
}
I know this example is easily done using summarise(Mean = mean(diff)), but the points is not solving this example but in general using summarise with a function of a data.frame
My try so far has been
df %>% group_by(id) %>% summarise(New = func(.))
but it gives the same value for every group, which is the overall function.
Hope everything is clear.
I'm not sure I understand what you are trying to do, and I'm not familiar with the differences between the plyr and dplyr packages. The most straightforward way to do what I think you're trying to do is with daply:
> daply(df, .(id), func)
A B
-0.0301488 0.2088815
As akrun pointed out in the comments, you can do this using do in dplyr:
df %>% group_by(id) %>% do(data.frame(New=func(.)))
You can also add other variables, though you have to use .$:
df %>% group_by(id) %>% do(data.frame(New=func(.), SmthElse = sd(.$diff)))
# id New SmthElse
#1 A 0.1934552 1.0932424
#2 B -0.4161216 0.4841031
That said, the simpler and faster performance solution is using data.table:
library(data.table)
dt = as.data.table(df) # or convert in place using setDT
dt[, .(New = func(.SD), SmthElse = sd(diff)), by = id]
# id New SmthElse
#1: A 0.1934552 1.0932424
#2: B -0.4161216 0.4841031

Summarize observations in r [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 7 years ago.
I have been working on the following csv file from http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv just for my own practice.
I am however not sure of how to do the following:
Summarize the observations from the same team (teamID) in the same year by adding up the component values. That is, you should end up with only one record per team per year, and this record should have the year, team name, total runs, total hits, total X2B ,…. Total HBP.
Here is the code I have so far but it is only giving me only one team per year yet I need all the teams for each year with their totals (e.g, for 1980, I need all the teams with totalruns,totalhits,.....,for 1981, all the teams with totalruns,totalhits,.... and so on)
newdat1 <- read.csv("http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv")
id <- split(1:nrow(newdata1), newdata1$yearID)
a2 <- data.frame(yearID=sapply(id, function(i) newdata1$yearID[i[1]]),
teamID=sapply(id,function(i) newdata$teamID[i[1]]),
totalRuns=sapply(id, function(i) sum(newdata1$R[i],na.rm=TRUE)),
totalHits=sapply(id, function(i) sum(newdata1$H[i],na.rm=TRUE)),
totalX2B=sapply(id, function(i) sum(newdata1$X2B[i],na.rm=TRUE)),
totalX3B=sapply(id, function(i) sum(newdata1$X3B[i],na.rm=TRUE)),
totalHR=sapply(id, function(i) sum(newdata1$HR[i],na.rm=TRUE)),
totalBB=sapply(id, function(i) sum(newdata1$BB[i],na.rm=TRUE)),
totalSB=sapply(id, function(i) sum(newdata1$SB[i],na.rm=TRUE)),
totalGIDP=sapply(id, function(i) sum(newdata1$GIDP[i],na.rm=TRUE)),
totalIBB=sapply(id, function(i) sum(newdata1$IBB[i],na.rm=TRUE)),
totalHBP=sapply(id, function(i) sum(newdata1$HBP[i],na.rm=TRUE)))
a2
Perhaps try something like:
library("dplyr")
newdata1 %>%
group_by(yearID, teamID) %>%
summarize_each(funs(sum(., na.rm = T)), R, H, X2B,
X3B, HR, BB, SB, GIDP, IBB, HBP)
Naturally this is most useful if you're comfortable with dplyr library. This is a guess without looking at the data too closely.
Also, instead of listing each column that you wish to sum over, you can use alternatively do
summarize_each(funs(sum(., na.rm = T)), -column_to_exclude1, -column_to_exlude2)
And so forth.
I'd suggest looking at ddply in the plyr package. See here for a good explanation of what I think you're trying to do.
For this example try the following code:
# ddply function in the plyr package
library(plyr)
# summarize the dataframe newdat1, using yearID and teamID as grouping variables
outputdat <-ddply(newdat1, c("yearID", "teamID"), summarize,
totalRuns= sum(R), # add all summary variables you need here...
totalHits= sum(H), # other summary functions (mean, sd etc) also work
totalX2B = sum(X2B))
Hope that helps?
library(plyr)
ddply(newdat1, ~ teamID + yearID, summarize, sum(R), sum(X2B), sum(SO), sum(IBB), sum(HBP))
eventually sum(..., na.rm=TRUE)
also data.table{} can do that:
library(data.table)
DT <- as.data.table(newdat1[,-c(1,5)])
setkey(DT, teamID, yearID)
DT[, lapply(.SD, sum, na.rm=TRUE), .(teamID, yearID)]

How to Split-Apply-Combine for several variables / columns in R

I'd like to perform a function on several variables, by group.
Fake data;
df<-data.frame(rnorm(100,mean=10),
rnorm(100,mean=15),
rnorm(100,mean=20),
rep(letters[1:10],each=10)
)
colnames(df)<-c("var1","var2","var3","group1")
In this particular case, I'd like to mean-center each variable by group. I want to return a dataframe with the original and centered variables.
Normally I use PLYR package for this;
library(plyr)
ddply(df, "group1", transform, centered_var1= scale(var1, scale=FALSE))
However, I haven't been able to successfully loop this function, or think of another minimal-code way to do this.
I'm open to non-PLYR solutions...My main criteria is keeping code to a minimum.
The colwise function may be what you're looking for.
library("plyr")
ddply(df, .(group1), colwise(scale, scale = FALSE))
Using dplyr
library(dplyr)
df %>% group_by(group1) %>%
mutate_each(funs(scale(., scale=F))) -> res
Is this what you want?
ddply(df, "group1", transform, centered_var1= scale(var1, scale=FALSE),
centered_var2 = scale(var2, scale=FALSE),
centered_var3 = scale(var3, scale=FALSE))

Recovering tapply results into the original data-frame in R

I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]

Resources