This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 7 years ago.
I have been working on the following csv file from http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv just for my own practice.
I am however not sure of how to do the following:
Summarize the observations from the same team (teamID) in the same year by adding up the component values. That is, you should end up with only one record per team per year, and this record should have the year, team name, total runs, total hits, total X2B ,…. Total HBP.
Here is the code I have so far but it is only giving me only one team per year yet I need all the teams for each year with their totals (e.g, for 1980, I need all the teams with totalruns,totalhits,.....,for 1981, all the teams with totalruns,totalhits,.... and so on)
newdat1 <- read.csv("http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv")
id <- split(1:nrow(newdata1), newdata1$yearID)
a2 <- data.frame(yearID=sapply(id, function(i) newdata1$yearID[i[1]]),
teamID=sapply(id,function(i) newdata$teamID[i[1]]),
totalRuns=sapply(id, function(i) sum(newdata1$R[i],na.rm=TRUE)),
totalHits=sapply(id, function(i) sum(newdata1$H[i],na.rm=TRUE)),
totalX2B=sapply(id, function(i) sum(newdata1$X2B[i],na.rm=TRUE)),
totalX3B=sapply(id, function(i) sum(newdata1$X3B[i],na.rm=TRUE)),
totalHR=sapply(id, function(i) sum(newdata1$HR[i],na.rm=TRUE)),
totalBB=sapply(id, function(i) sum(newdata1$BB[i],na.rm=TRUE)),
totalSB=sapply(id, function(i) sum(newdata1$SB[i],na.rm=TRUE)),
totalGIDP=sapply(id, function(i) sum(newdata1$GIDP[i],na.rm=TRUE)),
totalIBB=sapply(id, function(i) sum(newdata1$IBB[i],na.rm=TRUE)),
totalHBP=sapply(id, function(i) sum(newdata1$HBP[i],na.rm=TRUE)))
a2
Perhaps try something like:
library("dplyr")
newdata1 %>%
group_by(yearID, teamID) %>%
summarize_each(funs(sum(., na.rm = T)), R, H, X2B,
X3B, HR, BB, SB, GIDP, IBB, HBP)
Naturally this is most useful if you're comfortable with dplyr library. This is a guess without looking at the data too closely.
Also, instead of listing each column that you wish to sum over, you can use alternatively do
summarize_each(funs(sum(., na.rm = T)), -column_to_exclude1, -column_to_exlude2)
And so forth.
I'd suggest looking at ddply in the plyr package. See here for a good explanation of what I think you're trying to do.
For this example try the following code:
# ddply function in the plyr package
library(plyr)
# summarize the dataframe newdat1, using yearID and teamID as grouping variables
outputdat <-ddply(newdat1, c("yearID", "teamID"), summarize,
totalRuns= sum(R), # add all summary variables you need here...
totalHits= sum(H), # other summary functions (mean, sd etc) also work
totalX2B = sum(X2B))
Hope that helps?
library(plyr)
ddply(newdat1, ~ teamID + yearID, summarize, sum(R), sum(X2B), sum(SO), sum(IBB), sum(HBP))
eventually sum(..., na.rm=TRUE)
also data.table{} can do that:
library(data.table)
DT <- as.data.table(newdat1[,-c(1,5)])
setkey(DT, teamID, yearID)
DT[, lapply(.SD, sum, na.rm=TRUE), .(teamID, yearID)]
Related
I need to to use ddply to apply multiple functions on multiple columns of my data frame. When I use the column name (RV in the example below), my split variables (Group and Round below) work (I get a mean value for each combination of Round and Group).
I need to do this on 20 columns and I was thinking of creating a for loop and pass column indexes.
When I use the column index (for example df[[1]] which is "RV" in my data frame), Group and Round are ignored and the grand mean is returned for all combinations of Round and Group.
I tried to pass the column name, in new.df3 but Round and Group are ignored again.
df <- data.frame("RV" = 1:5, "Group" = c("a","b","b","b","a"), "Round" = c("2","1","1","2","1"))
# this works and a separate mean for each combination of "Group" and "Round" is calculated
new.df <- ddply(df, c("Group", "Round"), summarise,
mean= mean(RV))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df2 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[[1]]))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df3 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[,colnames(df[1])]))
I tried "lapply" and the same issue exists. Any suggestion why this happens and how I can fix it?
As great a package as plyr is, you would do well here to update to it's newest iteration, dplyr. There, the code would be
v <- vars(RV) # add all your variables here
new.df <- df %>%
group_by(Group, Round) %>%
summarize_at(v, funs(mean))
So using this method, you plug in all your variables into v, and you'll get a mean for all of them, for each combination of Group and Round. The pipe operator (%>%) looks weird when you first see it, but it helps streamline your code. It takes the output of the previous function and sets it to be the first argument of the next function. It makes it easy to see that we're taking df, grouping by Group and Round, then summarizing them.
If you really want to stick with plyr, we can get a solution there too:
new.df <- ddply(df, c("Group", "Round"), summarise,
RV_mean = mean(RV),
var2_mean = mean(var2) # add a more variables just like this
)
We can also work from your list approaches:
new.df2 <- ddply(df, .(Group, Round), function(data_subset) { # note alternative way to reference Group and Round
as.data.frame(llply(data_subset[,c("RV"), drop = FALSE], mean)) # add your variables here
})
Note that within ddply, I always refer to the subset of the data frame within my function calls, I never refer to df. df always refers to the original data frame - not the subset you are trying to work with.
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
From a data frame of many columns, I would like to aggregate (i.e. sum) hundreds of columns by a single column, without specifying each of the column names.
Some sample data:
names <- floor(runif(20, 1, 5))
sample <- cbind(names)
for(i in 1:20){
col <- rnorm(20,2,4)
sample <- cbind(sample, col)
}
What I have until now is the following code, but it gives me that arguments must be the same length.
aggregated <- aggregate.data.frame(sample[,c(2:20)], by = as.list(names), FUN = 'sum')
Original dataset is a lot bigger, so I can't specify the name of each of the columns to be aggregated and I can't use the list function.
You don't actually need to list them at all:
aggregate(. ~ names, sample, sum) # . represents all other columns
Of course base R is my favorite but in case someone wants dplyr:
library(dplyr)
data.frame(sample) %>%
group_by(names) %>%
summarise_each(funs(sum))
Just alter your code slightly:
aggregated <- aggregate(sample[,c(2:20)], by = list(names), FUN = 'sum')
I would like to calculate relative response values by dividing each response/column by its' group mean.
I have managed to produce an exhaustive (and thus unsatisfying) method. My data set is very large and contains multiple groups and responses.
###############
# example
# used packages
require(plyr)
# sample data
group <- c(rep("alpha", 3), rep("beta", 3), rep("gamma", 3))
a <- rnorm(9, 10,1) #some random data as response
b <- rnorm(9, 10,1)
df <- data.frame(group, a, b)
# my approach
# means for each group and response
df.means <- ddply(df, "group", colwise(mean))
# clunky method
df$rel.a[df$group=="alpha"] <-
df$a[df$group=="alpha"]/df.means$a[df.means$group=="alpha"]
df$rel.a[df$group=="beta"] <-
df$a[df$group=="beta"]/df.means$a[df.means$group=="beta"]
# ... etc
df$rel.b[df$group=="gamma"] <-
df$b[df$group=="gamma"]/df.means$b[df.means$group=="gamma"]
#desired outcome (well, perhaps with no missing values)
df
###############
I have been using r for a while now, but I still struggle with trivial data handling procedures. I believe I must be missing something, How can I better address these group(s)?
It's quite easily understandable with the package dplyr, the next version of plyr for data frames:
library(dplyr)
df %>% group_by(group) %>% mutate_each(funs(./mean(.)))
The . represents the data in each column (by group). mutate_each is used to modify each column except the grouping variables. You specify inside the funs argument which functions should be applied to each column.
With data.table package you can do this whole thing fast and easy in one line (without creating the df.means at all), simply
library(data.table)
setDT(df)[, paste0("real.", names(df)[-1]) :=
lapply(.SD, function(x) x/mean(x)),
group]
This will run over all the column within df (except group) by group and divide each value by the group mean
Edit: If you want to override the original columns (like in the dplyr answer, you can do this with small modification (remove the paste0 part):
setDT(df)[, names(df)[-1] := lapply(.SD, function(x) x/mean(x)), group]
If i understand you correctly, you can also do this easily in dplyr. Given the above data
library(dplyr)
df %>% group_by(group) %>% mutate(aresp = a/ mean(a), bresp= b/mean(b))
returns:
group a b aresp bresp
1 alpha 10.052847 8.076405 1.0132828 0.8288214
2 alpha 10.002243 11.447665 1.0081822 1.1747888
3 alpha 9.708111 9.709265 0.9785350 0.9963898
4 beta 10.732693 7.483065 0.9751125 0.8202278
5 beta 11.719656 11.270522 1.0647824 1.2353754
6 beta 10.567513 8.615878 0.9601051 0.9443968
7 gamma 10.221040 11.181763 1.0035630 0.9723315
8 gamma 10.302611 11.286443 1.0115721 0.9814341
9 gamma 10.030605 12.031643 0.9848649 1.0462344
I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]
I am using ddply to aggregate my data but haven't found an elegant way to assign column names to the output data frame.
At the moment I am doing this:
agg_data <- ddply(raw_data, .(id, date, classification), nrow)
names(agg_data)[4] <- "no_entries"
and this
agg_data <- ddply(agg_data, .(classification, date), colwise(mean, .(no_entries)) )
names(agg_data)[3] <- "avg_no_entries"
Is there a better, more elegant way to do this?
The generic form I use a lot is:
ddply(raw_data, .(id, date, classification), function(x) data.frame( no_entries=nrow(x) )
I use anonymous functions in my ddply statements almost all the time so the above idiom meshes well with anonymous functions. This is not the most concise way to express a function like nrow() but with functions where I pass multiple arguments, I like it a lot.
You can use summarise:
agg_data <- ddply(raw_data, .(id, date, classification), summarise, "no_entries" = nrow(piece))
or you can use length(<column_name>) if nrow(piece) doesn't work. For instance, here's an example that should be runnable by anyone:
ddply(baseball, .(year), summarise, newColumn = nrow(piece))
or
ddply(baseball, .(year), summarise, newColumn = length(year))
EDIT
Or as Joshua comments, the all caps version, NROW does the checking for you.