I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]
Related
So I have a large data set of students at a school that looks like this:
library(data.table)
set.seed(1)
school <- data.table("id" = rep(1:10, each = 10), "year" = rep(2000:2009, each = 10),
"grade" = sample(c(9:11, rep(NA, 5)), 100, replace = T))
What I want to do is create a column that indicates if a student has previously been in the same grade as he is now.
The desired output for this example can be found here (I crated a link to save space).
This may sound simple but it is not since students can go back in grades, or be absent in years prior.
I would like a way to do this using data.table as the dataset is very large. so far I've tried the following:
library(dplyr)
library(scales)
school[, repetition := any(school[censor((.I - 10):(.I + 10),
range = c(0, NROW(school))) %>% na.omit
][school[.I, id] == id] == grade)]
However, this doesn't work as I don't know how to distinguish "upper level" (from the first school[...] call) operators like .I and id from inside the second school[...] call.
P.D.: I'll accept suggestions for a better title. Thanks!
We can use duplicated to get logical value for grades that repeat for each id and year.
library(data.table)
school[, repetition := duplicated(grade, incomparables = NA), .(id, year)]
I want to aggregate data in R but in a very generic way with the right hand side (columns) stored in a object as string. Below is the example expression
aggregate(PATTERN_ID ~ Year + Week), data, length)
So in my case, right side which is "Year + Week" is going to be changing as in required and i want to pass it as a string stored in variable. I tried using evaluation strategy but does not give the required output. Below is what i have tried:
exp_aggregate_by = 'Year + Week'
aggregate(PATTERN_ID ~ eval.quoted(parse(text = exp_aggregate_by)), data, length)
Any input from the people will be much appreciated. Through data table is also fine. Thanks
Create a formula with paste and it should work
data(mtcars)
grp <- 'cyl + gear'
aggregate(formula(paste('mpg ~', grp)), mtcars, length)
For the OP's dataset,
aggregate(formula(paste('PATTERN_ID ~', exp_aggregate_by)), data, length)
An answer with data.table. I'm also including the answer using formula in aggregate for completeness.
vars <- c('Year', 'Week')
# with aggregate
form <- formula(paste('PATTERN_ID', paste(vars, collapse = '+'), sep = '~'))
aggregate(form, data, length)
# with data.table
setDT(data)
data[, length(PATTERN_ID), by = vars]
This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 7 years ago.
I have been working on the following csv file from http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv just for my own practice.
I am however not sure of how to do the following:
Summarize the observations from the same team (teamID) in the same year by adding up the component values. That is, you should end up with only one record per team per year, and this record should have the year, team name, total runs, total hits, total X2B ,…. Total HBP.
Here is the code I have so far but it is only giving me only one team per year yet I need all the teams for each year with their totals (e.g, for 1980, I need all the teams with totalruns,totalhits,.....,for 1981, all the teams with totalruns,totalhits,.... and so on)
newdat1 <- read.csv("http://www3.amherst.edu/~nhorton/r2/datasets/Batting.csv")
id <- split(1:nrow(newdata1), newdata1$yearID)
a2 <- data.frame(yearID=sapply(id, function(i) newdata1$yearID[i[1]]),
teamID=sapply(id,function(i) newdata$teamID[i[1]]),
totalRuns=sapply(id, function(i) sum(newdata1$R[i],na.rm=TRUE)),
totalHits=sapply(id, function(i) sum(newdata1$H[i],na.rm=TRUE)),
totalX2B=sapply(id, function(i) sum(newdata1$X2B[i],na.rm=TRUE)),
totalX3B=sapply(id, function(i) sum(newdata1$X3B[i],na.rm=TRUE)),
totalHR=sapply(id, function(i) sum(newdata1$HR[i],na.rm=TRUE)),
totalBB=sapply(id, function(i) sum(newdata1$BB[i],na.rm=TRUE)),
totalSB=sapply(id, function(i) sum(newdata1$SB[i],na.rm=TRUE)),
totalGIDP=sapply(id, function(i) sum(newdata1$GIDP[i],na.rm=TRUE)),
totalIBB=sapply(id, function(i) sum(newdata1$IBB[i],na.rm=TRUE)),
totalHBP=sapply(id, function(i) sum(newdata1$HBP[i],na.rm=TRUE)))
a2
Perhaps try something like:
library("dplyr")
newdata1 %>%
group_by(yearID, teamID) %>%
summarize_each(funs(sum(., na.rm = T)), R, H, X2B,
X3B, HR, BB, SB, GIDP, IBB, HBP)
Naturally this is most useful if you're comfortable with dplyr library. This is a guess without looking at the data too closely.
Also, instead of listing each column that you wish to sum over, you can use alternatively do
summarize_each(funs(sum(., na.rm = T)), -column_to_exclude1, -column_to_exlude2)
And so forth.
I'd suggest looking at ddply in the plyr package. See here for a good explanation of what I think you're trying to do.
For this example try the following code:
# ddply function in the plyr package
library(plyr)
# summarize the dataframe newdat1, using yearID and teamID as grouping variables
outputdat <-ddply(newdat1, c("yearID", "teamID"), summarize,
totalRuns= sum(R), # add all summary variables you need here...
totalHits= sum(H), # other summary functions (mean, sd etc) also work
totalX2B = sum(X2B))
Hope that helps?
library(plyr)
ddply(newdat1, ~ teamID + yearID, summarize, sum(R), sum(X2B), sum(SO), sum(IBB), sum(HBP))
eventually sum(..., na.rm=TRUE)
also data.table{} can do that:
library(data.table)
DT <- as.data.table(newdat1[,-c(1,5)])
setkey(DT, teamID, yearID)
DT[, lapply(.SD, sum, na.rm=TRUE), .(teamID, yearID)]
I have a dataframe in the form of:
df:
RepName, Discount
Bob,Smith , 5383.24
Johh,Doe , 30349.21
...
The names are repeated. In the df, RepName is a factor and Discount is numeric. I want to calculate the mean per RepName. I can't seem to get the aggregate statement right.
I've tried:
#This doesn't work
repAggDiscount <- aggregate(repdf, by = repdf$RepName, FUN = mean)
#Not what I want:
repAggDiscount <- aggregate(repdf, by = list(repdf$RepName), FUN = mean)
I've also tried the following:
repnames <- lapply(repdf$RepName, toString)
repAggDiscount <- aggregate(repdf, by = repnames, FUN = mean)
But that gives me a length mismatch...
I've read the help but an example of how this should work for my data would go a long way... thanks!
I'm posting #AnandaMahto's answer here to close out the question. You can use the formula syntax
aggregate(Discount ~ RepName, repdf, mean)
Or you can use the by= syntax
repAggDiscount <- aggregate(repdf$Discount, by = list(repdf$RepName), FUN = mean)
The problem with your syntax was that you were trying to aggregate the whole data.frame which included the RepName column where taking the mean doesn't make sense
You could also to
repAggDiscount <- aggregate(repdf[,-1], by = repdf[,1,drop=F], FUN = mean)
which is closer to the matrix style syntax.
I consistently need to take transaction data and aggregate it by Day, Week, Month, Quarter, Year - essentially, it's time-series data. I started to apply zoo/xts to my data in hopes I could aggregate the data faster, but I either don't fully understand the packages' purpose or I'm trying to apply it incorrectly.
In general, I would like to calculate the number of orders and the number of products ordered by category, by time period (day, week, month, etc).
#Create the data
clients <- 1:10
dates <- seq(as.Date("2012/1/1"), as.Date("2012/9/1"), "days")
categories <- LETTERS[1:5]
products <- data.frame(numProducts = 1:10,
category = sample(categories, 1000, replace = TRUE),
clientID = sample(clients, 1000, replace = TRUE),
OrderDate = sample(dates, 1000, replace = TRUE))
I could do this with plyr and reshape, but I think this is a round-about way to do so.
#Aggregate by date and category
products.day <- ddply(products, .(OrderDate, category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Aggregate by Month and category
products.month <- ddply(products, .(Month = months(OrderDate), Category = category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Make a wide-version of the data frame
products.month.wide <- cast(products.month, Month~Category, sum)
I tried to apply zoo to the data like so:
products.TS <- aggregate(products$numProducts, yearmon, mean)
It returned this error:
Error in aggregate.data.frame(as.data.frame(x), ...) :
'by' must be a list
I've read the zoo vignettes and documentation, but every example that I've found only shows 1 record/row/entry per time entry.
Do I have to pre-aggregate the data I want to time-series on? I was hoping that I could simply group by the fields I want, then have the months or quarters get added to the data frame incrementally to the X-axis.
Is there a better approach to aggregating this or a more appropriate package?
products$numProducts is a vector, not a zoo object. You'd need to create a zoo object before you can use method dispatch to call aggregate.zoo.
pz <- with(products, zoo(numProducts, OrderDate))
products.TS <- aggregate(pz, as.yearmon, mean)