Data Table R - Aggregation - r

I want to aggregate data in R but in a very generic way with the right hand side (columns) stored in a object as string. Below is the example expression
aggregate(PATTERN_ID ~ Year + Week), data, length)
So in my case, right side which is "Year + Week" is going to be changing as in required and i want to pass it as a string stored in variable. I tried using evaluation strategy but does not give the required output. Below is what i have tried:
exp_aggregate_by = 'Year + Week'
aggregate(PATTERN_ID ~ eval.quoted(parse(text = exp_aggregate_by)), data, length)
Any input from the people will be much appreciated. Through data table is also fine. Thanks

Create a formula with paste and it should work
data(mtcars)
grp <- 'cyl + gear'
aggregate(formula(paste('mpg ~', grp)), mtcars, length)
For the OP's dataset,
aggregate(formula(paste('PATTERN_ID ~', exp_aggregate_by)), data, length)

An answer with data.table. I'm also including the answer using formula in aggregate for completeness.
vars <- c('Year', 'Week')
# with aggregate
form <- formula(paste('PATTERN_ID', paste(vars, collapse = '+'), sep = '~'))
aggregate(form, data, length)
# with data.table
setDT(data)
data[, length(PATTERN_ID), by = vars]

Related

Get factor variable name without using apostrophs

How could I refer to a variable new in a function without writing them as string or indexing before. I want to construct a function where I can replace grouping variables easily.
For example:
final_table %>% data.table::dcast(Lipids ~ combi_new)
Another time this factor variable could be named differently.
e.g., group_2
It doesn't work with final_table[,2] - how could I solve this?
Thanks,
Nadine
I'm not sure I understand the question, are you trying to build the formula for a dcast from the column indices?
Maybe that answers your problem:
library(data.table)
setDT(final_table)
colnames(final_table)
measurevar <- colnames(final_table)[2]
groupvar <- colnames(final_table)[1]
valvar <- colnames(final_table)[3]
formula <- as.formula(paste(measurevar, groupvar, sep=" ~ "))
test <- dcast(final_table, formula, value.var = valvar)

Getting object not found error when calling Aggregate Function with FUN=count

When I call the aggregate function I got the error Error in match.fun(FUN) : object 'count' not found
I have tried updating R as well as using plyr package, but the later does not give me the result I want.
aggregate(soybean.table, by=list(soybean$seed.tmt, soybean$germination), FUN=count)
count is one function in dplyr and in plyr. The dplyr/plyr count requires a data.frame/data.table/tbl_df as input. With aggregate, the option to find the number of rows would be length
aggregate(soybean.table, by=list(soybean$seed.tmt, soybean$germination), FUN= length)
As a reproducible example
aggregate(mtcars, list(mtcars$vs, mtcars$am), FUN = length)
It may be better to select one column and group by the column of interest as the length would be the same as long as the grouping variables are the same e.g.
aggregate(mpg ~ vs + am, mtcars, FUN = length)
Or to have a different column name
aggregate(cbind(Count = seq_along(mpg))~ vs + am, mtcars, FUN = length)

Using R aggregate function over a dataframe with names

I have a dataframe in the form of:
df:
RepName, Discount
Bob,Smith , 5383.24
Johh,Doe , 30349.21
...
The names are repeated. In the df, RepName is a factor and Discount is numeric. I want to calculate the mean per RepName. I can't seem to get the aggregate statement right.
I've tried:
#This doesn't work
repAggDiscount <- aggregate(repdf, by = repdf$RepName, FUN = mean)
#Not what I want:
repAggDiscount <- aggregate(repdf, by = list(repdf$RepName), FUN = mean)
I've also tried the following:
repnames <- lapply(repdf$RepName, toString)
repAggDiscount <- aggregate(repdf, by = repnames, FUN = mean)
But that gives me a length mismatch...
I've read the help but an example of how this should work for my data would go a long way... thanks!
I'm posting #AnandaMahto's answer here to close out the question. You can use the formula syntax
aggregate(Discount ~ RepName, repdf, mean)
Or you can use the by= syntax
repAggDiscount <- aggregate(repdf$Discount, by = list(repdf$RepName), FUN = mean)
The problem with your syntax was that you were trying to aggregate the whole data.frame which included the RepName column where taking the mean doesn't make sense
You could also to
repAggDiscount <- aggregate(repdf[,-1], by = repdf[,1,drop=F], FUN = mean)
which is closer to the matrix style syntax.

Recovering tapply results into the original data-frame in R

I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]

Melt and dcast based on the name of the original data frame column

I'm having a hard time reshaping a dataframe for use with error bar plots, combining all the columns with centeral-tendency data and, separately, all the columns with error data.
I start with a data frame with a column for the independent variable, and then two columns for each measured parameter: one for the average value, and one for the error, as you'd typically format a spreadsheet with this kind of data. The initial data frame looks like this:
df<-data.frame(
indep=1:3,
Amean=runif(3),
Aerr=rnorm(3),
Bmean=runif(3),
Berr=rnorm(3)
)
I'd like to use melt and dcast to get it into a form that looks like this:
df.cast<-data.frame(
indep=rep(1:3, 2),
series=c(rep("A", 3),
rep("B", 3)),
means=runif(6),
errs=rnorm(6)
)
So that I can then feed it to ggplot like this:
qplot(data=df.cast, x=indep, y=means, ymin=means-errs, ymax=means+errs,
col=series, geom="errorbar")
I've been trying to melt and then recast using expressions like this:
df.melt<-melt(df, id.vars="indep")
dcast(df.melt,
indep~(variable=="Amean"|variable=="Bmean") + (variable=="Aerr"|variable=="Berr")
)
but these return a dataframe with funny boolean columns.
I could manually make two dataframes (one for the mean values, one for the errors), melt them separately, and recombine, but surely there must be a more elegant way?
This is how I would do it:
# Melt the data
mdf <- melt(df, id.vars="indep")
# Separate the series from the statistic with regular expressions
mdf$series <- gsub("([A-Z]).*", "\\1", mdf$variable)
mdf$stat <- gsub("[A-Z](.*)", "\\1", mdf$variable)
# Cast the data (after dropping the original melt variable
cdf <- dcast(mdf[, -2], indep+series ~ stat)
# Plot
qplot(data=cdf, x=indep, y=mean, ymin=mean-err, ymax=mean+err,
colour=series, geom="errorbar")
You can accomplish it using reshape in base R
df.cast <- reshape(df, varying = 2:5, direction = 'long', timevar = 'series',
v.names = c('mean', 'err'), times = c('A', 'B'))
qplot(data = df.cast, x = indep, y = mean, ymin = mean - err, ymax = mean + err,
colour = series, geom = "errorbar")

Resources