Using R aggregate function over a dataframe with names - r

I have a dataframe in the form of:
df:
RepName, Discount
Bob,Smith , 5383.24
Johh,Doe , 30349.21
...
The names are repeated. In the df, RepName is a factor and Discount is numeric. I want to calculate the mean per RepName. I can't seem to get the aggregate statement right.
I've tried:
#This doesn't work
repAggDiscount <- aggregate(repdf, by = repdf$RepName, FUN = mean)
#Not what I want:
repAggDiscount <- aggregate(repdf, by = list(repdf$RepName), FUN = mean)
I've also tried the following:
repnames <- lapply(repdf$RepName, toString)
repAggDiscount <- aggregate(repdf, by = repnames, FUN = mean)
But that gives me a length mismatch...
I've read the help but an example of how this should work for my data would go a long way... thanks!

I'm posting #AnandaMahto's answer here to close out the question. You can use the formula syntax
aggregate(Discount ~ RepName, repdf, mean)
Or you can use the by= syntax
repAggDiscount <- aggregate(repdf$Discount, by = list(repdf$RepName), FUN = mean)
The problem with your syntax was that you were trying to aggregate the whole data.frame which included the RepName column where taking the mean doesn't make sense
You could also to
repAggDiscount <- aggregate(repdf[,-1], by = repdf[,1,drop=F], FUN = mean)
which is closer to the matrix style syntax.

Related

Trouble constructing a function properly in R

In the code below, I'm trying to find the mean correct score for each item in the "category" column of the "regular season" dataset I'm working with.
rs_category <- list2env(split(regular_season, regular_season$category),
.GlobalEnv)
unique_categories <- unique(regular_season$category)
for (i in unique_categories)
Mean_[i] <- mean(regular_season$correct[regular_season$category == i], na.rm = TRUE, .groups = 'drop')
eapply(rs_category, Mean_[i])
print(i)
I'm having trouble getting this to work though. I have created a list of the items in the category as sub-datasets and separately, (I think) I have created a vector of the unique items in the category in order to run the for loop with. I have a feeling the problem may be with how I defined the mean function because an error occurs at the "eapply()" line and tells me "Mean_[i]" is not a function, but I can't think of how else to define the function. If someone could help, I would greatly appreciate it.
The issue would be that Mean_ wouldn't have an i name. In the below code, we initiaize the object 'Mean_' as type numeric with length as the same as length of 'unique_categories', then loop over the sequence of 'unique_categories', get the subset of 'correct', apply the mean function and store that as ith value of 'Mean_'
Mean_ <- numeric(length(unique_categories))
for(i in seq_along(unique_categories)) {
Mean_[i] <- mean(regular_season$correct[regular_season$category
== unique_categories[i]], na.rm = TRUE)
}
If we need to use a faster execution, use data.table
library(data.table)
setDT(regular_season[, .(Mean_ = mean(correct, na.rm = TRUE)), category]
Or using collapse
library(collapse)
fmean(slt(regular_season, category, correct), g = category)
Instead of splitting the dataset and using for loop R has functions for such grouping operations which I think can be used here. You can apply a function for each unique group (value).
library(dplyr)
regular_season %>%
group_by(category) %>%
summarise(Mean_ = mean(correct, na.rm = TRUE)) -> result
This gives you average value of correct for each category, where result$Mean_ is the vector that you are looking for.
In base R, this can be solved with aggregate.
result <- aggregate(correct~category, regular_season, mean, na.rm = TRUE)

Getting object not found error when calling Aggregate Function with FUN=count

When I call the aggregate function I got the error Error in match.fun(FUN) : object 'count' not found
I have tried updating R as well as using plyr package, but the later does not give me the result I want.
aggregate(soybean.table, by=list(soybean$seed.tmt, soybean$germination), FUN=count)
count is one function in dplyr and in plyr. The dplyr/plyr count requires a data.frame/data.table/tbl_df as input. With aggregate, the option to find the number of rows would be length
aggregate(soybean.table, by=list(soybean$seed.tmt, soybean$germination), FUN= length)
As a reproducible example
aggregate(mtcars, list(mtcars$vs, mtcars$am), FUN = length)
It may be better to select one column and group by the column of interest as the length would be the same as long as the grouping variables are the same e.g.
aggregate(mpg ~ vs + am, mtcars, FUN = length)
Or to have a different column name
aggregate(cbind(Count = seq_along(mpg))~ vs + am, mtcars, FUN = length)

Data Table R - Aggregation

I want to aggregate data in R but in a very generic way with the right hand side (columns) stored in a object as string. Below is the example expression
aggregate(PATTERN_ID ~ Year + Week), data, length)
So in my case, right side which is "Year + Week" is going to be changing as in required and i want to pass it as a string stored in variable. I tried using evaluation strategy but does not give the required output. Below is what i have tried:
exp_aggregate_by = 'Year + Week'
aggregate(PATTERN_ID ~ eval.quoted(parse(text = exp_aggregate_by)), data, length)
Any input from the people will be much appreciated. Through data table is also fine. Thanks
Create a formula with paste and it should work
data(mtcars)
grp <- 'cyl + gear'
aggregate(formula(paste('mpg ~', grp)), mtcars, length)
For the OP's dataset,
aggregate(formula(paste('PATTERN_ID ~', exp_aggregate_by)), data, length)
An answer with data.table. I'm also including the answer using formula in aggregate for completeness.
vars <- c('Year', 'Week')
# with aggregate
form <- formula(paste('PATTERN_ID', paste(vars, collapse = '+'), sep = '~'))
aggregate(form, data, length)
# with data.table
setDT(data)
data[, length(PATTERN_ID), by = vars]

Passing column names in R function containing is.na() and median

I have data with Income,spending,population and state. Income, spending and population has missing values.
I have created a for loop to replace the missing values by median which is calculated state-wise. However I have to run the for loop separately for Income, Spending and population. I tried to create a function to pass just the column names but it is giving me an error with is.na(). Here is the for loop
for (i in (unique(data$State))) {
data$Income[is.na(data$Income) & data$State==i] <-
median(data$Income[data$State==i], na.rm = TRUE)
}
In place of income I tried making a function and passing x.. but it is not working. Can someone help me achieve this function. I tried a few things but it gave me an error with is.na
Med_sub <- function(x){
for (i in (unique(data$State))) {
data$x[is.na(data$x)&data$State==i] <- median(data$x[data$State==i], na.rm = TRUE)
}
}
Med_sub(Income)
Med_sub(Population)
I am new to R. Any help would be greatly appreciated.
Consider a base R two-liner with ave (the inline aggregate function that slices numeric columns by factors) and ifelse all wrapped in a sapply loop:
median_fill <- function(x) ifelse(is.na(x), median(x, na.rm=TRUE), x)
data[c("Income","spending","population")] <- sapply(data[c("Income","spending","population")],
function(i) ave(i, data$state, FUN=median_fill))
A tidyverse three-liner:
library(dplyr)
data %>%
group_by(State) %>%
mutate_all(.funs = funs(coalesce(., median(., na.rm=TRUE))))

Recovering tapply results into the original data-frame in R

I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]

Resources