I use the following data.frame as an example:
d <- data.frame(x=c(1,NA), y=c(2,3))
I'd like to sum up the values of y by the variable x. Since there is no common value of x, I would expect aggregation to just give me the original data.frame back, where NA is treated as a group. But aggregation gives me the following results.
>aggregate(y ~ x, data=d, FUN=sum)
x y
1 1 2
I've read the documentation about changing the default actions of na.action, but it doesn't seem to give me anything meaningful.
>aggregate(y ~ x, data=d, FUN=sum, na.action=na.pass)
x y
1 1 2
What is going on? I don't seem to understand what na.pass is doing in this case. Is there an option to accomplish what I want in R? Any help would be greatly appreciated.
aggregate makes use of tapply, which in turn makes use of factor on its grouping variable.
But, look at what happens with NA values in factor:
factor(c(1, 2, NA))
# [1] 1 2 <NA>
# Levels: 1 2
Note the levels. You can make use of addNA to keep the NA:
addNA(factor(c(1, 2, NA)))
# [1] 1 2 <NA>
# Levels: 1 2 <NA>
Thus, you would probably need to do something like:
aggregate(y ~ addNA(x), d, sum)
# addNA(x) y
# 1 1 2
# 2 <NA> 3
Or something like:
d$x <- addNA(factor(d$x))
str(d)
# 'data.frame': 2 obs. of 2 variables:
# $ x: Factor w/ 2 levels "1",NA: 1 2
# $ y: num 2 3
aggregate(y ~ x, d, sum)
# x y
# 1 1 2
# 2 <NA> 3
(Alternatively, make the upgrade to something like "data.table", which will not just be faster than aggregate, but which will also give you more consistent behavior with NA values. No need to pay heed to whether you're using the formula method of aggregate or not.)
library(data.table)
as.data.table(d)[, sum(y), by = x]
# x V1
# 1: 1 2
# 2: NA 3
Related
I am a R-beginner and I am stuck and can't find a solution. Any remarks are highly appreciated. Here is the problem:
I have a dataframe df.
The columns are converted to char (Attributes) and num.
I want to reduce the dataframe by using the aggregate function (dply is not an option).
When I am aggregating using
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1")], sum)
I get correct results. But I want to group by more attributes. When adding more attributes for example
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
then at some point, the aggegrate result changes. The sum of Amount is no longer equal to the result of the first first aggegration (or the original dataframe).
Has anyone an idea what causes this behavior.
My best guess is that you have missing values in some of your grouping columns. Demonstrating on the built-in mtcars data, which has no missing values, everything is fine:
sum(mtcars$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am")], sum)$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am", "cyl")], sum)$mpg)
# [1] 642.9
But if we introduce a missing value in a grouping variable, it is not included in the aggregation:
mt = mtcars
mt$cyl[1] = NA
sum(aggregate(mt["mpg"], mt[c("am", "cyl")], sum)$mpg)
# [1] 621.9
The easiest fix would be to fill in the missing values with something other than NA, perhaps the string "missing".
I think #Gregor has correctly pointed out that problem could be a grouping variable having NA. The dplyr handles NA in grouping variables differently than aggregate.
We have an alternate solution with aggregate. Please note that document suggest that
`by` a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Here is clue. You can convert your grouping variables to factor using exclude="" which will ensure NA are part of factor.
set.seed(1)
df <- data.frame(ATTRIBUTE1 = sample(LETTERS[1:3], 10, replace = TRUE),
ATTRIBUTE2 = sample(letters[1:3], 10, replace = TRUE),
AMOUNT = 1:10)
df$ATTRIBUTE2[5] <- NA
aggregate(df["AMOUNT"], by = list(factor(df$ATTRIBUTE1,exclude = ""),
factor(df$ATTRIBUTE2, exclude="")), sum)
# Group.1 Group.2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
# 8 A <NA> 5
The result when grouping variables are not explicitly converted to factor to include NA is as:
aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
# ATTRIBUTE1 ATTRIBUTE2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
I use aggregate function to get count by group. The aggregate function only returns count for groups if count > 0. This is what I have
dt <- data.frame(
n = c(1,2,3,4,5,6),
id = c('A','A','A','B','B','B'),
group = c("x","x","y","x","x","x"))
applying the aggregate function
my.count <- aggregate(n ~ id+group, dt, length)
now see the results
my.count[order(my.count$id),]
I get following
id group n
1 A x 2
3 A y 1
2 B x 3
I need the following (the last row has zero that i need)
id group n
1 A x 2
3 A y 1
2 B x 3
4 B y 0
thanks for you help in in advance
We can create another column 'ind' and then use dcast to reshape from 'long' to 'wide', specifying the fun.aggregate as length and drop=FALSE.
library(reshape2)
dcast(transform(dt, ind='n'), id+group~ind,
value.var='n', length, drop=FALSE)
# id group n
#1 A x 2
#2 A y 1
#3 B x 3
#4 B y 0
Or a base R option is
as.data.frame(table(dt[-1]))
You can merge your "my.count" object with the complete set of "id" and "group" columns:
merge(my.count, expand.grid(lapply(dt[c("id", "group")], unique)), all = TRUE)
## id group n
## 1 A x 2
## 2 A y 1
## 3 B x 3
## 4 B y NA
There are several questions on SO that show you how to replace NA with 0 if that is required.
aggregate with drop=FALSE worked for me.
my.count <- aggregate(n ~ id+group, dt, length, drop=FALSE)
my.count[is.na(my.count)] <- 0
my.count
# id group n
# 1 A x 2
# 2 B x 3
# 3 A y 1
# 4 B y 0
If you are interested in frequencies only, you create with your formula a frequency table an turn it into a dataframe:
as.data.frame(xtabs(formula = ~ id + group, dt))
Obviously this won't work for other aggregate functions. I'm still waiting for dplyr's summarise function to let the user decide whether zero-groups are kept or not. Maybe you can vote for this improvement here: https://github.com/hadley/dplyr/issues/341
Does anybody know how to aggregate by NA in R.
If you take the example below
a <- matrix(1,5,2)
a[1:2,2] <- NA
a[3:5,2] <- 2
aggregate(a[,1], by=list(a[,2]), sum)
The output is:
Group.1 x
2 3
But is there a way to get the output to include NAs in the output like this:
Group.1 x
2 3
NA 2
Thanks
Instead of aggregate(), you may want to consider rowsum(). It is actually designed for this exact operation on matrices and is known to be much faster than aggregate(). We can add NA to the factor levels of a[, 2] with addNA(). This will assure that NA shows up as a grouping variable.
rowsum(a[, 1], addNA(a[, 2]))
# [,1]
# 2 3
# <NA> 2
If you still want to use aggregate(), you can incorporate addNA() as well.
aggregate(a[, 1], list(Group = addNA(a[, 2])), sum)
# Group x
# 1 2 3
# 2 <NA> 2
And one more option with data.table -
library(data.table)
as.data.table(a)[, .(x = sum(V1)), by = .(Group = V2)]
# Group x
# 1: NA 2
# 2: 2 3
Use summarize from dplyr
library(dplyr)
a %>%
as.data.frame %>%
group_by(V2) %>%
summarize(V1_sum = sum(V1))
Using sqldf:
a <- as.data.frame(a)
sqldf("SELECT V2 [Group], SUM(V1) x
FROM a
GROUP BY V2")
Output:
Group x
1 NA 2
2 2 3
stats package
A variation of AdamO's proposal:
data.frame(xtabs( V1 ~ V2 , data = a,na.action = na.pass, exclude = NULL))
Output:
V2 Freq
1 2 3
2 <NA> 2
You can also try aggregating by is.na(a[,2]) instead.
aggregate(a[,1], by=list(is.na(a[,2])), sum)
# Group.1 x
# 1 FALSE 3
# 2 TRUE 2
If you want a finer distinction than just NA or not, then you may want to define a new variable that uses an previously unused value to denote NA (a factor would be more elegant, but a numeric vector is the simplest):
b <- a[,2]
b[is.na(b)] <- 999
aggregate(a[,1], by=list(b), sum)
# Group.1 x
# 1 2 3
# 2 999 2
The addNA solution of Rich doesn't require any substantial change to the aggregate syntax, so I think it's the best solution. I'll point out that another option, which produces output similar to table (and thus can be coerced into a data.frame structure similar to that of aggregate) is xtabs.
xtabs(a[, 1] ~ a[, 2], addNA=T)
Gives:
Group.1 x
1 2 3
2 <NA> 2
Another "trick" I see is assigning a missing code to these data. We all like the NA output of R, but assigning a missing code to a grouping variable is a good coding exercise. We take it so that it has one more digit than the largest value in the dataset and is of the form -999...99.
codemiss <- function(x) -10^(floor(log(max(abs(x), na.rm=T), base=10))+2)-1
works in general.
Then you get
a[, 2][is.na(a[, 2])] <- codemiss(a[, 2])
And:
aggregate(a[, 1], list(a[, 2]), sum)
Gives you:
Group.1 x
1 -99 2
2 2 3
I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2
I have a big data frame with state names in one colum and different indexes in the other columns.
I want to subset by state and create an object suitable for minimization of the index or a data frame with the calculation already given.
Here's one simple (short) example of what I have
m
x y
1 A 1.0
2 A 2.0
3 A 1.5
4 B 3.0
5 B 3.5
6 C 7.0
I want to get this
m
x y
1 A 1.0
2 B 3.0
3 C 7.0
I don't know if a function with a for loop is necessary. Like
minimize<-function(x,...)
for (i in m$x){
do something with data by factor value
apply to that something the min function in every column
return(y)
}
so when you call
minimize(A)
[1] 1
I tried to use %in% but didn't work (I got this error).
A%in%m
Error in match(x, table, nomatch = 0L) : object 'A' not found
When I define it it goes like this.
A<-c("A")
"A"%in%m
[1] FALSE
Thank you in advance
Use aggregate
> aggregate(.~x, FUN=min, dat)
x y
1 A 1
2 B 3
3 C 7
See this post to get some other alternatives.
Try aggregate:
aggregate(y ~ x, m, min)
x y
1 A 1
2 B 3
3 C 7
Using data.table
require(data.table)
m <- data.table(m)
m[, j=min(y), by=x]
# x V1
# 1: A 1
# 2: B 3
# 3: C 7