I have the following data frame:
a <- c(1,1,4)
b <- c(1,0,2)
c <- data.frame(a=a,b=b)
str(c)
# a b
#1 1 1
#2 1 0
#3 4 2
I would like to aggregate the data frame c in the following way:
aggregate(b~a,FUN=mean,data=c)
# a b
#1 1 0.5
#2 4 2.0
However, my main problem is that I will be using a variable for the name of the column
So:
d <- 'a'
If I try to aggregate using this variable d that contains the name of the column, I will obviously get an error:
aggregate(b~d,FUN=mean,data=c)
#Error in model.frame.default(formula = b ~ d, data = c) : variable lengths differ (found for 'd')
This works but I then get silly column names. I would like to avoid the extra step of renaming columns:
aggregate(c[,'b']~c[,d],FUN=mean,data=c)
# c[, d] c[, "b"]
#1 1 0.5
#2 4 2.0
How to I aggregate and also get the right column names the first try?
(Maybe there is no way to do this)
You could try
aggregate(c['b'], c[d], FUN=mean)
# a b
# 1 1 0.5
# 2 4 2.0
Another option if you are using the formula method would be to use setNames
setNames(aggregate(b~get(d), FUN=mean, data=c), colnames(c))
# a b
#1 1 0.5
#2 4 2.0
If you're not wedded to aggregate(...) in base R, here is a data.table solution.
library(data.table)
setDT(c)[,list(b=mean(b)),by=d,with=TRUE]
# a b
# 1: 1 0.5
# 2: 4 2.0
You can use cbind to set the names in aggregate. This method also shows that you can leave out the data argument. So if we use your original plan, you can do
aggregate(cbind(b = c[, "b"]) ~ cbind(a = c[, "a"]), FUN = mean)
# a b
# 1 1 0.5
# 2 4 2.0
The way I solved this was to construct the formula parameter in paste:
aggregate(formula(paste0("b ~ ", d)), data = c, FUN = mean)
This way you can easily pass in as many variables for colnames to as complex a formula as desired.
Related
Does anybody know how to aggregate by NA in R.
If you take the example below
a <- matrix(1,5,2)
a[1:2,2] <- NA
a[3:5,2] <- 2
aggregate(a[,1], by=list(a[,2]), sum)
The output is:
Group.1 x
2 3
But is there a way to get the output to include NAs in the output like this:
Group.1 x
2 3
NA 2
Thanks
Instead of aggregate(), you may want to consider rowsum(). It is actually designed for this exact operation on matrices and is known to be much faster than aggregate(). We can add NA to the factor levels of a[, 2] with addNA(). This will assure that NA shows up as a grouping variable.
rowsum(a[, 1], addNA(a[, 2]))
# [,1]
# 2 3
# <NA> 2
If you still want to use aggregate(), you can incorporate addNA() as well.
aggregate(a[, 1], list(Group = addNA(a[, 2])), sum)
# Group x
# 1 2 3
# 2 <NA> 2
And one more option with data.table -
library(data.table)
as.data.table(a)[, .(x = sum(V1)), by = .(Group = V2)]
# Group x
# 1: NA 2
# 2: 2 3
Use summarize from dplyr
library(dplyr)
a %>%
as.data.frame %>%
group_by(V2) %>%
summarize(V1_sum = sum(V1))
Using sqldf:
a <- as.data.frame(a)
sqldf("SELECT V2 [Group], SUM(V1) x
FROM a
GROUP BY V2")
Output:
Group x
1 NA 2
2 2 3
stats package
A variation of AdamO's proposal:
data.frame(xtabs( V1 ~ V2 , data = a,na.action = na.pass, exclude = NULL))
Output:
V2 Freq
1 2 3
2 <NA> 2
You can also try aggregating by is.na(a[,2]) instead.
aggregate(a[,1], by=list(is.na(a[,2])), sum)
# Group.1 x
# 1 FALSE 3
# 2 TRUE 2
If you want a finer distinction than just NA or not, then you may want to define a new variable that uses an previously unused value to denote NA (a factor would be more elegant, but a numeric vector is the simplest):
b <- a[,2]
b[is.na(b)] <- 999
aggregate(a[,1], by=list(b), sum)
# Group.1 x
# 1 2 3
# 2 999 2
The addNA solution of Rich doesn't require any substantial change to the aggregate syntax, so I think it's the best solution. I'll point out that another option, which produces output similar to table (and thus can be coerced into a data.frame structure similar to that of aggregate) is xtabs.
xtabs(a[, 1] ~ a[, 2], addNA=T)
Gives:
Group.1 x
1 2 3
2 <NA> 2
Another "trick" I see is assigning a missing code to these data. We all like the NA output of R, but assigning a missing code to a grouping variable is a good coding exercise. We take it so that it has one more digit than the largest value in the dataset and is of the form -999...99.
codemiss <- function(x) -10^(floor(log(max(abs(x), na.rm=T), base=10))+2)-1
works in general.
Then you get
a[, 2][is.na(a[, 2])] <- codemiss(a[, 2])
And:
aggregate(a[, 1], list(a[, 2]), sum)
Gives you:
Group.1 x
1 -99 2
2 2 3
Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.
Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.
I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2
I have a big data frame with state names in one colum and different indexes in the other columns.
I want to subset by state and create an object suitable for minimization of the index or a data frame with the calculation already given.
Here's one simple (short) example of what I have
m
x y
1 A 1.0
2 A 2.0
3 A 1.5
4 B 3.0
5 B 3.5
6 C 7.0
I want to get this
m
x y
1 A 1.0
2 B 3.0
3 C 7.0
I don't know if a function with a for loop is necessary. Like
minimize<-function(x,...)
for (i in m$x){
do something with data by factor value
apply to that something the min function in every column
return(y)
}
so when you call
minimize(A)
[1] 1
I tried to use %in% but didn't work (I got this error).
A%in%m
Error in match(x, table, nomatch = 0L) : object 'A' not found
When I define it it goes like this.
A<-c("A")
"A"%in%m
[1] FALSE
Thank you in advance
Use aggregate
> aggregate(.~x, FUN=min, dat)
x y
1 A 1
2 B 3
3 C 7
See this post to get some other alternatives.
Try aggregate:
aggregate(y ~ x, m, min)
x y
1 A 1
2 B 3
3 C 7
Using data.table
require(data.table)
m <- data.table(m)
m[, j=min(y), by=x]
# x V1
# 1: A 1
# 2: B 3
# 3: C 7