Related
Does anybody know how to aggregate by NA in R.
If you take the example below
a <- matrix(1,5,2)
a[1:2,2] <- NA
a[3:5,2] <- 2
aggregate(a[,1], by=list(a[,2]), sum)
The output is:
Group.1 x
2 3
But is there a way to get the output to include NAs in the output like this:
Group.1 x
2 3
NA 2
Thanks
Instead of aggregate(), you may want to consider rowsum(). It is actually designed for this exact operation on matrices and is known to be much faster than aggregate(). We can add NA to the factor levels of a[, 2] with addNA(). This will assure that NA shows up as a grouping variable.
rowsum(a[, 1], addNA(a[, 2]))
# [,1]
# 2 3
# <NA> 2
If you still want to use aggregate(), you can incorporate addNA() as well.
aggregate(a[, 1], list(Group = addNA(a[, 2])), sum)
# Group x
# 1 2 3
# 2 <NA> 2
And one more option with data.table -
library(data.table)
as.data.table(a)[, .(x = sum(V1)), by = .(Group = V2)]
# Group x
# 1: NA 2
# 2: 2 3
Use summarize from dplyr
library(dplyr)
a %>%
as.data.frame %>%
group_by(V2) %>%
summarize(V1_sum = sum(V1))
Using sqldf:
a <- as.data.frame(a)
sqldf("SELECT V2 [Group], SUM(V1) x
FROM a
GROUP BY V2")
Output:
Group x
1 NA 2
2 2 3
stats package
A variation of AdamO's proposal:
data.frame(xtabs( V1 ~ V2 , data = a,na.action = na.pass, exclude = NULL))
Output:
V2 Freq
1 2 3
2 <NA> 2
You can also try aggregating by is.na(a[,2]) instead.
aggregate(a[,1], by=list(is.na(a[,2])), sum)
# Group.1 x
# 1 FALSE 3
# 2 TRUE 2
If you want a finer distinction than just NA or not, then you may want to define a new variable that uses an previously unused value to denote NA (a factor would be more elegant, but a numeric vector is the simplest):
b <- a[,2]
b[is.na(b)] <- 999
aggregate(a[,1], by=list(b), sum)
# Group.1 x
# 1 2 3
# 2 999 2
The addNA solution of Rich doesn't require any substantial change to the aggregate syntax, so I think it's the best solution. I'll point out that another option, which produces output similar to table (and thus can be coerced into a data.frame structure similar to that of aggregate) is xtabs.
xtabs(a[, 1] ~ a[, 2], addNA=T)
Gives:
Group.1 x
1 2 3
2 <NA> 2
Another "trick" I see is assigning a missing code to these data. We all like the NA output of R, but assigning a missing code to a grouping variable is a good coding exercise. We take it so that it has one more digit than the largest value in the dataset and is of the form -999...99.
codemiss <- function(x) -10^(floor(log(max(abs(x), na.rm=T), base=10))+2)-1
works in general.
Then you get
a[, 2][is.na(a[, 2])] <- codemiss(a[, 2])
And:
aggregate(a[, 1], list(a[, 2]), sum)
Gives you:
Group.1 x
1 -99 2
2 2 3
I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2
I have looked for an answer high and low. It seems so simple but I am struggling with getting anything to work.
Using R 3.0 in Win 7.
I am looking for a way to find the max value (row by row) for: the row of interest, the row before, and row after.
an example would look something like this:
x max
1 1 NA
2 7 7
3 3 7
4 4 5
5 5 5
I could do this with a loop but I would like to avoid that if possible. I have explored things similar to rowSums and rollmean but they do not quite fit the bill since I want a max for a row after it as well.
Any thoughts are greatly appreciated!!
You could use embed and pmax in base R for this.
d <- data.frame(x=c(1,7,3,4,5))
transform(d, max=c(NA, do.call(pmax, as.data.frame(embed(d$x, 3))), NA))
# x max
# 1 1 NA
# 2 7 7
# 3 3 7
# 4 4 5
# 5 5 NA
Here's an approach using dplyr's lead() and lag() functions:
library(dplyr)
d <- data.frame(x = c(1,7,3,4,5))
mutate(d, max = pmax(lead(x), x, lag(x)))
#> x max
#> 1 1 NA
#> 2 7 7
#> 3 3 7
#> 4 4 5
#> 5 5 NA
Assuming that you want to do this for a matrix, not for a vector(use rollapply for vectors), this is straightforward solution, probably not the best in terms of the speed:
library(Hmisc)
x <- matrix(runif(10), ncol=2)
rowMaxs <- apply(x, 1, max)
row3Maxs <- apply(cbind(rowMaxs,
Lag(rowMaxs, 1),
Lag(rowMaxs, -1)), 1, max)
cbind(x, row3Maxs)
however from the performance standpoint the following might be better:
row3Maxsc <- c(NA,
sapply(2:(length(rowMaxs)-1),
function(i)
max(rowMaxs[i], rowMaxs[i-1], rowMaxs[i+1])
),
NA)
cbind(x, row3Maxs)
I have a big data frame with state names in one colum and different indexes in the other columns.
I want to subset by state and create an object suitable for minimization of the index or a data frame with the calculation already given.
Here's one simple (short) example of what I have
m
x y
1 A 1.0
2 A 2.0
3 A 1.5
4 B 3.0
5 B 3.5
6 C 7.0
I want to get this
m
x y
1 A 1.0
2 B 3.0
3 C 7.0
I don't know if a function with a for loop is necessary. Like
minimize<-function(x,...)
for (i in m$x){
do something with data by factor value
apply to that something the min function in every column
return(y)
}
so when you call
minimize(A)
[1] 1
I tried to use %in% but didn't work (I got this error).
A%in%m
Error in match(x, table, nomatch = 0L) : object 'A' not found
When I define it it goes like this.
A<-c("A")
"A"%in%m
[1] FALSE
Thank you in advance
Use aggregate
> aggregate(.~x, FUN=min, dat)
x y
1 A 1
2 B 3
3 C 7
See this post to get some other alternatives.
Try aggregate:
aggregate(y ~ x, m, min)
x y
1 A 1
2 B 3
3 C 7
Using data.table
require(data.table)
m <- data.table(m)
m[, j=min(y), by=x]
# x V1
# 1: A 1
# 2: B 3
# 3: C 7
Looking through the ave function, I found a remarkable line:
split(x, g) <- lapply(split(x, g), FUN) # From ave
Interestingly, this line changes the value of x, which I found unexpected. I expected that split(x,g) would result in a list, which could be assigned to, but discarded afterward. My question is, why does the value of x change?
Another example may explain better:
a <- data.frame(id=c(1,1,2,2), value=c(4,5,7,6))
# id value
# 1 1 4
# 2 1 5
# 3 2 7
# 4 2 6
split(a,a$id) # Split a row-wise by id into a list of size 2
# $`1`
# id value
# 1 1 4
# 2 1 5
# $`2`
# id value
# 3 2 7
# 4 2 6
# Find the row with highest value for each id
lapply(split(a,a$id),function(x) x[which.max(x$value),])
# $`1`
# id value
# 2 1 5
# $`2`
# id value
# 3 2 7
# Assigning to the split changes the data.frame a!
split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),])
a
# id value
# 1 1 5
# 2 1 5
# 3 2 7
# 4 2 7
Not only has a changed, but it changed to a value that does not look like the right hand side of the assignment! Even if assigning to split(a,a$id) somehow changes a (which I don't understand), why does it result in a data.frame instead of a list?
Note that I understand that there are better ways to accomplish this task. My question is why does split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),]) change a?
The help page for split says in its header: "The replacement forms replace values corresponding to such a division." So it really should not be unexpected, although I admit it is not widely used. I do not understand how your example illustrates that the assigned values "do not look like the RHS of the assignment!". The max values are assigned to the 'value' lists within categories defined by the second argument factor.
(I do thank you for the question. I had not realized that split<- was at the core of ave. I guess it is more widely used than I realized, since I think ave is a wonderfully useful function.)
Just after definition of a, perform split(a, a$id)=1, the result would be:
> a
id value
1 1 1
2 1 1
3 1 1
4 1 1
The key here is that split<- actually modified the LHS with RHS values.
Here's an example:
> x <- c(1,2,3);
> split(x,x==2)
$`FALSE`
[1] 1 3
$`TRUE`
[1] 2
> split(x,x==2) <- split(c(10,20,30),c(10,20,30)==20)
> x
[1] 10 20 30
Note the line where I re-assign split(x,x==2) <- . This actually reassigns x.
As the comments below have stated, you can look up the definition of split<- like so
> `split<-.default`
function (x, f, drop = FALSE, ..., value)
{
ix <- split(seq_along(x), f, drop = drop, ...)
n <- length(value)
j <- 0
for (i in ix) {
j <- j%%n + 1
x[i] <- value[[j]]
}
x
}
<bytecode: 0x1e18ef8>
<environment: namespace:base>