group by count when count is zero in r

group by count when count is zero in r - r

I use aggregate function to get count by group. The aggregate function only returns count for groups if count > 0. This is what I have
dt <- data.frame(
n = c(1,2,3,4,5,6),
id = c('A','A','A','B','B','B'),
group = c("x","x","y","x","x","x"))
applying the aggregate function
my.count <- aggregate(n ~ id+group, dt, length)
now see the results
my.count[order(my.count$id),]
I get following
id group n
1 A x 2
3 A y 1
2 B x 3
I need the following (the last row has zero that i need)
id group n
1 A x 2
3 A y 1
2 B x 3
4 B y 0
thanks for you help in in advance

We can create another column 'ind' and then use dcast to reshape from 'long' to 'wide', specifying the fun.aggregate as length and drop=FALSE.
library(reshape2)
dcast(transform(dt, ind='n'), id+group~ind,
value.var='n', length, drop=FALSE)
# id group n
#1 A x 2
#2 A y 1
#3 B x 3
#4 B y 0
Or a base R option is
as.data.frame(table(dt[-1]))

You can merge your "my.count" object with the complete set of "id" and "group" columns:
merge(my.count, expand.grid(lapply(dt[c("id", "group")], unique)), all = TRUE)
## id group n
## 1 A x 2
## 2 A y 1
## 3 B x 3
## 4 B y NA
There are several questions on SO that show you how to replace NA with 0 if that is required.

aggregate with drop=FALSE worked for me.
my.count <- aggregate(n ~ id+group, dt, length, drop=FALSE)
my.count[is.na(my.count)] <- 0
my.count
# id group n
# 1 A x 2
# 2 B x 3
# 3 A y 1
# 4 B y 0

If you are interested in frequencies only, you create with your formula a frequency table an turn it into a dataframe:
as.data.frame(xtabs(formula = ~ id + group, dt))
Obviously this won't work for other aggregate functions. I'm still waiting for dplyr's summarise function to let the user decide whether zero-groups are kept or not. Maybe you can vote for this improvement here: https://github.com/hadley/dplyr/issues/341

Related

Create a column with the column name of the max value of the row in R

I got this data:
df = data.frame(x = c(1,2,3), y = c(5,1,4))
> x y
> 1 1 5
> 2 2 1
> 3 3 4
But i want a new column with the column name of the max value in the row
like this:
> x y max.col
> 1 1 5 y
> 2 2 1 x
> 3 3 4 y
I've tried a lot of codes, but without sucess. Extra points with i can use the solution with %>%
Edit1: i got a lot of NA's and i want skip it
Edit2: i got 30 different columns in the real df

We can use max.col to return the index of the max value and use that to subset the column name. If there are NAs replace the NA with a negative value
If a row is all NA, then we can identify it with rowSums on logical matrix
i1 <- !rowSums(!is.na(df))
df$max.col <- names(df)[max.col(replace(df, is.na(df), -999), 'first')]
df$max.col[i1] <- NA

Here is the solution for your question
df2 <- df %>%
mutate(max.col = ifelse(x>y, "x", "y"))
# x y max.col
# 1 1 5 y
# 2 2 1 x
# 3 3 4 y

"Weighted" counts at each combination of factor levels

I have the following dataframe:
> df=data.frame(from = c("x","y","x","z"), to=c("w","x","w","y"),weight=c(1,1,3,4))
> df
from to weight
1 x w 1
2 y x 1
3 x w 3
4 z y 4
If I want to calculate how many times an element of column from appears in the dataframe, I need to use:
> table(df$from)
x y z
2 1 1
This is not a weighted sum. Anyway, how could I consider also the column weight? E.g. in my example, the correct answer should be:
x y z
4 1 4

You can use tapply and calculate sum for each unique value in from
tapply(df$weight, df$from, sum)
#x y z
#4 1 4

We can use count from dplyr
library(dplyr)
df %>%
count(from, wt = weight)
# from n
#1 x 4
#2 y 1
#3 z 4
In base R, we can use xtabs
xtabs(weight~ from, df)
#from
#x y z
#4 1 4

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.

No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.

Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))

library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))

You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]

Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Why are there differences in using merge and %in%?

I have two datasets that I'd like to merge via two identifying variables (up and ver_u):
df1 looks like this:
up ver_u
257001 1
1010 1
101010 1
100316 1
df2 looks like this:
up ver_u code_uc quantity
500116 1 395884 1
100116 1 36761 2
160116 1 81308 3
100116 1 76146 1
113216 1 6338 1
101116 1 33887 1
What I would like to do is to take out a subset of df2 where their up and ver_u matches with those in df1. I did this in two different ways and I got different answers.
First method:
pur <- merge(df2, df1,by=c("up","ver_u"))
Second method:
test <- df2[(df2$up %in% df1$up) & (df2$ver_u %in% df1$ver_u),]
They are giving me different number of observations and I don't see why they are giving me a difference.
When I used merge on dataframe test with the following code, I got the same number of observations, but the two resulting dataframes I got are still different.
pur1 = merge(test, df1,by=c("up","ver_u"))
Is there some systematic differences of using merge and %in%?
Would greatly appreciate any insight on this.

Because merge is comparing row by row for both columns, while %in% is comparing one row by all other rows. Example:
#dummy data
df1 <- data.frame(x = c(1,2,3),
y = c(2,3,4))
df1
# x y
# 2 2 3
# 3 3 4
df2 <- data.frame(x = c(2,3,1,3),
y = c(3,1,4,1))
df2
# x y
# 1 2 3
# 2 3 1
# 3 1 4
# 4 3 1
# using merge
merge(df1, df2, by = c("x", "y"))
# x y
# 1 2 3
# using %in%
df1[(df1$x %in% df2$x) & (df1$y %in% df2$y), ]
# x y
# 2 2 3
# 3 3 4

Create indicator

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.

No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.

Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))

library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))

You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]

Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

group by count when count is zero in r - r

aggregate with drop=FALSE worked for me. my.count <- aggregate(n ~ id+group, dt, length, drop=FALSE) my.count[is.na(my.count)] <- 0 my.count # id group n # 1 A x 2 # 2 B x 3 # 3 A y 1 # 4 B y 0

Related

Create a column with the column name of the max value of the row in R

"Weighted" counts at each combination of factor levels

What is the most effective way to sort dataframe and add special id? [duplicate]

Why are there differences in using merge and %in%?

Create indicator

Categories

Resources