I have data like this:
library(data.table)
group <- c("a","a","a","b","b","b")
cond <- c("N","Y","N","Y","Y","N")
value <- c(2,1,3,4,2,5)
dt <- data.table(group, cond, value)
group cond value
a N 2
a Y 1
a N 3
b Y 4
b Y 2
b N 5
I would like to return max value when the cond is Y for the entire group. Something like this:
group cond value max
a N 2 1
a Y 1 1
a N 3 1
b Y 4 4
b Y 2 4
b N 5 4
I've tried adding an ifelse condition to a grouped max, however, I end up just returning the no condition of NA when the row doesn't meet the condition:
dt[, max := ifelse(cond=="Y", max(value), NA), by = group]
Assuming that for each 'group' we need to get the max of 'value' where the 'cond' is "Y", after grouping by 'group', subset the 'value' with the logical condition (cond == 'Y') and get the max value
dt[, max := max(value[cond == 'Y']), by = group]
dt
# group cond value max
#1: a N 2 1
#2: a Y 1 1
#3: a N 3 1
#4: b Y 4 4
#5: b Y 2 4
#6: b N 5 4
You could do...
dt[CJ(group = group, cond = "Y", unique=TRUE), on=.(group, cond),
.(mv = max(value))
, by=.EACHI]
# group cond mv
# 1: a Y 1
# 2: b Y 4
Using a join like this will eventually have optimization of the max calculation.
Another way (originally included in #akrun's answer):
dt[cond == "Y", mv := max(value), by=group]
From the prior link, we can see that this way is already optimized, except for the := part.
Related
I am trying to count multiple .N for data.table with conditions
I have a data.table
SD = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b","a", "a",
"b", "b"), z= c("x", "x", "y","y", "x","z","x","w"))
I want to count two things:
counts of each combination by x and y
I did it this way
SD[,.N, .(x,y)]
and output is what I want
x y N
1: 1 a 2
2: 1 b 2
3: 2 a 2
4: 2 b 2
counts of each combination by x and y where z == "x"
As a quick solution I did
SD[z == "x", .N, .(x,y)]
merge.data.table(SD[,.N, .(x,y)],
SD[z == "x", .N, .(x,y)],
by = c('x','y'), all.x =TRUE)`
Is there any way to do both in one line of code without joins?
It works but takes a long time due to the big size of data.table
Any help is appreciated
How about this one:
SD[, .(.N, count2 = sum(z == 'x')), .(x,y)]
# x y N count2
# 1: 1 a 2 2
# 2: 1 b 2 0
# 3: 2 a 2 1
# 4: 2 b 2 1
In addition to counting the number of rows by unique x-y combination, we also count the number of rows where z == 'x' in each x-y combination.
I have a data frame below. I need to find the the row min and max except few column that are characters.
df
x y z
1 1 1 a
2 2 5 b
3 7 4 c
I need
df
x y z Min Max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
Another dplyr possibility could be:
df %>%
mutate(Max = do.call(pmax, select_if(., is.numeric)),
Min = do.call(pmin, select_if(., is.numeric)))
x y z Max Min
1 1 1 a 1 1
2 2 5 b 5 2
3 7 4 c 7 4
Or a variation proposed be #G. Grothendieck:
df %>%
mutate(Min = pmin(!!!select_if(., is.numeric)),
Max = pmax(!!!select_if(., is.numeric)))
Another base R solution. Subset only the columns with numbers and then use apply in each row to get the minimum and maximum value with range.
cbind(df, t(apply(df[sapply(df, is.numeric)], 1, function(x)
setNames(range(x, na.rm = TRUE), c("min", "max")))))
# x y z min max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
1) This one-liner uses no packages:
transform(df, min = pmin(x, y), max = pmax(x, y))
giving:
x y z min max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
2) If you have many columns and don't want to list them all or determine yourself which are numeric then this also uses no packages.
ix <- sapply(df, is.numeric)
transform(df, min = apply(df[ix], 1, min), max = apply(df[ix], 1, max))
If your actual data has NAs and if you want to ignore them when taking the min or max then min, max, pmin and pmax all take an optional na.rm = TRUE argument.
Note
Lines <- "x y z
1 1 1 a
2 2 5 b
3 7 4 c"
df <- read.table(text = Lines)
1) We can use select_if. Here, we can use select_if to select the columns that are numeric, then with pmin, pmax get the rowwise min and max and bind it with the original dataset
library(dplyr)
library(purrr)
df %>%
select_if(is.numeric) %>%
transmute(Min = reduce(., pmin, na.rm = TRUE),
Max = reduce(., pmax, na.rm = TRUE)) %>%
bind_cols(df, .)
# x y z Min Max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
NOTE: Here, we use only a single expression of select_if
2) The same can be done in base R (no packages used)
i1 <- names(which(sapply(df, is.numeric)))
df['Min'] <- do.call(pmin, c(df[i1], na.rm = TRUE))
df['Max'] <- do.call(pmax, c(df[i1], na.rm = TRUE))
Also, as stated in the comments, this is generalized option. If it is only for two columns, just doing pmin(x, y) or pmax(x,y) is possible and that wouldn't check if the columns are numeric or not and it is not a general solution
NOTE: All of the solutions mentioned here are either answered first or from the comments with the OP
data
df <- structure(list(x = c(1L, 2L, 7L), y = c(1L, 5L, 4L), z = c("a",
"b", "c")), class = "data.frame", row.names = c("1", "2", "3"
))
What is efficient and elegant data.table syntax for finding the most common category for each id? I keep a boolean vector indicating NA positions (for other purposes)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
In this toy example, ignoring NA, x is common category for id==1 and y for id==2.
If you want to ignore NA's, you have to exclude them first with !is.na(category), group by id and category (by = .(id, category)) and create a frequency variable with .N:
dt[!is.na(category), .N, by = .(id, category)]
which gives:
id category N
1: 1 x 3
2: 2 y 3
3: 2 x 2
4: 1 y 2
Ordering this by id will give you a clearer picture:
dt[!is.na(category), .N, by = .(id, category)][order(id)]
which results in:
id category N
1: 1 x 3
2: 1 y 2
3: 2 y 3
4: 2 x 2
If you just want the rows which indicate the top results:
dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
or:
dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
which both give:
id category N
1: 1 x 3
2: 2 y 3
I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....
I would like to append a columns to my data.frame in R that contain row sums and products
Consider following data frame
x y z
1 2 3
2 3 4
5 1 2
I want to get the following
x y z sum prod
1 2 3 6 6
2 3 4 9 24
5 1 2 8 10
I have tried
sum = apply(ages,1,add)
but it gives me a row vector. Can some one please show me an efficient command to sum and product and append them to original data frame as shown above?
Try
transform(df, sum=rowSums(df), prod=x*y*z)
# x y z sum prod
#1 1 2 3 6 6
#2 2 3 4 9 24
#3 5 1 2 8 10
Or
transform(df, sum=rowSums(df), prod=Reduce(`*`, df))
# x y z sum prod
#1 1 2 3 6 6
#2 2 3 4 9 24
#3 5 1 2 8 10
Another option would be to use rowProds from matrixStats
library(matrixStats)
transform(df, sum=rowSums(df), prod=rowProds(as.matrix(df)))
If you are using apply
df[,c('sum', 'prod')] <- t(apply(df, 1, FUN=function(x) c(sum(x), prod(x))))
df
# x y z sum prod
#1 1 2 3 6 6
#2 2 3 4 9 24
#3 5 1 2 8 10
Another approach.
require(data.table)
# Create data
dt <- data.table(x = c(1,2,5), y = c(2,3,1), z = c(3,4,2))
# Create index
dt[, i := .I]
# Compute sum and prod
dt[, sum := sum(x, y, z), by = i]
dt[, prod := prod(x, y, z), by = i]
dt
# Compute sum and prod using .SD
dt[, c("sum", "prod") := NULL]
dt
dt[, sum := sum(.SD), by = i, .SDcols = c("x", "y", "z")]
dt[, prod := prod(.SD), by = i, .SDcols = c("x", "y", "z")]
dt
# Compute sum and prod using .SD and list
dt[, c("sum", "prod") := NULL]
dt
dt[, c("sum", "prod") := list(sum(.SD), prod(.SD)), by = i,
.SDcols = c("x", "y", "z")]
dt
# Compute sum and prod using .SD and lapply
dt[, c("sum", "prod") := NULL]
dt
dt[, c("sum", "prod") := lapply(list(sum, prod), do.call, .SD), by = i,
.SDcols = c("x", "y", "z")]
dt
Following can also be done but column names need to be entered:
ddf$sum = with(ddf, x+y+z)
ddf$prod = with(ddf, x*y*z)
ddf
x y z sum prod
1 1 2 3 6 6
2 2 3 4 9 24
3 5 1 2 8 10
With data.table, another form can be:
library(data.table)
cbind(dt, dt[,list(sum=x+y+z, product=x*y*z),])
x y z sum product
1: 1 2 3 6 6
2: 2 3 4 9 24
3: 5 1 2 8 10
A simpler version is suggested by #David Arenberg in comments:
dt[, ":="(sum = x+y+z, product = x*y*z)]
Only a partial answer, but if all values are greater than or equal to 0, rowSums/rowsum can be used to calculate products:
df <- data.frame(x = c(1, 2, 5), y = c(2, 3, 1), z = c(3, 4, 2))
# custom row-product-function
my_rowprod <- function(x) exp(rowSums(log(x)))
df$prod <- my_rowprod(df)
df
The generic version is (including negatives):
my_rowprod_2 <- function(x) {
sign <- ifelse((rowSums(x < 0) %% 2) == 1, -1, 1)
prod <- exp(rowSums(log(abs(x)))) * sign
prod
}
df$prod <- my_rowprod_2(df)
df