I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]
Related
I would like to apply one function to multiple columns, using a vector of different values for one parameter.
I have some data:
library(data.table)
df1 <- as.data.table(data.frame(a = c(1,2,3),
b = c(4,5,6)))
cols <- c('a', 'b')
n <- 1:2
And I want to create columns that add n to a and b. Output would look like this:
a b a+1 b+1 a+2 b+2
1: 1 4 2 5 3 6
2: 2 5 3 6 4 7
3: 3 6 4 7 5 8
This post details how to apply one function to multiple columns which I understand how to do.
df1[,paste0(cols,'+1'):= lapply(.SD, function(x) x + 1), .SDcols = cols]
What I don't know is how to apply the same function to multiple columns, substituting n for 1.
We can also use outer
cbind(df1, df1[, lapply(.SD, function(x) outer(x, n, `+`))])
Or another option is
nm1 <- paste0(cols, "+", rep(n, each = length(cols)))
df1[, (nm1) := lapply(n, `+`, .SD)]
Here is a base R solution
df1out <- cbind(df1,do.call(cbind,lapply(c(1,2), function(k) setNames(df1+k,paste0(names(df1),"+",k)))))
such that
> df1out
a b a+1 b+1 a+2 b+2
1: 1 4 2 5 3 6
2: 2 5 3 6 4 7
3: 3 6 4 7 5 8
I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))
I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]
How do I use variable column names on the RHS of := operations? For example, given this data.table "dt", I'd like to create two new columns, "first_y" and "first_z" that contains the first observation of the given column for the values of "x".
dt <- data.table(x = c("one","one","two","two","three"),
y = c("a", "b", "c", "d", "e"),
z = c(1, 2, 3, 4, 5))
dt
x y z
1: one a 1
2: one b 2
3: two c 3
4: two d 4
5: three e 5
Here's how you would do it without variable column names.
dt[, c("first_y", "first_z") := .(first(y), first(z)), by = x]
dt
x y z first_y first_z
1: one a 1 a 1
2: one b 2 a 1
3: two c 3 c 3
4: two d 4 c 3
5: three e 5 e 5
But how would I do this if the "y" and "z" column names are dynamically stored in a variable?
cols <- c("y", "z")
# This doesn't work
dt[, (paste0("first_", cols)) := .(first(cols)), by = x]
# Nor does this
q <- quote(first(as.name(cols[1])))
p <- quote(first(as.name(cols[2])))
dt[, (paste0("first_", cols)) := .(eval(q), eval(p)), by = x]
I've tried numerous other combinations of quote() and eval() and as.name() without success. The LHS of the operation appears to be working as intended and is documented in many places, but I can't find anything about using a variable column name on the RHS. Thanks in advance.
I'm not familiar with the first function (although it looks like something Hadley would define).
dt[, paste0("first_", cols) := lapply(.SD, head, n = 1L),
by = x, .SDcols = cols]
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
The .SDcols answer is fine for this case, but you can also just use get:
dt[, paste0("first_", cols) := lapply(cols, function(x) get(x)[1]), by = x]
dt
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
Another alternative is the vectorized version - mget:
dt[, paste0("first_", cols) := setDT(mget(cols))[1], by = x]
I can never get "get" or "eval" to work on the RHS when trying to do mathematical operations. Try this if you need to.
Thing_dt[, c(new_col) := Thing_dt[[oldcol1]] * Thing_dt[[oldcol2]]]
I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))