Recently I saw a question (can't find the link) that was something like this
I want to add a column on a data.frame that computes the variance of a different column while removing the current observation.
dt = data.table(
id = c(1:13),
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
)
So, with a for() loop:
res = NULL
for(i in 1:13){
res[i] = var(dt[-i,v])
}
I tried doing this in data.table, using negative indexing with .I, but to my surprise none of the following works:
#1
dt[,var := var(dt[,v][-.I])]
#2
dt[,var := var(dt$v[-.I])]
#3
fun = function(x){
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
var(v[-x])
}
dt[,var := fun(.I)]
#4
fun = function(x){
var(dt[-x,v])
}
dt[,var := fun(.I)]
All of those gives the same output:
id v var
1: 1 9 NA
2: 2 5 NA
3: 3 8 NA
4: 4 1 NA
5: 5 25 NA
6: 6 14 NA
7: 7 7 NA
8: 8 87 NA
9: 9 98 NA
10: 10 63 NA
11: 11 32 NA
12: 12 12 NA
13: 13 15 NA
What am I missing? I thought it was a problem with .I being passed to functions, but a dummy example:
fun = function(x,c){
x*c
}
dt[,dummy := fun(.I,2)]
id v var
1: 1 9 2
2: 2 5 4
3: 3 8 6
4: 4 1 8
5: 5 25 10
6: 6 14 12
7: 7 7 14
8: 8 87 16
9: 9 98 18
10: 10 63 20
11: 11 32 22
12: 12 12 24
13: 13 15 26
works fine.
Why can't I use .I in this specific scenario?
You may use .BY:
a list containing a length 1 vector for each item in by
dt[ , var_v := dt[id != .BY$id, var(v)], by = id]
Variance is calculated once per row (by = id). In each calculation, the current row is excluded using id != .BY$id in the 'inner' i.
all.equal(dt$var_v, res)
# [1] TRUE
Why doesn't your code work? Because...
.I is an integer vector equal to seq_len(nrow(x)),
...your -.I not only removes current observation, it removes all rows in one go from 'v'.
A small illustration which starts with your attempt (just without the assignment :=) and simplifies it step by step:
# your attempt
dt[ , var(dt[, v][-.I])]
# [1] NA
# without the `var`, indexing only
dt[ , dt[ , v][-.I]]
# numeric(0)
# an empty vector
# same indexing written in a simpler way
dt[ , v[-.I]]
# numeric(0)
# even more simplified, with a vector of values
# and its corresponding indexes (equivalent to .I)
v <- as.numeric(11:14)
i <- 1:4
v[i]
# [1] 11 12 13 14
x[-i]
# numeric(0)
Here's a brute-force thought:
exvar <- function(x, na.rm = FALSE) sapply(seq_len(length(x)), function(i) var(x[-i], na.rm = na.rm))
dt[,var := exvar(v)]
dt
# id v var
# 1: 1 9 1115.538
# 2: 2 5 1098.265
# 3: 3 8 1111.515
# 4: 4 1 1077.841
# 5: 5 25 1153.114
# 6: 6 14 1132.697
# 7: 7 7 1107.295
# 8: 8 87 822.447
# 9: 9 98 684.697
# 10: 10 63 1040.265
# 11: 11 32 1153.697
# 12: 12 12 1126.424
# 13: 13 15 1135.538
Related
I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
I'm trying to operate on a data.table column using a different data.table, and assign the result to a new column in the first data.table. But I keep having this issue:
Warning messages:
1: In from:(from + len) :
numerical expression has 10 elements: only the first used
Here is the data:
tstamps = c(1504306173, NA, NA, NA, NA, 1504393006, NA, NA, 1504459211, NA)
set.seed(0.1)
dt1 = data.table(utc_tstamp = sample(rep(tstamps, 100), 100))
dt2 = data.table(from = sample((1:90), 10), len = sample(1:10, 10))
> dt2
from len
1: 55 6
2: 59 9
3: 32 10
4: 24 3
5: 86 7
6: 54 1
7: 18 5
8: 11 8
9: 40 4
10: 75 2
I'm trying to count the number of NA's in dt1[from:(from+len), ] and assign the result to a new column, count, in dt2.
What I currently have for that is this
dt2[, count := dt1[from:(from+len), ][is.na(utc_tstamp), .N]]
but this is only using dt2[1,]$from and dt2[1,]$len, all the counts are just the number of NA's in dt1[dt2[1,]$from:(dt2[1,]$from + dt2[1,]$len), ], and I receive the following warning
Warning messages:
1: In from:(from + len) :
numerical expression has 10 elements: only the first used
2: In from:(from + len) :
numerical expression has 10 elements: only the first used
and the result is this:
> dt2
from len count
1: 55 6 5
2: 59 9 5
3: 32 10 5
4: 24 3 5
5: 86 7 5
6: 54 1 5
7: 18 5 5
8: 11 8 5
9: 40 4 5
10: 75 2 5
while it should be this:
> dt2
from len count
1: 55 6 5
2: 59 9 5
3: 32 10 8
4: 24 3 3
5: 86 7 5
6: 54 1 2
7: 18 5 4
8: 11 8 5
9: 40 4 4
10: 75 2 2
I'd appreciate it if someone explains why this is happening and how can I get what I want.
Based on the description, we get the sequence between the 'from' and 'from' added with 'len', based on this position index get the corresponding elements of 'utc_stamp' column from 'dt1', convert it to logical (is.na(), and get the sum i.e. sum of TRUE elements or the number of NA elements. Assign (:=) it to create a new column 'count' in 'df2'
dt2[, count := unlist(Map(function(x, y)
sum(is.na(dt1$utc_tstamp[x:y])), from , from + len))]
dt2
# from len count
# 1: 55 6 5
# 2: 59 9 5
# 3: 32 10 8
# 4: 24 3 3
# 5: 86 7 5
# 6: 54 1 2
# 7: 18 5 4
# 8: 11 8 5
# 9: 40 4 4
#10: 75 2 2
Or another option is to group by sequence of rows and then do the the sequence (:) based on 'from', 'len' columns to subset the column values from 'dt1' and get the sum of logical vector
dt2[, count := sum(is.na(dt1$utc_tstamp[from:(from + len)])), by = 1:nrow(dt2)]
Or define the joining variables explicitly and use a non-equi join:
dt2[, to := from+len]
dt1[, r := .I]
dt2[, ct := dt1[is.na(utc_tstamp)][dt2, on=.(r >= from, r <= to), .N, by=.EACHI]$N]
Finding the last position of a vector that is less than a given value is fairly straightforward (see e.g. this question
But, doing this line by line for a column in a data.frame or data.table is horribly slow. For example, we can do it like this (which is ok on small data, but not good on big data)
library(data.table)
set.seed(123)
x = sort(sample(20,5))
# [1] 6 8 15 16 17
y = data.table(V1 = 1:20)
y[, last.x := tail(which(x <= V1), 1), by = 1:nrow(y)]
# V1 last.x
# 1: 1 NA
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 1
# 7: 7 1
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 2
# 14: 14 2
# 15: 15 3
# 16: 16 4
# 17: 17 5
# 18: 18 5
# 19: 19 5
# 20: 20 5
Is there a fast, vectorised way to get the same thing? Preferably using data.table or base R.
You may use findInterval
y[ , last.x := findInterval(V1, x)]
Slightly more convoluted using cut. But on the other hand, you get the NAs right away:
y[ , last.x := as.numeric(cut(V1, c(x, Inf), right = FALSE))]
Pretty simple in base R
x<-c(6L, 8L, 15L, 16L, 17L)
y<-1:20
cumsum(y %in% x)
[1] 0 0 0 0 0 1 1 2 2 2 2 2 2 2 3 4 5 5 5 5
I am not a big data.table expert but I am somehow puzzled by some things. Here is my simple example:
test<-data.table(x= 1:10,y= 1:10,z= 1:10, l = 11:20,d= 21:30)
test<-test[,..I:=.I]
vec_of_names = c("z","l","d")
function_test<-function(x,y){
sum(x)+y
}
vec_of_final_names<-c("sum_z","sum_l","sum_d")
When I then attempt do to something like this:
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=.SD,y=eval(parse(text=vec_of_names[i]))),.SDcols=c("x","y"),by=..I]
}
I get an error:
Error in eval(expr, envir, enclos) : object 'z' not found
Whereas code below works perfectly fine but is a little bit ugly and also slow. Maybe somebody can suggest better alternatives.
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=eval(parse(text=paste("c(",paste(c("x","y"),collapse=","),")",sep=""))),y=eval(parse(text=vec_of_names[i]))),by=..I]
}
After specifying the .SDcols and grouped by = ..I (the ..I is a strange name for a column name), we unlist the .SD, get the sum, get the values of 'vec_of_names' in a list with mget, do the + of corresponding elements of this with the sum(unlist(.SD)) and assign (:=) it to 'vec_of_final_names' to create new columns
test[, (vec_of_final_names) := Map(`+`, sum(unlist(.SD)),
mget(vec_of_names)), by = ..I, .SDcols = x:y]
Based on the example, this can also be done without the grouping variable
test[, (vec_of_final_names) := Map(`+`, list(x+y), mget(vec_of_names))]
Or by specifying the .SDcols
test[, (vec_of_final_names) := Map(`+`, list(Reduce(`+`, .SD)),
mget(vec_of_names)), .SDcols = x:y]
Or using the OP's function
test[, (vec_of_final_names) := Map(function_test, list(unlist(.SD)),
mget(vec_of_names)), ..I, .SDcols = x:y]
test
# x y z l d ..I sum_z sum_l sum_d
# 1: 1 1 1 11 21 1 3 13 23
# 2: 2 2 2 12 22 2 6 16 26
# 3: 3 3 3 13 23 3 9 19 29
# 4: 4 4 4 14 24 4 12 22 32
# 5: 5 5 5 15 25 5 15 25 35
# 6: 6 6 6 16 26 6 18 28 38
# 7: 7 7 7 17 27 7 21 31 41
# 8: 8 8 8 18 28 8 24 34 44
# 9: 9 9 9 19 29 9 27 37 47
#10: 10 10 10 20 30 10 30 40 50
Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6