On my projects I usually do the data prepping with a few functions, so my code usually look like this:
readAndClean("directory") %>%
processing() %>%
readyForModelling()
Where I'm passing a data.table object from one function to another.
I've gotten a habit to always start this functions with:
processing <- function(data_init){
data <- copy(data_init)
}
to avoid making changes to the DT on the global environment, as the following example will:
test <- data.table(cars[1:10,])
processing <- function(data_init){
data_init[, id := 1:.N]
return("done")
}
test
# speed dist
# 1: 4 2
# 2: 4 10
# 3: 7 4
# 4: 7 22
# 5: 8 16
# 6: 9 10
# 7: 10 18
# 8: 10 26
# 9: 10 34
# 10: 11 17
processing(test)
# [1] "done"
test
# speed dist id
# 1: 4 2 1
# 2: 4 10 2
# 3: 7 4 3
# 4: 7 22 4
# 5: 8 16 5
# 6: 9 10 6
# 7: 10 18 7
# 8: 10 26 8
# 9: 10 34 9
# 10: 11 17 10
But this always seems a little ugly to me.
Is it the correct way of handling data.tables inside functions?
Related
I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
Recently I saw a question (can't find the link) that was something like this
I want to add a column on a data.frame that computes the variance of a different column while removing the current observation.
dt = data.table(
id = c(1:13),
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
)
So, with a for() loop:
res = NULL
for(i in 1:13){
res[i] = var(dt[-i,v])
}
I tried doing this in data.table, using negative indexing with .I, but to my surprise none of the following works:
#1
dt[,var := var(dt[,v][-.I])]
#2
dt[,var := var(dt$v[-.I])]
#3
fun = function(x){
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
var(v[-x])
}
dt[,var := fun(.I)]
#4
fun = function(x){
var(dt[-x,v])
}
dt[,var := fun(.I)]
All of those gives the same output:
id v var
1: 1 9 NA
2: 2 5 NA
3: 3 8 NA
4: 4 1 NA
5: 5 25 NA
6: 6 14 NA
7: 7 7 NA
8: 8 87 NA
9: 9 98 NA
10: 10 63 NA
11: 11 32 NA
12: 12 12 NA
13: 13 15 NA
What am I missing? I thought it was a problem with .I being passed to functions, but a dummy example:
fun = function(x,c){
x*c
}
dt[,dummy := fun(.I,2)]
id v var
1: 1 9 2
2: 2 5 4
3: 3 8 6
4: 4 1 8
5: 5 25 10
6: 6 14 12
7: 7 7 14
8: 8 87 16
9: 9 98 18
10: 10 63 20
11: 11 32 22
12: 12 12 24
13: 13 15 26
works fine.
Why can't I use .I in this specific scenario?
You may use .BY:
a list containing a length 1 vector for each item in by
dt[ , var_v := dt[id != .BY$id, var(v)], by = id]
Variance is calculated once per row (by = id). In each calculation, the current row is excluded using id != .BY$id in the 'inner' i.
all.equal(dt$var_v, res)
# [1] TRUE
Why doesn't your code work? Because...
.I is an integer vector equal to seq_len(nrow(x)),
...your -.I not only removes current observation, it removes all rows in one go from 'v'.
A small illustration which starts with your attempt (just without the assignment :=) and simplifies it step by step:
# your attempt
dt[ , var(dt[, v][-.I])]
# [1] NA
# without the `var`, indexing only
dt[ , dt[ , v][-.I]]
# numeric(0)
# an empty vector
# same indexing written in a simpler way
dt[ , v[-.I]]
# numeric(0)
# even more simplified, with a vector of values
# and its corresponding indexes (equivalent to .I)
v <- as.numeric(11:14)
i <- 1:4
v[i]
# [1] 11 12 13 14
x[-i]
# numeric(0)
Here's a brute-force thought:
exvar <- function(x, na.rm = FALSE) sapply(seq_len(length(x)), function(i) var(x[-i], na.rm = na.rm))
dt[,var := exvar(v)]
dt
# id v var
# 1: 1 9 1115.538
# 2: 2 5 1098.265
# 3: 3 8 1111.515
# 4: 4 1 1077.841
# 5: 5 25 1153.114
# 6: 6 14 1132.697
# 7: 7 7 1107.295
# 8: 8 87 822.447
# 9: 9 98 684.697
# 10: 10 63 1040.265
# 11: 11 32 1153.697
# 12: 12 12 1126.424
# 13: 13 15 1135.538
Finding the last position of a vector that is less than a given value is fairly straightforward (see e.g. this question
But, doing this line by line for a column in a data.frame or data.table is horribly slow. For example, we can do it like this (which is ok on small data, but not good on big data)
library(data.table)
set.seed(123)
x = sort(sample(20,5))
# [1] 6 8 15 16 17
y = data.table(V1 = 1:20)
y[, last.x := tail(which(x <= V1), 1), by = 1:nrow(y)]
# V1 last.x
# 1: 1 NA
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 1
# 7: 7 1
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 2
# 14: 14 2
# 15: 15 3
# 16: 16 4
# 17: 17 5
# 18: 18 5
# 19: 19 5
# 20: 20 5
Is there a fast, vectorised way to get the same thing? Preferably using data.table or base R.
You may use findInterval
y[ , last.x := findInterval(V1, x)]
Slightly more convoluted using cut. But on the other hand, you get the NAs right away:
y[ , last.x := as.numeric(cut(V1, c(x, Inf), right = FALSE))]
Pretty simple in base R
x<-c(6L, 8L, 15L, 16L, 17L)
y<-1:20
cumsum(y %in% x)
[1] 0 0 0 0 0 1 1 2 2 2 2 2 2 2 3 4 5 5 5 5
I have a dataset which consists of three columns: user, action and time which is a log for user actions. the data looks like this:
user action time
1: 618663 34 1407160424
2: 617608 33 1407160425
3: 89514 34 1407160425
4: 71160 33 1407160425
5: 443464 32 1407160426
---
996: 146038 8 1407161349
997: 528997 9 1407161350
998: 804302 8 1407161351
999: 308922 8 1407161351
1000: 803763 8 1407161352
I want to separate sessions for each user based on action times. Actions done in certain period (for example one hour) are going to be assumed one session.
The simple solution is to use a for loop and compare action times for each user but that's not efficient and my data is very large.
Is there any method that can I use to overcome this problem?
I can group users but separate on users actions into different sessions is somehow difficult for me :-)
Try
library(data.table)
dt <- rbind(
data.table(user=1, action=1:10, time=c(1,5,10,11,15,20,22:25)),
data.table(user=2, action=1:5, time=c(1,3,10,11,12))
)
# dt[, session:=cumsum(c(T, !(diff(time)<=2))), by=user][]
# user action time session
# 1: 1 1 1 1
# 2: 1 2 5 2
# 3: 1 3 10 3
# 4: 1 4 11 3
# 5: 1 5 15 4
# 6: 1 6 20 5
# 7: 1 7 22 5
# 8: 1 8 23 5
# 9: 1 9 24 5
# 10: 1 10 25 5
# 11: 2 1 1 1
# 12: 2 2 3 1
# 13: 2 3 10 2
# 14: 2 4 11 2
# 15: 2 5 12 2
I used a difference of <=2 to collect sessions.