frollsum, frollapply, etc... alternative: frollmedian? - r

i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!

Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")

Related

R (data.table): call different columns in a loop

I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Maximums of multiple data.table subsets

Given is a data.table with base data, startIndex of the subsets, duration of subsets. For each subset each duration is applied.
base <- data.table(idx=c(1,2,3,4,5,6,7,8,9,10), val=c(11,12,13,14,15,16,17,18,19,20))
startIndex <- c(2, 4, 7, 9)
duration <- c(1,2,3)
Is there some elegant way to get the maximum val per subset defined by startIndex and duration, with the result shown below? For example, the first subset is defined by startIndex=2 and duration=1, which means that the max between index 2 and 3 is 13.
Result:
idxStart idxEnd max
1: 2 3 13
2: 4 5 15
3: 7 8 18
4: 2 4 14
5: 4 6 16
6: 7 9 19
7: 2 5 15
8: 4 7 17
9: 7 10 20
thx a lot.
Here is a data.table approach using a non-equi join. First, use expand.grid for combinations of start index and duration. Then, calculate the end index for each row. Then join with your base, where the index idx falls between the start and end, and keep the maximum val.
library(data.table)
dt <- data.table(expand.grid(idxStart = startIndex, Duration = duration))
dt[ , idxEnd := idxStart + Duration][
base, Max := max(val), on = .(idxStart <= idx, idxEnd >= idx), by = .EACHI]
Output
idxStart Duration idxEnd Max
1: 2 1 3 13
2: 4 1 5 15
3: 7 1 8 18
4: 9 1 10 20
5: 2 2 4 14
6: 4 2 6 16
7: 7 2 9 19
8: 9 2 11 20
9: 2 3 5 15
10: 4 3 7 17
11: 7 3 10 20
12: 9 3 12 20
I can't think of a particularly elegant solution here, but I think a map function should get the job done. This is brute forcing each combination through so there may be a more efficient solution, but it should work.
library(data.table)
base <- data.table(idx=c(1,2,3,4,5,6,7,8,9,10), val=c(11,12,13,14,15,16,17,18,19,20))
startIndex <- c(2, 4, 7, 9)
duration <- c(1,2,3)
combos <- expand.grid(startIndex = startIndex,
duration = duration) %>%
mutate(endIndex = startIndex + duration)
max_slices <- map2(combos$startIndex, combos$endIndex, function(startIndex, endIndex){
slice(base, startIndex, endIndex) %>%
select(val) %>%
max()
}) %>%
as.numeric()
result <- combos %>%
cbind(max = max_slices)
Result:
startIndex duration endIndex max
1 2 1 3 13
2 4 1 5 15
3 7 1 8 18
4 9 1 10 20
5 2 2 4 14
6 4 2 6 16
7 7 2 9 19
8 9 2 11 19
9 2 3 5 15
10 4 3 7 17
11 7 3 10 20
12 9 3 12 19
I have a solution using the map function however I don't think I have kept the function as a data.table so this may not be satisfactory. Please let me know if not and I can take another look or refer to another answer. One option would be to run the data.table function on the output.
library(tidyverse)
library(data.table)
library(dtplyr)
base <- data.table(idx=c(1,2,3,4,5,6,7,8,9,10), val=c(11,12,13,14,15,16,17,18,19,20))
startIndex <- c(2, 4, 7, 9)
duration <- c(1,2,3)
crossing(startIndex, duration) %>%
data.table() %>%
mutate(max = map2_dbl(startIndex, duration, ~max(base$val[.x:(.x + .y)])))
#> Source: local data table [12 x 3]
#> Call: copy(`_DT1`)[, `:=`(max = map2_dbl(startIndex, duration, ~max(..base$val[.x:(.x +
#> .y)])))]
#>
#> startIndex duration max
#> <dbl> <dbl> <dbl>
#> 1 2 1 13
#> 2 2 2 14
#> 3 2 3 15
#> 4 4 1 15
#> 5 4 2 16
#> 6 4 3 17
#> # ... with 6 more rows
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
Created on 2021-04-04 by the reprex package (v2.0.0)

For each value of one column, find which is the last value of another vector that is lower

Finding the last position of a vector that is less than a given value is fairly straightforward (see e.g. this question
But, doing this line by line for a column in a data.frame or data.table is horribly slow. For example, we can do it like this (which is ok on small data, but not good on big data)
library(data.table)
set.seed(123)
x = sort(sample(20,5))
# [1] 6 8 15 16 17
y = data.table(V1 = 1:20)
y[, last.x := tail(which(x <= V1), 1), by = 1:nrow(y)]
# V1 last.x
# 1: 1 NA
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 1
# 7: 7 1
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 2
# 14: 14 2
# 15: 15 3
# 16: 16 4
# 17: 17 5
# 18: 18 5
# 19: 19 5
# 20: 20 5
Is there a fast, vectorised way to get the same thing? Preferably using data.table or base R.
You may use findInterval
y[ , last.x := findInterval(V1, x)]
Slightly more convoluted using cut. But on the other hand, you get the NAs right away:
y[ , last.x := as.numeric(cut(V1, c(x, Inf), right = FALSE))]
Pretty simple in base R
x<-c(6L, 8L, 15L, 16L, 17L)
y<-1:20
cumsum(y %in% x)
[1] 0 0 0 0 0 1 1 2 2 2 2 2 2 2 3 4 5 5 5 5

R - indices of matching values of two data.tables

This is my first post at StackOverflow. I am relatively a newbie in programming and trying to work with the data.table in R, for its reputation in speed.
I have a very large data.table, named "Actions", with 5 columns and potentially several million rows. The column names are k1, k2, i, l1 and l2. I have another data.table, with the unique values of Actions in columns k1 and k2, named "States".
For every row in Actions, I would like to find the unique index for columns 4 and 5, matching with States. A reproducible code is as follows:
S.disc <- c(2000,2000)
S.max <- c(6200,2300)
S.min <- c(700,100)
Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- unique(Actions[,list(k1,k2,i)])
So if i was using data.frame, the following line would be like:
index <- apply(Actions,1,function(x) {which((States[,1]==x[4]) & (States[,2]==x[5]))})
How can I do the same with data.table efficiently ?
This is relatively simple once you get the hang of keys and the special symbols which may be used in the j expression of a data.table. Try this...
# First make an ID for each row for use in the `dcast`
# because you are going to have multiple rows with the
# same key values and you need to know where they came from
Actions[ , ID := 1:.N ]
# Set the keys to join on
setkeyv( Actions , c("l1" , "l2" ) )
setkeyv( States , c("k1" , "k2" ) )
# Join States to Actions, using '.I', which
# is the row locations in States in which the
# key of Actions are found and within each
# group the row number ( 1:.N - a repeating 1,2,3)
New <- States[ J(Actions) , list( ID , Ind = .I , Row = 1:.N ) ]
# k1 k2 ID Ind Row
#1: 700 100 1 1 1
#2: 700 100 1 2 2
#3: 700 100 1 3 3
#4: 700 100 2 1 1
#5: 700 100 2 2 2
#6: 700 100 2 3 3
# reshape using 'dcast.data.table'
dcast.data.table( Row ~ ID , data = New , value.var = "Ind" )
# Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27...
#1: 1 1 1 1 4 4 4 7 7 7 10 10 10 13 13 13 16 16 16 1 1 1 4 4 4 7 7 7...
#2: 2 2 2 2 5 5 5 8 8 8 11 11 11 14 14 14 17 17 17 2 2 2 5 5 5 8 8 8...
#3: 3 3 3 3 6 6 6 9 9 9 12 12 12 15 15 15 18 18 18 3 3 3 6 6 6 9 9 9...

R data.table: Count events since last occurance (multiple, inclusive/exclusive)

[udpated: tried to clarify and simplify, corrected sample code and data.]
I've a set of measurements that are taken over a period of days. The range of numbers that can be captured in any measurement is 1-25 (in real life, given the test set, the range could be as high as 100 or as low as 20).
I'd like a way to tally a count for how many events have passed since a specific number occurred regardless of the measurement column. I'd like it to reset the count after the number match as shown below.
V1,V2,Vn are the values captured.
Match1, Match2, Matchn are the counts since last encountered columns.
Note: Matchn counts are incremented regardless of which Vx column n is encountered.
Any help is much appreciated.
this is somewhat related to my earlier post here
Sample input
library(data.table)
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,3,29,36),
V2 = c(2, 5, 12, 4,8,2,38,50,4,1)
)
code for creating Sample output
t$match1 <- c(1,2,3,4,1,2,3,4,5,1)
t$match2 <- c(1,1,2,3,4,5,1,2,3,4)
t$match3 <- c(1,2,3,1,2,3,4,5,1,2)
> t
Date V1 V2 match1 match2 match3
1: 2013-05-01 4 2 1 1 1
2: 2013-05-02 2 5 2 1 2
3: 2013-05-03 3 12 3 2 3
4: 2013-05-04 1 4 4 3 1
5: 2013-05-05 7 8 1 4 2
6: 2013-05-06 22 2 2 5 3
7: 2013-05-07 35 38 3 1 4
8: 2013-05-08 3 50 4 2 5
9: 2013-05-09 29 4 5 3 1
10: 2013-05-10 36 1 1 4 2
I think OP has a bunch of typos in it, and as far as I understand you want this:
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,52,29,36),
V2 = c(2, 5, 2, 4,8,47,38,50,4,1)
)
t[, inclusive.match.1 := 1:.N, by = cumsum(V1 == 1 | V2 == 1)]
t[, exclusive.match.1 := 1:.N, by = rev(cumsum(rev(V1 == 1 | V2 == 1)))]
t
# Date V1 V2 inclusive.match.1 exclusive.match.1
# 1: 2013-05-01 4 2 1 1
# 2: 2013-05-02 2 5 2 2
# 3: 2013-05-03 3 2 3 3
# 4: 2013-05-04 1 4 1 4
# 5: 2013-05-05 7 8 2 1
# 6: 2013-05-06 22 47 3 2
# 7: 2013-05-07 35 38 4 3
# 8: 2013-05-08 52 50 5 4
# 9: 2013-05-09 29 4 6 5
#10: 2013-05-10 36 1 1 6

Resources