Maximums of multiple data.table subsets - r

Given is a data.table with base data, startIndex of the subsets, duration of subsets. For each subset each duration is applied.
base <- data.table(idx=c(1,2,3,4,5,6,7,8,9,10), val=c(11,12,13,14,15,16,17,18,19,20))
startIndex <- c(2, 4, 7, 9)
duration <- c(1,2,3)
Is there some elegant way to get the maximum val per subset defined by startIndex and duration, with the result shown below? For example, the first subset is defined by startIndex=2 and duration=1, which means that the max between index 2 and 3 is 13.
Result:
idxStart idxEnd max
1: 2 3 13
2: 4 5 15
3: 7 8 18
4: 2 4 14
5: 4 6 16
6: 7 9 19
7: 2 5 15
8: 4 7 17
9: 7 10 20
thx a lot.

Here is a data.table approach using a non-equi join. First, use expand.grid for combinations of start index and duration. Then, calculate the end index for each row. Then join with your base, where the index idx falls between the start and end, and keep the maximum val.
library(data.table)
dt <- data.table(expand.grid(idxStart = startIndex, Duration = duration))
dt[ , idxEnd := idxStart + Duration][
base, Max := max(val), on = .(idxStart <= idx, idxEnd >= idx), by = .EACHI]
Output
idxStart Duration idxEnd Max
1: 2 1 3 13
2: 4 1 5 15
3: 7 1 8 18
4: 9 1 10 20
5: 2 2 4 14
6: 4 2 6 16
7: 7 2 9 19
8: 9 2 11 20
9: 2 3 5 15
10: 4 3 7 17
11: 7 3 10 20
12: 9 3 12 20

I can't think of a particularly elegant solution here, but I think a map function should get the job done. This is brute forcing each combination through so there may be a more efficient solution, but it should work.
library(data.table)
base <- data.table(idx=c(1,2,3,4,5,6,7,8,9,10), val=c(11,12,13,14,15,16,17,18,19,20))
startIndex <- c(2, 4, 7, 9)
duration <- c(1,2,3)
combos <- expand.grid(startIndex = startIndex,
duration = duration) %>%
mutate(endIndex = startIndex + duration)
max_slices <- map2(combos$startIndex, combos$endIndex, function(startIndex, endIndex){
slice(base, startIndex, endIndex) %>%
select(val) %>%
max()
}) %>%
as.numeric()
result <- combos %>%
cbind(max = max_slices)
Result:
startIndex duration endIndex max
1 2 1 3 13
2 4 1 5 15
3 7 1 8 18
4 9 1 10 20
5 2 2 4 14
6 4 2 6 16
7 7 2 9 19
8 9 2 11 19
9 2 3 5 15
10 4 3 7 17
11 7 3 10 20
12 9 3 12 19

I have a solution using the map function however I don't think I have kept the function as a data.table so this may not be satisfactory. Please let me know if not and I can take another look or refer to another answer. One option would be to run the data.table function on the output.
library(tidyverse)
library(data.table)
library(dtplyr)
base <- data.table(idx=c(1,2,3,4,5,6,7,8,9,10), val=c(11,12,13,14,15,16,17,18,19,20))
startIndex <- c(2, 4, 7, 9)
duration <- c(1,2,3)
crossing(startIndex, duration) %>%
data.table() %>%
mutate(max = map2_dbl(startIndex, duration, ~max(base$val[.x:(.x + .y)])))
#> Source: local data table [12 x 3]
#> Call: copy(`_DT1`)[, `:=`(max = map2_dbl(startIndex, duration, ~max(..base$val[.x:(.x +
#> .y)])))]
#>
#> startIndex duration max
#> <dbl> <dbl> <dbl>
#> 1 2 1 13
#> 2 2 2 14
#> 3 2 3 15
#> 4 4 1 15
#> 5 4 2 16
#> 6 4 3 17
#> # ... with 6 more rows
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
Created on 2021-04-04 by the reprex package (v2.0.0)

Related

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

Count rows satisfying "less than" filter for sequence of values

I have a dataset with a bunch of times. Let's say I wanted to create a summary table that counts number of rows satisfying "less than" filter for a sequence of values, say [number of rows with time < 6, number of rows with time < 7, etc.]
Example dataset:
data.frame(personId = c("2009ZEMD01", "2012PARK03", "2017VILL41", "2010WEYE01", "2016KOLA02", "2012PONC02"),
average = c(553, 559, 598, 606, 612, 613))
This was my solution using sapply:
tibble(time = 6:15,
count = sapply(time, function(t) best_3x3_solvers %>% filter(average/100 < t) %>% nrow))
The result:
> solvers_under
# A tibble: 10 x 2
time count
<int> <int>
1 6 3
2 7 48
3 8 274
4 9 840
5 10 1952
6 11 3792
7 12 6269
8 13 9459
9 14 13204
10 15 17274
The code is not too long but is there a method using more tidyverse tools without *apply? Maybe summarize with n().
One dplyr and purrr option could be:
map_dfr(.x = 6:15,
~ df %>%
group_by(time = .x) %>%
summarise(count = sum(average/100 < .x)))
time count
<int> <int>
1 6 3
2 7 6
3 8 6
4 9 6
5 10 6
6 11 6
7 12 6
8 13 6
9 14 6
10 15 6
Here's one way :
library(dplyr)
library(purrr)
map_df(6:15, ~df %>% summarise(time = .x, count = sum(average/100 < .x)))
# time count
# 1 6 3
# 2 7 6
# 3 8 6
# 4 9 6
# 5 10 6
# 6 11 6
# 7 12 6
# 8 13 6
# 9 14 6
#10 15 6
You can use summarise, count and filter
df%>%group_by(time)%>%summarise(count = n())%>%filter(count < t)

How to create a table with flexible columns based on variables control in R?

I want to create a tale like:
1 1 6 6 10 10 ...
2 2 7 7 11 11 ...
3 3 8 8 12 12 ...
4 4 9 9 13 13 ...
5 5 14 14 ...
15 15 ...
I want to use variables:
n (repeat) and m(total number of columns) and k(k=the prior columns's end number+1,for example: 6=5+1, and 10=9+1), and different number length of row
to create a table.
I know I can use like:
rep(list(1:5,6:9,10:15), each = 2)),
but how to make them as parameters using a general expression to list list(1:5,6:9,10:15,..use n,m,k expression...).
I tried to use loop for (i in 1:m) etc.. but cannot work it out
finally I want a sequence by using unlist(): 1,2,3,4,5,6,1,2,3,4,5,6......)
Many thanks.
Maybe the code below can help
len <- c(5,4,6)
res <- unlist(unname(rep(split(1:sum(len),
findInterval(1:sum(len),cumsum(len)+1)),
each = 2)))
which gives
> res
[1] 1 2 3 4 5 1 2 3 4 5 6 7 8 9 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15
Probably, something like this would be helpful.
#Number of times to repeat
r <- 2
#Length of each sequence
len <- c(5, 4, 6)
#Get the end of the sequence
end <- cumsum(Glen)
#Calculate the start of each sequence
start <- c(1, end[-length(end)] + 1)
#Create a sequence of start and end and repeat it r times
Map(function(x, y) rep(seq(x, y), r), start, end)
#[[1]]
# [1] 1 2 3 4 5 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 6 7 8 9
#[[3]]
# [1] 10 11 12 13 14 15 10 11 12 13 14 15
You could unlist to get it as one vector.
unlist(Map(function(x, y) rep(seq(x, y), r), start, end))

frollsum, frollapply, etc... alternative: frollmedian?

i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")

How to mimick ROW_NUMBER() OVER(...) in R

To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.

Resources