conditional rolling average - r

library(data.table)
set.seed(123)
d <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))
For each ID, I want to do a 7-years rolling average of y starting from 1998 onwards. However, the condition is that in each rolling window,
I only select the top 5 highest value of y to do the average. For e.g.
first rolling window would be
1998-2004 - only do the average of top 5 highest 'y' values
1999-2005 - only do the average of top 5 highest 'y' values
.
.
2007-2013 - only do the average of top 5 highest 'y' values
2008-2014 - only do the average of top 5 highest 'y' values
I am interested in using data.table to achieve this. However also open to other suggestions. Here's what I tried
d = setDT(d)
d[, avg.Y := frollmean(y, 7), by = ID]
How do I enter another argument where for each rolling 7-years window I only select the top 5 highest y value to calculate the mean?
EDIT
I could also have a case that some IDs might not have minimum 7 years of data to do a moving average in which case the above function will give me NAs. For those IDs, is it possible to simply take an arithematic mean? For e.g. if a ID has data from 1998-2002, in such cases, can I simply take the average of y from 1998-2002

We can use rollapplyr from zoo and apply a custom function to calculate mean of top 5 values in each rolling window.
library(data.table)
library(zoo)
setDT(d)
d[, avg.Y:= rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA), by = ID]
For cases where there could be less number of observations than the window size we can do
d[, avg.Y:= if (.N > 6)
rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA)
else mean(y), by = ID]

First time using frollapply() but this seems to work:
get_mean_top5 <- function(x) mean(-sort(-x, partial = 1:5)[1:5])
d[, test := frollapply(y, 7, FUN = get_mean_top5), by = ID]
The function get_mean_top5() filters out the top 5 highest values and then takes the mean. Other more readable forms would be:
get_mean_top5 <- function(x) mean(mean(x[order(x, decreasing=TRUE)[1:5]]))

A few more steps and a little bit repetitive base R solution:
df$seven_year_group <- paste0(ave(as.integer(as.factor(df$yearRef)) %% 7,
as.integer(as.factor(df$yearRef)) %% 7,
FUN = seq.int),
"_",
df$ID)
seven_year_averages <- data.frame(avg_y = do.call("rbind", lapply(split(df, df$seven_year_group),
function(x){mean(tail(x[order(x$y), "y"], 5))})))
seven_year_averages$seven_year_group <- row.names(seven_year_averages)
df <- merge(df, seven_year_averages, by = "seven_year_group", all.x = TRUE)
Data:
set.seed(2019)
df <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))

Related

Apply a rolling function to a data.table in R with a custom window size

I want to apply a rolling min function to a data table but with a custom start and end point for the window.
In this data set column x is what I want the min to be applied to and y is the desired output.
library(data.table) #version 1.13.6
dt <- data.table(x = seq(1:50))
dt[, y := c(0,0,0,0,0,0,0,0,0, seq(1:41))]
In my current data I want to get the rolling min value over a window of 5 centered to the right, but I want it to start from 5 rows before the current row. So in row 10, it is looking for the min in rows 1:5, in row 11 it is looking at rows 2:6 and so on.
Thanks
dt[, z := shift(frollapply(x, n = 5, FUN = min, fill = 0), n = 5, fill = 0)]
dt[, all.equal(y, z)] # TRUE

R: Replace values between two numbers with the number

Here is the dataframe
sampledf = data.frame(timeinterval = c(1:120), hour = c(rep(NA, times = 85), 1, rep(NA, times = 5), 1, rep(NA, times = 4),1, rep(NA, times = 4), 1, rep(NA, times = 18)))
I want to replace the NAs in column hour such that values between 86th row and 92 (inclusive) and then between 97 and 102 (inclusive) should all be 1.
Here is what I've tried so far:
1. Getting the list of rownames with value 1 in hour column
2. Looping through (This is what is not working!)
ones = which(sampledf$hour == 1)
n = (length(ones)+1)/2
chunk <- function(ones,n) split(ones, cut(seq_along(ones), n, labels = FALSE))
y = chunk(ones,n)
for (i in y) {
sampledf$Hour[c(y$i[1]:y$i[2])] == 1
}
Help me out, I'm new to R.
In python we have ffill method for this, what an equivalent here?
Thanks!
sampledf$hour[between(sampledf$timeinterval,86,92) | between(sampledf$timeinterval,97,102)]<-1
Basically you subset sampledf's hour column by those cases where timeinterval is between 86-92 or (|) 92-102, and assign 1 to all those cases.
If you want to assign 1 to all timeintervals in the given ranges:
sampledf$hour[sampledf$timeinterval %in% c(86:92,97:102)] <- 1
If you want to assign 1 to cases based on the rownumbers of your data:
sampledf$hour[c(86:92,97:102)] <- 1
If you want to add a cumulated sum to your values as in your comment, you can just use the cumsum() function and do:
sampledf$hour[which(sampledf$hour == 1)] <- cumsum(sampledf$hour[which(sampledf$hour == 1)])

Join 2 data frames using data.table with conditions

I have these two data frames:
set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE),
hash = sample(letters[1:5], 1000, replace=TRUE),
point = sample(1:24, 1000, replace=TRUE))
B <- data.table(station = sample(1:10, 100, replace=TRUE),
card = sample(letters[6:10], 100, replace=TRUE),
point = sample(1:24, 100, replace=TRUE))
Dataframe A contains more than 1M rows.
I try to find hash (from A) for each card (from B). I have some conditions there: stations and points in A lays in a range(for station +- 1 and for points just + 2).
I use grouping B by card and execute for each group function for binding rows after implementing such conditions and get max by freq.
detect <- function(x){
am0 <- data.frame(station = 0,
hash = 0,
point = 0)
for (i in 1:nrow(x)) {
am1 <- A %>%
filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
point > B$point[i] & point < B$point[i] + 2)
am0 <- rbind(am0, am1)
}
t <- as.data.frame(table(am0$hash))
t <- t %>%
arrange(-Freq) %>%
filter(row_number() == 1)
return(t)
}
And then just:
library(dplyr)
B %>%
group_by(card) %>%
do(detect(.)) %>%
ungroup
But I don't know how to implement function by each group with indices [i] so I actually get a wrong result.
# A tibble: 5 x 3
card Var1 Freq
<chr> <fctr> <int>
1 f c 46
2 g c 75
3 h c 41
4 i c 64
5 j c 62
I`m a beginner but I know best solution for big datasets - using data.table library for join 2 datasets like these. Can you help me to find decision for it?
I think what you want to do is:
#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]
## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
, on = .( point_limit >= point, point <= point,
station_lower <= station, station_upper >= station),
nomatch = 0,
allow.cartesian=TRUE]
## Count the occurrences of the combinations
counted_table <- joined_table[,.N, by=.(card,hash)][order(card, -N)]
## Select the top for each group.
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]
This will create a full table with all the information in and then do the counting on that. It relies purely on data.tables since to fully take advantage of the speed gains from that package. The data.table vignette is good if you are unfamiliar with the syntax. The nomatch condition ensures that we are doing an inner join.
This will probably be fine if A is only 1M rows and B is kept the same size, depending on your datas distribution. We can however split B also in a similar way to your do statement using the package purrr. I'm not sure how this interacts with R:s garabage collection however.
frame_list <- purrr::map(unique(B$card),
~ B[card == .x][A,
, on = .(point_limit >= point,
point <= point,
station_lower <= station,
station_upper >= station),
nomatch = 0,
allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )
Something to note in this is that I use, rbindlist instead of multiple rbind. Repeatedly calling rbind will be very slow, since you will need to allocate new memory each time.

Getting rep(0, x) in combination with data.table in R order to create leading zeros (invalid times argument)

I created the following data.table as an example:
dt <- data.table(x = c(1, 12, 200, 1600))
dt[, y := " "]
My goal is to fill the y column with the x values extended by leading zeros such that each entry in y consists of four digits (i.e. 0001, 0012, 0200, 1600).
My idea is as follows:
dt[, y := x] # fill column with original values
dt[nchar(as.integer(x)) < 4, y := paste(paste(rep(0, 4-nchar(as.integer(x))), collapse=""), x, sep="")]
This command is supposed to check whether x consists of less than 4 digits, and, if so, generate the required number of zeros and paste them at the beginning of the string. Executing the statement however yields the message "Error in rep(0, 4 - nchar(as.integer(x))) : invalid 'times' argument".
I know that my basic idea is correct since the following command works properly:
dt[nchar(as.integer(x)) < 4, y := paste(paste(rep(0, 4), collapse=""), x, sep="")]
Here, I simply replaced the second argument in rep() by a random number (4 in this case).
Therefore, rep() obviously has some problems understanding the column reference made by x. Other functions (e.g. as.numeric() and many many more) don't have problems with this.
Thanks for any help!
Juse use formatC:
library(data.table)
dt <- data.table(x = c(1, 12, 200, 1600))
dt[, y := formatC(x, width = 6, format = "d", flag = "0")] #
dt
x y
1: 1 000001
2: 12 000012
3: 200 000200
4: 1600 001600
I think the problem is that you're feeding rep() a vector of length>1.
There are probably formats you can use? Below is a workaround step-by-step.
dt <- data.frame(x = c(1, 12, 200, 1600))
dt$times_to_rep<-4-nchar(dt$x)
dt$power_of_ten<-10^dt$times_to_rep
dt$zeros<-substring(dt$power_of_ten,2,nchar(dt$power_of_ten))
dt$y<-paste0(dt$zeros, dt$x, sep = '')

Removing outliers from groups using data.table in R

I have a data.table object that contains group column. I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. My data.table can be build using simple script:
col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)
I've been trying to split data.frame by group variable, however, it's too messy approach. How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?
Assuming that you want to remove outliers according to both col1 and col2, based on the 95% quantile:
dt_filt <- dt[,
.SD[
((col1 < quantile(col1, probs = 0.95)) &
(col2 < quantile(col2, probs = 0.95)))
], by = group
]
which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds.

Resources