Related
I have a data frame that has rows that represent communities. For columns, the first column is the group that the community falls into (a total of 6 groups) and the remaining 8 are IDs of each member of the community.
What I would like to do is have a community (row) within groups 1, 3, and 5 to be picked where there is no overlap between them. Then, once I have that - I would like to pick a community from groups 2, 4, and 6 where there is no more than 25% overlap between the selected 6 total communities.
Here is an example dataset:
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
Based on the criteria I mentioned above, the following could be pulled out:
Group 1: 125, 8, 40, 127, 19, 33, 29, 3
Group 3: 11, 25, 126, 22, 56, 4, 6, 52
Group 5: 5, 63, 18, 48, 37, 32, 43, 1
Group 2: 25, 37, 8, 38, 40, 124, 32, 56
Group 4: 125, 15, 29, 4, 48, 5, 128, 11
Group 6: 34, 23, 33, 32, 63, 22, 19, 56
I believe this might be helpful (please let me know if not!).
The first step would be to subset your data into Group 1, 3, and 5. Then using transpose from purrr, splitting by Group, with cross we can get all combinations selecting one row from each group.
library(purrr)
grp_135 <- df[df$Group %in% c(1, 3, 5), ]
all_combn_135 <- lapply(cross(split(transpose(grp_135), grp_135$Group)), bind_rows)
Checking the first element to see what we have:
R> all_combn_135[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 29 4 3 63 125 40 32 38
3 5 5 63 18 48 37 32 43 1
Next, we can check for overlap by counting duplicates. In this case, I just unlist the three rows, use table for frequency, and sum up (subtracting 1 for each value found, since only want duplicates).
combn_ovlp_135 <- lapply(all_combn_135, function(x) {
sum(table(unlist(x[-1])) - 1)
})
The ones without overlap can be obtained by:
no_ovlp <- all_combn_135[combn_ovlp_135 == 0]
no_ovlp
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 11 25 126 22 56 4 6 52
3 5 5 63 18 48 37 32 43 1
For the next part, do something similar (this can be broken out as a generalized function), except when checking for overlap, combine elements with the first no_ovlp from previously:
grp_246 <- df[df$Group %in% c(2, 4, 6), ]
all_combn_246 <- lapply(cross(split(transpose(grp_246), grp_246$Group)), bind_rows)
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) - 1) / ((ncol(df) - 1) * 6)
})
It is not entirely clear how you want to calculate overlap for this part and compare with 25%. I counted duplicates and then divided by the number of columns (8 not counting Group) and multiply by 6 (rows). To see which combination of Group 2, 4, and 6 could be combined with no_ovlp you can try the following:
all_combn_246[combn_ovlp_246 < .25]
In my case, I believe none of the combinations met this criterion, although the first with 37.5% overlap was the minimum:
R> all_combn_246[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 25 37 8 38 40 124 32 56
2 4 125 15 29 4 48 5 128 11
3 6 34 23 33 32 63 22 19 56
What was unclear is how to count duplicates. For example, how much overlap is c(1, 2, 3, 3, 3)?
This could be two duplicates (two extra 3's):
R> sum(table(x) - 1)
[1] 2
Or you could count number of values that have any duplicates (just the number 3 is duplicated):
R> sum(table(x) > 1)
[1] 1
If it is the latter, you could try:
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) > 1) / ((ncol(df) - 1) * 6)
})
By shamelessly stealing Ben's use of cross(), I have this approach that I personally find easier to read:
# Returns the number of overlapping elements
overlap <- function(xx){
length(unlist(xx)) - length(unique(unlist(xx)))
}
df_135 <- df %>%
as_tibble() %>%
filter(Group %in% c(1,3,5)) %>%
group_by(Group) %>%
mutate(Community = row_number()) %>%
nest(Members = starts_with("Isol_")) %>%
mutate(Members = map(Members, as.integer))
df_135
# A tibble: 12 x 3
# Groups: Group [3]
# Group Community Members
# <dbl> <chr> <list>
# 1 1 g1_1 <int [8]>
# 2 1 g1_2 <int [8]>
# 3 1 g1_3 <int [8]>
# 4 1 g1_4 <int [8]>
# 5 3 g3_1 <int [8]>
# 6 3 g3_2 <int [8]>
# 7 3 g3_3 <int [8]>
# 8 3 g3_4 <int [8]>
# 9 5 g5_1 <int [8]>
#10 5 g5_2 <int [8]>
#11 5 g5_3 <int [8]>
#12 5 g5_4 <int [8]>
# Compute all combinations across groups
all_combns <- cross(split(df_135$Members, df_135$Group))
# select the combinations with the desired overlap
all_combns[map_int(all_combns, overlap) == 0]
# [[1]]
# [[1]]$`1`
# [1] 125 8 40 127 19 33 29 3
#
# [[1]]$`3`
# [1] 11 25 126 22 56 4 6 52
#
# [[1]]$`5`
# [1] 5 63 18 48 37 32 43 1
Here's a plain R solution. It's not the most efficient one, but it's very straight forward and therefor very tractable.
The code below collects all the values in group 1 (1,3,5) and group 2 (2,4,6), and samples n isolates from this list. It then tests for the minimal overlap and resamples group 2 if necessary. In the case of your request, it only needs to resample once or twice, but if your threshold is lower (e.g. 0.05), it may resample up to 50 times before it gets it right. In fact, if your threshold is too low and your number of samples too large (i.e. it is impossible to make this sample), it will warn you that it failed.
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
df = as.data.frame(df)
subset1 <- df[df$Group %in% c(1,3,5),]
subset2 <- df[df$Group %in% c(2,4,6),]
values_in_subset1 <- subset1[2:ncol(subset1)] # Drop group column
values_in_subset1 <- as.vector(t(values_in_subset1)) # Convert to single vector
values_in_subset2 <- subset2[2:ncol(subset2)] # Drop group column
values_in_subset2 <- as.vector(t(values_in_subset2)) # Convert to single vector
n_sampled <- 8
sample1 <- sample(values_in_subset1, n_sampled, replace=F) #Replace=F is default, added here for readability
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
min_percentage_overlap <- 0.25
retries <- 1
# Retry until it gets it right
while(percentage_overlap > min_percentage_overlap && retries < 1000)
{
retries <- retries + 1
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
}
# Report on number of attempts
cat(paste("Sampled", retries, "times to make sure there was less than", min_percentage_overlap*100,"% overlap."))
# Finally, check if it worked.
if(percentage_overlap <= min_percentage_overlap){
cat("It's super effective! (not really though)")
} else {
cat("But it failed!")
}
For instance I have a vector:
x <- c(6, 22, 18, 5, 19, 14, 17, 88,30, 0, -1, 2, 3)
How do I check for values lesser than 17 and change the values to value + some const?
something like this:
x[x < 17] <- lesser vaue + 18
expected output:
c(24, 22, 18, 23, 19, 32, 17, 88,30, 18, 17, 20, 21)
here an example
x[x<17] <- x[x<17]+18
x <- ifelse(x < 17, x + 18, x)
Another solution, saying if x is less than 17 add 18 else take x as it is.
Another option is case_when
library(dplyr)
case_when(x < 17 ~ x + 18, TRUE ~ x)
#[1] 24 22 18 23 19 32 17 88 30 18 17 20 21
For example I have a vector like this :
a <- c(4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34)
and I want to do this:
Step 1:
Partition into equal-frequency (equi-depth)
Bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step2:
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Output :
9,9,9,9,23,23,23,23,29,29,29,29
We can create groups by dividing length of a in equal number of bins and use ave to calculate rounded mean in each group.
no_of_bins <- 4
round(ave(a, rep(1:length(a), each = no_of_bins, length.out = length(a))))
#[1] 9 9 9 9 23 23 23 23 29 29 29 29
PS -
ave has default function as mean so it has not been explicitly applied.
Try this (take Orange$age predefined R variable as your input, 10 is the bin size)
v=split(Orange$age, ceiling(seq_along(Orange$age)/10))
lapply(v, function(item){rep(mean(item), length(item))})
I have a data.table xSet with multiple columns. I need a new table with a moving 4 row average for each column individually.
We could use rollapplyr from zoo
library(zoo)
library(dplyr)
df1 %>%
mutate_all(funs(New = rollapplyr(., FUN = mean, width = 4, partial = TRUE)))
Or similar option with data.table
library(data.table)
setDT(df1)[, paste0("New", names(df1)) := lapply(.SD,
function(x) rollapplyr(x, FUN = mean, width = 4, partial = TRUE))]
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:9, 3 * 15, replace = TRUE),
ncol = 3, dimnames = list(NULL, paste0("Col", 1:3))))
The answers by akrun and G. Grothendieck call the rollapplr() function which uses a right aligned window by default.
But this is in contrast to the definition the OP has shown in the image.
This can be visualised by creating some suitable input data and by using toString() instead of mean() as aggregation function:
library(data.table)
# create suitable input data
DT <- data.table(col1 = 1:15, col2 = 21:35, col3 = 41:55)
DT[, cbind(.SD, New = zoo::rollapplyr(.SD, 4, toString, partial = TRUE))]
col1 col2 col3 New.col1 New.col2 New.col3
1: 1 21 41 1 21 41
2: 2 22 42 1, 2 21, 22 41, 42
3: 3 23 43 1, 2, 3 21, 22, 23 41, 42, 43
4: 4 24 44 1, 2, 3, 4 21, 22, 23, 24 41, 42, 43, 44
5: 5 25 45 2, 3, 4, 5 22, 23, 24, 25 42, 43, 44, 45
6: 6 26 46 3, 4, 5, 6 23, 24, 25, 26 43, 44, 45, 46
7: 7 27 47 4, 5, 6, 7 24, 25, 26, 27 44, 45, 46, 47
8: 8 28 48 5, 6, 7, 8 25, 26, 27, 28 45, 46, 47, 48
9: 9 29 49 6, 7, 8, 9 26, 27, 28, 29 46, 47, 48, 49
10: 10 30 50 7, 8, 9, 10 27, 28, 29, 30 47, 48, 49, 50
11: 11 31 51 8, 9, 10, 11 28, 29, 30, 31 48, 49, 50, 51
12: 12 32 52 9, 10, 11, 12 29, 30, 31, 32 49, 50, 51, 52
13: 13 33 53 10, 11, 12, 13 30, 31, 32, 33 50, 51, 52, 53
14: 14 34 54 11, 12, 13, 14 31, 32, 33, 34 51, 52, 53, 54
15: 15 35 55 12, 13, 14, 15 32, 33, 34, 35 52, 53, 54, 55
col1 is equal to the row numbers, New.col1 shows the row indices which are being involved in computing rollapplyr().
Compared to OP's image, only rows 1 and 2 do match. Apparently, a right aligned window does not meet OP's definition.
We can compare OP's requirement with the other alignment options for rolling windows:
DT <- data.table(col1 = 1:15, col2 = 21:35, col3 = 41:55)
align_window <- c("center", "left", "right")
DT[, (align_window) := lapply(align_window,
function(x) zoo::rollapply(
col1, 4, toString, partial = TRUE, align = x))]
# add OP's definition from image
DT[1:2, OP := right][3, OP := toString(2:4)][4:15, OP := center][]
col1 col2 col3 center left right OP
1: 1 21 41 1, 2, 3 1, 2, 3, 4 1 1
2: 2 22 42 1, 2, 3, 4 2, 3, 4, 5 1, 2 1, 2
3: 3 23 43 2, 3, 4, 5 3, 4, 5, 6 1, 2, 3 2, 3, 4
4: 4 24 44 3, 4, 5, 6 4, 5, 6, 7 1, 2, 3, 4 3, 4, 5, 6
5: 5 25 45 4, 5, 6, 7 5, 6, 7, 8 2, 3, 4, 5 4, 5, 6, 7
6: 6 26 46 5, 6, 7, 8 6, 7, 8, 9 3, 4, 5, 6 5, 6, 7, 8
7: 7 27 47 6, 7, 8, 9 7, 8, 9, 10 4, 5, 6, 7 6, 7, 8, 9
8: 8 28 48 7, 8, 9, 10 8, 9, 10, 11 5, 6, 7, 8 7, 8, 9, 10
9: 9 29 49 8, 9, 10, 11 9, 10, 11, 12 6, 7, 8, 9 8, 9, 10, 11
10: 10 30 50 9, 10, 11, 12 10, 11, 12, 13 7, 8, 9, 10 9, 10, 11, 12
11: 11 31 51 10, 11, 12, 13 11, 12, 13, 14 8, 9, 10, 11 10, 11, 12, 13
12: 12 32 52 11, 12, 13, 14 12, 13, 14, 15 9, 10, 11, 12 11, 12, 13, 14
13: 13 33 53 12, 13, 14, 15 13, 14, 15 10, 11, 12, 13 12, 13, 14, 15
14: 14 34 54 13, 14, 15 14, 15 11, 12, 13, 14 13, 14, 15
15: 15 35 55 14, 15 15 12, 13, 14, 15 14, 15
None of the alignment options does completely meet OP's definition. "center" is the best match except for the first 3 rows.
Summary (tldr)
I need to perform a rolling regression on an irregular time series (i.e. the interval may not even be periodic and go from 0, 1, 2, 3... to ...7, 20, 24, 28...) that's simple numeric and does not necessarily require date/time, but the rolling window needs be by time. So if I have a timeseries that is irregularly sampled for 600 seconds and the window is 30, the regression is performed every 30 seconds, and not every 30 samples.
I've read examples, and while I could replicate doing rolling sums and medians by time, I can't seem to figure it out for regression.
The problem
First of all, I have read some of the other questions with regards to performing rolling functions on irregular time series data, such as this: optimized rolling functions on irregular time series with time-based window, and this: Rolling window over irregular time series.
The issue is that the examples provided, so far, are simple for equations like sum or median, but I have not yet figured out how to perform a simple rolling regression, i.e. using lm, that is still based on the same caveat that the window is based on an irregular time series. Also, my timeseries is much, much simpler; no date is necessary, it's simply time "elapsed".
Anyway, getting this right is important to me because with irregular time - for example, a skip in the time interval - may give an over- or underestimate of the coefficients in the rolling regression, as the sample window will include additional time.
So I was wondering if anyone can help me with creating a function that does this in the simplest way? The dataset is based on measuring a variable over time i.e. 2 variables: time, and response. Time is measured every x time elapsed units (seconds, minutes, so not date/time formatted), but once in a while it becomes irregular.
For every row in the function, it should perform a linear regression based on a width of n time units. The width should never exceed n units, but may be floored (i.e. reduced) to accomodate irregular time sampling. So for example, if the width is specified at 20 seconds, but time is sampled every 6 seconds, then the window will be rounded to 18, not 24 seconds.
I have looked at the question here: How to calculate the average slope within a moving window in R, and I tested that code on an irregular time series, but it looks like it's based on regular time series.
Sample data:
sample <-
structure(list(x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 47, 48,
49), y = c(50, 49, 48, 47, 46, 47, 46, 45, 44, 43, 44, 43, 42,
41, 40, 41, 40, 39, 38, 37, 38, 37, 36, 35, 34, 35, 34, 33, 32,
31, 30, 29, 28, 29, 28, 27, 26, 25, 26, 25, 24, 23, 22, 21, 20,
19)), .Names = c("x", "y"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -46L))
My current code (based on a previous question I referred to). I know it's not subsetting by time:
library(zoo)
clm <- function(z) coef(lm(y ~ x, as.data.frame(z)))
rollme <- rollapplyr(zoo(sample), 10, clm, by.column = F, fill = NA)
The expected output (manually calculated) is below. The output is different from a regular rolling regression -- the numbers are different as soon as the time interval skips at 29 (secs):
NA
NA
NA
NA
NA
NA
NA
NA
NA
-0.696969697
-0.6
-0.551515152
-0.551515152
-0.6
-0.696969697
-0.6
-0.551515152
-0.551515152
-0.6
-0.696969697
-0.6
-0.551515152
-0.551515152
-0.6
-0.696969697
-0.6
-0.551515152
-0.551515152
-0.6
-0.696969697
-0.605042017
-0.638888889
-0.716981132
-0.597560976
-0.528301887
-0.5
-0.521008403
-0.642857143
-0.566666667
-0.551515152
-0.551515152
-0.6
-0.696969697
-0.605042017
-0.638888889
-0.716981132
I hope I'm providing enough information, but let me know (or give me a guide to a good example somewhere) for me to try this?
Other things I have tried:
I've tried converting the time to POSIXct format but I don't know how to perform lm on that:
require(lubridate)
x <- as.POSIXct(strptime(sample$x, format = "%S"))
Update : Added tldr section.
Try this:
# time interval is 1
sz=10
pl2=list()
for ( i in 1:nrow(sample)){
if (i<sz) period=sz else
period=length(sample$x[sample$x>(sample$x[i]-sz) & sample$x<=sample$x[i]])-1
pl2[[i]]=seq(-period,0)
}
#update for time interval > 1
sz=10
tint=1
pl2=list()
for ( i in 1:nrow(sample)){
if (i<sz) period=sz else
period=length(sample$x[sample$x>(sample$x[i]-sz*tint) & sample$x<=sample$x[i]])-1
pl2[[i]]=seq(-period,0)
}
rollme3 <- rollapplyr(zoo(sample), pl2, clm, by.column = F, fill = NA)
> tail(rollme3)
(Intercept) x
41 47.38182 -0.5515152
42 49.20000 -0.6000000
43 53.03030 -0.6969697
44 49.26050 -0.6050420
45 50.72222 -0.6388889
46 54.22642 -0.7169811
For the sake of completeness, here is an answer which uses data.table to aggregate in a non-equi join.
Although there many similar questions, e.g., r calculating rolling average with window based on value (not number of rows or date/time variable), this question deserves an answer on its own as the OP is looking for the coefficients of a rolling regression.
library(data.table)
ws <- 10 # size of sliding window in time units
setDT(sample)[.(start = x - ws, end = x), on = .(x > start, x <= end),
as.list(coef(lm(y ~ x.x))), by = .EACHI]
x x (Intercept) x.x
1: -10 0 50.00000 NA
2: -9 1 50.00000 -1.0000000
3: -8 2 50.00000 -1.0000000
4: -7 3 50.00000 -1.0000000
5: -6 4 50.00000 -1.0000000
6: -5 5 49.61905 -0.7142857
7: -4 6 49.50000 -0.6428571
8: -3 7 49.50000 -0.6428571
9: -2 8 49.55556 -0.6666667
10: -1 9 49.63636 -0.6969697
11: 0 10 49.20000 -0.6000000
12: 1 11 48.88485 -0.5515152
13: 2 12 48.83636 -0.5515152
14: 3 13 49.20000 -0.6000000
15: 4 14 50.12121 -0.6969697
16: 5 15 49.20000 -0.6000000
17: 6 16 48.64242 -0.5515152
18: 7 17 48.59394 -0.5515152
19: 8 18 49.20000 -0.6000000
20: 9 19 50.60606 -0.6969697
21: 10 20 49.20000 -0.6000000
22: 11 21 48.40000 -0.5515152
23: 12 22 48.35152 -0.5515152
24: 13 23 49.20000 -0.6000000
25: 14 24 51.09091 -0.6969697
26: 15 25 49.20000 -0.6000000
27: 16 26 48.15758 -0.5515152
28: 17 27 48.10909 -0.5515152
29: 18 28 49.20000 -0.6000000
30: 19 29 51.57576 -0.6969697
31: 22 32 49.18487 -0.6050420
32: 23 33 50.13889 -0.6388889
33: 24 34 52.47170 -0.7169811
34: 25 35 48.97561 -0.5975610
35: 26 36 46.77358 -0.5283019
36: 27 37 45.75000 -0.5000000
37: 28 38 46.34454 -0.5210084
38: 29 39 50.57143 -0.6428571
39: 30 40 47.95556 -0.5666667
40: 31 41 47.43030 -0.5515152
41: 32 42 47.38182 -0.5515152
42: 33 43 49.20000 -0.6000000
43: 34 44 53.03030 -0.6969697
44: 37 47 49.26050 -0.6050420
45: 38 48 50.72222 -0.6388889
46: 39 49 54.22642 -0.7169811
x x (Intercept) x.x
Please note that rows 10 to 30 where the time series is regularly spaced are identical to OP's rollme.
The call to as.list() forces the result of coef(lm(...)) to appear in separate columns.
The code above uses a right aligned rolling window. However, the code can be easily adapted to support a left aligned window as well:
# left aligned window
setDT(sample)[.(start = x, end = x + ws), on = .(x >= start, x < end),
as.list(coef(lm(y ~ x.x))), by = .EACHI]
With runner one can apply any R function in irregular time series. User has to specify put data to x argument and vector of dates to idx argument (to make windows time dependent). Window width k can be a integer k = 30 or character like in seq.POSIXt k = "30 secs".
First example shows how to obtain both parameters from lm function - output will be a matrix
library(runner)
runner(
x = sample,
k = "30 secs",
idx = sample$datetime,
function(x) {
coefficients(lm(y ~ x, data = x))
}
)
Or one can execute runner separately for each parameter
library(runner)
sample$intercept <- runner(
sample,
k = "30 secs",
idx = sample$datetime,
function(x) {
coefficients(lm(y ~ x, data = x))[1]
}
)
sample$slope <- runner(
sample,
k = "30 secs",
idx = sample$datetime,
function(x) {
coefficients(lm(y ~ x, data = x))[2]
}
)
head(sample, 15)
# datetime x y intercept slope
# 1 2020-04-13 09:27:20 0 50 50.00000 NA
# 2 2020-04-13 09:27:21 1 49 50.00000 -1.0000000
# 3 2020-04-13 09:27:25 2 48 50.00000 -1.0000000
# 4 2020-04-13 09:27:29 3 47 50.00000 -1.0000000
# 5 2020-04-13 09:27:29 4 46 50.00000 -1.0000000
# 6 2020-04-13 09:27:32 5 47 49.61905 -0.7142857
# 7 2020-04-13 09:27:34 6 46 49.50000 -0.6428571
# 8 2020-04-13 09:27:38 7 45 49.50000 -0.6428571
# 9 2020-04-13 09:27:38 8 44 49.55556 -0.6666667
# 10 2020-04-13 09:27:41 9 43 49.63636 -0.6969697
# 11 2020-04-13 09:27:44 10 44 49.45455 -0.6363636
# 12 2020-04-13 09:27:47 11 43 49.38462 -0.6153846
# 13 2020-04-13 09:27:48 12 42 49.38462 -0.6153846
# 14 2020-04-13 09:27:49 13 41 49.42857 -0.6263736
# 15 2020-04-13 09:27:50 14 40 49.34066 -0.6263736
Data with datetime column
sample <- structure(
list(
datetime = c(3, 1, 4, 4, 0, 3, 2, 4, 0, 3, 3, 3, 1, 1, 1, 3, 0, 2, 4, 2, 2,
3, 0, 1, 2, 4, 0, 1, 4, 4, 1, 2, 1, 3, 0, 4, 4, 1, 3, 0, 0, 2,
1, 0, 2, 0) + Sys.time(),
x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 47, 48, 49),
y = c(50, 49, 48, 47, 46, 47, 46, 45, 44, 43, 44, 43, 42, 41, 40, 41, 40, 39,
38, 37, 38, 37, 36, 35, 34, 35, 34, 33, 32, 31, 30, 29, 28, 29, 28, 27,
26, 25, 26, 25, 24, 23, 22, 21, 20,19)
),
.Names = c("x", "y"),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -46L)
)