Creating a One-Day-Ahead Roll-Forward Forecast in R - r

I am attempting to compute the predictive measures for roll-forward one-day-ahead forecasts and plot the results.
I had a few questions about my example code below:
What can I do to resolve the "replacement has length zero" error below?
How can I output an array of the prediction values for their respective dates?
How can I plot the data and the predictions on one graph?
This is the error received upon execution of the script:
Error in error[j - fixed.nTrain + 1] <- valid.ts - naive.pred$mean[stepsAhead] :
replacement has length zero
The reproducible example is as follows.
d <- structure(list(Date = structure(c(17349, 17350, 17351, 17352,
17353, 17354, 17355, 17356, 17357, 17358, 17359, 17360, 17361,
17362, 17363, 17364, 17365, 17366, 17367, 17368, 17369, 17370,
17371, 17372, 17373, 17374, 17375, 17376, 17377, 17378, 17379,
17380, 17381, 17382, 17383), class = "Date"), Ratio = c(67, 50,
67, 50, 100, 50, 33, 67, 0, 0, 0, 0, 100, 75, 0, 0, 75, 100,
67, 33, 33, 33, 50, 50, 67, 100, 67, 50, 25, 25, 33, 33, 100,
33, 0)), .Names = c("Date", "Ratio"), row.names = 183:217, class = "data.frame")
library(xts)
dates = as.Date(d$Date,"%Y-%m-%d")
xs = xts(d$Ratio,dates)
library("forecast")
fixed.nValid <- 6
fixed.nTrain <- length(xs) - fixed.nValid
stepsAhead <- 2
error <- rep(0, fixed.nValid - stepsAhead + 1)
percent.error <- rep(0, fixed.nValid - stepsAhead + 1)
predictions <-rep(0, fixed.nValid - stepsAhead + 1)
for (j in fixed.nTrain:(fixed.nTrain + fixed.nValid - stepsAhead)) {
train.ts <- window(xs, start = as.Date("2017-07-02"), end = as.Date("2017-07-02") + j)
valid.ts <- window(xs, start = as.Date("2017-07-02") + j + stepsAhead, end = as.Date("2017-07-02") + j + stepsAhead)
naive.pred <- naive(train.ts, h = stepsAhead)
error[j - fixed.nTrain + 1] <- valid.ts - naive.pred$mean[stepsAhead]
percent.error[j - fixed.nTrain + 1] <- error[j - fixed.nTrain + 1] / valid.ts
}
mean(abs(error))
sqrt(mean(error^2))
mean(abs(percent.error))
This is the output of the script above:
Thank you!

The problem is that, when j = 33 in your for-loop, the value of
as.Date("2017-07-02") + j + stepsAhead
is "2017-08-06", which is later than the latest date in xs. This results in valid.ts having zero length, which is causing the error you're seeing.

Related

Labelling function for y-axis ggplot (How to hide labels for negative breaks)

I have this data:
structure(list(date = structure(c(18474, 18832, 18861, 18666,
18597, 18517, 18938, 18611, 18384, 18768, 18633, 18545, 18550,
18577, 18354, 18584, 18821, 18338, 18815, 18591, 18799, 18721,
18945, 18448, 18445, 18765, 18784, 18683, 19012, 18505, 18991,
18346, 18899, 18470, 18514, 18774, 18943, 18834, 18424, 18506,
18844, 18988, 18418, 18951, 18678, 18454, 18552, 19021, 18414,
18958), class = "Date"), value = c(10, 28, 87, 105, 345, 40,
592, 220, 5, 51, 160, 78, 91, 529, 51, 552, 13, 13, 7, 435, 11,
250, 848, 7, 5, 60, 25, 145, 1493, 22, 186, 44, 138, 9, 29, 40,
779, 29, 2, 22, 38, 164, 2, 1058, 126, 7, 98, 2502, 2, 1010)), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
A simple
ggplot(df, aes(date, value)) + geom_col()
gives this:
However, I'd like to have breaks from -2000 to 2000 at every 1000. For the negative values I want to draw the horizontal grid lines, however I do not want to label these breaks at the y-axis. So I do the following:
ggplot(df, aes(date, value)) +
geom_col() +
scale_y_continuous(
limits = c(-3000,3000),
breaks = c(-2000, -1000, 0,1000,2000),
labels = function(x){
if(x < 0){
return("")
}else{
return(x)
}
}
)
But this does not work saying: Error: Breaks and labels are different lengths Run rlang::last_error() to see where the error occurred. In addition: Warning message: In if (x < 0) { : the condition has length > 1 and only the first element will be used
I thought it would somehow check each break and I could return the label from this function.
I can do this:
ggplot(df, aes(date, value)) +
geom_col() +
scale_y_continuous(
limits = c(-3000,3000),
breaks = c(-2000, -1000, 0,1000,2000),
labels = c("", "", 0, 1000, 2000)
)
Which gives the desired output:
But I do not think this is the best option.
It's because if isn't vectorized, look at the warning message when running your breaking code above:
Warning message:
In if (x < 0) { :
the condition has length > 1 and only the first element will be used
You should use ifelse to return a vector the size of x.
ggplot(df, aes(date, value)) +
geom_col() +
scale_y_continuous(
limits = c(-3000,3000),
breaks = c(-2000, -1000, 0,1000,2000),
labels = function(x) ifelse(x < 0, "", x)
)
You can use replace:
ggplot(df, aes(date, value)) +
geom_col() +
scale_y_continuous(
limits = c(-3000,3000),
breaks = seq(-2000, 2000, 1000),
labels = function(x) replace(x, which(x < 0), "")
)

Summary code that counts the number of values less than specified numbers

I have a simple dataset (that I've titled 'summary') that includes a numeric column of values. I want to create code to summarize the number of rows less that specific values, such as 5, 10, 20, 30, etc.
Here is some of the data:
dput(summary[1:50,])
structure(list(S2S_Mins = c(NA, 101.15, 107.43, 205.5, 48.07,
34.9, 195.05, 17.58, 41.63, 74.27, 21.05, 32.27, 51.18, 17.88,
32.52, 26.98, 32.03, 40.03, 50.73, 54.38, 33.17, 19.97, 23.57,
41.82, 17.7, 20.9, 24.65, 16.48, 27.97, 94.47, 23.13, 22.63,
25.5, 43.8, 46.47, 33.98, 17.28, 27.57, 45.58, 34.52, 32.75,
35.92, 28.62, 17.48, 40.55, 38.8, 34.97, 41.95, 36.88, 21.58)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -50L))
I can go through and count the number of rows like this:
sum(summary$S2S_Mins < 5, na.rm = TRUE)
sum(summary$S2S_Mins < 10, na.rm = TRUE)
sum(summary$S2S_Mins < 20, na.rm = TRUE)
sum(summary$S2S_Mins < 30, na.rm = TRUE)
sum(summary$S2S_Mins < 60, na.rm = TRUE)
But I would like a summary function (or something similar) that will put this in a table for me, like follows:
TimeCategory Count
Less5 0
Less10 1
Less20 9
Less30 17
Less60 36
I have tried using dplyr with the summarize/summarise function, but I get errors:
#first try - gives a (1 x 0) tibble
summary %>% summarize(Less5 = nrow(S2S_Mins < 5), Less10 = nrow(S2S_Mins < 10))
#second try - gives error saying "unused argument (S2S_Mins < 5)"
summary %>% summarize(Less5 = n(S2S_Mins < 5), Less10 = n(S2S_Mins < 10))
Any pointers would be greatly appreciated. Thanks.
We can use sapply
v1 <- c(5, 10, 20, 30, 60)
out <- sapply(v1, function(x) sum(summary$S2S_Mins < x, na.rm = TRUE))
names(out) <- paste0("Less", v1)
stack(out)[2:1]
-ouput
ind values
1 Less5 0
2 Less10 0
3 Less20 7
4 Less30 19
5 Less60 43

Subsetting and plotting data by TimeStamp

I have a data.frame P1 (5000rows x 4cols) and would like to save the subset of data in columns 2,3 and 4 when the time-stamp in column 1 falls into a set range determined by a vector TimeStamp (in seconds).
E.g. put all values in columns 2, 3, and 4 into a new data.frame and call each section of data: Condition.1.P1, Condition.2.P1, etc.
The reason I'd like to label separately as I have 35 versions of P1 (P2, P3, P33, etc) and need to be able to melt them together to plot them.
dput(TimeStamp)
c(18, 138, 438, 678, 798, 1278, 1578, 1878, 2178)
dput(head(P1))
structure(list(Time = c(0, 5, 100, 200, 500, 1200), SkinTemp = c(27.781,
27.78, 27.779, 27.779, 27.778, 27.777), HeartRate = c(70, 70,
70, 70, 70, 70), RespirationRate = c(10, 10, 10, 10, 10, 10)), .Names = c("Time",
"SkinTemp", "HeartRate", "RespirationRate"), row.names = c(NA,
6L), class = "data.frame")
Do you want to seperate the data by the timestamp range and put it in a list? Than this might be what you are looking for:
TimeStamp <- c(18, 138, 438, 678, 798, 1278, 1578, 1878, 2178)
dat <- structure(list(Time = c(0, 5, 100, 200, 500, 1200), SkinTemp =(27.781,
27.78, 27.779, 27.779, 27.778, 27.777), HeartRate = c(70, 70,
70, 70, 70, 70), RespirationRate = c(10, 10, 10, 10, 10, 10)), .Names = c ("Time",
"SkinTemp", "HeartRate", "RespirationRate"), row.names = c(NA,
6L), class = "data.frame")
dat$Segment <- cut(dat$Time,c(-Inf,TimeStamp))
split(dat,dat$Segment)
P2 = data.frame(NA, NA, NA, NA) # Create empty data.frame
for (i in 1:length(ts)){
P3 = data.frame() # Create empty changing data.frame
if (i == 1) {ts1 = 0} else {ts1 = ts[i-1]} #First time stamp starts at 0
ts2 = ts[i]
P3 = subset(P1, P1$Time > ts1 & P1$Time < ts2)[,c(2,3,4)] #Subset the columns and assign to P3
if (nrow(P3) == 0){P3 = data.frame(NA, NA, NA)} #If the subset is empty, assign NA
P3$TimeStamp = paste(ts1,ts2,sep="-") # Append TimeStamp to the P3
colnames(P3) = colnames(P2) #Make sure column names are same to allow rbind
P2 = rbind(P2,P3) #Append P3 to P2
}
P2 = P2[c(2:nrow(P2)),] #Remove the first row (that has NA)
colnames(P2) = c("SkinTemp", "HeartRate", "RespirationRate", "TimeStamp") #Provide column names)
rm(P3); rm(i); rm(ts1); rm(ts2) #Cleanup

Compute and save the r-squared value of bootstrap objects in a new dataframe in R

I have a dataframe df
dput(df)
structure(list(x = c(49, 50, 51, 52, 53, 54, 55, 56, 1, 2, 3,
4, 5, 14, 15, 16, 17, 2, 3, 4, 5, 6, 10, 11, 3, 30, 64, 66, 67,
68, 69, 34, 35, 37, 39, 2, 17, 18, 99, 100, 102, 103, 67, 70,
72), y = c(2268.14043972082, 2147.62290922552, 2269.1387550775,
2247.31983098201, 1903.39138268307, 2174.78291538358, 2359.51909126411,
2488.39004804939, 212.851575751527, 461.398994384333, 567.150629704352,
781.775113821961, 918.303706148872, 1107.37695799186, 1160.80594193377,
1412.61328924168, 1689.48879626486, 260.737164468854, 306.72700499362,
283.410379620422, 366.813913489692, 387.570173754128, 388.602676983443,
477.858510450125, 128.198042456082, 535.519377609133, 1028.8780498564,
1098.54431357711, 1265.26965941035, 1129.58344809909, 820.922447928053,
749.343583476846, 779.678206156474, 646.575242339517, 733.953282899613,
461.156280127354, 906.813018662913, 798.186995701282, 831.365377249207,
764.519073183124, 672.076289062505, 669.879217186302, 1341.47673353751,
1401.44881976186, 1640.27575962036)), .Names = c("x", "y"), row.names = c(NA,
-45L), class = "data.frame")
I have created two non-linear regression (nls1 and nls2) based on my dataset.
library(stats)
nls1 <- nls(y~A*(x^B)*(exp(k*x)),
data = df,
start = list(A = 1000, B = 0.170, k = -0.00295))
nls2<-nls(y~A*x^3+B*x^2+C*x+D, data=df,
start = list(A=0.02, B=-0.6, C= 50, D=200))
I then computed bootstrap objects for these two functions to get multiple sets of parameters (A,B and k for nls1 and A, B, C and D for nls2).
library(nlstools)
Boo1 <- nlsBoot(nls1, niter = 200)
Boo2 <- nlsBoot(nls2, niter = 200)
Based on this bootstrap objects, I would like to compute r-squared of each combination of parameters and save the min, max and median of my r-squared values for each bootstrap object into one new dataframe. The dataframe could look like new.df.
structure(list(Median = c(NA, NA), Max = c(NA, NA), Min = c(NA,
NA)), .Names = c("Median", "Max", "Min"), row.names = c("nls1",
"nls2"), class = "data.frame")
The idea is then to do some box plots with the median, min and max values for each non-linear model based on bootstrapping to compare them. Can someone help me out with that? Thanks in advance.
Answer from #bunk
stat <- function(dat, inds) { fit <- try(nls(y~A*(x^B)*(exp(k*x)), data = dat[inds,], start = list(A = 1000, B = 0.170, k = -0.00295)), silent=TRUE); f1 <- if (inherits(fit, "nls")) AIC(fit) else NA; fit2 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = dat[inds,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE); f2 <- if (inherits(fit2, "nls")) AIC(fit2) else NA; c(f1, f2) }; res <- boot(df, stat, R=200). Then, to get medians for example, apply(res$t, 2, median, na.rm=TRUE)

Data Smoothing in R

This question is related to this one that I asked before. But referring to that question is not necessary to answer this one.
Data
I have a data set containing velocities of 2169 vehicles recorded at intervals of 0.1 seconds. So, there are many rows for an individual vehicle. Here I am reproducing the data only for the vehicle # 2:
> dput(uma)
structure(list(Frame.ID = 13:445, Vehicle.velocity = c(40, 40,
40, 40, 40, 40, 40, 40.02, 40.03, 39.93, 39.61, 39.14, 38.61,
38.28, 38.42, 38.78, 38.92, 38.54, 37.51, 36.34, 35.5, 35.08,
34.96, 34.98, 35, 34.99, 34.98, 35.1, 35.49, 36.2, 37.15, 38.12,
38.76, 38.95, 38.95, 38.99, 39.18, 39.34, 39.2, 38.89, 38.73,
38.88, 39.28, 39.68, 39.94, 40.02, 40, 39.99, 39.99, 39.65, 38.92,
38.52, 38.8, 39.72, 40.76, 41.07, 40.8, 40.59, 40.75, 41.38,
42.37, 43.37, 44.06, 44.29, 44.13, 43.9, 43.92, 44.21, 44.59,
44.87, 44.99, 45.01, 45.01, 45, 45, 45, 44.79, 44.32, 43.98,
43.97, 44.29, 44.76, 45.06, 45.36, 45.92, 46.6, 47.05, 47.05,
46.6, 45.92, 45.36, 45.06, 44.96, 44.97, 44.99, 44.99, 44.99,
44.99, 45.01, 45.02, 44.9, 44.46, 43.62, 42.47, 41.41, 40.72,
40.49, 40.6, 40.76, 40.72, 40.5, 40.38, 40.43, 40.38, 39.83,
38.59, 37.02, 35.73, 35.04, 34.85, 34.91, 34.99, 34.99, 34.97,
34.96, 34.98, 35.07, 35.29, 35.54, 35.67, 35.63, 35.53, 35.53,
35.63, 35.68, 35.55, 35.28, 35.06, 35.09, 35.49, 36.22, 37.08,
37.8, 38.3, 38.73, 39.18, 39.62, 39.83, 39.73, 39.58, 39.57,
39.71, 39.91, 40, 39.98, 39.97, 40.08, 40.38, 40.81, 41.27, 41.69,
42.2, 42.92, 43.77, 44.49, 44.9, 45.03, 45.01, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 44.99, 45.03, 45.26, 45.83, 46.83,
48.2, 49.68, 50.95, 51.83, 52.19, 52, 51.35, 50.38, 49.38, 48.63,
48.15, 47.87, 47.78, 48.01, 48.63, 49.52, 50.39, 50.9, 50.96,
50.68, 50.3, 50.05, 49.94, 49.87, 49.82, 49.82, 49.88, 49.96,
50, 50, 49.98, 49.98, 50.16, 50.64, 51.43, 52.33, 53.01, 53.27,
53.22, 53.25, 53.75, 54.86, 56.36, 57.64, 58.28, 58.29, 57.94,
57.51, 57.07, 56.64, 56.43, 56.73, 57.5, 58.27, 58.55, 58.32,
57.99, 57.89, 57.92, 57.74, 57.12, 56.24, 55.51, 55.1, 54.97,
54.98, 55.02, 55.03, 54.86, 54.3, 53.25, 51.8, 50.36, 49.41,
49.06, 49.17, 49.4, 49.51, 49.52, 49.51, 49.45, 49.24, 48.84,
48.29, 47.74, 47.33, 47.12, 47.06, 47.07, 47.08, 47.05, 47.04,
47.25, 47.68, 47.93, 47.56, 46.31, 44.43, 42.7, 41.56, 41.03,
40.92, 40.92, 40.98, 41.19, 41.45, 41.54, 41.32, 40.85, 40.37,
40.09, 39.99, 39.99, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
39.98, 39.97, 40.1, 40.53, 41.36, 42.52, 43.71, 44.57, 45.01,
45.1, 45.04, 45, 45, 45, 45, 45, 45, 44.98, 44.97, 45.08, 45.39,
45.85, 46.2, 46.28, 46.21, 46.29, 46.74, 47.49, 48.35, 49.11,
49.63, 49.89, 49.94, 49.97, 50.14, 50.44, 50.78, 51.03, 51.12,
51.05, 50.85, 50.56, 50.26, 50.06, 50.1, 50.52, 51.36, 52.5,
53.63, 54.46, 54.9, 55.03, 55.09, 55.23, 55.35, 55.35, 55.23,
55.07, 54.99, 54.98, 54.97, 55.06, 55.37, 55.91, 56.66, 57.42,
58.07, 58.7, 59.24, 59.67, 59.95, 60.02, 60, 60, 60, 60, 60,
60.01, 60.06, 60.23, 60.65, 61.34, 62.17, 62.93, 63.53, 64, 64.41,
64.75, 65.04, 65.3, 65.57, 65.75, 65.74, 65.66, 65.62, 65.71,
65.91, 66.1, 66.26, 66.44, 66.61, 66.78, 66.91, 66.99, 66.91,
66.7, 66.56, 66.6, 66.83, 67.17, 67.45, 67.75, 68.15, 68.64,
69.15, 69.57, 69.79, 69.79, 69.72, 69.72, 69.81, 69.94, 70, 70.01,
70.02, 70.03)), .Names = c("Frame.ID", "Vehicle.velocity"), class = "data.frame", row.names = c(NA,
433L))
Frame.ID is the time frame in which the Vehicle.velocity was observed. There is some noise in the velocity variable and I want to smooth it.
Methodology
To smooth the velocity I am using following equation:
where,
Delta = 10
Nalpha = number of data points (rows)
i = 1, ... ,Nalpha (i.e. the row number)
D = minimum of {i-1, Nalpha - i, 3*delta=30}
xalpha = velocity
Question
I have gone through the documentation of filter and convolution in R. It seems that I have to know about convolution to do this. However, I have tried my best and can't understand how convolution works! The linked question has an answer which helped me in understanding some of the inner workings in the function but I am still not sure.
Could anyone here on SO please explain how this thing works? Or guide me to an alternative methodology to achieve the same purpose i.e. apply the equation?
My current code which works but is lengthy
Here is what uma looks like:
> head(uma)
Frame.ID Vehicle.velocity
1 13 40
2 14 40
3 15 40
4 16 40
5 17 40
6 18 40
uma$i <- 1:nrow(uma) # this is i
uma$im1 <- uma$i - 1
uma$Nai <- nrow(uma) - uma$i # this is Nalpha
uma$delta3 <- 30 # this is 3 times delta
uma$D <- pmin(uma$im1, uma$Nai, uma$delta3) # selecting the minimum of {i-1, Nalpha - i, 3*delta=15}
uma$imD <- uma$i - uma$D # i-D
uma$ipD <- uma$i + uma$D # i+D
uma <- ddply(uma, .(Frame.ID), transform, k = imD:ipD) # to include all k in the data frame
umai <- uma
umai$imk <- umai$i - umai$k # i-k
umai$aimk <- (-1) * abs(umai$imk) # -|i-k|
umai$delta <- 10
umai$kernel <- exp(umai$aimk/umai$delta) # The kernel in the equation i.e. EXP^-|i-k|/delta
umai$p <- umai$Vehicle.velocity[match(umai$k,umai$i)] #observed velocity in kth row as described in equation as t(k)
umai$kernelp <- umai$p * umai$kernel # the product of kernel and observed velocity in kth row as described in equation as t(k)
umair <- ddply(umai, .(Frame.ID), summarize, Z = sum(kernel), prod = sum(kernelp)) # summing the kernel to get Z and summing the product to get the numerator of the equation
umair$new.Y <- umair$prod/umair$Z # the final step to get the smoothed velocity
Plot
Just for reference, if I plot the observed and smoothed velocities against time frames we can see the result of smoothing:
ggplot() +
geom_point(data=uma,aes(y=Vehicle.velocity, x= Frame.ID)) +
geom_point(data=umair,aes(y=new.Y, x= Frame.ID), color="red")
Please help me making my code short and applicable to all vehicles (represented by Vehicle.ID in the data set) by guiding me about use of convolution.
dplyr
Alright, so I used following code and it works but takes 3 hours on 32 GB RAM. Can anyone suggest improvements to speed it up (1 hour each is taken by umal, umav and umaa)?
uma <- tbl_df(uma)
uma <- uma %>% # take data frame
group_by(Vehicle.ID) %>% # group by Vehicle ID
mutate(i = 1:length(Frame.ID), im1 = i-1, Nai = length(Frame.ID) - i,
Dv = pmin(im1, Nai, 30),
Da = pmin(im1, Nai, 120),
Dl = pmin(im1, Nai, 15),
imDv = i - Dv,
ipDv = i + Dv,
imDa = i - Da,
ipDa = i + Da,
imDl = i - Dl,
ipDl = i + Dl) %>% # finding i, i-1 and Nalpha-i, D, i-D and i+D for location, velocity and acceleration
ungroup()
umav <- uma %>%
group_by(Vehicle.ID, Frame.ID) %>%
do(data.frame(kv = .$imDv:.$ipDv)) %>%
left_join(x=., y=uma) %>%
mutate(imk = i - kv, aimk = (-1) * abs(imk), delta = 10, kernel = exp(aimk/delta)) %>%
ungroup() %>%
group_by(Vehicle.ID) %>%
mutate(p = Vehicle.velocity2[match(kv,i)], kernelp = p * kernel) %>%
ungroup() %>%
group_by(Vehicle.ID, Frame.ID) %>%
summarise(Z = sum(kernel), prod = sum(kernelp)) %>%
mutate(svel = prod/Z) %>%
ungroup()
umaa <- uma %>%
group_by(Vehicle.ID, Frame.ID) %>%
do(data.frame(ka = .$imDa:.$ipDa)) %>%
left_join(x=., y=uma) %>%
mutate(imk = i - ka, aimk = (-1) * abs(imk), delta = 10, kernel = exp(aimk/delta)) %>%
ungroup() %>%
group_by(Vehicle.ID) %>%
mutate(p = Vehicle.acceleration2[match(ka,i)], kernelp = p * kernel) %>%
ungroup() %>%
group_by(Vehicle.ID, Frame.ID) %>%
summarise(Z = sum(kernel), prod = sum(kernelp)) %>%
mutate(sacc = prod/Z) %>%
ungroup()
umal <- uma %>%
group_by(Vehicle.ID, Frame.ID) %>%
do(data.frame(kl = .$imDl:.$ipDl)) %>%
left_join(x=., y=uma) %>%
mutate(imk = i - kl, aimk = (-1) * abs(imk), delta = 10, kernel = exp(aimk/delta)) %>%
ungroup() %>%
group_by(Vehicle.ID) %>%
mutate(p = Local.Y[match(kl,i)], kernelp = p * kernel) %>%
ungroup() %>%
group_by(Vehicle.ID, Frame.ID) %>%
summarise(Z = sum(kernel), prod = sum(kernelp)) %>%
mutate(ycoord = prod/Z) %>%
ungroup()
umal <- select(umal,c("Vehicle.ID", "Frame.ID", "ycoord"))
umav <- select(umav, c("Vehicle.ID", "Frame.ID", "svel"))
umaa <- select(umaa, c("Vehicle.ID", "Frame.ID", "sacc"))
umair <- left_join(uma, umal) %>% left_join(x=., y=umav) %>% left_join(x=., y=umaa)
A good first step would be to take a for loop (which I'll hide with sapply) and perform the exponential smoothing for each index:
josilber1 <- function(uma) {
delta <- 10
sapply(1:nrow(uma), function(i) {
D <- min(i-1, nrow(uma)-i, 30)
rng <- (i-D):(i+D)
rng <- rng[rng >= 1 & rng <= nrow(uma)]
expabs <- exp(-abs(i-rng)/delta)
return(sum(uma$Vehicle.velocity[rng] * expabs) / sum(expabs))
})
}
A more involved approach would be to only compute the incremental change in the exponential smoothing function for each index (as opposed to re-summing at each index). The exponential smoothing function has a lower part (data before the current index; I include the current index in low in the code below) and an upper part (data after the current index; high in the code below). As we loop through the vector, all the data in the lower part gets weighted less (we divide by mult) and all the data in the upper part gets weighted more (we multiply by mult). The leftmost element is dropped from low, the leftmost element in high moves to low, and one element is added to the right side of high.
The actual code is a bit messier to deal with the beginning and ending of the vector and to deal with numerical stability issues (errors in high are multiplied by mult each iteration):
josilber2 <- function(uma) {
delta <- 10
x <- uma$Vehicle.velocity
ret <- c(x[1], rep(NA, nrow(uma)-1))
low <- x[1]
high <- 0
norm <- 1
old.D <- 0
mult <- exp(1/delta)
for (i in 2:nrow(uma)) {
D <- min(i-1, nrow(uma)-i, 30)
if (D == old.D + 1) {
low <- low / mult + x[i]
high <- high * mult - x[i] + x[i+D-1]/mult^(D-1) + x[i+D]/mult^D
norm <- norm + 2 / mult^D
} else if (D == old.D) {
low <- low / mult - x[i-(D+1)]/mult^(D+1) + x[i]
high <- high * mult - x[i] + x[i+D]/mult^D
} else {
low <- low / mult - x[i-(D+2)]/mult^(D+2) - x[i-(D+1)]/mult^(D+1) + x[i]
high <- high * mult - x[i]
norm <- norm - 2 / mult^(D+1)
}
# For numerical stability, recompute high every so often
if (i %% 50 == 0) {
rng <- (i+1):(i+D)
expabs <- exp(-abs(i-rng)/delta)
high <- sum(x[rng] * expabs)
}
ret[i] <- (low+high)/norm
old.D <- D
}
return(ret)
}
R code like josilber2 can often be sped up considerably using the Rcpp package:
library(Rcpp)
josilber3 <- cppFunction(
"
NumericVector josilber3(NumericVector x) {
double delta = 10.0;
NumericVector ret(x.size(), 0.0);
ret[0] = x[0];
double low = x[0];
double high = 0.0;
double norm = 1.0;
int oldD = 0;
double mult = exp(1/delta);
for (int i=1; i < x.size(); ++i) {
int D = i;
if (x.size()-i-1 < D) D = x.size()-i-1;
if (30 < D) D = 30;
if (D == oldD + 1) {
low = low / mult + x[i];
high = high * mult - x[i] + x[i+D-1]/pow(mult, D-1) + x[i+D]/pow(mult, D);
norm = norm + 2 / pow(mult, D);
} else if (D == oldD) {
low = low / mult - x[i-(D+1)]/pow(mult, D+1) + x[i];
high = high * mult - x[i] + x[i+D]/pow(mult, D);
} else {
low = low / mult - x[i-(D+2)]/pow(mult, D+2) - x[i-(D+1)]/pow(mult, D+1) + x[i];
high = high * mult - x[i];
norm = norm - 2 / pow(mult, D+1);
}
if (i % 50 == 0) {
high = 0.0;
for (int j=i+1; j <= i+D; ++j) {
high += x[j] * exp((i-j)/delta);
}
}
ret[i] = (low+high)/norm;
oldD = D;
}
return ret;
}")
We can now benchmark the improvements from these three new approaches:
all.equal(umair.fxn(uma), josilber1(uma))
# [1] TRUE
all.equal(umair.fxn(uma), josilber2(uma))
# [1] TRUE
all.equal(umair.fxn(uma), josilber3(uma$Vehicle.velocity))
# [1] TRUE
library(microbenchmark)
microbenchmark(umair.fxn(uma), josilber1(uma), josilber2(uma), josilber3(uma$Vehicle.velocity))
# Unit: microseconds
# expr min lq mean median uq max neval
# umair.fxn(uma) 370006.728 382327.4115 398554.71080 393495.052 404186.153 572801.355 100
# josilber1(uma) 12879.268 13640.1310 15981.82099 14265.610 14805.419 28959.230 100
# josilber2(uma) 4324.724 4502.8125 5753.47088 4918.835 5244.309 17328.797 100
# josilber3(uma$Vehicle.velocity) 41.582 54.5235 57.76919 57.435 60.099 90.998 100
We got a lot of improvement (25x) with the simpler josilber1 and a 70x total speedup with josilber2 (the advantage would be more with a larger delta value). With josilber3 we achieve a 6800x speedup, getting the runtime all the way down to 54 microseconds to process a single vehicle!

Resources