Ramp up/down missing time-series data in R - r

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers

It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.

When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

Related

Time-series average of cross-sectional correlations

I have a panel dataset looking like this:
head(panel_data)
date symbol close rv rv_plus rv_minus rskew rkurt Mkt.RF SMB HML
1 1999-11-19 a 25.4 19.3 6.76 12.6 -0.791 4.36 -0.11 0.35 -0.5
2 1999-11-22 a 26.8 10.1 6.44 3.69 0.675 5.38 0.02 0.22 -0.92
3 1999-11-23 a 25.2 8.97 2.56 6.41 -1.04 4.00 -1.29 0.08 0.3
4 1999-11-24 a 25.6 5.81 2.86 2.96 -0.505 5.45 0.87 0.08 -0.89
5 1999-11-26 a 25.6 2.78 1.53 1.25 0.617 5.60 0.23 0.92 -0.2
6 1999-11-29 a 26.1 5.07 2.76 2.30 -0.236 7.27 -0.6 0.570 -0.14
where the variable symbol depicts different stocks. I want to calculate the time-series average of the cross-sectional correlation between the variables rskew and rkurt. This means I need to compute the correlation between rskew and rkurt over all different stocks at each point in time and then calculate the time-series average afterwards.
I tried to do it with the rollapply function from the zoo package, but since the number of different stocks is not the same for all dates, I cannot simply define width as an integer. Here is what i tried for a sample width of 20:
panel_data <- panel_data %>%
group_by(date) %>%
mutate(cor_skew_kurt = rollapply(data = panel_data[7:8],
width=20,
FUN=cor,
align="right",
na.rm=TRUE,
fill=NA)) %>%
ungroup
Is there a way to do this without having to define a fixed width for each date group?
Or should I maybe use a different approach to do this?
[Edited] Can you try running the below code? I have recreated an example emulating your issue. if I understood your problem correctly this code should at least put you on the path to the right solution as it solves the issue of unequal time window length.
###################
#Recreating an example dataset with unequal dates across stocks
seed(1)
date6 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26','1999-11-29')
date5 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26')
date4 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24')
cor_skew_kurt <- c(rep(NaN,21))
symbol <- c(rep('a',6),rep('b',5),rep('c',4),rep('d',6))
rskew <- rnorm(21,mean=1, sd =1)
rkurt <- rnorm(21, mean=5, sd = 1)
panel_data <- cbind.data.frame(date = c(date6,date5,date4,date6), symbol = symbol, rskew = rskew, rkurt = rkurt, cor_skew_kurt = cor_skew_kurt )
panel_data$date <- as.Date(panel_data$date, '%Y-%m-%d')
# Computing the cor_skew_kurt and filling the table <- ANSWER TO YOUR QUESTION
for (date in unique(panel_data$date))
{
panel_data[panel_data$date == date,"cor_skew_kurt"] <- as.double(cor(panel_data[panel_data$date == date,'rskew'],panel_data[panel_data$date == date,'rkurt']))
}

Is there a way to resolve this error in cardinality_threshold problem?

I tried to use ggpairs to visualise my dataset but the error message that I am getting is what I don't understand. Can someone please help me?
> describe(Mydata)
vars n mean sd median trimmed mad min max range skew
Time 1 192008 4257.07 2589.28 4156.44 4210.33 3507.03 0 8869.91 8869.91 0.09
Source* 2 192008 9.32 5.95 8.00 8.53 2.97 1 51.00 50.00 3.39
Destination* 3 192008 8.22 6.49 7.00 7.31 2.97 1 51.00 50.00 3.07
Protocol* 4 192008 16.14 4.29 19.00 16.77 0.00 1 20.00 19.00 -1.26
Length 5 192008 166.12 464.07 74.00 96.25 11.86 60 21786.00 21726.00 14.40
Info* 6 192008 63731.70 46463.90 60732.50 62899.62 69904.59 1 131625.00 131624.00 0.14
kurtosis se
Time -1.28 5.91
Source* 15.94 0.01
Destination* 13.21 0.01
Protocol* 0.66 0.01
Length 349.17 1.06
Info* -1.47 106.04
> Mydata[,1][Mydata[,1] ==0]<-NA
> ggpairs(Mydata)
Error in stop_if_high_cardinality(data, columns, cardinality_threshold) :
Column 'Source' has more levels (51) than the threshold (15) allowed.
Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the
cardinality_threshold may produce long processing times
As the error suggests, the way to get rid of the error is to set cardinality_threshold=NULL or cardinality_threshold=51 as Source and Destination are both factor variables with 51 levels.
However, they're likely to be hard to see any detail in the plots, if it plots at all because one of the panels of the plot would be attempting to fit 51 barplots with 51 columns into it. You may want to think if grouping your factor levels makes sense for the analysis you're interested in, or exclude the factors (although that only leaves two continuous variables).

avoid nested for-loops for tricky operation R

I have to do an operation that involves two matrices, matrix #1 with data and matrix #2 with coefficients to multiply columns of matrix #1
matrix #1 is:
dim(dat)
[1] 612 2068
dat[1:6,1:8]
X0005 X0010 X0011 X0013 X0015 X0016 X0017 X0018
1 1.96 1.82 8.80 1.75 2.95 1.10 0.46 0.96
2 1.17 0.94 2.74 0.59 0.86 0.63 0.16 0.31
3 2.17 2.53 10.40 4.19 4.79 2.22 0.31 3.32
4 3.62 1.93 6.25 2.38 2.25 0.69 0.16 1.01
5 2.32 1.93 3.74 1.97 1.31 0.44 0.28 0.98
6 1.30 2.04 1.47 1.80 0.43 0.33 0.18 0.46
and matrix #2 is:
dim(lo)
[1] 2068 8
head(lo)
i1 i2 i3 i4 i5 i6
X0005 -0.11858852 0.10336788 0.62618771 0.08706041 -0.02733101 0.006287923
X0010 0.06405406 0.13692216 0.64813610 0.15750302 -0.13503956 0.139280709
X0011 -0.06789727 0.30473549 0.07727417 0.24907723 -0.05345123 0.141591330
X0013 0.20909664 0.01275553 0.21067894 0.12666704 -0.02836527 0.464548147
X0015 -0.07690560 0.18788859 -0.03551084 0.19120773 -0.10196578 0.234037820
X0016 -0.06442454 0.34993481 -0.04057001 0.20258195 -0.09318325 0.130669546
i7 i8
X0005 0.08571777 0.031531478
X0010 0.31170850 -0.003127279
X0011 0.52527759 -0.065002026
X0013 0.27858049 -0.032178156
X0015 0.50693977 -0.058003429
X0016 0.53162596 -0.052091767
I want to multiply each column of matrix#1 by its correspondent coefficient of matrix#2 first column, and sum up all resulting columns. Then repeat the operation but with coefficients of matrix#2 second column, then third column, and so on...
The result is then a matrix with 8 columns, which are lineal combinations of data in matrix#1
My attempt includes nested for-loops. it works, but takes about 30' to execute. Is there any way to avoid these loops and reduce computational effort?
here is my attempt:
r=nrow(dat)
n=ncol(dat)
m=ncol(lo)
eme<-matrix(NA,r,m)
for (i in(1:m)){
SC<-matrix(NA,r,n)
for (j in(1:n)){
nom<-rownames(lo)
x<-dat[ , colnames(dat) == nom[j]]
SC[,j]<-x*lo[j,i]
SC1<-rowSums(SC)
}
eme[,i]<-SC1
}
Thanks for your help
It looks like you are just doing matrix - vector multiplication. In R, use the%*% operator, so all the looping is delegated to a fortran routine. I think it equates to the following
apply(lo, 2, function(x) dat %*% x)
Your code could be improved by moving the nom <- assignment outside the loops since it recalculates the same thing every iteration. Also, what is the point of SC1 being computed during each iteration?

Biomass model with density

I am trying to develop a common biomass model for several species with wood density.
here is my data set
Species_Name DBH_cm Wood_Density Leaf_Biomass_kg
Aam 10.9 0.55 4.495175666
Aam 8.3 0.55 3.003987585
Aam 18.3 0.55 7.0453234
Akashmoni 26.6 0.68 8.68327883
Akashmoni 18 0.68 5.514198965
Akashmoni 20.6 0.68 7.140993296
Amloki 13.7 0.64 0.418757191
Amloki 14.6 0.64 0.348964326
Amra 19 0.29 0
Arjun 13.3 0.82 0
Bajna 13 0.70 0
Bel 19.6 0.83 0.458638794
Sal 14.40 0.82 0.996750392
Sal 12.20 0.82 0.644956136
Sal 10.00 0.82 0.947928706
Sal 14.20 0.82 0.767434214
Sal 11.50 0.82 0.636970398
Sal 13.20 0.82 0.445111844
Sal 13.30 0.82 0.706039477
Sal 10.70 0.82 0.475809213
I tried to give NA to missing values by using
tree[which(tree$Leaf_Biomass_kg == 0),]$Leaf_Biomass_kg <- NA
my model code is
library(nlme)
start <- coef(lm(log(Leaf_Biomass_kg)~log(DBH_cm)+log(Wood_Density), data=tree))
start[1] <- exp(start[1])
names(start) <- c("a","b1", "b2")
m <- nlme(Leaf_Biomass_kg~a*DBH_cm^b1*Wood_Density^b2,
data=cbind(tree,g="a"),
fixed=a+b1+b2~1,
start=start,
groups=~g,
weights=varPower(form=~DBH_cm))
it gives
Error in finiteDiffGrad(model, data, pars) :
NAs are not allowed in subscripted assignments
Can anyone help me in this regard
I add na.action=na.exclude in my model but the problem still exists
m <- nlme(Leaf_Biomass_kg~a*DBH_cm^b1*Wood_Density^b2,
data=cbind(tree,g="a"),
fixed=a+b1+b2~1,
start=start,
groups=~g,
weights=varPower(form=~DBH_cm)
na.action=na.exclude)
You have missing values NA in your data set. Try setting na.exclude or remove records with NA values. You may want to impute missing values. This question has been adressed here:
https://stats.stackexchange.com/questions/11000/how-does-r-handle-missing-values-in-lm

how can I do vector integral in R? [duplicate]

This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 7 years ago.
I want to integrate a one dimensional vector in R, How should I do that?
Let's say I have:
d=hist(p, breaks=100, plot=FALSE)$density
where p is a sample like:
p=rnorm(1e5)
How can I calculate an integral over d?
If we assume that the values in d correspond to the y values of a function then we can calculate the integral by using a discrete approximation. We can for example use the trapezium rule or Simpsons rule for this purpose. We then also need to input the stepsize that corresponds to the discrete interval on the x-axis in order to "approximate the area under the curve".
Discrete integration functions defined below:
p=rnorm(1e5)
d=hist(p,breaks=100,plot=FALSE)$density
discreteIntegrationTrapeziumRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
stepsize*(0.5*integrand[1]+sum(integrand[2:(l-1)])+0.5*v[l])
}
discreteIntegrationSimpsonRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
a = seq(from=2,to=l-1,by=2);
b = seq(from=3,to=l-1,by=2)
(stepsize/3)*(integrand[1]+4*sum(integrand[a])+2*sum(integrand[b])+integrand[l])
}
As an example, let's approximate the complete area under the curve while assuming discrete x steps of size 1 and then do the same for the second half of d while we assume x-steps of size 0.2.
> plot(1:length(d),d) # stepsize one on x-axis
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultTrapeziumRule
[1] 9.9999
> resultSimpsonRule
[1] 10.00247
> plot(seq(from=-10,to=(-10+(length(d)*0.2)-0.2),by=0.2),d) # stepsize 0.2 on x-axis
> resultTrapziumRule <- discreteIntegrationTrapeziumRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultTrapziumRule
[1] 1.15478
> resultSimpsonRule
[1] 1.11678
In general, the Simpson rule offers better approximations of the integral. The more y-values you have (and the smaller the x-axis stepsize), the better your approximations will become.
Small EDIT for clarity:
In this particular case the stepsize should obviously be 0.1. The complete area under the density curve is then (approximately) equal to 1, as expected.
> d=hist(p,breaks=100,plot=FALSE)$density
> hist(p,breaks=100,plot=FALSE)$mids # stepsize = 0.1
[1] -4.75 -4.65 -4.55 -4.45 -4.35 -4.25 -4.15 -4.05 -3.95 -3.85 -3.75 -3.65 -3.55 -3.45 -3.35 -3.25 -3.15 -3.05 -2.95 -2.85 -2.75 -2.65 -2.55
[24] -2.45 -2.35 -2.25 -2.15 -2.05 -1.95 -1.85 -1.75 -1.65 -1.55 -1.45 -1.35 -1.25 -1.15 -1.05 -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25
[47] -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05
[70] 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 3.35 3.45 3.55 3.65 3.75 3.85 3.95 4.05 4.15
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d,stepsize=0.1)
> resultTrapeziumRule
[1] 0.999985

Resources