Standard deviation in every quarter-hour in R - r

I have raw data of power system frequency. 86 400 numbers.
frequency=a$Ist_Frq
plot.ts(frequency, main="System frequency [Hz]", xlab="Time [s]")
See example:
Raw data
Now, i have to determine quarter-hour time interval.
frequency=ts(a$Ist_Frq, start=1, frequency=900)
[quarter-hour time interval][2]
My question is:
Is there any way how to determine standart deviation in every quarter-hour?
Thanks for your answers.

There are probably several solutions to this problem: here is one
#some data
x <- rnorm(10000)
#identify quarter hour segments
y <- rep(1:ceiling(length(x)/(15 * 60)), each = 15 * 60)[1:length(x)]
#use tapply to find sd of x for every value of y
tapply(x, y, sd)
nb the last value might be based on fewer than 900 values

Related

R: Calculate standard deviation for specific time interval

I have a dataset with daily bond returns for some unique RIC codes (in total approx. 200.000 observations).
Now I want to calculate the standard deviation of those returns for the combined period t-30 to t-6 and t+6 to t+30. This means for every observation i,t, I need the 24 returns before t in the window t-30 to t-6 and 24 returns in the window t+6 to t+30 and calculate the standard deviation based on those 48 observations.
Here is a small snippet of my dataset:
#My data:
date <- c("2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17","2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17")
ric <- c("AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=")
return <- c(0.001009681, 0.003925873, 0.000354606, -0.000472641, -0.002935700, 0.003750854, 0.012317347, -0.001314047, 0.001014453, -0.007234452)
df <- data.frame(ric, date, return)
I have tried to use the slider package to generate two lists with the returns of the specific time frame. However, I feel that there is some more efficient way to solve this problem. I hope to find some help here.
This is what I tried before:
x <- slide(df$return, ~.x, .before=30, .after = -6)
y <- slide(df$return, ~.x, .before=-6, .after = 30)
z <- mapply(c, x, y, SIMPLIFY=TRUE)
for (i in 1:length(z))
{
df$sd[i] <- sd(z[[i]])
}

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

How to generate a population of random numbers within a certain exponentially increasing range

I have 16068 datapoints with values that range between 150 and 54850 (mean = 3034.22). What would the R code be to generate a set of random numbers that grow in frequency exponentially between 54850 and 150?
I've tried using the rexp() function in R, but can't figure out how to set the range to between 150 and 54850. In my actual data population, the lambda value is 25.
set.seed(123)
myrange <- c(54850, 150)
rexp(16068, 1/25, myrange)
The call produces an error.
Error in rexp(16068, 1/25, myrange) : unused argument (myrange)
The hypothesized population should increase exponentially the closer the data values are to 150. I have 25 data points with a value of 150 and only one with a value of 54850. The simulated population should fall in this range.
This is really more of a question for math.stackexchange, but out of curiosity I provide this solution. Maybe it is sufficient for your needs.
First, ?rexp tells us that it has only two arguments, so we generate a random exponential distribution with the desired length.
set.seed(42) # for sake of reproducibility
n <- 16068
mr <- c(54850, 150) # your 'myrange' with less typing
y0 <- rexp(n, 1/25) # simulate exp. dist.
y <- y0[order(-y0)] # sort
Now we need a mathematical approach to rescale the distribution.
# f(x) = (b-a)(x - min(x))/(max(x)-min(x)) + a
y.scaled <- (mr[1] - mr[2]) * (y - min(y)) / (max(y) - min(y)) + mr[2]
Proof:
> range(y.scaled)
[1] 150.312 54850.312
That's not too bad.
Plot:
plot(y.scaled, type="l")
Note: There might be some mathematical issues, see therefore e.g. this answer.

How to run a regression row by row

I just started using R for statistical purposes and I appreciate any kind of help.
As a first step, I ran a time series regression over my columns. Y values are dependent and the X is explanatory.
# example
Y1 <- runif(100, 5.0, 17.5)
Y2 <- runif(100, 4.0, 27.5)
Y3 <- runif(100, 3.0, 14.5)
Y4 <- runif(100, 2.0, 12.5)
Y5 <- runif(100, 5.0, 17.5)
X <- runif(100, 5.0, 7.5)
df1 <- data.frame(X, Y1, Y2, Y3, Y4, Y5)
# calculating log returns to provide data for the first regression
n <- nrow(df1)
X_logret <- log(X[2:n])-log(X[1:(n-1)])
Y1_logret <- log(Y1[2:n])-log(Y1[1:(n-1)])
Y2_logret <- log(Y2[2:n])-log(Y2[1:(n-1)])
Y3_logret <- log(Y3[2:n])-log(Y3[1:(n-1)])
Y4_logret <- log(Y4[2:n])-log(Y4[1:(n-1)])
Y5_logret <- log(Y5[2:n])-log(Y5[1:(n-1)])
# bringing the calculated log returns together in one data frame
df2 <- data.frame(X_logret, Y1_logret, Y2_logret, Y3_logret, Y4_logret, Y5_logret)
# running the time series regression
Regression <- lm(as.matrix(df2[c('Y1_logret', 'Y2_logret', 'Y3_logret', 'Y4_logret', 'Y5_logret')]) ~ df2$X)
# extracting the coefficients for further calculation
Regression$coefficients[2,(1:5)]
As a second step I want to run a regression row by row, which is day by day, since the data contains daily observed values. I also have a column "DATE" but I didn't know how to bring it in here in the example. The format of the DATE column is POSIXct, maybe someone has an idea how to refer to a certain period in it on which the regression should be done.
In the row by row regression I would like to use the 5 calculated coefficients (from the first regression) as an explanatory variable. The 5 Y_logret values, I would like to use as dependent variable.
Y_logret(1 to 5) = Beta * Regression$coefficients[2,(1:5)] + error value. The intercept is not needed, so I would set it to zero by adding +0 in the lm function.
My goal is to run this regression over a period of time, for example over 20 days. Day by day, this would provide a total of 20 Beta estimates (for one regression per day), but I would also need all errors for further calculation. So I have to extract 5 errors per day, that is a total of 20*5 error values.
This is just an example, in the original dataset I have 20 of the Y values and over 4000 rows. I would like to run the regression over certain intervals with 900-1000 day. Since I am completely new to R, I have no idea how to proceed. Especially how to code this in a few lines.
I really appreciate any kind of help.

calculate lag from phase arrows with biwavelet in r

I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here

Resources