For-loop error and min 2.5% and max 97.5% percentile in R - r

I have a data set with 41 rows and 21 columns. In DF, each row represents energy data in 15 minute interval of the day (from 10am-8pm). each column represents selected days within a month month.
I need to figure out load variability (standard deviation/ mean) b/w two lines in each column using the following equation.
I.e, between the 1st and 2nd; 1st, 2nd and 3rd; 1st-4th; 1st-5th; etc. element of each column.
I keep getting NA values in "lv" and wonder why. The end result, lv should have a dataframe of 41x21, same as df but with load variability.
Also, how do I also get 2.5 and 97.5 percentiles within the loop other than load variability?
x <- df[1:41,1:21]
#calculate load variability
count = 0
i=1{
for (i in 1:41){
count = count+1
mean = sum (x[1:l,])/count
diff = ((x-mean)^2)
lv= sqrt((diff/(count+1)-1)/mean)
i = i+1
}
}
lv
lv ends up with null values (NA).

If you want to calculate sd/mean for each row, try:
apply(x, 1, sd)/rowMeans(x)
If you want the 2.5% and 97.5% confidence level for each row try:
apply(x, 1, quantile, c(.025, 0.975))

Ok, after several tries (and some help from this question), I finally have:
cumul_loading <- function(x, leave.nan=FALSE){
ind_na <- !is.na(x)
nn <- cumsum(ind_na)
x[!ind_na] <- 0
cumul_mean <- cumsum(x) / nn
cumul_sd <- sqrt(cumsum(x^2) / (nn-1) - (cumsum(x))^2/(nn-1)/nn)
if(leave.nan) return(cumul_sd / cumul_mean) else
return((cumul_sd / cumul_mean)[-1])
}
It should have a few bugs (such as what to do with NAs), but it should now work with an apply function. The leave.nan argument optionally leaves the NaN produced when n_len - 1 = 0
apply(x, 2, cumul_loading)

Related

R: Calculate standard deviation for specific time interval

I have a dataset with daily bond returns for some unique RIC codes (in total approx. 200.000 observations).
Now I want to calculate the standard deviation of those returns for the combined period t-30 to t-6 and t+6 to t+30. This means for every observation i,t, I need the 24 returns before t in the window t-30 to t-6 and 24 returns in the window t+6 to t+30 and calculate the standard deviation based on those 48 observations.
Here is a small snippet of my dataset:
#My data:
date <- c("2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17","2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17")
ric <- c("AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=")
return <- c(0.001009681, 0.003925873, 0.000354606, -0.000472641, -0.002935700, 0.003750854, 0.012317347, -0.001314047, 0.001014453, -0.007234452)
df <- data.frame(ric, date, return)
I have tried to use the slider package to generate two lists with the returns of the specific time frame. However, I feel that there is some more efficient way to solve this problem. I hope to find some help here.
This is what I tried before:
x <- slide(df$return, ~.x, .before=30, .after = -6)
y <- slide(df$return, ~.x, .before=-6, .after = 30)
z <- mapply(c, x, y, SIMPLIFY=TRUE)
for (i in 1:length(z))
{
df$sd[i] <- sd(z[[i]])
}

R Winsorizing with specific cut-off values does not work

I want to winsorize my data using the mean plus (/minus) 2 standard deviations of the variable as cut-off points. I thus want to winsorize every variable one by one.
The variable I want to winsorize in the example code I provided below has 5 outliers.
I have created two benchmarks (high and low) and have inserted them in minval and maxval.
Just to prevent misunderstandings: I have several timepoints and groups in my data frame, the grepl-part is to just get one group at one measurement point for the winsorizing.
My code so far:
library(DescTools)
benchhigh <- mean(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE) +
2*sd(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE)
benchlow <- mean(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE) -
2*sd(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE)
ds$RRSout <- Winsorize( ds$RRS[ grepl('^34.*', ds$QUESTNNR) ],
minval = benchlow , maxval = benchhigh, na.rm = TRUE)
The error I get is:
"Error in $<-.data.frame(*tmp*, RRSout, value = c(1, 1.33333333333333, :
replacement has 38 rows, data has 510"
My replacement only has 38 rows because the ^34.* group has only 38 participants. I have to winsorize the outliers per group and measurement point though...
How can I replace/winsorize the outliers of the specific participant group in the RRS variable?
Thank you very much in advance!
Your input to Winsorize() is restricted to certain observations (grepl('^34.*', ds$QUESTNNR)). You can only append the result to the same number of (and ideally the exact same) rows:
ds$RRSout[ grepl('^34.*', ds$QUESTNNR) ] <-
Winsorize( ds$RRS[ grepl('^34.*', ds$QUESTNNR) ],
minval = benchlow , maxval = benchhigh, na.rm = TRUE)

R: select a subset based on probability

I'm new to R. I have a normal distribution.
n <- rnorm(1000, mean=10, sd=2)
As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.
For the first two subsets we may use
idx1 <- n < 5
ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
idx2 <- n > 15
ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]
while for the third one,
idx3 <- !idx1 & !idx2
probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]
where probs are linearly interpolated probabilities for each of element of n[idx3]. Then using sapply we draw TRUE (take) or FALSE (don't take) for each of those elements.
The prob option in sample() gives weigths of probability to the vector to sample.
https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample
So if I understood the question right what you want is to sample only 25% of the values < 5 and 75% for values > 75 and so on ..
Then you have to use the n parameter
As documentation says
n
a positive number, the number of items to choose from. See ‘Details.’
There you could input the % of sample you want multiplied by the length of the sample vector.
For your last sample you could add a uniform variable to run from .25 to .75 runif()
Hope this helps!

Change certain values of a vector based on mean and standard deviation of its subsets

I am trying to inject anomalies into a dataset, essentially changing certain values, based on a condition. I have a dataset, there are 10 subsets. The condition is that anomalies would be 2.8-3 times the standard deviation of each segment away from the mean of that subset. For that, I am dividing the dataset into 10 equal parts, then calculating the mean and standard deviation of each subset, and changing certain values by putting them 3 standard deviations of that subset away from the mean of that subset. The code looks like the following:
set.seed(1)
x <- rnorm(sample(1:35000, 32000, replace=F),0,1) #create dataset
y <- cumsum(x) #cumulative sum of dataset
j=1
for(i in c(1:10)){
seg = y[j:j+3000] #name each subset seg
m = mean(seg) #mean of subset
print(m)
s = sd(seg) # standard deviation of subset
print(s)
o_data = sample(j:j+3000,10) #draw random numbers from j to j + 3000
print(o_data)
y[o_data] = m + runif(10, min=2.8, max=3) * s #values = mean + 2.8-3 * sd
print(y[o_data])
j = j + 3000 # increment j
print(j)
}
The error I get is that standard deviation is NA, so I am not able to set the values.
What other approach is there by which I can accomplish the task? I have the inject anomalies which are 2.8-3 standard deviations away from the rolling mean essentially.
You have a simple error in your code. when you wrote
seg = y[j:j+3000] I believe that you meant seg = y[j:(j+3000)]
Similarly o_data = sample(j:j+3000,10) should be o_data = sample(j:(j+3000),10)

Find adjacent rows that match condition

I have a financial time series in R (currently an xts object, but I'm also looking into tibble right now).
How do I find the probability of 2 adjacent rows matching a condition?
For example I want to know the probability of 2 consecutive days having a higher than mean/median value. I know I can lag the previous days value into the next row which would allow me to get this statistic, but that seems very cumbersome and inflexible.
Is there a better way to get this done?
xts sample data:
foo <- xts(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days"))
What's the probability of 2 consecutive days having a higher than median value?
You can create a new column that calls out which are higher than the median, and then take only those that are consecutive and higher
> foo <- as_tibble(data.table(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days")))
Step 1
Create column to find those that are higher than median
> foo$higher_than_median <- foo$x > median(foo$x)
Step 2
Compare that column using diff,
Take it only when both are consecutively higher or lower..c(0, diff(foo$higher_than_median) == 0
Then add the condition that they must both be higher foo$higher_than_median == TRUE
Full Expression:
foo$both_higher <- c(0, diff(foo$higher_than_median)) == 0 & $higher_than_median == TRUE
Step 3
To find probability take the mean of foo$both_higher
mean(foo$both_higher)
[1] 0.1428571
Here is a pure xts solution.
How do you define the median? There are several ways.
In an online time series use, like computing a moving average, you can compute the median over a fixed lookback window (shown below), or from the origin up to now (an anchored window calculation). You won't know future values in the median computation beyond the current time step (Avoid look ahead bias).:
library(xts)
library(TTR)
x <- rep(c(1,1,5,1,5,5,1, 5, 5, 5), 10)
y <- xts(x = x, seq(as.Date("2016-01-01"), length = length(x), by = "days"), dimnames = list(NULL, "x"))
# Avoid look ahead bias in an online time series application by computing the median over a rolling fixed time window:
nMedLookback <- 5
y$med <- runPercentRank(y[, "x"], n = nMedLookback)
y$isAboveMed <- y$med > 0.5
nSum <- 2
y$runSum2 <- runSum(y$isAboveMed, n = nSum)
z <- na.omit(y)
prob <- sum(z[,"runSum2"] >= nSum) / NROW(z)
The case where your median is over the entire data set is obviously a much easier modification of this.

Resources