Find adjacent rows that match condition - r

I have a financial time series in R (currently an xts object, but I'm also looking into tibble right now).
How do I find the probability of 2 adjacent rows matching a condition?
For example I want to know the probability of 2 consecutive days having a higher than mean/median value. I know I can lag the previous days value into the next row which would allow me to get this statistic, but that seems very cumbersome and inflexible.
Is there a better way to get this done?
xts sample data:
foo <- xts(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days"))
What's the probability of 2 consecutive days having a higher than median value?

You can create a new column that calls out which are higher than the median, and then take only those that are consecutive and higher
> foo <- as_tibble(data.table(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days")))
Step 1
Create column to find those that are higher than median
> foo$higher_than_median <- foo$x > median(foo$x)
Step 2
Compare that column using diff,
Take it only when both are consecutively higher or lower..c(0, diff(foo$higher_than_median) == 0
Then add the condition that they must both be higher foo$higher_than_median == TRUE
Full Expression:
foo$both_higher <- c(0, diff(foo$higher_than_median)) == 0 & $higher_than_median == TRUE
Step 3
To find probability take the mean of foo$both_higher
mean(foo$both_higher)
[1] 0.1428571

Here is a pure xts solution.
How do you define the median? There are several ways.
In an online time series use, like computing a moving average, you can compute the median over a fixed lookback window (shown below), or from the origin up to now (an anchored window calculation). You won't know future values in the median computation beyond the current time step (Avoid look ahead bias).:
library(xts)
library(TTR)
x <- rep(c(1,1,5,1,5,5,1, 5, 5, 5), 10)
y <- xts(x = x, seq(as.Date("2016-01-01"), length = length(x), by = "days"), dimnames = list(NULL, "x"))
# Avoid look ahead bias in an online time series application by computing the median over a rolling fixed time window:
nMedLookback <- 5
y$med <- runPercentRank(y[, "x"], n = nMedLookback)
y$isAboveMed <- y$med > 0.5
nSum <- 2
y$runSum2 <- runSum(y$isAboveMed, n = nSum)
z <- na.omit(y)
prob <- sum(z[,"runSum2"] >= nSum) / NROW(z)
The case where your median is over the entire data set is obviously a much easier modification of this.

Related

Speed optimization - calculating weighted column in data.table with distance matrix

I am trying to apply weights to a numeric vector in a data.table. The weights come from the euclidean distances of each point with all the other points. If a point is close with another point, then the weights assigned to them will be higher, if the distance between 2 points are greater than a threshold then the weights will be 0, the weight assigned to the distance between a point and itself is 10000.
I can illustrate with the code below:
library(data.table)
library(dplyr)
library(tictoc)
set.seed(42)
df <- data.table(
LAT = rnorm(500, 42),
LONG = rnorm(500, -72),
points = rnorm(500)
)
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix
# function to calculate the weights
func <- function(j, cols, threshold) {
N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}
tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()
The current function works well for small number of points, but it takes a lot longer to compute weighted points for about 220K rows.
I have come up with another solution that cuts down the time in half, but I think it can still be improved.
d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)
tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
x == 0 ~ 10000,
x <= 0.5 ~ 1 / (x^2),
TRUE ~ 0)), .SDcols = cols]
# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]]))
}
toc()
round(sum(df$weighted_points - df2$weighted_points), 0)
The time differences may be small for this small data set, but I have tested the time using the real data set and the new way is quite a bit faster.
My question is, how can I make the new approach to be even faster? I know I am using case_when from dplyr which could make things slower in exchange for readability, but are there other things that I am not doing correctly in data.table that could help make it faster?
From data analyst side I think you could improve your code with an approximation for what mean distance and close points.
Once I worked with NCDC station locations and tried to find closes stations for each other because there were so many stations it was time-consuming. I came up with an idea that after I get dist of my coordinates of each point I just rank them up and put up threshold "how many stations I want to take for real weight calculation".
For example, after ranking take 50 closest points (within the rank) and put them weights respectively, other points will just get 0 weight.
Hope this helps

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

calculate lag from phase arrows with biwavelet in r

I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here

Select the most dissimilar individual using cluster analysis [duplicate]

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

For-loop error and min 2.5% and max 97.5% percentile in R

I have a data set with 41 rows and 21 columns. In DF, each row represents energy data in 15 minute interval of the day (from 10am-8pm). each column represents selected days within a month month.
I need to figure out load variability (standard deviation/ mean) b/w two lines in each column using the following equation.
I.e, between the 1st and 2nd; 1st, 2nd and 3rd; 1st-4th; 1st-5th; etc. element of each column.
I keep getting NA values in "lv" and wonder why. The end result, lv should have a dataframe of 41x21, same as df but with load variability.
Also, how do I also get 2.5 and 97.5 percentiles within the loop other than load variability?
x <- df[1:41,1:21]
#calculate load variability
count = 0
i=1{
for (i in 1:41){
count = count+1
mean = sum (x[1:l,])/count
diff = ((x-mean)^2)
lv= sqrt((diff/(count+1)-1)/mean)
i = i+1
}
}
lv
lv ends up with null values (NA).
If you want to calculate sd/mean for each row, try:
apply(x, 1, sd)/rowMeans(x)
If you want the 2.5% and 97.5% confidence level for each row try:
apply(x, 1, quantile, c(.025, 0.975))
Ok, after several tries (and some help from this question), I finally have:
cumul_loading <- function(x, leave.nan=FALSE){
ind_na <- !is.na(x)
nn <- cumsum(ind_na)
x[!ind_na] <- 0
cumul_mean <- cumsum(x) / nn
cumul_sd <- sqrt(cumsum(x^2) / (nn-1) - (cumsum(x))^2/(nn-1)/nn)
if(leave.nan) return(cumul_sd / cumul_mean) else
return((cumul_sd / cumul_mean)[-1])
}
It should have a few bugs (such as what to do with NAs), but it should now work with an apply function. The leave.nan argument optionally leaves the NaN produced when n_len - 1 = 0
apply(x, 2, cumul_loading)

Resources