Plotting frequency against sequencial durations in R - time stamp transformation - r

I can't figure out how to plot frequency against sequencial durations in R.
I have several processes(X,Y,Z,...) that consist of multiple steps (a,b,c,d,...). Each process has a different sequence of steps. So, X may consist of aabaacabcd, Y may consist of acdad, etc.
There are time stamps for beginning and end of every step, which I used to calculate each steps duration.
The data frame looks something like this:
ProcessID Step Start End Seconds
X a 30.09.2022 14:08 30.09.2022 14:11 165
X d 30.09.2022 14:11 30.09.2022 14:24 756
Y a 29.09.2022 11:55 29.09.2022 13:16 4876
Y c 29.09.2022 13:16 29.09.2022 14:26 4199
Y d 29.09.2022 14:26 30.09.2022 17:17 96654
There are around 1000 processes in the data frame. At the beginning of each process there's step 'a' which may appear more often throughout the sequence. 'd' only exists once in each sequence and closes the process.
I plotted frequency for 'seconds' for each step in a ridgeline plot, which looks like this:
ridgeline plot
This, however, treats 'seconds' as one data point rather than a duration and also doesn't take into consideration that step b, c, d etc usually (not always) follow step a.
I would like to plot frequency against a timeline which consideres the frequency of step a/b/c/d at a certain time point. For example, at time point 20 seconds on the timeline there are 32 processes in step a, 49 processes are at step c...etc.
I do struggle with defining the durations. 'Seconds' only gives me one time point and not the sequence of steps. 'Start' and 'End' does define sequence and duration but are not "normed" to all start at 0. So I don't know how to compare.
Can anyone help?
Thanks in advance

Related

calculating date specific correlation in r (leading to a potential time series)

I have a dataset that looks somewhat like this (the actual dataset is ~150000 lines with additional columns of fluff information such as company name, etc.):
Date return1 return2 rank
01/31/2008 0.05434 0.23413 3
01/31/2008 0.03423 0.43423 4
01/31/2008 0.65277 0.23423 1
01/31/2008 0.02342 0.47234 4
02/31/2008 0.01463 0.01231 4
02/31/2008 0.13456 0.52552 2
02/31/2008 0.34534 0.36663 1
02/31/2008 0.00324 0.56463 3
...
12/31/2015 0.21234 0.02333 2
12/31/2015 0.07245 0.87234 1
12/31/2015 0.47282 0.12998 1
12/31/2015 0.99022 0.03445 2
Basically I need to caculate the date-specific correlation between return1 and rank (so the corr. on 01/31/2008, 02/31/2008, and so on). I know I can split the data using the split() function but I am unsure as to how to get the date-specific correlation. The real data has about 260 entries per date and around 68 dates, so manually subsetting the original table and performing calculations is time consuming but more importantly more susceptible to error.
My ultimate goal is to create a time series of the correlations on different dates.
Thank you in advance!
I had this same problem earlier, except I wasn't calculating correlation. What I would do is
a %>% group_by(Date) %>% summarise(Correlation = cor(return1, rank))
And this will provide, for each date, a correlation value between return1 and rank. Don't forget that you can specify what kind of correlation you would like (e.g. Spearman).

Mismatching drawdown calculations

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?
I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

Using signal spikes to partition data set in R

I have an example data set that looks like this:
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
This data set has both positive and negative spikes in it that I would like to use as markers to calculate means on within the data. I would define the start of a spike as any number that is 40% greater or lessor than the number preceding it. A spike ends when it jumps back by more than 40%. So ideally I would like to locate each spike in the data set, and take the mean of the 5 data points immediately following the last number of the spike.
As can be seen, a spike can last for up to 5 data points long. The rule for averaging I would like to follow are:
Start averaging after the last recorded spike data point, not after the first spike data point. So if a spike lasts for three data points, begin averaging after the third spiked data point.
So the ideal output would look something like this:
1= 12.2
2= 11.8
3= 12.4
4= 12.2
5= 12.6
With the first spike being Ho(4)- followed by the following 5 numbers (12,11,12,12,14) for a mean of 12.1
The next spike in the data is data points Ho(13,14) (25,25) followed by the set of 5 numbers (12,11,13,12,11) for an average of 11.8.
And so on for the rest of the sequence.
It kind of seems like you're actually defining a spike to mean differing from the "medium" values in the dataset, as opposed to differing from the previous value. I've operationalized this by defining a spike as being any data more than 40% above or below the median value (which is 12 for the sample data posted). Then you can use the nifty rle function to get at your averages:
r <- rle(Ho >= mean(Ho)*0.6 & Ho <= median(Ho)*1.4)
run.begin <- cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
run.end <- run.begin + pmin(4, r$lengths[r$values]-1)
apply(cbind(run.begin, run.end), 1, function(x) mean(Ho[x[1]:x[2]]))
# [1] 12.2 11.8 12.4 12.2 12.6
So here is come code that seems to get the same result as you.
#Data
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
#plot(seq_along(Ho), Ho)
#find changes
diffs<-tail(Ho,-1)/head(Ho,-1)
idxs<-which(diffs>1.4 | diffs<.6)+1
starts<-idxs[seq(2, length(idxs), by=2)]
ends<-ifelse(starts+4<=length(Ho), starts+4, length(Ho))
#find means
mapply(function(a,b) mean(Ho[a:b]), starts, ends)

Sample exactly four maintaining almost equal sample distances

I am trying to generate appointment times for yearly scheduled visits. The available days=1:365 and the first appointment should be randomly chosen first=sample(days,1,replace=F)
Now given the first appointment I want to generate 3 more appointment in the space between 1:365 so that there will be exactly 4 appointments in the 1:365 space, and as equally spaced between them as possible.
I have tried
point<-sort(c(first-1:5*364/4,first+1:5*364/4 ));point<-point[point>0 & point<365]
but it does not always give me 4 appointments. I have eventually run this many times and picked only the samples with 4 appointments, but I wanted to ask if there is a more elegant way to get exactly 4 points as equally distanced a s possible.
I was thinking of equal spacing (around 91 days between appointments) in a year starting at the first appointment... Essentially one appointment per quarter of the year.
# Find how many days in a quarter of the year
quarter = floor(365/4)
first = sample(days, 1)
all = c(first, first + (1:3)*quarter)
all[all > 365] = all[all > 365] - 365
all
sort(all)
Is this what you're looking for?
set.seed(1) # for reproducible example ONLY - you need to take this out.
first <- sample(1:365,1)
points <- c(first+(0:3)*(365-first)/4)
points
# [1] 97 164 231 298
Another way uses
points <- c(first+(0:3)*(365-first)/3)
This creates 4 points euqally spaced on [first, 365], but the last point will always be 365.
The reason your code is giving unexpected results is because you use first-1:5*364/4. This creates points prior to first, some of which can be < 0. Then you exclude those with points[points>0...].

R: Very basic example for averaging a Time Series

I am qute new to R and studied several posts and websites about time series and moving averaging but simply cannot find a useful hint averging a special period of time.
My data is a table via readcsv with a date and time in one column and several other columns with values. The time steps in the data are not constant, so sometimes 5 minutes, sometimes 2 hours. Eg.
2014-01-25 14:50:00, 4, 8
2014-01-25 14:55:00, 3, 7
2014-01-25 15:00:00, 1, 4
2014-01-25 15:20:24, 12, 34
2014-01-25 17:19:00, 150, 225
2014-01-25 19:00:00, 300, 400
2014-01-25 21:00:00, NA, NA
2014-01-25 23:19:00, 312, 405
So I look for an averaging plot that
calculates data average in arbitrary intervals like 30 minutes, 1 hour, 1 day etc. So lower steps should be aggregated and higher steps should be disaggregated.
(removed, since it is trivial to get value per hour from a time series D which is averaged by X hours with D/x.)
data flagged as NA should not be taken into account. So the function should not interpolate/smooth through Na gaps and performing a line plot should not connect the points between a NA gap with a line.
I already tried
aggregate(list(value1=data$value1,value2=data$value2), list(time=cut(data$time, "1 hour")), sum)
but this does not fulfill needs 1 and 3 and is not able to disaggregate 2-hourly data steps.
Answering point 3: plot automatically skips NA values and breaks the line.
Try this example:
plot(c(1:5,NA,NA,6:10),t='l')
Now, if you want to 'smooth' or average over time intervals purely for graphical purposes, It's probably easiest to start out by separating your data at each line with an NA and then doing a spline or other smoothing operation on each subsection separately.

Resources