Mismatching drawdown calculations - r

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?

I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

Related

What is the appropriate frequency and start/end date in R for ts()?

I have the following dataset which I want to make a time series object of for auto.arima forecasting:
head(df)
total_score return
1539 121.77
1074 422.18
901 -229.79
843 96.30
1101 -55.25
961 -48.28
This data set contains of 13104 rows with each row representing sentiment score of tweets and BTC return on hourly basis, i.e. first row is 2021-01-01 00:00 and second row is 2021-01-01 01:00 and so on up until 2022-06-30 23:00. I have looked up how many hours fits in this range and that is 13103. How can I make my ts function such that I can use it for forecasting purposes in R auto.arima?
Moreover, I understand that auto.arima takes homoscedastic errors, whereas I need it to work for heteroscedastic errors. I also read that for this, I might use a GARCH model. However, if my auto.arima functions results in using a order of (2,0,0), does this mean that my GARCH model should be a (0,0,2)?
PS: I am still confused on why my data seems to be stationary, I was under the impression that crypto currencies are most likely NOT stationary, that is, the returns as well. But that is something for another time.

Anomaly Detection - Correlated Variables

I am working on an 'Anomaly' detection assignment in R. My dataset has around 30,000 records of which around 200 are anomalous. It has around 30 columns & all are quantitative. Some of the variables are highly correlated (~0.9). By anomaly I mean some of the records have unusual (high/low) values for some column(s) while some have correlated variables not behaving as expected. The below example will give some idea.
Suppose vehicle speed & heart rate are highly positively correlated. Usually vehicle speed varies between 40 & 60 while heart rate between 55-70.
time_s steering vehicle.speed running.distance heart_rate
0 -0.011734953 40 0.251867414 58
0.01 -0.011734953 50 0.251936555 61
0.02 -0.011734953 60 0.252005577 62
0.03 -0.011734953 60 0.252074778 90
0.04 -0.011734953 40 0.252074778 65
Here we have two types of anomalies. 4th record has exceptionally high value for heart_rate while 5th record seems okay if we look individual columns. But as we can see that heart_rate increases with speed, we expected a lower heart rate for 5th record while we have a higher value.
I could identify the column level anomalies using box plots etc but find it hard to identify the second type. Somewhere I read about PCA based anomaly detection but I couldn't find it's implementation in R.
Will you please help me with PCA based anomaly detection in R for this scenario. My google search was throwing mainly time series related stuff which is not something I am looking for.
Note: There is a similar implementation in Microsoft Azure Machine Learning - 'PCA Based Anomaly Detection for Credit Risk' which does the job but I wan't to know the logic behind it & replicate the same in R.

Test performing on counts

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

Building a model in R based on historic data

i'm working on a daily data frames for one month, the main variables for each data frame, are like this
Date_Heure Fonction Presence
2015-09-02 08:01:28 Acce 1
2015-09-02 08:15:56 check-out 0
2015-09-02 08:16:23 Alarme 0
the idea is to learn over 15 days the habits of the owner in his home, the rate of his presence each time slot, and when he activate the alarme of the home,
so after building this historic, we want to know (to predict) the next day (the 16th day), when he will activate his alarm based on the informations we calculated,
so basicly the historic should be transformed to a MODEL, but i cannot figure out how to do this ??
well what i have in hands are my inputs (i suppose) : the percentage of presence in the two half_hour before and after activating the alarm, and my input normally should be the time that the alarm should be activated, so what i have is like this
Presence 1st Time slot Presence 2nd Time slot Date_Heure
0.87 0 2015-09-02 08:16:23
0.91 0 2015-09-03 08:19:02
0.85 0 2015-09-04 08:18:11
i have the mean of the activated hour, of the percentage of presence in the two time slot
and every new day will be added to the historic (to the model, so the historic get bigger evry day by one day and the paramaters will change of course,teh mean, the max and the min of my indicators), it's like we are doing a "Statistical Learning"
So if you have any ideas, any clue that help me to start with, it would be helful for me cause when i serached, it's very vague for me, and i just need the right key to work

Using signal spikes to partition data set in R

I have an example data set that looks like this:
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
This data set has both positive and negative spikes in it that I would like to use as markers to calculate means on within the data. I would define the start of a spike as any number that is 40% greater or lessor than the number preceding it. A spike ends when it jumps back by more than 40%. So ideally I would like to locate each spike in the data set, and take the mean of the 5 data points immediately following the last number of the spike.
As can be seen, a spike can last for up to 5 data points long. The rule for averaging I would like to follow are:
Start averaging after the last recorded spike data point, not after the first spike data point. So if a spike lasts for three data points, begin averaging after the third spiked data point.
So the ideal output would look something like this:
1= 12.2
2= 11.8
3= 12.4
4= 12.2
5= 12.6
With the first spike being Ho(4)- followed by the following 5 numbers (12,11,12,12,14) for a mean of 12.1
The next spike in the data is data points Ho(13,14) (25,25) followed by the set of 5 numbers (12,11,13,12,11) for an average of 11.8.
And so on for the rest of the sequence.
It kind of seems like you're actually defining a spike to mean differing from the "medium" values in the dataset, as opposed to differing from the previous value. I've operationalized this by defining a spike as being any data more than 40% above or below the median value (which is 12 for the sample data posted). Then you can use the nifty rle function to get at your averages:
r <- rle(Ho >= mean(Ho)*0.6 & Ho <= median(Ho)*1.4)
run.begin <- cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
run.end <- run.begin + pmin(4, r$lengths[r$values]-1)
apply(cbind(run.begin, run.end), 1, function(x) mean(Ho[x[1]:x[2]]))
# [1] 12.2 11.8 12.4 12.2 12.6
So here is come code that seems to get the same result as you.
#Data
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
#plot(seq_along(Ho), Ho)
#find changes
diffs<-tail(Ho,-1)/head(Ho,-1)
idxs<-which(diffs>1.4 | diffs<.6)+1
starts<-idxs[seq(2, length(idxs), by=2)]
ends<-ifelse(starts+4<=length(Ho), starts+4, length(Ho))
#find means
mapply(function(a,b) mean(Ho[a:b]), starts, ends)

Resources