Backtesting in R for time series - r

I am new to the backtesting methodology - algorithm in order to assess if something works based on the historical data.Since I am new to that I am trying to keep things simple in order to understand it.So far I have understood that if let's say I have a data set of time series :
date = seq(as.Date("2000/1/1"),as.Date("2001/1/31"), by = "day")
n = length(date);n
class(date)
y = rnorm(n)
data = data.frame(date,y)
I will keep the first 365 days that will be the in sample period in order to do something with them and then I will update them with one observation at the time for the next month.Am I correct here ?
So if I am correct, I define the in sample and out of sample periods.
T = dim(data)[1];T
outofsampleperiod = 31
initialsample = T-outofsampleperiod
I want for example to find the quantile of alpha =0.01 of the empirical data.
pre = data[1:initialsample,]
ypre = pre$y
quantile(ypre,0.01)
1%
-2.50478
Now the difficult part for me is to update them in a for loop in R.
I want to add each time one observation and find again the empirical quantile of alpha = 0.01.To print them all and check the condition if is greater than the in sample quantile as resulted previously.
for (i in 1:outofsampleperiod){
qnew = quantile(1:(initialsample+i-1),0.01)
print(qnew)
}

You can create a little function that gets the quantile of column y, over rows 1 to i of a frame df like this:
func <- function(i,df) quantile(df[1:i,"y"],.01)
Then apply this function to each row of data
data$qnew = lapply(1:nrow(data),func,df=data)
Output (last six rows)
> tail(data)
date y qnew
392 2001-01-26 1.3505147 -2.253655
393 2001-01-27 -0.5096840 -2.253337
394 2001-01-28 -0.6865489 -2.253019
395 2001-01-29 1.0881961 -2.252701
396 2001-01-30 0.1754646 -2.252383
397 2001-01-31 0.5929567 -2.252065

Related

Argument is not numeric

I would like to visualize the number of people infected with COVID-19, but I am unable to obtain the mortality rate because the number of deaths is derived by int when obtaining the mortality rate per 100,000 population for each prefecture.
What I want to achieve
I want to find the solution of "covid19j_20200613$POP2019 * 100" by setting the data type of "covid19j_20200613$deaths" to num.
Error message.
Error in covid19j_20200613$deaths/covid19j_20200613$POP2019:
Argument of binary operator is not numeric
Source code in question.
library(spdep)
library(sf)
library(spatstat)
library(tidyverse)
library(ggplot2)
needs::prioritize(magrittr)
covid19j <- read.csv("https://raw.githubusercontent.com/kaz-ogiwara/covid19/master/data/prefectures.csv",
header=TRUE)
# Below is an example for May 20, 2020.
# Month and date may be changed
covid19j_20200613 <- dplyr::filter(covid19j,
year==2020,
month==6,
date==13)
covid19j_20200613$CODE <- 1:47
covid19j_20200613[is.na(covid19j_20200613)] <- 0
pop19 <- read.csv("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/pop2019.csv", header=TRUE)
covid19j_20200613 <- dplyr::inner_join(covid19j_20200613, pop19,
by = c("CODE" = "CODE"))
# Load Japan prefecture administrative boundary data
jpn_pref <- sf::st_read("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/jpn_pref.shp")
# Data and concatenation
jpn_pref_cov19 <- dplyr::inner_join(jpn_pref, covid19j_20200613, by=c("PREF_CODE"="CODE"))
ggplot2::ggplot(data = jpn_pref_cov19) +
geom_sf(aes(fill=testedPositive)) +
scale_fill_distiller(palette="RdYlGn") +
theme_bw() +
labs(title = "Tested Positiv of Covid19 (2020/06/13)")
# Mortality rate per 100,000 population
# Population number in units of 1000
as.numeric(covid19j_20200613$deaths)
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
Source code in question.
prefectures.csv
https://docs.google.com/spreadsheets/d/11C2vVo-jdRJoFEP4vAGxgy_AEq7pUrlre-i-zQVYDd4/edit?usp=sharing
pop2019.csv
https://docs.google.com/spreadsheets/d/1CbEX7BADutUPUQijM0wuKUZFq2UUt-jlWVQ1ipzs348/edit?usp=sharing
What we tried
I tried to put "as.numeric(covid19j_20200613$deaths)" before the calculation and set the number of dead to type
num, but I got the same error message during the calculation.
Additional information (FW/tool versions, etc.)
iMac M1 2021, R 4.2.0
Translated with www.DeepL.com/Translator (free version)
as.numeric() does not permanently change the data type - it only does it temporarily.
So when you're running as.numeric(covid19j_20200613$deaths), this shows you the column deaths as numeric, but the column will stay a character.
So if you want to coerce the data type, you need to also reassign:
covid19j_20200613$deaths <- as.numeric(covid19j_20200613$deaths)
covid19j_20200613$POP2019 <- as.numeric(covid19j_20200613$POP2019)
# Now you can do calculations
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
It's easier to read if you use mutate from dplyr:
covid19j_20200613 <- covid19j_20200613 |>
mutate(
deaths = as.numeric(deaths),
POP2019 = as.numeric(POP2019),
death_rate = deaths / POP2019 * 100
)
Result
deaths POP2019 deaths_rate
1 91 5250 1.73333333
2 1 1246 0.08025682
3 0 1227 0.00000000
4 1 2306 0.04336513
5 0 966 0.00000000
PS: your question is really difficult to follow! There is a lot of stuff that we don't actually need to answer it, so that makes it harder for us to identify where the issue is. For example, all the data import, the join, the ggplot...
When writing a question, please only include the minimal elements that lead to a problem. In your case, we only needed a sample dataset with the deaths and POP2019 columns, and the two lines of code that you tried to fix at the end.
If you look at str(covid19j) you'll see that the deaths column is a character column containing a lot of blanks. You need to figure out the structure of that column to read it properly.

Creating synthetic user data in R; issues with generating user identifier variable

I am trying to generate synthetic user event log data for demonstration purposes. It's going to be very basic feature-wise (about 4 variables altogether). Here is what I have so far:-
require(wakefield)#for generating the Status variable
require(dplyr)
require(stringi)
set.seed(1)
#data<-data.frame()
eventDate<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
eventDate<-sample(rep(eventDate,each=1000),replace = T)
u <- runif(length(eventDate), 0, 60*60*12) # "noise" to add or subtract from some timepoint
eventDateTime<-as.POSIXlt(u, origin = paste0(eventDate,"00:00:00"))
eventDateTime
eventOutcome<-r_sample_factor(x = c("Passed", "Failed", "Ongoing","Unknown"), n=length(eventDate))
eventOutcome
data<-data.frame(eventDate,eventDateTime,eventOutcome)
head(data)
# eventDate eventDateTime eventOutcome
#1 2015-01-25 2015-01-25 04:48:47 Unknown
#2 2015-05-05 2015-05-05 09:35:22 Unknown
#3 2015-11-28 2015-11-28 08:56:16 Failed
#4 2015-05-23 2015-05-23 02:24:52 Ongoing
#5 2015-01-26 2015-01-26 07:43:52 Failed
#6 2015-10-22 2015-10-22 03:07:14 Passed
There is about 365000 rows of data here. All that is left to do is add a user identifier variable. I would like it if some users will maybe have a handful of interactions recorded in the data set, whereas some users may have dozens/hundreds/thousands of interactions (I would like this dataset to have that kind of variability).
I can create a user identifier variable no problem:-
UserId<-stri_rand_strings(1300,6)
But if I add this to the data, it doesn't work:-
data$UserId<-stri_rand_strings(1300,6)
Error in `$<-.data.frame`(`*tmp*`, UserId, value = c("k3QlXs", "gK3eBa", :
replacement has 1300 rows, data has 365000
So my request two-fold: How can I assign a User identifier variable to this kind of data; how can I make it variable, where some users have a 1 or a few interactions whilst others will appear frequently (i.e. dozens, hundreds, thousands of times)?
Thank you in advance for any help, always appreciated :)
One option might be to generate a UUID for each user. A UUID looks like this:
c7f2dde5-dfeb-45cb-9720-87b23effd45d
If you use a good UUID generator, then it is almost impossible to generate the same UUID more than once. R has a uuid package which can be used:
library(uuid)
user_uuid <- UUIDgenerate()
It sounds like the distribution of user IDs you are looking for is something akin to a power law or Pareto distribution. This is a way to generate a vector of relative probabilities of sampling each user ID that follows a power law, then sample the user IDs following those relative probabilities.
Here I've used the function Pareto::rPareto to generate the relative probabilities. You can vary the parameters, especially alpha, to get different distributions. As alpha increases the distribution will become more even. I also supplied a truncation parameter so that you will not get too many users with unrealistically high numbers of purchases (In this example the most frequent ID has ~3700 cases).
Note you are not guaranteed to sample each of the 1300 user IDs at least once with this particular way of doing it.
Next I use the function uuid::UUIDgenerate to generate 1300 unique strings. Finally I use sample to sample the unique IDs with replacement as many times as you have rows in your data frame. I plot the frequencies of the different IDs in the sample. Again, modify the parameters if this distribution is not what you are looking for.
library(Pareto)
library(uuid)
library(ggplot2)
set.seed(1)
n_users <- 1300
n_rows <- 365000
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
id_sample <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
# Check the distribution of IDs
ggplot(as.data.frame(table(id_sample)), aes(x = Freq)) +
geom_histogram() +
scale_x_log10()
You are generating 1300 strings but number of rows in your data is 365000. So you can use sample to repeat those 1300 string randomly.
library(stringi)
data$UserId <- sample(stri_rand_strings(1300,6), nrow(data), replace = TRUE)

Apply a weighted value to a dataset in R

I have a dataset with an outcome flag which indicates whether a row has 'Good' or 'Bad' performance. My dataset has 83,337 records, of these there are only 150 'Bads' while the remainder are 'Goods'.
I'm using this dataset to get the value of information for the variables compared to the outcome flag, I use the code below for this:
all_data_IV <- smbinning.sumiv(all_data,y="OUTCOME_FLAG")
This gives a list of all variables and their IV for the outcome flag.
My problem is, I don't have enough 'Bads' to make this reliable and would therefore need to reduce the number of 'Goods' in the dataset.
I'd like to reduce the number of 'Goods' to 10,000 rows and keep 150 'Bads', but then weight up the 'Goods' so that it's a fair representation. So that the weight would be as follows:
No Records Sample Weight
Goods 83773 10000 8.37
Bads 150 150 1
In SAS if you want to apply a weight to a dataset you do something like the following:
/*CODING WEIGHT VALUE*/
DATA DEV_SAMPLE; SET DEV_SAMPLE;
IF FLAG = "DV" THEN WEIGHT_VALUE = 1;
ELSE WEIGHT_VALUE = 8.37;
/*APPLYING WEIGHT VALUE*/
PROC FREQ DATA = DEV_SAMPLE;
TABLE ARREARS;
WEIGHT WEIGHT_VALUE;
Does anyone know how to replicate this in R so I can apply a weight to the new sample?

Breaking a continuous variable into categories using dplyr and/or cut

I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut.
> head(btc_data)
time btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
>dput(btc_data)
("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
), class = "data.frame", row.names = c(NA, -10023L))
The difficulty is in the categories I want to create. The categories -1,0,1 should be based upon the % change over the previous time-lag.
So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0.
Is this possible in R? What is the most efficient way to implement the change?
There is a similar question here and also here but these do not answer my question for two reasons-
a) I am trying to calculate % change, not simply the difference
between 2 rows.
b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1
Here's an easy way to do this without having to rely on the data.table package. If you want this for only 60 minute intervals, you would first need to filter btc_data for the relevant 60 minute intervals.
# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)
# sort data frame
btc_data = btc_data[order(btc_data$time),]
# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]
# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0))
Using the data.table package and converting btc_data to a data.table would be a much more efficient and faster way to do this. There is a learning curve to using the package, but there are great vignettes and tutorials for this package.
Its always difficult to work with percentage. You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. The same thing with the value you want to set in relation to your reference. Together this give you almost infinite possible how you can calculate your percentage. Here is the key to your question.
# create the data
dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075")
dat <- as.numeric(gsub(",","",dat))
# calculate the difference to the last minute
dd <- diff(dat)
# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}
# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}
# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))
# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")
I think you search the way how to calculate the last plot (price last minute, relative to "mean" of t...) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. The problem is the high price of 5000 a change of 10 is very little.
You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1.
In this calculation I use the mean() for the running observation of the last minutes. I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. You choose min() as reference when your price is rising and max() when your price is falling. All this is possible in R.
I can't completely reproduce your example, but if I had to guess you would want to do something like this:
btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "",
btc_data$btc_price)))
pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) /
btc_data$btc_price[i - 60,]
}
pct_change <- pct_change[61:length(pct_change)]
new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2,
max(pct_change)), labels = c(-1,0,1))
btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)

unused arguments error using apply() in R

I get an error message when I attempt to use apply() conditional on a column of dates to return a set of coefficients.
I have a dataset (herein modified for simplicity, but reproducible):
ADataset <- data.table(Epoch = c("2007-11-15", "2007-11-16", "2007-11-17",
"2007-11-18", "2007-11-19", "2007-11-20", "2007-11-21"),
Distance = c("92336.22", "92336.23", "92336.22", "92336.20",
"92336.19", "92336.21", "92336.18))
ADataset
Epoch Distance
1: 2007-11-15 92336.22
2: 2007-11-16 92336.23
3: 2007-11-17 92336.22
4: 2007-11-18 92336.20
5: 2007-11-19 92336.19
6: 2007-11-20 92336.21
7: 2007-11-21 92336.18
The analysis begins with establishing start and end dates:
############## Establish dates for analysis
#4.Set date for center of duration
StartDate <- "2007-11-18"
as.numeric(as.Date(StartDate)); StartDate
EndDate <- as.Date(tail(Adataset$Epoch,1)); EndDate
Then I establish time durations for analysis:
#5.Quantify duration of time window
STDuration <- 1
LTDuration <- 3
Then I write functions to regress over both durations and return the slopes:
# Write STS and LTS functions, each with following steps
#6.Define time window- from StartDate less ShortTermDuration to
StartDate plus ShortTermDuration
#7.Define Short Term & Long Term datasets
#8. Run regression over dataset
my_STS_Function <- function (StartDate) {
STAhead <- as.Date(StartDate) + STDuration; STAhead
STBehind <- as.Date(StartDate) - STDuration; STBehind
STDataset <- subset(Adataset, as.Date(Epoch) >= STBehind & as.Date(Epoch)<STAhead)
STResults <- rlm( Distance ~ Epoch, data=STDataset); STResults
STSummary <- summary( STResults ); STSummary
# Return coefficient (Slope of regression)
STNum <- STResults$coefficients[2];STNum
}
my_LTS_Function <- function (StartDate) {
LTAhead <- as.Date(StartDate) + LTDuration; LTAhead
LTBehind <- as.Date(StartDate) - LTDuration; LTBehind
LTDataset <- subset(Adataset, as.Date(Epoch) >= LTBehind & as.Date(Epoch)<LTAhead)
LTResults <- rlm( Distance ~ Epoch, data=LTDataset); LTResults
LTSummary <- summary( LTResults ); LTSummary
# Return coefficient (Slope of regression)
LTNum <- LTResults$coefficients[2];LTNum
Then I test the function to make sure it works for a single date:
myTestResult <- my_STS_Function("2007-11-18")
It works, so I move on to apply the function over the range of dates in the dataset:
mySTSResult <- apply(Adataset, 1, my_STS_Function, seq(StartDate : EndDate))
...in which my desired result is a list or array or vector of mySTSResult (slopes) (and, subsequently, a separate list/array/vector of myLTSResults so then I can create a STSlope:LTSlope ratio over the duration), something like (mySTSResults fabricated)...
> Adataset
Epoch Distance mySTSResults
1: 2007-11-15 92336.22 3
2: 2007-11-16 92336.23 4
3: 2007-11-17 92336.22 5
4: 2007-11-18 92336.20 6
5: 2007-11-19 92336.19 7
6: 2007-11-20 92336.21 8
7: 2007-11-21 92336.18 9
Only I get this error:
Error in FUN(newX[, i], ...) : unused argument(s) (1:1185)
What is this telling me and how to do correct it? I've done some looking and cannot find the correction.
Hopefully I've explained this sufficiently. Please let me know if you need further details.
Ok, it seems the problem is in the additional arguments to my_STS_Function as stated in your apply function call (as you have defined it with only one parameter). The date range is being passed as an additional parameter to that function, and R is complaining that it is unused (a vector of 1185 elements it seems). Are you rather trying to pull a subset of the rows restricted by date range first, then wishing to apply the my_STS_Function? I'd have to think a bit on an exact solution to that.
Sorry - I did my working out in the comments there. A possible solution is this:
subSet <- Adataset[Adataset[,1] %in% seq(StartDate:EndDate),][order(na.exclude(match(Adataset[,1], seq(StartData,EndDate))),]
Adapted from the answer in this question:
R select rows in matrix from another vector (match, %in)
Adding this as a new answer as the previous one was getting confused. A previous commenter was correct, there are bugs in your code, but they aren't a sticking point.
My updated approach was to use seq.Date to generate the date sequence (only works if you have a data point for each day between the start and end - though you could use na.exclude as above):
dates = seq.Date(as.Date(StartDate),as.Date(EndDate),"days")
You then use this as the input to apply, with some munging of types to get things working correctly (I've done this with a lamda function):
mySTSResult <- apply(as.matrix(dates), 1, function(x) {class(x) <- "Date"; my_STS_Function(x)})
Then hopefully you should have a vector of the results, and you should be able to do something similar for LTS, and then manipulate that into another column in your original data frame/matrix.

Resources