Creating classification features from wavelet transformed time series - r

I'm interested in using a wavelet transform, Haar for example, to create classification variables from time series data to use in logistic regression.
Simple example. Let's say I'm trying to predict payment defaults and I have a person's monthly expense data and someone with consistent expenses is better than someone with increasing expenses in the the most recent 4 months.
If I have two sample borrowers:
Borrower A - Good - expensesA = c(100,110,95,105), default = 0
Borrower B - Bad - expensesB = c(75,100,150,200), default = 1
If I am using logistic regression, glm() in R, to create a classification model, and the R wavelets package dwt() function for a "haar" transform of the time series what are the appropriate features to extract from the dwt() object to use in glm()?
The truncated output for Borrower A is:
tr = dwt(expensesA, filter = "haar")
tr
An object of class "dwt"
Slot "W":
$W1
[,1]
[1,] 7.071068
[2,] 7.071068
$W2
[,1]
[1,] -5
Slot "V":
$V1
[,1]
[1,] 148.4924
[2,] 141.4214
$V2
[,1]
[1,] 205
Slot "filter":
Filter Class: Daubechies
Name: HAAR
Length: 2
Level: 1
Wavelet Coefficients: 7.0711e-01 -7.0711e-01
Scaling Coefficients: 7.0711e-01 7.0711e
-01
I know Ws are wavelet coefficients and the Vs the scaling coefficients.
Do I need to use all four W1 and V1 values as variables to properly model this or is it okay to try just the W1s without V1s (or vice versa)?
Is it worthwhile to try only the single W2 and V2s as variables?
Or is it better to try to use a clustering algortihm and label them based on clusters?
I know it of course also depends on the data, but I'm looking for a starting point regarding best practices.

Related

Calculate AUC for test set (keras model in R)

Is there a way (function) to calculate AUC value for a keras model in R on test-set?
I have searched on google but nothing shown up.
From Keras model, we can extract the predicted values as either class or probability as follows:
Probability:
[1,] 9.913518e-01 1.087829e-02
[2,] 9.990101e-01 1.216531e-03
[3,] 9.445553e-01 6.256607e-02
[4,] 9.928864e-01 6.808311e-03
[5,] 9.993126e-01 1.028240e-03
[6,] 6.075442e-01 3.926141e-01
Class:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Many thanks,
Ho
Generally it does not really matter what calssifier (keras or not) did the prediction. All you need to estimate the AUC are two things: the predicted probabilities from some classifier and the actual category (for example, dead "yes" vs. "no"). With this data you can calculate both, True Positive Rate and False positive rate, thus you can also make a ROC plot and estimate AUC with this data. You can use
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
See here for some more explanation.
I'm not sure this will answer your needs as it depends on your data structure and keras output format, but have a look to Dismo package's function evaluate. You need to set up something like that:
library(dismo)
predictors <- stack of explaining variables
pres_test <- a subset of data used to model ##that you not use in your model for this testing purpose
backg_test <- true or random (background) absence data
model <- output of your model
AUC <- evaluate(pres_test, backg_test, model, predictors) ## you may bootstrap this step x time by randomly selecting 'pres_test' and 'backg_test' x time.

Why is the Trend component of this Chistiano-Fitzgerald filter (mFilter's cffilter) so overfitted?

The goal: I am attempting to extract the seasonal and trend component from a time series using a band pass filter, due to issues with loess-based methods, which you can read more about here.
The data: The data is daily rainfall measurements from a 10-year span, which is highly stochastic and exhibits a clear annual seasonality. The data can be found here.
The problem: When I execute the filter, the Cycle component manifests as expected (capturing the annual seasonality) but the Trend component appears to extremely over-fitted, such that the Residuals become minuscule values, and the resulting model is not useful for out of sample forecasting.
US1ORLA0076 <- read_csv("US1ORLA0076_cf.csv")
head(US1ORLA0076)
water_date PRCP prcp_log
<date> <dbl> <dbl>
1 2006-12-22 0.09 0.0899
2 2006-12-23 0.75 0.693
3 2006-12-24 1.63 1.26
4 2006-12-25 0.06 0.0600
5 2006-12-26 0.36 0.353
6 2006-12-27 0.63 0.594
I then apply a Christiano-Fitzgerald band pass filter (designed to pass wavelengths between half-year and full-year in size, i.e. single annual waves) using the following command from the mFilter package.
library(mFilter)
US1ORLA0076_cffilter <- cffilter(US1ORLA0076$prcp_log,pl=180,pu=365,root=FALSE,drift=FALSE,
type=c("asymmetric"),
nfix=NULL,theta=1)
Which creates an S3 object containing, among other things, and vector of "trend" values and a vector of "cycle" values, like so:
head(US1ORLA0076_cffilter$trend)
[,1]
[1,] 0.1482724
[2,] 0.7501137
[3,] 1.3202868
[4,] 0.1139883
[5,] 0.4051551
[6,] 0.6453462
head(US1ORLA0076_cffilter$cycle)
[,1]
[1,] -0.05839342
[2,] -0.05696651
[3,] -0.05550995
[4,] -0.05402422
[5,] -0.05250982
[6,] -0.05096727
Plotted:
plot(US1ORLA0076_cffilter)
I am confused by this output. The cycle looks pretty much as I expected. The trend does not. Rather than being a gradually changing line representing the overall trend of the data after the seasonality has been exacted, it appears to be tracing the original data closely, i.e. being very overfit.
Question: Is mfilter even defining the "trend" the same way that a function like decompose() or stl() is? If not, how should I then think about it?
Question: Have I calibrated the cffilter() incorrectly, and what can I change to improve the definition of the trend component?
The answer is, "no" mfilter() is not defining "trend" the the same way that certain decomposition functions such as stl() do. It is defining it, more generally, as "the thing from which the cycle deviates". Setting a bandwidth of 180-365 for the pass filter, I have isolated the annual-cyclical component, which has been subtracted from the data, leaving behind everything else, which is defined here as the "trend" and can be thought of as a kind of residual.
To identify the "trend" as it is manifest in a decomposition package like stl() or decomp() using the same method, one could apply a band pass filter similar to that above, but with a period of oscillation defined between (for this data set) 366-3652, which would capture a frequency range reflecting the entire 10-year period, excluding intra-annual ones such as annual seasonality.
#Overall trend captured with similar code (and slightly different data):
US1ORLA0076_cffilter_trend <- cffilter(US1ORLA0076$prcp_log,pl=366,pu=3652,root=FALSE,drift=FALSE,
type=c("asymmetric"),
nfix=1,theta=1)
plot(US1ORLA0076_cffilter_trend)

TPR & FPR Curve for different classifiers - kNN, NaiveBayes, Decision Trees in R

I'm trying to understand and plot TPR/FPR for different types of classifiers. I'm using kNN, NaiveBayes and Decision Trees in R. With kNN I'm doing the following:
clnum <- as.vector(diabetes.trainingLabels[,1], mode = "numeric")
dpknn <- knn(train = diabetes.training, test = diabetes.testing, cl = clnum, k=11, prob = TRUE)
prob <- attr(dpknn, "prob")
tstnum <- as.vector(diabetes.testingLabels[,1], mode = "numeric")
pred_knn <- prediction(prob, tstnum)
pred_knn <- performance(pred_knn, "tpr", "fpr")
plot(pred_knn, avg= "threshold", colorize=TRUE, lwd=3, main="ROC curve for Knn=11")
where diabetes.trainingLabels[,1] is a vector of labels (class) I want to predict, diabetes.training is the training data and diabetest.testing is the testing.data.
Plot looks like the following:
The values stored in prob attribute is a numeric vector (decimal between 0 and 1). I convert the class labels factor into numbers and then I can use it with prediciton/performance function from ROCR library. Not 100% sure I'm doing it correct but at least it works.
For the NaiveBayes and Decision Trees tho, with prob/raw parameter speciefied in predict function I don't get a single numeric vector but a vector of lists or matrix where probability for each class is specified (I guess), eg:
diabetes.model <- naiveBayes(class ~ ., data = diabetesTrainset)
diabetes.predicted <- predict(diabetes.model, diabetesTestset, type="raw")
and diabetes.predicted is:
tested_negative tested_positive
[1,] 5.787252e-03 0.9942127
[2,] 8.433584e-01 0.1566416
[3,] 7.880800e-09 1.0000000
[4,] 7.568920e-01 0.2431080
[5,] 4.663958e-01 0.5336042
The question is how to use it to plot ROC curve and why in kNN I get one vector and for other classifieres I get them separate for both classes?
ROC curve
The ROC curve you provided for knn11 classifier looks off - it is below the diagonal indicating that your classifier assigns class labels correctly less than 50% of the time. Most likely what happened there is that you provided wrong class labels or wrong probabilities. If in training you used class labels of 0 and 1 - those same class labels should be passed to ROC curve in the same order (without 0 and one flipping).
Another less likely possibility is that you have a very weird dataset.
Probabilities for other classifiers
ROC curve was developed to call events from the radar. Technically it is closely related to predicting an event - probability that you correctly guess the even of a plane approaching from the radar. So it uses one probability. This can be confusing when someone does classification on two classes where "hit" probabilities are not evident, like in your case where you have cases and controls.
However any two class classification can be termed in terms of "hits" and "misses" - you just have to select a class which you will call an "event". In your case having diabetes might be called an event.
So from this table:
tested_negative tested_positive
[1,] 5.787252e-03 0.9942127
[2,] 8.433584e-01 0.1566416
[3,] 7.880800e-09 1.0000000
[4,] 7.568920e-01 0.2431080
[5,] 4.663958e-01 0.5336042
You would only have to select one probability - that of an event - probably "tested_positive". Another one "tested_negative" is just 1-tested_positive because when classifier things that a particular person has diabetes with 79% chance - he at the same time "thinks" that there is a 21% chance of that person not having diabetes. But you only need one number to express this idea, so knn only returns one, while other classifier can return two.
I don't know which library you used for decision trees so cannot help with the output of that classifier.
Looks like you are something fundamentally wrong.
Ideally KNN graph looks like above one. Here are few point you can use.
Calculate distance in you code.
Use below code for prediction in python
Predicted class
print(model_name.predict(test))
3 nearest neighbors
print(model_name.kneighbors(test)[1])

Sequential Neural Network

I am trying to construct a neural network as a generative model, to predict the next vector following a sequence of vectors (each vector is a distribution of real numbers of length n).
My thought was to take k previous sequences and concatenate them to have a kxn input vector. To train the model, I would have the next vector in the sequence as the output. As I am looking for non-deterministic output, I was going to use a sigmoid activation function with low gradient.
Does this procedure seem reasonable?
In the hope it does, I tried implementing it in R using both the nnet and neuralnet libraries, but it the documentation and examples I came across, it seems the input and output vectors must be of the same length. What is the syntax to train on input/output vectors of varying length in either of those modules?
A sample of my input vector is:
[,1]
[1,] 0
[2,] 0
[3,] 0.6
[4,] 0.4
[5,] 0
[6,] 0
[7,] 0.06666667
[8,] 0.6666667
[9,] 0
[10,] 0.2666667
[11,] 0
[12,] 0.4
[13,] 0
[14,] 0
[15,] 0.6
And output vector:
[,1]
[1,] 0
[2,] 0
[3,] 0.8571429
[4,] 0
[5,] 0.1428571
N.B. The above sample has n=5, k=3, although my actual dataset has n~200. In both cases, the individual vectors are normalized to 1.
Any help is much appreciated!
In general this is very simple and naive approach, which rather won't yield good results. Your are trying to perform the regression from set of time series into time series treating everything as simple attributes and simple model. There have been thousands of papers/research regarding time series predictions, representing time dependence etc.You are facing a hard type of prediction problem here, finding the good solution will require lots of work, and proposed model has a very little chance of working well.
From your text I deduce, that you actually have a sequence of time series, and for the "time window" [t-k,t-k+1,..,t-1] you want to predict the value (time series) in t. If this is true, then this is actualy time series prediction problem, where each attribute is the time series on its own, and all time series related techniques can be used here, as for example recurrent neural networks (if you really like NNs) or conditional RBMs (if you really want a non-deterministic, generative model - as they have been succesfully applied to time series prediction in recent years).
Now few other doubts:
As I am looking for non-deterministic output, I was going to use a sigmoid activation function
Sigmoid activation function is not non-deterministic. If you are looking for non deterministic models you should think about some architectures like RBMs, but as #Ben Allison mentioned in the comment, traditional neural networks can also be used in the probabilistic fashion with some simple modifications.
with low gradient.
What do you mean by low gradient? That your activation function has a small slope? This will result in a problematic learning in case of simple training procedures (like BP algorithm)
[DATA]
Your data looks like you normalized each time series so it sums to 1 which is rather not popular approach to data normzliazation in neural networks (you rather normalize data column-wise, so each dimension is normalized, not each sample).
Title
Your question, and model is not "sequentional" and does not include "varying vector lengths", looking for papers about such phenomena won't lead you to answer for your question.

Time series modelling with irregular data

I'm currently working on a pet project to forecast future base oil prices from historical base oil prices. The data is weekly but there are some periods in between where prices are missing.
I'm somewhat okay with modelling time series with complete data but when it comes to irregular ones, the models that I've learnt may not be applicable. Do I use xts class and proceed with ARIMA models in R the usual way?
After building a model to predict future prices, I'd like to factor in crude oil price fluctuation, diesel profit margin, car sales, economic growth and so on(Multivariable?) to improve accuracy. Can someone shed some light on how do I go about doing this the efficient way? In my mind, it looks like a maze.
EDIT: Trimmed Data here: https://docs.google.com/document/d/18pt4ulTpaVWQhVKn9XJHhQjvKwNI9uQystLL4WYinrY/edit
Coding:
Mod.fit<-arima(Y,order =c(3,2,6), method ="ML")
Result:
Warning message:
In log(s2) : NaNs produced
Will this warning affect my model accuracy?
With missing data, I can't use ACF and PACF. Is there a better way to select models? I used AIC(Akaike's Information Criterion) to compare different ARIMA models using this code.ARIMA(3,2,6) gave the smallest AIC.
Coding:
AIC<-matrix(0,6,6)
for(p in 0:5)
for(q in 0:5)
{
mod.fit<-arima(Y,order=c(p,2,q))
AIC[p+1,q+1]<-mod.fit$aic
p
}
AIC
Result:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1396.913 1328.481 1327.896 1328.350 1326.057 1325.063
[2,] 1343.925 1326.862 1328.321 1328.644 1325.239 1318.282
[3,] 1334.642 1328.013 1330.005 1327.304 1326.882 1314.239
[4,] 1336.393 1329.954 1324.114 1322.136 1323.567 1316.150
[5,] 1319.137 1321.030 1320.575 1321.287 1323.750 1316.815
[6,] 1321.135 1322.634 1320.115 1323.670 1325.649 1318.015
No in general you don't need to use xts and then do an ARIMA, there is an extra step required. Missing values, recorded as NA are handled by arima() and if using method = "ML" then they will be handled exactly; other methods may not get the innovations for missing data. This works because arima() fits the ARIMA model in a state-space representation.
If the data is regular but has missing data then the above should be fine.
The reason I say don't in general use xts is just that arima() requires a univariate time series object ?ts as its input. However, xts extends and inherits from zoo objects and the zoo package does provide an as.ts() method for objects of class "zoo". So if you get your data into a zoo() or xts() object, you can then coerce to class "ts" and that should include the NA in the appropriate places, which arima() will then handle if it can (i.e. there aren't too many missing values).

Resources