I am trying to construct a neural network as a generative model, to predict the next vector following a sequence of vectors (each vector is a distribution of real numbers of length n).
My thought was to take k previous sequences and concatenate them to have a kxn input vector. To train the model, I would have the next vector in the sequence as the output. As I am looking for non-deterministic output, I was going to use a sigmoid activation function with low gradient.
Does this procedure seem reasonable?
In the hope it does, I tried implementing it in R using both the nnet and neuralnet libraries, but it the documentation and examples I came across, it seems the input and output vectors must be of the same length. What is the syntax to train on input/output vectors of varying length in either of those modules?
A sample of my input vector is:
[,1]
[1,] 0
[2,] 0
[3,] 0.6
[4,] 0.4
[5,] 0
[6,] 0
[7,] 0.06666667
[8,] 0.6666667
[9,] 0
[10,] 0.2666667
[11,] 0
[12,] 0.4
[13,] 0
[14,] 0
[15,] 0.6
And output vector:
[,1]
[1,] 0
[2,] 0
[3,] 0.8571429
[4,] 0
[5,] 0.1428571
N.B. The above sample has n=5, k=3, although my actual dataset has n~200. In both cases, the individual vectors are normalized to 1.
Any help is much appreciated!
In general this is very simple and naive approach, which rather won't yield good results. Your are trying to perform the regression from set of time series into time series treating everything as simple attributes and simple model. There have been thousands of papers/research regarding time series predictions, representing time dependence etc.You are facing a hard type of prediction problem here, finding the good solution will require lots of work, and proposed model has a very little chance of working well.
From your text I deduce, that you actually have a sequence of time series, and for the "time window" [t-k,t-k+1,..,t-1] you want to predict the value (time series) in t. If this is true, then this is actualy time series prediction problem, where each attribute is the time series on its own, and all time series related techniques can be used here, as for example recurrent neural networks (if you really like NNs) or conditional RBMs (if you really want a non-deterministic, generative model - as they have been succesfully applied to time series prediction in recent years).
Now few other doubts:
As I am looking for non-deterministic output, I was going to use a sigmoid activation function
Sigmoid activation function is not non-deterministic. If you are looking for non deterministic models you should think about some architectures like RBMs, but as #Ben Allison mentioned in the comment, traditional neural networks can also be used in the probabilistic fashion with some simple modifications.
with low gradient.
What do you mean by low gradient? That your activation function has a small slope? This will result in a problematic learning in case of simple training procedures (like BP algorithm)
[DATA]
Your data looks like you normalized each time series so it sums to 1 which is rather not popular approach to data normzliazation in neural networks (you rather normalize data column-wise, so each dimension is normalized, not each sample).
Title
Your question, and model is not "sequentional" and does not include "varying vector lengths", looking for papers about such phenomena won't lead you to answer for your question.
Related
Is there a way (function) to calculate AUC value for a keras model in R on test-set?
I have searched on google but nothing shown up.
From Keras model, we can extract the predicted values as either class or probability as follows:
Probability:
[1,] 9.913518e-01 1.087829e-02
[2,] 9.990101e-01 1.216531e-03
[3,] 9.445553e-01 6.256607e-02
[4,] 9.928864e-01 6.808311e-03
[5,] 9.993126e-01 1.028240e-03
[6,] 6.075442e-01 3.926141e-01
Class:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Many thanks,
Ho
Generally it does not really matter what calssifier (keras or not) did the prediction. All you need to estimate the AUC are two things: the predicted probabilities from some classifier and the actual category (for example, dead "yes" vs. "no"). With this data you can calculate both, True Positive Rate and False positive rate, thus you can also make a ROC plot and estimate AUC with this data. You can use
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
See here for some more explanation.
I'm not sure this will answer your needs as it depends on your data structure and keras output format, but have a look to Dismo package's function evaluate. You need to set up something like that:
library(dismo)
predictors <- stack of explaining variables
pres_test <- a subset of data used to model ##that you not use in your model for this testing purpose
backg_test <- true or random (background) absence data
model <- output of your model
AUC <- evaluate(pres_test, backg_test, model, predictors) ## you may bootstrap this step x time by randomly selecting 'pres_test' and 'backg_test' x time.
The goal: I am attempting to extract the seasonal and trend component from a time series using a band pass filter, due to issues with loess-based methods, which you can read more about here.
The data: The data is daily rainfall measurements from a 10-year span, which is highly stochastic and exhibits a clear annual seasonality. The data can be found here.
The problem: When I execute the filter, the Cycle component manifests as expected (capturing the annual seasonality) but the Trend component appears to extremely over-fitted, such that the Residuals become minuscule values, and the resulting model is not useful for out of sample forecasting.
US1ORLA0076 <- read_csv("US1ORLA0076_cf.csv")
head(US1ORLA0076)
water_date PRCP prcp_log
<date> <dbl> <dbl>
1 2006-12-22 0.09 0.0899
2 2006-12-23 0.75 0.693
3 2006-12-24 1.63 1.26
4 2006-12-25 0.06 0.0600
5 2006-12-26 0.36 0.353
6 2006-12-27 0.63 0.594
I then apply a Christiano-Fitzgerald band pass filter (designed to pass wavelengths between half-year and full-year in size, i.e. single annual waves) using the following command from the mFilter package.
library(mFilter)
US1ORLA0076_cffilter <- cffilter(US1ORLA0076$prcp_log,pl=180,pu=365,root=FALSE,drift=FALSE,
type=c("asymmetric"),
nfix=NULL,theta=1)
Which creates an S3 object containing, among other things, and vector of "trend" values and a vector of "cycle" values, like so:
head(US1ORLA0076_cffilter$trend)
[,1]
[1,] 0.1482724
[2,] 0.7501137
[3,] 1.3202868
[4,] 0.1139883
[5,] 0.4051551
[6,] 0.6453462
head(US1ORLA0076_cffilter$cycle)
[,1]
[1,] -0.05839342
[2,] -0.05696651
[3,] -0.05550995
[4,] -0.05402422
[5,] -0.05250982
[6,] -0.05096727
Plotted:
plot(US1ORLA0076_cffilter)
I am confused by this output. The cycle looks pretty much as I expected. The trend does not. Rather than being a gradually changing line representing the overall trend of the data after the seasonality has been exacted, it appears to be tracing the original data closely, i.e. being very overfit.
Question: Is mfilter even defining the "trend" the same way that a function like decompose() or stl() is? If not, how should I then think about it?
Question: Have I calibrated the cffilter() incorrectly, and what can I change to improve the definition of the trend component?
The answer is, "no" mfilter() is not defining "trend" the the same way that certain decomposition functions such as stl() do. It is defining it, more generally, as "the thing from which the cycle deviates". Setting a bandwidth of 180-365 for the pass filter, I have isolated the annual-cyclical component, which has been subtracted from the data, leaving behind everything else, which is defined here as the "trend" and can be thought of as a kind of residual.
To identify the "trend" as it is manifest in a decomposition package like stl() or decomp() using the same method, one could apply a band pass filter similar to that above, but with a period of oscillation defined between (for this data set) 366-3652, which would capture a frequency range reflecting the entire 10-year period, excluding intra-annual ones such as annual seasonality.
#Overall trend captured with similar code (and slightly different data):
US1ORLA0076_cffilter_trend <- cffilter(US1ORLA0076$prcp_log,pl=366,pu=3652,root=FALSE,drift=FALSE,
type=c("asymmetric"),
nfix=1,theta=1)
plot(US1ORLA0076_cffilter_trend)
I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?
After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?
In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.
I did some digging around, but I'm still very new to the concept of latin hypercube sampling. I found this example which uses the lhs pacakge:
set.seed(1)
randomLHS(5,2)
[,1] [,2]
[1,] 0.84119491 0.89953985
[2,] 0.03531135 0.74352370
[3,] 0.33740457 0.59838122
[4,] 0.47682074 0.07600704
[5,] 0.75396828 0.35548904
From my understanding, the entries in the resulting matrix are the coordinates of 5 points that will be used to determine combinations of two continuous variables.
I'm trying to do a simulation with 5 categorical variables. The number of levels per variable range from 2 to 5. This results in 2 x 3 x 4 x 2 x 5 = 240 scenarios. I'd like to cut it down as much as possible so I was thinking of using a latin hypercube, but I'm confused about how to proceed. Any ideas would be much appreciated!
Also, do you know of any good resources that explains how to analyze the results from latin hypercube sampling?
I'd recommend sticking with the full factorial with 240 design points, for the following reasons.
Heck, this is what computers are for—to automate tedious
computational tasks. 240 design points is nothing, you're doing
this on a computer! You can easily automate the process with nested
loops iterating through the levels, one loop per factor. Don't
forget an innermost loop for replications. If each simulation takes
more than a minute or two, break it across multiple cores or multiple
machines. One of my students recently did this for his MS thesis
work, and was able to run more than a million simulated experiments
over a weekend.
With continuous factors, you generally assume some degree of smoothness in
the response surface and infer/project the response between adjacent design
points based on regression. With categorical data, inference isn't
valid for excluded factor combinations and interactions
may very well be the dominant effects. Unless you do the full
factorial, the combinations you omit may or may not be the most
important ones, but the point is that you'll never know if
you didn't sample there.
In general, you use the same analysis tools you would use if you were doing any other kind of sampling—Regression, logistic regression, ANOVA, partition trees,... For categorical factors, I'm a fan of partition trees.
I'm interested in using a wavelet transform, Haar for example, to create classification variables from time series data to use in logistic regression.
Simple example. Let's say I'm trying to predict payment defaults and I have a person's monthly expense data and someone with consistent expenses is better than someone with increasing expenses in the the most recent 4 months.
If I have two sample borrowers:
Borrower A - Good - expensesA = c(100,110,95,105), default = 0
Borrower B - Bad - expensesB = c(75,100,150,200), default = 1
If I am using logistic regression, glm() in R, to create a classification model, and the R wavelets package dwt() function for a "haar" transform of the time series what are the appropriate features to extract from the dwt() object to use in glm()?
The truncated output for Borrower A is:
tr = dwt(expensesA, filter = "haar")
tr
An object of class "dwt"
Slot "W":
$W1
[,1]
[1,] 7.071068
[2,] 7.071068
$W2
[,1]
[1,] -5
Slot "V":
$V1
[,1]
[1,] 148.4924
[2,] 141.4214
$V2
[,1]
[1,] 205
Slot "filter":
Filter Class: Daubechies
Name: HAAR
Length: 2
Level: 1
Wavelet Coefficients: 7.0711e-01 -7.0711e-01
Scaling Coefficients: 7.0711e-01 7.0711e
-01
I know Ws are wavelet coefficients and the Vs the scaling coefficients.
Do I need to use all four W1 and V1 values as variables to properly model this or is it okay to try just the W1s without V1s (or vice versa)?
Is it worthwhile to try only the single W2 and V2s as variables?
Or is it better to try to use a clustering algortihm and label them based on clusters?
I know it of course also depends on the data, but I'm looking for a starting point regarding best practices.