Using RNN without historical data - recurrent-neural-network

I have a numeric dataset containing 10 features and 1 dependent variable split into 36 timesteps.
what I am trying to do is to predict the 36 timesteps using the 10 features, however when looking at RNN it seems I need to use historical data of the desired dependent variable to train it, but it is not possible when making a prediction, I can only use the 10 features to make predictions for the 36 values.
so the first question is: Can I use RNN to predict a time-series of 36 timesteps using 10 features as input, noting that it's a single timestep for the features.
I've used ANN and many ML models, but I want to see if I can improve the results using these two methods.

Related

Predict with only one observation with randomForest in R

I am studying loan default prediction. I am currently using R's "randomForest" package. My first model had a accuracy of 98% with sensitivity of 0.98 and specificity of 0.97 on the test data with the "predict" command.
The training and testing data had an "n" of 2865 and 319 observations, respectively.
In a real situation, where I would like to predict the probability of loan default for just one company, ie only 1 observation in the test data, would I have a problem?
The dataset I used contains only 8 predictor variables and 1 predicted variable. According to the literature, there are many more variables to be considered. Why did I get good results with just a small dataset I used? Seems "weird" to me.

Is deming regression using MCR package in R appropriate for my data?

I have two sets of independent variables. I would like to find out the strength and direction of association between these two variables. Since both the variables are independent, and I need to account for error on both the axis, should I use Deming regression on them?
Also,I expect that bottom 10 - 20 % of the data is noise which can reduce the strength of the association. Please suggest me the code to automatically remove 10 or 20 % of the data from statistical analysis.

Multivariate time series model using MARSS package (or maybe dlm)

I have two temporal processes. I would like to see if one temporal process (X_{t,2}) can be used to perform better forecast of the other process (X_{t,1}). I have multiple sources providing temporal data on X_{t,2}, (e.g. 3 time series measuring X_{t,2}). All time series require a seasonal component.
I found MARSS' notation to be pretty natural to fit this type of model and the code looks like this:
Z=factor(c("R","S","S","S")) # observation matrix
B=matrix(list(1,0,"beta",1),2,2) #evolution matrix
A="zero" #demeaned
R=matrix(list(0),4,4); diag(R)=c("r","s","s","s")
Q="diagonal and unequal"
U="zero"
period = 12
per.1st = 1 # Now create factors for seasons
c.in = diag(period)
for(i in 2:(ceiling(TT/period))) {c.in = cbind(c.in,diag(period))}
c.in = c.in[,(1:TT)+(per.1st-1)]
rownames(c.in) = month.abb
C = "unconstrained" #2 x 12 matrix
dlmfit = MARSS(data, model=list(Z=Z,B=B,Q=Q,C=C, c=c.in,R=R,A=A,U=U))
I got a beta estimate implying that the second temporal process is useful in forecasting the first process but to my dismay, MARSS gives me an error when I use MARSSsimulate to forecast because one of the matrices (related to seasonality) is time-varying.
Anyone, knows a way around this issue of the MARSS package? And if not, any tips on fitting an analogous model using, say the dlm package?
I was able to represent my state-space model in a form adequate to use with the dlm package. But I encountered some problems using dlm too. First, the ML estimates are VERY unstable. I bypassed this issue by constructing the dlm model based on marss estimates. However, dlmFilter is not working properly. I think the issue is that dlmFilter is not designed to deal with models with multiple sources for one time series, and additional seasonal components. dlmForecast gives me forecasts that I need!!!
In summary for my multivariate time series model (with multiple sources providing data for one of the temporal processes), the MARSS library gave me reasonable estimates of the parameters and allowed me to obtain filtered and smoothed values of the states. Forecast values were not possible. On the other hand, dlm gave fishy estimates for my model and the dlmFilter didn't work, but I was able to use dlmForecast to forecast values using the model I fitted in MARSS and reexpressed in dlm appropriate form.

Imbalanced training dataset and regression model

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.
My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.
The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:
Distance Observations
0 330
1 1903
2 12210
3 35486
4 54640
5 62193
6 60728
7 47874
8 33666
9 21640
10 12535
11 6592
12 3159
13 1157
14 349
15 86
16 12
The first thing I would try here is building a regression model of the log of the distance, since this will concentrate the range of larger distances. If you're using a generalised linear model this is the log link function; for other methods you could just manually do this by estimating a regression function of your inputs, x, and exponentiating the result:
y = exp( f(x) )
remember to use the log of the distance for a pair to train with.
Popular techniques for dealing with imbalanced distribution in regression include:
Random over/under-sampling.
Synthetic Minority Oversampling Technique for Regression (SMOTER). Which has an R package to implement.
The Weighted Relevance-based Combination Strategy (WERCS). Which has a GitHub repository of R codes to implement it.
PS: The table you show seems like you have a classification problem and not a regression problem.
As previously mentioned, I think what might help you given your problem is Synthetic Minority Over-Sampling Technique for Regression (SMOTER).
If you're a Python user, I'm currently working to improve my implementation of the SMOGN algorithm, a variant of SMOTER. https://github.com/nickkunz/smogn
Also, there are a few examples on Kaggle that have applied SMOGN to improve their prediction results. https://www.kaggle.com/aleksandradeis/regression-addressing-extreme-rare-cases

Using self start logistic and gompertz with only initial weight up to 40 days old

I have used the self start gompertz and logistic functions on my growth data for broiler chickens.
My dillemma is that i have data only for the commercial lifespan of the chickens up to 35days for female(pullets) and up to 41 days for male(cocks) and i used the list function
out.nls<-nlsList(Scales.Weight~SSgompertz(Scales.Age,a0,b0,b1)|SexOfBirds,data=f_79.grp)
This gives me separate model coefficients for each case in SexOfBirds. $ different models.
What is the best plots to use to describe the models : My learning data is quite substantial
How do I compare the models with test data and generally show how good the fit is?

Resources