I am generating data via API calls, one data point at a time. I want to feed each point to a Stan model, save the updated model, and discard the data point.
Is this possible with Stan?
If so, how do you deal with group-level parameters? For example, if my model has J group-level parameters, but I'm only inputing one data point at a time, will this not generate an error?
I think your problem can be conceptualized as Bayesian updating. In other words, you beliefs about the parameters are currently represented by some joint distribution, then you get one more data point, and you want to update your beliefs in light of this data point. And then repeat that process.
If so, then you can do a Stan model that has only one data point, but you need some way of representing your current beliefs with a probability distribution to use as the prior. This typically would be done with some multivariate normal distribution on the parameters in the unconstrained space. You can use the unconstrain_pars function in the rstan package to obtain a matrix of unconstrained posterior draws and then see what multivariate normal it is close to. You probably want to use some shrunken covariance estimator for the multivariate normal if you have a lot of parameters. Then, in your Stan program use a multivariate normal prior on the parameters and do whatever transformations you need to do to get transformed parameters in the constrained space (many such transformations are documented in the Stan User Manual).
It is true that when you estimate a hierarchical model with only one data point, that data point has essentially no information about the groups that the particular data point are not in. However, in that case, the margins of the posterior distribution for the parameters of the omitted groups will be essentially the same as the prior distribution. That is fine.
Related
I am currently in the midst of fitting curves to sales volumes of 2 successive product generations over an observation period. I am looking at two different markets. Through some research, I have identified a nonlinear model that fit the data well in both markets. To estimate the parameters of the model, I am using R and the package, nls2.
I have obtained regression output for the markets two individually.
Now, I would like to test whether the estimated parameters are different from each other on the two markets. That is, I would like to test each model parameter from one market against the corresponding parameter estimate from the other market.
Is there any function in R or in the nls2 package, that will allow me to do that? Or is there a smarter method?
Thank you in advance!
You can compare the parameters using MANOVA.
Professor wanted us to run some 10 fold cross validation on a data set to get the lowest RMSE and use the coefficients of that to make a function that takes in parameters and predicts and returns a "Fitness Factor" Score which ranges between 25-75.
He encouraged us to try transforming the data, so I did. I used scale() on the entire data set to standardize it and then ran my regression and 10 fold cross validation. I then found the model I wanted and copied the coefficients over. The problem is my function predictions are WAY off when i put unstandardized parameters into it to predict a y.
Did I completely screw this up by standardizing the data to a mean of 0 and sd of 1? Is there anyway I can undo this mess if I did screw up?
My coefficients are extremely small numbers and I feel like I did something wrong here.
Build a proper pipeline, not just a hack with some R functions.
The problem is that you treat scaling as part of loading the data, not as part of the prediction process.
The proper protocol is as follows:
"Learn" the transformation parameters
Transform the training data
Train the model
Transform the new data
Predict the value
Inverse-transform the predicted value
During cross-validation these need to run separately for each fold, or you may overestimate (overfit) your quality.
Standardization is a linear transform, so the inverse is trivial to find.
I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance
Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.
I'm training a Hidden Markov Model using EM, and want to get some estimation of how "certain" I can be about the learned parameters (i.e- the estimated transition, emission, and prior probabilities). In general, different initial conditions result in different parameters, but in many of the cases the different parameters have similar likelihood.
I'm looking for some way to probe the likelihood terrain to get an estimate of the number of local maximas, in-order to get a better idea about the different results I might get. (Running the algorithm takes quite a long time so I can't run it enough times to do a "naive" sensitivity analysis)
Any standard methods to do so?