How to fit a restricted VAR model in R? - r

I was trying to understand how may I fit a VAR model that is specific and
not general.
I understand that fitting a model such as general VAR(1) is done by
importing the "vars" package from Cran
for example
consider that y is a matrix of a 10 by 2. then I did this after importing vars package
y=df[,1:2] # df is a dataframe with alot of columns (just care about the first two)
VARselect(y, lag.max=10, type="const")
summary(fitHilda <- VAR(y, p=1, type="const"))
This work fine if no restriction is being made on the coefficients. However, if I would like to fit this restricted VAR model in R
How may I do so in R?
Please refer me to a page if you know any? If there is anything unclear from your prespective please do not mark down let me know what is it and I will try to make it as clear as I understand.
Thank you very much in advance

I was not able to find how may I put restrictions the way I would like to. However, I find a way to go through that by doing as follow.
Try to find the number of lags using a certain information criterion like
VARselect(y, lag.max=10, type="const")
This will enable you to find the lag length. I found it to be one in my case. Then afterwards fit a VAR(1) model to your data. which is in my case y.
t=VAR(y, p=1, type="const")
When I view the summary. I find that some of the coefficients may be statistically insignificant.
summary(t)
Then afterwards run the built-in function from the package 'vars'
t1=restrict(t, method = "ser", thresh = 2.0, resmat = NULL)
This function enables one to Estimation of a VAR, by imposing zero restrictions by significance
to see the result write
summary(t1)

Related

binomial()$linkinv(fixef()) and binomial_pred_ci() functions: what exactly does these function are for when applied to mixed generalized analysis?

I was workinf on a dataset, trying re-perfomrming an already run statysical analysis and I met the following function:
binomial()$linkinv(fixef(m))
after running the following model
summary((m = glmer(T1.ACC ~ COND + (COND | ID), d9only, family = binomial)))
My first question is what exactly does this functions is made for? Beacuse throgh other command lines the reciprocal code as well as a slightly modified code based always on it are also reported:
1) 1- binomial()$linkinv(fixef())
2) d9only$fit = binomial()$linkinv(model.matrix(m) %*% fixef(m)) #also the sense of the operator %*% is quite misterious too.
Moreover, another function present is the following one:
binomial_pred_ci()
To be honest, I've to search through the overall script and no customized function there was or the package where that has been called from either? Anyone knows where does it may come from? Maybe the package 'runjags'? Just in case, any on how to download it?
Thanks for your answers
I agree with most of #Oliver's answer. I will add a few comments (since I had an answer partly composed already).
I would be very wary of the script you are following: some parts look wrong (I could obviously be mistaken since these bits are taken completely out of context ...)
binomial()$linkinv refers to the inverse link function for the model used. By default (which applies in this case since no optional link= argument has been specified), this is the inverse-logit or logistic function A nearly equivalent function is available via plogis(), but using $linkinv could be better in some cases since it would generalize to binomial analyses done with other link functions [e.g. probit or cloglog].
as #Oliver mentions, applying the inverse link function to the coefficients is at least weird, I would even say wrong. Researchers often exponentiate coefficients estimated on the logit/log-odds scale to obtain odds ratios, but applying the inverse link (usually logistic function) is rarely correct.
binomial()$linkinv(model.matrix(m) %*% fixef(m)) is indeed computing the predicted estimates on the link scale and converting them back to the data (= probability) scale. You can get the same results more reliably (handling missing values, etc.) by using predict(m, type = "response", re.form = ~0) (this extends #Oliver's answer to a case that also applies the inverse-link function for you).
I don't know what binomial_pred_ci is either, but I would suggest you look at predictInterval() from the merTools package ...
PS these answers all have not much to do with runjags, which uses an entirely different model structure. Presumably glmer models are being fitted for comparison ...
help(binomial) describes the link function and inverse link function and their uses. binomial()$linkinv is the binomial inverse-link function (sigmoid function) prob(y|eta) = 1 / (1 + exp(-eta)) where eta is the linear predictor. Using this with the coefficients (or fixed effects) is a bit odd, but is not unusual to get an idea of how large the effect of each coefficient is. I would not encourage it however.
%*% is the matrix multiplier, while model.matrix(m) (for lme4) extracts the fixed effect model matrix. So model.matrix(m) %*% fixef(m) is the linear predictor using only fixed effects. It would be the same as predict(m, re.form = ~ 0). This is often used in case you want to use the fixed effect model either because you want to correct for between-group-variation or because you are predicting new data.
binomial_pred_ci no idea. Guessing it's a function for predicting confidence levels.

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:
quantedahas the function textstat_simil, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda?
Do you guys have any other content rather than the source code of
github that explain in more detail the methods to calculate
similarity or distance measures? (I couldn't understand from
this
code, sorry).
Thanks for you patient!
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

100-fold-cross-validation for Ridge Regression in R

I have a huge dataset, and I am quite new to R, so the only way I can think of implementing 100-fold-CV by myself is through many for's and if's which makes it extremely inefficient for my huge dataset, and might even take several hours to compile. I started looking for packages that do this instead and found quite many topics related to CV on stackoverflow, and I have been trying to use the ones I found but none of them are working for me, I would like to know what I am doing wrong here.
For instance, this code from DAAG package:
cv.lm(data=Training_Points, form.lm=formula(t(alpha_cofficient_values)
%*% Training_Points), m=100, plotit=TRUE)
..gives me the following error:
Error in formula.default(t(alpha_cofficient_values)
%*% Training_Points) : invalid formula
I am trying to do Kernel Ridge Regression, therefore I have alpha coefficient values already computed. So for getting predictions, I only need to do either t(alpha_cofficient_values)%*% Test_Points or simply crossprod(alpha_cofficient_values,Test_Points) and this will give me all the predictions for unknown values. So I am assuming that in order to test my model, I should do the same thing but for KNOWN values, therefore I need to use my Training_Points dataset.
My Training_Points data set has 9000 columns and 9000 rows. I can write for's and if's and do 100-fold-CV each time take 100 rows as test_data and leave 8900 rows for training and do this until the whole data set is done, and then take averages and then compare with my known values. But isn't there a package to do the same? (and ideally also compare the predicted values with known values and plot them, if possible)
Please do excuse me for my elementary question, I am very new to both R and cross-validation, so I might be missing some basic points.
The CVST package implements fast cross-validation via sequential testing. This method significantly speeds up the computations while preserving full cross-validation capability. Additionaly, the package developers also added default cross validation functionality.
I haven't used the package before but it seems pretty flexible and straightforward to use. Additionally, KRR is readily available as a CVST.learner object through the constructKRRLearner() function.
To use the crossval functionality, you must first convert your data to a CVST.data object by using the constructData(x, y) function, with x the feature data and y the labels. Next, you can use one of the cross validation functions to optimize over a defined parameter space. You can tweak the settings of both the cv or fastcv methods to your liking.
After the cross validation spits out the optimal parameters you can create the model by using the learn function and subsequently predict new labels.
I puzzled together an example from the package documentation on CRAN.
# contruct CVST.data using constructData(x,y)
# constructData(x,y)
# Load some data..
ns = noisySinc(1000)
# Kernel ridge regression
krr = constructKRRLearner()
# Create parameter Space
params=constructParams(kernel="rbfdot", sigma=10^(-3:3),
lambda=c(0.05, 0.1, 0.2, 0.3)/getN(ns))
# Run Crossval
opt = fastCV(ns, krr, params, constructCVSTModel())
# OR.. much slower!
opt = CV(ns, krr, params, fold=100)
# p = list(kernel=opt[[1]]$kernel, sigma=opt[[1]]$sigma, lambda=opt[[1]]$lambda)
p = opt[[1]]
# Create model
m = krr$learn(ns, p)
# Predict with model
nsTest = noisySinc(10000)
pred = krr$predict(m, nsTest)
# Evaluate..
sum((pred - nsTest$y)^2) / getN(nsTest)
If further speedup is required, you can run the cross validations in parallel. View this post for an example of the doparallel package.

Hmm training with multiple observations and mhsmm package in R

i wanted to train a new hmm model, by means of Poisson observations that are the only thing i know.
I'm using the mhsmm package for R.
The first thing that bugs me is the initialization of the model, in the examples is:
J<-3
initial <- rep(1/J,J)
P <- matrix(1/J, nrow = J, ncol = J)
b <- list(lambda=c(1,3,6))
model = hmmspec(init=initial, trans=P, parms.emission=b,dens.emission=dpois.hsmm)
in my case i don't have initial values for the emission distribution parameters, that's what i want to estimate. How?
Secondly: if i only have observations, how do i pass them to
h1 = hmmfit(list_of_observations, model ,mstep=mstep.pois)
in order to obtain the trained model?
list_of_observations, in the examples, contains a vector of states, one of observations and one of observation sequence length and is usually obtained by a simulation of the model:
list_of_observations = simulate(model, N, rand.emis = rpois.hsmm)
EDIT: Found this old question with an answer that partially solved my problem:
MHSMM package in R-Input Format?
These two lines did the trick:
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
where data.df$sequences is the array containing all observations sequences and N is the array containing the count of observations for each sequence.
Still, the initial model is totally random, but i guess this is the way it is meant to be since it will be re-estimated, am i right?
The problem of initialization is critical not only for HMMs and HSMMs, but for all learning methods based on a form of the Expectation-Maximization algorithm. EM converges to a local optimum in terms of likelihood between model and data, but that does not always guarantee to reach the global optimum.
Goal: find estimates of the emission distribution but it also works for initial probability and transition matrix
Algorithm: needs initial estimate to start the optimisation from
You: have to provide an initial "guess" of the parameters
This may seem confusing at first, but the EM algorithm needs a point to start the optimisation. Then it makes some computations and it gives you a better estimate of your own initial guess (re-estimation, as you said). It is not able to just find the best parameters on its own, without being initialised.
From my experience, there is no general way to initialise the parameters that guarantee to converge to a global optimum, but it will depend more on the case at hand. That's why initialisation plays a critical role (mostly for emission distribution).
What I used to do in such a case is to separate the training data in different groups (e.g. percentiles of a certain parameter in the set), estimate the parameters on these groups, and then use them as initial parameter estimates for the EM algorithm. Basically, you have to try different methods and see which one works best.
I'd recommend to search the literature if similar problems have been solved with HMM, and try their initialisation method.

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources