Regress variable on variables the date before - r

For an econometrics project, I'm using R to estimate some effects with panel data.
To know if the strict exogeneity isn't too restrictive, I'm running the following 2SLS estimation to predict Y_it (which are sales) by X_it (some variables) using a first difference model.
I need to regress each component of Delta_X_it (=X_it - X_it-1) on a constant and all components of Delta_X_it-1
Then regress Delta_Y_it on the estimations of Delta_X_it
The 2nd step will be easy to implement if the first step is done, but this first step is the problem. I already first differenced all variables by group (here by Store), but I don't know how to tell R that I want to regress one variable at time t on variables at time t-1 while grouping by Store. Any idea on how to do so ?

Related

How to determine the most significant predictors - multivariate forecasting

I would like to create a forecasting model with time series in R. I have a target time series 'Sales' that I would like to forecast. I also have several time series that represent, for example, GDP or advertising spend. Unfortunately I have a lot of independent time series and I don't know how to figure out the most significant ones. It would be best to find out the most important ones already before building the model.
I have already worked with classification problems, here I have always used the Pearson correlation value. This is not possible with time series, right? How can I determine the correlation for time series and use the correlation to find suitable time series that describe my target time series?
I tried to use the corr.test() function in R, but I think thats not right.

Ensemble machine learning model with NNETAR and BRNN

I used the forecast package to forecast the daily time-series of variable Y using its lag values and a time series of an external parameter X. I found nnetar model (a NARX model) was the best in terms of overall performance. However, I was not able to get the prediction of peaks of the time series well despite my various attempts with parameter tuning.
I then extracted the peak values (above a threshold) of Y (and of course this is not a regular time series anymore) and corresponding X values and tried to fit a regression model (note: not an autoregression model) using various models in carat package. I found out the prediction of peak values using brnn(Bidirectional recurrent neural networks) model just using X values is better than that of nnetar which uses both lag values and X values.
Now my question is how do I go from here to create ensamples of these two models (i.e whenever the prediction using brnn regression model ( or any other regression model) is better I want to replace the prediction using nnetar and move forward - I am mostly concerned about the peaks)? Is this a commonly used approach?
Instead of trying to pick one model that would be the superior at anytime, it's typically better to do an average of the models, in order to include as many individual views as possible.
In the experiments I've been involved in, where we tried to pick one model that would outperform, based on historical performance, it's typically shown that a simple average was as good or better. Which is in line with the typical results on this problem: https://otexts.com/fpp2/combinations.html
So, before you try to go more advanced at it by using trying to pick a specific model based on previous performance, or by using an weighted average, consider doing a simple average of the two models.
If you want to continue with a sort of selection/weighted averaging, try to have a look at the FFORMA package in R: https://github.com/pmontman/fforma
I've not tried the specific package (yet), but have seen promising results in my test using the original m4metalearning package.

Create a new datafram to do piecewise linear regression on percentages after doing serial crosstabs in R

I am working with R. I need to identify the predictors of higher Active trial start percentage over time (StartDateMonthsYrs). I will do linear regression with Percent.Active as the dependent variable.
My original dataframe is attached and my obtained Active trial start percentage over time (named Percent.Activeis presented here.
So, I need to assess whether federal sponsored trials, industry sponsored trials or Other sponsored trials were associated with higher active trial start percentage over time. I have many other variables that I wneed to assess but this is the sample of my data.
I am thinking to do many crosstabs for each variable (eg Fedral & Active then Industry & Active..etc.) in each month (may be with help of lapply then accumulate the obtained percentages data in the second sheet then run the analysis based on that.
My code for linear regression is as follow:
q.lm0 <- lm(Percent.Active ~ Time.point+ xyz, data.percentage);summary(q.lm0)
I'm a little bit confused. You write 'associated'. If you really want to look for association then yeah, a crosstab might be possible, and sufficient, as association is not the same as causation (which is further derived from correlation, if there is a theory behind). If you look for correlation, and insights over time, doing a regression with the lm package is not useful.
If you want to look for a regreesion type analysis there are packages in R like the plm package, which can deal with panel data, as you clearly have panel data (time points, and interested trials labels, and repetitive time points for these labels). Look at this post for infos about the package:https://stackoverflow.com/questions/2804001/panel-data-with-binary-dependent-variable-in-r
I'm writing you this because your Percent.Activevariable is only a binary outcome of 0/1 I'm not sure if this is on purpose. However, even if your outcome is not binary, the plm package might help, but you will find other mentioned packages in that post.

regression - bounded depdent variable - model choice

I am working on a problem where i want to see if a measure (test) is a good predictor of the outcome variable (performance). Performance is a bounded variable between 0-100. I am only thinking of the methodology for now and not working with the data yet.
I am aware that there are different models and methods that deal with bounded dependent variables, but from my understanding these are useful if one is interested in predictions?
I am interested in how much variance of the dependent variable (performance)is explained by my measure (test). I am not interested in predicting specific outcomes.
Is it OK to just use normal regression?
Do i need to account for the bounded dependent variable somehow?
You can scale your dependent variable in the [0, 1] interval and run a logistic regression, that shrinks every input value into that range.
If you can, you can use fractional logit models, typically used to predict continuous outputs in the [0, 1] interval.
Alternatively, if you are into Machine Learning, you can implement a Neural Network regressor with one output note with a sigmoid activation function.

Ordered Probit R

I'm trying create an ordered probit model in R. My independent variable is categorical, my dependent variable is ordinal. I'm using the polr command and it does go through. When I run the command, I get the log odds for the different variables. I have converted them into odds ratios using the exp command. As far as I understand it, these odds ratios tell me what the probability is of my dependent variable going up one category every time my independent variable "goes up" one category. Is that correct? I'm somewhat confused because in the case of the independent variable, it's not really an increase since they are just categories.
My second question concerns the interpretation of the polr. All I get are the odds ratios. How would you recommend I get additional information on the suitability of the ordered probit? Thanks!

Resources