Mastter thesis help. Autocorrelation/lagrange test for master thesis in political science - r

I'm currently trying to detect how many lags I should include in my linear regression analysis in R.
The study is about whether the presence of commercial military actors (CMA) correlates/causes more military- and or civil deaths. My supervisor is very keen on me using lagrange multiplier test to test for how many lags I need. However, he is not a R user and can't help me implement. He also want me to include panel corrected standard errors (PCSE) proposed by Katz and Bailey.
Short variable description
DV = log_military_cas; it is a log transformation of yearly military deaths on country basis
IV = CMA; dummy coded variable suggesting either CMA presence in country and year combination (1) og no presence (0)
lag-variable = lag_md; log_md lagged one year.
DATA = lagr
This is what my supervisor sent me:
Testing for serial correlation. This is what I wrote down in my notes as a grad student:
Using the Lagrange Multiplier test first recommended by Engle (1984)(but also used by Beck and Katz (1996)) this is done in two steps: 1) estimate the model and save the residuals and 2)regress these residuals on the first lag of those and the independent variable. If the lag of the residual is statistically significant in the last regression, more lags of the dependent variable are needed.
<-- So just do this but with a model without any lags of dependent variable. If you find serial correlation, include a lag of DV and test again.
Question is twofold 1) what I'm I doing wrong the attached code, and 2) Should the baseline reg include pcse?
# no lag
lagtest_0a <- lm(log_military_cas ~ CMA + as.factor(country) + as.factor(year), data = lagr)
# save risiduals
lagr$Risid_0 <- resid(lagtest_0)
lagtest_0b <- lm(log_military_cas ~ CMA + Risid_0 + as.factor(country) + as.factor(year), data = lagr)
summary(lagtest_0b)
# Risid_0 is significant, so I need at least one lag
# lag 1
lagtest_1a <- lm(log_military_cas ~ CMA + lag_md + as.factor(country) + as.factor(year), data = lagr)
# save new risiduals
lagr$Risid1 <- resid(lagtest_1a)
# here the follwoing errorcode arrives:
Error in `$<-.data.frame`(`*tmp*`, Risid1, value = c(`2` = 1.84005148256506, :
replacement has 2855 rows, data has 2856
# Then I'm thinking, maybe I shouldnt store Risid_0 in the lagr dataframe. So I try without that just storing it for itself.
# save new risiduals in new way
Risid1 <- resid(lagtest_1a)
# rerun model
lagtest1 <- lm(log_military_cas ~ CMA + Rs_lagtest_md1 + as.factor(country) + as.factor(year), data = lagr)
# Then, the following errorcode arrives:
Error in model.frame.default(formula = log_military_cas ~ CMA + Rs_lagtest_md1 + :
variable lengths differ (found for 'Rs_lagtest_md1')
it seems like the problem is, that when I include lag_md (which has NA's on first year, since its lagged) the lenght of the variables are not the same, however as far as I know, the default system in R omits NA's. I even tried to specify this with na.action = na.omit, but the same error arrives.
Hope anyone can help me

Related

R and multiple time series and Error in model.frame.default: variable lengths differ

I am new to R and I am using it to analyse time series data (I am also new to this).
I have quarterly data for 15 years and I am interested in exploring the interplay between drinking and smoking rates in young people - treating smoking as the outcome variable. I was advised to use the gls command in the nlme package as this would allow me to include AR and MA terms. I know I could use more complex approaches like ARIMAX but as a first step, I would like to use simpler models.
After loading the data, specify the time series
data.ts = ts(data=data$smoke, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
data.ts.dec = decompose(data.ts)
After decomposing the data and some tests (KPSS and ADF test), it is clear that the data are not stationary so I differenced the data:
diff_dv<-diff(data$smoke, difference=1)
plot.ts(diff_dv, main="differenced")
data.diff.ts = ts(diff_dv, frequency=4, start=c(hse[1, "Year"], hse[1, "Quarter"]))
The ACF and PACF plots suggest AR(2) should also be included so I set up the model as:
mod.gls = gls(diff_dv ~ drink+time , data = data,
correlation=corARMA(p=2), method="ML")
However, when I run this command I get the following:
"Error in model.frame.default: variable lengths differ".
I understand from previous posts that this is due to the differencing and the fact that the diff_dv is now shorter. I have attempted fixing this by modifying the code but neither approach works:
mod.gls = gls(diff_dv ~ drink+time , data = data[1:(length(data)-1), ],
correlation=corARMA(p=2), method="ML")
mod.gls = gls(I(c(diff(smoke), NA)) ~ drink+time+as.factor(quarterly) , data = data,
correlation=corARMA(p=2), method="ML")
Can anyone help with this? Is there a workaround which would allow me to run the -gls- command or is there an alternative approach which would be equivalent to the -gls- command?
As a side question, is it OK to include time as I do - a variable with values 1 to 60? A similar question is for the quarters which I included as dummies to adjust for possible seasonality - is this OK?
Your help is greatly appreciated!
Specify na.action = na.omit or na.action = na.exclude to omit the rows with NA's. Here is an example using the built-in Ovary data set. See ?na.fail for info on the differences between these two.
Ovary2 <- transform(Ovary, dfoll = c(NA, diff(follicles)))
gls(dfoll ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary2,
correlation = corAR1(form = ~ 1 | Mare), na.action = na.exclude)

Using the panel regression on Hedonic data using plm package in R

I am trying to run the panel regression for unbalanced panel in R using the plm package. I am using the 'Hedonic' data to run the same.
I was trying to replicate something similar that is done in the following paper: http://ftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/vignettes/plm/plmEN.pdf (page 14, 3.2.5 Unbalanced Panel).
My code looks something like this:
form = mv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + blacks + lstat
ba = plm(form, data = Hedonic)
However, I am getting the following error on execution:
Error in names(y) <- namesy :
'names' attribute [506] must be the same length as the vector [0]
traceback() yields the following result:
4: pmodel.response.pFormula(formula, data, model = model, effect = effect,
theta = theta)
3: pmodel.response(formula, data, model = model, effect = effect,
theta = theta)
2: plm.fit(formula, data, model, effect, random.method, random.dfcor,
inst.method)
1: plm(form, data = Hedonic)
I am new to panel regression and would be really grateful if someone can help me with this issue.
Thanks.
That paper is ten years old, and I'm not sure plm works like that. The latest docs are here https://cran.r-project.org/web/packages/plm/vignettes/plm.pdf
Your problem arises because, in the docs:
the current version of plm is capable of working with a regular
data.frame without any further transformation, provided that the
individual and time indexes are in the first two columns,
The Hedonic data set does not have individual and time indexes in the first two columns. I'm not sure where the individual and time indexes are in the data, but if I specify townid for the index I at least get something that runs:
> p <- plm(mv~crim,data=Hedonic)
Error in names(y) <- namesy :
'names' attribute [506] must be the same length as the vector [0]
> p <- plm(mv~crim,data=Hedonic, index="townid")
> p
Model Formula: mv ~ crim
Coefficients:
crim
-0.0097455
because when you don't specify id and time indexes it is going to try using the first two columns, and in Hedonic that is giving unique numbers for the id, so the whole model falls apart.
If you look at the examples in help(plm) you might notice that the first two columns in all the data sets define the id and the time.

Error with ZIP and ZINB Models upon subseting and factoring data

I am trying to run ZIP and ZINB models to try look at some of the factors which might help explain disease (orf) distribution within 8 geographical regions. The models works fine for some regions and not others. However upon factoring and running the model in R I get error message.
How can I solve this problem or would there be a model that might work better with the subset as the analysis will only make sense when it’s all uniform across all regions.
zinb3 = zeroinfl(Cases2012 ~ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles) ,data=orf3, dist="negbin",link="logit")
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 2.99934e-24
Results after fitting zerotrunc & glm as suggested by #Achim Zeileis. How do i interprete zerotruc output given that no p values. Also how can I correct the error with glm?
zerotrunc(Cases2012 ~ Flock2012+Stocking.Density2012+ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles),data=orf1, subset = Cases2012> 0)
Call:
zerotrunc(formula = Cases2012 ~ Flock2012 + Stocking.Density2012 + Precip + Altitude +
factor(Breed) + factor(Farming.Practise) + factor(Lambing.Management) + factor(Thistles),
data = orf1, subset = Cases2012 > 0)
Coefficients (truncated poisson with log link):
(Intercept) Flock2012 Stocking.Density2012
14.1427130 -0.0001318 -0.0871504
Precip Altitude factor(Breed)2
-0.1467075 -0.0115919 -3.2138767
factor(Farming.Practise)2 factor(Lambing.Management)2 factor(Thistles)3
1.3699477 -2.9790725 2.0403543
factor(Thistles)4
0.8685876
glm(factor(Cases2012 ~ 0) ~ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles) +Flock2012+Stocking.Density2012 ,data=orf1, family = binomial)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
It's hard to say exactly what is going on based on the information provided. However, I would suspect that the data in some regions does not allow to fit the model specified. For example, there might be some regions where certain factor levels (of Breed or Farming.Practise or Lambin.Management or Thristles) only have zero values (or only non-zero but that is less frequent in practice). Then the coefficient estimates often degenerate so that the associated zero-inflation probability goes to 1 and the count coefficient cannot be estimated.
It's typically easier to separate these effects by using the hurdle rather the zero-inflation model. Then the two parts of the model can also be fitted separately by glm(factor(y > 0) ~ ..., ..., family = binomial) and zerotrunc(y ~ ..., ..., subset = y > 0). The latter function is essentially the same code as pscl uses but has been factored into a standalone function in the package countreg on R-Forge (not yet on CRAN).

I get many predictions after running predict.lm in R for 1 row of new input data

I used ApacheData data with 83784 rows to build a linear regression model:
fit <-lm(tomorrow_apache~ as.factor(state_today)
+as.numeric(daily_creat)
+ as.numeric(last1yr_min_hosp_icu_MDRD)
+as.numeric(bun)
+as.numeric(urin)
+as.numeric(category6)
+as.numeric(category7)
+as.numeric(other_fluid)
+ as.factor(daily)
+ as.factor(age)
+ as.numeric(apache3)
+ as.factor(mv)
+ as.factor(icu_loc)
+ as.factor(liver_tr_before_admit)
+ as.numeric(min_GCS)
+ as.numeric(min_PH)
+ as.numeric(previous_day_creat)
+ as.numeric(previous_day_bun) ,ApacheData)
And I want to use this model to predict a new input so I give each predictor variable a value:
predict(fit, data=data.frame(state_today=1, daily_creat=2.3, last1yr_min_hosp_icu_MDRD=3, bun=10, urin=0.01, category6=10, category7=20, other_fluid=0, daily=2 , age=25, apache3=12, mv=1, icu_loc=1, liver_tr_before_admit=0, min_GCS=20, min_PH=3, previous_day_creat=2.1, previous_day_bun=14))
I expect a single value as a prediction to this new input, but I get many many predictions! I don't know why is this happening. What am I doing wrong?
Thanks a lot for your time!
You may also want to try the excellent effects package in R (?effects). It's very useful for graphing the predicted probabilities from your model by setting the inputs on the right-hand side of the equation to particular values. I can't reproduce the example you've given in your question, but to give you an idea of how to quickly extract predicted probabilities in R and then plot them (since this is vital to understanding what they mean), here's a toy example using the in-built data sets in R:
install.packages("effects") # installs the "effects" package in R
library(effects) # loads the "effects" package
data(Prestige) # loads in-built dataset
m <- lm(prestige ~ income + education + type, data=Prestige)
# this last step creates predicted values of the outcome based on a range of values
# on the "income" variable and holding the other inputs constant at their mean values
eff <- effect("income", m, default.levels=10)
plot(eff) # graphs the predicted probabilities

Survival analysis: aft model, simexaft package in R

We are trying to reproduce the results of an accelarated failure time (aft) model in R, which has been coded in SAS.
The data set we use is here
There you can find the SAS code as well.
formula <- survreg(Surv(Duration, Censor) ~ Acq_Expense + Acq_Expense_SQ + Ret_Expense + Ret_Expense_SQ + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', data = daten [daten$Acquisition==1, ])
out1 <- survreg(formula = formula, data = daten [daten$Acquisition==1, ], dist = "weibull")
summary(out1)
ind <- c("Duration", "Censor")
err.mat <- ???
out2 <- simexaft(formula = formula, data = daten [daten$Acquisition==1, ], SIMEXvariable = ind, repeated = FALSE, err.mat = err.mat, dist = "weibull")
summary(out2)
Our question is how to define the err.mat term?
err.mat specifies the variables with measurement errors. Since our data set is right censored I thought the variables with measurement error are probably Duration and/or Censor. But it is not as simple as that, err.mat must be a square symmetric numeric matrix.
If you read the Journal of Statistical Software,January 2012, Volume 46, article describing the simexaft package, it becomes clear that in the situation without repeated measurements to estimate the measurement errors from data, that you must supply these estimates yourself from domain knowledge. See the example in pages 6-8. Also see the cited "Statistics in Medicine" article available at Dr Yi's website. The measurement errors are the first two predictor variables, systolic blood pressure (SBP) and serum cholesterol(CHOL) in that example. If you are using the text from which you are extracting that data, then you will need to read the chapter text (which does not appear to be available at that website) to determine what assumptions they make about the measurement errors.

Resources