Time fixed effects on Stata and R - r

I've looked at many resources but can't seem to find an answer to this.
Basically I have various time series related to weather and would like to perform an OLS estimation. As a simple example:
y=constant+b1*rain+b2*sunshine
The data are hourly and spans 5 years. The amount of rain and sunshine that occur at a given hour are related to the amount of the same variable in the preceding hour so I address this either with an autoregressive process or by first differencing the equation above and dropping the constant.
However with weather there are also hourly patterns at play (for example, more sunshine at given times of the day), monthly patterns and yearly patterns (some months and years have uncharacteristic amounts of sunshine or rain). For this reason I would like to use time fixed effects, which would essentially be dummy variables for each hour-month-year of the sample. Assuming the sample had 5 years this would mean having 5years * 12 months * 24 hours for a total of 1440 fixed effects dummies.
The question is if there is any way to create these dummies automatically in the regression command? Or to create the dummies prior to running the regression, and how would I do these steps of creating them and then include the 1440 dummies in the command?
I'm open to doing this in either Stata or R, so if you know how to do it in either of these it would be much appreciated
TLDR: how do I create 1440 time fixed effects dummies (one for each hour over 5 years), and then use it in the regression command?

In Stata: reghdfe seems perfect for this. I made some example data with the structure you specified:
clear
set obs 100000
gen hour = floor(runiform()*24)+1
gen year = floor(runiform()*5) + 1990
gen month = floor(runiform()*12)+1
gen outcome = rnormal()
Then using "#" in reghdfe we can include fixed effects for the interactions of the variables in the absorb() option:
reghdfe outcome, absorb(hour#year#month)
In the output, you can verify that the number of categories is what you wanted, 1,440:
HDFE Linear regression Number of obs = 100,000
Absorbing 1 HDFE group F( 0, 98560) = .
Prob > F = .
R-squared = 0.0142
Adj R-squared = -0.0002
Within R-sq. = 0.0000
Root MSE = 1.0010
------------------------------------------------------------------------------
outcome | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_cons | -.0020881 .0031653 -0.66 0.509 -.0082921 .0041158
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-------------------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
---------------------+---------------------------------------|
hour#year#month | 1440 0 1440 |
-------------------------------------------------------------+
regress can also do this but needs an egen group step to make a numeric identifier for the intersection because it doesn't use the hashtags. Using regress you would get the same result with
egen hourXyearXmonth = group(hour year month)
regress outcome, absorb(hourXyearXmonth)

Related

R won't include an important random term within my glm due to high correlation - but I need to account for it

I have a big data frame including abundance of bats per year, and I would like to model the population trend over those years in R. I need to include year additionally as a random effect, because my data points aren't independent as bat population one year directly effects the population of the next year (if there are 10 bats one year they will likely be alive the next year). I have a big dataset, however have used the group_by() function to create a simpler dataframe shown below - example of dataframe lay out. In my bigger dataset I also have month and day.
year
total individuals
2000
39
2001
84
etc.
etc.
Here is the model I wish to use with lme4.
BLE_glm6 <- glm(total_indv ~ year + (year|year), data = BLE_total, family = poisson)
Because year is the predictor variable, when adding year again R does not like it because it's highly correlated. So I am wondering, how do I account for the individuals one year directly affecting the number of individuals then next year if I can't include year as a random effect within the model?
There are a few possibilities. The most obvious would be to fit a Poisson model with the number of bats in the previous year as an offset:
## set up lagged variable
BLE_total <- transform(BLE_total,
total_indv_prev = c(NA, total_indv[-length(total_indv)])
## or use dplyr::lag() if you like tidyverse
glm(total_indv ~ year + offset(log(total_indv_prev)), data = BLE_total,
family = poisson)
This will fit the model
mu = total_indv_prev*exp(beta_0 + beta_1*year)
total_indv ~ Poisson(mu)
i.e. exp(beta_0 + beta_1*year) will be the predicted ratio between the current and previous year. (See here for further explanation of the log-offset in a Poisson model.)
If you want year as a random effect (sorry, read the question too fast), then
library(lme4)
glmer(total_indv ~ offset(log(total_indv_prev)) + (1|year), ...)

Pairwise comparisons after Glm.nb

I am studying some behavioral data based on the scan sampling method (number of occurrences of each behavior recorded among the total number of occurrences of all behaviors) in rabbits. I study two main effects that are the age of the animals (3 levels) and the time they are outside on a pasture (2 levels).
I have used this model for let's say the Grooming behavior:
glm.nb.Grooming = glm.nb(Grooming ~ Age * Time, data = sa)
The Anova showed an effect of age and time on the expression of this behavior (P_value(Time) < 0.05 and P_value(Age) < 0.05 with no effect of the interaction). I want to present my data in a table like this one down below because it exists an effect of Time x Age on some other behaviors.
When I want to run a "pairs" to know which values are different from the others, I get this problem (T3 and T8 are for Time outside and the numbers are the ages)
Grooming.em = emmeans(glm.nb.Grooming, ~ Time * Age, type="response") ; Grooming.em.em ; pairs(Grooming.em)
The pairwise comparisons has no p_value under 5% despite the effect of Age and Time as shown with the Anova.
I supposed it is because of some of the SE that are very high... or the log(0) I need to make these comparisons but I actually have no idea how to fix this. Can you help me? Thanks a lot

GAMM4 smoothing spline for time variable

I am constructing a GAMM model (for the first time) to compare longitudinal slopes of cognitive performance in a Bipolar Disorder (BD) sample, compared to a control (HC) sample. The study design is referred to as an "accelerated longitudinal study" where participants across a large span of ages 25-60, are followed for 2 years (HC group) and 4 years (BD group).
Hypothesis (1) The BD group’s yearly rate of change on processing speed will be higher overall than the healthy control group, suggesting a more rapid cognitive decline in BD than seen in HC.
Here is my R code formula, which I think is a bit off:
RUN2 <- gamm4(BACS_SC_R ~ group + s(VISITMONTH, bs = "cc") +
s(VISITMONTH, bs = "cc", by=group), random=~(1|SUBNUM), data=Df, REML = TRUE)
The visitmonth variable is coded as "months from first visit." Visit 1 would equal 0, and the following visits (3 per year) are coded as months elapsed from visit 1. Is a cyclic smooth correct in this case?
I plan on adding additional variables (i.e peripheral inflammation) to the model to predict individual slopes of cognitive trajectories in BD.
If you have any other suggestions, it would be greatly appreciated. Thank you!
If VISITMONTH is over years (i.e. for a BD observation we would have VISITMONTH in {0, 1, 2, ..., 48} (for the four years)), then no, you don't want a cyclic smooth unless there is some 4-year periodicity that would mean 0 and 11 should be constrained to be the same.
The default thin plate spline bs = 'tp' should suffice.
I'm also assuming that there are many possible values for VISITMONTH as not everyone was followed up at the same monthly intervals? Otherwise you're not going to have many degrees of freedom available for the temporal smooth.
Is group coded as an ordered factor here? If so that's great; the by smooth will encode the difference between the reference level (be sure to set HC as the reference level) and the other level so you can see directly in the summary a test for a difference of the BD group.
It's not clear how you are dealing with the fact that HC are followed up over fewer months than the BD group. It looks like the model has VISITMONTH representing the full time of the study not just a winthin-year term. So how do you intend to compare the BD group with the HC group for the 2 years where the HC group are not observed?

mixed models: r spss difference

I want to do a mixed model analysis on my data. I used both R and SPSS to verify whether my R results where correct, but the results differ enormous for one variable. I can't figure out why there is such a large difference myself, your help would be appreciated! I already did various checks on the dataset.
DV: score on questionnaire (QUES)
IV: time (after intervention, 3 month follow-up, 9 month follow-up)
IV: group (two different interventions)
IV: score on questionnaire before the intervention (QUES_pre)
random intercept for participants
SPSS code:
MIXED QUES BY TIME GROUP WITH QUES_pre
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=TIME GROUP QUES_pre TIME*GROUP | SSTYPE(3)
/METHOD=REML
/PRINT=SOLUTION TESTCOV
/RANDOM=INTERCEPT | SUBJECT(ID) COVTYPE(AR1)
/REPEATED=Index1 | SUBJECT(ID) COVTYPE(AR1).
R code:
model1 <- lme(QUES ~ group + time + time:group + QUES_pre, random = ~1|ID, correlation = corAR1(0, form =~1|Onderzoeksnummer), data = data,na.action=na.omit, method = "REML")
The biggest difference lies in the effect of group. For the SPSS code, the p value is .045, for the R code the p-value is .28. Is there a mistake in my code, or has anyone a suggestion of something else that might go wrong?

R: Regression with a holdout of certain variables

I'm doing a multi-linear regression model using lm(), Y is response variable (e.g.: return of interests) and others are explanatory variable (100+ cases, 30+ variables).
There are certain variables which are considered as key variables (concerning investment), when I ran the lm() function, R returns a model with adj.r.square of 97%. But some of the key variables are not significant predictors.
Is there a way to do a regression by keeping all of the key variables in the model (as significant predictors)? It doesn't matter if the adjusted R square decreases.
If the regression doesn't work, is there other methodology?
thank you!
==========================
the data set is uploaded
https://www.dropbox.com/s/gh61obgn2jr043y/df.csv
==========================
additional questions:
what if some variables have impact from previous period to current period?
Example: one takes a pill in the morning when he/she has breakfast and the effect of pills might last after lunch (and he/she takes the 2nd pill at lunch)
I suppose I need to take consideration of data transformation.
* My first choice is to plus a carry-over rate: obs.2_trans = obs.2 + c-o rate * obs.1
* Maybe I also need to consider the decay of pill effect itself, so a s-curve or a exponential transformation is also necessary.
take variable main1 for example, I can use try-out method to get an ideal c-o rate and s-curve parameter starting from 0.5 and testing by step of 0.05, up to 1 or down to 0, until I get the highest model score - say, lowest AIC or highest R square.
This is already a huge quantity to test.
If I need to test more than 3 variables in the same time, how could I manage that by R?
Thank you!
First, a note on "significance". For each variable included in a model, the linear modeling packages report the likelihood that the coefficient of this variable is different from zero (actually, they report p=1-L). We say that, if L is larger (smaller p), then the coefficient is "more significant". So, while it is quite reasonable to talk about one variable being "more significant" than another, there is no absolute standard for asserting "significant" vs. "not significant". In most scientific research, the cutoff is L>0.95 (p<0.05). But this is completely arbitrary, and there are many exceptions. recall that CERN was unwilling to assert the existence of the Higgs boson until they had collected enough data to demonstrate its effect at 6-sigma. This corresponds roughly to p < 1 × 10-9. At the other extreme, many social science studies assert significance at p < 0.2 (because of the higher inherent variability and usually small number of samples). So excluding a variable from a model because it is "not significant" really has no meaning. On the other hand you would be hard pressed to include a variable with high p while excluding another variable with lower p.
Second, if your variables are highly correlated (which they are in your case), then it is quite common that removing one variable from a model changes all the p-values greatly. A retained variable that had a high p-value (less significant), might suddenly have low p-value (more significant), just because you removed a completely different variable from the model. Consequently, trying to optimize a fit manually is usually a bad idea.
Fortunately, there are many algorithms that do this for you. One popular approach starts with a model that has all the variables. At each step, the least significant variable is removed and the resulting model is compared to the model at the previous step. If removing this variable significantly degrades the model, based on some metric, the process stops. A commonly used metric is the Akaike information criterion (AIC), and in R we can optimize a model based on the AIC criterion using stepAIC(...) in the MASS package.
Third, the validity of regression models depends on certain assumptions, especially these two: the error variance is constant (does not depend on y), and the distribution of error is approximately normal. If these assumptions are not met, the p-values are completely meaningless!! Once we have fitted a model we can check these assumptions using a residual plot and a Q-Q plot. It is essential that you do this for any candidate model!
Finally, the presence of outliers frequently distorts the model significantly (almost by definition!). This problem is amplified if your variables are highly correlated. So in your case it is very important to look for outliers, and see what happens when you remove them.
The code below rolls this all up.
library(MASS)
url <- "https://dl.dropboxusercontent.com/s/gh61obgn2jr043y/df.csv?dl=1&token_hash=AAGy0mFtfBEnXwRctgPHsLIaqk5temyrVx_Kd97cjZjf8w&expiry=1399567161"
df <- read.csv(url)
initial.fit <- lm(Y~.,df[,2:ncol(df)]) # fit with all variables (excluding PeriodID)
final.fit <- stepAIC(initial.fit) # best fit based on AIC
par(mfrow=c(2,2))
plot(initial.fit) # diagnostic plots for base model
plot(final.fit) # same for best model
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 11.38360 18.25028 0.624 0.53452
# Main1 911.38514 125.97018 7.235 2.24e-10 ***
# Main3 0.04424 0.02858 1.548 0.12547
# Main5 4.99797 1.94408 2.571 0.01195 *
# Main6 0.24500 0.10882 2.251 0.02703 *
# Sec1 150.21703 34.02206 4.415 3.05e-05 ***
# Third2 -0.11775 0.01700 -6.926 8.92e-10 ***
# Third3 -0.04718 0.01670 -2.826 0.00593 **
# ... (many other variables included)
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.76 on 82 degrees of freedom
# Multiple R-squared: 0.9824, Adjusted R-squared: 0.9779
# F-statistic: 218 on 21 and 82 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(initial.fit)
title("Base Model",outer=T,line=-2)
plot(final.fit)
title("Best Model (AIC)",outer=T,line=-2)
So you can see from this that the "best model", based on the AIC metric, does in fact include Main 1,3,5, and 6, but not Main 2 and 4. The residuals plot shows no dependance on y (which is good), and the Q-Q plot demonstrates approximate normality of the residuals (also good). On the other hand the Leverage plot shows a couple of points (rows 33 and 85) with exceptionally high leverage, and the Q-Q plot shows these same points and row 47 as having residuals not really consistent with a normal distribution. So we can re-run the fits excluding these rows as follows.
initial.fit <- lm(Y~.,df[c(-33,-47,-85),2:ncol(df)])
final.fit <- stepAIC(initial.fit,trace=0)
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 27.11832 20.28556 1.337 0.185320
# Main1 1028.99836 125.25579 8.215 4.65e-12 ***
# Main2 2.04805 1.11804 1.832 0.070949 .
# Main3 0.03849 0.02615 1.472 0.145165
# Main4 -1.87427 0.94597 -1.981 0.051222 .
# Main5 3.54803 1.99372 1.780 0.079192 .
# Main6 0.20462 0.10360 1.975 0.051938 .
# Sec1 129.62384 35.11290 3.692 0.000420 ***
# Third2 -0.11289 0.01716 -6.579 5.66e-09 ***
# Third3 -0.02909 0.01623 -1.793 0.077060 .
# ... (many other variables included)
So excluding these rows results in a fit that has all the "Main" variables with p < 0.2, and all except Main 3 at p < 0.1 (90%). I'd want to look at these three rows and see if there is a legitimate reason to exclude them.
Finally, just because you have a model that fits your existing data well, does not mean that it will perform well as a predictive model. In particular, if you are trying to make predictions outside of the "model space" (equivalent to extrapolation), then your predictive power is likely to be poor.
Significance is determined by the relationships in your data .. not by "I want them to be significant".
If the data says they are insignificant, then they are insignificant.
You are going to have a hard time getting any significance with 30 variables, and only 100 observations. With only 100+ observations, you should only be using a few variables. With 30 variables, you'd need 1000's of observations to get any significance.
Maybe start with the variables you think should be significant, and see what happens.

Resources