"Simulating" a large number of regressions with different predictor values - r

Let's say I have the following data and I'm interested in examining some counterfactuals. In particular, I want to examine whether there would be changes in predicted income given a change in income. The best way I can think to do this is to write a loop that runs this regression 1:n. However, how do I also make adjustments to the data frame while running through the loop. I'm really hoping that there is a base R function or something in a package that someone can point me to.
df = data.frame(year=c(2000,2001,2002,2003,2004,2005,2006,2007,2009,2010),
income=c(100,50,70,80,50,40,60,100,90,80),
age=c(26,30,35,30,28,29,31,34,20,35),
gpa=c(2.8,3.5,3.9,4.0,2.1,2.65,2.9,3.2,3.3,3.1))
df
mod = lm(income ~ age + gpa, data=df)
summary(mod)
Here are some counter factuals that may be worth considering when looking at the relationship between age, gpa, and income.
# What is everyone in the class had a lower/higher gpa?
df$gpa2 = df$gpa + 0.55
# what if one person had a lower/higher gpa?
df$gpa2[3] = 1.6
# what if the most recent employee/person had a lower/higher gpa?
df[10,4] = 4.0
With or without looping, what would be the best way to "simulate" a large (1000+) number of regression models in order examine various counter factuals, and then save those results in some data structure? Is there a "counter factual" analysis package which could save me a bit of work?

Related

How to create a rolling linear regression in R?

I am trying to create (as the title suggests) a rolling linear regression equation on a set of data (daily returns of two variables, total of 257 observations for each, linked together by date, want to make the rolling window 100 observations). I have searched for rolling regression packages but I have not found one that works on my data. The two data pieces are stored within one data frame.
Also, I am pretty new to programming, so any advice would help.
Some of the code I have used is below.
WeightedIMV_VIX_returns_combined_ID100896 <- left_join(ID100896_WeightedIMV_daily_returns, ID100896_VIX_daily_returns, by=c("Date"))
head(WeightedIMV_VIX_returns_combined_ID100896, n=20)
plot(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)#the data seems to be correlated enought to run a regression, doesnt matter which one you put first
ID100896.lm <- lm(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)
summary(ID100896.lm) #so the estimate Intercept is 1.2370, estimate Slope is 5.8266.
termplot(ID100896.lm)
Again, sorry if this code is poor, or if I am missing any information that some of you may need to help. This is my first time on here! Just let me know what I can do better. Thanks!

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

Delayed entry survival model. R vs Stata differences

I have some code in Stata that I'm trying to redo in R. I'm working on a delayed entry survival model and I want to limit the follow-up to 5 years. In Stata this is very easy and can be done as follows for example:
stset end, fail(failure) id(ID) origin(start) enter(entry) exit(time 5)
stcox var1
However, I'm having trouble recreating this in R. I've made a toy example limiting follow-up to 1000 days - here is the setup:
library(survival); library(foreign); library(rstpm2)
data(brcancer)
brcancer$start <- 0
# Make delayed entry time
brcancer$entry <- brcancer$rectime / 2
# Write to dta file for Stata
write.dta(brcancer, "brcancer.dta")
Ok so now we've set up an identical dataset for use in both R and Stata. Here is the Stata bit code and model result:
use "brcancer.dta", clear
stset rectime, fail(censrec) origin(start) enter(entry) exit(time 1000)
stcox hormon
And here is the R code and results:
# Limit follow-up to 1000 days
brcancer$limit <- ifelse(brcancer$rectime <1000, brcancer$rectime, 1000)
# Cox model
mod1 <- coxph(Surv(time=entry, time2= limit, event = censrec) ~ hormon, data=brcancer, ties = "breslow")
summary(mod1)
As you can see the R estimates and State estimates differ slightly, and I cannot figure out why. Have I set up the R model incorrectly to match Stata, or is there another reason the results differ?
Since the methods match on an avaialble dataset after recoding the deaths that occur after to termination date, I'm posting the relevant sections of my comment as an answer.
I also think that you should have changed any of the deaths at time greater than 1000 to be considered censored. (Notice that the numbers of events is quite different in the two sets of results.

lm() saving residuals with group_by with R- confused spss user

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?
Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))
mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

Expansion Regression models In R

I am doing a regression with several categorial variables and continuous variables mixed together. For simplify my question, I want to create a regression model that predicts the driving time given a certain driver in different zones with driving miles. That's say I have 5 different drivers and 2 zones in my training data.
I know I probably need to build 5*2=10 regression models for prediction. What I am using in R is
m <- lm(driving_time ~ factor(driver)+factor(zone)+miles)
But it seems like R doesn't expend the combination. My problem is if there are any smart way to do the expansion automatically in R. Or I have to write the 10 regression models one by one. Thank you.
Please read ?formula. + in a formula means include that variable as a main effect. You seem to be looking for an interaction term between driver and zone. You create an interaction term using the : operator. There is also a short cut to get both main and interaction effect via the * operator.
There is some confusion as to whether you want miles to also interact, but I'll assume not here as you only mention 2 x 5 terms.
foo <- transform(foo, driver = factor(driver), zone = factor(zone))
m <- lm(driving_time ~ driver * zone + miles, data = foo)
Here I assume your data are in data frame foo. The first line separates the data processing from the model specification/fitting by converting the variables of interest to factors before fitting.
The formula then specifies main and interactive effects for driver and zone plus main effect for miles.
If you want interactions between all three then:
m <- lm(driving_time ~ driver * zone * miles, data = foo)
or
m <- lm(driving_time ~ (driver + zone + miles)^3, data = foo)
would do that for you.

Resources