What is the R equivalent to the e(sample) command in Stata? - r

I'm running conditional logistic regression models in R as part of a discordant sibling pair analysis and I need to isolate the total n for each model. Also, I need to isolate the number and % of cases of the disease in the exposed and unexposed groups.
In Stata the e(sample) == 1 command gives this info. Is there an equivalent function for accomplishing this in R?

In R, if you run a regression you create a regression object.
RegOb <- lm(y ~ x1 + x2, data)
Often people call "RegOb" which uses the internal "print" method of this type of object. Alternative "summary(RegOb)" is popular (and often people would assign this).
However, RegOb contains many information about the regression. So in Stata you could use -ereturn list- to see what is saved. In R I would recommend to use "str(RegOb)" or "View(RegOb)" you will see everything that is saved. I have forgotten the correct syntax atm, but it will be something like:
RegOb$data
And since you have the original data, you can simply use a logical statement based on the original and the used data which will give you the estimation sample.

Related

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

I want to use a fixed effects model on a regression with one variable being the group variable

I am using felm() and the code is running on all the model… but I need it to run on state only… the problem asks "Estimate the model using fixed effects (FE) at the state level". Using felm() is not getting me the correct results because I don't know if I need to include state as a dependent variable (doesn't give me correct answers) or how to specify that one variable needs to be the group variable (I'm assuming this is how to get accurate results).
I have tried using
plm(ind~depvar+state,data=data, model='within')
I have tried using
felm(ind~depvar+state,data=data)
FELinMod3<-felm(DRIVING$totfatrte~DRIVING$D81+DRIVING$state, data=DRIVING)
FELinMod3<-plm(DRIVING$totfatrte~DRIVING$D81+DRIVING$state, data=DRIVING, model='within')
output is giving me incorrect coefficients to the ones I know are correct in STATA.
looks like felm() is for when you have multiple grouping variables, but it sounds like you're using only one grouping variable for fixed effects? (i.e., state).
you should get the same correct result for
mod3 <- lm(totfatrte ~ D81 + state, data = DRIVING)
Also, if the coefficients or standard errors disagree between stata and R, that doesn't necessarily mean that R is wrong.
Reading the documentation for felm() indicates that your code should look more like this:
model3<-felm(totfatrte ~ D81 | state, data = DRIVING)
but the code specifications for it are pretty complex based on whether you want to cluster your standard errors and so on.
Hope this helps.

mlogit package in R: intercept and alternative specific individual variables

I'm trying to deal with the package mlogit in R to build up a transportation-mode choice model. I searched similar problems but I've not found anything.
I have a set of 3 alternatives (walk, auto, transit) in a logit model, with alternative specific variables (same parameters for different alternatives) and individual alternative specific variables (ex: 0(if no)/1(if yes) home-destination trip, just for walk mode).
I'd like to have an intercept in only one of the alternatives (auto), but I'm not able to do this. Using reflevel, that refers to only one of the alternatives, I get two intercepts.
ml.data <- mlogit(choice ~ t + cost | dhome, mode, reflevel = "transit")
This is not working as I wish.
Moreover, I'd like to set the alternative specific variables as I said before. Insert them in part 2 of mlogit formula takes me two parameter values, but I'd like to have just one parameter, for the mentioned alternative.
Could anyone help me?
You cannot do what you want. It's not a question of mlogit in particular, it's a question of how multinomial logistic regression works. If your dependent variable has 3 levels, you will have 2 intercepts. And you have to use the same independent variables for the whole model (that's true for all methods of regression).
However, referring to the second part of the question (" individual alternative specific variables (ex: 0(if no)/1(if yes) home-destination trip, just for walk mode") I tried to modify the dataset by inserting 3 columns (dhome.auto [all zeros], dhome.transit [all zeros] and dhome.walk [0 if no / 1 if yes it's a home-destination trip]) in order to obtain this variable effective just for walk mode, even if it's now traited as an alternative specific variable. Then
ml.data <- mlogit(choice ~ t + cost + dhome, mode, reflevel = "transit")
it's a kind of a trick, but it seems to work

How to fit a multitple linear regression model on 1664 explantory variables in R

I have one response variable, and I'm trying to find a way of fitting a multiple linear regression model using 1664 different explanatory variables. I'm quite new to R and was taught the way of doing this by stating the formula using each of the explanatory variables in the formula. However as I have 1664 variables, it would take too long to do. Is there a quicker way of doing this?
Thank you!
I think you want to select from the 1664 variables a valid model, i.e. a model that predicts as much of the variability in the data with as few explanatory variables. There are several ways of doing this:
Using expert knowledge to select variables that are known to be relevant. This can be due to other studies finding this, or due to some underlying process that you now makes that variable relevant.
Using some kind of stepwise regression approach which selects the variables are relevant based on how well they explain the data. Do note that this method has some serious downsides. Have a look at stepAIC for a way of doing this using the Aikaike Information Criterium.
Correlating 1664 variables with data will yield around 83 significant correlations if you choose a 95% significance level (0.05 * 1664) purely based on randomness. So, tread carefully with the automatic variable selection. Cutting down the amount of variables with expert knowledge or some decorrelation techniques (e.g. principal component analysis) would help.
For a code example, you first need to include an example of your own (data + code) on which I can build.
I'll answer the programming question, but note that often a regression with that many variables could use some sort of variable selection procedure (e.g. #PaulHiemstra's suggestions).
You can construct a data.frame with only the variables you want to run, then use the formula shortcut: form <- y ~ ., where the dot indicates all variables not yet mentioned.
You could instead construct the formula manually. For instance: form <- as.formula( paste( "y ~", paste(myVars,sep="+") ) )
Then run your regression:
lm( form, data=dat )

function for removing nonsignificant variables at one step in R

I am trying to automate logistic regression in R.
Basically, my source code will generate a new equation everyday as the input data is updated,
(Variables, data format etc are same) and print out te significant variables with corresponding coefficients.
When I use step function, sometimes the resulting coefficients are not significant. Therefore, I want to update my set of coefficients and get rid of all the ones that are not significant enough.
Is there a function or automated way of doing it?
If not, the only way I can think of is writing a script on another language that takes the coefficients and corresponding P value and checking significance, and rerunning R accordingly. But even for that, do you know how can I get only P values and coefficients of variables. I can either print whole summary of regression result with "summary" function. I can't reach only P values.
Thank you very much
It's a bit hard for me without sample code and data, but you can subset based on variable values like this,
newdata <- data[ which(data$p.value < 0.5), ]
You can inspect your R object using str, see ?str to figure out how to select whatever you want to use in your subset $p.value or $residuals.
If this doesn't answer your question try submitting some sample code and data.
Best,
Eric

Resources