Shortening the formula syntax of a regression model - r

I was wondering if the syntax of the regression model below could be made more concise (shorter) than it currently is?
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/bv1.csv')
library(nlme)
model <- lme(achieve ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time), correlation = corSymm(),
random = ~0 + D1:time | schcode/id, data = dat, weights = varIdent(form = ~1|factor(math)),
na.action = na.omit, control = lmeControl(maxIter = 200, msMaxIter = 200, niterEM = 50,
msMaxEval = 400))
coef(summary(model))

Focusing on the fixed-effect component only.
Original formula:
form1 <- ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time)
X1 <- model.matrix(form1, data=dat)
I think this is equivalent
form2 <- ~0 +
D1 + D2 +
(D1+D2):(time + schcontext + female + female:time+schcontext:time)
X2 <- model.matrix(form2, data=dat)
(Unfortunately ~ 0 + (D1 + D2):(1 + time + ...) doesn't work as I would have liked/expected.)
For a start, the model matrix has the right dimensions. Staring at the column names of the model matrices and reordering the columns manually:
X2o <- X2[,c(1:3,6,4,7,5,8,9,11,10,12)]
all.equal(c(X1),c(X2o)) ##TRUE
(For numerical predictors, you don't need I(A*B): A:B is equivalent.)
Actually you can do a little better using the * operator
form3 <- ~0 +
D1 + D2 +
(D1+D2):(time*(schcontext+female))
X3 <- model.matrix(form3, data=dat)
X3o <- X3[,c(1:3,6,4,7,5,8,10,12,9,11)]
all.equal(c(X1),c(X3o)) ## TRUE
Compare formula length:
sapply(list(form1,form2,form3),
function(x) nchar(as.character(x)[[2]]))
## [1] 183 84 54

Related

svyglm - how to code for a logistic regression model across all variables?

In R using GLM to include all variables you can simply use a . as shown How to succinctly write a formula with many variables from a data frame?
for example:
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
however I am struggling to do this with svydesign. I have many exploratory variables and an ID and weight variable, so first I create my survey design:
des <-svydesign(ids=~id, weights=~wt, data = df)
Then I try creating my binomial model using weights:
binom <- svyglm(y~.,design = des, family="binomial")
But I get the error:
Error in svyglm.survey.design(y ~ ., design = des, family = "binomial") :
all variables must be in design = argument
What am I doing wrong?
You typically wouldn't want to do this, because "all the variables" would include design metadata such as weights, cluster indicators, stratum indicators, etc
You can use col.names to extract all the variable names from a design object and then reformulate, probably after subsetting the names, eg with the api example in the package
> all_the_names <- colnames(dclus1)
> all_the_actual_variables <- all_the_names[c(2, 11:37)]
> reformulate(all_the_actual_variables,"y")
y ~ stype + pcttest + api00 + api99 + target + growth + sch.wide +
comp.imp + both + awards + meals + ell + yr.rnd + mobility +
acs.k3 + acs.46 + acs.core + pct.resp + not.hsg + hsg + some.col +
col.grad + grad.sch + avg.ed + full + emer + enroll + api.stu

put all variables in a regression

I want reduce the expression in r code
model1 <- pglm::pglm(formula = lfp ~ lfp_1+lfp1+ kids + *kids2 + kids3 + kids4 + kids5+ lhinc + lhinc2 + lhinc3 +lhinc4 + lhinc5 +educ+ black + age + agesq + per2+ per3 + per4+ per5,
family = binomial("probit"),
data = lfp1,
model = "random")
on stata will put kids2 - kids5 and list the variables kids from 2 to 5 in the regression.
Same to lhinc2-lhinc5 and to per2 - per5
Try this one:
model1 <- pglm::pglm(formula = lfp ~.,
family = binomial("probit"),
data = lfp1,
model = "random")

In R, is there a parsimonious or efficient way to get a regression prediction holding all covariates at their means?

I'm wondering if there is essentially a faster way of getting predictions from a regression model for certain values of the covariates without manually specifying the formulation. For example, if I wanted to get a prediction for a given dependent variable at means of the covariates, I can do something like this:
glm(ins ~ retire + age + hstatusg + qhhinc2 + educyear + married + hisp,
family = binomial, data = dat)
meanRetire <- mean(dat$retire)
meanAge <- mean(dat$age)
meanHStatusG <- mean(dat$hStatusG)
meanQhhinc2 <- mean(dat$qhhinc2)
meanEducyear <- mean(dat$educyear)
meanMarried <- mean(dat$married)
meanYear <- mean(dat$year)
ins_predict <- coef(r_3)[1] + coef(r_3)[2] * meanRetire + coef(r_3)[3] * meanAge +
coef(r_3)[4] * meanHStatusG + coef(r_3)[5] * meanQhhinc2 +
coef(r_3)[6] * meanEducyear + coef(r_3)[7] * meanMarried +
coef(r_3)[7] * meanHisp
Oh... There is a predict function:
fit <- glm(ins ~ retire + age + hstatusg + qhhinc2 + educyear + married + hisp,
family = binomial, data = dat)
newdat <- lapply(dat, mean) ## column means
lppred <- predict(fit, newdata = newdat) ## prediction of linear predictor
To get predicted response, use:
predict(fit, newdata = newdat, type = "response")
or (more efficiently from lppred):
binomial()$linkinv(lppred)

Holding the coefficients of a linear model constant while exchanging predictors for their sample means?

I've been trying to look at the explanatory power of individual variables in a model by holding other variables constant at their sample mean.
However, I am unable to do something like:
Temperature = alpha + Beta1*RFGG + Beta2*RFSOx + Beta3*RFSolar
where Beta1=Beta2=Beta3 -- something like
Temperature = alpha + Beta1*(RFGG + RFSolar + RFSOx)
I want to do this so I can compare the difference in explanatory power (R^2/size of residuals) when one independent variable is not held at the sample mean while the rest are.
Temperature = alpha + Beta1*(RFGG + meanRFSolar + meanRFSOx)
or
Temperature = alpha + Beta1*RFGG + Beta1*meanRFSolar + Beta1*meanRFSOx
However, the lm function seems to estimate its own coefficients so I don't know how I can hold anything constant.
Here's some ugly code I tried throwing together that I know reeks of wrongness:
# fixing a new clean matrix for my data
dat = cbind(dat[,1:2],dat[,4:6]) # contains 162 rows of: Date, Temp, RFGG, RFSolar, RFSOx
# make a bunch of sample mean independent variables to use
meandat = dat[,3:5]
meandat$RFGG = mean(dat$RFGG)
meandat$RFSolar = mean(dat$RFSolar)
meandat$RFSOx = mean(dat$RFSOx)
RFTotal = dat$RFGG + dat$RFSOx + dat$RFSolar
B = coef(lm(dat$Temp ~ 1 + RFTot)) # trying to save the coefficients to use them...
B1 = c(rep(B[1],length = length(dat[,1])))
B2 = c(rep(B[2],length = length(dat[,1])))
summary(lm(dat$Temp ~ B1 + B2*dat$RFGG:meandat$RFSOx:meandat$RFSolar)) # failure
summary(lm(dat$Temp ~ B1 + B2*RFTot))
Thanks for taking a look to whoever sees this and please ask me any questions.
Thank you both of you, it was a combination of eliminating the intercept with (-1) and the offset function.
a = lm(Temp ~ I(RFGG + RFSOx + RFSolar),data = dat)
beta1hat = rep(coef(a)[1],length=length(dat[,1]))
beta2hat = rep(coef(a)[2],length=length(dat[,1]))
b = lm(Temp ~ -1 + offset(beta1hat) + offset(beta2hat*(RFGG + RFSOx_bar + RFSolar_bar)),data = dat)
c = lm(Temp ~ -1 + offset(beta1hat) + offset(beta2hat*(RFGG_bar + RFSOx + RFSolar_bar)),data = dat)
d = lm(Temp ~ -1 + offset(beta1hat) + offset(beta2hat*(RFGG_bar + RFSOx_bar + RFSolar)),data = dat)

neural network in R

hi i am trying to use neuralnet function in R so i can predict an integer outcome (meaning) using the rest of the variables.
here is the code that i have used:
library("neuralnet")
I am going to put 2/3 from the data for neural network learning and the rest
for test
ind<-sample(1:nrow(Data),6463,replace=FALSE)
Train<-Data[ind,]
Test<-Data[-ind,]
m <- model.matrix(
~meaning +
firstLevelAFFIRM + firstLevelDAT.PRSN + firstLevelMODE +
firstLevelO.DEF + firstLevelO.INDIV + firstLevelS.AGE.INDIV +
secondLevelV.BIN + secondLevelWord1 + secondLevelWord2 +
secondLevelWord3 + secondLevelWord4 + thirdLevelP.TYPE,
data = Train[,-1]) #(the first column is ID , i am not going to use it)
PredictorVariables <- paste("m[," , 3:ncol(m),"]" ,sep="")
Formula <- formula(paste("meaning ~ ", paste(PredictorVariables, collapse=" + ")))
net <- neuralnet(Formula,data=m, hidden=3, threshold=0.05)
m.test < -model.matrix(
~meaning +
firstLevelAFFIRM + firstLevelDAT.PRSN + firstLevelMODE +
firstLevelO.DEF + firstLevelO.INDIV + firstLevelS.AGE.INDIV +
secondLevelV.BIN + secondLevelWord1 + secondLevelWord2 +
secondLevelWord3 + secondLevelWord4 + thirdLevelP.TYPE,
data = Test[,-1])
net.results <- compute(net, m.test[,-c(1,2)]) #(first column is ID and the second one is the outcome that i am trying to predict)
output<-cbind(round(net.results$net.result),Test$meaning)
mean(round(net.results$net.result)!=Test$meaning)
the misclassification that i got was around 0.01 which is great, but my question is why the outcome that i got (net.results$net.result) is not an integer?
I assume that your output is linear. Try setting linear.output = FALSE.
net <- neuralnet(Formula, data = m, hidden = 3, threshold = 0.05, linear.output = FALSE)

Resources