Scaling a variable that is an interaction of two variables, for GLM - scale

Is there a way to scale all your variables *including your interaction variables, in a GLM? For example, running the model works when I scale all single variables, but not the interactions.
This works:
mod_lm_scale <-glm( pa ~
scale(WC05) + scale(WC06) + scale(WC08) +
scale(WC13) + scale(WC15),
data=sdmdata,
family='binomial')
Trying to add a scaled interaction does not work. See the last term: scale(WC05:WC06)
mod_lm_scale <-glm( pa ~
scale(WC05) + scale(WC06) + scale(WC08) +
scale(WC13) + scale(WC15),
scale(WC05:WC06),
data=sdmdata,
family='binomial')
I receive this error when including the scaled interaction, and receive no errors when I don't include it:
Error in model.frame.default(formula = pa ~ scale(WC05_clipped) + scale(WC06_clipped) + :
variable lengths differ (found for '(weights)')
In addition: Warning messages:
1: In WC05_clipped:WC06_clipped :
numerical expression has 2869 elements: only the first used
2: In WC05_clipped:WC06_clipped :
numerical expression has 2869 elements: only the first used

Related

Cannot calculate marginal effects in logit model

I am running the following regression:
Model <- glm(emp ~ industry + nat_status + region + state + age + educ7 + religion + caste,
family=binomial(link="logit"), data=IHDS)
However when I use the margins command, I get the following error:
There were 50 or more warnings (use warnings() to see the first 50)"
Warning messages: 1: In predict.lm(object, newdata, se.fit, scale = 1,
type = if (type == ... : prediction from a rank-deficient fit may
be misleading
Based on this error, I know that collinearity might exist. However, I do not know how to find it out and deal.
(I have tried adding each control individually)

My panel linear regression with log variables returns error on non-finite values, but there are no logs on zero or negative values

I'm trying to run a fixed effects regression in my panel data (using plm package). The regression on levels worked well, so as the first regressions using log variables (I'm putting log on only the dependent and some independent variables, which are in monetary terms). However, my regressions with logs stopped working.
require (AER)
library (AER)
require(plm)
library("plm")
#Indicates the panel and the time and individual columns
dd <- pdata.frame(painel, index = c ('Estado', 'Ano'))
#Model 1 - Model within with individual fixed effects
mod_1_within <- plm(PIB ~ txinad + op + desoc + Divliq + Esc_15 + RT + DC + DK + Gini + I(DK*Gini) + I(DC*Gini), data = dd, effect = 'individual')
summary (mod_1_within)
#this worked well
#Model 2 - Model 1 with the monetary variables in log (the others are % or indexes):
mod_1_within_log<- plm(log(PIB) ~ txinad + log(RT) + op + desoc + Divliq + Esc_15 + log(DC) + log(DK) + Gini + I(Gini*log(DC)) + I(Gini*log(DK)), data = dd, effect = 'individual')
summary (mod_1_within_log)
#This returns:
> mod_1_within_log<- plm(log(PIB) ~ txinad + log(RT) + op + desoc + Divliq + Esc_15 + log(DC) + log(DK) + Gini + I(Gini*log(DC)) + I(Gini*log(DK)), data = dd, effect = 'individual')
Error in model.matrix.pdata.frame(data, rhs = 1, model = model, effect = effect, :
model matrix or response contains non-finite values (NA/NaN/Inf/-Inf)
> summary (mod_1_within_log)
Error in summary(mod_1_within_log) : object 'mod_1_within_log' not found
This is ocurring even though there are no log variables with negative or zero values. I will take this opportunity to ask another question: if there is a variable with a zero value, is there a way I can make that value null and them take the log of that variable?
Thanks in advance!
I assume the reason why you're getting that error might be that you have Inf or -Inf values logged predictors or logged outcomes.
In order to check whether that is the case see the untransformed variables (before log) and check whether any observation has a value of zero. If it does, that is the problem. Why? R returns Inf from log(0). So when you run the FE model, plm is giving you that error because it can't deal with NAN or Inf values.

Logistic Regression Error in R - Mostly Binary Variables

I am trying to do a logistic regression in R. I am looking at the relationship between a binary categorical variable (low birthweight) and various other categorical variables (most are binary, some are not--("smoke" and "mat_BMI_cat2")).
When I type in this code, I always get an error.
logit_Mod <-glm(low_bw ~ medical_factors__anemia + medical_factors__eclampsia + m_diabetes + m_hypchron + m_hypreg + smoke + mat_BMI_cat2, data= trainingData, family = binomial(link = "logit"))
I get this error:
Error in family$linkfun(mustart) : Argument mu must be a nonempty numeric vector
I am a bit confused what this error means and where I should go from here.
Thank you!

cv.glm variable lengths differ

I am trying to cv.glm on a linear model however each time I do I get the error
Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv + :
variable lengths differ (found for 'air-force-falcons')
air-force-falcons is the first variable in the dataset lindata. When I run glm I get no errors. All the variables are in a single dataset and there are no missing values.
> linearmod5<- glm(lindata$Y ~ 0 + lindata$HomeAdv + ., data=lindata, na.action="na.exclude")
> set.seed(1)
> cv.err.lin=cv.glm(lindata,linearmod5,K=10)
Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv + :
variable lengths differ (found for 'air-force-falcons')
I do not know what is driving this error or the solution. Any ideas? Thank you!
What is causing this error is a mistake in the way you specify the formula
This will produce the error:
mod <- glm(mtcars$cyl ~ mtcars$mpg + .,
data = mtcars, na.action = "na.exclude")
cv.glm(mtcars, mod, K=11) #nrow(mtcars) is a multiple of 11
This not:
mod <- glm(cyl ~ ., data = mtcars)
cv.glm(mtcars, mod, K=11)
neither this:
mod <- glm(cyl ~ + mpg + disp, data = mtcars)
cv.glm(mtcars, mod, K=11)
What happens is that you specify the variable in like mtcars$cyl this variable have a number of rows equal to that of the original dataset. When you use cv.glm you partition the data frame in K parts, but when you refit the model on the resampled data it evaluates the variable specified in the form data.frame$var with the original (non partitioned) length, the others (that specified by .) with the partitioned length.
So you have to use relative variable in the formula (without $).
Other advices on formula:
avoid using a mix of specified variables and . you double variables. The dot is for all vars in the df except those on the left of tilde.
Why do you add a zero? if it is in the attempt to remove the intercept use -1 instead. However, this is a bad practice in my opinion

Day-ahead using GLM model in R

I have the following code to get a day-ahead prediction for load consumption in 15 minute interval using outside air temperature and TOD(96 categorical variable, time of the day). When I run the code below, I get the following errors.
i = 97:192
formula = as.formula(load[i] ~ load[i-96] + oat[i])
model = glm(formula, data = train.set, family=Gamma(link=vlog()))
I get the following error after the last line using glm(),
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
And the following error shows up after the last line using predict(),
Warning messages:
1: In if (!se.fit) { :
the condition has length > 1 and only the first element will be used
2: 'newdata' had 96 rows but variable(s) found have 1 rows
3: In predict.lm(object, newdata, se.fit, scale = residual.scale, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
4: In if (se.fit) list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var)) else predictor :
the condition has length > 1 and only the first element will be used
You're doing things in a rather roundabout fashion, and one that doesn't translate well to making out-of-sample predictions. If you want to model on a subset of rows, then either subset the data argument directly, or use the subset argument.
train.set$load_lag <- c(rep(NA, 96), train.set$load[1:96])
mod <- glm(load ~ load_lag*TOD, data=train.set[97:192, ], ...)
You also need to rethink exactly what you're doing with TOD. If it has 96 levels, then you're fitting (at least) 96 degrees of freedom on 96 observations which won't give you a sensible outcome.

Resources