Variable Lengths Differ Error on Predictive model - r

I am trying to build a predictive model from survey data. My DVs are questions on NPS and other like data points. My IVs are mainly demographical question. I keep getting a Variable lengths error using the following lines of code:
Model <- lm(Q6 ~ amount_spent + first_time + gender +
workshop_participation + adults + children +
household_adults + Below..25K. + X.25K.to..49K. +
X.50K.to..74K. + X.75K.to..99K. + X.100K.to..124K. +
X18.24. + X25.34. + X35.44. + X45.64.,
data = diy_festival2)
Here is the error:
Error in model.frame.default(formula = Q6 ~ amount_spent + first_time + :
variable lengths differ (found for 'Below..25K.')
What are some possible causes and what are some potential fixes I can try?

Your formula object is referencing (a) variable(s) that is not in diy_festival2. It is in the global environment, the debug suggests it is Below..25K.
x <- data.frame(x1=rnorm(100))
x2 <- rnorm(10)
model.matrix( ~ x1 + x2, data=x)
gives the error you have.

Related

how to address non-finite values in regression using R [duplicate]

I'm using plm package to analyse my panel data, which comprises a set of states for 14 years. In the course of running plm regressions, I've encountered a lot of times the error "model matrix or response contain non-finite values", but i've eventually solved them by deleting observations with null or NA values. However, I'm doing the regression:
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
which returns
Error in model.matrix.pdata.frame(data, rhs=1, model=model, effect=effect,
model matrix or response contains non-finite values (NA/NaN/inf/-inf)
But, as I said, my data contains no more null or NA values. Just to test this, I've run the separate regressions
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
and
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
and both worked, indicating that it is when I run with log(DK) and log(DC) together that I receive the error.
Thanks in advance!
As #StupidWolf suggested in the comment, your model matrix may contain contain zero's or possibly negative values (log(-1) returns NaN and log(0) return Inf).
plm does not handle this by removing incomplete observations manually, but we can do this manually by checking the model matrix used (or looking at the original data). Without complete data this is just a suggestion to check for some simple problems in the model matrix.
Note that I've shortened the formula to improve readability.
mm <- model.matrix(txinad + prod + op + emp + log(RT) +
(log(DC) + log(DK)) * Gini, data = dd)
## Check complete.cases
if(any(icc <- !complete.cases(mm))){
cat('Rows in dd causing trouble:\n')
print(dd[icc, ])
}
This would print any rows in dd, that causes problem in the model.matrix.

How to develop a hierarchical model to see the heterogeneity in mean of a specific variable using glmer?

I am using the following model in R:
glmer(y ~ x + z + (1|id), weights = specification, family = binomial, data = data)
which:
y ~ binomial(y, specification)
Logit(y) = intercept + a*x + b*z
a and b are coefficients for x and z variables.
a|a0+a1 = a0 + a1*I
One of the variables (here x) depends on other variable (here I), so I need a hierarchical model to see the heterogeneity in mean of the x.
I would appreciate it if anyone could help me with this problem?
Sorry if the question does not look professional! This is one of my first experiences.
I'm not perfectly sure I understand the question, but: if Logit(y) = intercept + a*x + b*z and a = a0 + a1*I, then it would seem that
Logit(y) = intercept + (a0+a1*I)*x + b*z
This looks like a straightforward interaction model:
glmer(y ~ 1 + x + x:I + z + (1|id), ...)
to make it more explicit, this could also be written as
glmer(y ~ 1 + x + I(x*I) + z + (1|id), ...)
(although the use of I as a predictor variable and in the I() function is a little bit confusing at first glance ...)

Problems with Fixed effects panel data

I am trying to run a regression with a panel data from the Michigan Consumers Survey. It is the first time I am using panel data on R so I am not very aware of the package "plm" that is needed. I am setting my panel data for fixed effects on individuals (CASEID) and time (YYYY):
Michigan_panel <- pdata.frame(Michigan_survey, index = c("CASEID", "YYYY"))
Then I am using the following regression:
mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
However R is showing me the following error:
> mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
Error in plm.fit(data, model, effect, random.method, random.models, random.dfcor, :
empty model
Does anyone know what I am doing wrong?
Could you give the link where is this specific survey? I found various dataset with this data name.
I suspect (only suspect), you data isn't panel data, please check the CASEID variable.
Changing the order between formula and data in plm won't be solve your problem.
.
I think the error come when you write the model. Your solution is this:
mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
In my view, you have to specify indexes in the formula, and follow the order of the plm package. I would like to write your formula as follows:
mod_1 <- plm(ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq,
data = Michigan_panel,
index= c("CASEID", "YYYY"),
model = "within")
1. Different Approach
From my knowledge we can also code this formula in a more elegant format.
library(plm)
Michigan_panel <- pdata.frame(Michigan_survey, index = c("CASEID", "YYYY"))
attach(Michigan_panel)
y <- cbind(ICS)
X <- cbind(ICE,PX1Q2,RATEX,ZLB,INCOME,AGE,EDUC,MARRY,SEX,AGE_sq)
model1 <- plm(y~X+factor(CASEID)+factor(YEAR), data=Michigan_panel, model="within")
summary(model1)
detach()
Adding factor(CASEID) and factor(YEAR) will add dummy variables in your model.

Error in quantile.default(resid) : missing values and NaN's not allowed

I am trying to predict the time it takes for a model to train (sklearn's linear regression) given a particular number of rows and columns. I have created additional features like by taking the log and squares of number of rows and columns.
I have pasted the data here. As you can see, the dataset has no missing values or NaN's.
I tried to run a linear regression model in R using the lm function using the below code -
library(data.table)
df = fread(linreg_df_edited.csv)
lrmodel <- lm(time ~ rows + columns + volume + rows_log + columns_log + volume_log + row_sq + col_sq, data = df)
But when I request the summary of the model using summary(lrmodel), I get the following error
Error in quantile.default(resid) :
missing values and NaN's not allowed if 'na.rm' is FALSE
My dataset doesn't have any missing values but I still tried and rebuilt the model after setting na.action=na.omit
lrmodel <- lm(time ~ rows + columns + volume + rows_log + columns_log + volume_log + row_sq + col_sq, df, na.action=na.omit)
I still get the same error. I can't figure this out. I thought maybe the a column is being read as a character variable. But that too isn't the case.
Any idea why this is happening?
Don't try to model on all of your transformations at once. Your call is:
model <- lm(time ~ rows + columns + volume + rows_log + columns_log + volume_log + row_sq + col_sq, data = df)
Instead, do:
model_lin <- lm(time ~ rows + columns + volume, data = df)
model_log <- lm(time ~ rows_log + columns_log + volume_log, data = df)
model_sq <- lm(time ~ row_sq + col_sq, data = df)
Then you'll see the squares are the problem. They're generating the NaN values.

Issue with non-linear least squares regression

I've been experiencing an issue when trying to estimate a model using the 'nls' function. I am trying to estimate the parameters (i.e. the aj's for j=0,1,...,9) in the following equation:
log(y) = log(a0 + a1*x1 + a2*x2 + a3*x3 + a4*(x1)^2 + a5*(x2)^2 +
a6*(x3)^2 + a7*(x1*x2) + a8*(x1*x3) + a9*(x2*x3)) + error
In order to avoid negative values inside the log function, I set the starting values for the parameters in the non-linear least squares model according to the following code:
ols.model <- lm(y ~ x1 + x2 + x3 + (x1)^2 + (x2)^2 + (x3)^2 + x1*x2 +
x1*x3 + x2*x3)
ols.coefficients <- ols.model$coefficients
fitted <- fitted(ols.model)
fitted <- fitted-ols.coefficients[1]
min.fitted <- min(fitted)
b0.start <- -min.fitted + 0.1
The above ensured that none of the starting fitted values will be negative inside the log function. My call of the nls regression looked like this:
nls.model <- nls(log(y) ~ log(b0 + b1*x1 + b2*x2 + b3*x3 + b4*(x1)^2 +
b5*(x2)^2 + b6*(x3)^2 + b7*(x1*x2) + b8*(x1*x3) +
b9*(x2*x3)),
start=list(b0=b0.start, b1=ols.coefficients[2],
b2=ols.coefficients[3], b3=ols.coefficients[4],
b4=ols.coefficients[5],b5=ols.coefficients[6],
b6=ols.coefficients[7], b7=ols.coefficients[8],
b8=ols.coefficients[9], b9=ols.coefficients[10]),
trace=TRUE)
Despite these carefully selected starting parameter values, I keep on getting an error message that reads:
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning message:
In log(b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + :
NaNs produced
Does anyone have any idea how I can resolve this issue and estimate the non-linear model without getting an error message? My dataset does not contain any missing or zero values so that is definitely not the problem.

Resources