how to address non-finite values in regression using R [duplicate] - r

I'm using plm package to analyse my panel data, which comprises a set of states for 14 years. In the course of running plm regressions, I've encountered a lot of times the error "model matrix or response contain non-finite values", but i've eventually solved them by deleting observations with null or NA values. However, I'm doing the regression:
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
which returns
Error in model.matrix.pdata.frame(data, rhs=1, model=model, effect=effect,
model matrix or response contains non-finite values (NA/NaN/inf/-inf)
But, as I said, my data contains no more null or NA values. Just to test this, I've run the separate regressions
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
and
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
and both worked, indicating that it is when I run with log(DK) and log(DC) together that I receive the error.
Thanks in advance!

As #StupidWolf suggested in the comment, your model matrix may contain contain zero's or possibly negative values (log(-1) returns NaN and log(0) return Inf).
plm does not handle this by removing incomplete observations manually, but we can do this manually by checking the model matrix used (or looking at the original data). Without complete data this is just a suggestion to check for some simple problems in the model matrix.
Note that I've shortened the formula to improve readability.
mm <- model.matrix(txinad + prod + op + emp + log(RT) +
(log(DC) + log(DK)) * Gini, data = dd)
## Check complete.cases
if(any(icc <- !complete.cases(mm))){
cat('Rows in dd causing trouble:\n')
print(dd[icc, ])
}
This would print any rows in dd, that causes problem in the model.matrix.

Related

PLM regression with log variables returning non-finite values error when there are no null or NA values in the data

I'm using plm package to analyse my panel data, which comprises a set of states for 14 years. In the course of running plm regressions, I've encountered a lot of times the error "model matrix or response contain non-finite values", but i've eventually solved them by deleting observations with null or NA values. However, I'm doing the regression:
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
which returns
Error in model.matrix.pdata.frame(data, rhs=1, model=model, effect=effect,
model matrix or response contains non-finite values (NA/NaN/inf/-inf)
But, as I said, my data contains no more null or NA values. Just to test this, I've run the separate regressions
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
and
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
and both worked, indicating that it is when I run with log(DK) and log(DC) together that I receive the error.
Thanks in advance!
As #StupidWolf suggested in the comment, your model matrix may contain contain zero's or possibly negative values (log(-1) returns NaN and log(0) return Inf).
plm does not handle this by removing incomplete observations manually, but we can do this manually by checking the model matrix used (or looking at the original data). Without complete data this is just a suggestion to check for some simple problems in the model matrix.
Note that I've shortened the formula to improve readability.
mm <- model.matrix(txinad + prod + op + emp + log(RT) +
(log(DC) + log(DK)) * Gini, data = dd)
## Check complete.cases
if(any(icc <- !complete.cases(mm))){
cat('Rows in dd causing trouble:\n')
print(dd[icc, ])
}
This would print any rows in dd, that causes problem in the model.matrix.

My panel linear regression with log variables returns error on non-finite values, but there are no logs on zero or negative values

I'm trying to run a fixed effects regression in my panel data (using plm package). The regression on levels worked well, so as the first regressions using log variables (I'm putting log on only the dependent and some independent variables, which are in monetary terms). However, my regressions with logs stopped working.
require (AER)
library (AER)
require(plm)
library("plm")
#Indicates the panel and the time and individual columns
dd <- pdata.frame(painel, index = c ('Estado', 'Ano'))
#Model 1 - Model within with individual fixed effects
mod_1_within <- plm(PIB ~ txinad + op + desoc + Divliq + Esc_15 + RT + DC + DK + Gini + I(DK*Gini) + I(DC*Gini), data = dd, effect = 'individual')
summary (mod_1_within)
#this worked well
#Model 2 - Model 1 with the monetary variables in log (the others are % or indexes):
mod_1_within_log<- plm(log(PIB) ~ txinad + log(RT) + op + desoc + Divliq + Esc_15 + log(DC) + log(DK) + Gini + I(Gini*log(DC)) + I(Gini*log(DK)), data = dd, effect = 'individual')
summary (mod_1_within_log)
#This returns:
> mod_1_within_log<- plm(log(PIB) ~ txinad + log(RT) + op + desoc + Divliq + Esc_15 + log(DC) + log(DK) + Gini + I(Gini*log(DC)) + I(Gini*log(DK)), data = dd, effect = 'individual')
Error in model.matrix.pdata.frame(data, rhs = 1, model = model, effect = effect, :
model matrix or response contains non-finite values (NA/NaN/Inf/-Inf)
> summary (mod_1_within_log)
Error in summary(mod_1_within_log) : object 'mod_1_within_log' not found
This is ocurring even though there are no log variables with negative or zero values. I will take this opportunity to ask another question: if there is a variable with a zero value, is there a way I can make that value null and them take the log of that variable?
Thanks in advance!
I assume the reason why you're getting that error might be that you have Inf or -Inf values logged predictors or logged outcomes.
In order to check whether that is the case see the untransformed variables (before log) and check whether any observation has a value of zero. If it does, that is the problem. Why? R returns Inf from log(0). So when you run the FE model, plm is giving you that error because it can't deal with NAN or Inf values.

Error in quantile.default(resid) : missing values and NaN's not allowed

I am trying to predict the time it takes for a model to train (sklearn's linear regression) given a particular number of rows and columns. I have created additional features like by taking the log and squares of number of rows and columns.
I have pasted the data here. As you can see, the dataset has no missing values or NaN's.
I tried to run a linear regression model in R using the lm function using the below code -
library(data.table)
df = fread(linreg_df_edited.csv)
lrmodel <- lm(time ~ rows + columns + volume + rows_log + columns_log + volume_log + row_sq + col_sq, data = df)
But when I request the summary of the model using summary(lrmodel), I get the following error
Error in quantile.default(resid) :
missing values and NaN's not allowed if 'na.rm' is FALSE
My dataset doesn't have any missing values but I still tried and rebuilt the model after setting na.action=na.omit
lrmodel <- lm(time ~ rows + columns + volume + rows_log + columns_log + volume_log + row_sq + col_sq, df, na.action=na.omit)
I still get the same error. I can't figure this out. I thought maybe the a column is being read as a character variable. But that too isn't the case.
Any idea why this is happening?
Don't try to model on all of your transformations at once. Your call is:
model <- lm(time ~ rows + columns + volume + rows_log + columns_log + volume_log + row_sq + col_sq, data = df)
Instead, do:
model_lin <- lm(time ~ rows + columns + volume, data = df)
model_log <- lm(time ~ rows_log + columns_log + volume_log, data = df)
model_sq <- lm(time ~ row_sq + col_sq, data = df)
Then you'll see the squares are the problem. They're generating the NaN values.

Variable Lengths Differ Error on Predictive model

I am trying to build a predictive model from survey data. My DVs are questions on NPS and other like data points. My IVs are mainly demographical question. I keep getting a Variable lengths error using the following lines of code:
Model <- lm(Q6 ~ amount_spent + first_time + gender +
workshop_participation + adults + children +
household_adults + Below..25K. + X.25K.to..49K. +
X.50K.to..74K. + X.75K.to..99K. + X.100K.to..124K. +
X18.24. + X25.34. + X35.44. + X45.64.,
data = diy_festival2)
Here is the error:
Error in model.frame.default(formula = Q6 ~ amount_spent + first_time + :
variable lengths differ (found for 'Below..25K.')
What are some possible causes and what are some potential fixes I can try?
Your formula object is referencing (a) variable(s) that is not in diy_festival2. It is in the global environment, the debug suggests it is Below..25K.
x <- data.frame(x1=rnorm(100))
x2 <- rnorm(10)
model.matrix( ~ x1 + x2, data=x)
gives the error you have.

Survival regression with survival package in R

We are actually trying to reproduce the results of a model in R, which has been coded in SAS. The model looks as follows: ln(Duration)=X'B+S*e, where X is the matrix of 10 independent variables, B a vector of coefficients, S is the scale parameter and e the error term.
The data set we use is here
There you can find the SAS code as well.
The first try looked as follows:
Dur <- survreg(Surv(Duration, Censor==0) ~ Acq_Expense + Acq_Expense_SQ + Ret_Expense + Ret_Expense_SQ + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', data = daten [daten$Acquisition==1, ])
summary(Dur)
But the coefficients in this model are not correct. On the following picture you see the R output on the left and the correct SAS output on the right:
We detected a problem with the squared terms (Acq_Expense_SQ, Ret_Expense_SQ), because when we exclude those terms all other estimates are much closer to the correct values. Therefore, we tried to scale down the squared terms by the factor 0.001.
Acq_Expense_SQ2 <- data.frame(0.001*daten$Acq_Expense_SQ)
colnames(Acq_Expense_SQ2) <- c("Acq_Expense_SQ2")
daten["Acq_Expense_SQ2"] <- Acq_Expense_SQ2
date3 <- subset(daten, daten$Acquisition==1)
Ret_Expense_SQ2 <- data.frame(0.001*daten$Ret_Expense_SQ)
colnames(Ret_Expense_SQ2) <- c("Ret_Expense_SQ2")
daten["Ret_Expense_SQ2"] <- Ret_Expense_SQ2
date3 <- subset(daten, daten$Acquisition==1)
Dur <- survreg(Surv(Duration, Censor == 0, type = 'right') ~ Acq_Expense + Acq_Expense_SQ2 + Ret_Expense + Ret_Expense_SQ2 + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', scale = 0, data = date3)
summary(Dur)
Now, the coefficient are much closer to the correct ones, but I do not know why.
Is there a possible explanation for this problem?
Or do you see another problem with our code?

Resources