Survival regression with survival package in R - r

We are actually trying to reproduce the results of a model in R, which has been coded in SAS. The model looks as follows: ln(Duration)=X'B+S*e, where X is the matrix of 10 independent variables, B a vector of coefficients, S is the scale parameter and e the error term.
The data set we use is here
There you can find the SAS code as well.
The first try looked as follows:
Dur <- survreg(Surv(Duration, Censor==0) ~ Acq_Expense + Acq_Expense_SQ + Ret_Expense + Ret_Expense_SQ + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', data = daten [daten$Acquisition==1, ])
summary(Dur)
But the coefficients in this model are not correct. On the following picture you see the R output on the left and the correct SAS output on the right:
We detected a problem with the squared terms (Acq_Expense_SQ, Ret_Expense_SQ), because when we exclude those terms all other estimates are much closer to the correct values. Therefore, we tried to scale down the squared terms by the factor 0.001.
Acq_Expense_SQ2 <- data.frame(0.001*daten$Acq_Expense_SQ)
colnames(Acq_Expense_SQ2) <- c("Acq_Expense_SQ2")
daten["Acq_Expense_SQ2"] <- Acq_Expense_SQ2
date3 <- subset(daten, daten$Acquisition==1)
Ret_Expense_SQ2 <- data.frame(0.001*daten$Ret_Expense_SQ)
colnames(Ret_Expense_SQ2) <- c("Ret_Expense_SQ2")
daten["Ret_Expense_SQ2"] <- Ret_Expense_SQ2
date3 <- subset(daten, daten$Acquisition==1)
Dur <- survreg(Surv(Duration, Censor == 0, type = 'right') ~ Acq_Expense + Acq_Expense_SQ2 + Ret_Expense + Ret_Expense_SQ2 + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', scale = 0, data = date3)
summary(Dur)
Now, the coefficient are much closer to the correct ones, but I do not know why.
Is there a possible explanation for this problem?
Or do you see another problem with our code?

Related

how to address non-finite values in regression using R [duplicate]

I'm using plm package to analyse my panel data, which comprises a set of states for 14 years. In the course of running plm regressions, I've encountered a lot of times the error "model matrix or response contain non-finite values", but i've eventually solved them by deleting observations with null or NA values. However, I'm doing the regression:
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
which returns
Error in model.matrix.pdata.frame(data, rhs=1, model=model, effect=effect,
model matrix or response contains non-finite values (NA/NaN/inf/-inf)
But, as I said, my data contains no more null or NA values. Just to test this, I've run the separate regressions
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
and
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
and both worked, indicating that it is when I run with log(DK) and log(DC) together that I receive the error.
Thanks in advance!
As #StupidWolf suggested in the comment, your model matrix may contain contain zero's or possibly negative values (log(-1) returns NaN and log(0) return Inf).
plm does not handle this by removing incomplete observations manually, but we can do this manually by checking the model matrix used (or looking at the original data). Without complete data this is just a suggestion to check for some simple problems in the model matrix.
Note that I've shortened the formula to improve readability.
mm <- model.matrix(txinad + prod + op + emp + log(RT) +
(log(DC) + log(DK)) * Gini, data = dd)
## Check complete.cases
if(any(icc <- !complete.cases(mm))){
cat('Rows in dd causing trouble:\n')
print(dd[icc, ])
}
This would print any rows in dd, that causes problem in the model.matrix.

How to derive the equation for a non-linear time series regression model built in R?

I've built a non-linear time series regression model in R that I would like to write down as an equation, so that I can back-test against my data in an Excel spreadsheet. I've created a .ts object and created a model using the tslm function, as shown below:
model16 <- tslm(production ~ date + I(date^2) + I(date^3) +
I(temp_neg_32^3) +
I(humidity_avg^3) +
I(dew_avg^3) +
below_freezing_min,
data = production_temp_no_outlier.ts)
I find the coefficients for each variable in the model by using the following code:
summary(model16)
The output is below:
So, my understanding is that the equation of my model should be:
y = -7924000000 + 1268000*date -67.62*(date^2) + 0.001202*(date^3) +
0.04395*(temp_neg_32^3) + 0.008658*(humidity_avg^3) -0.03762*(dew_avg^3) + -11930*below_freezing_min
However, whenever I plug the data into this equation, the output is just completely off - it has nothing in common with the fitted curve visualization that I build in R based on this model. So I am clearly doing something wrong. I will be very grateful if someone could help point out my errors!
This use of regression doesn't give you an exact fit, it gives you the line of best fit. What is the coefficient of determination? (AKA explained variance or R^2)
Take a look at this set of data (somewhat modeled after your example).
library(forecast)
library(tidyverse)
data("us_change", package = "fpp3")
fit <- tslm(Production~Savings + I(Savings^2) + I(Savings^3) + I(Income^3) + Unemployment,
data = ts(us_change))
summary(fit)
Here I extracted the coefficients, so I can show you a bit more of what I mean. Then I created a function that calculates the outcome of the regression equation.
cFit <- coefficients(fit)
# (Intercept) Savings I(Savings^2) I(Savings^3) I(Income^3) Unemployment
# 5.221684e-01 6.321979e-03 -2.472784e-04 -6.376422e-06 7.029079e-03 -3.144743e+00
regFun <- function(cFit, data){
attach(data)
f = cFit[[2]] * Savings + cFit[[3]] * Savings^2 + cFit[[4]] * Savings^3 + cFit[[5]] * Income^3 + Unemployment + cFit[[1]]
detach(data)
return(f)
}
Here are some examples of the predicted outcome versus the actual outcome.
fitOne <- regFun(cFit, us_change[1,])
# [1] 1.455793
us_change[1,]$Production
# [1] -2.452486
fitTwo <- regFun(cFit, us_change[2,])
# [1] 1.066338
us_change[2,]$Production
# [1] -0.5514595
fitThree <- regFun(cFit, us_change[3,])
# [1] 1.08083
us_change[3,]$Production
# [1] -0.3586518
You can tell from the variance here that the production volume is not explained very well by the inputs I provided.
Now look at what happens when I graph this:
plt <- ggplot(data = us_change %>%
mutate(Regression = regFun(cFit, us_change)),
aes(x = Production)) +
geom_point(aes(y = Savings, color = "Savings")) +
geom_point(aes(y = Savings^2, color = "Savings^2")) +
geom_point(aes(y = Savings^3, color = "Savings^3")) +
geom_point(aes(y = Savings^3, color = "Savings^3")) +
geom_point(aes(y = Unemployment, color = "Unemployment")) +
geom_line(aes(y = Regression, color = "Regression")) + # regression line
scale_color_viridis_d(end = .8) + theme_bw()
plotly::ggplotly(plt)
The regression equation output is the black line. It's the best fit, but there are values that are not represented all that well.
If you look closer, it's not a straight line either.

Not getting a smooth curve using ggplot2

I am trying to fitting a mixed effects models using lme4 package. Unfortunately I cannot share the data that i am working with. Also i couldn't find a toy data set is relevant to my problem . So here i have showed the steps that i followed so far :
First i plotted the overall trend of the data as follows :
p21 <- ggplot(data = sub_data, aes(x = age_cent, y = y))
p21+ geom_point() + geom_smooth()
Based on this , there seems to be a some nonlinear trend in the data. Hence I tried to fit the quadratic model as follows :
sub_data$age_cent=sub_data$age-mean((sub_data)$age)
sub_data$age_centsqr=(sub_data$age-mean((sub_data)$age))^2
m1= lmer(y ~ 1 + age_cent + age_centsqr +(1 | id) , sub_data, REML = TRUE)
In the above model i only included a random intercept because i don't have enough data to include both random slope and intercept.Then i extracted the predictions of these model at population level as follows :
pred1=predict(m1,re.form=NA)
Next I plotted these predictions along with a smooth quadratic function like this
p21+ geom_point() + geom_smooth(method = "lm", formula = y ~ I(x) + I(x^2)
,col="red")+geom_line(aes(y=pred1,group = id) ,col="blue", lwd = 0.5)
In the above plot , the curve corresponds to predictions are not smooth. Can any one helps me to figure out the reason for that ?
I am doing anything wrong here ?
Update :
As eipi10 pointed out , this may due to fitting different curves for different people.
But when i tried the same thing using a toy data set which is in the lme4 package , i got the same curve for each person as follows :
m1 <- lmer(Reaction ~ 1+I(Days) + (1+ Days| Subject) , data = sleepstudy)
pred1new1=predict(m1,re.form=NA)
p21 <- ggplot(data = sleepstudy, aes(x = Days, y = Reaction))
p21+ geom_point() + geom_smooth()
p21+ geom_point() + geom_smooth()+ geom_line(aes(y=pred1new1,group = Subject) ,col="red", lwd = 0.5)
What may be the reason the for different results ? Is this due to unbalance of the data ?
The data i used collected in 3 time steps and some people didn't have it for all 3 time steps. But the toy data set is a balanced data set.
Thank you
tl;dr use expand.grid() or something like it to generate a balanced/evenly spaced sample for every group (if you have a strongly nonlinear curve you may want to generate a larger/more finely spaced set of x values than in the original data)
You could also take a look at the sjPlot package, which does a lot of this stuff automatically ...
You need both an unbalanced data set and a non-linear (e.g. polynomial) model for the fixed effects to see this effect.
if the model is linear, then you don't notice missing values because the linear interpolation done by geom_line() works perfectly
if the data are balanced then there are no gaps to get weirdly filled by linear interpolation
Generate an example with quadratic effects and an unbalanced data set; fit the model
library(lme4)
set.seed(101)
dd <- expand.grid(id=factor(1:10),x=1:10)
dd$y <- simulate(~poly(x,2)+(poly(x,2)|id),
newdata=dd,
family=gaussian,
newparams=list(beta=c(0,0,0.1),
theta=rep(0.1,6),
sigma=1))[[1]]
## subsample randomly (missing values)
dd <- dd[sort(sample(nrow(dd),size=round(0.7*nrow(dd)))),]
m1 <- lmer(y ~ poly(x,2) + (poly(x,2)|id) , data = dd)
Naive prediction and plot:
dd$pred1 <- predict(m1,re.form=NA)
library(ggplot2)
p11 <- (ggplot(data = dd, aes(x = x, y = y))
+ geom_point() + geom_smooth(method="lm",formula=y~poly(x,2))
)
p11 + geom_line(aes(y=pred1,group = id) ,col="red", lwd = 0.5)
Now generate a balanced data set. This version generates 51 evenly spaced points between the min and max - this will be useful if the original data are unevenly spaced. If you have NA values in your x variable, don't forget na.rm=TRUE ...
pframe <- with(dd,expand.grid(id=levels(id),x=seq(min(x),max(x),length.out=51)
Make predictions, and overlay them on the original plot:
pframe$pred1 <- predict(m1,newdata=pframe,re.form=NA)
p11 + geom_line(data=pframe,aes(y=pred1,group = id) ,col="red", lwd = 0.5)

Significant variables for Logistic regression in R

I am still new to R and still struggling. I am trying to do a logistic regression using a categorical and continuous variable and I am supposed to select the right variable for my model. There are 27 variables and a 8,000 observations.
I have gone through a couple of articles online including stepwise regression by AIC and all I do is confuse myself the more. I was also told to select my variables from the correlation matrix but when I do the correlation I don't seem to find the correlation especially with the categorical variable. I also try to fit all the model and I get some variables with p-value less than 0.5. This is the code:
d4 <- d3[,c('SW','MOI','YOI','DOI_CMC','RMOB','RYOB','RDOB_CMC',
'RCA','Region','TPR','DPR','NV','HEL','Has_Radio','Has_TV',
'Religion','WI','MOFB','YOB','DOB_CMC','DOFB_CMC','AOR','MTFBI',
'DSOUOM_CMC','RW','RH','RBMI')]
cor(d4)
d5 <- cor(d4)
round(cor(d4),2)
When I select the significant variables and try to apply logistic regression all the p value will be between 0.9 to 1. See code:
d3 <- lm(TPR ~ SW + MOI + RMOB + RYOB + RCA + Region + TPR + DPR +
NV + HEL + Has_Radio + Has_TV + Religion + WI + MOFB +
YOB + DOB_CMC + DOFB_CMC + AOR + MTFBI + DSOUOM_CMC +
RW + RH + RBMI,
data = d3, family = "binomial")
summary(d3)
I need help with this please!!
Here is the sample of d3

PLM regression with log variables returning non-finite values error when there are no null or NA values in the data

I'm using plm package to analyse my panel data, which comprises a set of states for 14 years. In the course of running plm regressions, I've encountered a lot of times the error "model matrix or response contain non-finite values", but i've eventually solved them by deleting observations with null or NA values. However, I'm doing the regression:
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
which returns
Error in model.matrix.pdata.frame(data, rhs=1, model=model, effect=effect,
model matrix or response contains non-finite values (NA/NaN/inf/-inf)
But, as I said, my data contains no more null or NA values. Just to test this, I've run the separate regressions
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log (DC) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
and
mod_3.1_within_log_b <- plm(log(PIB) ~ txinad + prod + op + emp + log(RT) + log(DK) + Gini + I(log(DC)*Gini) + I(log(DK)*Gini), data = dd, effect = 'individual')
summary (mod_3.1_within_log_b)
and both worked, indicating that it is when I run with log(DK) and log(DC) together that I receive the error.
Thanks in advance!
As #StupidWolf suggested in the comment, your model matrix may contain contain zero's or possibly negative values (log(-1) returns NaN and log(0) return Inf).
plm does not handle this by removing incomplete observations manually, but we can do this manually by checking the model matrix used (or looking at the original data). Without complete data this is just a suggestion to check for some simple problems in the model matrix.
Note that I've shortened the formula to improve readability.
mm <- model.matrix(txinad + prod + op + emp + log(RT) +
(log(DC) + log(DK)) * Gini, data = dd)
## Check complete.cases
if(any(icc <- !complete.cases(mm))){
cat('Rows in dd causing trouble:\n')
print(dd[icc, ])
}
This would print any rows in dd, that causes problem in the model.matrix.

Resources