I have a dataset with 132 observations, 1 response variable (Lopend_gemiddelde), and 9 predictors on which I perform three methods: 1) multiple linear regression (MLR), 2) kriging on the residuals (based on a multiple linear regression), and 3) universal kriging. The results for the methods differ slightly in terms of performance (R2, RMSE, MAE). As an example, the performances for the three methods, based on a 20-fold cross validation are:
Performance criterion
Multiple linear regression
Kriging on residuals
Universal kriging
R2
0.337
0.323
0.333
RMSE
7.585
7.718
7.615
MAE
6.118
6.170
6.084
My two questions related to these results:
Why does the addition of kriged residuals to the MLR results in poorer performance (in terms of R2, RMSE, and MAE), compared to only the MLR?
What causes the difference in model performance (in terms of R2, RMSE, and MAE) between the kriging on residuals on the one hand, and the universal kriging on the other hand?
My related code for the first method (MLR) is displayed below. test$predicted is used to calculate the R2, RMSE, and MAE.
## == 1) multiple linear regression == ##
#multiple linear regression, input data is trained dataset (75% of original)
model_train = lm(Lopend_gemiddelde ~ 1 + nightlight_450 + nightlight_4950 + population_3000 + road_class_1_5000 + road_class_2_1000 + road_class_2_5000 + road_class_3_100 + road_class_3_300 + trafBuf50 , data=train)
#predict NO2 by the trained model (linear)
test$predicted = predict(model_train, test)
The related code for the second method (kriging on residuals) is displayed below (inspired by this post). The kriged residuals are added to the prediction values that derive from the multiple linear regression method (i.e. method 1). I project my kriging on a 100m resolution grid (grid_sp). predicted_model$PredAddedKrigedResi is used to calculate the R2, RMSE, and MAE.
## == 2) kriging on the residuals == ##
train_df = data.frame(train)
data_model_train <- data.frame(x = train_df$coords.x1, y = train_df$coords.x2,resid=resid(model_train))
#fitting the variogram (autofit)
coordinates(data_model_train) = ~x+y
variogram_train = autofitVariogram(resid ~ 1, data_model_train)
#create dataset including variogram parameters
autofit_params <- variogram_train$var_model
#use variogram with autofitted parameters
lz.ok_train_resid <- krige(resid ~ 1, data_model_train, grid_sp, autofit_params)
#convert to raster
raster_train_resid <- raster(lz.ok_train_resid['var1.pred'])
#predict NO2 by the trained model (linear)
test$predicted = predict(model_train, test)
#spatially join predicted values by trained model with corresponding kriged residuals.
predicted_model = raster::extract(raster_train_resid, test, sp=T) #sp = T: keep all data
#add kriged residuals to the predicted values by the trained model
predicted_model$PredAddedKrigedResi <- predicted_model$predicted + predicted_model$var1.pred
The related code for the universal kriging method is displayed below. I project my kriging on a 100m resolution grid (grid100) that, this time, includes information on the 9 predictions to account for trends. predicted_model$predictedUK is used to calculate the R2, RMSE, and MAE.
## == 3) universal kriging == ##
variogram_uk = autofitVariogram(Lopend_gemiddelde ~ 1 + nightlight_450 + nightlight_4950 + population_3000 + road_class_1_5000 + road_class_2_1000 + road_class_2_5000 + road_class_3_100 + road_class_3_300 + trafBuf50, input_data = train)
autofit_params_uk <- variogram_uk$var_model
system.time(parallelX <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula = Lopend_gemiddelde ~ 1 + nightlight_450 + nightlight_4950 + population_3000 + road_class_1_5000 + road_class_2_1000 + road_class_2_5000 + road_class_3_100 + road_class_3_300 + trafBuf50, locations = train, newdata = grid100[parts[[x]],], model = autofit_params_uk)))
# Create SpatialPixelsDataFrame from mergeParallelX
mergeParallelX <- SpatialPixelsDataFrame(points = mergeParallelX, data = mergeParallelX#data)
#convert to raster
raster_uk <- raster(mergeParallelX["var1.pred"])
#spatially join predicted values by trained model with corresponding kriged residuals.
predicted_model = raster::extract(raster_uk, test, sp=T) #sp = T: keep all data
#rename
predicted_model <- predicted_model %>% rename(predictedUK = "var1.pred")
Related
I am using a linear mixed effects model with a natural spline function for age to describe the trajectory of an outcome y (measured in grams) across time (age in years) in a sample of individuals. My model includes random-effects for splines (with fewer degrees of freedom than the fixed-effects splines).
I would like to estimate the subject-specific (person-specific) age at peak velocity from this model.
A previous thread describes how to estimate the subject-specific age at peak velocity when the model contains a quadratic random-effects age term.
The code below provides the example dataset and shows how the model was fit, how the mean trajectory was plotted and how the mean velocity curve and mean age at peak velocity were calculated.
How can I estimate the age at peak velocity for each individual from this model?
# LOAD DATASET
library(RCurl)
dat <- read.csv(("https://raw.githubusercontent.com/aelhak/data/main/data.csv"))
# FIT MODEL
library(nlme)
library(splines)
model <- lme( # linear mixed-effects model
y ~ ns(age, df = 7), # FE: natural spline function for age with 6 knots
random = ~ ns(age, df = 3) | id, # RE: natural spline function for age with 2 knot
method = "ML", data = dat)
# PLOT MEAN FITTED TRAJECTORY CURVE
pred <- data.frame(age = seq(min(dat$age), max(dat$age), length = 100))
pred$pred <- predict(model, pred, level = 0)
library(tidyverse)
ggplot(data = pred, aes(x = age, y = pred)) + geom_line()
# PLOT MEAN VELOCITY CURVE
library(SplinesUtils)
spl_population <- RegSplineAsPiecePoly(model, "ns(age, df = 7)")
plot(spl_population, deriv = 1)
# ESTIMATE MEAN AGE AT PEAK VELOCITY AND PEAK VELOCITY
(apv_mean <- solve(spl_population, b = 0, deriv = 2))
(pv_mean <- predict(spl_population, apv_mean, deriv = 1))
(apv_pv_mean <- as.data.frame(cbind(apv_mean, pv_mean)))
(apv_pv_mean <- apv_pv_mean %>% top_n(1, pv_mean))
# apv_mean = 14.09897 years
# pv_mean = 317.1267 grams
I am using the Stata dataset ANES.dta with the information about the 2000 presidential election in the USA. I build two models on this dataset - one logit and one LPM. I want to compare the two models with each other using the following Goodness of fit measures - accuracy, sensitivity and specificity of the both models.
I am new to R, I have mainly used STATA so far and that's why I'm wondering if it is normal to get absolutely the same values in confusion matrices for a logit model and a LPM model, based on the same data? Am I doing something wrong?
rm(list=ls())
library(foreign)
dat <- read.dta("ANES.dta", convert.factors = FALSE)
dat_clear <- na.omit(dat)
head(dat_clear)
#Logit model
m1_logit <- glm(vote ~ gender + income + pro_choice ,
data = dat_clear, family = binomial(link = "logit") ,
na.action = na.omit)
summary(m1_logit)
#LPM
m2_lpm <- lm(vote ~ gender + income + pro_choice,
data = dat_clear, na.action = na.omit)
summary(m2_lpm)
#Confusion matrix for logit model
dat_clear$prediction_log <- predict(m1_logit, newdata = dat_clear, type = "response")
dat_clear$vote_pred_log <- as.numeric(dat_clear$prediction_log > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_log)
#Confusion matrix for LPM model
dat_clear$prediction_lpm <- predict(m2_lpm, newdata = dat_clear, type = "response")
dat_clear$vote_pred_lpm <- as.numeric(dat_clear$prediction_lpm > .5)
table(observed = dat_clear$vote, predicted = dat_clear$vote_pred_lpm)
This is what the confusion matrices look like
I am running a program where I conduct an OLS regression and then I subtract the coefficients from the actual observations to keep the residuals.
model1 = lm(data = final, obs ~ day + poly(temp,2) + prpn + school + lag1) # linear model
predfit = predict(model1, final) # predicted values
residuals = data.frame(final$obs - predfit) # obtain residuals
I want to bootstrap my model and then do the same with the bootstrapped coefficients. I try doing this the following way:
lboot <- lm.boot(model1, R = 1000)
predfit = predict(lboot, final)
residuals = data.frame(final$obs - predfit) # obtain residuals
However, that does not work. I also try:
boot_predict(model1, final, R = 1000, condense = T, comparison = "difference")
and that also does not work.
How can I bootstrap my model and then predict based of that?
If you're trying to fit the best OLS using bootstrap, I'd use the caret package.
library(caret)
#separate indep and dep variables
indepVars = final[,-final$obs]
depVar = final$obs
#train model
ols.train = train(indepVars, depVar, method='lm',
trControl = trainControl(method='boot', number=1000))
#make prediction and get residuals
ols.pred = predict(ols.train, indepVars)
residuals = ols.pred - final$obs
I would like to ask how to calculate a confusion matrix for a fixed effect logit model (bife package)
With the basic logit model (glm) there is no problem, but with fixed effect logit there is.
For some reason the number of predictions is different for logit and fixed effect logit.
Example:
library(bife)
library(tidyverse)
library(caret)
dataset <- psid
logit <- glm(LFP ~ AGE + I(AGE^2) + log(INCH) + KID1 + KID2 + KID3, data = dataset, family = "binomial")
mod <- bife(LFP ~AGE + I(AGE^2) + log(INCH) + KID1 + KID2 + KID3| ID, dataset)
summary(mod)
summary(logit)
predict(logit)
predict(mod)
Y <- factor(dataset$LFP)
PRE <- factor(round(predict(logit, type = "response")))
PRE_FIX <- factor(round(predict(mod, type = "response")))
confusionMatrix(Y, PRE)
# Not working
confusionMatrix(Y, PRE_FIX)
It is possible to compute the confusionMatrix:
confusionMatrix<-table(true=Y , pred = round(fitted(PRE_FIX)))
And then, convert the confusion matrix to a table shape.
0 1
0 TruePositive FalseNegative
1 FalsePositive TrueNegative
I am trying to understand what are the "ld" residuals that are produced from running the residual function on a survreg model?
For example
library(survival)
mod <- survreg(Surv(time, status -1) ~ age , data = lung)
residuals( mod , "ldcase")
residuals( mod, "ldshape")
residuals( mod , "ldresp")
The documentation for the residual function says the following:
Diagnostics based on these quantities are discussed in an article by
Escobar and Meeker. The main ones are the likelihood displacement
residuals for perturbation of a case weight (ldcase), the response
value (ldresp), and the shape.
References
Escobar, L. A. and Meeker, W. Q. (1992). Assessing influence in
regression analysis with censored data. Biometrics 48, 507-528.
Taking case weight "ldcase" in particular my understanding from the referenced paper is that these residuals represent an estimate for double the difference in log-likelihood between the original model and the same model fitted by setting subjects i's weight to 2.
However when I attempt to manually code this myself my derived values seem to have no relationship at all to the values produced by the residual function (fully reproducible example below):
library(survival)
library(ggplot2)
mod <- survreg( Surv(time, status -1) ~ age , data = lung)
get_ld <- function(i, mod){
weight <- rep(1 , nrow(lung))
weight[i] <- 2
modw <- survreg(
Surv(time, status -1) ~ age ,
data = lung ,
weights = weight
)
2 * as.numeric(logLik(mod) - logLik(modw))
}
dat <- data.frame(
ld = sapply( 1:nrow(lung), get_ld , mod = mod),
ld_est = residuals(mod , "ldcase")
)
ggplot( data = dat , aes( x = ld_est , y = ld)) + geom_point()
Additionally from the paper these residuals are supposed to be distributed with a 2 * chisq ( p + 2 ) distribution which in this case with p = 1 gives a 1-sided 95% cutoff point of 15.62 which would imply that my manually derived residuals are at least on the correct scale which makes me very confused as to what the residuals returned by "ldcase" actually are ?