Compare two regression models in R - r

age25=subset(juul,juul[,"age"]>25.00)## create a subset of age greater than 25
modelgf=lm(age25[,"igf1"]~age25[,"age"])
age20=subset(juul,juul[,"age"]<20.00)
modelgf2=lm(age20[,"igf1"]~age20[,"age"])
I tried to compare the modelgf and modelgf2 models using anova(m1,m2). However, I get a warning message:
In anova.lmlist(object, ...) :
models with response ‘"age20[, \"igf1\"]"’ removed because response differs from model 1
Are there any other ways to compare these two models?

Here you go:
# Dummy for Age>25
juul[,"ageCat25"] <- juul[,"ageCat"] > 25.00
# Collinear dummy for Age<20
juul[,"ageCat20"] <- ifelse(!juul[,"ageCat25"] & juul[,"age"]<20.00, TRUE, juul[,"ageCat25"])
m1 <- lm(foo ~ ageCat25, juul)
m2 <- lm(foo ~ ageCat20, juul)
anova(m1,m2)
Interpretation left to the OP.

Related

kNN algorithm not working while using caret

I am trying to run LOOCV kNN on this dataset (104x182 where the first 62 samples are B and the following 42 are C). I first conducted a PCA on the standardized version of this dataset (giving me 104 PCs). I then try to perform LOOCV kNN for i = 3:98 where i refers to the number of PCs I will use for my kNN model. For each i I pull out the highest accuracy, which k it occurs at and store it within a data frame.
# required packages
library(MASS)
library(class)
library(tidyverse)
library(caret)
# reading in and cleaning data
data <- read.csv("chowdary.csv")
og_data <- data[, -1]
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
# on standardized data
pca_all <- prcomp(og_data[, -1], center=TRUE, scale=TRUE)
# creating data frame to store best k value for each number of PCs
kdf_pca_all_cc <- tibble(i=as.numeric(), # this is for storing number of PCs used,
pca_all_k=as.numeric(), # k value,
pca_all_acc=as.numeric(), # accuracy value,
pca_all_kapp=as.numeric()) # and kappa value
# kNN
k_kNN <- 3:97 # number of PCs to use in each iteration of the model
train_control <- trainControl(method="LOOCV")
kNN_data <- as.data.frame(cbind(as.factor(st_data[, 1]), pca_all$x)) # data used in kNN model below
for (i in k_kNN){
a111 <- train(V1~ .,
method="knn",
tuneGrid=expand.grid(k=1:25),
trControl=train_control,
metric="Accuracy",
data=kNN_data[, 1:i])
b111 <- a111$results[as.integer(a111$bestTune), ] # this is to store the best accuracy rate, along with its k and kappa value
kdf_pca_all_cc <- kdf_pca_all_cc %>%
add_row(i=i-1,
pca_all_k=b111[, 1],
pca_all_acc=b111[, 2],
pca_all_kapp=b111[, 3])
}
For example, for i = 5, the kNN model would be using the following data:
head(kNN_data[, 1:5])
V1 PC1 PC2 PC3 PC4
1 1 3.299844 0.2587487 -1.00501632 2.0273727
2 1 1.427856 -1.0455044 -1.79970790 2.5244021
3 1 3.087657 1.2563404 1.67591441 -1.4270431
4 1 3.107778 1.5893396 2.65871270 -2.8217264
5 1 3.244306 0.5982652 0.37011029 0.3642425
6 1 3.000098 0.5471276 -0.01178315 1.0857886
However, whenever I try to run the for-loop, I am given the following warning message:
Error: Metric Accuracy not applicable for regression models
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
I have no idea how to fix this. Any help would be much appreciated.
Also, as a side note, is there a faster way to run this for-loop? It takes quite a while but I have no idea how to make it more efficient. Thank you.

R: one regression model for 2 different data sets to prepare for waldtest

I have two different data sets. Each of them represents one portfolio of my two portfolios.
y(p) as dependent variable and x1(p), x2(p),x3(p),x4(p) as independent variables.
(p) indicates a portfolio-specific value. column 1 of each variable represents portfolio 1 and column 2 represents portfolio 2.
The regression equation is:
y(p)=∝(p)+ 𝛽1(p)*x1(p)+𝛽2(p)*x2(p)+𝛽3(p)*x3(p)+𝛽4(p)*x4(p)
What i did so far is to implement a separate regression model for each portfolio in R:
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
My objective is to compare the two intercepts of both regression models. Within the scope of this comparison i need to test the joint significance of these intercepts. As far as i can tell, using the wald test should be appropriate.
If I use the waldtest-function from the lmtest-package it does not work.
Obviously, because the response variable is not the same for both models.
library(lmtest)
waldtest(lm1,lm2)
In waldtest.default(object, ..., test = match.arg(test)) :
models with response "y[, 2]" removed because response differs from model 1
All workarounds I tried so far did not work either, e.g. R: Waldtest: "Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml"
My guess is that the regression needs to be done in a different way to fix the problems regarding the waldtest.
So that leads to my question:
Is there a possibility to do the regression in one model, which still generates portfolio-specific intercepts and coefficients? (I assume, that this would fix the problems with the waldtest-function.)
Any advice or suggestion will be appreciated.
The following data can be used for a reproducible example:
y=matrix(rnorm(10),ncol=2)
x1=matrix(rnorm(10),ncol=2)
x2=matrix(rnorm(10),ncol=2)
x3=matrix(rnorm(10),ncol=2)
x4=matrix(rnorm(10),ncol=2)
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
library(lmtest)
waldtest(lm1,lm2)
Best regards,
Simon
Here are three ways to test intercepts equality. The second one is an implementation of the accepted answer to this question, while the other two are implementations of the second answer to the aforementioned question under different assumptions.
Let
n <- 5
y <- matrix(rnorm(10), ncol = 2)
x <- matrix(rnorm(10), ncol = 2)
First, we may indeed perform the test with only a single model. For that purpose we create a new vector Y that concatenates y[, 1] and y[, 2]. As for the independent variables, we create a block-diagonal matrix with the regressors of one model at the upper-left block and those for the other model at the lower-right block. Lastly, I create a group factor indicating the hidden model. Hence,
library(Matrix)
Y <- c(y)
X <- as.matrix(bdiag(x[, 1], x[, 2]))
G <- factor(rep(0:1, each = n))
Now the unrestricted model is
m1 <- lm(Y ~ G + X - 1)
while the restricted one is
m2 <- lm(Y ~ X)
Testing for intercepts equality gives
library(lmtest)
waldtest(m1, m2)
# Wald test
#
# Model 1: Y ~ G + X - 1
# Model 2: Y ~ X
# Res.Df Df F Pr(>F)
# 1 6
# 2 7 -1 0.5473 0.4873
so that, as expected, we cannot reject they equality. A problem with this solution, however, is that it is like estimating the two models separately but assuming that the errors have the same variance in both. Also, we don't allow for a cross-correlation between errors.
Second, we can relax the assumption of identical errors variance by estimating two separate models and employing a Z-test as follows.
M1 <- lm(y[, 1] ~ x[, 1])
M2 <- lm(y[, 2] ~ x[, 2])
Z <- unname((coef(M1)[1] - coef(M2)[1]) / (coef(summary(M1))[1, 2]^2 + coef(summary(M2))[1, 2])^2)
2 * pnorm(-abs(Z))
# [1] 0.5425736
leading to the same conclusion.
Lastly, we can employ the SUR in this way allowing for model-dependent errors variance as well as contemporaneous errors cross-dependence (that may be not necessary in your case, it matters what kind of data you are using). For that we can use the systemfit package as follows:
library(systemfit)
eq1 <- y[, 1] ~ x[, 1]
eq2 <- y[, 2] ~ x[, 2]
m <- systemfit(list(eq1, eq2), method = "SUR")
In this case we also are able to perform the Wald test:
R <- matrix(c(1, 0, -1, 0), nrow = 1) # Restriction matrix
linearHypothesis(m, R, test = "Chisq")
# Linear hypothesis test (Chi^2 statistic of a Wald test)
#
# Hypothesis:
# eq1_((Intercept) - eq2_(Intercept) = 0
#
# Model 1: restricted model
# Model 2: m
#
# Res.Df Df Chisq Pr(>Chisq)
# 1 7
# 2 6 1 0.3037 0.5816

Comparing nested models with NAs in R

I am trying to compare nested regression models using the anova() function in R, but am running into problems because the level 1 and level 2 models differ in the number of observations due to missing cases. Here is a simple example:
# Create dataframe with multiple predictors with different number of NAs
dep <- c(45,46,45,48,49)
basevar <- c(10,12,10,16,17)
pred1 <- c(NA,20,NA,19,21)
dat <- data.frame(dep,basevar,pred1)
# Define level 1 of the nested models
basemodel <- lm(dep ~ basevar, data = dat)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = dat)
# Compare the models (uh oh!)
anova(basemodel, model1)
I have seen 2 suggestions to similar problems, but both are problematic.
Suggestion 1: Impute the missing data. The problem with this is that the missing cases in my data were removed because they were outliers, and thus are not "missing at random," and imputing may overfit the data.
Suggestion 2: Make a separate data frame containing only the complete cases for the variable with missing cases, and use that for regressions. This is also problematic if you are creating multiple nested models sharing the same level 1 variable, but in which the level 2 variables differ in the number of missing cases. Here is an example of this:
# Create a new predictor variable with a different number of NAs from pred1
pred2 <- c(23,21,NA,10,11)
dat <- cbind(dat,pred2)
# Create dataframe containing only completed cases of pred1
nonadat1 <- subset(dat, subset = !is.na(pred1))
# Do the same for pred2
nonadat2 <- subset(dat, subset = !is.na(pred2))
# Define level 1 of the nested models within dataframe of pred1 complete cases
basemodel1 <- lm(dep ~ basevar, data = nonadat1)
# Check values of the model
summary(basemodel1)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = nonadat1)
# Compare the models (yay it runs!)
anova(basemodel1, model1)
# Define level 1 of the nested models within dataframe of pred2 complete cases
basemodel2 <- lm(dep ~ basevar, data = nonadat2)
# Values are different from those in basemodel1
summary(basemodel2)
# Add level 2
model2 <- lm(dep ~ basevar + pred2, data = nonadat2)
# Compare the models
anova(basemodel2, model2)
As you can see, creating individual data frames creates differences at level 1 of the nested models, which makes interpretation problematic.
Does anyone know how I can compare these nested models while circumventing these problems?
Could this work? See here for more information. It doesn't exactly deal with the fact that models are fitted on different datasets, but it does allow for a comparison.
A<-logLik(basemodel)
B<-logLik(model1)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))

use cox model to estimate survival

I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.
The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.

linear predictor - ordered probit (ordinal, clm)

I have got a question regarding the ordinal package in R or specifically regarding the predict.clm() function. I would like to calculate the linear predictor of an ordered probit estimation. With the polr function of the MASS package the linear predictor can be accessed by object$lp. It gives me on value for each line and is in line with what I understand what the linear predictor is namely X_i'beta. If I however use the predict.clm(object, newdata,"linear.predictor") on an ordered probit estimation with clm() I get a list with the elements eta1 and eta2,
with one column each, if the newdata contains the dependent variable
where each element contains as many columns as levels in the dependent variable, if the newdata doesn't contain the dependent variable
Unfortunately I don't have a clue what that means. Also in the documentations and papers of the author I don't find any information about it. Would one of you be so nice to enlighten me? This would be great.
Cheers,
AK
UPDATE (after comment):
Basic clm model is defined like this (see clm tutorial for details):
Generating data:
library(ordinal)
set.seed(1)
test.data = data.frame(y=gl(4,5),
x=matrix(c(sample(1:4,20,T)+rnorm(20), rnorm(20)), ncol=2))
head(test.data) # two independent variables
test.data$y # four levels in y
Constructing models:
fm.polr <- polr(y ~ x) # using polr
fm.clm <- clm(y ~ x) # using clm
Now we can access thetas and betas (see formula above):
# Thetas
fm.polr$zeta # using polr
fm.clm$alpha # using clm
# Betas
fm.polr$coefficients # using polr
fm.clm$beta # using clm
Obtaining linear predictors (only parts without theta on the right side of the formula):
fm.polr$lp # using polr
apply(test.data[,2:3], 1, function(x) sum(fm.clm$beta*x)) # using clm
New data generation:
# Contains only independent variables
new.data <- data.frame(x=matrix(c(rnorm(10)+sample(1:4,10,T), rnorm(10)), ncol=2))
new.data[1,] <- c(0,0) # intentionally for demonstration purpose
new.data
There are four types of predictions available for clm model. We are interested in type=linear.prediction, which returns a list with two matrices: eta1 and eta2. They contain linear predictors for each observation in new.data:
lp.clm <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm
Note 1: eta1 and eta2 are literally equal. Second is just a rotation of eta1 by 1 in j index. Thus, they leave left side and right side of linear predictor scale opened respectively.
all.equal(lp.clm$eta1[,1:3], lp.clm$eta2[,2:4], check.attributes=FALSE)
# [1] TRUE
Note 2: Prediction for first line in new.data is equal to thetas (as far as we set this line to zeros).
all.equal(lp.clm$eta1[1,1:3], fm.clm$alpha, check.attributes=FALSE)
# [1] TRUE
Note 3: We can manually construct such predictions. For instance, prediction for second line in new.data:
second.line <- fm.clm$alpha - sum(fm.clm$beta*new.data[2,])
all.equal(lp.clm$eta1[2,1:3], second.line, check.attributes=FALSE)
# [1] TRUE
Note 4: If new.data contains response variable, then predict returns only linear predictor for specified level of y. Again we can check it manually:
new.data$y <- gl(4,3,length=10)
lp.clm.y <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm.y
lp.manual <- sapply(1:10, function(i) lp.clm$eta1[i,new.data$y[i]])
all.equal(lp.clm.y$eta1, lp.manual)
# [1] TRUE

Resources