Manually implementing Regression Likelihood Ratio Test

Manually implementing Regression Likelihood Ratio Test - r

I'm trying to implement my own linear regression likelihood ratio test.
The test is where you take the sum of squares of a reduced model and the sum of squares of a full model and compare it to the F statistic.
However, I am having some trouble implementing the function, especially when dealing with dummy variables.
This is the dataset I am working with and testing the function on.
Here is the code so far:
The function inputs are the setup matrix mat, the response matrix which has just one column, the indices (variables) being test, and the alpha value the test is at.
linear_regression_likelihood <- function(mat, response, indices, alpha) {
mat <- as.matrix(mat)
reduced <- mat[,c(1, indices)]
q <- 1 #set q = 1 just to test on data
p <- dim(mat)[2]
n <- dim(mat)[1]
f_stat <- qf(1-alpha, df1 = p-q, df2 = n-(p+1))
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
y_hat_full <- mat%*%beta_hat_full
SSRes_full <- t(response - y_hat_full)%*%(response-y_hat_full)
beta_hat_red <- qr.solve(t(reduced)%*%reduced)%*%t(reduced)%*%response
y_hat_red <- reduced%*%beta_hat_red
SSRes_red <- t(response - y_hat_red)%*%(response-y_hat_red)
s_2 <- (t(response - mat%*%beta_hat_full)%*%(response - mat%*%beta_hat_full))/(n-p+1)
critical_value <- ((SSRes_red - SSRes_full)/(p-q))/s_2
print(critical_value)
if (critical_value > f_stat) {
return ("Reject H0")
}
else {
return ("Fail to Reject H0")
}
}
Here is the setup code, where I setup the matrix in the correct format. Data is the read in CSV file.
data <- data[, 2:5]
mat <- data[, 2:4]
response <- data[, 1]
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy
mat <- cbind(1, mat[1:2], dummy)
linear_regression_likelihood(mat, response, 2:3, 0.05)
This is the error I keep getting.
Error in solve.default(as.matrix(c)) : system is computationally singular: reciprocal condition number = 1.63035e-18
I know it has to do with taking the inverse of the matrix after it is multiplied, but the function is unable to do so. I thought it may be due to the dummy variables having too small of values, but I am not sure of any other way to include the dummy variables.
The test I am doing is to check whether the factor variable x3 has any affect on the response y. The actual answer which I verified using the built in functions states that we fail to reject the null hypothesis.

The error originates from line
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
If you go through your function step-by-step you will see an error
Error in qr.solve(t(mat) %*% mat) : singular matrix 'a' in solve
The problem here is that your model matrix does not have full column rank, which translates to your regression coefficients not being unique. This is a result of the way you "dummyfied" x3. In order to ensure full rank, you need to remove one dummy column (or manually remove the intercept).
In the following example I remove the A column from dummy which means that resulting x3 coefficients measure the effect of a unit-change in B, C, and D against A.
# Read data
data <- read.csv("data_hw5.csv")
data <- data[, 2:5]
# Extract predictor and response data
mat <- data[, 2:4]
response <- data[, 1]
# Dummify categorical predictor x3
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy <- dummy[, -1] # Remove A to have A as baseline
mat <- cbind(1, mat[1:2], dummy)
# Apply linear_regression_likelihood
linear_regression_likelihood(mat, response, 2:3, 0.05);
# [,1]
#[1,] 8.291975
#[1] "Reject H0"
A note
The error could have been avoided if you had used base R's function model.matrix which ensures full rank when "dummyfying" categorical variables (model.matrix is also implicitly called in lm and glm to deal with categorical, i.e. factor variables).
Take a look at
mm <- model.matrix(y ~ x1 + x2 + x3, data = data)
which by default omits the first level of factor variable x3. mm is identical to mat after (correct) "dummification".

Related

automatization of lm tests with all possible var combinations and getting values for: shapiro.test(), bptest(),vif() in R

I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer).
I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.

you are scratching the surface of what is now referred to as Statistical learning. the intro text is "Statistical Learning with applications in R" and the grad level text is "The Elements of Statistical learning".
to do what you need you use regsubsets() function from the "leaps" package. However if you read at least chapter 6 from the intro book you will discover about cross-validation and bootstrapping which are the modern way of doing model selection.

The following automates the models fitting and the tests you made afterwards.
There is one function that fits all possible models. Then a series of calls to the *apply functions will get the values you want.
library(lmtest)
library(car)
fitAllModels <- function(data, resp, regr){
f <- function(M){
apply(M, 2, function(x){
fmla <- paste(resp, paste(x, collapse = "+"), sep = "~")
fmla <- as.formula(fmla)
lm(fmla, data = data)
})
}
regr <- names(data)[names(data) %in% regr]
regr_list <- lapply(seq_along(regr), function(n) combn(regr, n))
models_list <- lapply(regr_list, f)
unlist(models_list, recursive = FALSE)
}
Now the data.
# Make up a data.frame to test the function above.
# Don't forget to set the RNG seed to make the
# results reproducible
set.seed(7646)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- data.frame(x1, x2, x3, x4, x5)
First fit all models with "x1" as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.
fit_list <- fitAllModels(df, "x1", names(df)[-1])
And now the sequence of tests.
# Normality test, standardized residuals
rs_sd_list <- lapply(fit_list, rstandard)
sw_list <- lapply(rs_sd_list, shapiro.test)
sw_pvalues <- sapply(sw_list, '[[', 'p.value')
# check for heteroskedasticity (Breusch-Pagan-Test)
bp_list <- lapply(fit_list, bptest)
bp_pvalues <- sapply(bp_list, '[[', 'p.value')
# check for multicollinearity,
# only models with 2 or more regressors
vif_values <- lapply(fit_list, function(fit){
regr <- attr(terms(fit), "term.labels")
if(length(regr) < 2) NA else vif(fit)
})
A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.
# models without outliers
# identify outliers (calculating the
# Cooks distance, if x > 4/(n - k - 1) --> outlier
df_no_out_list <- lapply(fit_list, function(fit){
cooks <- cooks.distance(fit)
regr <- attr(terms(fit), "term.labels")
k <- length(regr)
inx <- cooks < 4/(nrow(df) - k - 1)
#df[inx, ]
which(inx)
})
# This tells how many rows have the df's without outliers
sapply(df_no_out_list, NROW)
# A data.frame without outliers. This one is the one
# for model number 8.
# The two code lines could become a one-liner.
i <- df_no_out_list[[8]]
df[i, ]

extracting residuals from pixel by pixel regression

I am trying to extract the residuals from a regression run pixel by pixel on a raster stack of NDVI/precipitation. My script works when i run it with a small part of my data. But when i try to run the whole of my study area i get: "Error in setValues(out, x) : values must be numeric, integer, logical or factor"
The lm works, since I can extract both slope and intercept. I just cant extract the residuals.
Any idea of how this could be fixed?
Here is my script:
setwd("F:/working folder/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
residualfun = function(x) { if (is.na(x[1])){ NA } else { m <- lm(x[1:6] ~ x[7:12], na.action=na.exclude)
r <- residuals.lm(m)
return (r)}}
res <- calc(s,residualfun)
And here is my data: https://1drv.ms/u/s!AhwCgWqhyyDclJRjhh6GtentxFOKwQ

Your function only test if the first layer shows NA values to avoid fitting the model. But there may be NA in other layers. You know that because you added na.action = na.exclude in your lm fit.
The problem is that if the model removes some values because of NA, the residuals will only have the length of the non-NA values. This means that your resulting r vector will have different lengths depending on the amount of NA values in layers. Then, calc is not be able to combine results of different lengths in a stack a a defined number of layers.
To avoid that, you need to specify the length of r in your function and attribute residuals only to non-NA values.
I propose the following function that now works on the dataset your provided. I added (1) the possibility compare more layers of each if you want to extend your exploration (with nlayers), (2) avoid fitting the model if there are only two values to compare in each layer (perfect model), (3) added a try if for any reason the model can fit, this will output values of -1e32 easily findable for further testing.
library(raster)
setwd("/mnt/Data/Stackoverflow/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
# Number of layers of each
nlayers <- 6
residualfun <- function(x) {
r <- rep(NA, nlayers)
obs <- x[1:nlayers]
cov <- x[nlayers + 1:nlayers]
# Remove NA values before model
x.nona <- which(!is.na(obs) & !is.na(cov))
# If more than 2 points proceed to lm
if (length(x.nona) > 2) {
m <- NA
try(m <- lm(obs[x.nona] ~ cov[x.nona]))
# If model worked, calculate residuals
if (is(m)[1] == "lm") {
r[x.nona] <- residuals.lm(m)
} else {
# alternate value to find where model did not work
r[x.nona] <- -1e32
}
}
return(r)
}
res <- calc(s, residualfun)

Calculation of DFFITS as diagnostic for Leverage and Influence in regression

I am trying to calculate DFFITS by hand. The value obtained should be equal to the first value obtained by dffits function. However there must be something wrong with my own calculation.
attach(cars)
x1 <- lm(speed ~ dist, data = cars) # all observations
x2 <- lm(speed ~ dist, data = cars[-1,]) # without first obs
x <- model.matrix(speed ~ dist) # x matrix
h <- diag(x%*%solve(crossprod(x))%*%t(x)) # hat values
num_dffits <- x1$fitted.values[1] - x2$fitted.values[1] #Numerator
denom_dffits <- sqrt(anova(x2)$`Mean Sq`[2]*h[1]) #Denominator
df_fits <- num_dffits/denom_dffits #DFFITS
dffits(x1)[1] # DFFITS function

Your numerator is wrong. As you have removed first datum from the second model, corresponding predicted value is not in fitted(x2). We need to use predict(x2, cars[1, ]) in place of fitted(x2)[1].
Hat values can be efficiently computed by
h <- rowSums(qr.Q(x1$qr) ^ 2)
or using its R wrapper function
h <- hat(x1$qr, FALSE)
R also has a generic function for getting hat values, too:
h <- lm.influence(x1, FALSE)$hat
or its wrapper function
h <- hatvalues(x1)
You also don't have to call anova to get MSE:
c(crossprod(x2$residuals)) / x2$df.residual

How to modify slots with paired random effects in lmer

This is a follow up question to a previous post (How to modify slots lme4 >1.0). I have a similar pairwise data structure and want the random effect to consider both "pops" in the pair. I have a functional random intercept model using the code previously suggested:
dat <- data.frame(pop1 = c(2,1,1,1,1,3,2,2,2,3,5,3,5,4,6),
pop2 = c(1,3,4,5,6,2,4,5,6,4,3,6,4,6,5),
X = c(20,25,18,40,36,70,68,72,78,76,97,100,115,110,108),
Y = c(18,16,15,40,22,18,18,18,18,45,10,47,67,5,6))
#build random effects matrix
Zl<-lapply(c("pop1","pop2"),function(nm)Matrix:::fac2sparse(dat[[nm]],"d",drop=FALSE))
ZZ<-Reduce("+",Zl[-1],Zl[[1]])
#specify model structure
mod<-lFormula(Y~X+(1|pop1),data=dat,REML=TRUE)
#replace slot
mod$reTrms$Zt <- ZZ
#fit model
dfun<-do.call(mkLmerDevfun,mod)
opt<-optimizeLmer(dfun)
mkMerMod(environment(dfun),opt,mod$reTrms,fr=mod$fr)
However, when attempting to add a random slope variable:
mod2<-lFormula(Y~X+(1+X|pop1),data=dat,REML=TRUE)
mod2$reTrms$Zt <- ZZ
dfun<-do.call(mkLmerDevfun,mod2)
Results in the same error identified in the previous post (where the issue was calling the wrong data frame): "Error in Lambdat %*% Ut :
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82"
View lm for each pop
plot(1,type="n",xlim=c(0,150),ylim=c(0,75),ylab = "Y",xlab="X")
for(i in 1:length(unique(c(dat$pop1,dat$pop2)))){
subdat<-dat[which(dat$pop1==i | dat$pop2==i),]
out<-summary(lm(subdat$Y~subdat$X))
x=subdat$X
y=x*out$coefficients[2,1]+out$coefficients[1,1]
lines(x,y,col=i))
}
legend(125,60,1:6,col=1:6,lty=1,title="Pop")

dat <- data.frame(pop1 = c(2,1,1,1,1,3,2,2,2,3,5,3,5,4,6),
pop2 = c(1,3,4,5,6,2,4,5,6,4,3,6,4,6,5),
X = c(20,25,18,40,36,70,68,72,78,76,97,100,115,110,108),
Y = c(18,16,15,32,22,29,32,38,44,45,51,47,67,59,61))
It helps to try to understand what the original code is actually doing:
## build random effects matrix
## 1. sparse dummy-variable matrices for each population ID
Zl <- lapply(dat[c("pop1","pop2")],
Matrix::fac2sparse,to="d",drop.unused.levels=FALSE)
## 2. take the sum of all components of the list of dummy-variable matrices ...
ZZ <- Reduce("+",Zl[-1],Zl[[1]])
The Reduce form is convenient in general if we have a long list, but it helps to see that in this case it's just Zl[[1]]+Zl[[2]] ...
all.equal(Zl[[1]]+Zl[[2]],ZZ) ## TRUE
What does this RE structure look like?
library(gridExtra)
grid.arrange(
image(t(Zl[[1]]),main="pop 1",sub="",xlab="pop",ylab="obs"),
image(t(Zl[[2]]),main="pop 2",sub="",xlab="pop",ylab="obs"),
image(t(ZZ),main="combined",sub="",xlab="RE",ylab="obs"),
nrow=1)
For the random slope, I think we want to take each filled element of ZZ and replace it with the X value observed for the corresponding observation/row of dat: the indexing here is a bit obscure - in this case it boils down to there being 2 filled values in each row of Z/column of Zt (the #p slot of the sparse matrix gives a zero-indexed pointer to the first non-zero element in each column ...)
vals <- dat$X[rep(1:(length(ZZ#p)-1),diff(ZZ#p))]
ZZX <- ZZ
ZZX#x <- vals
image(t(ZZX))
library(lme4)
mod <- lFormula(Y~X+(X|pop1),data=dat,REML=TRUE)
## replace slot
mod$reTrms$Zt <- rbind(ZZ,ZZX)
## fit model
dfun <- do.call(mkLmerDevfun,mod)
opt <- optimizeLmer(dfun)
m1 <- mkMerMod(environment(dfun),opt,mod$reTrms,fr=mod$fr)
This seems to work, but you should certainly check it with your own knowledge of what's supposed to be going on here ...

R: looped variable assignment, augmenting variable calculation each time

I am trying to calculate a regression variable based on a range of variables in my data set. I would like the regression variable (ei: Threshold 1) to be calculated using a different variable set in each iteration of running the regression.
Aim to collected SSR values for each threshold range, and thus identify the ideal threshold based on the data.
Data (df) variables: Yield, Prec, Price, 0C, 1C, 2C, 3C, 4C, 5C, 6C, 7C, 8C, 9C, 10C
Each loop calculates "thresholds" by selecting a different "b" each time.
a <- df$0C
b <- df$1C
Threshold1 <- (a-b)
Threshold2 <- (b)
Where "b" would be changing in each loop, ranging from 1C to 9C.
Each individual threshold set (1 and 2) should be used to run a regression, and save the SSR for comparison with the subsequent regression utilizing thresholds based on a new "b" value (ranging from 1C TO 9C)
Regression:
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
for each loop of the Regression, I vary the components of calculating thresholds in the following manner:
Current approach is centered around the following code:
df <- read.csv("Data.csv",header=TRUE)
names(df)
0C-9Cvarlist <- names(df)[9:19]
ssr.vec <- matrix(,21,1)
for(i in 1:length(varlist)){
a <- df$0C
b <- df$[i]
Threshold1 <- (a-b)
Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist,r2)
}
colnames(ssr.vec) <- c("varlist","r2")
I am failing to achieve the desired result with the above approach.
Thank you.

I can spot quite a few mistakes...
You need to add variables of interest (Threshold1 anf Threshold2) to the data in the regression. Also, I think that you need to select varlist[i] and not varlist to create your ssr.vec. You need 2 columns to your ssr.vec which is a matrix, so you should call it matrix. You also cannot use something like df$[i] to extract a column! Why is the matrix of length 21 ?! Change the column name to C0,..,C9 and not 0C,..,9C.
For future reference, solve the simple errors before asking question... and include error messages in your post!
This should do the job:
df <- read.csv("Data.csv",header=TRUE)
names(df)[8:19] = paste0("C",0:10)
varlist <- names(df)[9:19]
ssr.vec <- matrix(,21,2)
for(i in 1:length(varlist)){
a <- df$C0
b <- df[,i+9]
df$Threshold1 <- (a-b)
df$Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist[i],r2)
}
colnames(ssr.vec) <- c("varlist","r2")