Multicollinearity on multiple dependent variables - r

Consider the following:
library(tidyverse)
library(car)
a <- c( 2, 3, 4, 5, 6, 7, 8, 9, 100, 11)
b <- c(5, 6, 7, 80, 9, 10, 11, 12, 13, 14)
c <- c(15, 16, 175, 18, 19, 20, 21, 22, 23, 24)
x <- c(17,18,50,15,64,15,3,5,6,9)
y <- c(55,66,99,83,64,51,23,64,89,101)
z <- c(98,78,56,21,45,34,61,98,45,64)
abc <- data.frame(cbind(a,b,c))
Firstly, I run a regression and values abc with xyz as follows (This went according to plan):
dep_vars <- as.matrix(abc)
lm <- lm(dep_vars ~ x + y + z, data = abc)
From here, I want to get the variance inflation factor using the vif() function:
vif(lm)
But then I get an error that says Error in if (names(coefficients(mod)[1]) == "(Intercept)") { : argument is of length zero.
Can anybody help me understand where I went wrong? Or is there an alternative?

Related

Is there a way for me to simply plot all of these vectors on a single graph?

these are the vectors that need to all be plotted on the same graph
I would like to plot all of these vectors on one set. I've seen methods using matrices but I can't fathom how I would organize this as a matrix and I also would rather work with the vectors. Is there a method I can use to have these all on a single graph?
x_axis <- c(0, 1, 2, 3, 4, 7)
mouse_r_veh <- c(6, 7, 5, 2, 3, 7)
mouse_r_cap <- c(27, 22, 21, 25, 21, 25)
mouse_rr_veh <- c(7, 3, 4, 6, 4, 17)
mouse_rr_cap <- c(24, 27, 29, 9, 10, 21)
mouse_l_veh <- c(10, 12, 11, 16, 13, 2)
mouse_l_cap <- c(26, 23, 23, 23, 24, 22)
mouse_ll_veh <- c(0, 2, 1, 3, 0, 0)
If you don't want to use matplot, nor yet ggplot, you could just do a single plot call and several lines:
plot(x_axis, ylim = c(0, 30))
lines(mouse_r_cap, col="red")
lines(mouse_r_veh, col = "green")
# ... et cetera
If you don't mind using matplot with a matrix, you could do:
mx <- cbind(x_axis, mouse_r_veh, mouse_r_cap,
mouse_rr_veh, mouse_rr_cap, mouse_l_veh,
mouse_l_cap, mouse_ll_veh)
matplot(mx, type ="l")
You could put the data in a data.frame and use pivot_longer to create a new variable with the name of each serie:
library(tidyr)
library(ggplot2)
df <- data.frame(x_axis,
mouse_r_veh,
mouse_r_cap,
mouse_rr_veh,
mouse_rr_cap,
mouse_l_veh,
mouse_l_cap,
mouse_ll_veh)
data <- df %>% pivot_longer(cols = contains('mouse'))
ggplot(data) + geom_line(aes(x = x_axis, y = value, color = name))

Separate data in two groups

I have this data:
x1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22)
y1 = c(1, 6, 2, 5, 4, 7, 9, 6, 8, 4, 5, 6, 5, 5, 6, 7, 5, 8, 9,
5, 4, 7)
plot(x1, y1)
fit <- lm(y1 ~ x1)
fit
abline(fit, col = "black")
prediction <- predict(fit)`
I want to separate the data in two groups using the next condition:
if (y1 < prediction) {print("Negative number")}
else if (y1 > prediction) {print("Positive number")}
But it appears: Warning messages:
1: In if (y1 < prediction) { :
the condition has length > 1 and only the first element will be used
2: In if (y1 > prediction) { :
the condition has length > 1 and only the first element will be used
Can someone tell me how to fix it?

How to dynamically indicate groups in an R in plot

This is my data
x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22)
y = c(1, 6, 2, 5, 4, 7, 9, 6, 8, 4, 5, 6, 5, 5, 6, 7, 5, 8, 9,
5, 4, 7)
plot(x, y)
fit <- lm(y ~ x)
fit
abline(fit, col = "black", lwd = "1")
I would like to the plot to split the data into two groups, observations above the regression line and and those under the regression line. How can I do this?
You can use predict to get the fitted value at each x, and then a logical comparison between the observed and fitted to test if they're above or below the line. Then set the colors you plot based on this logical comparison.
prediction <- predict(fit)
colors<-ifelse(y>prediction,1,2)
plot(x,y,col=colors)
abline(fit, col= "black",lwd="1")

Why do mean() and mean(aggregate()) return different results?

I want to calculate a mean. Here is the code with sample data:
# sample data
Nr <- c(1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
dph <- c(3.125000, 6.694737, 4.310680, 11.693735, 103.882353, 11.000000, 7.333333, 20.352941, 5.230769, NA, 4.615385, 47.555556, 2.941176, 18.956522, 44.320000, 28.500000, NA, 10.470588, 19.000000, 25.818182, 43.216783, 51.555556, 8.375000, 6.917647, 9.375000, 5.647059, 4.533333, 27.428571, 14.428571, NA, 1.600000, 5.764706, 4.705882, 55.272727, 2.117647, 30.888889, 41.222222, 23.444444, 2.428571, 6.200000, 17.076923, 21.280000, 40.829268, 14.500000, 6.250000, NA, 15.040000, 5.687204, 2.400000, NA, 26.375000, 18.064516, 4.000000, 6.139535, 8.470588, 128.666667, 2.235294, 34.181818, 116.000000, 6.000000, 5.777778, 10.666667, 15.428571, 54.823529, 81.315789, 42.333333)
dat <- data.frame(cbind(Nr = Nr, dph = dph))
# calculate mean directly
mean(dat$dph, na.rm = TRUE)
[1] 23.02403
# aggregate first, then calculate mean
mean(aggregate(dph ~ Nr, dat, mean, na.rm = T)$dph)
[1] 22.11743
# 23.02403 != 22.11743
Why do I get two different results?
Explanation for question:
I need to perform a Wilcoxon test, comparing a pre baseline with a post baseline. Pre is 3 measurements, post is 16. Because a Wilcoxon test needs two vectors of equal length, I calculate means for pre and post for each patient with aggregate, creating two vectors of equal length. Above data is pre.
Edit:
Patient no. 4 was removed from the data. But using Nr <- rep(1:22, 3) returns the same results.
I think this is because in the mean(dat$x, na.rm=T) version, each NA that is removed, reduces the number of observations by 1, whereas if you aggregate first, in your example you have an NA in row 10 (ID 11) which is removed but since the other rows with ID 11 do not contain NAs (or at least one of them doesn't), the number of observations (unique IDs) you use to calculate the mean after aggregation for each ID, is not reduced by 1 for each NA. So the difference IMO comes from dividing the sum of dph, which should be the same in both calculations, by different numbers of observations.
You can verify this by changing NA entries to 0 and the calculating the mean again with both versions, they'll return the same.
But generally you should note that it only works here because you have the same number of observations for each ID (3 in this case). If they were different, you would again get different results.

How do I get confidence intervals without inverting a singular Hessian matrix in R?

I'm a student working on an epidemiology model in R, using maximum likelihood methods. I created my negative log likelihood function. It's sort of gross looking, but here it is:
NLLdiff = function(v1, CV1, v2, CV2, st1 = (czI01 - czV01), st2 = (czI02 - czV02), st01 = czI01, st02 = czI02, tt1 = czT01, tt2 = czT02) {
prob1 = (1 + v1 * CV1 * tt1)^(-1/CV1)
prob2 = ( 1 + v2 * CV2 * tt2)^(-1/CV2)
-(sum(dbinom(st1, st01, prob1, log = T)) + sum(dbinom(st2, st02, prob2, log = T)))
}
The reason the first line looks so awful is because most of the data it takes is input there. czI01, for example, is already declared. I did this simply so that my later calls to the function don't all have to have awful vectors in them.
I then optimized for CV1, CV2, v1 and v2 using mle2 (library bbmle). That's also a bit gross looking, and looks like:
ml.cz.diff = mle2 (NLLdiff, start=list(v1 = vguess, CV1 = cguess, v2 = vguess, CV2 = cguess), method="L-BFGS-B", lower = 0.0001)
Now, everything works fine up until here. ml.cz.diff gives me values that I can turn into a plot that reasonably fits my data. I also have several different models, and can get AICc values to compare them. However, when I try to get confidence intervals around v1, CV1, v2 and CV2 I have problems. Basically, I get a negative bound on CV1, which is impossible as it actually represents a square number in the biological model as well as some warnings.
Is there a better way to get confidence intervals? Or, really, a way to get confidence intervals that make sense here?
What I see happening is that, by coincidence, my hessian matrix is singular for some values in the optimization space. But, since I'm optimizing over 4 variables and don't have overly extensive programming knowledge, I can't come up with a good method of optimization that doesn't rely on the hessian. I have googled the problem - it suggested that my model's bad, but I'm reconstructing some work done before which suggests that my model's really not awful (the plots I make using the ml.cz.diff look like the plots of the original work). I have also read the relevant parts of the manual as well as Bolker's book Ecological Models in R. I have also tried different optimization methods, which resulted in a longer run time but the same errors. The "SANN" method didn't finish running within an hour, so I didn't wait around to see the result.
In a nutshell: my confidence intervals are bad. Is there a relatively straightforward way to fix them in R?
My vectors are:
czT01 = c(5, 5, 5, 5, 5, 5, 5, 25, 25, 25, 25, 25, 25, 25, 50, 50, 50, 50, 50, 50, 50)
czT02 = c(5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 25, 25, 25, 25, 25, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75)
czI01 = c(25, 24, 22, 22, 26, 23, 25, 25, 25, 23, 25, 18, 21, 24, 22, 23, 25, 23, 25, 25, 25)
czI02 = c(13, 16, 5, 18, 16, 13, 17, 22, 13, 15, 15, 22, 12, 12, 13, 13, 11, 19, 21, 13, 21, 18, 16, 15, 11)
czV01 = c(1, 4, 5, 5, 2, 3, 4, 11, 8, 1, 11, 12, 10, 16, 5, 15, 18, 12, 23, 13, 22)
czV02 = c(0, 3, 1, 5, 1, 6, 3, 4, 7, 12, 2, 8, 8, 5, 3, 6, 4, 6, 11, 5, 11, 1, 13, 9, 7)
and I get my guesses by:
v = -log((c(czI01, czI02) - c(czV01, czV02))/c(czI01, czI02))/c(czT01, czT02)
vguess = mean(v)
cguess = var(v)/vguess^2
It's also possible that I'm doing something else completely wrong, but my results seem reasonable so I haven't caught it.
You could change the parameterization so that the constraints are always satisfied. Rewrite the likelihood as a a function of ln(CV1) and ln(CV2), that way you can be sure that CV1 and CV2 remain strictly positive.
NLLdiff_2 = function(v1, lnCV1, v2, lnCV2, st1 = (czI01 - czV01), st2 = (czI02 - czV02), st01 = czI01, st02 = czI02, tt1 = czT01, tt2 = czT02) {
prob1 = (1 + v1 * exp(lnCV1) * tt1)^(-1/exp(lnCV1))
prob2 = ( 1 + v2 * exp(lnCV2) * tt2)^(-1/exp(lnCV2))
-(sum(dbinom(st1, st01, prob1, log = T)) + sum(dbinom(st2, st02, prob2, log = T)))
}

Resources