R clustering- silhouette with observation labels - r

I do hierarchical clustering with the cluster package in R. Using the silhouette function, I can get the silhouette plot of my cluster output for any given height (h) cut-off in the dendrogram.
# run hierarchical clustering
if(!require("cluster")) { install.packages("cluster"); require("cluster") }
tmp <- matrix(c( 0, 20, 20, 20, 40, 60, 60, 60, 100, 120, 120, 120,
20, 0, 30, 50, 60, 80, 40, 80, 120, 100, 140, 120,
20, 30, 0, 40, 60, 80, 80, 80, 120, 140, 140, 80,
20, 50, 40, 0, 60, 80, 80, 80, 120, 140, 140, 140,
40, 60, 60, 60, 0, 20, 20, 20, 60, 80, 80, 80,
60, 80, 80, 80, 20, 0, 20, 20, 40, 60, 60, 60,
60, 40, 80, 80, 20, 20, 0, 20, 60, 80, 80, 80,
60, 80, 80, 80, 20, 20, 20, 0, 60, 80, 80, 80,
100, 120, 120, 120, 60, 40, 60, 60, 0, 20, 20, 20,
120, 100, 140, 140, 80, 60, 80, 80, 20, 0, 20, 20,
120, 140, 140, 140, 80, 60, 80, 80, 20, 20, 0, 20,
120, 120, 80, 140, 80, 60, 80, 80, 20, 20, 20, 0),
nr=12, dimnames=list(LETTERS[1:12], LETTERS[1:12]))
cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single')
sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good'))
plot(sil_cl)
This gives the figure below, which is the point that frustrates me. How can I use the observation labels rownames(tmp) in the silhouette plot as opposed to the numeric indices (1 to 12) - which make no sense whatsoever to me.

I'm not sure why but the silhouette call seems to drop the row names. You can add them back with
cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single')
sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good'))
rownames(sil_cl) <- rownames(tmp)
plot(sil_cl)

I found that adding the argument cex.names = par("cex.axis") to the plot() function gives you the desired labels:
cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single')
sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good'))
plot(sil_cl, cex.names = par("cex.axis"))

Related

Behaviour of gtsummary with two variables

I was creating a summary for an article and I came out with the following behaviour that I cannot understand. two columns of the data frame report the min and max pressure as the following
a <- c(80, 80, 80, 80, 80, 80, 80, 80, 70, 70, 75, 75, 70, 65, 60, 80, 75, 70, 80, 80, 80, 80, 80, 70, 80, 70, 75, 80, 70, 65, 70, 75, 70, 75, 80, 65, 85, 75, 70, 70, 70, 75, 80, 80, 70, 70, 80, 70, 80, 60, 80, 80, 70, 70, 85, 70, 70, 80, 70, 70, 75, 75, 70, 70, 70,
70, 70, 80, 80, 70)
b <- c(120, 120, 120, 120, 120, 120, 120, 120, 120, 125, 120, 135, 130, 120, 115, 110, 125, 120, 130, 125, 110, 120, 130, 110, 125, 130, 105, 100, 110, 110, 130, 120, 110, 120, 135, 125, 145, 135, 130, 110, 115, 145, 120, 125, 100, 120, 120, 130,
115, 120, 110, 160, 120, 130, 155, 125, 135, 155, 110, 130, 145, 155, 125, 130, 140, 110, 100, 150, 130, 130)
pressure <- data.frame(a,b)
str(pressure)
pressure %>% tbl_summary()
and the result is the following
so for b I got the expected behaviour while is formatted as categorical I guess. No matter what change I made (forcing as double, adding decimals etc) worked to have a formatted as b. If I shorten the vectors the behaviour is different and they both looks like.
I've also forced the output with
pressure %>% tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd})"))
but I keep getting same results
Any help appreciated
It appears to be the default behavior of tbl_summary() to interpret any numeric variables with fewer than 10 unique values as categorical. You can observe that when running the following:
library(tidyverse)
library(gtsummary)
d <- map_dfc(8:12, \(x) rep(1:x, length.out = 100)) |>
set_names(letters[1:5])
d |>
tbl_summary()
This behavior can be overridden by specifying the type of the problematic variables:
d |>
tbl_summary(type = list(c(a,b,c) ~ "continuous"))

Factorial - Four Way ANOVA - How to find a statistically effective combination

I have to analyze an experiment data set to find a most effective combination of a molecular biology reaction.
The experiment has four factors: Temperature, RPM, Time, Catalytic activity. And I am measuring the Efficiency of a reaction (EE). How can I find an effective combination of four factors for the highest efficiency(EE)?
No repeated measurements. All data are independent experimental data
As I understood - EE is parametric data, factors are categorical data (Fixed combinations).
Do I have to go for a Fourway ANOVA?
if so is this model correct for the analysis
library(lsmeans)
lm(EE ~ Temperature + RPM + Time+ Catalytic +
Temperature:RPM +
Temperature:Time +
Temperature:Catalytic +
RPM:Time+
RPM+Catalytic+
Time+Catalytic+
Temperature:RPM:Time +
Temperature:RPM:Catalytic+
Temperature:Time:Catalytic+
RPM:Time:Catalytic+
Temperature:RPM:Time:Catalytic, "data")
And, then how can I get the significant values for each pairwise comparison?
Here is the sample data set for an example.
> dput(df)
structure(list(TEMPERATURE = c(40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 45, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
45, 45, 45), RPM = c(150, 150, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 200,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,
150, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 200), TIME = c(24, 24, 24,
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96,
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72,
96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72,
72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48,
48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24,
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96,
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72,
96, 96, 96, 96, 96), CAT = c(4, 6, 8, 10, 12, 4, 6, 8, 10, 12,
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10,
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6,
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12,
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10,
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6,
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12),
EE = c(50, 53, 54, 57, 59, 53, 56, 59, 61, 64, 57, 58, 60,
62, 63, 56, 54, 52, 55, 55, 44, 48, 50, 50, 54, 49, 52, 56,
57, 56, 52, 56, 57, 58, 66, 46, 48, 48, 52, 49, 53, 57, 59,
62, 64, 54, 58, 60, 64, 66, 55, 59, 61, 63, 65, 54, 59, 64,
65, 67, 49, 51, 53, 54, 59, 50, 54, 63, 64, 64, 52, 56, 56,
59, 57, 52, 55, 58, 60, 63, 52, 56, 58, 61, 63, 54, 55, 58,
63, 63, 56, 58, 62, 62, 65, 57, 59, 62, 63, 66, 42, 42, 51,
54, 56, 46, 50, 52, 56, 58, 48, 51, 54, 55, 57, 48, 53, 56,
57, 61)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -120L), spec = structure(list(cols = list(
TEMPERATURE = structure(list(), class = c("collector_double",
"collector")), RPM = structure(list(), class = c("collector_double",
"collector")), TIME = structure(list(), class = c("collector_double",
"collector")), CAT = structure(list(), class = c("collector_double",
"collector")), EE = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Try something like this:
library(rsm)
mod = rsm(EE ~ SO(Temperature, RPM, Time, Catalytic), data = data)
summary(mod)
This will fit a second-order surface (model equation includes all predictors, two-way interactions, and squares). The summary shows a stationary point and related statistics. If all the eigenvalues are negative, then it is a peak. Otherwise you have some kind of saddle point.
This model differs from the one in the OP in that it has no three-way or four-way interactions, but includes the squares of the predictors, which are really essential for fitting a second-order response surface.
Lots more details
I had to revise this a bit to account for the fact that R is case-sensitive!
> mod = rsm(EE ~ SO(TEMPERATURE, RPM, TIME, CAT), data = df)
There is an issue here in that RPM has only two values, so we can't estimate a pure quadratic effect. So there is one coefficient of NA and that messes up the computation of the stationary point. However, we can still plot the fitted surface (despite a few warning messages)
> par(mfrow = c(2,3))
> contour(mod, ~TEMPERATURE+RPM+TIME+CAT)
It looks like we are best off with large CAT and lower RPM (see that plot), so look again:
> par(mfrow=c(1,1))
> contour(mod, ~ TEMPERATURE + TIME, at = list(CAT = 12, RPM = 150))
So visually, we seem to get the best response at around temperature 43.5, time 65, catalyst 12, and rpm 150.
If you insist on modeling these as factors, it can be done, but you need to convert all the predictors to factors. This is a common error; you can have a designed experiment with only a few distinct values of a quantitative variable, but R does not read your mind and assume it's a factor; you have to convert it to one. In the following I have opted to fit a model with up to 2-way interactions.
> facmod = lm(EE ~ (factor(TEMPERATURE) + factor(TIME) + factor(RPM) + factor(CAT))^2, data = df)
> library(emmeans)
> emmip(facmod, TIME ~ TEMPERATURE | CAT*RPM)
The highest fitted response is at catalyst 12, RPM 150, temperature 42.5 or larger, and time 96. It is clear that 150 RPM is better (left vs. right comoparisons) and the high CAT is better (comparing panels vertically). These are different models and somewhat different results. I like the rsm approach better as it is more systematic.
For a screening DOE you collected more data than what was needed.
Here is a starting point, I welcome additional comments.
I would model the linear combination of all of your factors:
model <-lm(EE ~ TEMPERATURE + RPM + TIME +CAT , data=df)
summary(model)
Call:
lm(formula = EE ~ TEMPERATURE + RPM + TIME + CAT, data = df)
Residuals:
Min 1Q Median 3Q Max
-10.1850 -1.5742 0.3383 1.7767 9.7033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.50833 6.47932 7.950 1.42e-12 ***
TEMPERATURE 0.27000 0.14245 1.895 0.06056 .
RPM -0.10533 0.01163 -9.056 4.10e-15 ***
TIME 0.03639 0.01084 3.358 0.00107 **
CAT 1.20417 0.10281 11.713 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.185 on 115 degrees of freedom
Multiple R-squared: 0.6706, Adjusted R-squared: 0.6591
F-statistic: 58.52 on 4 and 115 DF, p-value: < 2.2e-16
Looking at the signs for the slope estimates. While this simple model is assuming a linear relationship, the physical significance of the slope is comparing the average of low values with average of the high values.
For example the the slope of Temperature term is positive. This means when the temperature increases form the low value (40C) to the high value (45C)the Efficiency will increase.
Temperature, TIME and CAT as positive slopes I would take the largest value available.
RPM had a negative slope so I would choose the lowest value available.
Thus my prediction from the experiments my prediction would yield EE=66,
While the highest result in from the experiment was:
df[which.max(df$EE),]
# A tibble: 1 x 5
TEMPERATURE RPM TIME CAT EE
<dbl> <dbl> <dbl> <dbl> <dbl>
1 42.5 150 96 12 67
Now you could investigate the non-linear relationships by looking at the results from this model:
Call:
lm(formula = EE ~ TEMPERATURE * RPM * TIME * CAT, data = df)
Here the slopes of the interaction terms are orders of magnitudes smaller than the linear terms. This could be misleading is the variables are not normalized.
Good luck.

How to label a ternary plot

I am trying to create a triangular plot,that three dimensions of which represent three herbal strategies.
One dimension represents the strategy of C (competitive plant), the second dimension “S” (stress tolerant plants) and the third dimension ”R” (ruderal plants), the points on it represent the plant species.
I want to write the species name outside the triangle and connect it to the points inside the triangle with an arrow. How do I draw this ternary plot?
The following is the data structure and my code
require(Ternary)
TernaryPlot()
#Plot two stylised plots side by side, and plot data
par(mfrow=c(1, 1), mar=rep(0.3, 4))
TernaryPlot(atip='C%', btip='R%', ctip='S%',
point='UP', lab.cex=0.8, grid.minor.lines=0,
grid.lty='solid', col='#FFFFFF', grid.col='GREY',
axis.col=rgb(0.1, 0.1, 0.1), ticks.col=rgb(0.1, 0.1, 0.1),
padding=0.08)
data_points <- list("Bromus dantonia" = c(47, 59, 149),
"Calamagrosis psoudo phragmatis" = c(90, 102, 63),
"Carex diluta" = c(109, 64, 82),
"Carex divisa" = c(96, 99, 59),
"Carex pseudocyperus" = c(130, 71, 54),
"Carex stenophylla" = c(97, 98, 59),
"Catabrosa aquatica" = c(100, 5, 150),
"Centaurea iberica" = c(124, 85, 46),
"Cirsium hygrophilum" = c(158, 42, 55),
"Cladium mariscus" = c(159, 96, 0),
"cod2" = c(54, 82, 119),
"Cynodon dactylon" = c(121, 54, 80),
"Eleocharis palustri" = c(124, 100, 31),
"Epilobium parviflorum" = c(67, 80, 107),
"Eromopoa persica" = c(83, 15, 157),
"Funaria cf.microstoma" = c(8, 0, 247),
"Glaux maritime" = c(4, 196, 55),
"Hordeum brevisubulatum" = c(76, 70, 109),
"Hordeum glaucum" = c(40, 79, 136),
"Inula britannica" = c(95, 108, 51),
"Juncus articulatus" = c(107, 79, 69),
"Blysmus compressus" = c(81, 127, 47),
"Juncusinflexus"= c(149, 106, 0),
"Medicago polymorpha" = c(60, 86, 109),
"Mentha spicata" = c(150, 23, 82),
"Ononis spinosa" = c(66, 112, 77),
"Phragmites australis" = c(234, 0, 21),
"Plantago amplexicaulis" = c(108, 83, 64),
"Poa trivialis" = c(90, 28, 138),
"Polygonum paronychioides" = c(20, 12, 223),
"Potentila reptans" = c(106, 41, 108),
"Potentilla anserina" = c(105, 58, 91),
"Ranunculus grandiflorus" = c(129, 25, 101),
"Schoenus nigricans" = c(143, 91, 21),
"Setaria viridis" = c(10, 7, 238),
"Sonchus oleraceus" = c(178, 0, 77),
"Taraxacum officinale" = c(117, 28, 110),
"Trifolium repens" = c(94, 4, 157),
"Triglochin martima" = c(63, 96, 95),
"Veronica anagallis-aquatica" = c(55, 37, 163)
)
AddToTernary(points, data_points, pch=21, cex=1.2,
bg=vapply(data_points,
function (x) rgb(x[1], x[2], x[3], 128,
maxColorValue=255),
character(1))
)
AddToTernary(text, data_points, names(data_points), cex=0.8, font=1)

Filling a list with for loop or Lapply

This is my list:
mylist=dput(mylist)
list(list(
c(30, 50, 35, 25, 45),
c(40, 35, 35, 50, 45),
c(40, 20, 40, 50, 25),
),
list(
c(50, 50, 25, 40, 45, 40, 35, 40, 45, 20),
c(40, 35, 40, 40, 45, 30, 20, 50, 35, 25),
c(20, 30, 50, 35, 45, 40, 25, 50, 35, 50),
),
list(
c(45, 50, 25, 25, 30, 25, 35, 35, 35, 30, 50, 50, 30, 30, 20),
c(40, 20, 35, 35, 50, 20, 25, 30, 35, 20, 40, 20, 45, 30, 20),
c(50, 20, 25, 35, 35, 30, 50, 25, 40, 35, 45, 45, 35, 45, 25),
),
list(
c(50, 50, 50, 40, 20, 25, 50, 40, 50, 50, 45, 40, 30, 50, 35, 45, 50, 30,
35, 45),
c(45, 20, 25, 20, 25, 30, 20, 30, 45, 25, 50, 30, 30, 25, 50, 45, 20, 45, 45, 50),
c(20, 40, 50, 25, 40, 45, 25, 30, 20, 20, 35, 45, 20, 40, 50, 45, 40, 40, 45, 35),
),
list(
c(35, 25, 45, 20, 25, 30, 30, 35, 30, 40, 30, 20, 20, 30, 45, 40, 35, 35, 35, 35, 25, 45, 35, 20, 50),
c(50, 35, 30, 30, 35, 45, 45, 50, 25, 25, 40, 25, 50, 45, 25, 30, 30, 25, 45, 45, 30, 20, 50, 30, 30),
c(35, 40, 50, 25, 40, 45, 30, 25, 50, 25, 35, 50, 50, 50, 25, 50, 20, 50, 40, 25, 25, 35, 20, 20, 50),
)
)
)
mylist=dput(mylist)
list
(
list( c(30, 50, 35, 25, 45),
c(40, 35, 35, 50, 45),
c(40, 20, 40, 50, 25),
),
list(
c(50, 50, 25, 40, 45, 40, 35, 40, 45, 20),
c(40, 35, 40, 40, 45, 30, 20, 50, 35, 25),
c(20, 30, 50, 35, 45, 40, 25, 50, 35, 50),
), list(
c(45, 50, 25, 25, 30, 25, 35, 35, 35, 30, 50, 50, 30, 30, 20),
c(40, 20, 35, 35, 50, 20, 25, 30, 35, 20, 40, 20, 45, 30, 20),
c(50, 20, 25, 35, 35, 30, 50, 25, 40, 35, 45, 45, 35, 45, 25),
),
list(
c(50, 50,50, 40, 20, 25, 50, 40, 50, 50, 45, 40, 30, 50, 35, 45, 50, 30, 35, 45),
c(45, 20, 25, 20, 25, 30, 20, 30, 45, 25, 50, 30, 30, 25, 50, 45, 20, 45, 45, 50),
c(20, 40, 50, 25, 40, 45, 25, 30, 20, 20, 35, 45, 20, 40, 50, 45, 40, 40, 45, 35),
),
list(
c(35, 25, 45, 20, 25, 30, 30, 35, 30, 40, 30, 20, 20, 30, 45, 40, 35, 35, 35, 35, 25, 45, 35, 20, 50),
c(50, 35, 30, 30, 35, 45, 45, 50, 25, 25, 40, 25, 50, 45, 25, 30, 30, 25, 45, 45, 30, 20, 50, 30, 30),
c(35, 40, 50, 25, 40, 45, 30, 25, 50, 25, 35, 50, 50, 50, 25, 50, 20, 50, 40, 25, 25, 35, 20, 20, 50),
)))
I am facing two problems:
First: I can´t run this For Loop below:
resultlist<-vector(mode = "list", 5)
for (i in 1:6) {
for(k in 1:5) {
resultlist[[k]][[i]]<-mean(mylist[[k]][[i]])
}
}
It sends the message: Error in mylist[[k]][[i]] : subscript out of bounds
The second problem is actually a sugestion: My original for is much bigger (for example my i goes until 4828), to run a for with this 2 indexes (k and i) there are others functions that would let my code be more simple/efficient or in this situation the best to do is to keep with the for loop?
You are getting that error because the fourth element of mylist has only 5 sublists.
You could also do:
resultlist <- lapply(mylist, function(x) lapply(x, function(y) rep(mean(y), length(y))))
to get what you want.
In your case , your mylist[[4]] only have 5 element, so the error occur because you call the sixth element of it which does not exist.
If all the list elements are equal length, you can use expand.grid to find the all combination, and use sapply to execute the loop which is little bit faster than for
Try this :
list_len=length(mylist)
sub_list_len=6
combination<-expand.grid(1:sub_list_len,1:list_len)
temp_output<-apply(combination,1,function(x) mean( mylist[[x[2]]][[x[1]]] ))
resultlist<-split(temp_output,rep(1:list_len,each=sub_list_len)) %>% lapply(.,function(x) split(x,1:length(x)))
resultlist

Create new dataset removing variables with high inflation factors

I have a dataset of environmental variables I would like to use for a GLMM. I am using the corvif function from the AED package (http://www.highstat.com/Book2/AED_1.0.zip) to identify and remove variables with high inflation factors.
Instead of removing one variable at a time manually from my dataset with a GVIF values > 3 (highest value removed first), I would like to know how to write a loop to accomplish this task automatically with the result being a new dataset with only the remaining variables (i.e. those with GVIF values < 3).
Any suggestions for how to approach this problem for a new R user?
Here is my sample data:
WW_Covs <- structure(list(Latitude = c(62.4419, 67.833333, 65.95, 63.72935,
60.966667, 60.266667, 55.660455, 62.216667, 61.3, 61.4, 62.084139,
55.662566, 64.48508, 63.208354, 62.87591, 62.70856, 62.64009,
63.79488, 59.55, 62.84206), BIO_02 = c(87, 82, 75, 70, 77, 70,
59, 84, 84, 79, 85, 60, 91, 87, 74, 74, 76, 70, 76, 74), BIO_03 = c(26,
23, 25, 26, 25, 24, 25, 25, 26, 25, 26, 26, 24, 25, 24, 25, 25,
25, 26, 24), BIO_04 = c(8443, 9219, 7594, 6939, 7928, 7593, 6160,
8317, 8167, 7972, 8323, 6170, 9489, 8578, 7814, 7680, 7904, 7149,
7445, 7803), BIO_05 = c(201, 169, 151, 166, 194, 210, 202, 205,
204, 186, 205, 200, 200, 195, 170, 154, 180, 166, 219, 170),
BIO_06 = c(-131, -183, -144, -102, -107, -75, -26, -119,
-113, -120, -120, -28, -169, -143, -131, -142, -124, -111,
-72, -129), BIO_08 = c(128, 109, 85, 78, 122, 145, 153, 134,
130, 126, 132, 152, 120, 119, 115, 98, 124, 104, 147, 115
), BIO_09 = c(-31, -81, -16, 13, -60, -6, 25, -25, -25, -70,
-25, 23, -56, -39, -47, -60, -39, 8, 0, -46), BIO_12 = c(667,
481, 760, 970, 645, 557, 645, 666, 652, 674, 670, 670, 568,
598, 650, 734, 620, 868, 571, 658), BIO_13 = c(78, 77, 96,
109, 85, 70, 67, 77, 84, 93, 78, 68, 72, 78, 93, 99, 90,
96, 72, 93), BIO_15 = c(23, 40, 25, 21, 36, 30, 21, 24, 28,
34, 24, 22, 28, 29, 34, 32, 36, 22, 30, 34), BIO_19 = c(147,
85, 180, 236, 108, 119, 154, 149, 135, 118, 148, 162, 117,
119, 120, 141, 111, 204, 111, 122)), .Names = c("Latitude",
"BIO_02", "BIO_03", "BIO_04", "BIO_05", "BIO_06", "BIO_08", "BIO_09",
"BIO_12", "BIO_13", "BIO_15", "BIO_19"), row.names = c(1:20), class = "data.frame")
Sample code:
library(AED)
WW_Final <- corvif(WW_Covs)
test <- corvif(WW_Covs])
test[order(-test$GVIF), ]
if(test$GVIF[1,] > 3, # this is where I get stuck...
Here is an algorithm for doing this. I illustrate with the built-in dataset longley, and I also use function vif in package car, rather than using package AED:
It's not pretty, and should be wrapped inside a function, but I leave that as an exercise for the interested reader.
The code:
library(car)
dat <- longley
cutoff <- 2
flag <- TRUE
while(flag){
fit <- lm(Employed ~ ., data=dat)
vfit <- vif(fit)
if(max(vfit) > cutoff){
dat <- dat[, -which.max(vfit)]
} else {
flag <- FALSE
}
}
print(fit)
print(vfit)
The output:
Call:
lm(formula = Employed ~ ., data = dat)
Coefficients:
(Intercept) Unemployed Armed.Forces
50.66281 0.02265 0.02847
Unemployed Armed.Forces
1.032501 1.032501

Resources