I am reading a book by Cohen, Cohen, Aiken and West(2003) "Applied Multiple Regression Correlation Analysis for the Behavioral Sciences" and have come across a 3d plot of a Regression surface showing interaction and no interaction (p. 259). The graphs look like they may have been created using R. I like the graphs as a teaching tool and would like to reproduce them. The plots look something like this:
The only addition to the Coehn et al. plots were lines across the planes at the mean, +1sd, and =1sd for x2. This would be an excellent addition if possible (generally most things are possible with R)
I have provided a sample data set below with an IV, 2 predictors and centered predictors. How would I use R to generate the regression surface (plane) plot showing interaction and an additive model for both the centered and uncentered data (I assume the technique will be the same but want to make sure).
Total of 4 plots:
1. uncentered no interaction
2. uncentered interaction
3. centered no interaction
4. centered interaction
DF<-structure(list(y = c(-1.22, -1.73, -2.64, -2.44, -1.11, 2.24,
3.42, 0.67, 0.59, -0.61, -10.77, 0.93, -8.6, -6.99, -0.12, -2.29,
-5.16, -3.35, -3.35, -2.51, 2.21, -1.18, -5.21, -7.74, -1.34),
x1 = c(39.5, 41, 34, 30.5, 31.5, 30, 41.5, 24, 43, 39, 25.5,
38.5, 33.5, 30, 41, 31, 25, 37, 37.5, 24.5, 38, 37, 41, 37,
36), x2 = c(61L, 53L, 53L, 44L, 49L, 44L, 57L, 47L, 54L,
48L, 46L, 59L, 46L, 61L, 55L, 57L, 59L, 59L, 55L, 50L, 62L,
55L, 55L, 52L, 55L), centered.x1 = c(5.49702380952381, 6.99702380952381,
-0.0029761904761898, -3.50297619047619, -2.50297619047619,
-4.00297619047619, 7.49702380952381, -10.0029761904762, 8.99702380952381,
4.99702380952381, -8.50297619047619, 4.49702380952381, -0.50297619047619,
-4.00297619047619, 6.99702380952381, -3.00297619047619, -9.00297619047619,
2.99702380952381, 3.49702380952381, -9.50297619047619, 3.99702380952381,
2.99702380952381, 6.99702380952381, 2.99702380952381, 1.99702380952381
), centered.x2 = c(9.80357142857143, 1.80357142857143, 1.80357142857143,
-7.19642857142857, -2.19642857142857, -7.19642857142857,
5.80357142857143, -4.19642857142857, 2.80357142857143, -3.19642857142857,
-5.19642857142857, 7.80357142857143, -5.19642857142857, 9.80357142857143,
3.80357142857143, 5.80357142857143, 7.80357142857143, 7.80357142857143,
3.80357142857143, -1.19642857142857, 10.8035714285714, 3.80357142857143,
3.80357142857143, 0.803571428571431, 3.80357142857143)), .Names = c("y",
"x1", "x2", "centered.x1", "centered.x2"), row.names = c(NA,
25L), class = "data.frame")
Thank you in advance.
EDIT: The following code plots the plane but will not work for when you have an interaction (which is really what I'm interested in). Additionally, I don't know how to plot the high (+1sd), low(-1sd) and mean for x2 either.
x11(10,5)
s3d <- scatterplot3d(DF[,c(2,3,1)], type="n", highlight.3d=TRUE,
angle=70, scale.y=1, pch=16, main="scatterplot3d")
# Now adding a regression plane to the "scatterplot3d"
my.lm <- with(DF, lm(y ~ x1 + x2))
s3d$plane3d(my.lm, lty.box = "solid")
An attempt to plot an interaction plane (Seen here):
s3d <- scatterplot3d(DF[,c(2,3,1)], type="n", highlight.3d=TRUE,
angle=70, scale.y=1, pch=16, main="scatterplot3d")
my.lm <- with(DF, lm(y ~ x1 + x2 + x1:x2 ))
s3d$plane3d(my.lm, lty.box = "solid")
Yielded the following error:
Error in segments(x, z1, x + y.max * yx.f, z2 + yz.f * y.max, lty = ltya, :
cannot mix zero-length and non-zero-length coordinates
Here's how I would do it (adding a bit of color) with packages 'rms' and 'lattice':
require(rms) # also need to have Hmisc installed
require(lattice)
ddI <- datadist(DF)
options(datadist="ddI")
lininterp <- ols(y ~ x1*x2, data=DF)
bplot(Predict(lininterp, x1=25:40, x2=45:60),
lfun=wireframe, # bplot passes extra arguments to wireframe
screen = list(z = -10, x = -50), drape=TRUE)
And the non-interaction model:
bplot(Predict(lin.no.int, x1=25:40, x2=45:60), lfun=wireframe, col=2:8, drape=TRUE,
screen = list(z = -10, x = -50),
main="Estimated regression surface with no interaction")
Related
Does anyone know how to fit a quadratic (or higher order) model on a continuous variable and do quantile regression on it in R? Additionally, how do you tell what level of tau fits the data better?
The values for "den" are fish densities (count/m^3) and salinity = salinity (ppt). The full dataset is 1500 observations and I'd like to predict fish density using salinity. A plot with all the data looks semi-quadratic, but I'd like to compare that fit to others using quantile regression. I just can't figure out how to make the relationship in the model non-linear. Is it den ~ salinity + salinity^2?
df <- structure(list(den = c(0, 12, 8.33, 5, 0, 0, 1, 1.33, 0, 3), salinity = c(37, 35, 36, 39, 36, 37, 35, 38, 36, 37)), row.names = c(86L,
240L, 394L, 548L, 702L, 856L, 1010L, 1164L, 1318L, 1472L), class = "data.frame")
quantreg75 <- rq(den ~ salinity, data=rain, tau=0.75)
I am running multiple GAM models and need to view and compare the summary output from these. I'd like a quick and efficient way to extract and compile summary statistics from the models but have not found a way to do so.
A example data set is provided below:
example.data <- structure(list(response = c(1.47, 0.84, 1.99, 2.29, 4.14, 4.47,
2.71, 1.67, 4.12, 1.67, 2.03, 1.74, 0.98, 0.96, 0.56, 2.45, 1.31,
3.06, 2.35, 3.2, 1.16, 2.07, 0.99, 1.35, 1.02, 2.92, 1.8, 2.17,
2.56, 1.56, 2.33, 3.19, 1.53, 2.94, 3.28, 1.53, 2.8, 5.53, 1.26,
2.43, 3.5, 2.22, 3.73, 2.46, 2.16, 1.99, 3.34, 2.63, 2.51, 1.78
), predictor1 = c(17, 14.4, 99.45, 10.8, 54.25, 55.1, 40, 9,
54.25, 14.4, 14.4, 17, 14.4, 17, 10.8, 54.25, 54.25, 15.3, 55.1,
54.25, 14.4, 58, 17, 53.425, 58, 40.45, 14.4, 12.75, 91.05, 6.24,
100.25, 77.25, 43.4, 183.6, 91.05, 9.84, 100.25, 64, 10, 10,
91.05, 8.25, 100.25, 54.25, 89.4, 9.84, 10.8, 54.25, 10.8, 54.25
), predictor2 = c(165.7, 177.3, 594.2, 192.5, 426.2, 270.8, 244,
236.1, 416, 175.8, 258.6, 233.5, 115.8, 141, 153.5, 414.2, 438.9,
203, 261.4, 357.8, 148, 205.5, 137.4, 214.7, 167.8, 371.4, 179.9,
273.7, 567.2, 231.5, 355.3, 270, 319.5, 301.9, 301.9, 215.5,
256.5, 417, 231.8, 284.6, 396.3, 323, 458.4, 290, 203, 198, 350.8,
338, 323.5, 264.7), predictor3 = c(829.8, 841, 903.6, 870.3,
794, 745, 845.2, 906.5, 890.3, 874.2, 805.4, 828.8, 872, 854.7,
912.2, 790.8, 759.2, 855.1, 741.6, 961.8, 839.9, 805.1, 885.2,
887.8, 833.9, 1050.9, 787.5, 837, 731.9, 774.4, 820.8, 995.8,
916.3, 1032.1, 1014.3, 773.7, 846.4, 723.7, 764.2, 708.3, 1009.3,
1053.7, 751.7, 901.1, 848.7, 796.5, 697.1, 733.6, 725.6, 856.6
)), row.names = c(50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L,
60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L,
73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L,
86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L,
99L, 100L), class = "data.frame")
Right now, the unsophisticated and inefficient way I do it is something like this:
library(mgcv)
mod1 = gam(response ~ s(predictor1), data=example.data)
mod2 = gam(response ~ s(predictor2), data=example.data)
mod3 = gam(response ~ s(predictor3), data=example.data)
mod.names <- c("mod1", "mod2", "mod3")
mod.predictors <- c("predictor1", "predictor2", "predictor3")
mod.rsq <- c(summary(mod1)$r.sq, summary(mod2)$r.sq, summary(mod3)$r.sq)
mod.AIC <- c(AIC(mod1), AIC(mod2), AIC(mod3))
summary.data <- data.frame(mod.names,
mod.rsq,
mod.AIC,
mod.predictors)
summary.data
I can then select models accordingly from the summary table.
I have over one hundred potential predictors in the actual data, and it's obviously laborious to manually specify all the models and their output so a more automated alternative would be desirable.
The hard part of this question is choosing which models to run: that's a hard statistical question, and depending on what you choose, a less hard programming problem.
I'll assume that you are only interested in models like the ones in your example. Then this should work:
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
predictors <- setdiff(names(example.data), "response")
result <- data.frame(predictors = predictors, rsq = NA, AIC = NA)
model <- response ~ predictor
for (i in seq_len(nrow(result))) {
pred <- result$predictors[i]
model[[3]] <- bquote(s(.(as.name(pred))))
mod <- gam(model, data = example.data)
result$rsq[i] <- summary(mod)$r.sq
result$AIC[i] <- AIC(mod)
}
result
#> predictors rsq AIC
#> 1 predictor1 0.2011252 138.0875
#> 2 predictor2 0.4666861 118.7270
#> 3 predictor3 0.1959123 139.0365
The tricky part is computing the model formula. I start with a simple model response ~ predictor, then replace the 3rd part (predictor) with code produced by bquote(s(.(as.name(pred)))). That function produces unevaluated code like s(predictor1) when pred holds "predictor1".
I would like to plot both a linear model (LM) and non-linear (GLM) model of the same data.
The range between 16% - 84% should line up between a LM and GLM, Citation: section 3.5
I have included a more complete chunk of the code because I am not sure at which point I should try to cut the linear model. or at which point I have messed up - I think with the linear model.
The code below results in the following image:
My Objective (taken from previous citation-link).
Here is my data:
mydata3 <- structure(list(
dose = c(0, 0, 0, 3, 3, 3, 7.5, 7.5, 7.5, 10, 10, 10, 25, 25, 25, 50, 50, 50),
total = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L),
affected = c(1, 0, 1.2, 2.8, 4.8, 9, 2.8, 12.8, 8.6, 4.8, 4.4, 10.2, 6, 20, 14, 12.8, 23.4, 21.6),
probability = c(0.04, 0, 0.048, 0.112, 0.192, 0.36, 0.112, 0.512, 0.344, 0.192, 0.176, 0.408, 0.24, 0.8, 0.56, 0.512, 0.936, 0.864)),
.Names = c("dose", "total", "affected", "probability"),
row.names = c(NA, -18L),
class = "data.frame")
My script:
#load libraries
library(ggplot2)
library(drc) # glm model
library(plyr) # rename function
library(scales) #log plot scale
#Creating linear model
mod_linear <- lm(probability ~ (dose), weights = total, data = mydata3)
#Creating data.frame: note values 3 and 120 refer to 16% and 84% response in sigmoidal plot
line_df <-expand.grid(dose=exp(seq(log(3),log(120),length=200)))
#Extracting values from linear model
p_line_df <- as.data.frame(cbind(dose = line_df,
predict(mod_linear, newdata=data.frame(dose = line_df),
interval="confidence",level=0.95)))
#Renaming linear df columns
p_line_df <-rename(p_line_df, c("fit"="probability"))
p_line_df <-rename(p_line_df, c("lwr"="Lower"))
p_line_df <-rename(p_line_df, c("upr"="Upper"))
p_line_df$model <-"Linear"
#Create sigmoidal dose-response curve using drc package
mod3 <- drm(probability ~ (dose), weights = total, data = mydata3, type ="binomial", fct=LL.2(names=c("Slope:b","ED50:e")))
#data frame for ggplot2
base_DF_3 <-expand.grid(dose=exp(seq(log(1.0000001),log(10000),length=200)))
#extract data from model
p_df3 <- as.data.frame(cbind(dose = base_DF_3,
predict(mod3, newdata=data.frame(dose = base_DF_3),
interval="confidence", level=.95)))
#renaming columns
p_df3 <-rename(p_df3, c("Prediction"="probability"))
p_df3$model <-"Sigmoidal"
#combining Both DataFames
p_df_all <- rbind(p_df3, p_line_df)
#plotting
ggplot(p_df_all, aes(x=dose,y=probability, group=model))+
geom_line(aes(x=dose,y=probability,group=model,linetype=model),show.legend = TRUE)+
scale_x_log10(breaks = c(0.000001, 10^(0:10)),labels = c(0, math_format()(0:10)))
Looking at the reference you provided, what the authors describe is the use of a linear model to approximate the central portion of a (sigmoidal) logistic function. The linear model that achieves this is a straight line that passes through the inflection point of the logistic curve, and has the same slope as the logistic function at that inflection point. We can use some basic algebra and calculus to solve this problem.
From ?LL.2, we see that the form of the logistic function being fitted by drm is
f(x) = 1 / {1 + exp(b(log(x) - log(e)))}
We can get the values of the coefficient in this equation by
b = mod3$coefficients[1]
e = mod3$coefficients[2]
Now, by differentiation, the slope of the logistic function is given by
dy/dx = -(b * exp((log(x)-log(e))*b)) / (1+exp((log(x)-log(e))*b))^2
At the inflection point, the dose (x) is equal to the coefficient e, thus the slope at the inflection point simplifies (greatly) to
sl50 = -b/4
Since we also know that the inflection point occurs at the point where probability = 0.5 and dose = e, we can construct the straight line (in log-transformed coordinates) like this:
linear_probability = sl50 * (log(p_df3$dose) - log(e)) + 0.5
Now, to plot the logistic and linear functions together:
p_df3_lin = p_df3
p_df3_lin$model = 'linear'
p_df3_lin$probability = linear_probability
p_df_all <- rbind(p_df3, p_df3_lin)
ggplot(p_df_all, aes(x=dose,y=probability, group=model))+
geom_line(aes(x=dose,y=probability,group=model,linetype=model),show.legend = TRUE)+
scale_x_log10(breaks = c(0.000001, 10^(0:10)),labels = c(0, math_format()(0:10))) +
scale_y_continuous(limits = c(0,1))
I want to call the NbClust() function for a couple of dataframes. I do so by "sending" them all through a for loop that contains the NbClust() function call.
The code looks like this:
#combos of just all columns from df
variations = unlist(lapply(seq_along(df), function(x) combn(df, x, simplify=FALSE)), recursive=FALSE)
for(i in 1:length(variations)){
df = data.frame(variations[i])
nc = NbClust(scale(df), distance="euclidean", min.nc=2, max.nc=10, method="complete")
}
Unfortunately it always generates the below error. Strangely enough, if I am applying the same function call without the loop (i.e. to only one data frame) it works perfectly... so what is wrong?
I have had a look at the source code of NbClust and indeed there is a line that contains the code of the error message but I am unable to change the code accordingly. Do you have any idea what the problem might be?
Error in if ((res[ncP - min_nc + 1, 15] <= resCritical[ncP - min_nc +
: missing value where TRUE/FALSE needed
Additionally it produces the following warnings:
In addition: Warning messages:
1: In max(DiffLev[, 5], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
2: In matrix(c(results), nrow = 2, ncol = 26) :
data length [51] is not a sub-multiple or multiple of the number of rows [2]
3: In matrix(c(results), nrow = 2, ncol = 26, dimnames = list(c("Number_clusters", :
data length [51] is not a sub-multiple or multiple of the number of rows [2]
Data looks as follows:
df = structure(list(GDP = c(18.2, 8.5, 54.1, 1.4, 2.1, 83.6, 17, 4.9,
7.9, 2, 14.2, 48.2, 17.1, 10.4, 37.5, 1.6, 49.5, 10.8, 6.2, 7.1,
7.8, 3, 3.7, 4.2, 8.7, 2), Population = c(1.22, 0.06, 0, 0.54,
2.34, 0.74, 1.03, 1.405095932, 0.791124402, 2.746318326, 0.026149254,
11.1252, 0.05183432, 2.992952671, 0.705447655, 0, 0.900246028,
1.15476828, 0, 1.150673397, 1.441975309, 0, 0.713777778, 1.205504587,
1.449230769, 0.820985507), Birth.rate = c(11.56, 146.75, 167.23,
7, 7, 7, 10.07, 47.42900998, 20.42464115, 7.520608751, 7, 7,
15.97633136, 15.1531143, 20.41686405, 7, 22.60379293, 7, 7, 18.55225902,
7, 7.7, 7, 7, 7, 7), Income = c(54L, 94L, 37L, 95L, 98L, 31L,
78L, 74L, 81L, 95L, 16L, 44L, 63L, 95L, 20L, 95L, 83L, 98L, 98L,
84L, 62L, 98L, 98L, 97L, 98L, 57L), Savings = c(56.73, 56.49,
42.81, 70.98, 88.24, 35.16, 46.18, 35.043, 46.521, 58.024, 22.738,
60.244, 77.807, 80.972, 13.08, 40.985, 46.608, 63.32, 51.45,
74.803, 73.211, 50.692, 65.532, 83.898, 60.857, 40.745)), .Names = c("GDP", "Population", "Birth.rate", "Income", "Savings"), class = "data.frame", row.names = c(NA, -26L))
Some of the Clustering methods are not directly adapted to your datasets or type of data. You can select the best methods, or use all of them. When using all of them, it often happens that this produces an ERROR message (which is not a bug). By disabling the ERROR message that stops the loop, the below could be an alternative:
vc.method <- c("kl","ch", "hartigan","ccc", "scott","marriot","trcovw", "tracew","friedman", "rubin", "cindex", "db", "silhouette", "duda", "beale", "ratkowsky", "ball", "ptbiserial", "pseudot2", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", "hubert", "sdindex", "dindex", "sdbw", "alllong")
val.nb <- c()
for(method in 1:length(vc.method)){
tryCatch({
en.nb <- NbClust(na.omit(sum.sn), distance = "euclidean", min.nc = 2,
max.nc = vc.K.max, method = "kmeans",
index = vc.method[method])
val.nb <- c(val.nb, as.numeric(en.nb$Best.nc[1]))
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
The source of this data is server performance metrics. The numbers I have are the mean (os_cpu) and standard deviation (os_cpu_sd). Mean clearly doesn't tell the whole story, so I want to add standard deviation. I started down the path of geom_errorbar, however I believe this is for standard error. What would be an accepted way to plot these metrics? Below is a reproducible example:
DF_CPU <- structure(list(end = structure(c(1387315140, 1387316340, 1387317540,
1387318740, 1387319940, 1387321140, 1387322340, 1387323540, 1387324740,
1387325940, 1387327140, 1387328340, 1387329540, 1387330740, 1387331940,
1387333140, 1387334340, 1387335540, 1387336740, 1387337940, 1387339140,
1387340340, 1387341540, 1387342740, 1387343940, 1387345140, 1387346340,
1387347540, 1387348740, 1387349940), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), os_cpu = c(14.8, 15.5, 17.4, 15.6, 14.9, 14.6,
15, 15.2, 14.6, 15.2, 15, 14.5, 14.8, 15, 14.6, 14.9, 14.9, 14.4,
14.8, 14.9, 14.5, 15, 14.6, 14.5, 15.3, 14.6, 14.6, 15.2, 14.5,
14.5), os_cpu_sd = c(1.3, 2.1, 3.2, 3.3, 0.9, 0.4, 1.4, 1.5,
0.4, 1.6, 1, 0.4, 1.4, 1.4, 0.4, 1.3, 0.9, 0.4, 1.4, 1.3, 0.4,
1.7, 0.4, 0.4, 1.7, 0.4, 0.4, 1.7, 0.5, 0.4)), .Names = c("end",
"os_cpu", "os_cpu_sd"), class = "data.frame", row.names = c(1L,
5L, 9L, 13L, 17L, 21L, 25L, 29L, 33L, 37L, 41L, 45L, 49L, 53L,
57L, 61L, 65L, 69L, 73L, 77L, 81L, 85L, 89L, 93L, 97L, 101L,
105L, 109L, 113L, 117L))
head(DF_CPU)
end os_cpu os_cpu_sd
1 2013-12-17 21:19:00 14.8 1.3
5 2013-12-17 21:39:00 15.5 2.1
9 2013-12-17 21:59:00 17.4 3.2
13 2013-12-17 22:19:00 15.6 3.3
17 2013-12-17 22:39:00 14.9 0.9
ggplot(data=DF_CPU, aes(x=end, y=os_cpu)) +
geom_line()+
geom_errorbar(aes(ymin=os_cpu-os_cpu_sd,ymax=os_cpu+os_cpu_sd), alpha=0.2,color="red")
Per #ari-b-friedman suggestion, here's what it looks like with geom_ribbon():
Your question is largely about aesthetics, and so opinions will differ. Having said that there are some guidelines:
Emphasize what is important.
Provide a frame of reference if at all possible.
Avoid misleading scales or graphics.
Avoid unnecessary graphics.
So this code:
ggplot(data=DF_CPU, aes(x=end, y=os_cpu)) +
geom_point(size=3, shape=1)+
geom_line(linetype=2, colour="grey")+
geom_linerange(aes(ymin=os_cpu-1.96*os_cpu_sd,ymax=os_cpu+1.96*os_cpu_sd), alpha=0.5,color="blue")+
ylim(0,max(DF_CPU$os_cpu+1.96*DF_CPU$os_cpu_sd))+
stat_smooth(formula=y~1,se=TRUE,method="lm",linetype=2,size=1)+
theme_bw()
Produces this:
This graphic emphasizes that cpu utilization (??) over 20 min intervals did not deviate significantly from the average for the 9 hour period monitored. The reference line is average utilization. The error bars were replaced with geom_linerange(...) because the horizontal bars in geom_errorbar(...) add nothing and are distracting. Also, your original plot makes it seem that error is very large compared to actual utilization, which it isn't. I changed the range to +/- 1.96*sd because that more closely approximates 95% CL. Finally, the x- and y-axis labels need to be replaced with something descriptive, but I don't have enough information to do that.
There's a designer's adage that "form follows function", and this should apply to graphics. What are you trying to do with your plots? What's the question you are trying to answer?
If it is "is cpu usage significantly decreasing with time?" then this plot will probably do and gives the answer "no". If it is "is the probability of exceeding 10s changing with time?" then you need to assume a model for your data (eg something as simple as Normal(os_cpu, os_cpu_sd)) and then plot exceedence (tail) probabilities.
Anyway, just plotting means and envelopes like you have done is always a fair start, and at least answers the questions "what does my data look like?" and "is anything obviously wrong?"