How do I efficiently summarize summary output from multiple GAM models? - r

I am running multiple GAM models and need to view and compare the summary output from these. I'd like a quick and efficient way to extract and compile summary statistics from the models but have not found a way to do so.
A example data set is provided below:
example.data <- structure(list(response = c(1.47, 0.84, 1.99, 2.29, 4.14, 4.47,
2.71, 1.67, 4.12, 1.67, 2.03, 1.74, 0.98, 0.96, 0.56, 2.45, 1.31,
3.06, 2.35, 3.2, 1.16, 2.07, 0.99, 1.35, 1.02, 2.92, 1.8, 2.17,
2.56, 1.56, 2.33, 3.19, 1.53, 2.94, 3.28, 1.53, 2.8, 5.53, 1.26,
2.43, 3.5, 2.22, 3.73, 2.46, 2.16, 1.99, 3.34, 2.63, 2.51, 1.78
), predictor1 = c(17, 14.4, 99.45, 10.8, 54.25, 55.1, 40, 9,
54.25, 14.4, 14.4, 17, 14.4, 17, 10.8, 54.25, 54.25, 15.3, 55.1,
54.25, 14.4, 58, 17, 53.425, 58, 40.45, 14.4, 12.75, 91.05, 6.24,
100.25, 77.25, 43.4, 183.6, 91.05, 9.84, 100.25, 64, 10, 10,
91.05, 8.25, 100.25, 54.25, 89.4, 9.84, 10.8, 54.25, 10.8, 54.25
), predictor2 = c(165.7, 177.3, 594.2, 192.5, 426.2, 270.8, 244,
236.1, 416, 175.8, 258.6, 233.5, 115.8, 141, 153.5, 414.2, 438.9,
203, 261.4, 357.8, 148, 205.5, 137.4, 214.7, 167.8, 371.4, 179.9,
273.7, 567.2, 231.5, 355.3, 270, 319.5, 301.9, 301.9, 215.5,
256.5, 417, 231.8, 284.6, 396.3, 323, 458.4, 290, 203, 198, 350.8,
338, 323.5, 264.7), predictor3 = c(829.8, 841, 903.6, 870.3,
794, 745, 845.2, 906.5, 890.3, 874.2, 805.4, 828.8, 872, 854.7,
912.2, 790.8, 759.2, 855.1, 741.6, 961.8, 839.9, 805.1, 885.2,
887.8, 833.9, 1050.9, 787.5, 837, 731.9, 774.4, 820.8, 995.8,
916.3, 1032.1, 1014.3, 773.7, 846.4, 723.7, 764.2, 708.3, 1009.3,
1053.7, 751.7, 901.1, 848.7, 796.5, 697.1, 733.6, 725.6, 856.6
)), row.names = c(50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L,
60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L,
73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L,
86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L,
99L, 100L), class = "data.frame")
Right now, the unsophisticated and inefficient way I do it is something like this:
library(mgcv)
mod1 = gam(response ~ s(predictor1), data=example.data)
mod2 = gam(response ~ s(predictor2), data=example.data)
mod3 = gam(response ~ s(predictor3), data=example.data)
mod.names <- c("mod1", "mod2", "mod3")
mod.predictors <- c("predictor1", "predictor2", "predictor3")
mod.rsq <- c(summary(mod1)$r.sq, summary(mod2)$r.sq, summary(mod3)$r.sq)
mod.AIC <- c(AIC(mod1), AIC(mod2), AIC(mod3))
summary.data <- data.frame(mod.names,
mod.rsq,
mod.AIC,
mod.predictors)
summary.data
I can then select models accordingly from the summary table.
I have over one hundred potential predictors in the actual data, and it's obviously laborious to manually specify all the models and their output so a more automated alternative would be desirable.

The hard part of this question is choosing which models to run: that's a hard statistical question, and depending on what you choose, a less hard programming problem.
I'll assume that you are only interested in models like the ones in your example. Then this should work:
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
predictors <- setdiff(names(example.data), "response")
result <- data.frame(predictors = predictors, rsq = NA, AIC = NA)
model <- response ~ predictor
for (i in seq_len(nrow(result))) {
pred <- result$predictors[i]
model[[3]] <- bquote(s(.(as.name(pred))))
mod <- gam(model, data = example.data)
result$rsq[i] <- summary(mod)$r.sq
result$AIC[i] <- AIC(mod)
}
result
#> predictors rsq AIC
#> 1 predictor1 0.2011252 138.0875
#> 2 predictor2 0.4666861 118.7270
#> 3 predictor3 0.1959123 139.0365
The tricky part is computing the model formula. I start with a simple model response ~ predictor, then replace the 3rd part (predictor) with code produced by bquote(s(.(as.name(pred)))). That function produces unevaluated code like s(predictor1) when pred holds "predictor1".

Related

Line of best fit for scatterplots with a negative correlations using the function abline () in R

Right now I have a dataset with temperature (independent variable) on the x-axis and the rate of sapflow measured from the xylem in trees during transpiration along the y-axis.
I created a scatterplot from the data below (scatterplot 1). I want to plot the best fit line using the function abline() or perhaps another function that does the same job. However, my r-code (below) appears to have plotted the line of best fit in the bottom left-hand corner of my plot, which does not seem right.
Rcode
plot(Sapflow$Sapflow, Sapflow$Temperature,
main="Scatterplot of Temperature (°C)",
xlab="Temperature (°C)",
ylab=expression(paste("Sapflow Litres day"^{-1})),
pch=19,
col="red")
##Line of best fit
abline(lm(Sapflow$Sapflow~Sapflow$Temperature), col="blue")
Goal
My scatterplot has an overall negative correlation and I can imagine the line of best fit, in reality, would be slightly curved. Even though I cannnot say for sure without a true visual representation of the line of best fit, I can envision that it would be similar in appearance to scatterplot 2. Can anyone lend a hand?
Scatterplot 1
Scatterplot 2
Reference: 'Quick R'
https://www.statmethods.net/graphs/images/scatterplot2.jpg
Data
structure(list(Date = structure(c(31L, 42L, 53L, 55L, 56L, 57L,
58L, 59L, 60L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 54L, 61L, 72L,
83L, 86L, 87L, 88L, 89L, 90L, 91L, 62L, 63L, 64L, 65L, 66L, 67L,
68L, 69L, 70L, 71L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L,
82L, 84L, 85L, 92L, 103L, 114L, 117L, 118L, 119L, 120L, 121L,
122L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L, 101L, 102L, 104L,
105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 115L, 116L,
123L, 134L, 145L, 147L, 148L, 149L, 150L, 151L, 152L, 124L, 125L,
126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 135L, 136L, 137L,
138L, 139L, 140L, 141L, 142L, 143L, 144L, 146L, 1L, 12L, 23L,
25L, 26L, 27L, 28L, 29L, 30L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 24L
), .Label = c("10/1/18", "10/10/18", "10/11/18", "10/12/18",
"10/13/18", "10/14/18", "10/15/18", "10/16/18", "10/17/18", "10/18/18",
"10/19/18", "10/2/18", "10/20/18", "10/21/18", "10/22/18", "10/23/18",
"10/24/18", "10/25/18", "10/26/18", "10/27/18", "10/28/18", "10/29/18",
"10/3/18", "10/30/18", "10/4/18", "10/5/18", "10/6/18", "10/7/18",
"10/8/18", "10/9/18", "6/1/18", "6/10/18", "6/11/18", "6/12/18",
"6/13/18", "6/14/18", "6/15/18", "6/16/18", "6/17/18", "6/18/18",
"6/19/18", "6/2/18", "6/20/18", "6/21/18", "6/22/18", "6/23/18",
"6/24/18", "6/25/18", "6/26/18", "6/27/18", "6/28/18", "6/29/18",
"6/3/18", "6/30/18", "6/4/18", "6/5/18", "6/6/18", "6/7/18",
"6/8/18", "6/9/18", "7/1/18", "7/10/18", "7/11/18", "7/12/18",
"7/13/18", "7/14/18", "7/15/18", "7/16/18", "7/17/18", "7/18/18",
"7/19/18", "7/2/18", "7/20/18", "7/21/18", "7/22/18", "7/23/18",
"7/24/18", "7/25/18", "7/26/18", "7/27/18", "7/28/18", "7/29/18",
"7/3/18", "7/30/18", "7/31/18", "7/4/18", "7/5/18", "7/6/18",
"7/7/18", "7/8/18", "7/9/18", "8/1/18", "8/10/18", "8/11/18",
"8/12/18", "8/13/18", "8/14/18", "8/15/18", "8/16/18", "8/17/18",
"8/18/18", "8/19/18", "8/2/18", "8/20/18", "8/21/18", "8/22/18",
"8/23/18", "8/24/18", "8/25/18", "8/26/18", "8/27/18", "8/28/18",
"8/29/18", "8/3/18", "8/30/18", "8/31/18", "8/4/18", "8/5/18",
"8/6/18", "8/7/18", "8/8/18", "8/9/18", "9/1/18", "9/10/18",
"9/11/18", "9/12/18", "9/13/18", "9/14/18", "9/15/18", "9/16/18",
"9/17/18", "9/18/18", "9/19/18", "9/2/18", "9/20/18", "9/21/18",
"9/22/18", "9/23/18", "9/24/18", "9/25/18", "9/26/18", "9/27/18",
"9/28/18", "9/29/18", "9/3/18", "9/30/18", "9/4/18", "9/5/18",
"9/6/18", "9/7/18", "9/8/18", "9/9/18"), class = "factor"),
Temperature = c(85.07,
79.72, 72.83, 90.1, 83.02, 73.34, 77.11, 74.79, 81.66, 77.71,
66.14, 78.15, 69.33, 68.13, 60.31, 69.47, 81.86, 78.63, 77.69,
77.56, 52.88, 53.32, 53.74, 55.85, 49.56, 55.3, 69.25, 74.96,
69.29, 60.07, 54.31, 48.6, 55.73, 56.74, 47.66, 60.51, 55.64,
58.39, 63.8, 63.16, 73.65, 71.08, 64.34, 60.1, 51.61, 54.87,
58.23, 52.49, 52.56, 59.64, 67.85, 64.42, 60.08, 59.71, 57.12,
58.7, 68.85, 72.44, 89.13, 77.67, 62.17, 61.3, 63.58, 66.26,
60.09, 56.63, 53.11, 59.84, 60.06, 80.76, 79.51, 73.96, 84.58,
78.77, 71.65, 72.59, 77.52, 69.04, 78.26, 77.22, 73.75, 81.95,
82.04, 78.14, 73.41, 72.76, 90.68, 74.24, 71.3, 74.4, 60.26,
66.08, 65.18, 57.17, 66.88, 75.53, 71.52, 74.97, 66.02, 78.06,
73.58, 68.18, 83.55, 80.4, 66.28, 72.32, 72.39, 77.74, 69.81,
74.21, 77.37, 88.28, 65.33, 87.54, 80.49, 69.58, 68.18, 69.25,
60.06, 66.38, 68.51, 71.65, 63.29, 76.63, 80.46, 85.56, 81.25,
94.48, 73.87, 76.8, 72.83, 77.55, 81.5, 77.7, 75.79, 94.38, 99.55,
94.14, 87.29, 84.81, 82.63, 85.27, 84.52, 71.13, 76.28, 78.06,
82.83, 75.18, 83.8, 85.38, 84, 85.33),
Humidity = c(19.67, 18.82,
20.38, 14.94, 12.92, 15.28, 15.12, 16.05, 15.19, 16.67, 18.69,
14.61, 16.71, 17.35, 16.98, 15.44, 15.21, 18.62, 20.11, 18.64,
15.66, 17.2, 18.21, 19.32, 23.02, 21.69, 18.03, 18.46, 18.45,
20.78, 23.04, 22.05, 19.71, 20.59, 24.89, 23.34, 24.7, 24.2,
22.43, 18.21, 17.66, 18.23, 20.36, 22.83, 23.52, 22.88, 19.59,
21.51, 22.25, 21.47, 22.03, 22.51, 25.54, 24.01, 24.28, 26.21,
23.72, 17.63, 17.27, 19.19, 19.97, 19.84, 22.78, 24.46, 23.05,
23.31, 24.75, 23.23, 18.91, 15.56, 13.51, 15.8, 17.67, 19.18,
18.93, 20.05, 17.1, 16.87, 18.77, 20.49, 21.5, 18.04, 18.82,
17.38, 13.05, 13.13, 13.48, 16.32, 16.74, 16.11, 15.77, 15.48,
18.17, 18.16, 18.44, 16.63, 16.64, 14.47, 13.07, 14.14, 17.27,
16.71, 18.22, 12.9, 13.95, 14.7, 15.78, 17.52, 19.66, 18.87,
18.07, 16.4, 12.92, 10.57, 10.04, 9.78, 10.24, 14.25, 15.92,
11.59, 9.25, 10.33, 11.22, 15.03, 13.67, 14.26, 15.42, 8.34,
8.56, 12.37, 14.38, 15.47, 16.4, 17.15, 20.05, 11.08, 10.63,
14.34, 13.27, 9.33, 8.1, 10.95, 12.79, 8.64, 11.42, 12.12, 9.91,
7.86, 3.51, 4.97, 3.63, 5.59),
Radiation = c(197.8, 195.5, 288,
72, 160.5, 337.1, 176.9, 242.3, 189.4, 295.7, 363.2, 158, 290,
251.2, 297.3, 192.6, 163.5, 274.5, 210.7, 243.4, 287.4, 375.7,
290.5, 336.4, 361.6, 369.2, 302.6, 295.2, 348.5, 343.5, 327.6,
358.9, 358.6, 288.9, 325.6, 307.8, 321.3, 321.5, 280.6, 264.9,
253, 279.5, 318.1, 285.1, 330.8, 252, 201, 229.9, 259.3, 230.4,
265.5, 214.1, 307, 311.1, 282.5, 256.9, 227.2, 263.4, 68.2, 130.8,
276.6, 299.2, 276.5, 243.9, 291, 289.3, 290.6, 259.6, 220.5,
72.7, 158.9, 233.8, 105.9, 164.2, 168.1, 188.7, 120.1, 217.7,
111.2, 114.7, 143.6, 55.2, 108.5, 162.2, 185, 197.7, 54.1, 126.3,
111.2, 135.4, 228.3, 214.3, 240.1, 247.6, 173, 172.4, 131.9,
149.4, 203.1, 92.3, 168.5, 146.6, 65.9, 103.6, 200.2, 131.3,
183.5, 128.3, 140.6, 124.1, 125.9, 75.8, 173.2, 47.9, 111.7,
205.8, 188.3, 175.6, 193.7, 170.4, 188.3, 108, 171.1, 59.5, 87.7,
142.2, 111.8, 26.3, 129.9, 103.1, 158.7, 147.9, 109.8, 67.8,
106.6, 12.3, 15.8, 53, 63.4, 86.2, 123.3, 112.9, 128.2, 141.9,
81.6, 102, 86.8, 83.9, 50, 96.8, 100.5, 47),
Sapflow = c(14.97,
16.31, 17.52, 7.45, 12.18, 15.82, 11.79, 14.45, 10.95, 13.62,
16.28, 11.42, 16.13, 15.09, 17.28, 14.43, 11.7, 16.06, 17.66,
16.33, 17.79, 18.58, 19.41, 19.8, 21.63, 21.35, 17.81, 17.56,
19.37, 21.27, 23.26, 23.67, 22.64, 21.85, 24.81, 22.36, 24.72,
23.87, 23.67, 22.01, 19.23, 19.92, 21.99, 23.6, 24.9, 24.46,
22.22, 23.95, 24.81, 23.88, 22.98, 24.47, 26.09, 25.97, 25.82,
26.24, 25.09, 22, 16.91, 21.35, 25.32, 25.76, 26.38, 25.78, 25.77,
25.15, 26.29, 26.22, 24.59, 18.26, 18.91, 21.57, 21.37, 21.29,
23.96, 24.85, 21.02, 23.05, 22.69, 23.9, 25.24, 25.4, 23.19,
22.8, 22.08, 21.86, 13.82, 22.05, 23.21, 20.12, 22.73, 21.88,
23.33, 24.76, 23.5, 22.06, 22.01, 20.65, 21.54, 19.9, 21.67,
21.84, 18.82, 17.99, 21.41, 23.53, 23.39, 25.75, 22.62, 22.25,
21.81, 16.81, 20.42, 12.08, 12.36, 15.31, 14.14, 15.48, 15.18,
14.19, 12.09, 12.39, 12.34, 12.61, 10.79, 10.53, 11.29, 9.92,
9.79, 10.86, 10.98, 10.58, 12.54, 12.52, 12.25, 6.38, 0.91, 5.24,
6.56, 5.72, 4.55, 4.99, 2.88, 0.99, 1.03, 1.57, 2.07, 2.3, 2.22,
2.11, 2.21, 2.29)),
class = "data.frame", row.names = c(NA, -152L))
Your original (linear) fit seems off because you misspecified the columns in your call to plot. By default, the first argument is for the x-axis, not for the y-axis but you can change this by expressly assigning y = Sapflow$Sapflow and x = Sapflow$Temperature. Accordingly, plotting your data (with sapflow as the dependent variable and temperature as the predictor) with a linear fit using abline looks like this:
plot(y = Sapflow$Sapflow, x = Sapflow$Temperature,
main="Scatterplot of Temperature (°C)",
xlab="Temperature (°C)",
ylab=expression(paste("Sapflow Litres day"^{-1})),
pch=19,
col="red")
abline(lm(Sapflow$Sapflow ~ Sapflow$Temperature))
If you want fit your data smooth curve, you can use scatter.smooth:
scatter.smooth(y = Sapflow$Sapflow, x = Sapflow$Temperature,
main="Scatterplot of Temperature (°C)",
xlab="Temperature (°C)",
ylab=expression(paste("Sapflow Litres day"^{-1})),
pch=19,
col="red")

Plot slope and intercept of a model with log and percentage transformations

This is the structure of my data
> dput(test)
structure(list(MAT = c(4.9, 4.9, 15.5, 14.1, 14.1, 14.1, 11.5,
11.5, 11.5, 17, 6.1, 2.7, 2.2, 2.2, 14.1, 14.1, 14.1, 9.5, 9.5,
9.5, 9.5, 9.3, 8.3, 8.266666651, 8.266666651, 4.3, 4.3, 22.3,
14.1, 14.1, 14.1, 8.5, 8.5, 8.5, 8.5, 21.5, 21.5, 3.8, 3.8, 6,
6, 6, 6, 6), es = c(0.29603085763985, 0.421393627439682, 0.189653473156549,
0.226685054608428, 0.291373762079697, 0.166533544378467, 0.250586529054368,
0.146320008054403, 0.199565119644333, -0.0819047677231083, 0.15963948187092,
-0.154628141843561, 0.201121044198443, 0.0867981239977565, 0.543870310978598,
0.34547921143505, 0.37557241352574, -0.287318919407836, 0.207937483228907, 0.190143660810163, 0.276182673435993, 0.128596803172119, 0.454753165843559,
0.399237234440439, 0.32075358541748, 0.362664873575803, -0.0865925288159671,
0.51290512543514, 0.186308318839249, 0.147936083867325, 0.243792477087184,
0.625169403695832, 0.110317782120045, 0.217836235313289, 0.171468156841181,
0.50548821117127, 0.164418265301427, -0.00246305543239786, 0.325552346507191,
0.381240606108843, 0.19337350462531, 0.0408803528990759, 0.321815078821239,
0.307642815014319), var = c(0.00496277337027962, 0.0130962311273343,
0.0180149624217804, 0.0134568083459063, 0.00139708925143695,
0.000725862546533828, 0.00670831011660164, 0.0190783110089115,
0.0641568910090007, 0.0121596544795352, 0.0653909966557582, 0.0514610437228611,
0.0231592619167496, 0.0108989891148006, 0.0588577146414195, 0.0695760532112402,
0.0744256820906048, 0.00997789089155498, 0.00928124381998638,
0.0145009450673482, 0.00652956018299188, 0.0111886178917916,
0.0265943757419349, 0.142676904340634, 0.110705177803624, 0.0576538348777718,
0.0625171635976251, 0.0131652117394448, 0.00947904166717649,
0.00813569411386797, 0.00444289889858652, 0.0673007030900184,0.00545169559098343, 0.240046081413733, 0.00561125010476281,
0.0185516235174018, 0.0179989506841957, 0.0496806959944248, 0.022478393723115,
0.0521209786580004, 0.282298667080106, 0.0151428845076692, 0.00992945920656693, 0.0145544965304081), MAP = c(810, 810, 1140, 1750, 1750, 1750,
1034, 1034, 1034, 720, 645, 645, 645, 645, 1000, 1000, 1000,
691, 691, 691, 691, 1134, 1750, 1326, 1326, 1140, 1140, 1310,
1750, 1750, 1750, 1003, 1003, 1003, 1003, 1750, 1750, 1750, 1750,
1750, 1750, 1750, 1750, 1750), CO2dif = c(162L, 162L, 190L, 165L,
165L, 165L, 200L, 200L, 200L, 150L, 335L, 335L, 335L, 335L, 348L,
348L, 348L, 200L, 200L, 200L, 200L, 220L, 350L, 350L, 350L, 350L,
350L, 350L, 180L, 180L, 180L, 130L, 130L, 130L, 130L, 320L, 320L,
360L, 360L, 345L, 345L, 350L, 348L, 348L)), row.names = c(NA,
-44L), class = "data.frame", .Names = c("MAT", "es", "var", "MAP",
"CO2dif"))
I run model selection using meta-analysis, and the best model to predict the effects size is:
library(metafor)
summary(rma(es, var, data=test ,control=list(stepadj=.5), mods= ~ 1 + log(MAT) + MAP + CO2dif + log(MAT):CO2dif, knha=TRUE))
Model Results:
estimate se tval pval ci.lb ci.ub
intrcpt 1.2556 0.3719 3.3758 0.0017 0.5033 2.0080 **
log(MAT) -0.5740 0.1694 -3.3882 0.0016 -0.9167 -0.2313 **
MAP 0.0001 0.0001 2.5181 0.0160 0.0000 0.0003 *
CO2dif -0.0042 0.0013 -3.2932 0.0021 -0.0067 -0.0016 **
log(MAT):CO2dif 0.0020 0.0005 3.7500 0.0006 0.0009 0.0031 ***
Now I want to plot es vs MAT, with an example with this model, assuming that MAP=1200 mm and CO2dif=350
MAPi <- 1200
CO2i <- 350
make_pct <- function(x) (exp(x) - 1) * 100
ggplot(test, aes(x = log(MAT), y = make_pct(es))) +
geom_abline(aes(intercept = make_pct(1.2556 + 0.0001 * MAPi - 0.0042 * CO2i),
slope = make_pct(log(0.0020 * CO2i)) - make_pct(log(0.5740))) ,
color = "red", size=0.8) +
geom_point() +
theme_classic()
Effect size (es) is in log format, and I want percentage, so I transform it with the function make_pct. MAT, on the other hand, has to be log-transformed in the plot as indicated in the model output. Is the slope of the ggplot above correct with the log and percentage transformations? It seems to me that the slope is rather low. I am not very familiar with this type of plots and transformations, so any tips are welcome. Thanks
The relationship between exp(es)-1 and the explanatory variable log(MAT) is not linear.
For a given set of values of MAP and CO2dif, this relationship is of the form: y = exp(es)-1 = k1*exp(k2*log(MAT)).
This function can be plotted as follows:
library(metafor)
library(ggplot2)
modfit <- rma(es, var, data=test ,control=list(stepadj=.5),
mods= ~ 1 + MAP + log(MAT)*CO2dif, knha=TRUE)
pars <- coef(modfit)
MAPi <- 1200
CO2i <- 350
make_pct <- function(x) (exp(x) - 1) * 100
mod_fun <- function(MAP, MAT, CO2dif, pars) {
y <- pars[1]+pars[2]*MAP+pars[3]*log(MAT)+
pars[4]*CO2dif+pars[5]*log(MAT)*CO2dif
make_pct(y)
}
test$ESpct <- mod_fun(MAPi, test$MAT, CO2i, coef(modfit))
ggplot(test, aes(x = log(MAT), y = make_pct(es))) +
geom_line(aes(y=ESpct), color = "red", size=0.8) +
geom_point() + theme_classic()

r: error for NbClust() call when deploying it within for() loop - "Error in if ((res[ncP - min_nc + 1, 15] <= resCritical[ncP - min_nc + :"

I want to call the NbClust() function for a couple of dataframes. I do so by "sending" them all through a for loop that contains the NbClust() function call.
The code looks like this:
#combos of just all columns from df
variations = unlist(lapply(seq_along(df), function(x) combn(df, x, simplify=FALSE)), recursive=FALSE)
for(i in 1:length(variations)){
df = data.frame(variations[i])
nc = NbClust(scale(df), distance="euclidean", min.nc=2, max.nc=10, method="complete")
}
Unfortunately it always generates the below error. Strangely enough, if I am applying the same function call without the loop (i.e. to only one data frame) it works perfectly... so what is wrong?
I have had a look at the source code of NbClust and indeed there is a line that contains the code of the error message but I am unable to change the code accordingly. Do you have any idea what the problem might be?
Error in if ((res[ncP - min_nc + 1, 15] <= resCritical[ncP - min_nc +
: missing value where TRUE/FALSE needed
Additionally it produces the following warnings:
In addition: Warning messages:
1: In max(DiffLev[, 5], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
2: In matrix(c(results), nrow = 2, ncol = 26) :
data length [51] is not a sub-multiple or multiple of the number of rows [2]
3: In matrix(c(results), nrow = 2, ncol = 26, dimnames = list(c("Number_clusters", :
data length [51] is not a sub-multiple or multiple of the number of rows [2]
Data looks as follows:
df = structure(list(GDP = c(18.2, 8.5, 54.1, 1.4, 2.1, 83.6, 17, 4.9,
7.9, 2, 14.2, 48.2, 17.1, 10.4, 37.5, 1.6, 49.5, 10.8, 6.2, 7.1,
7.8, 3, 3.7, 4.2, 8.7, 2), Population = c(1.22, 0.06, 0, 0.54,
2.34, 0.74, 1.03, 1.405095932, 0.791124402, 2.746318326, 0.026149254,
11.1252, 0.05183432, 2.992952671, 0.705447655, 0, 0.900246028,
1.15476828, 0, 1.150673397, 1.441975309, 0, 0.713777778, 1.205504587,
1.449230769, 0.820985507), Birth.rate = c(11.56, 146.75, 167.23,
7, 7, 7, 10.07, 47.42900998, 20.42464115, 7.520608751, 7, 7,
15.97633136, 15.1531143, 20.41686405, 7, 22.60379293, 7, 7, 18.55225902,
7, 7.7, 7, 7, 7, 7), Income = c(54L, 94L, 37L, 95L, 98L, 31L,
78L, 74L, 81L, 95L, 16L, 44L, 63L, 95L, 20L, 95L, 83L, 98L, 98L,
84L, 62L, 98L, 98L, 97L, 98L, 57L), Savings = c(56.73, 56.49,
42.81, 70.98, 88.24, 35.16, 46.18, 35.043, 46.521, 58.024, 22.738,
60.244, 77.807, 80.972, 13.08, 40.985, 46.608, 63.32, 51.45,
74.803, 73.211, 50.692, 65.532, 83.898, 60.857, 40.745)), .Names = c("GDP", "Population", "Birth.rate", "Income", "Savings"), class = "data.frame", row.names = c(NA, -26L))
Some of the Clustering methods are not directly adapted to your datasets or type of data. You can select the best methods, or use all of them. When using all of them, it often happens that this produces an ERROR message (which is not a bug). By disabling the ERROR message that stops the loop, the below could be an alternative:
vc.method <- c("kl","ch", "hartigan","ccc", "scott","marriot","trcovw", "tracew","friedman", "rubin", "cindex", "db", "silhouette", "duda", "beale", "ratkowsky", "ball", "ptbiserial", "pseudot2", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", "hubert", "sdindex", "dindex", "sdbw", "alllong")
val.nb <- c()
for(method in 1:length(vc.method)){
tryCatch({
en.nb <- NbClust(na.omit(sum.sn), distance = "euclidean", min.nc = 2,
max.nc = vc.K.max, method = "kmeans",
index = vc.method[method])
val.nb <- c(val.nb, as.numeric(en.nb$Best.nc[1]))
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

R: Trouble fitting a 4-parameter hockeystick curve with nls

My dataset:
mydata<-structure(list(t = c(0.208333333, 0.208333333, 0.208333333, 0.208333333,
1, 1, 1, 1, 2, 2, 2, 2, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16,
16, 16, 0.208333333, 0.208333333, 0.208333333, 0.208333333, 1,
1, 1, 1, 2, 2, 2, 2), parent = c(1.2, 1.4, 0.53, 1.2, 1, 0.72,
0.93, 1.1, 0.88, 0.38, 0.45, 0.27, 0.057, 0.031, 0.025, 0.051,
0.027, 0.015, 0.034, 0.019, 0.017, 0.025, 0.024, 0.023, 0.29,
0.22, 0.34, 0.19, 0.12, 0.092, 0.41, 0.28, 0.064, 0.05, 0.058,
0.043)), .Names = c("t", "Ct"), row.names = c(325L, 326L,
327L, 328L, 341L, 342L, 343L, 344L, 357L, 358L, 359L, 360L, 373L,
374L, 375L, 376L, 389L, 390L, 391L, 392L, 401L, 402L, 403L, 404L,
805L, 806L, 807L, 808L, 821L, 822L, 823L, 824L, 837L, 838L, 839L,
840L), class = "data.frame")
The function to be fitted is a hockeystick curve; i.e. it flattens off after the bending point tb:
hockeystick<-function (t, C0, k1, k2, tb)
{
Ct = ifelse(t <= tb, C0 -k1 * t, C0 -k1*tb -k2*t)
}
Fitting using nls:
start.hockey<-c(C0=3,k1=1,k2=0.1,tb=3)
nls(log(Ct)~hockeystick(t,C0,k1,k2,tb),start=start.hockey,data=mydata)
No matter what starting values I use, I always get this error:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
I tried both the port and the standard nls methods. I tried both the linearized (shown here) and the normal state of the model but neither seems to work.
Edit: As per the suggestion of Carl I tried to fit the model to a dataset where I first averaged the Ct values per value of t and still get the error.
edit: Changed the model somewhat so the k2 value is positive instead of negative. A negative value does not make sense kinetically.
I haven't quite solved the nls() problem, but I have a few suggestions.
First of all, I would suggest revising your hockey stick function slightly to make it continuous at the breakpoint:
hockeystick<-function (t, C0, k1, k2, tb)
{
Ct <- ifelse(t <= tb, C0 -k1 * t, C0 -k1*t -k2*(t-tb))
}
Eyeballing:
par(las=1,bty="l") ## cosmetic
plot(log(Ct)~t,data=mydata)
curve(hockeystick(x,C0=0,k1=0.8,k2=-0.7, tb=3),add=TRUE)
I've made k2 negative here so the decreasing slope in the second stage is less than in the first stage.
start.hockey <- c(C0=0,k1=0.8,k2=-0.7, tb=3)
nls(log(Ct)~hockeystick(t,C0,k1,k2,tb),
start=start.hockey,data=mydata)
Models with breakpoints are often non-differentiable in the parameters, but
I don't quite see how that's a problem here ...
This does work:
library(bbmle)
m1 <- mle2(log(Ct)~dnorm(hockeystick(t,C0,k1,k2,tb),
sd=exp(logsd)),
start=c(as.list(start.hockey),list(logsd=0)),
data=mydata)
The parameters are reasonable (and different from the starting values):
coef(summary(m1))
## Estimate Std. Error z value Pr(z)
## C0 -0.4170749 0.2892128 -1.442104 1.492731e-01
## k1 0.6720120 0.2236111 3.005271 2.653439e-03
## k2 -0.5285974 0.2400605 -2.201934 2.766994e-02
## tb 2.0007688 0.1714292 11.671108 1.790751e-31
## logsd -0.2218745 0.1178580 -1.882558 5.976033e-02
Plot predictions:
pframe <- data.frame(t=seq(0,15,length=51))
pframe$pred <- predict(m1,newdata=pframe)
with(pframe,lines(t,pred,col=2))

Graph mean and standard deviation

The source of this data is server performance metrics. The numbers I have are the mean (os_cpu) and standard deviation (os_cpu_sd). Mean clearly doesn't tell the whole story, so I want to add standard deviation. I started down the path of geom_errorbar, however I believe this is for standard error. What would be an accepted way to plot these metrics? Below is a reproducible example:
DF_CPU <- structure(list(end = structure(c(1387315140, 1387316340, 1387317540,
1387318740, 1387319940, 1387321140, 1387322340, 1387323540, 1387324740,
1387325940, 1387327140, 1387328340, 1387329540, 1387330740, 1387331940,
1387333140, 1387334340, 1387335540, 1387336740, 1387337940, 1387339140,
1387340340, 1387341540, 1387342740, 1387343940, 1387345140, 1387346340,
1387347540, 1387348740, 1387349940), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), os_cpu = c(14.8, 15.5, 17.4, 15.6, 14.9, 14.6,
15, 15.2, 14.6, 15.2, 15, 14.5, 14.8, 15, 14.6, 14.9, 14.9, 14.4,
14.8, 14.9, 14.5, 15, 14.6, 14.5, 15.3, 14.6, 14.6, 15.2, 14.5,
14.5), os_cpu_sd = c(1.3, 2.1, 3.2, 3.3, 0.9, 0.4, 1.4, 1.5,
0.4, 1.6, 1, 0.4, 1.4, 1.4, 0.4, 1.3, 0.9, 0.4, 1.4, 1.3, 0.4,
1.7, 0.4, 0.4, 1.7, 0.4, 0.4, 1.7, 0.5, 0.4)), .Names = c("end",
"os_cpu", "os_cpu_sd"), class = "data.frame", row.names = c(1L,
5L, 9L, 13L, 17L, 21L, 25L, 29L, 33L, 37L, 41L, 45L, 49L, 53L,
57L, 61L, 65L, 69L, 73L, 77L, 81L, 85L, 89L, 93L, 97L, 101L,
105L, 109L, 113L, 117L))
head(DF_CPU)
end os_cpu os_cpu_sd
1 2013-12-17 21:19:00 14.8 1.3
5 2013-12-17 21:39:00 15.5 2.1
9 2013-12-17 21:59:00 17.4 3.2
13 2013-12-17 22:19:00 15.6 3.3
17 2013-12-17 22:39:00 14.9 0.9
ggplot(data=DF_CPU, aes(x=end, y=os_cpu)) +
geom_line()+
geom_errorbar(aes(ymin=os_cpu-os_cpu_sd,ymax=os_cpu+os_cpu_sd), alpha=0.2,color="red")
Per #ari-b-friedman suggestion, here's what it looks like with geom_ribbon():
Your question is largely about aesthetics, and so opinions will differ. Having said that there are some guidelines:
Emphasize what is important.
Provide a frame of reference if at all possible.
Avoid misleading scales or graphics.
Avoid unnecessary graphics.
So this code:
ggplot(data=DF_CPU, aes(x=end, y=os_cpu)) +
geom_point(size=3, shape=1)+
geom_line(linetype=2, colour="grey")+
geom_linerange(aes(ymin=os_cpu-1.96*os_cpu_sd,ymax=os_cpu+1.96*os_cpu_sd), alpha=0.5,color="blue")+
ylim(0,max(DF_CPU$os_cpu+1.96*DF_CPU$os_cpu_sd))+
stat_smooth(formula=y~1,se=TRUE,method="lm",linetype=2,size=1)+
theme_bw()
Produces this:
This graphic emphasizes that cpu utilization (??) over 20 min intervals did not deviate significantly from the average for the 9 hour period monitored. The reference line is average utilization. The error bars were replaced with geom_linerange(...) because the horizontal bars in geom_errorbar(...) add nothing and are distracting. Also, your original plot makes it seem that error is very large compared to actual utilization, which it isn't. I changed the range to +/- 1.96*sd because that more closely approximates 95% CL. Finally, the x- and y-axis labels need to be replaced with something descriptive, but I don't have enough information to do that.
There's a designer's adage that "form follows function", and this should apply to graphics. What are you trying to do with your plots? What's the question you are trying to answer?
If it is "is cpu usage significantly decreasing with time?" then this plot will probably do and gives the answer "no". If it is "is the probability of exceeding 10s changing with time?" then you need to assume a model for your data (eg something as simple as Normal(os_cpu, os_cpu_sd)) and then plot exceedence (tail) probabilities.
Anyway, just plotting means and envelopes like you have done is always a fair start, and at least answers the questions "what does my data look like?" and "is anything obviously wrong?"

Resources