I have data that looks similar to the following example data and I'm looking for a way to fit an equation that i can use on other data with similar profiles but might be higher or lower.
structure(list(day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132,
133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145,
146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158,
159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184,
185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197,
198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210,
211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236,
237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249,
250, 251, 252, 253), Count = c(10, 50, 500, 425, 300, 400, 275,
98, 115, 79, 87, 114, 69, 105, 81, 82, 117, 87, 123, 81, 119,
97, 84, 124, 122, 53, 114, 95, 49, 95, 101, 114, 74, 120, 72,
61, 79, 59, 96, 95, 105, 53, 110, 69, 69, 79, 106, 52, 50, 98,
102, 107, 122, 108, 47, 68, 51, 114, 96, 102, 121, 113, 130,
134, 143, 144, 141, 139, 140, 142, 141, 125, 134, 130, 137, 139,
123, 138, 108, 133, 97, 122, 120, 110, 144, 121, 103, 127, 103,
100, 139, 138, 103, 105, 114, 142, 128, 141, 141, 122, 110, 125,
112, 98, 130, 116, 138, 120, 135, 143, 136, 145, 101, 120, 131,
119, 131, 116, 114, 143, 126, 102, 116, 106, 133, 110, 102, 141,
141, 132, 110, 95, 130, 133, 131, 128, 103, 111, 120, 140, 107,
114, 95, 113, 116, 131, 145, 144, 121, 111, 100, 145, 96, 130,
95, 119, 135, 127, 113, 105, 110, 102, 105, 116, 145, 115, 102,
120, 143, 140, 141, 132, 143, 136, 108, 106, 127, 112, 122, 118,
112, 96, 116, 141, 162, 168, 198, 156, 165, 180, 179, 166, 194,
194, 162, 199, 156, 193, 200, 160, 160, 187, 150, 185, 161, 183,
166, 167, 199, 159, 146, 195, 151, 161, 161, 162, 167, 193, 191,
181, 148, 200, 182, 164, 147, 182, 165, 165, 159, 163, 188, 154,
192, 157, 149, 163, 170, 151, 185, 168, 154, 164, 191, 169, 186,
157, 182, 195, 150, 145, 152, 188, 176)), row.names = c(NA, -253L
), class = c("tbl_df", "tbl", "data.frame"))
The red line is an example of what an equation might look like. Very rough drawing.
I think what you may be looking for is a generalized additive model (GAM), which is often used to model nonlinear data like time. Here I have saved your dput as data and fit it to a GAM below. First, we can load the mgcv package for the GAM fit.
#### Load Library ####
library(mgcv)
Then you fit the GAM. This can be a very complex topic, and I advise reading a lot on this, but essentially you fit the regression in a similar manner you are probably used to if you have done regression in R before. The only difference is what spline terms you add to the regression, or the nonlinear functions that approximate the relationship between x and y. Here I have just fit a cubic regression spline using the s function for the spline, "day" as the variable, and bs = "cr" for the cubic regression spline. I also use REML here, recommended by a lot of GAM experts, to automatically adjust the knots and smoothing parameters. This can be customized a lot, but for simplicity I leave it alone here.
#### Fit GAM ####
fit <- gam(
Count ~ s(day, bs = "cr"),
method = "REML",
data=data
)
The results can be run here:
#### Summary ####
summary(fit)
As seen below. Here you see the intercept is listed like typical regression summaries. Now you have an additional "Approximate significance of smooth terms" section, which lists some useful metrics for your smoothing term. EDF is how curvilinear it is, and Ref.df & F are used for the significance test, seen to the far right. In this case, the smoothing term is significant. There are also many model metrics on the bottom that are worth observing:
Family: gaussian
Link function: identity
Formula:
Count ~ s(day, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.538 2.504 53.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(day) 8.037 8.751 17.93 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.382 Deviance explained = 40.2%
-REML = 1298.7 Scale est. = 1586.9 n = 253
Technically we can write an equation based off this knowledge, but the difference with GAMs is that each spline fit sets separate coefficients for each part of the nonlinear trend, so its not entirely useful for nonlinear data (some reasons are given here). For example, if I want all of the coefficients for a linear equation, I can run coef(fit) and get this very long list:
(Intercept) s(day).1 s(day).2 s(day).3 s(day).4 s(day).5
133.537549 -83.413590 -54.926693 -35.398280 -38.849985 -41.564495
s(day).6 s(day).7 s(day).8 s(day).9
-38.991790 9.101440 4.764924 24.764163
Plotting the data can be done with the below function and is a much better approximation of the regression fit:
#### Plot Fit ####
plot(fit)
Which shows the data fit with its spline and standard error, along with a rug showing the data points with lines on the x axis. This plotting can be customized a lot too, especially with the gratia package, but I leave it here as is. In any case, the interpretation from the plot is far more clear...counts initially decrease a ton, then rebound slightly, plateau for some time, then rebound again before plateauing again.
Hope that is helpful and I recommend reading a lot on this topic. I have included some links to some really useful primers on the subject below.
Citations
Simpson, 2018: GAMs article. This covers mostly fixed effect versions, which you are probably more likely to use.
Pedersen et al., 2019: GAMMs article. This covers some random effects parts too, which may be difficult to understand unless you know more about mixed models.
This book is also a canonical reference to GAMs that is a lot more comprehensive, but I find it is a difficult read and not the best source for beginners.
I am not sure if i got it right,
are you looking for the code to make this kind of plot?
install tidyverse (a collection of packages) and then -
Add this piece of code:
# your code first and then use the pipe operator '%>%':
%>%
ggplot(aes(x = day, y = Count))+geom_line()
I'm working with a PCA problem where I have 3 variables and I reduce them to 2 by doing PCA. I've already plot all the points in 3D using scatter3D. My question is, how can I plot the plane determined by two vectors (the first two eigenvectors of the sampled covariance matrix) in R?
This is what I have so far
library(plot3D)
X <- matrix(c(55, 75, 110,
47, 69, 108,
42, 71, 110,
48, 74, 114,
47, 75, 114,
52, 73, 104,
49, 72, 106,
44, 67, 107,
52, 73, 108,
45, 73, 111,
50, 80, 117,
50, 71, 110,
48, 75, 114,
51, 73, 106,
44, 66, 102,
42, 71, 112,
50, 68, 107,
48, 70, 108,
51, 72, 108,
52, 73, 109,
49, 72, 112,
49, 73, 108,
46, 70, 105,
39, 66, 100,
50, 76, 108,
52, 71, 108,
56, 75, 108,
53, 70, 112,
53, 72, 110,
49, 74, 113,
51, 72, 109,
55, 74, 110,
56, 75, 110,
62, 79, 118,
58, 77, 115,
50, 71, 105,
52, 67, 104,
52, 73, 107,
56, 73, 106,
55, 78, 118,
53, 68, 103), ncol = 3,nrow = 41,byrow = TRUE)
S <- cov(X)
Gamma <- eigen(S)$vectors
scatter3D(X[,1], X[,2], X[,3], pch = 18, bty = "u", colkey = FALSE,
main ="bty= 'u'", col.panel ="gray", expand =0.4,
col.grid = "white",ticktype = "detailed",
phi = 25,theta = 45)
pc <- scale(X,center=TRUE,scale=FALSE) %*% Gamma[,c(1,2)]
Now I would like to plot the plane using scatter3D
Perhaps this will do. Using the iris data. It uses scatter3d in package car which can add a regression surface to a 3d plot:
library(car)
data(iris)
iris.pr <- prcomp(iris[, 1:3], scale.=TRUE)
# Draw 3d plot with surface and color points by species
scatter3d(PC3~PC1+PC2, iris.pr$x, point.col=c(rep(2, 50), rep(3, 50), rep(4, 50)))
This plots a regression surface predicting PC3 from PC1 and PC2. By definition the correlation between any two principal components is zero so the surface should be PC3=0 for any values of PC1 and PC2, but I don't see a way to produce exactly that surface. It is pretty close though.
Let's condsider the bagplot example as included in the aplpack library in R. A bagplot is a bivariate generalisation of a boxplot and therefore gives insight in the distribution of data points in both axes.
Example of a bagplot:
Code for the example:
# example of Rousseeuw et al., see R-package rpart
cardata <- structure(as.integer( c(2560,2345,1845,2260,2440,
2285, 2275, 2350, 2295, 1900, 2390, 2075, 2330, 3320, 2885,
3310, 2695, 2170, 2710, 2775, 2840, 2485, 2670, 2640, 2655,
3065, 2750, 2920, 2780, 2745, 3110, 2920, 2645, 2575, 2935,
2920, 2985, 3265, 2880, 2975, 3450, 3145, 3190, 3610, 2885,
3480, 3200, 2765, 3220, 3480, 3325, 3855, 3850, 3195, 3735,
3665, 3735, 3415, 3185, 3690, 97, 114, 81, 91, 113, 97, 97,
98, 109, 73, 97, 89, 109, 305, 153, 302, 133, 97, 125, 146,
107, 109, 121, 151, 133, 181, 141, 132, 133, 122, 181, 146,
151, 116, 135, 122, 141, 163, 151, 153, 202, 180, 182, 232,
143, 180, 180, 151, 189, 180, 231, 305, 302, 151, 202, 182,
181, 143, 146, 146)), .Dim = as.integer(c(60, 2)),
.Dimnames = list(NULL, c("Weight", "Disp.")))
bagplot(cardata,factor=3,show.baghull=TRUE,
show.loophull=TRUE,precision=1,dkmethod=2)
title("car data Chambers/Hastie 1992")
# points of y=x*x
bagplot(x=1:30,y=(1:30)^2,verbose=FALSE,dkmethod=2)
The bagplot of aplpack seems to only support plotting a "bag" for a single data series. Even more interesting would be to plot two (or three) data series within a single bagplot, where visually comparing the "bags" of the data series gives insight in the differences in the data distributions of the data series. Does anyone know if (and if so, how) this can be done in R?
If we modify some of the aplpack::bagplot code we can make a new geom for ggplot2. Then we can compare groups within a dataset in the usual ggplot2 ways. Here's one example:
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width,
colour = Species, fill = Species)) +
geom_bag() +
theme_minimal()
and we can show the points with the bagplot:
ggplot(iris, aes(Sepal.Length, Sepal.Width,
colour = Species, fill = Species)) +
geom_bag() +
geom_point() +
theme_minimal()
Here's the code for the geom_bag and modified aplpack::bagplot function: https://gist.github.com/benmarwick/00772ccea2dd0b0f1745