ggplot for normal distribution - add data to graph - r

I'm trying add to my plot some data that will facilitate users. My distribution graph comes from this code:
require(ggplot2)
my_data<-c(70, 71, 75, 78, 78, 79, 80, 81, 84, 85, 87, 87, 90, 91, 95, 95, 96, 96, 97, 98, 98, 100, 101, 102, 102, 102, 102, 104, 104, 104, 107, 107, 109, 110, 110, 110, 111, 112, 113, 113, 114, 115, 118, 118, 118, 120, 124, 131, 137, 137, 139, 145, 158, 160, 162, 165, 169, 177, 179, 180)
dist <- dnorm(my_data,mean=mean(my_data), sd(my_data))
qplot(my_data,dist,geom="line")+xlab("x values")+ylab("Density")+ ggtitle("cool graph Distribution") + geom_line(color="black")
and the result is:
What I'm aiming to do is to add more data to the ggplot2:
the mean
say I have a sample: 80. I'd like to draw a line between the x values straight up intersecting with the graph.
Divide the graph into parts by 2 sigmas (or maybe 3) and add a region (example in the graph below shows 4 regions: unusually low price, great price and forth).
Thanks for any pointers!
desired result:

You can add various lines to the chart using geom_line, I reckon that it's only a matter of placing lines at different points that you want to highlight on the chart (mean, etc.):
qplot(my_data,dist,geom="line") +
xlab("x values") +
ylab("Density") +
ggtitle("cool graph Distribution") +
geom_line(color="black") +
geom_line(stat = "hline", yintercept = "mean", colour = "blue") +
geom_line(stat = "vline", xintercept = "mean", colour = "red")

Related

time series equation in R

I have data that looks similar to the following example data and I'm looking for a way to fit an equation that i can use on other data with similar profiles but might be higher or lower.
structure(list(day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132,
133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145,
146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158,
159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184,
185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197,
198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210,
211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236,
237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249,
250, 251, 252, 253), Count = c(10, 50, 500, 425, 300, 400, 275,
98, 115, 79, 87, 114, 69, 105, 81, 82, 117, 87, 123, 81, 119,
97, 84, 124, 122, 53, 114, 95, 49, 95, 101, 114, 74, 120, 72,
61, 79, 59, 96, 95, 105, 53, 110, 69, 69, 79, 106, 52, 50, 98,
102, 107, 122, 108, 47, 68, 51, 114, 96, 102, 121, 113, 130,
134, 143, 144, 141, 139, 140, 142, 141, 125, 134, 130, 137, 139,
123, 138, 108, 133, 97, 122, 120, 110, 144, 121, 103, 127, 103,
100, 139, 138, 103, 105, 114, 142, 128, 141, 141, 122, 110, 125,
112, 98, 130, 116, 138, 120, 135, 143, 136, 145, 101, 120, 131,
119, 131, 116, 114, 143, 126, 102, 116, 106, 133, 110, 102, 141,
141, 132, 110, 95, 130, 133, 131, 128, 103, 111, 120, 140, 107,
114, 95, 113, 116, 131, 145, 144, 121, 111, 100, 145, 96, 130,
95, 119, 135, 127, 113, 105, 110, 102, 105, 116, 145, 115, 102,
120, 143, 140, 141, 132, 143, 136, 108, 106, 127, 112, 122, 118,
112, 96, 116, 141, 162, 168, 198, 156, 165, 180, 179, 166, 194,
194, 162, 199, 156, 193, 200, 160, 160, 187, 150, 185, 161, 183,
166, 167, 199, 159, 146, 195, 151, 161, 161, 162, 167, 193, 191,
181, 148, 200, 182, 164, 147, 182, 165, 165, 159, 163, 188, 154,
192, 157, 149, 163, 170, 151, 185, 168, 154, 164, 191, 169, 186,
157, 182, 195, 150, 145, 152, 188, 176)), row.names = c(NA, -253L
), class = c("tbl_df", "tbl", "data.frame"))
The red line is an example of what an equation might look like. Very rough drawing.
I think what you may be looking for is a generalized additive model (GAM), which is often used to model nonlinear data like time. Here I have saved your dput as data and fit it to a GAM below. First, we can load the mgcv package for the GAM fit.
#### Load Library ####
library(mgcv)
Then you fit the GAM. This can be a very complex topic, and I advise reading a lot on this, but essentially you fit the regression in a similar manner you are probably used to if you have done regression in R before. The only difference is what spline terms you add to the regression, or the nonlinear functions that approximate the relationship between x and y. Here I have just fit a cubic regression spline using the s function for the spline, "day" as the variable, and bs = "cr" for the cubic regression spline. I also use REML here, recommended by a lot of GAM experts, to automatically adjust the knots and smoothing parameters. This can be customized a lot, but for simplicity I leave it alone here.
#### Fit GAM ####
fit <- gam(
Count ~ s(day, bs = "cr"),
method = "REML",
data=data
)
The results can be run here:
#### Summary ####
summary(fit)
As seen below. Here you see the intercept is listed like typical regression summaries. Now you have an additional "Approximate significance of smooth terms" section, which lists some useful metrics for your smoothing term. EDF is how curvilinear it is, and Ref.df & F are used for the significance test, seen to the far right. In this case, the smoothing term is significant. There are also many model metrics on the bottom that are worth observing:
Family: gaussian
Link function: identity
Formula:
Count ~ s(day, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.538 2.504 53.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(day) 8.037 8.751 17.93 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.382 Deviance explained = 40.2%
-REML = 1298.7 Scale est. = 1586.9 n = 253
Technically we can write an equation based off this knowledge, but the difference with GAMs is that each spline fit sets separate coefficients for each part of the nonlinear trend, so its not entirely useful for nonlinear data (some reasons are given here). For example, if I want all of the coefficients for a linear equation, I can run coef(fit) and get this very long list:
(Intercept) s(day).1 s(day).2 s(day).3 s(day).4 s(day).5
133.537549 -83.413590 -54.926693 -35.398280 -38.849985 -41.564495
s(day).6 s(day).7 s(day).8 s(day).9
-38.991790 9.101440 4.764924 24.764163
Plotting the data can be done with the below function and is a much better approximation of the regression fit:
#### Plot Fit ####
plot(fit)
Which shows the data fit with its spline and standard error, along with a rug showing the data points with lines on the x axis. This plotting can be customized a lot too, especially with the gratia package, but I leave it here as is. In any case, the interpretation from the plot is far more clear...counts initially decrease a ton, then rebound slightly, plateau for some time, then rebound again before plateauing again.
Hope that is helpful and I recommend reading a lot on this topic. I have included some links to some really useful primers on the subject below.
Citations
Simpson, 2018: GAMs article. This covers mostly fixed effect versions, which you are probably more likely to use.
Pedersen et al., 2019: GAMMs article. This covers some random effects parts too, which may be difficult to understand unless you know more about mixed models.
This book is also a canonical reference to GAMs that is a lot more comprehensive, but I find it is a difficult read and not the best source for beginners.
I am not sure if i got it right,
are you looking for the code to make this kind of plot?
install tidyverse (a collection of packages) and then -
Add this piece of code:
# your code first and then use the pipe operator '%>%':
%>%
ggplot(aes(x = day, y = Count))+geom_line()

Changing color (or any other aspect) of the lines shown in monthplot() or ggmonthplot()

Does anyone know how to change the color (or width or type) of the mean line in monthplot() or ggmonthplot()
Here is a quick example using both functions
myobs <- c( 147, 123, 131, 172, 166, 172, 180, 135, 208, 133, 134, 156,
181, 178, 150, 186, 189, 165, 164, 146, 213, 144, 164, 129,
178, 163, 186, 180, 149, 190, 200, 158, 201, 112, 100, 129,
157, 101, 133, 100, 117, 136, 129, 101, 167, 124, 113, 137,
148, 132, 141, 160, 148, 157, 156, 158, 204, 143, 168)
tsmyobs <- ts(myobs,start=c(2017,7), frequency=12)
monthplot(tsmyobs)
and
require(ggplot2)
require(fpp)
ggmonthplot(tsmyobs)
With monthplot() I get thin normal black lines for the subseries lines and thin normal black lines for the mean lines.
With ggmonthplot() I get thin normal black lines for the subseries lines and thin normal blue lines for the mean lines.
Ultimately I would like some control over the colors, weight, and type of both the subseries lines and the mean lines so I can contrast them visually a little better but at this point I would be happy just changing the color of the mean. I can not seem to find any controls for line color or line width or line type in either monthplot or ggmonthplot.
There seem to be very limited customization options for ggmonthplot, but it produces a ggplot object, so it is possible to modify it after creation.
For example, if you store your initial plot:
p <- ggmonthplot(tsmyobs)
p
You can change it to whatever size and color you want like this:
p$layers[[1]] <- geom_line(color = "red", size = 2)
p
You could even map seasons to the color aesthetic:
p$layers[[1]] <- geom_line(aes(color = factor(season)), size = 2)
p
You can even add scales and theme elements afterwards:
p +
scale_color_viridis_d() +
theme_bw(base_size = 16) +
theme(legend.position = "none")

Plot a plane in R: PCA

I'm working with a PCA problem where I have 3 variables and I reduce them to 2 by doing PCA. I've already plot all the points in 3D using scatter3D. My question is, how can I plot the plane determined by two vectors (the first two eigenvectors of the sampled covariance matrix) in R?
This is what I have so far
library(plot3D)
X <- matrix(c(55, 75, 110,
47, 69, 108,
42, 71, 110,
48, 74, 114,
47, 75, 114,
52, 73, 104,
49, 72, 106,
44, 67, 107,
52, 73, 108,
45, 73, 111,
50, 80, 117,
50, 71, 110,
48, 75, 114,
51, 73, 106,
44, 66, 102,
42, 71, 112,
50, 68, 107,
48, 70, 108,
51, 72, 108,
52, 73, 109,
49, 72, 112,
49, 73, 108,
46, 70, 105,
39, 66, 100,
50, 76, 108,
52, 71, 108,
56, 75, 108,
53, 70, 112,
53, 72, 110,
49, 74, 113,
51, 72, 109,
55, 74, 110,
56, 75, 110,
62, 79, 118,
58, 77, 115,
50, 71, 105,
52, 67, 104,
52, 73, 107,
56, 73, 106,
55, 78, 118,
53, 68, 103), ncol = 3,nrow = 41,byrow = TRUE)
S <- cov(X)
Gamma <- eigen(S)$vectors
scatter3D(X[,1], X[,2], X[,3], pch = 18, bty = "u", colkey = FALSE,
main ="bty= 'u'", col.panel ="gray", expand =0.4,
col.grid = "white",ticktype = "detailed",
phi = 25,theta = 45)
pc <- scale(X,center=TRUE,scale=FALSE) %*% Gamma[,c(1,2)]
Now I would like to plot the plane using scatter3D
Perhaps this will do. Using the iris data. It uses scatter3d in package car which can add a regression surface to a 3d plot:
library(car)
data(iris)
iris.pr <- prcomp(iris[, 1:3], scale.=TRUE)
# Draw 3d plot with surface and color points by species
scatter3d(PC3~PC1+PC2, iris.pr$x, point.col=c(rep(2, 50), rep(3, 50), rep(4, 50)))
This plots a regression surface predicting PC3 from PC1 and PC2. By definition the correlation between any two principal components is zero so the surface should be PC3=0 for any values of PC1 and PC2, but I don't see a way to produce exactly that surface. It is pretty close though.

Simulate data from a Gompertz curve in R

I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
I have the data, and I know that the growth follows a Gompertz curve with formula:
[y = a*exp(-exp(-(x-x0)/b))], with a = 88.8, b = 11.7, and x0 = 15.1.
The data I have is
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
Any help would be appreciated thank you
*Will edit when more information is given**
I am a little confused by your question. I have compiled what you have written into R. Please elaborate for me so that I can help you:
gompertz <- function(x, x0, a, b){
a*exp(-exp(-(x-x0)/b))
}
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165) # means of 30 samples of grape cluster weights?
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112) # ?
#??
gompertz(x, x0 = 15.1, a = 88.8, b = 11.7)
gompertz(y, x0 = 15.1, a = 88.8, b = 11.7)

Plot multiple series of data into a single bagplot with R

Let's condsider the bagplot example as included in the aplpack library in R. A bagplot is a bivariate generalisation of a boxplot and therefore gives insight in the distribution of data points in both axes.
Example of a bagplot:
Code for the example:
# example of Rousseeuw et al., see R-package rpart
cardata <- structure(as.integer( c(2560,2345,1845,2260,2440,
2285, 2275, 2350, 2295, 1900, 2390, 2075, 2330, 3320, 2885,
3310, 2695, 2170, 2710, 2775, 2840, 2485, 2670, 2640, 2655,
3065, 2750, 2920, 2780, 2745, 3110, 2920, 2645, 2575, 2935,
2920, 2985, 3265, 2880, 2975, 3450, 3145, 3190, 3610, 2885,
3480, 3200, 2765, 3220, 3480, 3325, 3855, 3850, 3195, 3735,
3665, 3735, 3415, 3185, 3690, 97, 114, 81, 91, 113, 97, 97,
98, 109, 73, 97, 89, 109, 305, 153, 302, 133, 97, 125, 146,
107, 109, 121, 151, 133, 181, 141, 132, 133, 122, 181, 146,
151, 116, 135, 122, 141, 163, 151, 153, 202, 180, 182, 232,
143, 180, 180, 151, 189, 180, 231, 305, 302, 151, 202, 182,
181, 143, 146, 146)), .Dim = as.integer(c(60, 2)),
.Dimnames = list(NULL, c("Weight", "Disp.")))
bagplot(cardata,factor=3,show.baghull=TRUE,
show.loophull=TRUE,precision=1,dkmethod=2)
title("car data Chambers/Hastie 1992")
# points of y=x*x
bagplot(x=1:30,y=(1:30)^2,verbose=FALSE,dkmethod=2)
The bagplot of aplpack seems to only support plotting a "bag" for a single data series. Even more interesting would be to plot two (or three) data series within a single bagplot, where visually comparing the "bags" of the data series gives insight in the differences in the data distributions of the data series. Does anyone know if (and if so, how) this can be done in R?
If we modify some of the aplpack::bagplot code we can make a new geom for ggplot2. Then we can compare groups within a dataset in the usual ggplot2 ways. Here's one example:
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width,
colour = Species, fill = Species)) +
geom_bag() +
theme_minimal()
and we can show the points with the bagplot:
ggplot(iris, aes(Sepal.Length, Sepal.Width,
colour = Species, fill = Species)) +
geom_bag() +
geom_point() +
theme_minimal()
Here's the code for the geom_bag and modified aplpack::bagplot function: https://gist.github.com/benmarwick/00772ccea2dd0b0f1745

Resources