ggplot - altering the height of each overlapping variable on a density plot - r

I'm quite new to R and ggplot2 so apologies if this is an obvious question, but I've searched around and can't find anything about this exact issue
I have a ggplot density plot for 6 variables on the same plot, overlapping. What I am trying to do is to change the maximum height of each variable to be a certain value without changing the distribution. e.g. :
variable_1 - 1, //on Y axis
variable_2 - 0.5 etc.
This way I can get an idea of the distribution (across the x axis) whilst also showing a second independent parameter through the y axis
Is this possible at all?

Yes this is possible although I wouldn't recommend it. What you can do is just divide the distribution by it's maximum and then multiply with the target height.
# some example data:
x = seq(-5, 5, .1)
y1 = dnorm(x)
y2 = dnorm(x, .5, .2)
Y = cbind(y1, y2)
matplot(x, Y, type = 'l', bty = 'n', lty = 1, las = 1)
# now I want the red line to be max 1
# and the black line to be mack .5
y1 = .5*y1 / max(y1)
y2 = 1*y2 / max(y2)
Y = cbind(y1, y2)
matplot(x, Y, type = 'l', bty = 'n', lty = 1, las = 1)
The important part here is that I used two different transformations for y1 and y2. The consequence is that in the second figure the distributions cannot be compared anymore. You can avoid this by only applying the same transformation to all distributions.

Related

Plotting a Function With Noise

I always wondered how such pictures are made:
I am working with the R programming language. I would like to plot a parabola with "random noise" added to the parabola. I tried something like this:
x = 1:100
y = x^2
z = y + rnorm(1, 100,100)
plot(x,z)
But this is still producing a parabola without "noise".
Can someone please show me how I can add "noise" to a parabola (or any function) in R?
Thanks!
In this case you need to generate 100 random points or will be adding the same amount of noise to each point (thus no noise). z = y + rnorm(100, 100,100)
x = 1:100
y = x^2
z = y + rnorm(length(y), 100,100)
plot(x,z)
In your code you add the same value to all your points so it just shifts your curve up by that constant. Instead you need to generate a vector of random noise the same length as your y variable. Also you probably want to set the mean = 0 for the rnorm() noise so that it's truly random noise around the true value not systematically 100 units larger.
To get something very similar to your example, you can overplot the second vector with noise using lines() and add a legend with the code below.
x = 1:100
y1 = x^2
y2 = y1 + rnorm(100, 0, 500)
plot(x, y1, type = "l", ylab = "y")
lines(x,y2,type = "l", col = "red")
legend(
x = "top",
legend = c("y1", "y2"),
col = c("black", "red"),
lwd = 1,
bty = "n",
horiz = T
)
Created on 2022-11-08 with reprex v2.0.2

Plotting single points and their range

I am trying to plot some data points from a matrix complete with their standard deviation, but I am having troubles in plotting the latter.
My tools are:
a matrix with the data points to plot at a x coordinate within a properly xlim-defined x-axis;
a vector of as many y arbitrary coordinates for the plotting height, just not making them overlap;
a vector of lengths of the standard deviation lines, to be displayed horizontally around the data points.
Yeah, eventually it'll look like a flying saucer invasion.
I can easily plot the points at the given height, one by one - it is the way I want to do it.
Trouble comes in adding the standard deviation horizontal lines for each point.
Has someone an idea on how to do it?
x<-matrix(c(1:4,NA,NA,10:16), nrow=4, ncol=4)
y<-seq(0.001,0.006, 0.001)
std.dev<-c(runif(7, 0.1, 0.5), NA, NA, runif(7, 0.1, 0.5))
plot(0,0, xlim=c(min = 0, max(x), na.rm=T)+0.001), ylim = c(0,0.016), type = "n", xlab = "My x", yaxt = "n", ylab ="")
points(x = x[1,2], y = y[1], pch = 21, bg = "red", col = "red")
When working with base R it is amazing to find out that R does not provide a "built-in" support for error bars. You may want to consult doing this with other packages.
With base R the work-around is to use the arrow() function and setting the "arrow head angle" to 90 degrees.
Note: I had to change your given data definition as it threw errors. Also have a look at this part of your code.
I plot the error bars in vertical mode. You can easily adapt this for horizontal bars. I did this for presentation reasons to avoid overlapping error bars.
Using your full data will make it easier to deconflict the bars.
x<-matrix(c(1:7,NA,NA,10:16), nrow=4, ncol=4) # adapted to ensure same length
y<-seq(0.001,0.016, 0.001) # adapted to ensure same length
std.dev<-c(runif(7, 0.1, 0.5), NA, NA, runif(7, 0.1, 0.5))
plot(0,0
, xlim= c(min = 0, max(x, na.rm=T)) # had to fix xlim definition
, ylim = c(-1,1) # changed to show give std.dev
, type = "n", xlab = "My x", yaxt = "n", ylab ="")
points(x = x, y = y, pch = 21, bg = "red", col = "red") # set x and y to show all
# --------------- add arrows with "flat head --------------------------
arrows( x0 = x, , x1 = x
,y0 = y-std.dev, y1 = y+std.dev # center deviation on data point
, code=3, angle=90 # set the angle for the head to emulate error bar
, length=0.1)
This yields:

Align the primary and secondary y-axis on the common base, set points in the center of bars?

I am trying to display barchart overlayed with line plot on secondary y-axis. I was following example here: http://robjhyndman.com/hyndsight/r-graph-with-two-y-axes/. I successfully display my data, however the beginning of the y1 and y2 axis do not start on the common base (on the common 0), the y2 is located further up.
How to correctly align y1 and y2 axes on the common basis? Can I extent both of my y1 and y2 axis in the same size? And, how can I adjust the position of the points in the middle of the bars?
My dummy data:
x <- 1:5
y1 <- c(10,53,430,80,214)
y2 <- c(0.2,1.2,3.3, 3.5, 4.2)
# create new window
windows()
# set margins
par(mar=c(5,4,4,5)+.1)
# create bar plot with primary axis (y1)
barplot(y1, ylim= c(0,500))
mtext("y1",side=2,line=3)
# add plot with secondary (y2) axis
par(new=TRUE)
plot(x, y2,,type="b",col="red",xaxt="n",yaxt="n",xlab="",ylab="", ylim= c(0,10), lwd = 2, lty = 2, pch = 18)
axis(4)
mtext("y2",side=4,line=3)
When you check the documentation for par() you will find the options xaxsand yaxs with which you can control the interval calculation for both axes. Calling par(yaxs = 'i') prior to your plot() command or using the option directly as an argument to plot() will change the interval calculation in the following way:
Style "i" (internal) just finds an axis with pretty labels that fits
within the original data range.
Additional information for the TO concerning his comment:
In order to center the points of the line go with lines instead and you can use the x-axis created by barplot:
par(mar=c(5,4,4,5)+.1)
# create bar plot with primary axis (y1)
par(xpd = F)
ps <- barplot(y1, ylim= c(0,500), xpd = F)
axis(4, at = 0:5 * 100, labels = 0:5 * 2) # transform values
mtext('y1',side = 2, line = 3)
lines(x = ps, y = y2 * 50, type = 'b', col = 'red') # transform values

How to plot a normal distribution by labeling specific parts of the x-axis?

I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?
The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.
If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))
An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.
In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)
I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.
This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result:

Plotting multiple curves same graph and same scale

This is a follow-up of this question.
I wanted to plot multiple curves on the same graph but so that my new curves respect the same y-axis scale generated by the first curve.
Notice the following example:
y1 <- c(100, 200, 300, 400, 500)
y2 <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
# first plot
plot(x, y1)
# second plot
par(new = TRUE)
plot(x, y2, axes = FALSE, xlab = "", ylab = "")
That actually plots both sets of values on the same coordinates of the graph (because I'm hiding the new y-axis that would be created with the second plot).
My question then is how to maintain the same y-axis scale when plotting the second graph.
(The typical method would be to use plot just once to set up the limits, possibly to include the range of all series combined, and then to use points and lines to add the separate series.) To use plot multiple times with par(new=TRUE) you need to make sure that your first plot has a proper ylim to accept the all series (and in another situation, you may need to also use the same strategy for xlim):
# first plot
plot(x, y1, ylim=range(c(y1,y2)))
# second plot EDIT: needs to have same ylim
par(new = TRUE)
plot(x, y2, ylim=range(c(y1,y2)), axes = FALSE, xlab = "", ylab = "")
This next code will do the task more compactly, by default you get numbers as points but the second one gives you typical R-type-"points":
matplot(x, cbind(y1,y2))
matplot(x, cbind(y1,y2), pch=1)
points or lines comes handy if
y2 is generated later, or
the new data does not have the same x but still should go into the same coordinate system.
As your ys share the same x, you can also use matplot:
matplot (x, cbind (y1, y2), pch = 19)
(without the pch matplopt will plot the column numbers of the y matrix instead of dots).
You aren't being very clear about what you want here, since I think #DWin's is technically correct, given your example code. I think what you really want is this:
y1 <- c(100, 200, 300, 400, 500)
y2 <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
# first plot
plot(x, y1,ylim = range(c(y1,y2)))
# Add points
points(x, y2)
DWin's solution was operating under the implicit assumption (based on your example code) that you wanted to plot the second set of points overlayed on the original scale. That's why his image looks like the points are plotted at 1, 101, etc. Calling plot a second time isn't what you want, you want to add to the plot using points. So the above code on my machine produces this:
But DWin's main point about using ylim is correct.
My solution is to use ggplot2. It takes care of these types of things automatically. The biggest thing is to arrange the data appropriately.
y1 <- c(100, 200, 300, 400, 500)
y2 <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
df <- data.frame(x=rep(x,2), y=c(y1, y2), class=c(rep("y1", 5), rep("y2", 5)))
Then use ggplot2 to plot it
library(ggplot2)
ggplot(df, aes(x=x, y=y, color=class)) + geom_point()
This is saying plot the data in df, and separate the points by class.
The plot generated is
I'm not sure what you want, but i'll use lattice.
x = rep(x,2)
y = c(y1,y2)
fac.data = as.factor(rep(1:2,each=5))
df = data.frame(x=x,y=y,z=fac.data)
# this create a data frame where I have a factor variable, z, that tells me which data I have (y1 or y2)
Then, just plot
xyplot(y ~x|z, df)
# or maybe
xyplot(x ~y|z, df)

Resources