density functions of multiple columns in a dataframe - ggplot - r

I need some help to produce a graph similar to the one posted here Density plot for numerous variables using ggplot in R
I tried the code mentioned in the post however the result is not good looking
My database looks like this:
head(df)
a b c d e f g
1 0.9999994 0.9999994 0.7924445 0.9998647 0.7300587 0.9249790 0.9816021
2 0.9999885 0.9999885 0.6782044 0.9983770 0.6119326 0.9434158 0.9583668
3 1.0000000 1.0000000 0.8709003 0.9999908 0.8181097 0.8939165 0.9942465
4 1.0000000 1.0000000 0.8587627 0.9999847 0.8035536 0.9034016 0.9998198
5 0.9999996 0.9999996 0.8059187 0.9999075 0.7480368 0.9043720 0.9290576
6 0.9999999 0.9999999 0.8532174 0.9999810 0.7971970 0.9059244 0.9983568
dat <- stack(df)
ggplot(dat, aes(x=values, fill=ind)) + geom_density(alpha=0.5)
The values range from 0.6 to 1
I've also tried the approach with pivot_longer but it doesn't have a great look as well ..
could anyone help?provide me with suggestions or alternatives?
Thanks

If you look at your y axis, you will notice it has very high values. The reason is that the density for column d is extremely high, since its values are all concentrated into a tiny spot. A grouped density plot will calculate the density for each group separately, and the smoothing kernel is scaled according to the range of the data. Since the density of column d has to fit in a range of about 0.001 of the x axis but have an area under its curve of 1, that curve is going to be a very tall sharp spike. Its density therefore "drowns out" the density of all the other groups. If you use coord_cartesian to set the y range, we can see all the other densities much more clearly. Of course, this cuts off the top of the d density since it is three orders of magnitude higher, but this seems like a reasonable compromise.
ggplot(dat, aes(x = values, fill=ind)) +
geom_density(alpha = 0.5, position = "identity") +
coord_cartesian(ylim = c(0, 30))

Related

Barplot with continuous x axis using base r graphics

I am looking to scale the x axis on my barplot to time, so as to accurately represent when measurements were taken.
I have these data frames:
> Botcv
Date Average SE
1 2014-09-01 4.0 1.711307
2 2014-10-02 5.5 1.500000
> Botc1
Date Average SE
1 2014-10-15 2.125 0.7180703
2 2014-11-12 1.000 0.4629100
3 2014-12-11 0.500 0.2672612
> Botc2
Date Average SE
1 2014-10-15 3.375 1.3354708
2 2014-11-12 1.750 0.4531635
3 2014-12-11 0.625 0.1829813
I use this code to produce a grouped barplot:
covaverage <- c(Botcv$Average,NA,NA,NA)
c1average <- c(NA,NA, Botc1$Average)
c2average <- c(NA,NA, Botc2$Average)
date <- c(Botcv$Date, Botc1$Date)
averagematrix <- matrix(c(covaverage,c1average, c2average), nrow=3, ncol=5, byrow=TRUE)
barplot(averagematrix,date, xlab="Date", ylab="Average", axis.lty=1, space=NULL,width=3,beside=T, ylim=c(0.00,6.00))
R plots the bars equal distances apart by default and I have been trying to find a workaround for this. I have seen several other solutions that utilise ggplot2 but I am producing plots for my masters thesis and would like to keep the appearance of my barplots in line with other graphs that I have created using base R graphics. I also want to add error bars to the plot. If anyone could provide a solution then I would be very grateful!! Thanks!
Perhaps you can use this as a start. It is probably easier to use boxplots, as they can be put at a given x position by using the at argument. For base barplots this cannot be done, but you can use rectangle instead to replicate the barplot look. Error bars can be added using arrows or segments.
bar_w = 1 # width of bars
offset = c(-1,1) # offset to avoid overlapping
cols = grey.colors(2) # colors for different types
# combine into a single data frame
d = data.frame(rbind(Botc1, Botc2), 'type' = c(1,1,1,2,2,2))
# set up empty plot with sensible x and y lims
plot(as.Date(d$Date), d$Average, type='n', ylim=c(0,4))
# draw data of data frame 1 and 2
for (i in unique(d$type)){
dd = d[d$type==i, ]
x = as.Date(dd$Date)
y = dd$Average
# rectangles
rect(xleft=x-bar_w+offset[i], ybottom=0, xright=x+bar_w+offset[i], ytop=y, col=cols[i])
# errors bars
arrows(x0=x+offset[i], y0=y-0.5*dd$SE, x1=x+offset[i], y1=y+0.5*dd$SE, col=1, angle=90, code=3, length = 0.1)
}
If what you want to get is simply the theme that will match the base theme the + theme_bw() in ggplot2 will achieve this:
data(mtcars)
require(ggplot2)
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_boxplot() +
theme_bw()
Result
Alternative
boxplot(mpg~cyl,data=mtcars)
If, as you said, the only thing you want to achieve is similar look, and you have working plot in the ggplot2 using the theme_bw() should produce plots that are indistinguishable from what would be derived via the standard plotting mechanism. If you feel so inclined you may tweak some minutiae details like font sizes, thickness of graph borders or visualisation of outliers.

How to measure area between 2 distribution curves in R / ggplot2

The specific example is that imagine x is some continuous variable between 0 and 10 and that the red line is distribution of "goods" and the blue is "bads", I'd like to see if there is value in incorporating this variable into checking for 'goodness' but I'd like to first quantify the amount of stuff in the areas where the blue > red
Because this is a distribution chart, the scales look the same, but in reality there is 98 times more good in my sample which complicates things, since it's not actually just measuring the area under the curve, but rather measuring the bad sample where it's distribution is along lines where it's greater than the red.
I've been working to learn R, but am not even sure how to approach this one, any help appreciated.
EDIT
sample data:
http://pastebin.com/7L3Xc2KU <- a few million rows of that, essentially.
the graph is created with
graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)
The only way I can think of to do this is to calculate the area between the curve using simple trapezoids. First we manually compute the densities
d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])
Now we create functions that will interpolate between our observed density points
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
Next we find the x range of the overlap of the densities
ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))
and divide that into 500 sections
i <- seq(min(ovrng), max(ovrng), length.out=500)
Now we calculate the distance between the density curves
h <- f0(i)-f1(i)
and using the formula for the area of a trapezoid we add up the area for the regions where d1>d0
area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627
We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)+h[jj])
Here's a way to shade the area between two density plots and calculate the magnitude of that area.
# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)),
group=c(rep("Bad", 1000), rep("Good", 2000)))
# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
geom_density(aes(x=x, y=..count.., colour=group), lwd=1)
Some extra calculations to shade the area between the two density plots
(adapted from this SO question):
pp1 = ggplot_build(p1)
# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])
# We want ymax and ymin to differ only when the density of "Good"
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]
# Shade the area between "Good" and "Bad"
p1a = p1 +
geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)
Here are the two plots:
To get the area (number of values) in specific ranges of Good and Bad, use the density function on each group (or you can continue to work with the data pulled from ggplot as above, but this way you get more direct control over how the density distribution is generated):
## Calculate densities for Bad and Good.
# Use same number of points and same x-range for each group, so that the density
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"],
n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"],
n=1024, from=min(dat$x), to=max(dat$x))
## Normalize so that densities sum to number of rows in each group
# Number of rows in each group
counts = tapply(dat$x, dat$group, length)
bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y
## Results
# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495 # Out of 2000 total in the data frame
# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315 # Out of 1000 total in the data frame

plotCI: how to overlay plots of two variables

I am trying to plot populations of predators and of prey over time, with confidence intervals. I can plot these two separately, how to plot on same graph?
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(d.means)),d.means,d.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(w.means)),w.means,w.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
After the first plot, use the code below before plotting the next plot:
par(new=T)
Make sure that you set the xlim and ylim to accommodate both plots. And you will need to use the options axes=F and ann=F.
These graphical features are discussed in detail in the ebook "R Fundamentals & Graphics". You might want to use it as a desk reference.
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
Here you have created all the variables you need but to plot them using ggplot you need them to be in a tall dataset with an variable indicating if they are predator or prey. I also added a time variable, I think yours would be week.
x=data.frame(means=c(w.means,d.means),
n=c(w.n,d.n),
se=c(w.se,d.se),
role=c(rep("pred",length(w.n)),rep("prey",length(d.n))),
time=c(1:length(w.n),1:length(d.n))
)
I don't know exactly what your data look like so here is a fake one I cooked up just to illustrate the format.
means n se role time
1 0.9874234 10 0.16200575 pred 1
2 1.4120207 12 0.08895026 pred 2
3 2.7352516 8 0.07991036 pred 3
4 1.1301248 11 0.05481813 prey 1
5 2.4810040 13 0.28682585 prey 2
6 3.1546947 9 0.22126054 prey 3
Once the data are in this nice format using ggplot is really pretty easy.
ggplot(x, aes(x=time, y=means, colour=role)) +
geom_errorbar(aes(ymin=means-se, ymax=means+se), width=.1) +
geom_line()
That gives this:

Set the width of ggplot geom_path based on a variable

I have two functions, a and b, that each take a value of x from 1-3 and produce an estimate and an error.
x variable estimate error
1 a 8 4
1 b 10 2
2 a 9 3
2 b 10 1
3 a 8 5
3 b 11 3
I'd like to use geom_path() in ggplot to plot the estimates and errors for each function as x increases.
So if this is the data:
d = data.frame(x=c(1,1,2,2,3,3),variable=rep(c('a','b'),3),estimate=c(8,10,9,10,8,11),error=c(4,2,3,1,5,3))
Then the output that I'd like is something like the output of:
ggplot(d,aes(x,estimate,color=variable)) + geom_path()
but with the thickness of the line at each point equal to the size of the error. I might need to use something like geom_polygon(), but I haven't been able to find a good way to do this without calculating a series of coordinates manually.
If there's a better way to visualize this data (y value with confidence intervals at discrete x values), that would be great. I don't want to use a bar graph because I actually have more than two functions and it's hard to track the changing estimate/error of any specific function with a large group of bars at each x value.
The short answer is that you need to map size to error so that the size of the geometric object will vary depending on the value, error in this case. There are many ways to do what you want like you have suggested.
df = data.frame(x = c(1,1,2,2,3,3),
variable = rep(c('a','b'), 3),
estimate = c(8,10,9,10,8,11),
error = c(4,2,3,1,5,3))
library(ggplot2)
ggplot(df, aes(x, estimate, colour = variable, group = variable, size = error)) +
geom_point() + theme(legend.position = 'none') + geom_line(size = .5)
I found geom_ribbon(). The answer is something like this:
ggplot(d,aes(x,estimate,ymin=estimate-error,ymax=estimate+error,fill=variable)) + geom_ribbon()

ggplot: error bars

I want to plot error bars of the two different set of value of y1, y2 with respect to x. In other words, I have two data Y1,Y2 and they are correspond X value. I managed to plot them together after I reshaped the data frame. Now I want to graph the error bars on the same graph for each Y1, Y2 points. I understand geom_errorbar() is what I'm looking for. However, I'm following long way to do that and I'm sure there is a short way. What I'm doing I'm calculating "se" for each set and calculate aes(ymin=y1-se, ymax=y+se) and repeat the same for Y2. Because I want to apply this error bars to different plots . I 'd rather do it in a short way.
Here my data frame after reshape:
M Req Rec load Un L1
1 30.11 9.000000 3.000000 30.02000 A
2 50.31 10.030000 6.045000 39.44000 A
3 60.01 11.290000 7.366667 54.93000 A
4 66.10 12.630000 8.827500 68.44500 A
5 80.18 13.106000 9.462000 71.07600 A
6 87.10 14.421667 15.961667 82.70500 A
7 90.08 15.880000 20.644286 94.20714 A
1 4.000 1.500000 1.000000 1 B
2 8.240 6.240000 4.760000 3.00000 B
3 10.28 12.230000 9.420000 4.05000 B
4 18.570 25.570000 17.930000 6.00000 B
5 22.250 35.250000 27.850000 7.00000 B
6 35.070 55.010000 36.810000 8.06000 B
7 48.480 0.420000 47.020000 9.06000 B
I have used the following command to graph it:
ggplot(df_reshaped,aes(x = M, y = Req, colour = L1, shape=L1)) +
geom_point(size = 5)+
geom_line() +
scale_x_discrete(name="M") +
scale_y_continuous(name="Y1 Y2")+
ggtitle("A vs B")
In this case I'm graphing Y1=Req1, Y2=Req2, with respect to x=M
Any short way or suggestion to calculate the error bars ?
Is there any quick way to calculate the "se" ?
In general there are two possibilities to prepare your data for ggplot:
You could aggregate the raw data and plot the results. If you follow this way, you have to calculate the standard errors too since the information cannot be retrieved from the aggregated data. These standard errors could be plotted with geom_errorbar.
A second option is to use the raw data and let ggplot do all the calculations for you. This could be done with stat_summary. For example:
stat_summary(fun.data = "mean_cl_normal", mult = 1, geom = "errorbar")
Obviously, you have chosen the first approach. So, you just need to calculate the standard errors for the points of both variables.

Resources