How to plot density plots with proportions on the y-axis? - r

I am using the sm package in R to draw a density plot of several variables with different sample sizes, like this:
var1 <- density(vars1[,1])
var2 <- density(vars2[,1])
var3 <- density(vars3[,1])
pdf(file="density.pdf",width=8.5,height=8)
plot(var1,col="BLUE")
par(new=T)
plot(var2,axes=FALSE,col="RED")
par(new=T)
plot(var3,axes=FALSE,col="GREEN")
dev.off()
The problem I'm having, is that I want the y-axis to show the proportions so I can compare the different variables with each other in a more meaningful way. The maxima of all three density plots are now exactly the same, and I'm pretty sure that they wouldn't be if the y-axis showed proportions. Any suggestions? Many thanks!
Edit:
I just learned that I should not plot on top of an existing plot, so now the plotting part of the code looks like this:
pdf(file="density.pdf",width=8.5,height=8)
plot(var1,col="BLUE")
lines(var2,col="RED")
lines(var3,col="GREEN")
dev.off()
The maxima of those lines however are now very much in line with the sample size differences. Is there a way to put the proportions on the y-axis for all three variables, so the area under the curve is equal for all three variables? Many thanks!

Don't plot on top of an existing plot, because they axes may be different. Instead, use lines() to plot the second and third densities after plotting the first. If necessary, adjust the ylim parameter in plot() so that they all fit.
An example for how sample size ought not matter:
set.seed(1)
D1 <- density(rnorm(1000))
D2 <- density(rnorm(10000))
D3 <- density(rnorm(100000))
plot(D1$x,D1$y,type='l',col="red",ylim=c(0,.45))
lines(D2$x,D2$y,lty=2,col="blue")
lines(D3$x,D3$y,lty=3,col="green")

You could make tim's solution a little more flexible by not hard-coding in the limits.
plot(D1$x,D1$y,type='l',col="red",ylim=c(0, max(sapply(list(D1, D2, D3),
function(x) {max(x$y)}))))
This would also cater for Vincent's point that the density functions are not necessarily constrained in their range.

Related

Contour plot via Scatter plot

Scatter plots are useless when number of plots is large.
So, e.g., using normal approximation, we can get the contour plot.
My question: Is there any package to implement the contour plot from scatter plot.
Thank you #G5W !! I can do it !!
You don't offer any data, so I will respond with some artificial data,
constructed at the bottom of the post. You also don't say how much data
you have although you say it is a large number of points. I am illustrating
with 20000 points.
You used the group number as the plotting character to indicate the group.
I find that hard to read. But just plotting the points doesn't show the
groups well. Coloring each group a different color is a start, but does
not look very good.
plot(x,y, pch=20, col=rainbow(3)[group])
Two tricks that can make a lot of points more understandable are:
1. Make the points transparent. The dense places will appear darker. AND
2. Reduce the point size.
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
That looks somewhat better, but did not address your actual request.
Your sample picture seems to show confidence ellipses. You can get
those using the function dataEllipse from the car package.
library(car)
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
dataEllipse(x,y,factor(group), levels=c(0.70,0.85,0.95),
plot.points=FALSE, col=rainbow(3), group.labels=NA, center.pch=FALSE)
But if there are really a lot of points, the points can still overlap
so much that they are just confusing. You can also use dataEllipse
to create what is basically a 2D density plot without showing the points
at all. Just plot several ellipses of different sizes over each other filling
them with transparent colors. The center of the distribution will appear darker.
This can give an idea of the distribution for a very large number of points.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
You can get a more continuous look by plotting more ellipses and leaving out the border lines.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=seq(0.11,0.99,0.02),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.05, lty=0)
Please try different combinations of these to get a nice picture of your data.
Additional response to comment: Adding labels
Perhaps the most natural place to add group labels is the centers of the
ellipses. You can get that by simply computing the centroids of the points in each group. So for example,
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
## Now add labels
for(i in unique(group)) {
text(mean(x[group==i]), mean(y[group==i]), labels=i)
}
Note that I just used the number as the group label, but if you have a more elaborate name, you can change labels=i to something like
labels=GroupNames[i].
Data
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
You can use hexbin::hexbin() to show very large datasets.
#G5W gave a nice dataset:
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
If you don't know the group information, then the ellipses are inappropriate; this is what I'd suggest:
library(hexbin)
plot(hexbin(x,y))
which produces
If you really want contours, you'll need a density estimate to plot. The MASS::kde2d() function can produce one; see the examples in its help page for plotting a contour based on the result. This is what it gives for this dataset:
library(MASS)
contour(kde2d(x,y))

filled.contour in R: how to make the same density is same color

I need to plot 2-dimentional density plot using filled.contour in R.
I have two datasets and plot them in two fill.contour. I do not have 10 reputations here and I could not post my figures here. I post my codes here and hope it can help to find out the problem.
library(MASS)
density <- kde2d(multi_ligand[,21], multi_ligand[,7])
filled.contour(density,
color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')),
xlab=colnames(single_ligand[21]),
ylab=colnames(single_ligand[7])
)
density1 <- kde2d(single_ligand[,21], single_ligand[,7])
filled.contour(density1,
color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')),
xlab=colnames(single_ligand[21]),
ylab=colnames(single_ligand[7])
)
The problem is I found that in these two plots, the color in the same density is not same. For example, in the first plot, density 0.06 is yellow, however, in the second plot, density 0.06 is blue. I use the same color scale in these two plots.
In order to make these two plots comparable, I want to use same color in the same density in these two plots.
Could any please tell me that how should I change my setting to make it right?
By default filled.contour will adjust the blocks of color to evenly cover the range of z, or in this case density, values for each data set. If you want the exact same levels to be used on both plots, you will need to specify them yourself. Here is some code that will specify levels that will cover the range of both data sets.
#sample data
set.seed(15)
ax<-rnorm(50) #like multi_ligand[,21]
ay<-rnorm(50) #like multi_ligand[,7]
bx<-rnorm(75,2, .5) #like single_ligand[,21]
by<-rnorm(75,2, .5) #like single_ligand[,7]
#calculate both densities
density <- kde2d(ax, ay)
density1 <- kde2d(bx, by)
#make levels that cover both ranges of z values
lvls <- pretty(range(density$z, density1$z),20)
#draw both plots using the same levels
filled.contour(density,
color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')),
levels=lvls
)
filled.contour(density1,
color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')),
levels=lvls
)
Which produce these two plots

Plotting multiple variables

l have four variables y1,y2,y3,y4. I want to make a plot that will show how y2,y3 and y4 behave in relation to y1. I have tried using scatterplot but l do not get much information from that.
matplot might be useful here as well:
dat <- data.frame(y1=1:3,y2=1:3,y3=2:4,y4=3:5)
matplot(dat[1],dat[-1],type="l",lty=1)
par(mfrow=c(1,3)) #make a plot area with space for three plots
y1=rnorm(100)
y2=rnorm(100)
y3=rnorm(100)
y4=rnorm(100)
plot(y1,y2)
plot(y1,y3)
plot(y1,y3)

Plotting distribution of differences in R

I have a dataset with numbers indicating daily difference in some measure.
https://dl.dropbox.com/u/22681355/diff.csv
I would like to create a plot of the distribution of the differences with special emphasis on the rare large changes.
I tried plotting each column using the hist() function but it doesn't really provide a detailed picture of the data.
For example plotting the first column of the dataset produces the following plot:
https://dl.dropbox.com/u/22681355/Rplot.pdf
My problem is that this gives very little detail to the infrequent large deviations.
What is the easiest way to do this?
Also any suggestions on how to summarize this data in a table? For example besides showing the min, max and mean values, would you look at quantiles? Any other ideas?
You could use boxplots to visualize the distribution of the data:
sdiff <- read.csv("https://dl.dropbox.com/u/22681355/diff.csv")
boxplot(sdiff[,-1])
Outliers are printed as circles.
I back #Sven's suggestion for identifying outliers, but you can get more refinement in your histograms by specifying a denser set of breakpoints than what hist chooses by default.
d <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv', header=TRUE, row.names=1)
with(d, hist(a, breaks=seq(min(a), max(a), length.out=100)))
Violin plots could be useful:
df <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv')
library(vioplot)
with(df,vioplot(a,b,c,d,e,f,g,h,i,j))
I would use a boxplot on transformed data, e.g.:
boxplot(df[,-1]/sqrt(abs(df[,-1])))
Obviously a histogram would also look better after transformation.

R - logistic curve plot with aggregate points

Let's say I have the following dataset
bodysize=rnorm(20,30,2)
bodysize=sort(bodysize)
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
dat=as.data.frame(cbind(bodysize,survive))
I'm aware that the glm plot function has several nice plots to show you the fit,
but I'd nevertheless like to create an initial plot with:
1)raw data points
2)the loigistic curve and both
3)Predicted points
4)and aggregate points for a number of predictor levels
library(Hmisc)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
All fine up to here.
Now I want to plot the real data survival rates for a given levels of x1
dat$bd<-cut2(dat$bodysize,g=5,levels.mean=T)
AggBd<-aggregate(dat$survive,by=list(dat$bd),data=dat,FUN=mean)
plot(AggBd,add=TRUE)
#Doesn't work
I've tried to match AggBd to the dataset used for the model and all sort of other things but I simply can't plot the two together. Is there a way around this?
I basically want to overimpose the last plot along the same axes.
Besides this specific task I often wonder how to overimpose different plots that plot different variables but have similar scale/range on two-dimensional plots. I would really appreciate your help.
The first column of AggBd is a factor, you need to convert the levels to numeric before you can add the points to the plot.
AggBd$size <- as.numeric (levels (AggBd$Group.1))[AggBd$Group.1]
to add the points to the exisiting plot, use points
points (AggBd$size, AggBd$x, pch = 3)
You are best specifying your y-axis. Also maybe using par(new=TRUE)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
#then
par(new=TRUE)
#
plot(AggBd$Group.1,AggBd$x,pch=30)
obviously remove or change the axis ticks to prevent overlap e.g.
plot(AggBd$Group.1,AggBd$x,pch=30,xaxt="n",yaxt="n",xlab="",ylab="")
giving:

Resources