I want to fit my vectors x ,y to some kind of curve, but they're both about 10k long with x-values very closely packed, so a scatter plot just ends up as a huge mess. What I'd like to do is to plot the AVERAGE of the y-values corresponding to one x-value.
For example:
y=rnorm(1000)
x=c(rep(1,500),rep(2,500))
plot(x,y)
I'd like this plot to only have two single points, one for x=1 and one for x=2. Any ideas?
plot(unique(x),tapply(y,x,mean))
or maybe even
plot(tapply(x,x,unique),tapply(y,x,mean))
Related
I have some scatter plots look like this:
scatter plot
The X values are discrete. For every X there's some Y values. I want to plot the density distribution of Y for each X. Like this:
density plot
Sorry that I don't know how to describe it, so I draw it by hand. Each red line is a density curve of Y corresponding to an X value. The blue line is the trend line.
In my limited experience with R and ggplot2, I don't have any ideas how to do it. I am searching for a long time on Google but no use. Please help or try to give some ideas how to achieve this. Thanks a lot!
Sorry for the newbie R question...
I have a data.frame that contains measurements of a single variable. These measurements will be distributed differently depending on whether the thing being measured is of type A or type B; that is, you can imagine that my column names are: measurement, type label (A or B). I want to plot the histograms of the measurements for A and B separately, and put the two histograms in the same plot, with each histogram normalised to unit area (this is because I expect the proportions of A and B to differ significantly). By unit area, I mean that A and B each have unit area, not that A+B have unit area. Basically, I want something like geom_density, but I don't want a smoothed distributions for each; I want the histogram bars. Not interleaved, but plotted one on top of the other. Not stacked, although it would be interesting to know how to do this also. (The purpose of this plot is to explore differences in the shapes of the distributions that would indicate that there are quantitative differences between A and B that could be used to distinguish between them.) That's all. Two or more histograms -- not smoothed density plots -- in the same plot with each normalised to unit area. Thanks!
Something like this?
# generate example
set.seed(1)
df <- data.frame(Type=c(rep("A",1000),rep("B",4000)),
Value=c(rnorm(1000,mean=25,sd=10),rchisq(4000,15)))
# you start here...
library(ggplot2)
ggplot(df, aes(x=Value))+
geom_histogram(aes(y=..density..,fill=Type),color="grey80")+
facet_grid(Type~.)
Note that there are 4 times as many samples of type B.
You can also set the y-axis scales to float using: scales="free_y" in the call to facet_grid(...).
I am trying to plot a set of data in R
x <- c(1,4,5,3,2,25)
my Y scale is fixed at 20 so that the last datapoint would effectively not be visible on the plot if i execute the following code
plot(x, ylim=c(0,20), type='l')
i wanted to show the range of the outlying datapoint by showing a smaller box above the plot, with an independent Y scale, representing only this last datapoint.
is there any package or way to approach this problem?
You may try axis.break (plotrix package) http://rss.acs.unt.edu/Rdoc/library/plotrix/html/axis.break.html, with which you can define the axis to break, the style, size and color of the break marker.
The potential disadvantage of this approach is that the trend perception might be fooled. Good luck!
Let's say I have the following dataset
bodysize=rnorm(20,30,2)
bodysize=sort(bodysize)
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
dat=as.data.frame(cbind(bodysize,survive))
I'm aware that the glm plot function has several nice plots to show you the fit,
but I'd nevertheless like to create an initial plot with:
1)raw data points
2)the loigistic curve and both
3)Predicted points
4)and aggregate points for a number of predictor levels
library(Hmisc)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
All fine up to here.
Now I want to plot the real data survival rates for a given levels of x1
dat$bd<-cut2(dat$bodysize,g=5,levels.mean=T)
AggBd<-aggregate(dat$survive,by=list(dat$bd),data=dat,FUN=mean)
plot(AggBd,add=TRUE)
#Doesn't work
I've tried to match AggBd to the dataset used for the model and all sort of other things but I simply can't plot the two together. Is there a way around this?
I basically want to overimpose the last plot along the same axes.
Besides this specific task I often wonder how to overimpose different plots that plot different variables but have similar scale/range on two-dimensional plots. I would really appreciate your help.
The first column of AggBd is a factor, you need to convert the levels to numeric before you can add the points to the plot.
AggBd$size <- as.numeric (levels (AggBd$Group.1))[AggBd$Group.1]
to add the points to the exisiting plot, use points
points (AggBd$size, AggBd$x, pch = 3)
You are best specifying your y-axis. Also maybe using par(new=TRUE)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
#then
par(new=TRUE)
#
plot(AggBd$Group.1,AggBd$x,pch=30)
obviously remove or change the axis ticks to prevent overlap e.g.
plot(AggBd$Group.1,AggBd$x,pch=30,xaxt="n",yaxt="n",xlab="",ylab="")
giving:
I am using the sm package in R to draw a density plot of several variables with different sample sizes, like this:
var1 <- density(vars1[,1])
var2 <- density(vars2[,1])
var3 <- density(vars3[,1])
pdf(file="density.pdf",width=8.5,height=8)
plot(var1,col="BLUE")
par(new=T)
plot(var2,axes=FALSE,col="RED")
par(new=T)
plot(var3,axes=FALSE,col="GREEN")
dev.off()
The problem I'm having, is that I want the y-axis to show the proportions so I can compare the different variables with each other in a more meaningful way. The maxima of all three density plots are now exactly the same, and I'm pretty sure that they wouldn't be if the y-axis showed proportions. Any suggestions? Many thanks!
Edit:
I just learned that I should not plot on top of an existing plot, so now the plotting part of the code looks like this:
pdf(file="density.pdf",width=8.5,height=8)
plot(var1,col="BLUE")
lines(var2,col="RED")
lines(var3,col="GREEN")
dev.off()
The maxima of those lines however are now very much in line with the sample size differences. Is there a way to put the proportions on the y-axis for all three variables, so the area under the curve is equal for all three variables? Many thanks!
Don't plot on top of an existing plot, because they axes may be different. Instead, use lines() to plot the second and third densities after plotting the first. If necessary, adjust the ylim parameter in plot() so that they all fit.
An example for how sample size ought not matter:
set.seed(1)
D1 <- density(rnorm(1000))
D2 <- density(rnorm(10000))
D3 <- density(rnorm(100000))
plot(D1$x,D1$y,type='l',col="red",ylim=c(0,.45))
lines(D2$x,D2$y,lty=2,col="blue")
lines(D3$x,D3$y,lty=3,col="green")
You could make tim's solution a little more flexible by not hard-coding in the limits.
plot(D1$x,D1$y,type='l',col="red",ylim=c(0, max(sapply(list(D1, D2, D3),
function(x) {max(x$y)}))))
This would also cater for Vincent's point that the density functions are not necessarily constrained in their range.