Scatter plots are useless when number of plots is large.
So, e.g., using normal approximation, we can get the contour plot.
My question: Is there any package to implement the contour plot from scatter plot.
Thank you #G5W !! I can do it !!
You don't offer any data, so I will respond with some artificial data,
constructed at the bottom of the post. You also don't say how much data
you have although you say it is a large number of points. I am illustrating
with 20000 points.
You used the group number as the plotting character to indicate the group.
I find that hard to read. But just plotting the points doesn't show the
groups well. Coloring each group a different color is a start, but does
not look very good.
plot(x,y, pch=20, col=rainbow(3)[group])
Two tricks that can make a lot of points more understandable are:
1. Make the points transparent. The dense places will appear darker. AND
2. Reduce the point size.
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
That looks somewhat better, but did not address your actual request.
Your sample picture seems to show confidence ellipses. You can get
those using the function dataEllipse from the car package.
library(car)
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
dataEllipse(x,y,factor(group), levels=c(0.70,0.85,0.95),
plot.points=FALSE, col=rainbow(3), group.labels=NA, center.pch=FALSE)
But if there are really a lot of points, the points can still overlap
so much that they are just confusing. You can also use dataEllipse
to create what is basically a 2D density plot without showing the points
at all. Just plot several ellipses of different sizes over each other filling
them with transparent colors. The center of the distribution will appear darker.
This can give an idea of the distribution for a very large number of points.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
You can get a more continuous look by plotting more ellipses and leaving out the border lines.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=seq(0.11,0.99,0.02),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.05, lty=0)
Please try different combinations of these to get a nice picture of your data.
Additional response to comment: Adding labels
Perhaps the most natural place to add group labels is the centers of the
ellipses. You can get that by simply computing the centroids of the points in each group. So for example,
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
## Now add labels
for(i in unique(group)) {
text(mean(x[group==i]), mean(y[group==i]), labels=i)
}
Note that I just used the number as the group label, but if you have a more elaborate name, you can change labels=i to something like
labels=GroupNames[i].
Data
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
You can use hexbin::hexbin() to show very large datasets.
#G5W gave a nice dataset:
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
If you don't know the group information, then the ellipses are inappropriate; this is what I'd suggest:
library(hexbin)
plot(hexbin(x,y))
which produces
If you really want contours, you'll need a density estimate to plot. The MASS::kde2d() function can produce one; see the examples in its help page for plotting a contour based on the result. This is what it gives for this dataset:
library(MASS)
contour(kde2d(x,y))
Related
What I'm looking for is best explained by a picture: A line that "contours" the maxima of my points (like giving the "skyline" of the plot). I have a plot of scattered points with dense, (mostly) unique x coordinates (not equally distributed in either axis). I want a red line surfacing this plot:
What I've tried/thought of so far is, that a simple "draw as line" approach fails due to the dense nature of the data with unique x values and a lot of local maxima and minima (basically at every point). The same fact makes a mere "get maximum"-approach impossible.
Therefore I'm asking: Is there some kind of smoothing option for a plot? Or any existing "skyline" operator for a plot?
I am specifically NOT looking for a "contour plot" or a "skyline plot" (as in Bayesian skylineplot) - the terms would actually describe what I want, but unfortunately are already used for other things.
Here is a minimal version of what I'm working with so far, a negative example of lines not giving the desired results. I uploaded sample data here.
load("xy_lidarProfiles.RData")
plot(x, y,
xlab="x", ylab="y", # axis
pch = 20, # point marker style (1 - 20)
asp = 1 # aspect of x and y ratio
)
lines(x, y, type="l", col = "red") # makes a mess
You will get close to your desired result if you order() by x values. What you want then is a running maximum, which TTR::runMax() provides.
plot(x[order(x)], y[order(x)], pch=20)
lines(x[order(x)], TTR::runMax(y[order(x)], n=10), col="red", lwd=2)
You may adjust the window with the n= parameter.
I have overlayed violin plots comparing group A and group B scores for a particular section of a survey, facet wrapped by section. The scores are discrete 1-7 values. In some of these violin plots, the smoothing works as expected. In others, one group or the other looks very "wavy" between discrete scores (shown below).
I thought the problem may be a difference in the group sizes, but then surely the "waviness" would appear in all the section plots.
Also, this doesn't explain to me why the plots "dip in" despite being discrete 1-7 values.
When I add the adjust parameter it over-smooths the already smooth sections, so it's not quite ideal.
I use this code to create the plots
create_violin_across_groups_by_section <- function(data, test_group="first") {
g <- ggplot(data) +
aes(x=factor(nrow(data)),y=score,fill=group) +
geom_violin(alpha=0.5,position="identity") +
facet_wrap("section") +
labs(
title = paste("Comparison across groups for ", test_group)
)
return(g)
}
which results in something like this
in this case, "openness," is oddly wavy while the others all appear to be smoothed as normal.
I've thought perhaps it has something to do with the x=factor(nrow(data)) but again, surely the waviness would appear in all the section plots.
I would expect either all of the plots to be wavy (though I still wouldn't understand why) or all of them to have the same smoothness.
How can I make all of the facet-wrapped plots have the same smoothness, and why are they different in the first place?
Thanks all
The shape of the violin plot is calculated with a kernel density estimation. Kernel density estimations are designed for continuous data and not for discrete data, like your scores. While you can feed discrete data to the kernel estimator, the result may not always be beautiful or even meaningful. You can try to use different kernel and bw argument values in the geom_violin or you might consider something designed for discrete data, such as geom_dotplot.
+ geom_dotplot(binaxis = "y", stackdir = "center", position = "dodge")
Check out the corresponding example of geom_dotplot https://ggplot2.tidyverse.org/reference/geom_dotplot.html for a preview of how it can look like.
Check out the kernel and bw description of the violin plot https://ggplot2.tidyverse.org/reference/geom_violin.html that points to the density function https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/density for further information on how kernel density estimations are calculated.
I am trying to simulate a minefield by plotting two Poisson distributed samples in the same plot, one with a higher intensity and smaller area than the other. This is the minefield and the other is just noise (stones, holes, metal) seen as points. I cannot get R to plot the points with the same units in the axis. Whatever I do, the points span the entire plot, even though I only want the X points to cover a quarter of the plot. My R-code is just the following:
library(spatstat)
Y = rpoispp(c(5),win=owin(c(0,10),c(0,10)))
X = rpoispp(c(10),win=owin(c(0,5),c(0,5)))
Please let me know if you can help me.
My guess is that you are doing something like:
> plot(Y)
> plot(X)
to plot the points.
The problem with this is that the default behavior of the plot function for the class ppp (which is what the rpoispp function returns) is to create a new plot with just its points. So the second plot call essentially erases the first plot, and plots its own points in a differently scaled window. You can override this behavior by setting the option add=TRUE for the second plot. So the code
> plot(Y)
> plot(X, add=TRUE, cols="red")
should get you something like:
Check out the docs (help(plot.ppp)) for more explanation and other options to prettify the plot.
Let's say I have the following dataset
bodysize=rnorm(20,30,2)
bodysize=sort(bodysize)
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
dat=as.data.frame(cbind(bodysize,survive))
I'm aware that the glm plot function has several nice plots to show you the fit,
but I'd nevertheless like to create an initial plot with:
1)raw data points
2)the loigistic curve and both
3)Predicted points
4)and aggregate points for a number of predictor levels
library(Hmisc)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
All fine up to here.
Now I want to plot the real data survival rates for a given levels of x1
dat$bd<-cut2(dat$bodysize,g=5,levels.mean=T)
AggBd<-aggregate(dat$survive,by=list(dat$bd),data=dat,FUN=mean)
plot(AggBd,add=TRUE)
#Doesn't work
I've tried to match AggBd to the dataset used for the model and all sort of other things but I simply can't plot the two together. Is there a way around this?
I basically want to overimpose the last plot along the same axes.
Besides this specific task I often wonder how to overimpose different plots that plot different variables but have similar scale/range on two-dimensional plots. I would really appreciate your help.
The first column of AggBd is a factor, you need to convert the levels to numeric before you can add the points to the plot.
AggBd$size <- as.numeric (levels (AggBd$Group.1))[AggBd$Group.1]
to add the points to the exisiting plot, use points
points (AggBd$size, AggBd$x, pch = 3)
You are best specifying your y-axis. Also maybe using par(new=TRUE)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
#then
par(new=TRUE)
#
plot(AggBd$Group.1,AggBd$x,pch=30)
obviously remove or change the axis ticks to prevent overlap e.g.
plot(AggBd$Group.1,AggBd$x,pch=30,xaxt="n",yaxt="n",xlab="",ylab="")
giving:
I am using the sm package in R to draw a density plot of several variables with different sample sizes, like this:
var1 <- density(vars1[,1])
var2 <- density(vars2[,1])
var3 <- density(vars3[,1])
pdf(file="density.pdf",width=8.5,height=8)
plot(var1,col="BLUE")
par(new=T)
plot(var2,axes=FALSE,col="RED")
par(new=T)
plot(var3,axes=FALSE,col="GREEN")
dev.off()
The problem I'm having, is that I want the y-axis to show the proportions so I can compare the different variables with each other in a more meaningful way. The maxima of all three density plots are now exactly the same, and I'm pretty sure that they wouldn't be if the y-axis showed proportions. Any suggestions? Many thanks!
Edit:
I just learned that I should not plot on top of an existing plot, so now the plotting part of the code looks like this:
pdf(file="density.pdf",width=8.5,height=8)
plot(var1,col="BLUE")
lines(var2,col="RED")
lines(var3,col="GREEN")
dev.off()
The maxima of those lines however are now very much in line with the sample size differences. Is there a way to put the proportions on the y-axis for all three variables, so the area under the curve is equal for all three variables? Many thanks!
Don't plot on top of an existing plot, because they axes may be different. Instead, use lines() to plot the second and third densities after plotting the first. If necessary, adjust the ylim parameter in plot() so that they all fit.
An example for how sample size ought not matter:
set.seed(1)
D1 <- density(rnorm(1000))
D2 <- density(rnorm(10000))
D3 <- density(rnorm(100000))
plot(D1$x,D1$y,type='l',col="red",ylim=c(0,.45))
lines(D2$x,D2$y,lty=2,col="blue")
lines(D3$x,D3$y,lty=3,col="green")
You could make tim's solution a little more flexible by not hard-coding in the limits.
plot(D1$x,D1$y,type='l',col="red",ylim=c(0, max(sapply(list(D1, D2, D3),
function(x) {max(x$y)}))))
This would also cater for Vincent's point that the density functions are not necessarily constrained in their range.