Density plot in ggplot2 using log scale

Density plot in ggplot2 using log scale - r

I'd like to use ggplot2 density geometry using a log transformation for the x scale:
qplot(rating, data=movies, geom="density", log="x")
This, however, produces a chart with probabilities larger than 1. One solution that seems to work is to scale the dataset before calling qplot:
qplot(rating, data=transform(movies, rating=log(rating))
But then the x axis doesn't look nice. What is the correct way to handle this?
It seems that my question doesn't not, in fact, make sense. It seems that it is OK that probability densities are larger than one, as per [2]. What is important is that the integral over the entire space is equal to one [3].

This gives the right answer.
qplot(rating, y = ..scaled.., data=movies, geom="density", log="x")
stat_density produces new values, one of them is ..scaled.. which is the density scaled from 0 to 1.
HTH

Related

Contour plot via Scatter plot

Scatter plots are useless when number of plots is large.
So, e.g., using normal approximation, we can get the contour plot.
My question: Is there any package to implement the contour plot from scatter plot.
Thank you #G5W !! I can do it !!

You don't offer any data, so I will respond with some artificial data,
constructed at the bottom of the post. You also don't say how much data
you have although you say it is a large number of points. I am illustrating
with 20000 points.
You used the group number as the plotting character to indicate the group.
I find that hard to read. But just plotting the points doesn't show the
groups well. Coloring each group a different color is a start, but does
not look very good.
plot(x,y, pch=20, col=rainbow(3)[group])
Two tricks that can make a lot of points more understandable are:
1. Make the points transparent. The dense places will appear darker. AND
2. Reduce the point size.
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
That looks somewhat better, but did not address your actual request.
Your sample picture seems to show confidence ellipses. You can get
those using the function dataEllipse from the car package.
library(car)
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
dataEllipse(x,y,factor(group), levels=c(0.70,0.85,0.95),
plot.points=FALSE, col=rainbow(3), group.labels=NA, center.pch=FALSE)
But if there are really a lot of points, the points can still overlap
so much that they are just confusing. You can also use dataEllipse
to create what is basically a 2D density plot without showing the points
at all. Just plot several ellipses of different sizes over each other filling
them with transparent colors. The center of the distribution will appear darker.
This can give an idea of the distribution for a very large number of points.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
You can get a more continuous look by plotting more ellipses and leaving out the border lines.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=seq(0.11,0.99,0.02),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.05, lty=0)
Please try different combinations of these to get a nice picture of your data.
Additional response to comment: Adding labels
Perhaps the most natural place to add group labels is the centers of the
ellipses. You can get that by simply computing the centroids of the points in each group. So for example,
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
## Now add labels
for(i in unique(group)) {
text(mean(x[group==i]), mean(y[group==i]), labels=i)
}
Note that I just used the number as the group label, but if you have a more elaborate name, you can change labels=i to something like
labels=GroupNames[i].
Data
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))

You can use hexbin::hexbin() to show very large datasets.
#G5W gave a nice dataset:
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
If you don't know the group information, then the ellipses are inappropriate; this is what I'd suggest:
library(hexbin)
plot(hexbin(x,y))
which produces
If you really want contours, you'll need a density estimate to plot. The MASS::kde2d() function can produce one; see the examples in its help page for plotting a contour based on the result. This is what it gives for this dataset:
library(MASS)
contour(kde2d(x,y))

geom_density doesn't fill correctly with scale_y_log10

Code:
require(ggplot2)
set.seed(0)
xvar <- rnorm(100)
ggplot(data.frame(xvar), aes(xvar)) + geom_density(fill="lightblue") + scale_y_log10()
The graph is something like this:
How can I make the graph shade on the right side of (viz. below) the density estimate?

The problem is that stat_density by default fills between the density and the y=0 line of the transformed data. So transformations that alter the y=0 line will fall victim to problems of this sort. I personally think this is a bug in ggplot2, although since graphical grammar experts probably argue that y-transformed densities are meaningless, the bug may not get a lot of attention.
A very kludgy workaround is to manually add an offset to ..density.., which you will have to explicitly invoke, and then change the breaks to make it look like you didn't do anything weird.
require(ggplot2)
require(scales)
set.seed(0)
xvar <- rnorm(100000)
quartz(height=4,width=6)
ggplot(data.frame(xvar), aes(x=xvar, y=log10(..density..)+4)) +
geom_density(fill='lightblue') +
scale_y_continuous(breaks=c(0,1,2,3,4),
labels=c('0.0001', '0.001', '0.01', '0.1','1'), limits=c(0,4),
name='density')
quartz.save('![StackOverflow_29111741_v2][1].png')
That code produces this graph:

This isn't a ggplot2 or even an R issue but is simply an issue with the tails of a probability distribution being undersampled for your sample sizes. The log axis can go down forever, taking infinitely long to "reach" zero, but no finite sample size can ever hope to cover the increasingly improbable regions of the distribution.
So, to make the plot pretty, you need to both (a) increase the number of points from 100 to 10,000 or higher, while (b) keeping the plot ylims the same. (Otherwise the extra data you draw in your rnorm call will sparsely populate the tails of the gaussian even farther away from the mean, convincing ggplot2 to make automatic y axis limits even lower, in the range of the poorly-sampled tails, and the noisiness that you don't like will return.)
require(ggplot2)
require(scales)
set.seed(0)
xvar <- rnorm(100000)
ggplot(data.frame(xvar), aes(xvar)) +
geom_density(fill="lightblue") +
scale_y_continuous(trans=log10_trans(), limits = c(0.01, 1))
This generates this plot, which I think is what you want.

How to represent datapoints that are out of scale in R

I am trying to plot a set of data in R
x <- c(1,4,5,3,2,25)
my Y scale is fixed at 20 so that the last datapoint would effectively not be visible on the plot if i execute the following code
plot(x, ylim=c(0,20), type='l')
i wanted to show the range of the outlying datapoint by showing a smaller box above the plot, with an independent Y scale, representing only this last datapoint.
is there any package or way to approach this problem?

You may try axis.break (plotrix package) http://rss.acs.unt.edu/Rdoc/library/plotrix/html/axis.break.html, with which you can define the axis to break, the style, size and color of the break marker.
The potential disadvantage of this approach is that the trend perception might be fooled. Good luck!

Axis-labeling in R histogram and density plots; multiple overlays of density plots

I have two related problems.
Problem 1: I'm currently using the code below to generate a histogram overlayed with a density plot:
hist(x,prob=T,col="gray")
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(density(x))
I've pasted the data (i.e. x above) here.
I have two issues with the code as it stands:
the last tick and label (100) of the x-axis does not appear on the histogram/plot. How can I put these on?
I'd like the y-axis to be of count or frequency rather than density, but I'd like to retain the density plot as an overlay on the histogram. How can I do this?
Problem 2: using a similar solution to problem 1, I now want to overlay three density plots (not histograms), again with frequency on the y-axis instead of density. The three data sets are at:
http://pastebin.com/z5X7yTLS
http://pastebin.com/Qg8mHg6D
http://pastebin.com/aqfC42fL

Here's your first 2 questions:
myhist <- hist(x,prob=FALSE,col="gray",xlim=c(0,100))
dens <- density(x)
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(dens$x,dens$y*(1/sum(myhist$density))*length(x))
The histogram has a bin width of 5, which is also equal to 1/sum(myhist$density), whereas the density(x)$x are in small jumps, around .2 in your case (512 even steps). sum(density(x)$y) is some strange number definitely not 1, but that is because it goes in small steps, when divided by the x interval it is approximately 1: sum(density(x)$y)/(1/diff(density(x)$x)[1]) . You don't need to do this later because it's already matched up with its own odd x values. Scale 1) for the bin width of hist() and 2) for the frequency of x length(x), as DWin says. The last axis tick became visible after setting the xlim argument.
To do your problem 2, set up a plot with the correct dimensions (xlim and ylim), with type = "n", then draw 3 lines for the densities, scaled using something similar to the density line above. Think however about whether you want those semi continuous lines to reflect the heights of imaginary bars with bin width 5... You see how that might make the density lines exaggerate the counts at any particular point?

Although this is an aged thread, if anyone catches this. I would only think it is a 'good idea' to forego translating the y density to count scales based on what the user is attempting to do.
There are perfectly good reasons for using frequency as the y value. One idea in particular that comes to mind is that using counts for the y scale value can give an analyst a good idea about where to begin the 'data hunt' for stratifying heterogenous data, if a mixed distribution model cannot soundly or intuitively be applied.
In practice, overlaying a density estimate over the observed histogram can be very useful in data quality checks. For example, in the above, if I were looking at the above graphic as a single source of data with the assumption that it describes "1 thing" and I wish to model this as "1 thing", I have an issue. That is, I have heterogeneous data which may require some level of stratification. The density overlay then becomes a simple visual tool for detecting heterogeneity (apart from using log transformations to smooth between-interval variation), and a direction (locations of the mixed distributions) for stratifying the data.

how to get the range (ylim) of a plot?

I am plotting a boxplot without the outliers and I would like to create a new plot in the same cartesian space as the boxplot did. Is there a way of extracting the plotting values for a plot?
I first thought about creating an object but there seems to be no plotting-related parameters.
my_plot <- boxplot(a ~ b, outline=F)
But the parameters inside my_plot only concern statistical information but not plotting.
How can I get the final range (ylim) of the boxplot?
UPDATE: Nick's #nick-sabbe suggestion (par("yaxp")[1:2]) partially works. It returns properly the value of each of the labels in each extreme on the Y-axis. The correct way is to use par('usr') as it returns the extremes of the plotting area in the form (x1, x2, y1, y2). Thanks Nick for pointing me into the right direction.

I haven't tested this for boxplots, but for normal scatterplots, par("yaxp") gives you interesting information wrt the y axis. So you can use, IIRC, par("yaxp")[1:2] to get the current outer limits of the y axis. This doesn't always do exactly what you want, but typically it does. Let us know if it works for your boxplot...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Density plot in ggplot2 using log scale - r

This gives the right answer. qplot(rating, y = ..scaled.., data=movies, geom="density", log="x") stat_density produces new values, one of them is ..scaled.. which is the density scaled from 0 to 1. HTH

Related

Contour plot via Scatter plot

geom_density doesn't fill correctly with scale_y_log10

How to represent datapoints that are out of scale in R

Axis-labeling in R histogram and density plots; multiple overlays of density plots

how to get the range (ylim) of a plot?

Categories

Resources