R: Density plot vs Density plot in ggplot2 - r

I am trying to do some density plots in R. I originally used density plot but I changed to the density plot in ggplot2 because I visually prefer ggplot2.
So I did a density plot using the density plot function and did a density plot in ggplot2 (see below) but I found the plots were not identical. It looks like some of the y-values have been lost or dropped in the ggplot2 (right plot). Is there any particular reason for this? How can I make the ggplot identical to the destiny plot (left plot).
Code:
library(ggplot2)
library(grid)
par(mfrow=c(1,2))
# define function to create multi-plot setup (nrow, ncol)
vp.setup <- function(x,y){
grid.newpage()
pushViewport(viewport(layout = grid.layout(x,y)))
}
# define function to easily access layout (row, col)
vp.layout <- function(x,y){
viewport(layout.pos.row=x, layout.pos.col=y)
}
vp.setup(1,2)
dat <- read.table(textConnection("
low high
10611.0 14195.0
10759.0 14437.0
10807.0 14574.0
10714.0 14380.0
10768.0 14448.0
10601.0 14239.0
10579.0 14218.0
10806.0 14510.0
"), header=TRUE, sep="\t")
plot(density(dat$low))
dat.low = data.frame(low2 = c(dat$low), lines = rep(c("low")))
low_plot_gg = (ggplot(dat.low, aes(x = low2, fill = lines)) +
stat_density(aes(y = ..density..)) +
coord_cartesian(xlim = c(10300, 11000))
)
print(low_plot_gg, vp=vp.layout(1,2))

Based on some trial and error, it looks like you want
+ xlim(c(10300,11000))
rather than
+ coord_cartesian(xlim = c(10300, 11000))
coord_cartesian extends the limits of the plots but doesn't change what's drawn inside them at all ...

It's not a problem of lost values. The function plot(density()) proceed to smoothing for extreme value but it's not very accurate for your little dataset. For a bigger dataset the two plots will be the same.

Related

PCA plotting - What is the difference between ggfortify and ggplot?

I'm new to PCA. I'm plotting the scores using autoplot from ggfortify and ggplot. Both have the same shape but have different values for the x and y axes. Eg. autoplot goes from -0.2 to 0.2 in the y-axis, and ggplot goes from -0.6 to -0.6. The points on the graphs look the exact same. Only the values of the axes changed. Why is that?
Edit:
I can't really give the full data here as it's very long. I tried these two:
library(ggfortify)
pca.data <- prcomp(my_data)
autoplot(pca.data)
and
my_dataframe <- data.frame(Sample = rownames(pca.data$x),
X = pca.data$x[,1],
Y = pca.data$x[,2])
ggplot(data = my_dataframe, aes(x=X, y=Y, label=Sample)) +
geom_point() +
xlab("PC1") +
ylab("PC2") +
ggtitle("PCA Graph")
According to the vignette, autoplot scales in the same way as the biplot() function. If you don't want it to, you can instead use:
autoplot(pca.data, scale=0)
which (except for axis labels) gives the same at the ggplot command that you used.

How to plot histograms of raw data on the margins of a plot of interpolated data

I would like to show in the same plot interpolated data and a histogram of the raw data of each predictor. I have seen in other threads like this one, people explain how to do marginal histograms of the same data shown in a scatter plot, in this case, the histogram is however based on other data (the raw data).
Suppose we see how price is related to carat and table in the diamonds dataset:
library(ggplot2)
p = ggplot(diamonds, aes(x = carat, y = table, color = price)) + geom_point()
We can add a marginal frequency plot e.g. with ggMarginal
library(ggExtra)
ggMarginal(p)
How do we add something similar to a tile plot of predicted diamond prices?
library(mgcv)
model = gam(price ~ s(table, carat), data = diamonds)
newdat = expand.grid(seq(55,75, 5), c(1:4))
names(newdat) = c("table", "carat")
newdat$predicted_price = predict(model, newdat)
ggplot(newdat,aes(x = carat, y = table, fill = predicted_price)) +
geom_tile()
Ideally, the histograms go even beyond the margins of the tileplot, as these data points also influence the predictions. I would, however, be already very happy to know how to plot a histogram for the range that is shown in the tileplot. (Maybe the values that are outside the range could just be added to the extreme values in different color.)
PS. I managed to more or less align histograms to the margins of the sides of a tile plot, using the method of the accepted answer in the linked thread, but only if I removed all kind of labels. It would be particularly good to keep the color legend, if possible.
EDIT:
eipi10 provided an excellent solution. I tried to modify it slightly to add the sample size in numbers and to graphically show values outside the plotted range since they also affect the interpolated values.
I intended to include them in a different color in the histograms at the side. I hereby attempted to count them towards the lower and upper end of the plotted range. I also attempted to plot the sample size in numbers somewhere on the plot. However, I failed with both.
This was my attempt to graphically illustrate the sample size beyond the plotted area:
plot_data = diamonds
plot_data <- transform(plot_data, carat_range = ifelse(carat < 1 | carat > 4, "outside", "within"))
plot_data <- within(plot_data, carat[carat < 1] <- 1)
plot_data <- within(plot_data, carat[carat > 4] <- 4)
plot_data$carat_range = as.factor(plot_data$carat_range)
p2 = ggplot(plot_data, aes(carat, fill = carat_range)) +
geom_histogram() +
thm +
coord_cartesian(xlim=xrng)
I tried to add the sample size in numbers with geom_text. I tried fitting it in the far right panel but it was difficult (/impossible for me) to adjust. I tried to put it on the main graph (which would anyway probably not be the best solution), but it didn’t work either (it removed the histogram and legend, on the right side and it did not plot all geom_texts). I also tried to add a third row of plots and writing it there. My attempt:
n_table_above = nrow(subset(diamonds, table > 75))
n_table_below = nrow(subset(diamonds, table < 55))
n_table_within = nrow(subset(diamonds, table >= 55 & table <= 75))
text_p = ggplot()+
geom_text(aes(x = 0.9, y = 2, label = paste0("N(>75) = ", n_table_above)))+
geom_text(aes(x = 1, y = 2, label = paste0("N = ", n_table_within)))+
geom_text(aes(x = 1.1, y = 2, label = paste0("N(<55) = ", n_table_below)))+
thm
library(egg)
pobj = ggarrange(p2, ggplot(), p1, p3,
ncol=2, widths=c(4,1), heights=c(1,4))
grid.arrange(pobj, leg, text_p, ggplot(), widths=c(6,1), heights =c(6,1))
I would be very happy to receive help on either or both tasks (adding sample size as text & adding values outside plotted range in a different color).
Based on your comment, maybe the best approach is to roll your own layout. Below is an example. We create the marginal plots as separate ggplot objects and lay them out with the main plot. We also extract the legend and put it outside the marginal plots.
Set-up
library(ggplot2)
library(cowplot)
# Function to extract legend
#https://github.com/hadley/ggplot2/wiki/Share-a-legend-between-two-ggplot2-graphs
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend) }
thm = list(theme_void(),
guides(fill=FALSE),
theme(plot.margin=unit(rep(0,4), "lines")))
xrng = c(0.6,4.4)
yrng = c(53,77)
Plots
p1 = ggplot(newdat, aes(x = carat, y = table, fill = predicted_price)) +
geom_tile() +
theme_classic() +
coord_cartesian(xlim=xrng, ylim=yrng)
leg = g_legend(p1)
p1 = p1 + thm[-1]
p2 = ggplot(diamonds, aes(carat)) +
geom_line(stat="density") +
thm +
coord_cartesian(xlim=xrng)
p3 = ggplot(diamonds, aes(table)) +
geom_line(stat="density") +
thm +
coord_flip(xlim=yrng)
plot_grid(
plot_grid(plotlist=list(p2, ggplot(), p1, p3), ncol=2,
rel_widths=c(4,1), rel_heights=c(1,4), align="hv", scale=1.1),
leg, rel_widths=c(5,1))
UPDATE: Regarding your comment about the space between the plots: This is an Achilles heel of plot_grid and I don't know if there's a way to fix it. Another option is ggarrange from the experimental egg package, which doesn't add so much space between plots. Also, you need to save the output of ggarrange first and then lay out the saved object with the legend. If you run ggarrange inside grid.arrange you get two overlapping copies of the plot:
# devtools::install_github('baptiste/egg')
library(egg)
pobj = ggarrange(p2, ggplot(), p1, p3,
ncol=2, widths=c(4,1), heights=c(1,4))
grid.arrange(pobj, leg, widths=c(6,1))

R: Density plot with colors by group?

I have data from 2 populations.
I'd like to get the histogram and density plot of both on the same graphic.
With one color for one population and another color for the other one.
I've tried this (example):
library(ggplot2)
AA <- rnorm(100000, 70,20)
BB <- rnorm(100000,120,20)
valores <- c(AA,BB)
grupo <- c(rep("AA", 100000),c(rep("BB", 100000)))
todo <- data.frame(valores, grupo)
ggplot(todo, aes(x=valores, fill=grupo, color=grupo)) +
geom_histogram(aes(y=..density..), binwidth=3)+ geom_density(aes(color=grupo))
But I'm just getting a graphic with a single line and a single color.
I would like to have different colors for the the two density lines. And if possible the histograms as well.
I've done it with ggplot2 but base R would also be OK.
or I don't know what I've changed and now I get this:
ggplot(todo, aes(x=valores, fill=grupo, color=grupo)) +
geom_histogram( position="identity", binwidth=3, alpha=0.5)+
geom_density(aes(color=grupo))
but the density lines were not plotted.
or even strange things like
I suggest this ggplot2 solution:
ggplot(todo, aes(valores, color=grupo)) +
geom_histogram(position="identity", binwidth=3, aes(y=..density.., fill=grupo), alpha=0.5) +
geom_density()
#skan: Your attempt was close but you plotted the frequencies instead of density values in the histogram.
A base R solution could be:
hist(AA, probability = T, col = rgb(1,0,0,0.5), border = rgb(1,0,0,1),
xlim=range(AA,BB), breaks= 50, ylim=c(0,0.025), main="AA and BB", xlab = "")
hist(BB, probability = T, col = rgb(0,0,1,0.5), border = rgb(0,0,1,1), add=T)
lines(density(AA))
lines(density(BB), lty=2)
For alpha I used rgb. But there are more ways to get it in. See alpha() in the scales package for instance. I added also the breaks parameter for the plot of the AAs to increase the binwidth compared to the BB group.

How to plot points on hexbin graph in R?

I have two sets of data that need to plot on the same graph. A set is very large (~ 10⁶) and I want to plot with hexbin, and the other set is very small (~ 10) and I want to plot the points. How do I plot points on the hexbin?
The closer to success I got was this:
bin = hexbin(x, y)
plot(bin)
pushViewport(dataViewport(x, y))
grid.points(x, y)
I appreciate any help :)
Assuming you are using the hexbin package...
library(hexbin)
library(grid)
# some data from the ?hexbin help
set.seed(101)
x <- rnorm(10000)
y <- rnorm(10000)
z <- w <- -3:3
# hexbin
bin <- hexbin(x, y)
# plot - look at str(p)
p <- plot(bin)
# push plot viewport
pushHexport(p$plot.vp)
# add points
grid.points(z, w, pch=16, gp=gpar(col="red"))
upViewport()
You can use the ggplot package for that task, see the code below, just replace the data.frame used in the data parameter for geom_point with the one for the points you want to plot.
library(ggplot2)
library(hexbin)
ggplot(diamonds, aes(carat, price)) + stat_binhex() + geom_point(data = diamonds[c(1,10,100,1000), ], aes(carat, price), size=10, color = 'red' )
Try this... it should work fine.
Just create a panel.function within your hexbinplot function:
hexbinplot(d.frame$X ~ d.frame$Y
,aspect=...,cex.title=...
,panel=function(x, y, ...){
panel.hexbinplot(x,y,...)
# panel.curve(...) # optional stuff
# panel.text(...) # optional stuff
panel.points(x=c(25,50),y=c(100,150),pch=20,cex=3.2)
}
)
take a look for instance at: How to add points to multi-panel Lattice graphics bwplot?

How to draw a clipped density plot in ggplot2 without missing sections

I would like to use ggplot2 to draw a lattice plot of densities produced from different methods, in which the same yaxis scale is used throughout.
I would like to set the upper limit of the y axis to a value below the highest density value for any one method. However ggplot by default removes sections of the geom that are outside of the plotted region.
For example:
# Toy example of problem
xval <- rnorm(10000)
#Base1
plot(density(xval))
#Base2
plot(density(xval), ylim=c(0, 0.3)) # densities > 0.3 not removed from plot
xval <- as.data.frame(xval)
ggplot(xval, aes(x=xval)) + geom_density() #gg1 - looks like Base1
ggplot(xval, aex(x=xval)) + geom_density() + ylim(0, 0.3)
#gg2: does not look like Base2 due to removal of density values > 0.3
These produce the images below:
How can I make the ggplot image not have the missing section?
Using xlim() or ylim() directly will drop all data points that are not within the specified range. This yields the discontinuity of the density plot. Use coord_cartesian() to zoom in without losing the data points.
ggplot(xval, aes(x=xval)) +
geom_density() +
coord_cartesian(ylim = c(0, 0.3))

Resources