I need to look for correlations in the publicly available flights package. I managed to make a scatter plot using ggplot.
With the code:
library(nycflights13)
attach(flights)
ggplot(flights, aes(x = arr_delay, y = dep_delay)) +
geom_point(size = 2) +
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
As show in the image most is centered in the bottom left. Is there any way to make this graph look more visually appealing by spreading the plotted values better?
You can plot your points by using the alpha parameter which gives a degree of transparency (between 0 and 1 being the most opaque) to them. This will make overlapping points distinguish better while also making the regions of the plot with higher concentration look darker. The style of the plot will improve, too.
Start with a value of alpha = 0.7 then experiment with it until you get the best results.
ggplot(flights, aes(x = arr_delay, y = dep_delay)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95)
I used facets to separate the flights that arrived early or on time (arr_delay <=0) with those that arrived late (arr_delay>0). The relationship seems different.
library(nycflights13)
library(dplyr)
library(ggplot2)
ff <- flights %>%
filter(!is.na(arr_delay), origin=="LGA") %>% # Filtered to reduce waiting time!
mutate(`Arrival time`=ifelse(arr_delay<=0, "Early", "Delayed"))
ggplot(ff, aes(x = arr_delay, y = dep_delay)) +
geom_point(size = 2, alpha = 0.3) +
geom_smooth(method="auto", fullrange=FALSE, level=0.95) +
facet_wrap(~`Arrival time`, scales="free", labeller=label_both) +
labs(x="Arrival delay (minutes)", y="Departure delay (minutes)")
For the points, you could use aggregated data, for the smooth the normal data.
flights <- within(flights, {
bin <- floor(dep_delay / 10)
av_arr <- ave(arr_delay, bin, FUN=mean)
av_dep <- ave(dep_delay, bin, FUN=mean)
})
library("ggplot2")
library("nycflights13")
ggplot(flights) +
geom_point(aes(x=av_arr, y=av_dep), size=2) +
geom_smooth(aes(x=arr_delay, y=dep_delay), method="auto", se=TRUE,
fullrange=FALSE, level=0.95)
Related
Given an example boxplot like this in ggplot2:
ggplot(diamonds, aes(carat, price)) +
geom_boxplot(aes(group = cut_width(carat, 0.25)), outlier.alpha = 0.1) +
stat_smooth( method="lm", formula = y ~ poly(x,2), n= 40, se=TRUE, color="red", aes(group=1), size=1.5)
I get an image that looks like this:
However the stat_smooth line is greatly influenced by the number of points in each of the carat categories. I would prefer to treat each of the categories equally, which would mean, to my mind, weighting each point with a particular carat value, with the inverse of the number of the total number of points with that value. (So, at 5, the point would have a weight of 1, and at 1, the point would have a weight of 1/aBigNumber.) I've tried the weight aesthetic to the plot, but it breaks the boxplot. I've tried adding the weigh to the smooth, but I get an error:
Error: ggplot2 doesn't know how to deal with data of class uneval
So, how do I weight a smoothing function so that the categories are treated equally (that is inverse to the number of points in the category), and still keep the boxplot in the output?
You could do something like this...
library(dplyr)
diamonds2 <- diamonds %>% mutate(cutcarat=cut_width(carat, 0.25)) %>%
group_by(cutcarat) %>%
summarise(carat=mean(carat), price=mean(price))
ggplot() +
geom_boxplot(data=diamonds,
aes(x=carat, y=price, group = cut_width(carat, 0.25)),
outlier.alpha = 0.1) +
geom_smooth(data=diamonds2,
aes(x=carat, y=price), method="lm",
formula = y ~ poly(x,2), n= 40, se=TRUE, color="red", size=1.5)
I am trying to plot two variables where N=700K. The problem is that there is too much overlap, so that the plot becomes mostly a solid block of black. Is there any way of having a grayscale "cloud" where the darkness of the plot is a function of the number of points in an region? In other words, instead of showing individual points, I want the plot to be a "cloud", with the more the number of points in a region, the darker that region.
One way to deal with this is with alpha blending, which makes each point slightly transparent. So regions appear darker that have more point plotted on them.
This is easy to do in ggplot2:
df <- data.frame(x = rnorm(5000),y=rnorm(5000))
ggplot(df,aes(x=x,y=y)) + geom_point(alpha = 0.3)
Another convenient way to deal with this is (and probably more appropriate for the number of points you have) is hexagonal binning:
ggplot(df,aes(x=x,y=y)) + stat_binhex()
And there is also regular old rectangular binning (image omitted), which is more like your traditional heatmap:
ggplot(df,aes(x=x,y=y)) + geom_bin2d()
An overview of several good options in ggplot2:
library(ggplot2)
x <- rnorm(n = 10000)
y <- rnorm(n = 10000, sd=2) + x
df <- data.frame(x, y)
Option A: transparent points
o1 <- ggplot(df, aes(x, y)) +
geom_point(alpha = 0.05)
Option B: add density contours
o2 <- ggplot(df, aes(x, y)) +
geom_point(alpha = 0.05) +
geom_density_2d()
Option C: add filled density contours
(Note that the points distort the perception of the colors underneath, may be better without points.)
o3 <- ggplot(df, aes(x, y)) +
stat_density_2d(aes(fill = stat(level)), geom = 'polygon') +
scale_fill_viridis_c(name = "density") +
geom_point(shape = '.')
Option D: density heatmap
(Same note as C.)
o4 <- ggplot(df, aes(x, y)) +
stat_density_2d(aes(fill = stat(density)), geom = 'raster', contour = FALSE) +
scale_fill_viridis_c() +
coord_cartesian(expand = FALSE) +
geom_point(shape = '.', col = 'white')
Option E: hexbins
(Same note as C.)
o5 <- ggplot(df, aes(x, y)) +
geom_hex() +
scale_fill_viridis_c() +
geom_point(shape = '.', col = 'white')
Option F: rugs
Possibly my favorite option. Not quite as flashy, but visually simple and simple to understand. Very effective in many cases.
o6 <- ggplot(df, aes(x, y)) +
geom_point(alpha = 0.1) +
geom_rug(alpha = 0.01)
Combine in one figure:
cowplot::plot_grid(
o1, o2, o3, o4, o5, o6,
ncol = 2, labels = 'AUTO', align = 'v', axis = 'lr'
)
You can also have a look at the ggsubplot package. This package implements features which were presented by Hadley Wickham back in 2011 (http://blog.revolutionanalytics.com/2011/10/ggplot2-for-big-data.html).
(In the following, I include the "points"-layer for illustration purposes.)
library(ggplot2)
library(ggsubplot)
# Make up some data
set.seed(955)
dat <- data.frame(cond = rep(c("A", "B"), each=5000),
xvar = c(rep(1:20,250) + rnorm(5000,sd=5),rep(16:35,250) + rnorm(5000,sd=5)),
yvar = c(rep(1:20,250) + rnorm(5000,sd=5),rep(16:35,250) + rnorm(5000,sd=5)))
# Scatterplot with subplots (simple)
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) +
geom_subplot2d(aes(xvar, yvar,
subplot = geom_bar(aes(rep("dummy", length(xvar)), ..count..))), bins = c(15,15), ref = NULL, width = rel(0.8), ply.aes = FALSE)
However, this features rocks if you have a third variable to control for.
# Scatterplot with subplots (including a third variable)
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1, aes(color = factor(cond))) +
geom_subplot2d(aes(xvar, yvar,
subplot = geom_bar(aes(cond, ..count.., fill = cond))),
bins = c(15,15), ref = NULL, width = rel(0.8), ply.aes = FALSE)
Or another approach would be to use smoothScatter():
smoothScatter(dat[2:3])
Alpha blending is easy to do with base graphics as well.
df <- data.frame(x = rnorm(5000),y=rnorm(5000))
with(df, plot(x, y, col="#00000033"))
The first six numbers after the # are the color in RGB hex and the last two are the opacity, again in hex, so 33 ~ 3/16th opaque.
You can also use density contour lines (ggplot2):
df <- data.frame(x = rnorm(15000),y=rnorm(15000))
ggplot(df,aes(x=x,y=y)) + geom_point() + geom_density2d()
Or combine density contours with alpha blending:
ggplot(df,aes(x=x,y=y)) +
geom_point(colour="blue", alpha=0.2) +
geom_density2d(colour="black")
You may find useful the hexbin package. From the help page of hexbinplot:
library(hexbin)
mixdata <- data.frame(x = c(rnorm(5000),rnorm(5000,4,1.5)),
y = c(rnorm(5000),rnorm(5000,2,3)),
a = gl(2, 5000))
hexbinplot(y ~ x | a, mixdata)
geom_pointdenisty from the ggpointdensity package (recently developed by Lukas Kremer and Simon Anders (2019)) allows you visualize density and individual data points at the same time:
library(ggplot2)
# install.packages("ggpointdensity")
library(ggpointdensity)
df <- data.frame(x = rnorm(5000), y = rnorm(5000))
ggplot(df, aes(x=x, y=y)) + geom_pointdensity() + scale_color_viridis_c()
My favorite method for plotting this type of data is the one described in this question - a scatter-density plot. The idea is to do a scatter-plot but to colour the points by their density (roughly speaking, the amount of overlap in that area).
It simultaneously:
clearly shows the location of outliers, and
reveals any structure in the dense area of the plot.
Here is the result from the top answer to the linked question:
I am plotting the results of 50 - 100 experiments.
Each experiment results in a time series.
I can plot a spaghetti plot of all time series, but
what I'd like to have is sort of a density map for the time series plume.
(something similar to the gray shading in the lower panel
in this figure: http://www.ipcc.ch/graphics/ar4-wg1/jpg/fig-6-14.jpg)
I can 'sort of' do this with 2d binning or binhex but the result could be prettier (see example below).
Here is a code that reproduces a plume plot for mock data (uses ggplot2 and reshape2).
# mock data: random walk plus a sinus curve.
# two envelopes for added contrast.
tt=10*sin(c(1:100)/(3*pi))
rr=apply(matrix(rnorm(5000),100,50),2,cumsum) +tt
rr2=apply(matrix(rnorm(5000),100,50),2,cumsum)/1.5 +tt
# stuff data into a dataframe and melt it.
df=data.frame(c(1:100),cbind(rr,rr2) )
names(df)=c("step",paste("ser",c(1:100),sep=""))
dfm=melt(df,id.vars = 1)
# ensemble average
ensemble_av=data.frame(step=df[,1],ensav=apply(df[,-1],1,mean))
ensemble_av$variable=as.factor("Mean")
ggplot(dfm,aes(step,value,group=variable))+
stat_binhex(alpha=0.2) + geom_line(alpha=0.2) +
geom_line(data=ensemble_av,aes(step,ensav,size=2))+
theme(legend.position="none")
Does anyone know of a nice way do get a shaded envelope with gradients. I have also tried geom_ribbon but that did not give any indication of density changes along the plume. binhex does that, but not with aesthetically pleasing results.
Compute quantiles:
qs = data.frame(
do.call(
rbind,
tapply(
dfm$value, dfm$step, function(i){quantile(i)})),
t=1:100)
head(qs)
X0. X25. X50. X75. X100. t
1 -0.8514179 0.4197579 0.7681517 1.396382 2.883903 1
2 -0.6506662 1.2019163 1.6889073 2.480807 5.614209 2
3 -0.3182652 2.0480082 2.6206045 4.205954 6.485394 3
4 -0.1357976 2.8956990 4.2082762 5.138747 8.860838 4
5 0.8988975 3.5289219 5.0621513 6.075937 10.253379 5
6 2.0027973 4.5398120 5.9713921 7.015491 11.494183 6
Plot ribbons:
ggplot() +
geom_ribbon(data=qs, aes(x=t, ymin=X0., ymax=X100.),fill="gray30", alpha=0.2) +
geom_ribbon(data=qs, aes(x=t, ymin=X25., ymax=X75.),fill="gray30", alpha=0.2)
This is for two quantile intervals, (0-100) and (25-75). You'll need more args to quantile and more ribbon layers for more quantiles, and need to adjust the colours too.
Based on the idea of Spacedman, I found a way to add more intervals in an automatic way: I first compute the quantiles for each step, group them by pairs of symmetric values and then use geom_ribbon in the right order...
library(tidyr)
library(dplyr)
condquant <- dfm %>% group_by(step) %>%
do(quant = quantile(.$value, probs = seq(0,1,.05)), probs = seq(0,1,.05)) %>%
unnest() %>%
mutate(delta = 2*round(abs(.5-probs)*100)) %>%
group_by(step, delta) %>%
summarize(quantmin = min(quant), quantmax= max(quant))
ggplot() +
geom_ribbon(data = condquant, aes(x = step, ymin = quantmin, ymax = quantmax,
group = reorder(delta, -delta), fill = as.numeric(delta)),
alpha = .5) +
scale_fill_gradient(low = "grey10", high = "grey95") +
geom_line(data = dfm, aes(x = step, y = value, group=variable), alpha=0.2) +
geom_line(data=ensemble_av,aes(step,ensav),size=2)+
theme(legend.position="none")
Thanks Erwan and Spacedman.
Avoiding 'tidyr' ('dplyr' and 'magrittr') my version of Erwans answer becomes
probs=c(0:10)/10 # use fewer quantiles than Erwan
arr=t(apply(df[,-1],1,quantile,prob=probs))
dfq=data.frame(step=df[,1],arr)
names(dfq)=c("step",colnames(arr))
dfqm=melt(dfq,id.vars=c(1))
# add inter-quantile (per) range as delta
dfqm$delta=dfqm$variable
levels(dfqm$delta)=abs(probs-rev(probs))*100
dfplot=ddply(dfqm,.(step,delta),summarize,
quantmin=min(value),
quantmax=max(value) )
ggplot() +
geom_ribbon(data = dfplot, aes(x = step, ymin = quantmin,
ymax =quantmax,group=rev(delta),
fill = as.numeric(delta)),
alpha = .5) +
scale_fill_gradient(low = "grey25", high = "grey75") +
geom_line(data=ensemble_av,aes(step,ensav),size=2) +
theme(legend.position="none")
Hi I really have googled this a lot without any joy. Would be happy to get a reference to a website if it exists. I'm struggling to understand the Hadley documentation on polar coordinates and I know that pie/donut charts are considered inherently evil.
That said, what I'm trying to do is
Create a donut/ring chart (so a pie with an empty middle) like the tikz ring chart shown here
Add a second layer circle on top (with alpha=0.5 or so) that shows a second (comparable) variable.
Why? I'm looking to show financial information. The first ring is costs (broken down) and the second is total income. The idea is then to add + facet=period for each review period to show the trend in both revenues and expenses and the growth in both.
Any thoughts would be most appreciated
Note: Completely arbitrarily if an MWE is needed if this was tried with
donut_data=iris[,2:4]
revenue_data=iris[,1]
facet=iris$Species
That would be similar to what I'm trying to do.. Thanks
I don't have a full answer to your question, but I can offer some code that may help get you started making ring plots using ggplot2.
library(ggplot2)
# Create test data.
dat = data.frame(count=c(10, 60, 30), category=c("A", "B", "C"))
# Add addition columns, needed for drawing with geom_rect.
dat$fraction = dat$count / sum(dat$count)
dat = dat[order(dat$fraction), ]
dat$ymax = cumsum(dat$fraction)
dat$ymin = c(0, head(dat$ymax, n=-1))
p1 = ggplot(dat, aes(fill=category, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
labs(title="Basic ring plot")
p2 = ggplot(dat, aes(fill=category, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect(colour="grey30") +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme_bw() +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
labs(title="Customized ring plot")
library(gridExtra)
png("ring_plots_1.png", height=4, width=8, units="in", res=120)
grid.arrange(p1, p2, nrow=1)
dev.off()
Thoughts:
You may get more useful answers if you post some well-structured sample data. You have mentioned using some columns from the iris dataset (a good start), but I am unable to see how to use that data to make a ring plot. For example, the ring plot you have linked to shows proportions of several categories, but neither iris[, 2:4] nor iris[, 1] are categorical.
You want to "Add a second layer circle on top": Do you mean to superimpose the second ring directly on top of the first? Or do you want the second ring to be inside or outside of the first? You could add a second internal ring with something like geom_rect(data=dat2, xmax=3, xmin=2, aes(ymax=ymax, ymin=ymin))
If your data.frame has a column named period, you can use facet_wrap(~ period) for facetting.
To use ggplot2 most easily, you will want your data in 'long-form'; melt() from the reshape2 package may be useful for converting the data.
Make some barplots for comparison, even if you decide not to use them. For example, try:
ggplot(dat, aes(x=category, y=count, fill=category)) +
geom_bar(stat="identity")
Just trying to solve question 2 with the same approach from bdemarest's answer. Also using his code as a scaffold. I added some tests to make it more complete but feel free to remove them.
library(broom)
library(tidyverse)
# Create test data.
dat = data.frame(count=c(10,60,20,50),
ring=c("A", "A","B","B"),
category=c("C","D","C","D"))
# compute pvalue
cs.pvalue <- dat %>% spread(value = count,key=category) %>%
ungroup() %>% select(-ring) %>%
chisq.test() %>% tidy()
cs.pvalue <- dat %>% spread(value = count,key=category) %>%
select(-ring) %>%
fisher.test() %>% tidy() %>% full_join(cs.pvalue)
# compute fractions
#dat = dat[order(dat$count), ]
dat %<>% group_by(ring) %>% mutate(fraction = count / sum(count),
ymax = cumsum(fraction),
ymin = c(0,ymax[1:length(ymax)-1]))
# Add x limits
baseNum <- 4
#numCat <- length(unique(dat$ring))
dat$xmax <- as.numeric(dat$ring) + baseNum
dat$xmin = dat$xmax -1
# plot
p2 = ggplot(dat, aes(fill=category,
alpha = ring,
ymax=ymax,
ymin=ymin,
xmax=xmax,
xmin=xmin)) +
geom_rect(colour="grey30") +
coord_polar(theta="y") +
geom_text(inherit.aes = F,
x=c(-1,1),
y=0,
data = cs.pvalue,aes(label = paste(method,
"\n",
format(p.value,
scientific = T,
digits = 2))))+
xlim(c(0, 6)) +
theme_bw() +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank(),
panel.border = element_blank()) +
labs(title="Customized ring plot") +
scale_fill_brewer(palette = "Set1") +
scale_alpha_discrete(range = c(0.5,0.9))
p2
And the result:
I'd like to plot a mirrored 95% density curve and map alpha to the density:
foo <- function(mw, sd, lower, upper) {
x <- seq(lower, upper, length=500)
dens <- dnorm(x, mean=mw, sd=sd, log=TRUE)
dens0 <- dens -min(dens)
return(data.frame(dens0, x))
}
df.rain <- foo(0,1,-1,1)
library(ggplot2)
drf <- ggplot(df.rain, aes(x=x, y=dens0))+
geom_line(aes(alpha=..y..))+
geom_line(aes(x=x, y=-dens0, alpha=-..y..))+
stat_identity(geom="segment", aes(xend=x, yend=0, alpha=..y..))+
stat_identity(geom="segment", aes(x=x, y=-dens0, xend=x, yend=0, alpha=-..y..))
drf
This works fine, but I'd like to make the contrast between the edges and the middle more prominent, i.e., I want the edges to be nearly white and only the middle part to be black. I've been tampering with scale_alpha() but without luck. Any ideas?
Edit: Ultimately, I'd like to plot several raindrops, i.e., the individual drops will be small but the shading should still be clearly visible.
Instead of mapping dens0 to the alpha, I'd map it to color:
drf <- ggplot(df.rain, aes(x=x, y=dens0))+
geom_line(aes(color=..y..))+
geom_line(aes(x=x, y=-dens0, color=-..y..))+
stat_identity(geom="segment", aes(xend=x, yend=0, color=..y..))+
stat_identity(geom="segment", aes(x=x, y=-dens0, xend=x, yend=0, color=-..y..))
Now we still have the contrast in color is mainly present in the tails. Using two colors helps a bit (note that the switch in color is at 0.25):
drf + scale_color_gradient2(midpoint = 0.25)
Finally, to include the distribution of the dens0 values, I base the midpoint of the color scale on the median value in the data:
drf + scale_color_gradient2(midpoint = median(df.rain$dens0))
Note!: But however the way you tweak your data, most contrast in your data is in the more extreme values in your dataset. Trying to mask this by messing with a non-linear scale, or by tweaking a color scale like I did, could present a false picture of the real data.
Here is a solution using geom_ribbon() instead of geom_line()
df.rain$group <- seq_along(df.rain$x)
tmp <- tail(df.rain, -1)
tmp$group <- tmp$group - 1
tmp$dens0 <- head(df.rain$dens0, -1)
dataset <- rbind(head(df.rain, -1), tmp)
ggplot(dataset, aes(x = x, ymin = -dens0, ymax = dens0, group = group,
alpha = dens0)) + geom_ribbon() + scale_alpha(range = c(0, 1))
ggplot(dataset, aes(x = x, ymin = -dens0, ymax = dens0, group = group,
fill = dens0)) + geom_ribbon() +
scale_fill_gradient(low = "white", high = "black")
See Paul's answer for changing the colours.
dataset9 <- merge(dataset, data.frame(study = 1:9))
ggplot(dataset9, aes(x = x, ymin = -dens0, ymax = dens0, group = group,
alpha = dens0)) + geom_ribbon() + scale_alpha(range = c(0, 0.5)) +
facet_wrap(~study)
While pondering both your answers I actually found exactly what I was looking for. The easiest way is to simply use scale_colour_gradientn with a vector of greys.
library(RColorBrewer)
grey <- brewer.pal(9,"Greys")
drf <- ggplot(df.rain, aes(x=x, y=dens0, col=dens0))+
stat_identity(geom="segment", aes(xend=x, yend=0))+
stat_identity(geom="segment", aes(x=x, y=-dens0, xend=x, yend=0))+
scale_colour_gradientn(colours=grey)
drf