Related
I have a boxplot which summarizes ~60000 turbidity data points into quartiles, median, whiskers and sometimes outliers. Often a few outliers are so high up that the whole plot is compressed at the bottom, and I therefor choose to omit the outliers. However, I also have added averages to the plots as points, and I want these to be plotted always. The problem is that the y-axis of the boxplot does not adjust to the added average points, so when averages are far above the box they are simply plotted outside the chart window (see X-point for 2020, but none for 2021 or 2022). Normally with this parameter, the average will be between the whisker end and the most extreme outliers. This is normal, and expected in the data.
I have tried to capture the boxplot y-axis range to compare with the average, and then setting the ylim if needed, but I just don't know how to retrieve these axis ranges.
My code is just
boxplot(...)
points(...)
and works as far as plotting the points. Just not adjusting the y-axis.
Question 1: is it not possible to get the boxplot to redraw with the new points data? I thought this was standard in R plots.
Question 2: if not, how can I dynamically adjust the y-axis range?
Let's try to show a concrete example of the problem with some simulated data:
set.seed(1)
df <- data.frame(y = c(rexp(99), 150), x = rep(c("A", "B"), each = 50))
Here, group "B" has a single outlier at 150, even though most values are a couple of orders of magnitude lower. That means that if we try to draw a boxplot, the boxes get squished at the bottom of the plot:
boxplot(y ~ x, data = df, col = "lightblue")
If we remove outliers, the boxes plot nicely:
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
The problem comes when we want to add a point indicating the mean value for each boxplot, since the mean of "B" lies outside the plot limits. Let's calculate and plot the means:
mean_vals <- sapply(split(df$y, df$x), mean)
mean_vals
#> A B
#> 0.9840417 4.0703334
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
The mean for "B" is missing because it lies above the upper range of the plot.
The secret here is to use boxplot.stats to get the limits of the whiskers. By concatenating our vector of means to this vector of stats and getting its range, we can set our plot limits exactly where they need to be:
y_limits <- range(c(boxplot.stats(df$y)$stats, mean_vals))
Now we apply these limits to a new boxplot and draw it with the points:
boxplot(y ~ x, data = df, outline = FALSE, ylim = y_limits, col = "lightblue")
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
For comparison, you could do the whole thing in ggplot like this:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot(fill = "lightblue", outlier.shape = NA) +
geom_point(size = 3, color = "red", stat = "summary", fun = mean) +
coord_cartesian(ylim = range(c(range(c(boxplot.stats(df$y)$stats,
mean_vals))))) +
theme_classic(base_size = 16)
Created on 2023-02-05 with reprex v2.0.2
I would like to ask you for a few advices on a R cartography with Raster / spplot I am currently working on. I am a novice so I apologize in advance should the methods I used to be not at all optimal!
=> So:
I have a raster object and almost got what I wanted, but I have troubles with the legend and the result looks kind of childish. I'd like to get something a bit more "professional".
I'd like to 1) improve the overall aesthetics and 2) add legends on my plot such as this concentric bubble size legend proposed in this other post: create a concentric circle legend.
Here is what I have right now: death rate and exposure in France
What I think might improve the map:
Use a concentric circles bubble legend for hospital volume and put it on the top right corner
Add transparency to my points. Here I have 13 bubbles, but the real map has about 600 with many overlapping (especially in Paris area).
Add a legend to my colour gradient
If you have any tips / comments do not hesitate! I'm a beginner but eager to learn :)
I've enclosed a simplified full code (13 hospitals instead of 600, data completely edited, variable names changed... So no need to interprete!). I've edited it so that you can just copy / paste easily.
####################################################################
####################################################################
# 1) DATA PREPARATION
# Packages
library(raster)
library(rgeos)
library(latticeExtra)
library(sf)
# Mortality dataset
french_regions=c("IDF", "NE", "NO", "SE", "SO")
death_rates_reg=c(0.032,0.014,0.019,0.018,0.021)
region_mortality=data.frame(french_regions,death_rates_reg)
# Hospital dataset
hospital_id=1:13
expo=c(0.11,0.20,0.17,0.25,0.18,0.05,0.07,0.25,0.40,0.70,0.45,0.14,0.80)
volume=sample(1:200, 13, replace=TRUE)
lat=c(44.8236,48.8197,45.7599,45.2785,48.9183,50.61,43.6356,47.9877,48.8303,48.8302,48.8991,43.2915,48.7232)
long=c(-0.57979,7.78697,4.79666,6.3421,2.52365,3.03763,3.8914,-4.095,2.34038,2.31117,2.33083,5.56335,2.45025)
french_hospitals=data.frame(hospital_id,expo,volume,lat,long)
# French regions map object - merge of departments according to phone codes
formes <- getData(name="GADM", country="FRA", level=2)
formes$NAME_3=0 # NAME_3 = new mega-regions IDF, NE, NO, SE, SO
formes$NAME_3[formes$NAME_1=="Auvergne-Rhône-Alpes"]="SE"
formes$NAME_3[formes$NAME_1=="Bourgogne-Franche-Comté"]="NE"
formes$NAME_3[formes$NAME_1=="Bretagne"]="NO"
formes$NAME_3[formes$NAME_1=="Centre-Val de Loire"]="NO"
formes$NAME_3[formes$NAME_1=="Corse"]="SE"
formes$NAME_3[formes$NAME_1=="Grand Est"]="NE"
formes$NAME_3[formes$NAME_1=="Hauts-de-France"]="NE"
formes$NAME_3[formes$NAME_1=="Île-de-France"]="IDF"
formes$NAME_3[formes$NAME_1=="Normandie"]="NO"
formes$NAME_3[formes$NAME_1=="Nouvelle-Aquitaine"]="SO"
formes$NAME_3[formes$NAME_1=="Occitanie"]="SO"
formes$NAME_3[formes$NAME_1=="Pays de la Loire"]="NO"
formes$NAME_3[formes$NAME_1=="Provence-Alpes-Côte d'Azur"]="SE"
formes$NAME_3[formes$NAME_2=="Aude"]="SE"
formes$NAME_3[formes$NAME_2=="Gard"]="SE"
formes$NAME_3[formes$NAME_2=="Hérault"]="SE"
formes$NAME_3[formes$NAME_2=="Lozère"]="SE"
formes$NAME_3[formes$NAME_2=="Pyrénées-Orientales"]="SE"
groups = aggregate(formes, by = "NAME_3")
# Colour palettes
couleurs_death=colorRampPalette(c('gray100','gray50'))
couleurs_expo=colorRampPalette(c('green','gold','red','darkred'))
# Hospitals bubble sizes and colours
my_colours=couleurs_expo(401)
french_hospitals$bubble_color="Initialisation"
french_hospitals$indice=round(french_hospitals$expo*400,digits=0)+1
french_hospitals$bubble_size=french_hospitals$volume*(1.5/50)
for(i in 1:length(french_hospitals$bubble_color)){
french_hospitals$bubble_color[i]=my_colours[french_hospitals$indice[i]]
}
####################################################################
####################################################################
# 2) MAP
# Assignation of death rates to regions
idx <- match(groups$NAME_3, region_mortality$french_regions)
concordance <- region_mortality[idx, "death_rates_reg"]
groups$outcome_char <- concordance
# First map: region colours = death rates
graphA=spplot(groups, "outcome_char", col.regions=couleurs_death(500),
par.settings = list(fontsize = list(text = 12)),
main=list(label=" ",cex=1),colorkey = list(space = "bottom", height = 0.85))
# Second map: hospital bubbles = exposure
GraphB=graphA + layer(panel.points(french_hospitals[,c(5,4)],col=french_hospitals$bubble_color,pch=20, cex=french_hospitals$bubble_size))
# Addition of the legend
Bubble_location=matrix(data=c(-4.0,-2.0,0.0,-4.0,-2.0,0.0,42.3,42.3,42.3,41.55,41.55,41.55),nrow=6,ncol=2)
GraphC1=GraphB + layer(panel.points(Bubble_location, col=c(my_colours[5],my_colours[125],my_colours[245],"black","black","black"), pch=19,cex=c(2.5,2.5,2.5,5.0,2.0,1.0)))
Bubble_location2=matrix(data=c(-3.4,-1.27,0.55, -3.65, -3.3 , -3.4,-1.52,0.48,42.31,42.31,42.31,42.55,41.9, 41.56,41.56,41.56),nrow=8,ncol=2)
GraphC2=GraphC1+layer(panel.text(Bubble_location2, label=c("0%","30%","60%", "Exposure:", "Hospital volume:", "125","50","25"), col="black", cex=1.0))
# Final map
GraphC2
Thank you in advance for your help! (I know this is a lot, do not feel forced to dive in the code)
It isn't pretty, but I think this can get you started baring a more complete answer from someone else. I'd suggest using ggplot instead of spplot. The only thing you need to do is convert your sp object to sf to integrate with ggplot. The bubble plot needs a lot of guess and check, so I'll leave that up to you...
Map layout design is still better in GIS software, in my opinion.
library(sf)
library(ggplot2)
# Convert sp to sf
groups_sf <- st_as_sf(groups)
# Make reference dataframe for concentric bubble legend
bubble_legend <- data.frame(x = c(8.5, 8.5, 8.5), y = c(50, 50, 50), size = c(3, 6, 9))
ggplot() +
geom_sf(data = groups_sf) +
geom_point(data = french_hospitals, aes(x = long, y = lat, color = indice, size = bubble_size), alpha = 0.7) +
geom_point(data = bubble_legend, aes(x = x, y = y + size/50), size = bubble_legend$size, shape = 21, color = "black", fill = NA) +
geom_text(data = bubble_legend, aes(x = x + 0.5, y = y + size/50, label = size), size = 3) +
scale_color_gradient(low = "green", high = "red") +
guides(size="none")
Let me know what you think. I can help troubleshoot more if there are any issues.
Thank you for your answer Skaqqs, very appreciated. This is in my opinion a good step forward!! I tried it quickly on the real data and it already looks way better, especially with the transparency.
I can't really show more since that's sensitive data on a trendy topic and we want to keep it confidential as much as possible until article submission.
I'll move on from this good starting base and update you.
Thank you :)
I am trying to recreate an image found in a textbook in R, the original of which was built in MATLAB:
I have generated each of the graphs seperately, but what would be best practice them into an image like this in ggplot2?
Edit: Provided code used. This is just a transformation of normally distributed data.
library(ggplot2)
mean <- 6
sd <- 1
X <- rnorm(100000, mean = mean, sd = sd)
Y <- dnorm(X, mean = mean, sd = sd)
Y_p <- pnorm(X, mean = mean, sd = sd)
ch_vars <- function(X){
nu_vars <- c()
for (x in X){
nu_vars <- c(nu_vars, (1/(1 + exp(-x + 5))))
}
return(nu_vars)
}
nu_X <- ch_vars(X)
nu_Y <- ch_vars(Y)
data <- data.frame(x = X, y = Y, Y_p = Y_p, x = nu_X, y = nu_Y)
# Cumulative distribution
ggplot(data = data) +
geom_line(aes(x = X, y = Y_p))
# Distribution of initial data
ggplot(data = data_ch, aes(x = X)) +
geom_histogram(aes(y = ..density..), bins = 25, fill = "red", color = "black")
# Distribution of transformed data
ggplot(data = data, aes(x = nu_X)) +
geom_histogram(aes(y = ..density..), bins = 25, fill = "green", color = "black")
In short, you can't, or rather, you shouldn't.
ggplot is a high-level plotting packaging. More than a system for drawing shapes and lines, it's fairly "opinionated" about how data should be represented, and one of its opinions is that a plot should express a clear relationship between its axes and marks (points, bars, lines, etc.). The axes essentially define a coordinate space, and the marks are then plotted onto the space in a straightforward and easily interpretable manner.
The plot you show breaks that relationship -- it's a set of essentially arbitrary histograms all drawn onto the same box, where the axis values become ambiguous. The x-axis represents the values of 1 histogram and the y-axis represents another (and thus neither axis represents the histograms' heights).
It is of course technically possible to force ggplot to render something like your example, but it would require pre-computing the histograms, normalizing their values and bin heights to a common coordinate space, converting these into suitable coordinates for use with geom_rect, and then re-labeling the plot axes. It would be a very large amount of manual effort and ultimately defeats the point of using a high-level plotting grammar like ggplot.
I am making a density map in R using ggmap and stat_density2d. The code looks like this:
riverside <- get_map('Riverside, IL', zoom = 14 , color = 'bw' )
RiversideMap <- ggmap(riverside, extent = 'device', legend = 'topleft')
# make the map:
RiversideMap +
stat_density2d(aes(x = lon, y = lat,
fill = ..level.. , alpha = ..level..),size = .01, bins = 16,
data = myData, geom = 'polygon') +
scale_fill_gradient(low = "yellow", high = "blue") +
scale_alpha(range = c(.0, 0.3), guide = FALSE)
The density shown in the map's color legend is normalized in stat_density2d by requiring the integral of the density over area equals 1.
In the map, the units of the x and y axes are decimal degrees. (For example, a point is specified by the coordinates lat = 41.81888 and lon = -87.84147).
For ease of interpretation, like to make two changes to the values of the density as displayed in the map legend.
First, I'd like the integral of the density to be N (the number of data points - or addresses - in the data set) rather than 1. So the values displayed in the legend need to be multiplied by N = nrow(myData).
Second, I'd like the unit of distance to be kilometers rather than decimal degrees. For the latitudes and longitudes that I am plotting, this requires dividing the values displayed in the legend by 9203.
With the default normalization of density in stat_density2d, I get these numbers in the legend: c(2000,1500,1000,500).
Taking N = 1600 and performing the above re-scalings, this becomes c(348, 261, 174, 87) (= 1600/9203 * 2000 etc). Obviously, these are not nice round numbers, so it would be even better if the legend numbers were say c(400,300,200,100) with their locations in the legend color bar adjusted accordingly.
The advantage of making these re-scalings is that the density in the map becomes easy to interpret: it is just the number of people per square km (rather than the probability density of people per square degree).
Is there an easy way to do this? I am new to ggmap and ggplot2. Thanks in advance.
In brief, use:
scale_fill_continuous(labels = scales::unit_format(unit = "k", scale = 1e-3))
This link is great help for managing scales, axes and labels: https://ggplot2-book.org/scales.html
I have been trying to create a map of membership locations from postcodes across the UK as a project in learning R. I have achieved nearly the result I wanted, but it's proving very frustrating getting the glitches sorted. This image is my current best effort:
I still want to change:
get rid of the extraneous legend (the "0.16", "0.5" squares), which are coming from the size arg to geom_point. If I remove the size=0.16 arg the guide/legend disappears, but the geom size returns to the default too. This also happens for the "black" guide -- coming from a colour obviously -- but why?
properly clip the stat_density2d polygons, which are exhibiting undesireable behaviour when clipped (see bottom-right plot near the top)
have control over the line-width of the geom_path that includes the county boundaries: it's currently too thick (would like about 1/2 thickness shown) but all I can achieve by including 'size' values is to make the lines stupidly thick - so thick that they obscure the whole map.
The R code uses revgeocode() to find the placename closest to the centre point but I don't know how to include the annotation on the map. I would like to include it in a text-box over the North Sea (top right of UK maps), maybe with a line/arrow to the point itself. A simpler option could just be some text beneath the UK map, below the x-axis ... but I don't know how to do that. geom_rect/geom_text seem fraught in this context.
Finally, I wanted to export the map to a high-res image, but when I do that everything changes again, see:
which shows the high-res (~1700x1800px) image on the left and the Rstudio version (~660x720px) on the right. The proportions of the maps have changed and the geom_text and geom_point for the centre point are now tiny. I would be happy if the gap between the two map rows was always fairly small, too (rather than just small at high res).
Code
The basics: read list of members postcodes, join with mySociety table of postcode<>OSGB locations, convert locations to Lat/long with spTransform, calculate binhex and density layers, plot with ggmap.
The code for all this is somewhat lengthy so I have uploaded it as a Gist:
https://gist.github.com/rivimey/ee4ab39a6940c0092c35
but for reference the 'guts' of the mapping code is here:
# Get a stylised base map for the whole-of-uk maps.
map.bbox = c(left = -6.5, bottom = 49.5, right = 2, top = 58)
basemap.uk <- get_stamenmap(bb = map.bbox, zoom=calc_zoom(map.bbox), maptype="watercolor")
# Calculate the density plot - a continuous approximation.
smap.den <- stat_density2d(aes(x = lat, y = lon, fill = ..level.., alpha = ..level..),
data = membs.wgs84.df, geom = "polygon",
breaks=2/(1.5^seq(0,12,by=1)), na.rm = TRUE)
# Create a point on the map representing the centroid, and label it.
cmap.p <- geom_point(aes(x = clat, y = clon), show_guide = FALSE, data = centroid.df, alpha = 1)
cmap.t1 <- geom_text(aes(x = clat, y = clon+0.22, label = "Centre", size=0.16), data = centroid.df)
cmap.t2 <- geom_text(aes(x = clat, y = clon+0.1, label = "Centre", size=0.25), data = centroid.df)
# Create an alternative presentation, as binned hexagons, which is more true to the data.
smap.bin <- geom_hex(aes(x = lat, y = lon),
data = membs.wgs84.df, binwidth = c(0.15, 0.1), alpha = 0.7, na.rm = TRUE)
# Create a path for the county and country boundaries, to help identify map regions.
bounds <- geom_path(aes(x = long, y = lat, group = group, colour = "black"), show_guide = FALSE,
data = boundaries.subset, na.rm = TRUE)
# Create the first two actual maps: a whole-uk binned map, and a whole-uk density map.
map.bin <- ggmap(basemap.uk) + smap.bin + grad + cmap.p + cmap.t1
map.den <- ggmap(basemap.uk) + smap.den + alpha + cmap.p + cmap.t1
# Create a zoomed-in map for the south-east, to show greater detail. I would like to use this
# bbox but google maps don't respect it :(
map.lon.bbox = c(left = -1, bottom = 51, right = 1, top = 52)
# Get a google terrain map for the south-east, bbox roughly (-1.7,1.7, 50.1, 53)
basemap.lon <- get_map(location = c(0,51.8), zoom = 8, maptype="terrain", color = "bw")
# Create a new hexbin with more detail than earlier.
smap.lon.bin <- geom_hex(aes(x = lat, y = lon),
data = membs.wgs84.df, bins=26, alpha = 0.7, na.rm = TRUE)
# Noe create the last two maps: binned and density maps for London and the SE.
lonmap.bin <- ggmap(basemap.lon) + bounds + smap.lon.bin + grad + cmap.p + cmap.t2
lonmap.den <- ggmap(basemap.lon) + bounds + smap.den + alpha + cmap.p + cmap.t2
# Arrange the maps in 2x2 grid, and tell the grid code to let the first row be taller than the second.
multiplot(map.bin, lonmap.bin, map.den, lonmap.den, heights = unit( c(10,7), "null"), cols=2 )