This should be fairly easy but I can't find my way thru.
tri_fill <- structure(
list(x= c(0.75, 0.75, 2.25, 3.25),
y = c(40, 43, 43, 40)),
.Names = c("x", "y"),
row.names = c(NA, -4L), class = "data.frame",Integrated=NA, Related=NA)
# install.packages("ggplot2", dependencies = TRUE)
require(ggplot2)
ggplot(data=tri_fill,aes(x=x, y=y))+
geom_polygon() +
scale_fill_gradient(limits=c(1, 4), low = "lightgrey", high = "red")
What I want is a gradient along the x-axis, but with the above I only get a legend with a gradient and the polygon with solid fill.
Here is a possible solution for when you have a relatively simple polygon. Instead of a polygon, we create lots of line-segments and color them by a gradient. The result will thus look like a polygon with a gradient.
#create data for 'n'segments
n_segs <- 1000
#x and xend are sequences spanning the entire range of 'x' present in the data
newpolydata <- data.frame(xstart=seq(min(tri_fill$x),max(tri_fill$x),length.out=n_segs))
newpolydata$xend <- newpolydata$xstart
#y's are a little more complicated: when x is below changepoint, y equals max(y)
#but when x is above the changepoint, the border of the polygon
#follow a line according to the formula y= intercept + x*slope.
#identify changepoint (very data/shape dependent)
change_point <- max(tri_fill$x[which(tri_fill$y==max(tri_fill$y))])
#calculate slope and intercept
slope <- (max(tri_fill$y)-min(tri_fill$y))/ (change_point - max(tri_fill$x))
intercept <- max(tri_fill$y)
#all lines start at same y
newpolydata$ystart <- min(tri_fill$y)
#calculate y-end
newpolydata$yend <- with(newpolydata, ifelse (xstart <= change_point,
max(tri_fill$y),intercept+ (xstart-change_point)*slope))
p2 <- ggplot(newpolydata) +
geom_segment(aes(x=xstart,xend=xend,y=ystart,yend=yend,color=xstart)) +
scale_color_gradient(limits=c(0.75, 4), low = "lightgrey", high = "red")
p2 #note that I've changed the lower border of the gradient.
EDIT: above solution works if one only desires a polygon with a gradient, however, as was pointed out in the comments this can give problems when you were planning to map one thing to fill and another thing to color, as each 'aes' can only be used once. Therefore I have modified the solution to not plot lines, but to plot (very thin) polygons which can have a fill aes.
#for each 'id'/polygon, four x-variables and four y-variable
#for each polygon, we start at lower left corner, and go to upper left, upper right and then to lower right.
n_polys <- 1000
#identify changepoint (very data/shape dependent)
change_point <- max(tri_fill$x[which(tri_fill$y==max(tri_fill$y))])
#calculate slope and intercept
slope <- (max(tri_fill$y)-min(tri_fill$y))/ (change_point - max(tri_fill$x))
intercept <- max(tri_fill$y)
#calculate sequence of borders: x, and accompanying lower and upper y coordinates
x_seq <- seq(min(tri_fill$x),max(tri_fill$x),length.out=n_polys+1)
y_max_seq <- ifelse(x_seq<=change_point, max(tri_fill$y), intercept + (x_seq - change_point)*slope)
y_min_seq <- rep(min(tri_fill$y), n_polys+1)
#create polygons/rectangles
poly_list <- lapply(1:n_polys, function(p){
res <- data.frame(x=rep(c(x_seq[p],x_seq[p+1]),each=2),
y = c(y_min_seq[p], y_max_seq[p:(p+1)], y_min_seq[p+1]))
res$fill_id <- x_seq[p]
res
}
)
poly_data <- do.call(rbind, poly_list)
#plot, allowing for both fill and color-aes
p3 <- ggplot(tri_fill, aes(x=x,y=y))+
geom_polygon(data=poly_data, aes(x=x,y=y, group=fill_id,fill=fill_id)) +
scale_fill_gradient(limits=c(0.75, 4), low = "lightgrey", high = "red") +
geom_point(aes(color=factor(y)),size=5)
p3
Related
Hi How can calculate the overlapping area between 2 columns ( or 2 subsets of a column) in R.
Please see the example data below:
set.seed(1234)
df <- data.frame(
Data=factor(rep(c("D1", "D2"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5),
rnorm(200, mean=65, sd=5)))
)
library(ggplot2)
plot <- ggplot(df, aes(weight,fill = Data))+
geom_density()
plot
This results in the below plot. I am wondering, how to color the overlapping area and calculate the overlapping coefficient (OVL) similar to what is done here Using Monte Carlo Integration?
Please note that the link (and example above) provided uses parametric distribution while I am asking if I have a dataset of observed values.
I normally find it easier to work directly with the densities and plot them as geom_area. If you get the x-axis sampling points to match on the two distributions you can find the overlap area using pmin, and the sum of its values divided by the sum of the values for the two curves should give you the proportion of the total area that is overlapped.
d1dens <- with(df, density(weight[Data == "D1"],
from = min(weight),
to = max(weight)))
d2dens <- with(df, density(weight[Data == "D2"],
from = min(weight),
to = max(weight)))
joint <- pmin(d1dens$y, d2dens$y)
df2 <- data.frame(x = rep(d1dens$x, 3),
y = c(d1dens$y, d2dens$y, joint),
Data = rep(c("D1", "D2", "overlap"), each = length(d1dens$x)))
ggplot(df2, aes(x, y, fill = Data)) +
geom_area(position = position_identity(), color = "black") +
scale_fill_brewer(palette = "Pastel2") +
theme_bw()
sum(joint) / sum(d1dens$y, d2dens$y)
#> [1] 0.1480701
Problem:
1.) I have a shapefile that looks like this:
Extreme values for coordinates are: xmin = 300,000, xmax = 620,000, ymin = 31,000 and ymax = 190,000.
2.) I have a dataset of approx. 2mio points (every point is inside the given polygon) - each one is in one of a 5 different categories.
Now, for every point inside the border (distance between points has to be 10, so that would give us 580,800,000 points) I want to determine color, depending on a category of the nearest point in a dataset.
In the end I would like to draw a ggplot, where the color of every point is dependent on its category (so I'll use 5 different colors).
What I have so far:
My ideas for solution are not optimized and it takes R forever to determine categories for every point inside the polygon.
1.) I created a new dataset with points in a shape of a rectangle with extreme values of coordinates, with 10 units between points. From a new dataset I selected points that have fallen inside the border of polygons (with a function pnt.in.poly from package SDMTools). Then I wanted to find nearest points (from dataset) of every point in a polygon and determined category, but I never manage to get a subset from 580,800,000 points (obviously).
2.) I tried to take 2mio points and color an area around them, dependent on their category, but that did not work right.
I know that it is not possible to plot so many points and see the difference between plot with 200,000,000 points and plot with 1,000,000 points, but I would like to have an accurate coloring when zooming (drawing) only one little spot in a polygon (size of 100 x 100 for example).
Question: Is there any better a way of coloring so many points in a polygon (with creating a new shapefile or grouping points)?
Thank you for your ideas!
It’s really helpful if you include some data with your question, even (especially) if it’s a toy data set. As you don’t, I’ve made a toy example. First, I define a simple shape data frame and a data frame of synthetic data that includes x, y, and grp (i.e., a categorical variable with 5 levels). I crop the latter to the former and plot the results,
# Dummy shape function
df_shape <- data.frame(x = c(0, 0.5, 1, 0.5, 0),
y = c(0, 0.2, 1, 0.8, 0))
# Load library
library(ggplot2)
library(sgeostat) # For in.polygon function
# Data frame of synthetic data: random [x, y] and category (grp)
df_synth <- data.frame(x = runif(500),
y = runif(500),
grp = factor(sample(1:5, 500, replace = TRUE)))
# Remove points outside polygon
df_synth <- df_synth[in.polygon(df_synth$x, df_synth$y, df_shape$x, df_shape$y), ]
# Plot shape and synthetic data
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_synth, aes(x = x, y = y, colour = grp))
g
Next, I create a regular grid and crop that using the polygon.
# Create a grid
df_grid <- expand.grid(x = seq(0, 1, length.out = 50),
y = seq(0, 1, length.out = 50))
# Check if grid points are in polygon
df_grid <- df_grid[in.polygon(df_grid$x, df_grid$y, df_shape$x, df_shape$y), ]
# Plot shape and show points are inside
g <- ggplot(df_shape, aes(x = x, y = y)) + geom_path(colour = "#FF3300", size = 1.5)
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y))
g
To classify each point on this grid by the nearest point in the synthetic data set, I use knn or k-nearest-neighbours with k = 1. That gives something like this.
# Classify grid points according to synthetic data set using k-nearest neighbour
df_grid$grp <- class::knn(df_synth[, 1:2], df_grid, df_synth[, 3])
# Show categorised points
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_grid, aes(x = x, y = y, colour = grp))
g
So, that's how I'd address that part of your question about classifying points on a grid.
The other part of your question seems to be about resolution. If I understand correctly, you want the same resolution even if you're zoomed in. Also, you don't want to plot so many points when zoomed out, as you can't even see them. Here, I create a plotting function that lets you specify the resolution. First, I plot all the points in the shape with 50 points in each direction. Then, I plot a subregion (i.e., zoom), but keep the same number of points in each direction the same so that it looks pretty much the same as the previous plot in terms of numbers of dots.
res_plot <- function(xlim, xn, ylim, yn, df_data, df_sh){
# Create a grid
df_gr <- expand.grid(x = seq(xlim[1], xlim[2], length.out = xn),
y = seq(ylim[1], ylim[2], length.out = yn))
# Check if grid points are in polygon
df_gr <- df_gr[in.polygon(df_gr$x, df_gr$y, df_sh$x, df_sh$y), ]
# Classify grid points according to synthetic data set using k-nearest neighbour
df_gr$grp <- class::knn(df_data[, 1:2], df_gr, df_data[, 3])
g <- ggplot()
g <- g + ggthemes::theme_clean()
g <- g + geom_point(data = df_gr, aes(x = x, y = y, colour = grp))
g <- g + xlim(xlim) + ylim(ylim)
g
}
# Example plot
res_plot(c(0, 1), 50, c(0, 1), 50, df_synth, df_shape)
# Same resolution, but different limits
res_plot(c(0.25, 0.75), 50, c(0, 1), 50, df_synth, df_shape)
Created on 2019-05-31 by the reprex package (v0.3.0)
Hopefully, that addresses your question.
I would like to added a marginal space between groups of box plots by using the stats_summary method.
Here is a small example of my problem
library(ggplot2)
library(reshape2)
data1 <- (lapply(letters[1:5], function(l1) return(matrix(rt(5*3, 1), nrow = 5, ncol = 3, dimnames = list(cat2=letters[6:10], cat3=letters[11:13])))))
names(data1) <- letters[1:5]
data2 <- melt(data1)
customstats <- function(x) {
xs <- sort(x)
return(c(ymin=min(x), lower= mean(xs[xs < mean(x)]), middle = mean(x) , upper = mean(xs[xs > mean(x)]), ymax=max(x)))
}
ggplot(data2, aes(x=cat2, y=value, fill=cat3), width=2) +
stat_summary(fun.data = customstats, geom = "boxplot",
alpha = 0.5, position = position_dodge(1), mapping = aes(fill=cat3))
The result is the following picture.
I would like to achieve a visual separation for each "cat2" and add a "space" between the group of boxplots (I'm retricted to using the stats_summary since I have a custom statistic). How can I do it?
I have fixed a similar problem in an ugly (but effective for me) way by creating a dataframe with the same plotting variables as my original data, but with x (or y) positioned or factored that it fits between the two points I want to separate and missing values for y (or x). For your problem, I added the following code and got an image with spacial separation of clusters.
library(plyr)
empties <- data.frame(cat2_orig=unique(data2$cat2)[-length(unique(data2$cat2))])
#no extra space needed between last cluster and edge of plot
empties$cat2 <- paste0(empties$cat2_orig,empties$cat2_orig)
empties$value <- NA
data2_space <- rbind.fill(data2,empties)
ggplot(data2_space, aes(x=cat2, y=value, fill=cat3), width=2) +
stat_summary(fun.data = customstats, geom = "boxplot",
alpha = 0.5, position = position_dodge(1), mapping = aes(fill=cat3)) +
#remove tickmarks for non-interesting points on x-axis
scale_x_discrete(breaks=unique(data2$cat2))
Before & after
I am trying to find a way to plot data frames of different size using the same function. The data is quite similar to the dfs below. Order of xs is not important.
GetDf <- function(n)
data.frame(x = seq(1, n), y = rnorm(n, 3.5, 0.5), group = runif(n) > 0.5)
PlotIt <- function(df) {
p <- ggplot(df) + geom_point(aes(x = x, y = y, colour = group)) +
expand_limits(y = 1) + expand_limits(y = 5) +
geom_hline(aes(yintercept = c(2.5, 4.5)), linetype = "dotdash")
print(p)
}
df1 <- GetDf(1000)
df2 <- GetDf(10000)
df3 <- GetDf(100000)
df4 <- GetDf(1000000)
PlotIt(df1) looks ok, but PlotIt(df2) is already bad. Points overlap. I could set the point size smaller when n is large, but then the plots of df1 - df4 would look radically different. If the size is fixed, then the plot of df3 needs something like size = 0.75, and PlotIt(df1) is bad.
I know there is the library hexbin and geom_hex(), but it doesn't seem to produce what I want. I would like to have groups shown in different colors, hexbin is not good for plotting df1, etc.
What would be the best way to plot at least df1 - df3, preferably also df4, so that the plots would "feel" the same and look good? (I'm sorry about vagueness, but I don't know how to be more specific.)
I followed krlmlr answer, and wrote a function that calculates alpha from the row count of df. Also, choosing a better shape made the plots nicer. override.aes is needed for low alpha values.
PlotIt <- function(df) {
Alpha <- function(x) pmax(0.1, pmin(1, 2.05 - 0.152 * log(x)))
p <- ggplot(df) +
geom_point(aes(x = x, y = y, colour = group), size = 1.5,
shape = 1, alpha = Alpha(nrow(df))) +
expand_limits(y = 1) + expand_limits(y = 5) +
geom_hline(aes(yintercept = c(2.5, 4.5)), linetype = "dotdash") +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
print(p)
}
Plots of df1 - df3 look ok to me (full screen). The question is somewhat similar to Scatterplot with too many points. Differences: same function should apply to big and small data frames, and the order of x's is not important.
I suspect you don't want to trace individual points in a scatter plot of 1000 or more points. Why don't you use a sample?
PlotIt <- function(df) {
df <- sample.rows(df, 1000, replace=F)
...
}
(sample.rows is in my kimisc package).
If you really want to show all points, use an alpha value in geom_point. Be sure to export your plot as raster and not as vector image, it will take ages to render otherwise:
geom_point(aes(...), alpha=get_reasonable_alpha_value(df))
You'll have to do some experimentation for implementing get_reasonable_alpha_value. It should return a value between 0 (fully transparent) and 1 (opaque).
Perhaps a two-dimensional density estimation will suit you better:
geom_density2d(...)
My initial goal was to plot a population of individual points and then draw a convex hull enclosing 80% of that population centered on the mass of the population.
After trying a number of ideas, the best solution I came up with was to use ggplot's stat_density2d. While this works great for a qualitative analysis, I still need to indicate an 80% boundary. I started out looking for a way to outline the 80th percentile population boundary, but I can work with an 80% probability density boundary instead.
Here's where I'm looking for help. The bin parameter for kde2d (used by stat_density2d) is not clearly documented. If I set bin = 4 in the example below, am I correct in interpreting the central (green) region as containing a 25% probability mass and the combined yellow, red, and green areas as representing a 75% probability mass? If so, by changing the bin to = 5, would the area inscribed then equal an 80% probability mass?
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=4, geom = "polygon", ) +
geom_point(aes(x = x, y = y)) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "black"))
TestData
I repeated a number of test cases and manually counted the excluded points [would love to find a way to count them based on what ..level.. they were contained within] but given the random nature of the data (both my real data and the test data) the number of points outside of the stat_density2d area varied enough to warrant asking for help.
Summarizing, is there a practical means of drawing a polygon around the central 80% of the population of points in the data frame? Or, baring that, am I safe to use stat_density2d and set bin equal to 5 to produce an 80% probability mass?
Excellent answer from Bryan Hanson dispelling the fuzzy notion that I could pass an undocumented bin parameter in stat_density2d. The results looked close at values for bin around 4 to 6, but as he stated, the actual function is unknown and therefore not usable.
I used the HDRegionplot as provided in the accepted answer by DWin to solve my problem. To that, I added a center of gravity (COGravity) and point in polygon (pnt.in.poly) from the SDMTools package to complete the analysis.
library(MASS)
library(coda)
library(SDMTools)
library(emdbook)
library(ggplot2)
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
ContourLines <- as.data.frame(HPDregionplot(mcmc(data.matrix(df)), prob=0.8))
df$inpoly <- pnt.in.poly(df, ContourLines[, c("x", "y")])$pip
dp <- df[df$inpoly == 1,]
COG100 <- as.data.frame(t(COGravity(df$x, df$y)))
COG80 <- as.data.frame(t(COGravity(dp$x, dp$y)))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=5, geom = "polygon", ) +
geom_point(aes(x = x, y = y, colour = as.factor(inpoly)), alpha = 1) +
geom_point(data=COG100, aes(COGx, COGy),colour="white",size=2, shape = 4) +
geom_point(data=COG80, aes(COGx, COGy),colour="green",size=4, shape = 3) +
geom_polygon(data = ContourLines, aes(x = x, y = y), color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
TestData
nrow(dp)/nrow(df) # actual number of population members inscribed within the 80% probability polgyon
Alright, let me start by saying I'm not entirely sure of this answer, and it's only a partial answer! There is no bin parameter for MASS::kde2d which is the function used by stat_density2d. Looking at the help page for kde2d and the code for it (seen simply by typing the function name in the console), I think the bin parameter is h (how these functions know to pass bin to h is not clear however). Following the help page, we see that if h is not provided, it is computed by MASS:bandwidth.nrd. The help page for that function says this:
# The function is currently defined as
function(x)
{
r <- quantile(x, c(0.25, 0.75))
h <- (r[2] - r[1])/1.34
4 * 1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)
}
Based on this, I think the answer to your last question ("Am I safe...") is definitely no. r in the above function is what you need for your assumption to be safe, but it is clearly modified, so you are not safe. HTH.
Additional thought: Do you have any evidence that your code is using your bins argument? I'm wondering if it is being ignored. If so, try passing h in place of bins and see if it listens.
HPDregionplot in package:emdbook is supposed to do that. It does use MASS::kde2d but it normalizes the result. It has the disadvantage to my mind that it requires an mcmc object.
library(MASS)
library(coda)
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
Building on the answer by 42, I've simplified HPDregionplot() to reduce dependencies and remove the requirement to work with mcmc-objects. The function works on a two-column data.frame and creates no intermediate plots. Note, however, that the this approach breaks as soon as grDevices::contourLines() return multiple contours.
hpd_contour <- function (x, n = 50, prob = 0.95, ...) {
post1 <- MASS::kde2d(x[[1]], x[[2]], n = n, ...)
dx <- diff(post1$x[1:2])
dy <- diff(post1$y[1:2])
sz <- sort(post1$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
as.data.frame(grDevices::contourLines(post1$x, post1$y, post1$z, levels = levels))
}
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
ContourLines <- hpd_contour(df, prob=0.8)
ggplot(df, aes(x = x, y = y)) +
stat_density2d(aes(fill = as.factor(..level..)), bins=5, geom = "polygon") +
geom_point() +
geom_polygon(data = ContourLines, color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
Moreover, the workflow now easily extends to grouped data.
ContourLines <- iris[, c("Species", "Sepal.Length", "Sepal.Width")] %>%
group_by(Species) %>%
do(hpd_contour(.[, c("Sepal.Length", "Sepal.Width")], prob=0.8))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.6) +
geom_polygon(data = ContourLines, fill = NA) +
guides(color = FALSE) +
theme(plot.margin = margin())