DBSCAN clustering plotting through ggplot2 - r

I am trying to plot the dbscan clustering result through ggplot2. If I understand it correctly the current dbscan plots noise in black colour with base plot function. Some code first,
library(dbscan)
n <- 100
x <- cbind(
x = runif(5, 0, 10) + rnorm(n, sd = 0.2),
y = runif(5, 0, 10) + rnorm(n, sd = 0.2)
)
plot(x)
kNNdistplot(x, k = 5)
abline(h=.25, col = "red", lty=2)
res <- dbscan::dbscan(x, eps = .25, minPts = 4)
plot(res, x, main = "DBSCAN")
x <- data.frame(x)
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y")
I want two things to do differently here, first trying to plot the clustering output through ggplot(). The difficulty is if I use res$cluster to plot points the plot() will ignore points with 0 labels (which are noise points), and ggplots() will though error as length of res$cluster will be smaller than actual data to plot and if I try to use res$cluster+1 it will give 1 to noise points, which I don't want. And secondly if possible try to do something which clusym[] in package fpc does. It plots clusters with labels 1, 2, 3, ... and ignores 0 labels. Thats fine if my labels for noise points are still 0 and then giving any specific symbol say "*" to noise point with a specific colour lets say grey. I have seen a stack overflow post which tries to do similar thing for convex hull plotting but couldn't still figure out how to do this if I don't want to draw the hull and want a clustering number for each cluster.
A possibility which I thought was first plot the points without noise and then additional adding noise points with the desired colour and symbols to the original plot .
But since the res$cluster length is not equal to x it is thronging error.
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y") + adding noise points
Error: Aesthetics must be either length 1 or the same as the data (100): shape, colour

You should first subset the third column from the output of DBSCAN, tack that onto your original data as a new column (i.e. as cluster), and assign that as a factor.
When you make the ggplot, you can assign color or shape to cluster. As for ignoring the noise points, I would do it as follows.
data <- dataframe with the cluster column (still in numeric form).
data2 <- dplyr::filter(data, cluster > 0)
data2$cluster <- as.factor(data2$cluster)
ggplot(data2, aes(x = x, y = y) +
geom_point(aes(color = `cluster`))

Related

Overlay multiple geom_raster plots with different gradients

I would like to plot with gglot's geom_raster a 2D plot with 2 different gradients, but I do not know if there is a fast and elegant solution for this and I am stuck.
The effect that I would like to see is the overlay of multiple geom_raster, essentially. Also, I would need a solution that scales to N different gradients; let me give an example with N=2 gradients which is easier to follow.
I first create a 100 x 100 grid of positions X and Y
# the domain are 100 points on each axis
domain = seq(0, 100, 1)
# the grid with the data
grid = expand.grid(domain, domain, stringsAsFactors = FALSE)
colnames(grid) = c('x', 'y')
Then I compute one value per grid point; imagine something stupid like this
grid$val = apply(grid, 1, function(w) { w['x'] * w['y'] }
I know how to plot this with a custom white to red gradient
ggplot(grid, aes(x = x, y = y)) +
geom_raster(aes(fill = val), interpolate = TRUE) +
scale_fill_gradient(
low = "white",
high = "red", aesthetics = 'fill')
But now imagine I have another value per grid point
grid$second_val = apply(grid, 1, function(w) { w['x'] * w['y'] + runif(1) }
Now, how do I plot a grid where each position "(x,y)" is coloured with an overlay of:
1 "white to red" gradient with value given by val
1 "white to blue" gradient with value given by second_val
Essentially, in most applications val and second_val will be two 2D density functions and I would like each gradient to represent the density value. I need two different colours to see the different distribution of the values.
I have seen this similar question but don't know how to use that answer in my case.
#Axeman's answer to my question, which you linked to, applies directly the same to your question.
Note that scales::color_ramp() uses values between 0 and 1, so normalize val and second_val between 0, 1 before plotting
grid$val_norm <- (grid$val-min(grid$val))/diff(range(grid$val))
grid$second_val_norm <- (grid$second_val-min(grid$second_val))/diff(range(grid$second_val))
Now plot using #Axeman's answer. You can plot one later as raster, and overlay the second with annotate. I have added transparency (alpha=.5) otherwise you'll only be able to see the second layer.:
ggplot(grid, aes(x = x, y = y)) +
geom_raster(aes(fill=val)) +
scale_fill_gradient(low = "white", high = "red", aesthetics = 'fill') +
annotate(geom="raster", x=grid$x, y=grid$y, alpha=.5,
fill = scales::colour_ramp(c("transparent","blue"))(grid$second_val_norm))
Or, you can plot both layers using annotate().
# plot using annotate()
ggplot(grid, aes(x = x, y = y)) +
annotate(geom="raster", x=grid$x, y=grid$y, alpha=.5,
fill = scales::colour_ramp(c("transparent","red"))(grid$val_norm)) +
annotate(geom="raster", x=grid$x, y=grid$y, alpha=.5,
fill = scales::colour_ramp(c("transparent","blue"))(grid$second_val_norm))

Contour plot or heatmap from three continuous variables

I have a model which has told me there is an interaction between two variables: a and b, which is significantly influencing my response variable: c. All three are continuous numeric variables. For detail c is the rate in change my response variable, b is the rate of change in my predictor and a is mean annual rainfall. The unit of analysis is pixels in a raster. So my model is telling me mean annual rainfall modifies how my predictor affects my response.
To visualise this interaction I would like to use a contour plot/heat map/level plot with a and b on the x and y axes and c providing the colour to show me how my response variable changes within the space described by a and b. I can do this with a scatter plot but its not very pretty or easy to interpret:
qplot(b, a, colour = c) +
scale_colour_gradient(low="green", high="red") +
When I try to plot a contour plot/heat map/level plot though all I get is errors, blank plots or ugly plots.
geom_contour gives me an error:
ggplot(data = Mod, aes(x = Rain, y = Bomas, z = Fire)) +
geom_contour()
Warning message:
Not possible to generate contour data
geom_raster initially gives me Error: cannot allocate vector of size 81567.2 Gb but when I round my data it produces:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_raster(aes(fill = c))
Adding interpolate = TRUE to the geom_raster code just makes the lines a little blurry.
geom_tile produces a blank graph but with a scale bar for c:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_tile(aes(color = c))
I've also tried using stat_density2d and setting the fill and/or the colour to c, but just got an error, and I've tried using levelplot in the lattice package as well but that produces this:
levelplot(c ~ a * b, data = df,
aspect = "asp", contour = TRUE,
xlab = "a",
ylab = "b")
I suspect the problems I'm encountering are because the functions are not set up to deal with continuous x and y variables, all the examples seem to use factors. I would have thought I could compensate for that by changing bin widths but that doesn't seem to work either. Is there a function that allows you to make a heat map with 3 continuous variables? Or do I need to treat my a and b variables as factors and manually make a dataframe with bins appropriate for my data?
If you want to experiment for yourself then you get similar problems to what I'm having with:
df<- as.data.frame(rnorm(1:1068))
df[,2] <- rnorm(1:1068)
df[,3] <- rnorm(1:1068)
names(df) <- c("a", "b", "c")
You can get automatic bins, and for example calculate the means by using stat_summary_2d:
ggplot(df, aes(a, b, z = c)) +
stat_summary_2d() +
geom_point(shape = 1, col = 'white') +
viridis::scale_fill_viridis()
Another good option is to slice your data by the third variable, and plot small multiples. This doesn't really show very well for random data though:
library(ggplot2)
ggplot(df, aes(a, b)) +
geom_point() +
facet_wrap(~cut_number(c, 4))

how to combine in ggplot line / points with special values?

I'm quite new to ggplot but I like the systematic way how you build your plots. Still, I'm struggeling to achieve desired results. I can replicate plots where you have categorical data. However, for my use I often need to fit a model to certain observations and then highlight them in a combined plot. With the usual plot function I would do:
library(splines)
set.seed(10)
x <- seq(-1,1,0.01)
y <- x^2
s <- interpSpline(x,y)
y <- y+rnorm(length(y),mean=0,sd=0.1)
plot(x,predict(s,x)$y,type="l",col="black",xlab="x",ylab="y")
points(x,y,col="red",pch=4)
points(0,0,col="blue",pch=1)
legend("top",legend=c("True Values","Model values","Special Value"),text.col=c("red","black","blue"),lty=c(NA,1,NA),pch=c(4,NA,1),col=c("red","black","blue"),cex = 0.7)
My biggest problem is how to build the data frame for ggplot which automatically then draws the legend? In this example, how would I translate this into ggplot to get a similar plot? Or is ggplot not made for this kind of plots?
Note this is just a toy example. Usually the model values are derived from a more complex model, just in case you wante to use a stat in ggplot.
The key part here is that you can map colors in aes by giving a string, which will produce a legend. In this case, there is no need to include the special value in the data.frame.
df <- data.frame(x = x, y = y, fit = predict(s, x)$y)
ggplot(df, aes(x, y)) +
geom_line(aes(y = fit, col = 'Model values')) +
geom_point(aes(col = 'True values')) +
geom_point(aes(col = 'Special value'), x = 0, y = 0) +
scale_color_manual(values = c('True values' = "red",
'Special value' = "blue",
'Model values' = "black"))

Varying gradient using ggplot2 in R

I am trying to create a plot where the color gradient changes by both the x and y axis. More specifically I am trying set up the gradients so that the hue range changes along the x axis and the value changes along the y axis.
For an example I am working with a sine curve with some noise along -pi to pi.
set.seed(5678)
x <- seq(-1*pi, 1*pi, 0.01)
y <- sin(x) + rnorm(length(y))
df <- cbind.data.frame(x, y)
ggplot(df, aes(x=x, y=y)) + geom_line()
Now I want to colorize the line so that the hue progresses from red-orange to orange-yellow to yellow-green, etc. along the x axis and then will take on different values in that range depending on its y value. So at x=-pi, y=2 might be red and y=-2 might be yellow while at x=0, y=2 might be green and y=-2 might be blue.
Has anyone tried to create a graph like this?
Here's an option for doing it using a hue calculated from x and y:
df$hue <- pmax(pmin((df$x + pi)/pi/3 + (2 - df$y) / 12, 1), 0)
ggplot(df, aes(x=x, y=y, group = 1, colour = hsv(hue, 1, 1))) + geom_path() +
scale_colour_identity()
Note because the lines are quite long vertically so the effect isn't fully seen. Here's a version using approx to interpolate:
adf <- as.data.frame(approx(df, xout = seq(-pi, max(df$x), 0.001)))
adf$hue <- pmax(pmin((adf$x + pi)/pi/3 + (2 - adf$y) / 12, 1), 0)
ggplot(adf, aes(x=x, y=y, group = 1, colour = hsv(hue, 1, 1))) + geom_path() +
scale_colour_identity()
In both cases, it's the hue that's dependent on both x and y, with value held constant. That fits your proposed example, if not your original description. Clearly it could be tailored to vary hue and value separately. It's also worth noting that there needs to be a group set. Otherwise ggplot2 tries to join together all the points of the same colour.

Interpolating correctly between points in R using ggplot2 and axis scaling

I have some data I want to graph on a semi-log scale, however I get some artifacts when there is a large jump between points. On linear scale, a straight line is drawn between subsequent points, which is a fine approximation for visualization. However, the exact same thing is done when using the log scale (either by using scale_x_log10 or scale_x_continuous with a log transformation). A line between two points on the semi-log scale should show up curved. In other words, this:
df <- data.frame(x = c(0, 1), y = c(0, 1))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
produces this:
when I would expect something more like this:
generated by this code:
df <- data.frame(x = seq(0, 1, 0.01), y = seq(0, 1, 0.01))
ggplot(data = df, aes(x, y)) + geom_line() + scale_x_log10(limits = c(10^-3, 10^0))
It's clear what's happening, but I'm not sure what the best way to fix the interpolation is. In the actual data I'm plotting there are a few jumps at various points, which makes the plots very misleading when trying to compare two lines. (They're ROC curves in this instance.)
One thought is I can search the data for jumps and fill in some interpolated points myself, but I'm hoping for a cleaner way that doesn't involve me adding in a bunch of fake data points.
What you describe is a transformation of the coordinate system, not a transformation of the scales. The distinction is that scale transformations take place before any statistical transformations, and coordinate transformations take place afterward. In this case, the "statistical transformation" is "draw a straight line between the points". With a transformed scale, the line is straight in the transformed (log) space; with a transformed coordinate, it is straight in the original (linear) space and therefore curved in log space.
# don't include 0 in the data because log 0 is -Inf
DF <- data.frame(x = c(0.1, 1), y = c(0.1, 1))
ggplot(data = DF, aes(x = x, y = y)) +
geom_line() +
coord_trans(x="log10")

Resources