I have some imaging data with very faint contrast and quite a bit of noise, and when I display it with a linear colour scale it doesn't show well. In imaging software such as imageJ or photoshop, there's a tonal curve that one can tune to bump the contrast in a nonlinear fashion and effectively stretch the scale on some region of interest to see more details.
As a simplest case of such nonlinear tuning parameter, #BrianDiggs pointed out the bias argument to colorRamp, which still requires previous tranformation of the data to be in [0, 1].
I'd like to generalise the non-linear scale to other functionals than x^gamma, therefore the function below doesn't actually use bias in colorRamp but does the transformation on the data side.
I feel like I'm reinventing the wheel; is there already such a tool for continuous colour scales in R?
Here is a possible solution,
set.seed(123)
x <- sort(runif(1e4, min=-20 , max=120))
library(scales) # rescale function
curve_pal <- function (x, colours = rev(blues9),
fun = function(x) x^gamma,
n=10, gamma=1)
{
# function that maps [0,1] -> colours
palfun <- colorRamp(colors=colours)
# now divide the data in n equi-spaced regions, mapped linearly to [0,1]
xcuts <- cut(x, breaks=seq(min(x), max(x), length=n))
xnum <- as.numeric(xcuts)
# need to work around NA values that make colorRamp/rgb choke
testNA <- is.na(xnum)
xsanitised <- ifelse(testNA, 0, fun(rescale(xnum)))
# non-NA values in [0,1] get assigned their colour
ifelse(testNA, NA, rgb(palfun(xsanitised), maxColorValue=255))
}
library(gridExtra)
grid.newpage()
grid.arrange(rasterGrob(curve_pal(x, gamma=0.5), wid=1, heig=1, int=F),
rasterGrob(curve_pal(x, gamma=1), wid=1, heig=1, int=F),
rasterGrob(curve_pal(x, gamma=2), wid=1, heig=1, int=F),
nrow=1)
Related
I know that there are lots of discussions out there that treat this subject matter... but every time I encounter this, I never find a consistent, satisfying answer.
I'm trying to create a very basic graphical depiction of a principal components analysis model. I always aim to not use packages that automatically generate plots for me because I want the control.
Every time I try to make a PCA plot with loadings, I am stumped by how the canned functions relate their site-specific scores with the model's loading vectors. This is despite the myriad nodes out there treating this matter, most of which just use the canned functions without explaining how the numbers got from a basic PCA model to the biplot (they just use the canned function).
For the example code below, I'll use autoplot. If I make a PCA model and use autoplot, I get a very cool graph. But I want to know how they get these numbers- the scores get rescaled, and I have no idea how the vectors are relativized the way they are on the plot. Can anyone walk me through how I would get these numbers relativized data in dataframes of my own (both scores and vectors) so I can make the aesthetic changes I want without using autoplot??
d <- iris
m1 <- prcomp(d[,1:4], scale=T)
scores <- data.frame(m1$x[,1:2])
library(ggplot2)
#Scores range from about -2.5 to +3
ggplot(scores, aes(x=PC1, y=PC2))+
geom_point()
#Scores range from about -0.15 to 0.22, no clue where the relativized loadings come from
autoplot(m1, loadings = T)
I'll attempt to walk you through and simplify the steps that autoplot uses to draw a PCA plot, so you can do this yourself quite easily in ggplot.
autoplot is actually an S3 generic function, so it's more accurate to talk about the method ggfortify:::autoplot.prcomp uses, since this is the function that is dispatched when you call autoplot on a prcomp object.
Let's start with your own example:
library(ggfortify)
library(ggplot2)
d <- iris
m1 <- prcomp(d[, 1:4], scale = TRUE)
scores <- data.frame(m1$x[, 1:2])
The scores are normalized by dividing each column by its own root mean squared error
scores[] <- lapply(scores, function(x) x / sqrt(sum((x - mean(x))^2)))
The loadings are simply obtained from the rotation member of the prcomp object:
loadings <- as.data.frame(m1$rotation)[1:2]
There is some internal scaling to ensure that the loadings appear on the same scale as the adjusted PC scores, but as far as I can tell this is simply for visual effect. The scaling amounts to about 0.2 here, and is calculated as follows:
scale <- min(max(abs(scores$PC1))/max(abs(loadings$PC1)),
max(abs(scores$PC2))/max(abs(loadings$PC2))) * 0.8
scale
#> [1] 0.1987812
We now have enough to recreate the autoplot using vanilla ggplot code.
ggplot(scores, aes(x = PC1, y = PC2))+
geom_point() +
geom_segment(data = loadings * scale,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
color = "red", arrow = arrow(angle = 25, length = unit(4, "mm")))
Aside from the axis titles, this is identical to the autoplot:
autoplot(m1, loadings = TRUE)
I have a perspective plot of a locfit model and I wish to add two things to it
Predictor variables as points in the 3D space
Color the surface according to the Z axis value
For the first, I have tried to use the trans3d function. But I get the following error even though my variables are in vector format:
Error in cbind(x, y, z, 1) %*% pmat : requires numeric/complex matrix/vector arguments
Here is a snippet of my code
library(locfit)
X <- as.matrix(loc1[,1:2])
Y <- as.matrix(loc1[,3])
zz <- locfit(Y~X,kern="bisq")
pmat <- plot(zz,type="persp",zlab="Amount",xlab="",ylab="",main="Plains",
phi = 30, theta = 30, ticktype="detailed")
x1 <- as.vector(X[,1])
x2 <- as.vector(X[,2])
Y <- as.vector(Y)
points(trans3d(x1,x2,Y,pmat))
My "loc1" data can be found here - https://www.dropbox.com/s/0kdpd5hxsywnvu2/loc1_amountfreq.txt?dl=0
TL,DR: not really in plot.locfit, but you can reconstruct it.
I don't think plot.locfit has good support for this sort of customisation. Supposedly get.data=T in your plot call will plot the original data points (point 1), and it does seem to do so, except if type="persp". So no luck there. Alternatively you can points(trans3d(...)) as you have done, except you need the perspective matrix returned by persp, and plot.locfit.3d does not return it. So again, no luck.
For colouring, typically you make a colour scale (http://r.789695.n4.nabble.com/colour-by-z-value-persp-in-raster-package-td4428254.html) and assign each z facet the colour that goes with it. However, you need the z-values of the surface (not the z-values of your original data) for this, and plot.locfit does not appear to return this either.
So to do what you want, you'll essentially be recoding plot.locfit yourself (not hard, though just cludgy).
You could put this into a function so you can reuse it.
We:
make a uniform grid of x-y points
calculate the value of the fit at each point
use these to draw the surface (with a colour scale), saving the perspective matrix so that we can
plot your original data
so:
# make a grid of x and y coords, calculate the fit at those points
n.x <- 20 # number of x points in the x-y grid
n.y <- 30 # number of y points in the x-y grid
zz <- locfit(Total ~ Mex_Freq + Cal_Freq, data=loc1, kern="bisq")
xs <- with(loc1, seq(min(Mex_Freq), max(Mex_Freq), length.out=20))
ys <- with(loc1, seq(min(Cal_Freq), max(Cal_Freq), length.out=30))
xys <- expand.grid(Mex_Freq=xs, Cal_Freq=ys)
zs <- matrix(predict(zz, xys), nrow=length(xs))
# generate a colour scale
n.cols <- 100 # number of colours
palette <- colorRampPalette(c('blue', 'green'))(n.cols) # from blue to green
# palette <- colorRampPalette(c(rgb(0,0,1,.8), rgb(0,1,0,.8)), alpha=T)(n.cols) # if you want transparency for example
# work out which colour each z-value should be in by splitting it
# up into n.cols bins
facetcol <- cut(zs, n.cols)
# draw surface, with colours (col=...)
pmat <- persp(x=xs, y=ys, zs, theta=30, phi=30, ticktype='detailed', main="plains", xlab="", ylab="", zlab="Amount", col=palette[facetcol])
# draw your original data
with(loc1, points(trans3d(Mex_Freq,Cal_Freq,Total,pmat), pch=20))
Note - doesn't look that pretty! might want to adjust say your colour scale colours, or the transparency of the facets, etc. Re: adding legend, there are some other questions that deal with that.
(PS: what a shame ggplot doesn't do 3D scatter plots.)
I am using glmnet package to get following graph from mtcars dataset (regression of mpg on other variables):
library(glmnet)
fit = glmnet(as.matrix(mtcars[-1]), mtcars[,1])
plot(fit, xvar='lambda')
How can I add names of variables to each curve, either at beginning of each curve or at its maximal y point (maximum away from x-axis)? I tried and I can add legend as usual but not labels on each curve or at its start. Thanks for your help.
An alternative is the
plot_glmnet function in the
plotmo
package. It automatically positions the variable names
and has a few other bells and whistles.
For example, the following code
library(glmnet)
mod <- glmnet(as.matrix(mtcars[-1]), mtcars[,1])
library(plotmo) # for plot_glmnet
plot_glmnet(mod)
gives
The variable names are spread out to prevent overplotting, but we can
still make out which curve is associated with which variable.
Further examples may be found in Chapter 6 in
plotres vignette
which is included in the
plotmo
package.
As the labels are hard coded it is perhaps easier to write a quick function. This is just a quick shot, so can be changed to be more thorough. I would also note that when using the lasso there are normally a lot of variables so there will be a lot of overlap of the labels (as seen in your small example)
lbs_fun <- function(fit, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
# plot
plot(fit, xvar="lambda")
# label
lbs_fun(fit)
Here is a modification of the best answer, using line segments instead of text labels directly overlying the curves. This is especially useful when there are lots of variables and you only want to print those that had absolute coefficient values greater than zero:
#note: the argument 'lra' is a cv.glmnet object
lbs_fun <- function(lra, ...) {
fit <- lra$glmnet.fit
L=which(fit$lambda==lra$lambda.min)
ystart <- sort(fit$beta[abs(fit$beta[,L])>0,L])
labs <- names(ystart)
r <- range(fit$beta[,100]) # max gap between biggest and smallest coefs at smallest lambda i.e., 100th lambda
yfin <- seq(r[1],r[2],length=length(ystart))
xstart<- log(lra$lambda.min)
xfin <- xstart+1
text(xfin+0.3,yfin,labels=labs,...)
segments(xstart,ystart,xfin,yfin)
}
plot(lra$glmnet.fit,label=F, xvar="lambda", xlim=c(-5.2,0), lwd=2) #xlim, lwd is optional
I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.
I have come across a number of situations where I want to plot more points than I really ought to be -- the main holdup is that when I share my plots with people or embed them in papers, they occupy too much space. It's very straightforward to randomly sample rows in a dataframe.
if I want a truly random sample for a point plot, it's easy to say:
ggplot(x,y,data=myDf[sample(1:nrow(myDf),1000),])
However, I was wondering if there were more effective (ideally canned) ways to specify the number of plot points such that your actual data is accurately reflected in the plot. So here is an example.
Suppose I am plotting something like the CCDF of a heavy tailed distribution, e.g.
ccdf <- function(myList,density=FALSE)
{
# generates the CCDF of a list or vector
freqs = table(myList)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(rlnorm(10000,3,2.4)),log='xy')
This will produce a plot where the x & y axis become increasingly dense. Here it would be ideal to have fewer samples plotted for large x or y values.
Does anybody have any tips or suggestions for dealing with similar issues?
Thanks,
-e
I tend to use png files rather than vector based graphics such as pdf or eps for this situation. The files are much smaller, although you lose resolution.
If it's a more conventional scatterplot, then using semi-transparent colours also helps, as well as solving the over-plotting problem. For example,
x <- rnorm(10000); y <- rnorm(10000)
qplot(x, y, colour=I(alpha("blue",1/25)))
Beyond Rob's suggestions, one plot function I like as it does the 'thinning' for you is hexbin; an example is at the R Graph Gallery.
Here is one possible solution for downsampling plot with respect to the x-axis, if it is log transformed. It log transforms the x-axis, rounds that quantity, and picks the median x value in that bin:
downsampled_qplot <- function(x,y,data,rounding=0, ...) {
# assumes we are doing log=xy or log=x
group = factor(round(log(data$x),rounding))
d <- do.call(rbind, by(data, group,
function(X) X[order(X$x)[floor(length(X)/2)],]))
qplot(x,count,data=d, ...)
}
Using the definition of ccdf() from above, we can then compare the original plot of the CCDF of the distribution with the downsampled version:
myccdf=ccdf(rlnorm(10000,3,2.4))
qplot(x,count,data=myccdf,log='xy',main='original')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=1,main='rounding = 1')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=0,main='rounding = 0')
In PDF format, the original plot takes up 640K, and the downsampled versions occupy 20K and 8K, respectively.
I'd either make image files (png or jpeg devices) as Rob already mentioned, or I'd make a 2D histogram. An alternative to the 2D histogram is a smoothed scatterplot, it makes a similar graphic but has a more smooth cutoff from dense to sparse regions of space.
If you've never seen addictedtor before, it's worth a look. It has some very nice graphics generated in R with images and sample code.
Here's the sample code from the addictedtor site:
2-d histogram:
require(gplots)
# example data, bivariate normal, no correlation
x <- rnorm(2000, sd=4)
y <- rnorm(2000, sd=1)
# separate scales for each axis, this looks circular
hist2d(x,y, nbins=50, col = c("white",heat.colors(16)))
rug(x,side=1)
rug(y,side=2)
box()
smoothscatter:
library("geneplotter") ## from BioConductor
require("RColorBrewer") ## from CRAN
x1 <- matrix(rnorm(1e4), ncol=2)
x2 <- matrix(rnorm(1e4, mean=3, sd=1.5), ncol=2)
x <- rbind(x1,x2)
layout(matrix(1:4, ncol=2, byrow=TRUE))
op <- par(mar=rep(2,4))
smoothScatter(x, nrpoints=0)
smoothScatter(x)
smoothScatter(x, nrpoints=Inf,
colramp=colorRampPalette(brewer.pal(9,"YlOrRd")),
bandwidth=40)
colors <- densCols(x)
plot(x, col=colors, pch=20)
par(op)