Adding labels on curves in glmnet plot in R - r

I am using glmnet package to get following graph from mtcars dataset (regression of mpg on other variables):
library(glmnet)
fit = glmnet(as.matrix(mtcars[-1]), mtcars[,1])
plot(fit, xvar='lambda')
How can I add names of variables to each curve, either at beginning of each curve or at its maximal y point (maximum away from x-axis)? I tried and I can add legend as usual but not labels on each curve or at its start. Thanks for your help.

An alternative is the
plot_glmnet function in the
plotmo
package. It automatically positions the variable names
and has a few other bells and whistles.
For example, the following code
library(glmnet)
mod <- glmnet(as.matrix(mtcars[-1]), mtcars[,1])
library(plotmo) # for plot_glmnet
plot_glmnet(mod)
gives
The variable names are spread out to prevent overplotting, but we can
still make out which curve is associated with which variable.
Further examples may be found in Chapter 6 in
plotres vignette
which is included in the
plotmo
package.

As the labels are hard coded it is perhaps easier to write a quick function. This is just a quick shot, so can be changed to be more thorough. I would also note that when using the lasso there are normally a lot of variables so there will be a lot of overlap of the labels (as seen in your small example)
lbs_fun <- function(fit, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
# plot
plot(fit, xvar="lambda")
# label
lbs_fun(fit)

Here is a modification of the best answer, using line segments instead of text labels directly overlying the curves. This is especially useful when there are lots of variables and you only want to print those that had absolute coefficient values greater than zero:
#note: the argument 'lra' is a cv.glmnet object
lbs_fun <- function(lra, ...) {
fit <- lra$glmnet.fit
L=which(fit$lambda==lra$lambda.min)
ystart <- sort(fit$beta[abs(fit$beta[,L])>0,L])
labs <- names(ystart)
r <- range(fit$beta[,100]) # max gap between biggest and smallest coefs at smallest lambda i.e., 100th lambda
yfin <- seq(r[1],r[2],length=length(ystart))
xstart<- log(lra$lambda.min)
xfin <- xstart+1
text(xfin+0.3,yfin,labels=labs,...)
segments(xstart,ystart,xfin,yfin)
}
plot(lra$glmnet.fit,label=F, xvar="lambda", xlim=c(-5.2,0), lwd=2) #xlim, lwd is optional

Related

How to plot, in R, a correlogram on top of a correlation matrix?

I've followed the instructions on this website from STHDA to plot correlation matrices and correlograms in R. The website and examples are really good. However, I'd like to plot the upper part of the correlogram over the upper part of the correlation matrix.
Here's the code:
library(PerformanceAnalytics)
chart.Correlation(mtcars, histogram=TRUE, pch=19)
This should give me the correlation matrix using scatter plots, together with the histogram, which I'd like to maintain. But for the upper part of the plot, I'd like to have the correlogram obtained from this code:
library(corrplot)
corrplot(cor(mtcars), type="upper", order="hclust", tl.col="black", tl.srt=45)
The obvious way of doing it is exporting all graphs in pdf and then work with Inkscape, but it would be nicer if I could get this directly from R. Is there any possible way for doing this?
Thanks.
The trick to using the panel functions within pairs is found in help(pairs):
A panel function should not attempt to start a new plot, but just plot within a given coordinate system: thus 'plot' and 'boxplot' are not panel functions.
So, you should use graphic-adding functions, such as points, lines, polygon, or perhaps (when available) plot(..., add=TRUE), but not a straight-up plot. What you were suggesting in your comment (with SpatialPolygons) might have worked with some prodding if you actually tried to plot it on a device vice just returning it from your plotting function.
In my example below, I actually do "create a new plot", but I cheat (based on this SO post) by adding a second plot on top of the one already there. I do this to shortcut an otherwise necessary scale/shift, which would still not be perfect since you appear to want a "perfect circle", something that can really only be guaranteed with asp=1 (aspect ratio fixed at 1:1).
colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60')
## colorRamp() returns a function which takes as an argument a number
## on [0,1] and returns a color in the gradient in colorRange
myColorRampFunc <- colorRamp(colorRange)
panel.cor <- function(w, z, ...) {
correlation <- cor(w, z)
## because the func needs [0,1] and cor gives [-1,1], we need to
## shift and scale it
col <- rgb( myColorRampFunc( (1+correlation)/2 )/255 )
## square it to avoid visual bias due to "area vs diameter"
radius <- sqrt(abs(correlation))
radians <- seq(0, 2*pi, len=50) # 50 is arbitrary
x <- radius * cos(radians)
y <- radius * sin(radians)
## make them full loops
x <- c(x, tail(x,n=1))
y <- c(y, tail(y,n=1))
## I trick the "don't create a new plot" thing by following the
## advice here: http://www.r-bloggers.com/multiple-y-axis-in-a-r-plot/
## This allows
par(new=TRUE)
plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1)
polygon(x, y, border=col, col=col)
}
pairs(mtcars, upper.panel=panel.cor)
You can manipulate the size of the circles -- at the expense of unbiased visualization -- by playing with the radius. The colors I took directly from the page you linked to originally.
Similar functions can be used for your lower and diagonal panels.

Calibration (inverse prediction) from LOESS object in R

I have fit a LOESS local regression to some data and I want to be able to find the X value associated with a given Y value.
plot(cars, main = "Stopping Distance versus Speed")
car_loess <- loess(cars$dist~cars$speed,span=.5)
lines(1:50, predict(car_loess,data.frame(speed=1:50)))
I was hoping that I could use teh inverse.predict function from the chemCal package, but that does not work for LOESS objects.
Does anyone have any idea how I might be able to do this calibrationa in a better way than predicticting Y values from a long vector of X values and looking through the resulting fitted Y for the Y value of interest and taking its corresponding X value?
Practically speaking in the above example, let's say I wanted to find the speed at which the stopping distance is 15.
Thanks!
The predicted line that you added to the plot is not quite right. Use code like this instead:
# plot the loess line
lines(cars$speed, car_loess$fitted, col="red")
You can use the approx() function to get a linear approximation from the loess line at a give y value. It works just fine for the example that you give:
# define a given y value at which you wish to approximate x from the loess line
givenY <- 15
estX <- approx(x=car_loess$fitted, y=car_loess$x, xout=givenY)$y
# add corresponding lines to the plot
abline(h=givenY, lty=2)
abline(v=estX, lty=2)
But, with a loess fit, there may be more than one x for a given y. The approach I am suggesting does not provide you with ALL of the x values for the given y. For example ...
# example with non-monotonic x-y relation
y <- c(1:20, 19:1, 2:20)
x <- seq(y)
plot(x, y)
fit <- loess(y ~ x)
# plot the loess line
lines(x, fit$fitted, col="red")
# define a given y value at which you wish to approximate x from the loess line
givenY <- 15
estX <- approx(x=fit$fitted, y=fit$x, xout=givenY)$y
# add corresponding lines to the plot
abline(h=givenY, lty=2)
abline(v=estX, lty=2)

Add points to pairs plot?

Is there any way for me to add some points to a pairs plot?
For example, I can plot the Iris dataset with pairs(iris[1:4]), but I wanted to execute a clustering method (for example, kmeans) over this dataset and plot its resulting centroids on the plot I already had.
It would help too if there's a way to plot the whole data and the centroids together in a single pairs plot in such a way that the centroids can be plotted in a different way. The idea is, I plot pairs(rbind(iris[1:4],centers) (where centers are the three centroids' data) but plotting the three last elements of this matrix in a different way, like changing cex or pch. Is it possible?
You give the solution yourself in the last paragraph of your question. Yes, you can use pch and col in the pairs function.
pairs(rbind(iris[1:4], kmeans(iris[1:4],3)$centers),
pch=rep(c(1,2), c(nrow(iris), 3)),
col=rep(c(1,2), c(nrow(iris), 3)))
Another option is to use panel function:
cl <- kmeans(iris[1:4],3)
idx <- subset(expand.grid(x=1:4,y=1:4),x!=y)
i <- 1
pairs(iris[1:4],bg=cl$cluster,pch=21,
panel=function(x, y,bg, ...) {
points(x, y, pch=21,bg=bg)
points(cl$center[,idx[i,'x']],cl$center[,idx[i,'y']],
cex=4,pch=10,col='blue')
i <<- i +1
})
But I think it is safer and easier to use lattice splom function. The legend is also automatically generated.
cl <- kmeans(iris[1:4],3)
library(lattice)
splom(iris[1:4],groups=cl$cluster,pch=21,
panel=function(x, y,i,j,groups, ...) {
panel.points(x, y, pch=21,col=groups)
panel.points(cl$center[,j],cl$center[,i],
pch=10,col='blue')
},auto.key=TRUE)

gradient colour scale with gamma parameter?

I have some imaging data with very faint contrast and quite a bit of noise, and when I display it with a linear colour scale it doesn't show well. In imaging software such as imageJ or photoshop, there's a tonal curve that one can tune to bump the contrast in a nonlinear fashion and effectively stretch the scale on some region of interest to see more details.
As a simplest case of such nonlinear tuning parameter, #BrianDiggs pointed out the bias argument to colorRamp, which still requires previous tranformation of the data to be in [0, 1].
I'd like to generalise the non-linear scale to other functionals than x^gamma, therefore the function below doesn't actually use bias in colorRamp but does the transformation on the data side.
I feel like I'm reinventing the wheel; is there already such a tool for continuous colour scales in R?
Here is a possible solution,
set.seed(123)
x <- sort(runif(1e4, min=-20 , max=120))
library(scales) # rescale function
curve_pal <- function (x, colours = rev(blues9),
fun = function(x) x^gamma,
n=10, gamma=1)
{
# function that maps [0,1] -> colours
palfun <- colorRamp(colors=colours)
# now divide the data in n equi-spaced regions, mapped linearly to [0,1]
xcuts <- cut(x, breaks=seq(min(x), max(x), length=n))
xnum <- as.numeric(xcuts)
# need to work around NA values that make colorRamp/rgb choke
testNA <- is.na(xnum)
xsanitised <- ifelse(testNA, 0, fun(rescale(xnum)))
# non-NA values in [0,1] get assigned their colour
ifelse(testNA, NA, rgb(palfun(xsanitised), maxColorValue=255))
}
library(gridExtra)
grid.newpage()
grid.arrange(rasterGrob(curve_pal(x, gamma=0.5), wid=1, heig=1, int=F),
rasterGrob(curve_pal(x, gamma=1), wid=1, heig=1, int=F),
rasterGrob(curve_pal(x, gamma=2), wid=1, heig=1, int=F),
nrow=1)

maximum plot points in R?

I have come across a number of situations where I want to plot more points than I really ought to be -- the main holdup is that when I share my plots with people or embed them in papers, they occupy too much space. It's very straightforward to randomly sample rows in a dataframe.
if I want a truly random sample for a point plot, it's easy to say:
ggplot(x,y,data=myDf[sample(1:nrow(myDf),1000),])
However, I was wondering if there were more effective (ideally canned) ways to specify the number of plot points such that your actual data is accurately reflected in the plot. So here is an example.
Suppose I am plotting something like the CCDF of a heavy tailed distribution, e.g.
ccdf <- function(myList,density=FALSE)
{
# generates the CCDF of a list or vector
freqs = table(myList)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(rlnorm(10000,3,2.4)),log='xy')
This will produce a plot where the x & y axis become increasingly dense. Here it would be ideal to have fewer samples plotted for large x or y values.
Does anybody have any tips or suggestions for dealing with similar issues?
Thanks,
-e
I tend to use png files rather than vector based graphics such as pdf or eps for this situation. The files are much smaller, although you lose resolution.
If it's a more conventional scatterplot, then using semi-transparent colours also helps, as well as solving the over-plotting problem. For example,
x <- rnorm(10000); y <- rnorm(10000)
qplot(x, y, colour=I(alpha("blue",1/25)))
Beyond Rob's suggestions, one plot function I like as it does the 'thinning' for you is hexbin; an example is at the R Graph Gallery.
Here is one possible solution for downsampling plot with respect to the x-axis, if it is log transformed. It log transforms the x-axis, rounds that quantity, and picks the median x value in that bin:
downsampled_qplot <- function(x,y,data,rounding=0, ...) {
# assumes we are doing log=xy or log=x
group = factor(round(log(data$x),rounding))
d <- do.call(rbind, by(data, group,
function(X) X[order(X$x)[floor(length(X)/2)],]))
qplot(x,count,data=d, ...)
}
Using the definition of ccdf() from above, we can then compare the original plot of the CCDF of the distribution with the downsampled version:
myccdf=ccdf(rlnorm(10000,3,2.4))
qplot(x,count,data=myccdf,log='xy',main='original')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=1,main='rounding = 1')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=0,main='rounding = 0')
In PDF format, the original plot takes up 640K, and the downsampled versions occupy 20K and 8K, respectively.
I'd either make image files (png or jpeg devices) as Rob already mentioned, or I'd make a 2D histogram. An alternative to the 2D histogram is a smoothed scatterplot, it makes a similar graphic but has a more smooth cutoff from dense to sparse regions of space.
If you've never seen addictedtor before, it's worth a look. It has some very nice graphics generated in R with images and sample code.
Here's the sample code from the addictedtor site:
2-d histogram:
require(gplots)
# example data, bivariate normal, no correlation
x <- rnorm(2000, sd=4)
y <- rnorm(2000, sd=1)
# separate scales for each axis, this looks circular
hist2d(x,y, nbins=50, col = c("white",heat.colors(16)))
rug(x,side=1)
rug(y,side=2)
box()
smoothscatter:
library("geneplotter") ## from BioConductor
require("RColorBrewer") ## from CRAN
x1 <- matrix(rnorm(1e4), ncol=2)
x2 <- matrix(rnorm(1e4, mean=3, sd=1.5), ncol=2)
x <- rbind(x1,x2)
layout(matrix(1:4, ncol=2, byrow=TRUE))
op <- par(mar=rep(2,4))
smoothScatter(x, nrpoints=0)
smoothScatter(x)
smoothScatter(x, nrpoints=Inf,
colramp=colorRampPalette(brewer.pal(9,"YlOrRd")),
bandwidth=40)
colors <- densCols(x)
plot(x, col=colors, pch=20)
par(op)

Resources