How autoplot (ggplot) gets scores and loadings from prcomp - r

I know that there are lots of discussions out there that treat this subject matter... but every time I encounter this, I never find a consistent, satisfying answer.
I'm trying to create a very basic graphical depiction of a principal components analysis model. I always aim to not use packages that automatically generate plots for me because I want the control.
Every time I try to make a PCA plot with loadings, I am stumped by how the canned functions relate their site-specific scores with the model's loading vectors. This is despite the myriad nodes out there treating this matter, most of which just use the canned functions without explaining how the numbers got from a basic PCA model to the biplot (they just use the canned function).
For the example code below, I'll use autoplot. If I make a PCA model and use autoplot, I get a very cool graph. But I want to know how they get these numbers- the scores get rescaled, and I have no idea how the vectors are relativized the way they are on the plot. Can anyone walk me through how I would get these numbers relativized data in dataframes of my own (both scores and vectors) so I can make the aesthetic changes I want without using autoplot??
d <- iris
m1 <- prcomp(d[,1:4], scale=T)
scores <- data.frame(m1$x[,1:2])
library(ggplot2)
#Scores range from about -2.5 to +3
ggplot(scores, aes(x=PC1, y=PC2))+
geom_point()
#Scores range from about -0.15 to 0.22, no clue where the relativized loadings come from
autoplot(m1, loadings = T)

I'll attempt to walk you through and simplify the steps that autoplot uses to draw a PCA plot, so you can do this yourself quite easily in ggplot.
autoplot is actually an S3 generic function, so it's more accurate to talk about the method ggfortify:::autoplot.prcomp uses, since this is the function that is dispatched when you call autoplot on a prcomp object.
Let's start with your own example:
library(ggfortify)
library(ggplot2)
d <- iris
m1 <- prcomp(d[, 1:4], scale = TRUE)
scores <- data.frame(m1$x[, 1:2])
The scores are normalized by dividing each column by its own root mean squared error
scores[] <- lapply(scores, function(x) x / sqrt(sum((x - mean(x))^2)))
The loadings are simply obtained from the rotation member of the prcomp object:
loadings <- as.data.frame(m1$rotation)[1:2]
There is some internal scaling to ensure that the loadings appear on the same scale as the adjusted PC scores, but as far as I can tell this is simply for visual effect. The scaling amounts to about 0.2 here, and is calculated as follows:
scale <- min(max(abs(scores$PC1))/max(abs(loadings$PC1)),
max(abs(scores$PC2))/max(abs(loadings$PC2))) * 0.8
scale
#> [1] 0.1987812
We now have enough to recreate the autoplot using vanilla ggplot code.
ggplot(scores, aes(x = PC1, y = PC2))+
geom_point() +
geom_segment(data = loadings * scale,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
color = "red", arrow = arrow(angle = 25, length = unit(4, "mm")))
Aside from the axis titles, this is identical to the autoplot:
autoplot(m1, loadings = TRUE)

Related

R: Best fit for data (Exponential or Power), with curve predicted beyond final data point

So, I am challenged and request a little guidance.
I have used the rriskDistributions package to evaluate some CDFs for some industrial sector injury data with the get.lnorm.par() function. It fits the data great, unfortunately, the axes require swapping because my response variable is currently on the x-axis, and needs to be on the y-axis. Unfortunately again, the get.lnorm.par() function requires that the probabilities be only on the y-axis, and I cannot figure out how to create the same curve with swapped axes.
I want to get it to look something like this:
An example of the code that I have worked through in ggplot follows:
x <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
y <- c(3170221,6810103,14999840,26623982,48903587,74177290,266181110)
prob <- c(x) ## There are 389 different x values, but keeping it simple!
quant <- c(y) ## Same as x.
df1 <- data.frame(prob,quant)
plot2 <- ggplot(df1, aes(x=prob, y=quant)) + geom_point() +
geom_smooth(method="lm", formula= log(y)~x, se=FALSE) +
labs(y="quantiles", x="probabilities", title="Probs vs Quants")
plot2
I have created lines that fit this data, but everything ends at the last data point.
When I used get.lnorm.par(), the fit was great, but like stated previously, the axes require flipping. When I tried this, I continued to get errors about infinite output and could not define the bounds of the function to be plotted.
So, here is the code using the rriskDistributions package:
pct <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
my.lnorm<-get.lnorm.par(p=pct, q=c(3170221,6810103,14999840,26623982,48903587,74177290,266181110),
tol = 0.001, scaleX = c(0,0.0809))
Essentially, I am trying to create a fit curve for the data (either exponential or power) that expands, or predicts beyond the final data point. This I cannot figure out for the life of me, and changing any of the parameters in the rriskDistributions functions is quite challenging.
Any thoughts?
Thanks.

color discrete groups of parallel coordinate plot in GGally package

To create a parallel coordinate plot I wanted to use ggparcoord() function in package GGally. The following codes show a reproducible example.
set.seed(3674)
k <- rep(1:3, each=30)
x <- k + rnorm(mean=10, sd=.2,n=90)
y <- -2*k + rnorm(mean=10, sd=.4,n=90)
z <- 3*k + rnorm(mean=10, sd=.6,n=90)
dat <- data.frame(group=factor(k),x,y,z)
library(GGally)
ggparcoord(dat,columns=1:4,groupColumn = 1)
Notice in the picture that the color for group was continuous even though I have the group variable as a factor. Is there any way I can display the plot with three discrete color instead?
I have looked at some other posts where they discuss various other ways of doing parallel coordinate plots in here. But I really wanted to do this in ggparcoord() function of package GGally. I appreciate your time in thinking about this problem.
Your code was almost correct. I spotted that columns=1:4 was not right in this case. You need to drop the column for groupColumn in columns
ggparcoord(dat,columns=2:4,groupColumn = 1)

gradient colour scale with gamma parameter?

I have some imaging data with very faint contrast and quite a bit of noise, and when I display it with a linear colour scale it doesn't show well. In imaging software such as imageJ or photoshop, there's a tonal curve that one can tune to bump the contrast in a nonlinear fashion and effectively stretch the scale on some region of interest to see more details.
As a simplest case of such nonlinear tuning parameter, #BrianDiggs pointed out the bias argument to colorRamp, which still requires previous tranformation of the data to be in [0, 1].
I'd like to generalise the non-linear scale to other functionals than x^gamma, therefore the function below doesn't actually use bias in colorRamp but does the transformation on the data side.
I feel like I'm reinventing the wheel; is there already such a tool for continuous colour scales in R?
Here is a possible solution,
set.seed(123)
x <- sort(runif(1e4, min=-20 , max=120))
library(scales) # rescale function
curve_pal <- function (x, colours = rev(blues9),
fun = function(x) x^gamma,
n=10, gamma=1)
{
# function that maps [0,1] -> colours
palfun <- colorRamp(colors=colours)
# now divide the data in n equi-spaced regions, mapped linearly to [0,1]
xcuts <- cut(x, breaks=seq(min(x), max(x), length=n))
xnum <- as.numeric(xcuts)
# need to work around NA values that make colorRamp/rgb choke
testNA <- is.na(xnum)
xsanitised <- ifelse(testNA, 0, fun(rescale(xnum)))
# non-NA values in [0,1] get assigned their colour
ifelse(testNA, NA, rgb(palfun(xsanitised), maxColorValue=255))
}
library(gridExtra)
grid.newpage()
grid.arrange(rasterGrob(curve_pal(x, gamma=0.5), wid=1, heig=1, int=F),
rasterGrob(curve_pal(x, gamma=1), wid=1, heig=1, int=F),
rasterGrob(curve_pal(x, gamma=2), wid=1, heig=1, int=F),
nrow=1)

ggplot2 2d Density Weights

I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)

Calculating an area under a continuous density plot

I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.

Resources