PCA: scores vs loadings in biplot - r

I was investigating the interpretation of a biplot and meaning of loadings/scores in PCA in this question: What are the principal components scores?
According to the author of the first answer the scores are:
x y
John -44.6 33.2
Mike -51.9 48.8
Kate -21.1 44.35
According to the second answer regarding "The interpretation of the four axis in bipolar":
The left and bottom axes are showing [normalized] principal component
scores; the top and right axes are showing the loadings.
So, theoretically after plotting the biplot from "What are principal components scores" I should get on the left and bottom axes the scores:
x y
John -44.6 33.2
Mike -51.9 48.8
Kate -21.1 44.35
and on the right and top the loadings.
I entered the data he provided in R:
DF<-data.frame(Maths=c(80, 90, 95), Science=c(85, 85, 80), English=c(60, 70, 40), Music=c(55, 45, 50))
pca = prcomp(DF, scale = FALSE)
biplot(pca)
This is the plot I got:
Firstly, the left and bottom axis represent the loadings of the principal components. The top and right axis represent the scores BUT they do not correspond to the scores the author from the post provided (3 aka Kate has positive scores on the plot but one negative on PC1 according to the Tony Breyal in the first answer to the question in the post).
If I am doing or understanding something wrong, where is my mistake?

There are a few nuances you missed:
biplot.princomp function
For some reason biplot.princomp scales the loading and score axes differently. So the scores you see are transformed. To get the actual values you can invoke the biplot function like this:
biplot(pca, scale=0)
see help(biplot.princomp) for more.
Now the values are actual scores. You can confirm this by comparing the plot with pca$x.
Centering.
However the result is still not the same as per the answer you found in crossvalidated SO.
This is because Tony Breyal calculated the scores manually and he was using non-centered data for that. the prcomp function does centering by default and then uses centered data to get the scores.
So you can center the data first:
> scale(DF, scale=FALSE)
Maths Science English Music
[1,] -8.333333 1.666667 3.333333 5
[2,] 1.666667 1.666667 13.333333 -5
[3,] 6.666667 -3.333333 -16.666667 0
And now use these numbers to get the scores as per the answer:
x y
John 0.28*(-8.3) + -0.17*1.6 + -0.94*3 + 0.07*5 0.77*(-8.3) + -0.08*1.6 + 0.19*3 + -0.60*5
Mike 0.28*1.6 + -0.17*1.6 + -0.94*13 + 0.07*(-5) 0.77*1.6 + -0.08*1.6 + 0.19*13 + -0.60*(-5)
Kate 0.28*6.6 + -0.17*(-3.3) + -0.94*(-16) + 0.07*0 0.77*6.6 + -0.08*(-3.3) + 0.19*(-16) + -0.60*0
After doing this you should get the same scores as plotted by biplot(pca, scale=0)
Hope this helps.

Related

Guidance on producing multivariate Radar plots in ggplot

I am currently attempting to make two different radar plots (see attached drawn representations).
Both plots use the same nested experimental design utilizing two independent variables: location and lithology, covering 36 experimental sites. Location has three levels (W,C,E) with each of the categories containing three lithologies (C,M,P).
Plot one is a representation of the distribution of altitude and orientation across sites with orientation being in degrees (0-360) and altitude being in meters.
Plot two will be used to represent the differing minerology of groups of site utilizing three minerals (A,B,C).
Drawn depiction of radar plot 1 with four cardinal directions, colours representing climate location (independent variable 1) and shapes representing lithology (independent variable 2). Circles represent altitude with larger distance from the center indicating higher altitude
The second radar plot uses the same independent variables: location (W,C,E) and lithology (C,M,P) with the red highlighted area representing the quantity of mineral A and yellow indicating mineral B .
If anyone has any pointers, packages or guides which could help I would greatly appreciate it.
Edit: The second plot doesn't seem too difficult to make but the first is still causing issues. I have partially solved using polar plots but am now having difficulty adjusting the aesthetics.
Data head:
head(radar_plot_data)
climate lithology plot pos_lit code altitude_m orientation_ao
1: Central Calcareous C.C.Pp.1 CC C.C.1 1150 150.8
2: Central Calcareous C.C.Pp.2 CC C.C.2 860 24.0
3: Central Calcareous C.C.Pp.3 CC C.C.3 1026 90.0
4: Central Calcareous C.C.Pp.4 CC C.C.4 1326 86.3
5: Central Calcareous C.C.Pp.5 CC C.C.5 966 87.5
6: Central Metapelite C.M.Pp.1 CM C.M.1 951 28.3
Current code:
`ggplot(radar_plot_data, aes(x = orientation_ao)) +`
geom_point(aes(x = orientation_ao, y = altitude_m,
shape = climate, color = lithology, stroke = 2.5)) +
coord_polar() +
scale_x_continuous(limits = c(0,360),
breaks = seq(0, 360, by = 45),
minor_breaks = seq(0, 360, by = 15))

Why ggtern doesn't plots some points?

I'm trying to do a plot from a data.frame that contains positive and negative values and I cannot plot all points. Someone know, if is possible to adapt the code to plot all point?
example = data.frame(X1=c(-1,-0.5,1),X2=c(-1,0.7,1),X3=c(1,-0.5,2))
ggtern(data=example, aes(x=X1,y=X2,z=X3)) + geom_point()
Well actually your points are getting plotted, but they lie outside the plot region.
Firstly, to understand why, each row of your data must sum to unity, else it will be coerced that way, therefore what you will be plotting is the following:
example = example / matrix(rep(rowSums(example),ncol(example)),nrow(example),ncol(example))
example
X1 X2 X3
1 1.000000 1.000000 -1.000000
2 1.666667 -2.333333 1.666667
3 0.250000 0.250000 0.500000
Now the rows sum to unity:
print(rowSums(example))
[1] 1 1 1
You have negative values, which are nonsensical in terms of 'composition', however, negative concentrations can still be plotted, as they will numerically represent points outside the ternary area, lets expand the limits and render to see where they lie:
ggtern(data=example, aes(x=X1,y=X2,z=X3)) +
geom_mask() +
geom_point() +
limit_tern(5,5,5)

Identifying points in a curve

I feel like this is an easy question...
How do you identify coordinates in a figure? I plotted some data, used unireg (the uniReg package) to make a spline curve, and want to pull out the data from a point.
library(uniReg)
P0mM <- read.table(text="
Time FeuM
0.04 138.8181818
7 1258.636364
14 1320.545455
21 2110.37037
28 13730.37037
35 1550.909091",header=TRUE)
z=seq(min(P0mM$Time),max(P0mM$Time),length=201)
uf=with(P0mM,unireg(Time,FeuM,g=5,sigma=1))
plot(FeuM~Time,P0mM,ylim=c(0,16000),ylab="Fe2+ uM", xlab="Time", main="0mM P")
lines(z,uf$unimod.func(z))
I was able to find the max y value of the curve (which is 14444)
max((uf$unimod.func(z)))
I want to identify where on the x axis this happens. (Should be around 30, but I want to be exact).
How do you do this?
Thanks!
Seems like a case for optimise or optimize (depending on your affinity with British or American English):
optimise(uf$unimod.func, maximum=TRUE, interval=range(P0mM$Time))
#$maximum
#[1] 29.27168
#
#$objective
# [,1]
#[1,] 14444.85

Identifying data points amongst background noise for binned data R

Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.

rgl 2D surface plot of matrix not enough detail

I am trying to use rgl to surface plot this matrix "r":
library(rgl)
Banana Apple BlueBerry Kiwi Raisin Strawberry
Chicago 1.0000000 0.9972928 0.9947779 1.0623767 0.9976347 0.9993892
Wilmette 1.0016507 1.0000000 0.9976524 0.9863927 0.9985248 1.0016828
Winnetka 1.0040722 1.0025362 1.0000000 0.9886008 1.0016501 0.9955785
Glenview 0.9961316 1.0105463 1.0167024 1.0000000 1.0129399 1.0123440
Deerfield 1.0023308 1.0026052 0.9979093 0.9870921 1.0000000 1.0025606
Wheeling 1.0073697 0.9985745 1.0045129 0.9870925 1.0008054 1.0000000
rgl.surface(1:6 , 1:6 , r, color="red", back="lines")
Since the z-values are so close together in magnitude, the surface plot looks almost flat, even though there are subtle bumps in the data.
1) How do I make it so that I have a zoomed in version where I can see as much detail as possible?
2) Is there a way to show in different colors the data (faces) that have the biggest "slope", and so that the labels of the columns and rows of the matrix are preserved on the 3D surface (maybe just using the first three letters of the labels)? In other words, so I can see that the Kiwi in Chicago and the Kiwi in Wilmette causes the greatest min/max variation?
Something like this?
library(rgl)
library(colorRamps) # for matlab.like(...)
palette <- matlab.like(10) # palette of 10 colors
rlim <- range(r[!is.na(r)])
colors <- palette[9*(r-rlim[1])/diff(rlim) + 1]
open3d(scale=c(1/6,1/6,1/diff(range(r))))
surface3d(1:6 , 1:6 , r, color=colors, back="lines")
Part of your problem is that you were using rgl.surface(...) incorrectly. The second argument is the matrix of z-values. With surface3d(...) the arguments are x, y ,z in that order.
EDIT: Response to OP's comment.
Using your ex post facto dataset...
open3d(scale=c(1/6,1/6,1/diff(range(r))))
bbox3d(color="white")
surface3d(1:6 , 1:6 , r, color=colors, back="lines")
axis3d('x--',labels=rownames(r),tick=TRUE)
axis3d('y-+',labels=colnames(r),tick=TRUE)
axis3d('z--',tick=TRUE)

Resources