R: prcomp, Order of columns in data table matter? - r

I ran the prcomp function on a data table containing 91 columns and 2030 rows and obtained a PCA plot. However, when I re-ordered the same data table to make it easier to color-code the data points, I got an entirely different looking PCA plot.
Does the order of the columns matter in prcomp()?
Just a note, the code included was provided for me by someone previously in my lab, who is no longer here to ask. I have a moderate understanding of what it is doing.
Thanks for the help!
pcaPlotter3d <- function(fileName, startColumn, endColumn){
x<- read.table(fileName, sep = '\t', header =TRUE, stringsAsFactors = FALSE)
pcaData <- prcomp(~., x[,startColumn:endColumn], na.action=na.exclude, scale = TRUE)
library(scatterplot3d)
colorList <- c(rep("magenta", 2), rep("blue", 12), rep("red",33), rep("purple", 2), rep("green", 6), rep("black",36))
shapeList <- c(rep(19, 91))#, rep(15, 24))
with (pcaData, {
pointsForPlot <- scatterplot3d(pcaData$rotation[,1:3], color=colorList,
pch = shapeList, main = "TAP Proteins PCA", mar = c(3,3,3,5), xlab = "PC1 (16.5%)", ylab = "PC2 (3.67%)", zlab = "PC3 (2.79%)",
col.grid = NULL)
pointsForPlot.coords <- pointsForPlot$xyz.convert(pcaData$rotation[,1:3])
legend(8,5, bty = "n", xpd = TRUE, cex = 0.75, inset = .1,
title = "Groups", c("Bio", "EF", "IF", "RF", "Rib", "Unk"),
col = c("magenta", "blue", "red", "purple", "green", "black") , pch = c(19,19,19,19,19,19));
})
print(summary(pcaData))
}

It's hard to see in the perspective plots, but it seems that all that has happened is that the signs of PC2 and PC3 have been flipped. (Eigenvectors/PCA directions are only defined up to a change in sign, and a trivial change like changing the order of the columns can indeed cause them to flip.) Given that the inertias/proportions of variance are the same and the ranges of the axes are inverted (e.g. PC2 goes from -0.1 to 0.5 in plot 1 and -0.5 to 0.1 in plot 2), this is the most likely explanation. You can simply multiply the PC2 and PC3 coordinates by -1 in the appropriate places if you want to recover the original plot ...

Related

How can I change the colour of my points on my db-RDA triplot in R?

QUESTION: I am building a triplot for the results of my distance-based RDA in R, library(vegan). I can get a triplot to build, but can't figure out how to make the colours of my sites different based on their location. Code below.
#running the db-RDA
spe.rda.signif=capscale(species~canopy+gmpatch+site+year+Condition(pair), data=env, dist="bray")
#extract % explained by first 2 axes
perc <- round(100*(summary(spe.rda.signif)$cont$importance[2, 1:2]), 2)
#extract scores (coordinates in RDA space)
sc_si <- scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1)
sc_sp <- scores(spe.rda.signif, display="species", choices=c(1,2), scaling=1)
sc_bp <- scores(spe.rda.signif, display="bp", choices=c(1, 2), scaling=1)
#These are my location or site names that I want to use to define the colours of my points
site_names <-env$site
site_names
#set up blank plot with scaling, axes, and labels
plot(spe.rda.signif,
scaling = 1,
type = "none",
frame = FALSE,
xlim = c(-1,1),
ylim = c(-1,1),
main = "Triplot db-RDA - scaling 1",
xlab = paste0("db-RDA1 (", perc[1], "%)"),
ylab = paste0("db-RDA2 (", perc[2], "%)")
)
#add points for site scores - these are the ones that I want to be two different colours based on the labels in the original data, i.e., env$site or site_names defined above. I have copied the current state of the graph
points(sc_si,
pch = 21, # set shape (here, circle with a fill colour)
col = "black", # outline colour
bg = "steelblue", # fill colour
cex = 1.2) # size
Current graph
I am able to add species names and arrows for environmental predictors, but am just stuck on how to change the colour of the site points to reflect their location (I have two locations defined in my original data). I can get them labelled with text, but that is messy.
Any help appreciated!
I have tried separating shape or colour of point by site_name, but no luck.
If you only have a few groups (in your case, two), you could make the group a factor (within the plot call). In R, factors are represented as an integer "behind the scenes" - you can represent up to 8 colors in base R using a simple integer:
set.seed(123)
df <- data.frame(xvals = runif(100),
yvals = runif(100),
group = sample(c("A", "B"), 100, replace = TRUE))
plot(df[1:2], pch = 21, bg = as.factor(df$group),
bty = "n", xlim = c(-1, 2), ylim = c(-1, 2))
legend("topright", unique(df$group), pch = 21,
pt.bg = unique(as.factor(df$group)), bty = "n")
If you have more than 8 groups, or if you would like to define your own colors, you can simply create a vector of colors the length of your groups and still use the same factor method, though with a few slight tweaks:
# data with 10 groups
set.seed(123)
df <- data.frame(xvals = runif(100),
yvals = runif(100),
group = sample(LETTERS[1:10], 100, replace = TRUE))
# 10 group colors
ccols <- c("red", "orange", "blue", "steelblue", "maroon",
"purple", "green", "lightgreen", "salmon", "yellow")
plot(df[1:2], pch = 21, bg = ccols[as.factor(df$group)],
bty = "n", xlim = c(-1, 2), ylim = c(-1, 2))
legend("topright", unique(df$group), pch = 21,
pt.bg = ccols[unique(as.factor(df$group))], bty = "n")
For pch just a slight tweak to wrap it in as.numeric:
pchh <- c(21, 22)
ccols <- c("slateblue", "maroon")
plot(df[1:2], pch = pchh[as.numeric(as.factor(df$group))], bg = ccols[as.factor(df$group)],
bty = "n", xlim = c(-1, 2), ylim = c(-1, 2))
legend("topright", unique(df$group),
pch = pchh[unique(as.numeric(as.factor(df$group)))],
pt.bg = ccols[unique(as.factor(df$group))], bty = "n")

Scaling in vegan RDA plots, how is it working?

I don't understand how the scaling works in Vegan, when plotting ordinations.
I found this question which will help clarify my point. For what I can read on the "Numerical ecology with R" book, there is differences between scaling = 1 and scaling = 2. In particular, with scaling 1 "The angles among descriptor vectors do not reflect their correlations" while with scaling 2 "The angles between descriptors in the biplot reflect their correlations".
So, I run this code (partially copy-pasted from the cited question) and I get two different plots (the axis span is different, so maybe the scaling parameter is doing something) but I don't see much difference between the angles of the descriptor vectors so I am trying to understand what, if anything, is wrong.
What I am missing, here?
library("vegan")
data(varespec)
data(varechem)
ord <- rda(varespec)
set.seed(1)
(fit <- envfit(ord, varechem, perm = 999))
## make up a fake `status`
status <- factor(rep(c("Class1","Class2"), times = nrow(varespec) / 2))
## manual version with extra things
colvec <- c("red","green")
scl <- 1
plot(ord, type = "n", scaling = scl, main="Scaling 1")
points(ord, display = "sites", col = colvec[status], pch = (1:2)[status])
points(ord, display = "species", pch = "+")
plot(fit, add = TRUE, col = "black")
dev.new()
scl <- 2
plot(ord, type = "n", scaling = scl, main="Scaling 2")
points(ord, display = "sites", col = colvec[status], pch = (1:2)[status])
points(ord, display = "species", pch = "+")
plot(fit, add = TRUE, col = "black")

Problem with my code- Univariate regression plot not showing lines

this will sound very basic, but I cannot find the solution to this problem with my code. I did a univariate regression (regr1) between the 2 variables immigrate_policy and lrgen. In plotting the commands for the lines do not show.
One problem could be the sequence maybe? Because the range for lrgen should actually be between 1 and 9, but I had to put manually 1:8 because every other sequence I put gives me an error. With this sequence, however, the lines in the plot are weird, and definitely not right
Following is my code:
regr1 <- lm(formula = ITA$immigrate_policy ~ ITA$lrgen, data = ITA)
summary(regr1)
install.packages("stargazer") library(stargazer) help(stargazer)
stargazer(regr1, type ="html",out="project.html")
stargazer(regr1, type="text",out="project/regression.html")
plot(ITA$lrgen, ITA$immigrate_policy,
xlab = "Political Stance of the party", ylab = "Position towards Immigration policies") abline(regr1, col = "red", lwd = 2)
range(ITA$lrgen)
ci <- data.frame(lrgen = seq(1:8))
sim <- predict(regr1, newdata = ci, interval = "confidence", level =
0.99)
lines(c(1:8),sim[,2], lt = "dashed", lwd = 1, col = "yellow")
lines(c(1:8),sim[,3], lt = "dashed", lwd = 1, col = "yellow")

Calculate intersection point of two density curves in R

I have two vectors of 1000 values (a and b), from which I created density plots and histograms. I would like to retrieve the coordinates (or just the y value) where the two plots cross (it does not matter if it detects several crossings, I can discriminate them afterwards). Please find the data in the following link. Sample Data
xlim = c(min(c(a,b)), max(c(a,b)))
hist(a, breaks = 100,
freq = F,
xlim = xlim,
xlab = 'Test Subject',
main = 'Difference plots',
col = rgb(0.443137, 0.776471, 0.443137, 0.5),
border = rgb(0.443137, 0.776471, 0.443137, 0.5))
lines(density(a))
hist(b, breaks = 100,
freq = F,
col = rgb(0.529412, 0.807843, 0.921569, 0.5),
border = rgb(0.529412, 0.807843, 0.921569, 0.5),
add = T)
lines(density(b))
Using locate() is not optimal, since I need to retrieve this from several plots (but will use that approach if nothing else is viable). Thanks for your help.
We calculate the density curves for both series, taking care to use the same range. Then, we compare whether the y-value for a is greater than b at each x-value. When the outcome of this comparison flips, we know the lines have crossed.
df <- merge(
as.data.frame(density(a, from = xlim[1], to = xlim[2])[c("x", "y")]),
as.data.frame(density(b, from = xlim[1], to = xlim[2])[c("x", "y")]),
by = "x", suffixes = c(".a", ".b")
)
df$comp <- as.numeric(df$y.a > df$y.b)
df$cross <- c(NA, diff(df$comp))
points(df[which(df$cross != 0), c("x", "y.a")])
which gives you

Customising vegan ordination plot

I have a dataset including 100 species and therefore it's very bad to plot. So I want to pick out a subset of these species and plot them in a RDA plot. I have been following this
guideline
The code looks like this:
## load vegan
require("vegan")
## load the Dune data
data(dune, dune.env)
## PCA of the Dune data
mod <- rda(dune, scale = TRUE)
## plot the PCA
plot(mod, scaling = 3)
## build the plot up via vegan methods
scl <- 3 ## scaling == 3
colvec <- c("red2", "green4", "mediumblue")
plot(mod, type = "n", scaling = scl)
with(dune.env, points(mod, display = "sites", col = colvec[Use],
scaling = scl, pch = 21, bg = colvec[Use]))
text(mod, display = "species", scaling = scl, cex = 0.8, col = "darkcyan")
with(dune.env, legend("topright", legend = levels(Use), bty = "n",
col = colvec, pch = 21, pt.bg = colvec))
This is the plot you end up with. Now i would really like to remove some of the species from the plot, but not the analysis. So the plot only shows like Salrep, Viclat, Aloge and Poatri.
Help is appreciated.
The functions you are doing the actual plotting with have an argument select (at least text.cca() and points.cca(). select takes either a logical vector of length i indicating whether the ith thing should be plotted, or the (numeric) indices of the things to plot. The example would then become:
## Load vegan
library("vegan")
## load the Dune data
data(dune, dune.env)
## PCA of the Dune data
mod <- rda(dune, scale = TRUE)
## plot the PCA
plot(mod, scaling = 3)
## build the plot up via vegan methods
scl <- 3 ## scaling == 3
colvec <- c("red2", "green4", "mediumblue")
## Show only these spp
sppwant <- c("Salirepe", "Vicilath", "Alopgeni", "Poatriv")
sel <- names(dune) %in% sppwant
## continue plotting
plot(mod, type = "n", scaling = scl)
with(dune.env, points(mod, display = "sites", col = colvec[Use],
scaling = scl, pch = 21, bg = colvec[Use]))
text(mod, display = "species", scaling = scl, cex = 0.8, col = "darkcyan",
select = sel)
with(dune.env, legend("topright", legend = levels(Use), bty = "n",
col = colvec, pch = 21, pt.bg = colvec))
Which gives you:
You may also use the ordiselect() function from the goeveg-package:
https://CRAN.R-project.org/package=goeveg
It offers selection of species for ordination plots based on abundances and/or species fit to axes.
## Select ssp. with filter: 50% most abundant and 50% best fitting
library(goeveg)
sel <- ordiselect(dune, mod, ablim = 0.5, fitlim = 0.5)
sel # 12 species selected
The result object of the function (containing the names of selected species) can be put into the select argument (as described above).

Resources