plotting principal components 3,4 using autoplot (ggfortify) in R

plotting principal components 3,4 using autoplot (ggfortify) in R - r

I have been trying to use autoplot (in the ggfortify R package) to plot data points in PCA coordinates. For data matrix D2,
autoplot(prcomp(D2),colour=color_codes)
works fine as far a generating a scatterplot of points in the space of principal components 1+2. However, PCA components 1+2 only explain about 30% of the covariance, and I would like to do the same for PCA 1+3, 2+3, and 3+4, etc. Is there a simple argument in autoplot that will let me do this, and if not, what's the simplest function I can use to do so?
Additionally, is there some way to calculate and add centroids using autoplot?

From ?autoplot.prcomp:
autoplot(object, data = NULL, scale = 1, x = 1, y = 2, ...)
where:
x = principal component number used in x axis and
y = principal component number used in y axis
Hence, if you need to plot PC2 vs PC3 and to add the centroid:
library(ggfortify)
set.seed(1)
D2 <- matrix(rnorm(1000),ncol=10)
prcmp <- prcomp(D2)
pc.x <- 2
pc.y <- 3
cnt.x <- mean(prcmp$x[,pc.x])
cnt.y <- mean(prcmp$x[,pc.y])
autoplot(prcmp, x=2, y=3) +
geom_point(x=cnt.x, y=cnt.y, colour="red", size=5)

Related

3D plot from model in R plotly?

Is it possible to generate a 3D plot from models using plotly? I tried to search over the internet, but many examples are based on the infamous volcano dataset that generates a plot from a matrix of points.
My two models are:
y = 0.49867x - 4.78577
y = 76.13084x + 4.81945
If not possible, how can i transform my data into the matrix format such as that in the volcano dataset? For more details, I have hosted the data file here. I have never used plotly before and i'm unfamiliar with the grammar, but i think i can manage if i can at least format my data into the likes of the volcano dataset.
Thank you.

To plot a surface with plotly, you need to construct a numeric matrix.
Taking Himmelblau's function as a test:
f <- function(x, y) { (x^2+y-11)^2 + (x+y^2-7)^2 }
Create x and y values:
x <- seq(-6, 6, length = 100)
y <- x
Then, create z with outer function. It will return a matrix.
z <- outer(x, y, f)
We can now create a surface plot:
library(plotly)
plot_ly(x = x, y = y, z = ~z) %>% add_surface()

How to plot numerous polygons in each data category?

I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:

some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group

Place different QQ plot (with different datasets) in the same coordinate system

I can only get the qq plot one by one with different datasets..
library(fitdistrplus)
x1<-c(1300,541,441,35,278,167,276,159,126,60.8,160,5000,264.6,379,170,251.3,155.84,187.01,850)
x2<-c(25,500,42,100,10,8.2,76,2.2,7.86,50)
y1<-log10(x1)
y2<-log10(x2)
x1.logis <- fitdist(y1, "logis", method="mle")
x2.logis <- fitdist(y2, "logis", method="mle")
ppcomp(x1.logis, addlegend=FALSE)
ppcomp(x2.logis, addlegend=FALSE)
How can i place the two qq plot in same coordinate system?

Use ggplot2. You need to extract your fitted values from the fitdist object n and make a new data frame. Use ggplot2 layers to add the fitted values from the two data sets and then add an abline.
library(ggplot2)
fittedx1 <- data.frame(x = sort(plogis(x1.logis$data,
location = x1.logis$estimate[1],
scale = x1.logis$estimate[2])),
p = (1:length(x1.logis$data))/length(x1.logis$data))
fittedx2 <- data.frame(x = sort(plogis(x2.logis$data,
location = x2.logis$estimate[1],
scale = x2.logis$estimate[2])),
p = (1:length(x2.logis$data))/length(x2.logis$data))
fitted <- rbind(fittedx1,fittedx2) #You need to combine the two datasets
#Add a variable that identifies which dataset the values belong to
#Then you can use the col option in ggplot to give each data set its own color!
fitted$set <- c(rep("1", nrow(fittedx1)), rep("2", nrow(fittedx2)))
#Now plot
ggplot(fitted) +
geom_point(aes(p, x, col=set), shape=1, size=3) +
geom_abline(intercept=0, slope=1)

Circular density plot using ggplot2

I'm working with circular data and I wanted to reproduce this kind of plot using ggplot2:
library(circular)
data1 <- rvonmises(1000, circular(0), 10, control.circular=list(units="radians")) ## sample
quantile.circular(data1,c(0.05,.95)) ## for interval
data2 <- mean(data1)
dens <- density(data1, bw=27)
p<-plot(dens, points.plot=TRUE, xlim=c(-1,2.1),ylim=c(-1.0,1.2),
main="Circular Density", ylab="", xlab="")
points(circular(0), plot.info=p, col="blue",type="o")
arrows.circular(c(5.7683795,0.5151433 )) ## confidence interval
arrows.circular(data2, lwd=3) ## circular mean
The thinest arrows are extremes of my interval
I suppose blue point is forecast
The third arrow is circular mean
I need circular density
I've been looking for something similar but I did not found anything.
Any suggestion?
Thanks

To avoid running in the wrong direction would you quickly check if this code goes in the right direction? The arrows can be added easily using +arrow(...) with appropriate loading.
EDIT: One remark to the complicated way of attaching density values - ggplot's geom_density does not seem to like coord_polar (at least the way I tried it).
#create some dummy radial data and wrap it in a dataframe
d1<-runif(100,min=0,max=120)
df = NULL
df$d1 <- d1
df <- as.data.frame(df)
#estimate kernel density and then derive an approximate function to attach density values to the radial values in the dataframe
data_density <- density(d1)
density_function <- with(data_density, approxfun(x, y, rule=1))
df$density <- density_function(df$d1)
#order dataframe to facilitate geom_line in polar coordinates
df <- df[order(df$density,df$d1),]
#ggplot object
require(ggplot2)
g = ggplot(df,aes(x=d1,y=density))
#Radial observations on unit circle
g = g + geom_point(aes(x=d1,y=min(df$density)))
#Density function
g = g + geom_line()
g = g + ylim(0,max(df$density))
g = g + xlim(0,360)
#polar coordinates
g = g + coord_polar()
g
Uniform random variables sampled from (0,120):

Plot decision boundaries with ggplot2?

How do I plot the equivalent of contour (base R) with ggplot2? Below is an example with linear discriminant function analysis:
require(MASS)
iris.lda<-lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
datPred<-data.frame(Species=predict(iris.lda)$class,predict(iris.lda)$x) #create data.frame
#Base R plot
eqscplot(datPred[,2],datPred[,3],pch=as.double(datPred[,1]),col=as.double(datPred[,1])+1)
#Create decision boundaries
iris.lda2 <- lda(datPred[,2:3], datPred[,1])
x <- seq(min(datPred[,2]), max(datPred[,2]), length.out=30)
y <- seq(min(datPred[,3]), max(datPred[,3]), length.out=30)
Xcon <- matrix(c(rep(x,length(y)),
rep(y, rep(length(x), length(y)))),,2) #Set all possible pairs of x and y on a grid
iris.pr1 <- predict(iris.lda2, Xcon)$post[, c("setosa","versicolor")] %*% c(1,1) #posterior probabilities of a point belonging to each class
contour(x, y, matrix(iris.pr1, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple") #Plot contour lines in the base R plot
iris.pr2 <- predict(iris.lda2, Xcon)$post[, c("virginica","setosa")] %*% c(1,1)
contour(x, y, matrix(iris.pr2, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple")
#Eqivalent plot with ggplot2 but without decision boundaries
ggplot(datPred, aes(x=LD1, y=LD2, col=Species) ) +
geom_point(size = 3, aes(pch = Species))
It is not possible to use a matrix when plotting contour lines with ggplot. The matrix can be rearranged to a data-frame using melt. In the data-frame below the probability values from iris.pr1 are displayed in the first column along with the x and y coordinates in the following two columns. The x and y coordinates form a grid of 30 x 30 points.
df <- transform(melt(matrix(iris.pr1, length(x), length(y))), x=x[X1], y=y[X2])[,-c(1,2)]
I would like to plot the coordinates (preferably connected by a smoothed curve) where the posterior probabilities are 0.5 (i.e. the decision boundaries).

You can use geom_contour in ggplot to achieve a similar effect. As you correctly assumed, you do have to transform your data. I ended up just doing
pr<-data.frame(x=rep(x, length(y)), y=rep(y, each=length(x)),
z1=as.vector(iris.pr1), z2=as.vector(iris.pr2))
And then you can pass that data.frame to the geom_contour and specify you want the breaks at 0.5 with
ggplot(datPred, aes(x=LD1, y=LD2) ) +
geom_point(size = 3, aes(pch = Species, col=Species)) +
geom_contour(data=pr, aes(x=x, y=y, z=z1), breaks=c(0,.5)) +
geom_contour(data=pr, aes(x=x, y=y, z=z2), breaks=c(0,.5))
and that gives

The partimat function in the klaR library does what you want for observed predictors, but if you want the same for the LDA projections, you can build a data frame augmenting the original with the LD1...LDk projections, then call partimat with formula Group~LD1+...+LDk, method='lda' - then you see the "LD-plane" that you intended to see, nicely partitioned for you. This seemed easier to me, at least to explain to students newer to R, since I'm just reusing a function already provided in a way in which it wasn't quite intended.