I would like to add to a clusplot plot the variables used for pca as arrows. I am not sure that a way has been implemented (I can't find anything in the documentation).
I have produced a clusplot that looks like this:
With the package princomp I can independently plot the observations in an analogous space of representation, with the variables (columns) as arrows:
Is there a way to do the two things at the same time, by showing the clusters and the variables of pca on the same diagram?
I wanted to to the same thing as OP today and ended up putting pieces from clusplot and biplot together. This is the result which may be useful if you want to do the same thing:
clusplot2 <- function(dat, clustering, ...) {
clusplot(dat, clustering, ...)
## this is from clusplot.default
pca <- princomp(dat, scores = TRUE, cor = (ncol(dat) != 2))
## this is (adapted) from biplot.princomp
directions <- t(t(pca$loadings[, 1:2]) * pca$sdev[1:2]) * sqrt(pca$n.obs)
## all below is (adapted) from biplot.default
unsigned.range <- function(x) c(-abs(min(x, na.rm = TRUE)),
abs(max(x, na.rm = TRUE)))
x <- predict(pca)[, 1:2]
y <- directions
rangx1 <- unsigned.range(x[, 1L])
rangx2 <- unsigned.range(x[, 2L])
rangy1 <- unsigned.range(y[, 1L])
rangy2 <- unsigned.range(y[, 2L])
xlim <- ylim <- rangx1 <- rangx2 <- range(rangx1, rangx2)
ratio <- max(rangy1/rangx1, rangy2/rangx2)
par(new = T)
col <- par("col")
if (!is.numeric(col))
col <- match(col, palette(), nomatch = 1L)
col <- c(col, col + 1L)
cex <- rep(par("cex"), 2)
plot(y, axes = FALSE, type = "n", xlim = xlim * ratio, ylim = ylim *
ratio, xlab = "", ylab = "", col = col[1L])
axis(3, col = col[2L])
axis(4, col = col[2L])
box(col = col[1L])
text(y, labels = names(dat), cex = cex[2L], col = col[2L])
arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
length = 0.1)
}
############################################################
library(cluster)
dat <- iris[, 1:4]
clus <- pam(dat, k = 3)
clusplot2(dat, clus$clustering, main = "Test")
Of course there is much room for improvement (as this is just copied together) but I think anyone can easily adapt it if needed.
If you wonder why the arrows (loadings * sdev) are scaled with 0.8 * sqrt(n): I have absolutely no idea. I would have plotted loadings * sdev which should resemble the correlation between the principal components and the variables but this is how biplot does it.
Anyway, this should produce the same arrows as biplot.princomp and use the same pca as clusplot which was the primary goal for me.
Related
The type of plot I am trying to achieve in R seems to have been known as either as moving distribution, as joy plot or as ridgeline plot:
There is already a question in Stackoverflow whose recorded answer explains how to do it using ggplot: How to reproduce this moving distribution plot with R?
However, for learning purposes, I am trying to achieve the same using only base R plots (no lattice, no ggplot, no any plotting package).
In order to get started, I generated the following fake data to play with:
set.seed(2020)
shapes <- c(0.1, 0.5, 1, 2, 4, 5, 6)
dat <- lapply(shapes, function(x) rbeta(1000, x, x))
names(dat) <- letters[1:length(shapes)]
Then using mfrow I can achieve this:
par(mfrow=c(length(shapes), 1))
par(mar=c(1, 5, 1, 1))
for(i in 1:length(shapes))
{
values <- density(dat[[names(dat)[i]]])
plot(NA,
xlim=c(min(values$x), max(values$x)),
ylim=c(min(values$y), max(values$y)),
axes=FALSE,
main="",
xlab="",
ylab=letters[i])
polygon(values, col="light blue")
}
The result I get is:
Clearly, using mfrow (or even layout) here is not flexible enough and also does allow for the overlaps between the distributions.
Then, the question: how can I reproduce that type of plot using only base R plotting functions?
Here's a base R solution. First, we calculate all the density values and then manually offset off the y axis
vals <- Map(function(x, g, i) {
with(density(x), data.frame(x,y=y+(i-1), g))
}, dat, names(dat), seq_along(dat))
Then, to plot, we calculate the overall range, draw an empty plot, and the draw the densities (in reverse so they stack)
xrange <- range(unlist(lapply(vals, function(d) range(d$x))))
yrange <- range(unlist(lapply(vals, function(d) range(d$y))))
plot(0,0, type="n", xlim=xrange, ylim=yrange, yaxt="n", ylab="", xlab="Value")
for(d in rev(vals)) {
with(d, polygon(x, y, col="light blue"))
}
axis(2, at=seq_along(dat)-1, names(dat))
d = lapply(dat, function(x){
tmp = density(x)
data.frame(x = tmp$x, y = tmp$y)
})
d = lapply(seq_along(d), function(i){
tmp = d[[i]]
tmp$grp = names(d)[i]
tmp
})
d = do.call(rbind, d)
grp = unique(d$grp)
n = length(grp)
spcx = 5
spcy = 3
rx = range(d$x)
ry = range(d$y)
rx[2] = rx[2] + n/spcx
ry[2] = ry[2] + n/spcy
graphics.off()
plot(1, type = "n", xlim = rx, ylim = ry, axes = FALSE, ann = FALSE)
lapply(seq_along(grp), function(i){
x = grp[i]
abline(h = (n - i)/spcy, col = "grey")
axis(2, at = (n - i)/spcy, labels = grp[i])
polygon(d$x[d$grp == x] + (n - i)/spcx,
d$y[d$grp == x] + (n - i)/spcy,
col = rgb(0.5, 0.5, 0.5, 0.5))
})
I'm new to R and working on some code that outputs a scatter plot matrix. The data frame is in the following format:
A B C D
2 3 0 5
8 9 5 4
0 0 5 3
7 0 0 0
My data sets can run into the 100-1000s of rows and 10-100s of columns, with a wide scale of values (hence log transforming my data).
This bit of code gives me some partial success in enhancing the basic plot (see embedded image):
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1), xlog = FALSE, ylog = FALSE)
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# Add regression line to plots.
my_line <- function(x,y,...){
points(x,y,...)
LR <- lm(log(x) ~ log(y), data = SP)
abline(LR, col = "red", untf = TRUE)
}
# Plot scatter plot matrices.
pairs(mydataframe, pch = 20, main = "test",
cex = 0.125, cex.labels = 1,
xlim = c(100, 1e9),
ylim = c(100, 1e9),
upper.panel = panel.cor,
lower.panel = my_line,
log = "xy")'
example
Problem 1 - instead of getting R^2 values in the upper panel, I get NAs instead. How can I correct this?
Problem 2 - I'd like to remove the function for adjusting text size of R^2 value in proportion to correlation. I know it's in panel.cor but not sure which part will need removal or adjustment.
Many thanks in advance
EDIT: 08/06/2016
I have found a work around which also simplifies the code:
panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
# correlation coefficient
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.6, txt)
}
# add regression line to plots.
my_line <- function(x,y,...)
{
points(x,y,...)
LR <- lm(x ~ y, data = SP)
abline(LR, col = "red", untf = TRUE)
}
# Plot scatterplot matrices.
pairs(SP, pch = 20, main = "test",
cex = 0.125, cex.labels = 1,
upper.panel = panel.cor,
lower.panel = my_line)
example 2
The issue appears to be missing values i.e. 0's. I change these to NA's initially so I can use a log scale. This in combination with log transformation leads to missing R^2 values in the upper panel.
Ideally I'd like to have a log scale. Is there a way i can do this without introducing the aformentioned issue?
Clarification - I'd like a log (xy) scale in the scatter plots (lower panel) and for x-axis in the histograms (diagonal panel). I've been playing about with it today but can't quite get it as i want. Perhaps i'm asking too much from pairs. Any help would be appreciated.
Edit: 10/06/2016
Success!....well approximately 99% happy.
I have made changes - added histograms to diagonal panel and p-value to upper panel (the base code in "pairs()" for adding the histogram needed adjustment due to the log scale used on the x-axis). Please feel free to correct my descriptions if they're not accurate or correct:
library(lattice)
DF <- read.csv("File location", header = TRUE)
DF.1 <- DF+1 # Added small epsilon to data frame otherwise plot errors arise due to missing values.
# Function to calculate R^2 & p-value for upper panels in pairs() - scatterplot matrices.
panel.cor <- function(x, y, digits = 3, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1), xlog = FALSE, ylog = FALSE) # xlog/ylog: ensures that R^2 and p-values display in upper panel.
# Calculate correlation coefficient and add to diagonal plot.
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.7, txt, cex = 1.25) # First 2 arguments determine postion of R^2-value in upper panel cells.
# Calculate P-value and add to diagonal plot.
p <- cor.test(x, y)$p.value
txt2 <- format(c(p, 0.123456789), digits = digits)[1]
txt2 <- paste("p= ", txt2, sep = "")
if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
text(0.5, 0.3, txt2, cex = 1.25) # First 2 arguments determine postion of p-value in upper panel cells.
}
# Function to calculate frequency distribution and plot histogram in diagonal plot.
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0.5, 1.5, 0, 1.75), xlog = TRUE, ylog = FALSE) # xlog argument allows log x-axis when called in pairs.
h <- hist(log(x), plot = FALSE, breaks = 20)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan")
}
# add regression line to plots.
my_line <- function(x,y, ...)
{
points(x,y,...)
LR <- lm(log(x) ~ log(y), data = DF.1)
abline(LR, col = "red", untf = TRUE)
}
# Plot scatterplot matrices.
pairs(DF.1, pch = 20, main = "Chart Title",
cex = 0.75, cex.labels = 1.5, label.pos = 0.0001,
upper.panel = panel.cor,
lower.panel = my_line,
diag.panel = panel.hist,
log = ("xy"),
xlim = c(5, 1e9),
ylim = c(5, 1e9))
The fly in the ointment:
1 - the text labels in the diagonal panel only partially appear. I used a decreasing value for the "label.pos" argument in "pairs()" which moved the label down until they appeared. However, they won't move anymore no matter how much I decrease that value. I've tried to coerce the position from the histogram function, but that doesn't work. I hope someone can see what I'm missing. Thanks in advance...I've not had any responses yet:(
PS: I tried to link 3rd image with my successful plot but I was foiled by my lack of reputation...groan.
EDIT: 13/06/2016
Solved! I feel a bit foolish. The fix for the positioning of the main title in the diagonal panel was super simple and I spent a long time trying much more complex ways to do this. The "label.pos" argument in pairs should be negative! I used a small value of -0.0675 which placed it near the top of the cell containing the histogram.
I hope someone else finds this useful. I'll mark as solved but I'd appreciate any comments regarding my code commenting or if someone sees a way of making the code more efficient. Thanks Alex
Sometimes i feel totally dense. Answer my own question...who would have thought...slaps head. Please see edits in my post for the fixes I found.
Attempting to code a function that returns a bubble chart from aggregated data.
I'm passing it a column of a data.frame in "agg".
aggs2 <- function(agg, deporur=0, all=TRUE){
##create aggregate from library data
agg1 <- aggregate(agg, by=list(NoNA$IMD_NATIONAL_QUINTILE, NoNA$UR),
FUN=function(x) c(mn=mean(x), n=length(x)))
##bind into a dataframe
agg1 <- cbind(agg1[,1:2], agg1[,3])
##add column holding values of Deprivation Quantile and Urban/Rural status
agg1$NewCol <- do.call(paste, c("Deprivation Quantile", agg1[c("Group.1", "Group.2")],
sep = " "))
##set column names
colnames(agg1) <- c("Deprivation", "Urban and Rural", "Mean", "Count", "DepUR")
##remove categories with low counts
if(all==FALSE){
agg1 <- subset(agg1, agg1$Count > 9)
}
##order data.frame by mean
agg1 <- agg1[order(agg1$Mean, decreasing=TRUE),]
##create bubble chart
if(deporur==1){
radius3 <- sqrt(agg1$Count/pi)
symbols(factor(agg1$DepUR), agg1$Mean, circles=radius3, inches=0.35,
xlim=c(0,10.0), ylim=c(min(agg1$Mean-0.25),10.0), fg="white", bg="purple",
xlab="Deprivation Quantile and Urban/Rural Status", ylab="Mean Response")
text(factor(agg1$DepUR), agg1$Mean-.1, agg1$DepUR, cex=0.7)
}
#return ordered dataframe
agg1
}
This returns a sorted data.frame by mean, and the following chart:
Because this function will need to create graphs from a variety of different documents and columns, I would like to code it so that the labels do not overlap the bubbles, or other labels.
I have looked at the directlabels library, but I have been unable to work out how to code it properly.
Would greatly appreciate any assistance.
I'm not aware of any solution for non-overlapping labels with regards to other labels AND other circles. Nevertheless, wordcloud::textplot might be a starting point:
library(wordcloud)
set.seed(8)
df <- data.frame(x = runif(10), y = runif(10), size = sample(10:20, 10), lab = paste0("label", 1:10))
par(mfrow = c(1,2))
with(df, {
plot(x, y, cex = size, pch = 19, col = adjustcolor("violet", alpha.f = .4), main = "non-overlapping")
textplot(x, y, lab, new = FALSE, show.lines = FALSE, cex = 2)
plot(x, y, cex = size, pch = 19, col = adjustcolor("violet", alpha.f = .4), main = "overlapping")
text(x, y, lab, cex = 2)
})
I want to compare two datasets with same x and y variables. However, not all X variable points are present on both. As a toy example say this is what I have:
position.x <- c(1,2,3)
score.x <- c(450,220,330)
x <- data.frame(position,score.x)
position.y <- c(2,3,5)
score.y <- c(333,423,988)
y<- data.frame(position.y,score.y)
par(mfrow = c(2,1))
plot(x, pch = 19)
plot(y, pch = 19)
X axes are not comparable. I found some post explaining how to do it on ggplot using facet_wrap but I would like to do it using base graph.
Thank you in advance.
you could specify the range of the x and y axises by xlim and slim
position.x <- c(1,2,3)
score.x <- c(450,220,330)
x <- data.frame(position,score.x)
position.y <- c(2,3,5)
score.y <- c(333,423,988)
y<- data.frame(position.y,score.y)
par(mfrow = c(2,1))
plot(x, pch = 19, xlim=c(1,5))
plot(y, pch = 19, xlim=c(1,5))
if you are going to repeat this, you might as well write some kind of function (which is one of the benefits of ggplot--it takes care of all the set-up for you):
## data needs to be in a long format
dat <- data.frame(position = c(1,2,3,2,3,5),
score = c(450,220,330,333,423,988),
z = c('x','x','x','y','y','y'))
facet_wrap <- function(data, x, y, z, horiz = TRUE, ...) {
## save current par settings and return after finished
op <- par(no.readonly = TRUE)
on.exit(par(op))
zz <- unique(data[, z])
## sets up the layout to cascade horizontally or vertically
## and sets xlim and ylim appropriately
if (horiz) {
par(mfrow = c(1, length(zz)), ...)
ylim <- range(data[, y])
xlim <- NULL
} else {
par(mfrow = c(length(zz), 1), ...)
xlim <- range(data[, x])
ylim <- NULL
}
## make a subset of data for each unique by variable
## and draw a basic plot for each one
for (ii in zz) {
tmp <- data[data[, z] %in% ii, ]
plot(tmp[, x], tmp[, y], xlim = xlim, ylim = ylim)
}
}
facet_wrap(dat, 'position', 'score', 'z', mar = c(5,4,2,2))
facet_wrap(dat, 'position', 'score', 'z', mar = c(5,4,1,2), horiz = FALSE)
I am trying to visualize a curve for pollination distribution. I am very new to R so please don't be upset by my stupidity.
llim <- 0
ulim <- 6.29
f <- function(x,y) {(.156812/((2*pi)*(.000005^2)*(gamma(2/.156812)))*exp(-((sqrt(x^2+y^2))/.000005)^.156812))}
integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), llim, ulim)$value
})
}, llim, ulim)
fv <- Vectorize(f)
curve(fv, from=0, to=1000)
And I get:
Error in y^2 : 'y' is missing
I'm not quite sure what you're asking to plot. But I know you want to visualise your scalar function of two arguments.
Here are some approaches. First we define your function.
llim <- 0
ulim <- 6.29
f <- function(x,y) {
(.156812/((2*pi)*(.000005^2)*(gamma(2/.156812)))*exp(-((sqrt(x^2+y^2))/.000005)^.156812))
}
From your title I thought of the following. The function defined below intf integrates your function over the square [0,ul] x [0,ul] and return the value. We then vectorise and plot the integral over the square as a function the length of the side of the square.
intf <- function(ul) {
integrate(function(y) {
sapply(y, function(y) {
integrate(function(x) f(x,y), 0, ul)$value
})
}, 0, ul)$value
}
fv <- Vectorize(intf)
curve(fv, from=0, to=1000)
If f is a distribution, I guess you can make your (somewhat) nice probability interpretation of this curve. (I.e. ~20 % probability of pollination(?) in the 200 by 200 meter square.)
However, you can also do a contour plot (of the log-transformed values) which illustrate the function we are integrating above:
logf <- function(x, y) log(f(x, y))
x <- y <- seq(llim, ulim, length.out = 100)
contour(x, y, outer(x, y, logf), lwd = 2, drawlabels = FALSE)
You can also plot some profiles of the surface:
plot(1, xlim = c(llim, ulim), ylim = c(0, 0.005), xlab = "x", ylab = "f")
y <- seq(llim, ulim, length.out = 6)
for (i in seq_along(y)) {
tmp <- function(x) f(x, y = y[i])
curve(tmp, llim, ulim, add = TRUE, col = i)
}
legend("topright", lty = 1, col = seq_along(y),
legend = as.expression(paste("y = ",y)))
They need to be modified a bit to make them publication worthy, but you get the idea. Lastly, you can do some 3d plots as others have suggested.
EDIT
As per your comments, you can also do something like this:
# Define the function times radius (this time with general a and b)
# The default of a and b is as before
g <- function(z, a = 5e-6, b = .156812) {
z * (b/(2*pi*a^2*gamma(2/b)))*exp(-(z/a)^b)
}
# A function that integrates g from 0 to Z and rotates
# As g is not dependent on the angle we just multiply by 2pi
intg <- function(Z, ...) {
2*pi*integrate(g, 0, Z, ...)$value
}
# Vectorize the Z argument of intg
gv <- Vectorize(intg, "Z")
# Plot
Z <- seq(0, 1000, length.out = 100)
plot(Z, gv(Z), type = "l", lwd = 2)
lines(Z, gv(Z, a = 5e-5), col = "blue", lwd = 2)
lines(Z, gv(Z, b = .150), col = "red", lwd = 2)
lines(Z, gv(Z, a = 1e-4, b = .2), col = "orange", lwd = 2)
You can then plot the curves for the a and b you want. If either is not specified, the default is used.
Disclaimer: my calculus is rusty and I just did off this top of my head. You should verify that I've done the rotation of the function around the axis properly.
The lattice package has several functions that can help you draw 3 dimensional plots, including wireframe() and persp(). If you prefer not to use a 3d-plot, you can create a contour plot using contour().
Note: I don't know if this is intentional, but your data produces a very large spike in one corner of the plot. This produces a plot that is for all intents flat, with a barely noticable spike in one corner. This is particularly problematic with the contour plot below.
library(lattice)
x <- seq(0, 1000, length.out = 50)
y <- seq(0, 1000, length.out = 50)
First the wire frame plot:
df <- expand.grid(x=x, y=y)
df$z <- with(df, f(x, y))
wireframe(z ~ x * y, data = df)
Next the perspective plot:
dm <- outer(x, y, FUN=f)
persp(x, y, dm)
The contour plot:
contour(x, y, dm)