Scatter plot matrices using pairs() in R - r

I'm new to R and working on some code that outputs a scatter plot matrix. The data frame is in the following format:
A B C D
2 3 0 5
8 9 5 4
0 0 5 3
7 0 0 0
My data sets can run into the 100-1000s of rows and 10-100s of columns, with a wide scale of values (hence log transforming my data).
This bit of code gives me some partial success in enhancing the basic plot (see embedded image):
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1), xlog = FALSE, ylog = FALSE)
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# Add regression line to plots.
my_line <- function(x,y,...){
points(x,y,...)
LR <- lm(log(x) ~ log(y), data = SP)
abline(LR, col = "red", untf = TRUE)
}
# Plot scatter plot matrices.
pairs(mydataframe, pch = 20, main = "test",
cex = 0.125, cex.labels = 1,
xlim = c(100, 1e9),
ylim = c(100, 1e9),
upper.panel = panel.cor,
lower.panel = my_line,
log = "xy")'
example
Problem 1 - instead of getting R^2 values in the upper panel, I get NAs instead. How can I correct this?
Problem 2 - I'd like to remove the function for adjusting text size of R^2 value in proportion to correlation. I know it's in panel.cor but not sure which part will need removal or adjustment.
Many thanks in advance
EDIT: 08/06/2016
I have found a work around which also simplifies the code:
panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
# correlation coefficient
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.6, txt)
}
# add regression line to plots.
my_line <- function(x,y,...)
{
points(x,y,...)
LR <- lm(x ~ y, data = SP)
abline(LR, col = "red", untf = TRUE)
}
# Plot scatterplot matrices.
pairs(SP, pch = 20, main = "test",
cex = 0.125, cex.labels = 1,
upper.panel = panel.cor,
lower.panel = my_line)
example 2
The issue appears to be missing values i.e. 0's. I change these to NA's initially so I can use a log scale. This in combination with log transformation leads to missing R^2 values in the upper panel.
Ideally I'd like to have a log scale. Is there a way i can do this without introducing the aformentioned issue?
Clarification - I'd like a log (xy) scale in the scatter plots (lower panel) and for x-axis in the histograms (diagonal panel). I've been playing about with it today but can't quite get it as i want. Perhaps i'm asking too much from pairs. Any help would be appreciated.
Edit: 10/06/2016
Success!....well approximately 99% happy.
I have made changes - added histograms to diagonal panel and p-value to upper panel (the base code in "pairs()" for adding the histogram needed adjustment due to the log scale used on the x-axis). Please feel free to correct my descriptions if they're not accurate or correct:
library(lattice)
DF <- read.csv("File location", header = TRUE)
DF.1 <- DF+1 # Added small epsilon to data frame otherwise plot errors arise due to missing values.
# Function to calculate R^2 & p-value for upper panels in pairs() - scatterplot matrices.
panel.cor <- function(x, y, digits = 3, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1), xlog = FALSE, ylog = FALSE) # xlog/ylog: ensures that R^2 and p-values display in upper panel.
# Calculate correlation coefficient and add to diagonal plot.
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.7, txt, cex = 1.25) # First 2 arguments determine postion of R^2-value in upper panel cells.
# Calculate P-value and add to diagonal plot.
p <- cor.test(x, y)$p.value
txt2 <- format(c(p, 0.123456789), digits = digits)[1]
txt2 <- paste("p= ", txt2, sep = "")
if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
text(0.5, 0.3, txt2, cex = 1.25) # First 2 arguments determine postion of p-value in upper panel cells.
}
# Function to calculate frequency distribution and plot histogram in diagonal plot.
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0.5, 1.5, 0, 1.75), xlog = TRUE, ylog = FALSE) # xlog argument allows log x-axis when called in pairs.
h <- hist(log(x), plot = FALSE, breaks = 20)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan")
}
# add regression line to plots.
my_line <- function(x,y, ...)
{
points(x,y,...)
LR <- lm(log(x) ~ log(y), data = DF.1)
abline(LR, col = "red", untf = TRUE)
}
# Plot scatterplot matrices.
pairs(DF.1, pch = 20, main = "Chart Title",
cex = 0.75, cex.labels = 1.5, label.pos = 0.0001,
upper.panel = panel.cor,
lower.panel = my_line,
diag.panel = panel.hist,
log = ("xy"),
xlim = c(5, 1e9),
ylim = c(5, 1e9))
The fly in the ointment:
1 - the text labels in the diagonal panel only partially appear. I used a decreasing value for the "label.pos" argument in "pairs()" which moved the label down until they appeared. However, they won't move anymore no matter how much I decrease that value. I've tried to coerce the position from the histogram function, but that doesn't work. I hope someone can see what I'm missing. Thanks in advance...I've not had any responses yet:(
PS: I tried to link 3rd image with my successful plot but I was foiled by my lack of reputation...groan.
EDIT: 13/06/2016
Solved! I feel a bit foolish. The fix for the positioning of the main title in the diagonal panel was super simple and I spent a long time trying much more complex ways to do this. The "label.pos" argument in pairs should be negative! I used a small value of -0.0675 which placed it near the top of the cell containing the histogram.
I hope someone else finds this useful. I'll mark as solved but I'd appreciate any comments regarding my code commenting or if someone sees a way of making the code more efficient. Thanks Alex

Sometimes i feel totally dense. Answer my own question...who would have thought...slaps head. Please see edits in my post for the fixes I found.

Related

Simple way for adding multiple R^2-values in a plotmatrix

So I have a simple multi-plot/plotmatrix of the following form:
DATA_SE <- read.table("DEWRATES_SE_15-17.txt", sep = "\t", dec = ".", header = T)
multiplot_SE <- pairs(~SE_21+SE_25+SE_26, data = DATA_SE, main = "Tauraten_Selhausen")
multiplot_SE
Is there any way to add r-squared-values (for a simple lm-modell) to each one of my plots?
Thanks!
Update:
Is there a way to set a fixed limit for the x- and y-axis of my plot-panels?
I just need to set the them all at the same value (even for x- and y)!
You can do something like this (since you don't provide sample data I'm using the iris dataset to demonstrate):
panel.rsquared <- function(x, y) {
fit <- lm(y ~ x)
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.5, sprintf(
"R squared = %4.3f\n Adj. R squared = %4.3f",
summary(fit)$r.squared,
summary(fit)$adj.r.squared))
}
pairs(iris[, -ncol(iris)], upper.panel = panel.rsquared)
Update
In response to your comment, you can define any upper/lower panel function to meet your needs.
For example, you could do something like I'm showing below. Mind you, this is not very useful, as it will be difficult (impossible) to avoid overlapping text and points. That's the whole idea (and strength) of pairs when configuring the upper panel to show annotation/text and the lower panel to show the plots. That way you avoid redundancies (in your original post plots are repeated and are therefore redundant).
Anyway, for what it's worth:
panel.plot_withrsquared <- function(x, y) {
points(x, y)
fit <- lm(y ~ x)
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.1, 0.8,
sprintf("R squared = %4.3f",summary(fit)$r.squared),
adj = 0, cex = 0.8)
}
pairs(
iris[, -ncol(iris)],
upper.panel = panel.rsquared,
lower.panel = panel.plot_withrsquared)

How to apply big data on this p-value corrgram?

I am studying Didzis' p-value corrgram with different input data examples, where his insignificant p-value (p < 0.05) corresponds to almost a perfect curve fit, which is strange, see Fig 1-3.
Fig. 1 Output of the "extreme" input data #1,
Fig. 2 Output with minimum input data #2,
Fig. 3 Output with Didzis' input data #3,
Statistical inspection.
Fig. 1 p-values are very high when r small,
Fig. 2 p-values are very high but confidence intervals much be wide, not sure if drawing a graph there is appropriate,
Fig. 3 very low p-values when curve fitting almost perfect - this observation can be confusing
Input data test cases
Real live data example #1 as "extreme" example and its application output in Fig. 1
## 1 To make a list of lists
set.seed(24)
A=541650
m1 <- matrix(1:A, ncol=4, nrow=A)
str(m1)
a=360; b=1505; c=4;
m2 <- array(`length<-`(m1, a*b*c), dim = c(a,b,c))
res <- lapply(seq(dim(m2)[3]), function(i) cor(m2[,,i]))
str(res)
res <- lapply(res, function(x) eigen(replace(x, is.na(x), 0))$vectors[,1:1])
str(res)
Minimum example #2 and its application output in Fig. 2
A <- 1505
res <- list(rnorm(A), rnorm(rnorm(A)), rnorm(rnorm(rnorm(A))), rnorm(rnorm(rnorm(rnorm(A)))))
str(res)
Standard input example is Didzis used US election data #3 in Fig. 3
res <- USJudgeRatings[,c(2:3,6,1,7)]
To make the p-value corrgram
## 2 Didzis https://stackoverflow.com/a/15271627/54964
panel.cor <- function(x, y, digits=2, cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
test <- cor.test(x,y)
Signif <- ifelse(round(test$p.value,3)<0.001,"p<0.001",paste("p=",round(test$p.value,3)))
text(0.5, 0.25, paste("r=",txt))
text(.5, .75, Signif)
}
panel.smooth<-function (x, y, col = "blue", bg = NA, pch = 18,
cex = 0.8, col.smooth = "red", span = 2/3, iter = 3, ...)
{
points(x, y, pch = pch, col = col, bg = bg, cex = cex)
ok <- is.finite(x) & is.finite(y)
if (any(ok))
lines(stats::lowess(x[ok], y[ok], f = span, iter = iter),
col = col.smooth, ...)
}
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
data <- res
str(data)
pairs(data,
lower.panel=panel.smooth, upper.panel=panel.cor,diag.panel=panel.hist)
About significant upperbound
The source says that the study which is not statistically siginificant with 15K points may become significant with 2-3M points.
My observation is that it becomes signifant with 6-7M with my data sample and study, data 541650 541650 6925867.
So I think there is no problem in plotting so big data sets in Didzis' p-value corrgram in theory.
His algorithms are making possibly some simplifications, or causing clusterisation of the points such that many figures look like with a increasing diagonal or with y=0 line.
OS: Debian 8.5
R: 3.3.1

clusplot - showing variables

I would like to add to a clusplot plot the variables used for pca as arrows. I am not sure that a way has been implemented (I can't find anything in the documentation).
I have produced a clusplot that looks like this:
With the package princomp I can independently plot the observations in an analogous space of representation, with the variables (columns) as arrows:
Is there a way to do the two things at the same time, by showing the clusters and the variables of pca on the same diagram?
I wanted to to the same thing as OP today and ended up putting pieces from clusplot and biplot together. This is the result which may be useful if you want to do the same thing:
clusplot2 <- function(dat, clustering, ...) {
clusplot(dat, clustering, ...)
## this is from clusplot.default
pca <- princomp(dat, scores = TRUE, cor = (ncol(dat) != 2))
## this is (adapted) from biplot.princomp
directions <- t(t(pca$loadings[, 1:2]) * pca$sdev[1:2]) * sqrt(pca$n.obs)
## all below is (adapted) from biplot.default
unsigned.range <- function(x) c(-abs(min(x, na.rm = TRUE)),
abs(max(x, na.rm = TRUE)))
x <- predict(pca)[, 1:2]
y <- directions
rangx1 <- unsigned.range(x[, 1L])
rangx2 <- unsigned.range(x[, 2L])
rangy1 <- unsigned.range(y[, 1L])
rangy2 <- unsigned.range(y[, 2L])
xlim <- ylim <- rangx1 <- rangx2 <- range(rangx1, rangx2)
ratio <- max(rangy1/rangx1, rangy2/rangx2)
par(new = T)
col <- par("col")
if (!is.numeric(col))
col <- match(col, palette(), nomatch = 1L)
col <- c(col, col + 1L)
cex <- rep(par("cex"), 2)
plot(y, axes = FALSE, type = "n", xlim = xlim * ratio, ylim = ylim *
ratio, xlab = "", ylab = "", col = col[1L])
axis(3, col = col[2L])
axis(4, col = col[2L])
box(col = col[1L])
text(y, labels = names(dat), cex = cex[2L], col = col[2L])
arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
length = 0.1)
}
############################################################
library(cluster)
dat <- iris[, 1:4]
clus <- pam(dat, k = 3)
clusplot2(dat, clus$clustering, main = "Test")
Of course there is much room for improvement (as this is just copied together) but I think anyone can easily adapt it if needed.
If you wonder why the arrows (loadings * sdev) are scaled with 0.8 * sqrt(n): I have absolutely no idea. I would have plotted loadings * sdev which should resemble the correlation between the principal components and the variables but this is how biplot does it.
Anyway, this should produce the same arrows as biplot.princomp and use the same pca as clusplot which was the primary goal for me.

Adding line of identity to correlation plots using pairs() command in R

Similar to a prevous post, I'd like to modify the following code (from example in the R documentation for pairs() command):
## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(USJudgeRatings, lower.panel = panel.smooth, upper.panel = panel.cor)
Instead of a loess line, I want a line of identity for each plot. The secret lies in the $"panel.smooth" function, but I don't know how to modify it.
I think you just mean something like this:
my_line <- function(x,y,...){
points(x,y,...)
abline(a = 0,b = 1,...)
}
pairs(USJudgeRatings, lower.panel = my_line, upper.panel = panel.cor)
Or, if you want to plot a fitted, linear line, then you could modify joran's answer:
my_line <- function(x,y,...){
points(x,y,...)
abline(a = lm(y ~ x)$coefficients[1] , b = lm(y ~ x)$coefficients[2] , ...)
}
If you are indeed using pairs, however, it seems like loess would be more appropriate since you are likely exploring a dataset and would warrant the fitting of a linear line as extraneous at that point.

Prevent cex from scaling with the correlation coefficient in chart.Correlation()

I'm trying to plot a huge matrix of correlation coefficients, and currently, my plot looks like this:
Notice that some cells are missing correlation coefficients (ignore for now that lack of symmetry of the plot, unless you happen know why that's the case, too). I believe that the values are not, in fact, missing, but simply too small to appear, because they are scaled by the value of their correlation coefficient.
Looking at the documentation for chart.Correlation(), I was able to find a function from which much of the content of chart.Correlation() was modeled:
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(.8, .8, Signif, cex=cex, col=2)
}
pairs(USJudgeRatings[,c(2:3,6,1,7)], lower.panel=panel.smooth, upper.panel=panel.cor)
If I change:
text(0.5, 0.5, txt, cex = cex * r)
To:
text(0.5, 0.5, txt, cex = 0.8)
I get roughly the effect I'm looking for. The problem is that I have no idea how to change this parameter using chart.Correlation() itself. Does issue this make sense?
Here's a hack that modifies the function to allow what you want. This adds a user-settable exponent argument to the function that allows you to change the cex value to cex*[correlation]^cex.cor.scale.
It spits out warnings because of the way ... is handled; they're annoying but harmless.
It would probably be best to contact the maintainer and ask them if they'd be willing to enhance the function, or start creating your own modified version of the package.
edit: slightly more robust changes to the relevant lines
library("PerformanceAnalytics")
## turn the function into a character string
tmpstr <- deparse(chart.Correlation)
## modify the relevant lines
panelcorline <- grep("^ *panel.cor",tmpstr)
tmpstr[panelcorline] <- paste(tmpstr[panelcorline],"cex.cor.scale=1,")
rscaleline <- grep("^ *text\\(0.5",tmpstr)
tmpstr[rscaleline] <- gsub("cex \\* r","cex*r^cex.cor.scale",tmpstr[rscaleline])
## convert back to a function (don't mask the original function)
my.chart.Correlation <- eval(parse(text=tmpstr))
Test it out:
data(managers)
chart.Correlation(managers[,1:8], histogram=TRUE, pch="+")
## no scaling
my.chart.Correlation(managers[,1:8], histogram=TRUE, pch="+",cex.cor.scale=0)
## enhanced scaling
my.chart.Correlation(managers[,1:8], histogram=TRUE, pch="+",cex.cor.scale=2)

Resources