How do I interpret the output of corrplot? - r

The corrplot packages provides some neat plots and documents with examples.
But I don't understand the output. I can see that if you have a matrix A_ij, you can plot it as an arrangement of n by n square tiles, where the color of tile ij corresponds to the value of A_ij. But some examples appear to have more dimensions:
Here we can guess that color shows the correlation coefficient, and orientation of the ellipse is negative/positive correlation. What is the eccentricity?
The documentation for method says:
the visualization method of correlation matrix to be used. Currently, it supports seven methods, named "circle" (default), "square", "ellipse", "number", "pie", "shade" and "color". See examples for details.
The areas of circles or squares show the absolute value of corresponding correlation coefficients. Method "pie" and "shade" came from Michael Friendly’s job (with some adjustment about the shade added on), and "ellipse" came from D.J. Murdoch and E.D. Chow’s job, see in section References.
So we know that the area, for circles and squares, should show the coefficient. What about the other dimensions, and other methods?

There is only one dimension shown by the plot.
Michael Friendly, in Corrgrams: Exploratory displays for correlation matrices (the corrplot documentation confusingly refers to this as his "job"), says:
In the shaded row, each cell is shaded blue or red depending on the sign of the correlation, and with the intensity of color scaled 0–100% in proportion to the magnitude of the correlation. (Such scaled colors are easily computed using RGB coding from red, (1, 0, 0), through white (1, 1, 1), to blue (0, 0, 1). For simplicity, we ignore the non-linearities of color reproduction and perception, but note that these are easily accommodated in the color mapping function.) White diagonal lines are added so that the direction of the correlation may still be discerned in black and white. This bipolar scale of color was chosen to leave correlations near 0 empty (white), and to make positive and negative values of equal magnitude approximately equally intensely shaded. Gray scale and other color schemes are implemented in our software (Section 6), but not illustrated here.
The bar and circular symbols also use the same scaled colors, but fill an area proportional to the absolute value of the correlation. For the bars, negative values are filled from the bottom, positive values from the top. The circles are filled clockwise for positive values, anti-clockwise for negative values. The ellipses have their eccentricity parametrically scaled to the correlation value (Murdoch and Chow, 1996). Perceptually, they have the property of becoming visually less prominent as the magnitude of the correlation increases, in contrast to the other glyphs.
(emphasis mine)
"Murdoch and Chow, 1996" is a publication describing the equation for drawing the ellipses (A Graphical Display of Large Correlation Matrices). The ellipses are apparently meant to be caricatures of bivariate normal distributions:
So in conclusion, the only dimension shown is always the correlation coefficient (or the value of A_ij, to use the question's terminology) itself. The multiple apparent dimensions are redundant.

I think the plot is quite self explanatory. On the right hand side you have the scale which is colored from red (negative correlation) to blue (positive correlation). The color follows a gradient according to the strength of the correlation.
If the ellipse leans towards the right, it is again positive correlation and if it leans to the left, it is negative correlation.
The diffusion around a line (which denotes perfect correlation, for example mpg ~ mpg) creates an ellipse. You will have a more diffused ellipse for lower strengths of the correlation. This is typically how a weakly correlated relationship will look in a scatterplot. These I think are caricatures, however.
Here is some code from the corrplot function responsible for drawing ellipses. I am not going to attempt to explain this (because it is a part of a larger system). I wanted to show that the logic is all there if you'd like to deep dive into it:
if (method == "ellipse" & plotCI == "n") {
ell.dat <- function(rho, length = 99) {
k <- seq(0, 2 * pi, length = length)
x <- cos(k + acos(rho)/2)/2
y <- cos(k - acos(rho)/2)/2
return(cbind(rbind(x, y), c(NA, NA)))
}
ELL.dat <- lapply(DAT, ell.dat)
ELL.dat2 <- 0.85 * matrix(unlist(ELL.dat), ncol = 2,
byrow = TRUE)
ELL.dat2 <- ELL.dat2 + Pos[rep(1:length(DAT), each = 100),
]
polygon(ELL.dat2, border = col.border, col = col.fill)
}

Related

Barplot with a distinguish color for significant positive and negative p value

my data set contains 12 months and 4 seasons (pre-monsoon, monsoon, postmonsoon,dry-season) and their r and p-value in relation to ring width index
here is the code I used,
k <- read.csv("macro_r_p.csv",TRUE,",")
k
cols <- c("azure3", "red") [(k$p < 0.05)+1]
barplot(k$r,names.arg=k$parameter,ylab="Correlation coefficient",col=cols,
main = expression("T"[max]), las=2)
but, I could only manage a color for both positive and negative significant p value (p < 0.05). I want one different color for positive significant relation and another different color for negative significant relation. It would be helpful to give me the code to get a sophisticated text font (a bit larger). Also, my graph height was not adjusted properly.

Calculate area on 3D surface mesh encolosed by four arbitrary points from coordinate data

I have human facial data as below:
library(Rvcg)
library(rgl)
data(humface)
lm <- matrix(c(1.0456182e+001, -3.5877686e+001, 5.0972912e+001, 2.2514189e+001,
8.4171227e+001, 6.6850304e+001, 8.3239525e+001, 9.8277359e+000,
6.5489395e+001, 4.2590347e+001, 4.0016006e+001, 5.9176712e+001),
4)
shade3d(humface, col="#add9ec", specular = "#202020", alpha = 0.8)
plot3d(lm, type = "s", col = "red", xlab = "x", ylab = "y", zlab = "z",
size = 1, aspect = FALSE,add=T)
for lm, four landmarks are placed on the surface of the mesh, in the following oder:
The yellow lines are drawn by hand for illustration purpose. I wish to calculate the surface area of the quarilateral enclosed by the four red dots, i.e., the surface area inside the yellow edges.
If surface area cannot be calculated, I also welcome methods to calculate the area (not area of the surface of the face) of the quadrilateral. I know one could calculate the sum of areas of triangle 123 and triangle 234. However, I my real application, I have no idea of the ordering and relative spatial position of the four points. Since I have thousands of qudrilateral areas to calculate, it is impossible to plot each quadrilateral and determine how to decompose the quadrilateral into two triangles. For example, I may accidentally pick triangle 123 and triangle 124, and the sum of these two triangle ares is not what I want.
Therefore, I am interested in either surface area or area of the quadrilateral. Solution to either is welcome. I just do not want to plot each quadrilateral and I want an area value directly computed from the coordinates.
The rgl::shadow3d function can compute a projection of the quad onto the face. Then you'd compute the area by summing the areas of triangles and quads in the result. #DiegoQueiroz gives you some pointers for doing that,
plus the Rvcg package contains vcgArea:
quad <- mesh3d(lm, triangles = cbind(c(1,2,4), c(1,4,3)))
projection <- shadow3d(humface, quad, plot = FALSE)
Here's what that looks like:
shade3d(projection, col = "yellow", polygon_offset = -1)
The projection ends up containing 3604 triangles; the area is
vcgArea(projection)
# [1] 5141.33
There are a few ambiguities in the problem: the quadrilateral isn't planar, so you'd get a different one if you split it into triangles along the other diagonal. And the projection of the quad onto the face is different depending on which direction you choose. I used the default of projecting along the z axis, but in fact the face isn't perfectly aligned that way.
EDITED TO ADD:
If you don't know how to decompose the 4 points into a single quadrilateral, then project all 4 triangles (which form a tetrahedron in 3-space):
triangles <- mesh3d(lm, triangles = cbind(c(1,2,3), c(1,2,4), c(1,3,4), c(2,3,4))
projection <- shadow3d(humface, triangles, plot = FALSE)
This gives a slightly different region than projecting the quad:
vcgArea(projection)
# [1] 5217.224
I think the reason for this is related to what I referred to in the comment above: the area depends on the "thickness" of the object being projected, since the quad is not planar.
I believe your question is more appropriate for math.stackexchange.com because I think it's more a question about the math behind the code than the code itself.
If you are concerned about precision, you may want to use techniques for smoothing the calculated area of a mesh, like the one presented in this paper.
However, if you don't really need that area to really model the surface, then you can ignore the face and compute the convex quadrilateral area using the many available formulas for that, however, the simplest one requires you to have the vectors that correspond to the quadrilateral's diagonals (which you can find by checking this question)
If you decide to find the diagonals and use the simplest vectorial formula (half the magnitude of the cross-product between the diagonals), you should use the cross() and Norm() functions from the pracma package as R's crossprod() computes a different type of cross product than the one you will need.

Is there a way to manually adjust the boundaries of a color gradient on a phylogeny in ape/phytools?

I am trying to visualize the results of a phylogenetic least squares regression using ape and phytools. Specifically, I have been trying to create a regression equation for predictive purposes, and I am looking at how much phylogenetic signal influences the residuals (and hence the accuracy of the equation). I have been using a code somewhat similar to the following to plot the results (albeit here retooled for dummy data).
library("ape")
library("phytools")
orig_tree<-rtree(n=20)
plot(orig_tree)
values<-data.frame("residuals"=runif(20,min=-1,max=1),row.names=orig_tree$tip.label)
values<-setNames(values$residuals,rownames(values))
residualsignalfit<-fastAnc(orig_tree,values,vars=TRUE,CI=TRUE)
obj<-contMap(orig_tree,values,plot=FALSE)
plot(obj,fsize=.25)
However, the problem comes in that I have a few species that exhibit extremely high residuals relative to the rest of the dataset. Because the minimum and maximum values of the color gradient are set to the minimum and maximum values of the actual column, this washes out all the contrast between 90% of the dataset to visualize the few extreme outlier values. Below is code that reproduces an example of what I mean compared to obj above.
values2<-values
values2[6]<--2
values2[7]<-2
residualsignalfit2<-fastAnc(orig_tree,values2,vars=TRUE,CI=TRUE)
obj2<-contMap(orig_tree,values2,plot=FALSE)
plot(obj2,fsize=.25)
This causes the figure to make it seem as though there is much less phylogenetic signal than there really is because it colors all but the most extreme outlier points to be similar in color.
I am trying to figure out a way to set the minimum and maximum of the color gradient so that any value ≤ -1 is the maximum possible red value and any value ≥ 1 is the maximum possible blue value, thereby allowing greater contrast in the rest of the residuals. I tried using the command
plot(obj2,fsize=.25,lims=c(-1,1))
but as you can see from this code this does nothing. I know ggplot2 has an option to rescale the color gradient based on user-inputted values, but I can't seem to figure out how to make phylogeny objects from ape or phytools get plotted in ggplot2.
Given this, is there any way to manipulate the color gradient pattern in ape/phytools such that one can arbitrarily set the maximum and minimum boundaries for the color gradient?
You could manipulate the color gradient pattern by "squeezing" the values between some arbitrary boundaries (in my example the 90% quantiles) to adjust the color gradient in phytools::contMap:
## The vector of values with two outliers
values_outliers <- values
values_outliers[6] <- -10
values_outliers[7] <- 10
## The original heat plot object
contMap_with_outliers <- contMap(orig_tree, values_outliers, plot = FALSE)
plot(contMap_with_outliers, fsize = .25)
## Removing the outliers (setting them within to the 90% CI)
values_no_outliers <- values_outliers
## Find the 90% boundaries
boundaries <- quantile(values_no_outliers, probs = c(0.05, 0.95))
## Changing the values below the lowest boundary
values_no_outliers <- ifelse(values_no_outliers < boundaries[1], boundaries[1], values_no_outliers)
## Changing the values above the highest boundary
values_no_outliers <- ifelse(values_no_outliers > boundaries[2], boundaries[2], values_no_outliers)
## The heat plot object without outliers
contMap_without_outliers <- contMap(orig_tree, values_no_outliers, plot = FALSE)
plot(contMap_without_outliers, fsize = .25)

Plot gigantic correlation matrix as colours

I have a correlation matrix $P_{i,j}$ which is $1000 \times 1000$. Given the data the matrix will have rectangular patches of very high correlations. That is, if you draw a $20 \times 20$ square anywhere in this matrix you will either be looking at a patch of highly correlated variables ($\rho_{i,j}> 0.8$) or medium to uncorrelated ($\in [-0.1, 0.5]$). The reason for this is the structure of the data.
How do I represent this graphically? I know of one way to visualize a matrix like this but it only works for small dimensions:
install.packages("plotrix")
library(plotrix)
rhoMat = array(rnorm(1000*1000),dim=c(1000,1000))
color2D.matplot(rhoMat[1:10,1:10],cs1=c(0,0.01),cs2=c(0,0),cs3=c(0,0)) #nice!
color2D.matplot(rhoMat,cs1=c(0,0.01),cs2=c(0,0),cs3=c(0,0)) #broken!
What is a function or algorithm that would plot a red area if in that vicinity in the matrix $P_{i,j}$, correlations "tend to" be high, versus "tending" to be low (even better if it switches from one colour to another as we move from positive to negative correlation patches). I want something to see how many patches of high correlations there are and whether one patch is correlated to another patch at a different place in the dataset.
I only want to do it in R.
I think you can use image with the argument breaks to get exactly what you want:
dat <- matrix(runif(10000), ncol = 100)
image(dat, breaks = c(0.0, 0.8, 1.0), col = c("yellow", "red"))
I always fail to think of image for this kind of problem - the name is sort of non-obvious. I started with heatmap and then it led me to image.
Look at the corrplot package. It has various tools for visualizing correlations, one option that it has is to use hierarchical clustering to draw rectangles around groups of high or low correlation.
I've done this in Excel fairly easily. You can change the colour of boxes based on range of values within the boxes. You can even create a gradient from lets say 0 to 1. 1000 x 1000 would be big for Excel, but I think it would work. You would just have to zoom out.

Calculating the volume under a surface

I have created a 3D plot (a surface) using wireframe function. I wonder if there is any functions by which I can calculate the volume under the surface in a 3D plot?
Here is a sample of my data plus the wrieframe syntax I used to create my 3D (surface) plot:
x1<-c(13,27,41,55,69,83,97,111,125,139)
x2<-c(27,55,83,111,139,166,194,222,250,278)
x3<-c(41,83,125,166,208,250,292,333,375,417)
x4<-c(55,111,166,222,278,333,389,445,500,556)
x5<-c(69,139,208,278,347,417,487,556,626,695)
x6<-c(83,166,250,333,417,500,584,667,751,834)
x7<-c(97,194,292,389,487,584,681,779,876,974)
x8<-c(111,222,333,445,556,667,779,890,1001,1113)
x9<-c(125,250,375,500,626,751,876,1001,1127,1252)
x10<-c(139,278,417,556,695,834,974,1113,1252,1391)
df<-data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)
df.matrix<-as.matrix(df)
wireframe(df.matrix,
aspect = c(61/87, 0.4),scales=list(arrows=FALSE,cex=.5,tick.number="10",z=list(arrows=T)),ylim=c(1:10),xlab=expression(phi1),ylab="Percentile",zlab=" Loss",main="Random Classifier",
light.source = c(10,10,10),drape=T,col.regions = rainbow(100, s = 1, v = 1, start = 0, end = max(1,100 - 1)/100, alpha = 1),screen=list(z=-60,x=-60))
Note: my real data is a 100X100 matrix
Thanks
The data you are feeding to wireframe is a grid of values. Hence one estimate of the volume of whatever underlying surface this is approximating is the sum of the grid values multiplied by the grid cell areas. This is just like adding up the heights of histogram bars to get the number of values in your histogram.
The problem I see with you doing this on your data is that the cell areas are going to be in odd units - percentiles on one axis, phi on the other has unknown units, so your volume is going to have units of loss times units of percentile times units of phi.
This isn't a problem if you want to compare volumes of similar things on exactly the same grid, but if you have surfaces on different grids (different values of phi, or different percentiles) then you need to be careful.
Now, noting that wireframe doesn't draw like a 3d histogram would (looking like square tower blocks) this gives us another way to estimate the volume. Your 10x10 matrix is plotted as 9x9 squares. Divide each of those squares into triangles and then compute the volume of the 192 right truncated triangular prisms (I think this is what they are - they are equilateral triangular prisms with a right angle and one sloping end). The formula for that should be out there somewhere. Probably base area times height to the centroid of the triangle or something.
I thought maybe this would be in the raster package, but it isn't. There's code for computing the surface area but not the volume! I'm sure the raster maintainer would be happy to have some code for this!
If the points are arbitrary (ie, don't follow smooth function), it seems like you're looking for the volume of the convex hull (minimum surface) surrounding these points. One package to help you calculate this is alphashape3d.
You'll need a 3-column matrix of the coordinates to form the right type of object to make the calculation but it seems rather straight-forward.

Resources