Label outliers using mvOutlier from MVN in R - r

I'm trying to label outliers on a Chi-square Q-Q plot using mvOutlier() function of the MVN package in R.
I have managed to identify the outliers by their labels and get their x-coordinates. I tried placing the former on the plot using text(), but the x- and y-coordinates seem to be flipped.
Building on an example from the documentation:
library(MVN)
data(iris)
versicolor <- iris[51:100, 1:3]
# Mahalanobis distance
result <- mvOutlier(versicolor, qqplot = TRUE, method = "quan")
labelsO<-rownames(result$outlier)[result$outlier[,2]==TRUE]
xcoord<-result$outlier[result$outlier[,2]==TRUE,1]
text(xcoord,label=labelsO)
This produces the following:
I also tried text(x = xcoord, y = xcoord,label = labelsO), which is fine when the points are near the y = x line, but might fail when normality is not satisfied (and the points deviate from this line).
Can someone suggest how to access the Chi-square quantiles or why the x-coordinate of the text() function doesn't seem to obey the input parameters.

Looking inside the mvOutlier function, it looks like it doesn't save the chi-squared values. Right now your text code is treating xcoord as a y-value, and assumes that the actual x value is 1:2. Thankfully the chi-squared value is a fairly simple calculation, as it is rank-based in this case.
result <- mvOutlier(versicolor, qqplot = TRUE, method = "quan")
labelsO<-rownames(result$outlier)[result$outlier[,2]==TRUE]
xcoord<-result$outlier[result$outlier[,2]==TRUE,1]
#recalculate chi-squared values for ranks 50 and 49 (i.e., p=(size:(size-n.outliers + 1))-0.5)/size and df = n.variables = 3
chis = qchisq(((50:49)-0.5)/50,3)
text(xcoord,chis,label=labelsO)

As it is mentioned in the previous reply, MVN packge does not support to label outliers. Although it is not really necessary since it can be done manually, we still might consider to add "labeling outliers" option within mvOutlier(...) function. Thanks for your interest indeed. We might include it in the following updates of the package.

The web-based version of the MVN package has now ability to label outliers (Advanced options under Outlier detection tab). You can access this web-tool through http://www.biosoft.hacettepe.edu.tr/MVN/

Related

Is there a way to add species to an ISOMAP plot in R?

I am using the isomap-function from vegan package in R to analyse community data of epiphytic mosses and lichens. I started analysing the data using NMDS but due to the structure of the data ran into problems which is why I switched to ISOMAP which works perfectly well and returns very nice results. So far so good... However, the output of the function does not support plotting of species within the ISOMAP plot as species scores are not available. Anyway, I would really like to add species information to enhance the interpretability of the output.
Does anyone of you has a solution or hint to this problem? Is there a way to add species kind of post hoc to the plot as it can be done with environmental data?
I would greatly appreciate any help on this topic!
Thank you and best regards,
Inga
No, there is no function to add species scores to isomap. It would look like this:
`sppscores<-.isomap` <-
function(object, value)
{
value <- scale(value, center = TRUE, scale = FALSE)
v <- crossprod(value, object$points)
attr(v, "data") <- deparse(substitute(value))
object$species <- v
object
}
Or alternatively:
`sppscores<-.isomap` <-
function(object, value)
{
wa <- vegan::wascores(object$points, value, expand = TRUE)
attr(wa, "data") <- deparse(substitute(value))
object$species <- wa
object
}
If ord is your isomap result and comm are your community data, you can use these as:
sppscores(ord) <- comm # either alternative
I have no idea (yet) which of these alternatives is more correct. The first adds species scores as vectors of their linear increase, the second as their weighted averages in ordination space, but expanded so that we allow some species be more extreme than the site units where they occur.
These will add new element species to the result object ord. However, using these in vegan would need more coding, but you can extract the species scores with vegan::scores, but their scaling is based on the original scale of community data, and may be badly scaled with respect to points of site units, and working on this would require more work. However, you can plot them separately, or then multiply with a constant giving similar scaling as site unit scores.
sp <- scores(ord, display="species", choices=1:2)
plot(sp, type = "n", asp = 1) # does not allow plotting text
text(sp, labels = rownames(sp)) # so we must add text

How can I make a graph like this in R?

I need to make a probability plot for a geochemical analysis (cumulative vs element concentration in log scale), like in the picture:
.
I have tried with ppPlot in the 'qualityTools' package with the 'log-normal' argument but for some elements it does not work. It says I need positive values for a log-normal distributions but they are all positive, I've checked. I think the command uses the 'density' function in base R, and its density model inadvertently produces negative concentration values.
The 'qqnorm' command in base R produces a different kind of plot.
How can I go around this?
Edit: Here's part of my magnesium data (I need to generate a similar graph with them):
mg <- c(51.400, 149.000, 276.000, 135.000, 179.000, 81.000, 116.000, 8.150, 7.770, 7.870, 8.840, 15.600, 13.400,
57.400, 7.440, 14.800, 40.800, 15.100, 21.400, 5.550, 3.390, 18.800, 20.100, 19.600, 11.600, 11.700,
12.200, 12.500, 11.700, 12.100, 13.000, 12.300, 13.300, 13.200, 12.600, 29.700, 25.400, 21.000, 11.100,
11.500, 11.000, 32.600, 17.500, 16.500, 18.100, 27.200, 21.200, 26.400, 18.800, 19.900, 32.000, 28.600,
29.400, 30.700, 2.370, 2.070, 1.850, 1.970, 24.900, 19.100, 17.400, 23.100, 50.100, 48.800, 18.000,
15.800, 27.100, 43.500, 4.820, 13.400, 14.600, 24.100, 22.700, 22.500, 43.500, 41.300, 43.700, 41.100,
40.800, 63.700, 7.700, 8.360, 60.000, 58.400, 63.100, 65.100, 219.000, 25.800, 4.940, 3.670, 13.800,
5.190, 14.700, 15.000, 13.100, 12.300, 10.700, 10.700, 11.100, 10.100, 10.600, 63.200, 19.800, 22.200,
17.600, 11.500, 10.600, 9.380, 3.190, 9.180, 10.800, 189.000, 190.000, 152.000, 119.000, 194.000, 56.100)
This is my solution, but i had to create a new hole data frame with random numbers:
ggplot()+
geom_point(data=N,aes(x=Prob,y=AS))+
scale_x_log10(breaks=c(0.1,1,5,20,50,80,95,99,99.9))+
scale_y_log10(expand=c(0,0),
breaks=c(seq(0.01,0.1,0.01),seq(0.2,1,0.1),seq(2,10,1)))+
ylab(paste("As(","\U00B5","g/l)"))+ xlab("Cummulative probability (%)")+
theme_bw()+annotation_logticks()
For the lines you should add segments and text in other layer for example.

How to control plot layout for lmerTest output results?

I am using lme4 and lmerTest to run a mixed model and then use backward variable elimination (step) for my model. This seems to work well. After running the 'step' function in lmerTest, I plot the final model. The 'plot' results appear similar to ggplot2 output.
I would like to change the layout of the plot. The obvious answer is to do it manually myself creating an original plot(s) with ggplot2. If possible, I would like to simply change the layout of of the output, so that each plot (i.e. plotted dependent variable in the final model) are in their own rows.
See below code and plot to see my results. Note plot has three columns and I would like three rows. Further, I have not provided sample data (let me know if I need too!).
library(lme4)
library(lmerTest)
# Full model
Female.Survival.model.1 <- lmer(Survival.Female ~ Location + Substrate + Location:Substrate + (1|Replicate), data = Transplant.Survival, REML = TRUE)
# lmerTest - backward stepwise elimination of dependent variables
Female.Survival.model.ST <- step(Female.Survival.model.1, reduce.fixed = TRUE, reduce.random = FALSE, ddf = "Kenward-Roger" )
Female.Survival.model.ST
plot(Female.Survival.model.ST)
The function that creates these plots is called plotLSMEANS. You can look at the code for the function via lmerTest:::plotLSMEANS. The reason to look at the code is 1) to verify that, indeed, the plots are based on ggplot2 code and 2) to see if you can figure out what needs to be changed to get what you want.
In this case, it sounds like you'd want facet_wrap to have one column instead of three. I tested with the example from the **lmerTest* function step help page, and it looks like you can simply add a new facet_wrap layer to the plot.
library(ggplot2)
plot(Female.Survival.model.ST) +
facet_wrap(~namesforplots, scales = "free", ncol = 1)
Try this: plot(difflsmeans(Female.Survival.model.ST$model, test.effs = "Location "))

New gplots update in R cannot find function distfun in heatmap.2

I have a bit of R-code to make a heatmap from a correlation matrix, which worked the last time I used it (prior to the 2013 Oct 17 update of gplots; after updating to R Version 3.0.2). This makes me think that something changed in the most recent gplots update, but I can not figure out what.
What used to present a nice plot now gives me this error:
" Error in hclustfun(distfun(x)) : could not find function "distfun" "
and won't plot anything. Below is the code to reproduce the plot (heavily commented as I was using it to teach an undergrad how to use heatmaps for a project). I tried adding the last line to explicitly set the functions, but it didn't help resolve the problem.
EDIT: I changed the last line of code to read:
,distfun =function(c) {as.dist(1-c,upper=FALSE)}, hclustfun=hclust)
and it worked. When I used just "dist=as.dist" I got a plot, but it wasn't sorted right, and several of the dendrogram branches didn't connect to the tree. Not sure what happened, or why this is working, but it appears to be.
Any help would be greatly appreciated.
Thanks in advance,
library(gplots)
set.seed(12345)
randData <- as.data.frame(matrix(rnorm(600),ncol=6))
randDataCorrs <- randData+(rnorm(600))
names(randDataCorrs) <- paste(names(randDataCorrs),"_c",sep="")
randDataExtra <- cbind(randData,randDataCorrs)
randDataExtraMatrix <- cor(randDataExtra)
heatmap.2(randDataExtraMatrix, # sets the correlation matrix to use
symm=TRUE, #tells whether it is symmetrical or not
main= "Correlation matrix\nof Random Data Cor", # Names plot
xlab= "x groups",ylab="", # Sets the x and y labels
scale="none", # Tells it not to scale the data
col=redblue(256),# Sets the colors (can be manual, see below)
trace="none", # tells it not to add a trace
symkey=TRUE,symbreaks=TRUE, # Tells it to keep things symmetric around 0
density.info = "none"#) # can be "histogram" if you want a hist of your corr values here
#,distfun=dist, hclustfun=hclust)
,distfun =function(c) {as.dist(1-c,upper=FALSE)}, hclustfun=hclust) # new last line
I had the same error, then I noticed that I had made a variable called dist, which is the default call for distfun= dist. I renamed the variable and then everything ran fine. You likely made the same error, as your new code is working since you have altered the default call of distfun.

differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to get both functions to give the same result (or highly similar result) so that I understand all the 'blackbox' parameters that go into this.
This is the example data and packages:
require(gplots)
# made4 from bioconductor
require(made4)
data(khan)
data <- as.matrix(khan$train[1:30,])
Clustering the data with heatmap.2 gives:
heatmap.2(data, trace="none")
Using heatplot gives:
heatplot(data)
very different results and scalings initially. heatplot results look more reasonable in this case so I'd like to understand what parameters to feed into heatmap.2 to get it to do the same, since heatmap.2 has other advantages/features I'd like to use and because I want to understand the missing ingredients.
heatplot uses average linkage with correlation distance so we can feed that into heatmap.2 to ensure similar clusterings are used (based on: https://stat.ethz.ch/pipermail/bioconductor/2010-August/034757.html)
dist.pear <- function(x) as.dist(1-cor(t(x)))
hclust.ave <- function(x) hclust(x, method="average")
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave)
resulting in:
this makes the row-side dendrograms look more similar but the columns are still different and so are the scales. It appears that heatplot scales the columns somehow by default that heatmap.2 doesn't do that by default. If I add a row-scaling to heatmap.2, I get:
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave,scale="row")
which still isn't identical but is closer. How can I reproduce heatplot's results with heatmap.2? What are the differences?
edit2: it seems like a key difference is that heatplot rescales the data with both rows and columns, using:
if (dualScale) {
print(paste("Data (original) range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
data <- t(scale(t(data)))
print(paste("Data (scale) range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
data <- pmin(pmax(data, zlim[1]), zlim[2])
print(paste("Data scaled to range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
}
this is what I'm trying to import to my call to heatmap.2. The reason I like it is because it makes the contrasts larger between the low and high values, whereas just passing zlim to heatmap.2 gets simply ignored. How can I use this 'dual scaling' while preserving the clustering along the columns? All I want is the increased contrast you get with:
heatplot(..., dualScale=TRUE, scale="none")
compared with the low contrast you get with:
heatplot(..., dualScale=FALSE, scale="row")
any ideas on this?
The main differences between heatmap.2 and heatplot functions are the following:
heatmap.2, as default uses euclidean measure to obtain distance matrix and complete agglomeration method for clustering, while heatplot uses correlation, and average agglomeration method, respectively.
heatmap.2 computes the distance matrix and runs clustering algorithm before scaling, whereas heatplot (when dualScale=TRUE) clusters already scaled data.
heatmap.2 reorders the dendrogram based on the row and column mean values, as described here.
Default settings (p. 1) can be simply changed within heatmap.2, by supplying custom distfun and hclustfun arguments. However p. 2 and 3 cannot be easily addressed, without changing the source code. Therefore heatplot function acts as a wrapper for heatmap.2. First, it applies necessary transformation to the data, calculates distance matrix, clusters the data, and then uses heatmap.2 functionality only to plot the heatmap with the above parameters.
The dualScale=TRUE argument in the heatplot function, applies only row-based centering and scaling (description). Then, it reassigns the extremes (description) of the scaled data to the zlim values:
z <- t(scale(t(data)))
zlim <- c(-3,3)
z <- pmin(pmax(z, zlim[1]), zlim[2])
In order to match the output from the heatplot function, I would like to propose two solutions:
I - add new functionality to the source code -> heatmap.3
The code can be found here. Feel free to browse through revisions to see the changes made to heatmap.2 function. In summary, I introduced the following options:
z-score transformation is performed prior to the clustering: scale=c("row","column")
the extreme values can be reassigned within the scaled data: zlim=c(-3,3)
option to switch off dendrogram reordering: reorder=FALSE
An example:
# require(gtools)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
distCor <- function(x) as.dist(1-cor(t(x)))
hclustAvg <- function(x) hclust(x, method="average")
heatmap.3(data, trace="none", scale="row", zlim=c(-3,3), reorder=FALSE,
distfun=distCor, hclustfun=hclustAvg, col=rev(cols), symbreak=FALSE)
II - define a function that provides all the required arguments to the heatmap.2
If you prefer to use the original heatmap.2, the zClust function (below) reproduces all the steps performed by heatplot. It provides (in a list format) the scaled data matrix, row and column dendrograms. These can be used as an input to the heatmap.2 function:
# depending on the analysis, the data can be centered and scaled by row or column.
# default parameters correspond to the ones in the heatplot function.
distCor <- function(x) as.dist(1-cor(x))
zClust <- function(x, scale="row", zlim=c(-3,3), method="average") {
if (scale=="row") z <- t(scale(t(x)))
if (scale=="col") z <- scale(x)
z <- pmin(pmax(z, zlim[1]), zlim[2])
hcl_row <- hclust(distCor(t(z)), method=method)
hcl_col <- hclust(distCor(z), method=method)
return(list(data=z, Rowv=as.dendrogram(hcl_row), Colv=as.dendrogram(hcl_col)))
}
z <- zClust(data)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
heatmap.2(z$data, trace='none', col=rev(cols), Rowv=z$Rowv, Colv=z$Colv)
Few additional comments regarding heatmap.2(3) functionality:
symbreak=TRUE is recommended when scaling is applied. It will adjust the colour scale, so it breaks around 0. In the current example, the negative values = blue, while the positive values = red.
col=bluered(256) may provide an alternative colouring solution, and it doesn't require RColorBrewer library.

Resources