New outliers appear after I remove existing ones using QQ Plot Results - r

I'm working on the PCA section from Michael Faraway's Linear Models with R (chapter 11, page 164).
PCA analysis is sensitive to outliers and the Mahalanobis distance helps us identify them.
The author checks for outliers by plotting the Mahalanobis distance against the quantiles of a chi-squared distribution.
if require(faraway)==F install.packages("faraway"); require(faraway)
data(fat, package='faraway')
cfat <- fat[,9:18]
n <- nrow(cfat); p <- ncol(cfat)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2,
"quantiles")),
ylab = "Sorted Mahalanobis distances")
abline(0,1)
I identify the points:
identify(qchisq(1:n/(n+1),p), sort(md))
It appears that the outliers are in rows 242:252. I remove these outliers and re-create the QQ Plot:
cfat.mod <- cfat[-c(242:252),] #remove outliers
robfat <- cov.rob(cfat.mod)
md <- mahalanobis(cfat.mod, center=robfat$center, cov=robfat$cov)
n <- nrow(cfat.mod); p <- ncol(cfat.mod)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2,
"quantiles")),
ylab = "Sorted Mahalanobis distances")
abline(0,1)
identify(qchisq(1:n/(n+1),p), sort(md))
Alas, it appears now that a new set of points (rows 234:241) are now outliers. This keeps happening every time I remove additional outliers.
Look forward to understanding what I'm doing wrong.

To identify the points correctly, make sure the labels correspond to the positions of the points in the data. The functions order or sort with index.return=TRUE will give the sorted indices. Here is an example, arbitrarily removing the points with md greater than a threshold.
## Your data
data(fat, package='faraway')
cfat <- fat[, 9:18]
n <- nrow(cfat)
p <- ncol(cfat)
md <- sort(mahalanobis(cfat, colMeans(cfat), cov(cfat)), index.return=TRUE)
xs <- qchisq(1:n/(n+1), p)
plot(xs, md$x, xlab=expression(paste(chi^2, 'quantiles')))
## Use indices in data as labels for interactive identify
identify(xs, md$x, labels=md$ix)
## remove those with md>25, for example
inds <- md$x > 25
cfat.mod <- cfat[-md$ix[inds], ]
nn <- nrow(cfat.mod)
md1 <- mahalanobis(cfat.mod, colMeans(cfat.mod), cov(cfat.mod))
## Plot the new data
par(mfrow=c(1, 2))
plot(qchisq(1:nn/(nn+1), p), sort(md1), xlab='chisq quantiles', ylab='')
abline(0, 1, col='red')
car::qqPlot(md1, distribution='chisq', df=p, line='robust', main='With car::qqPlot')

Related

Undertanding Sample Sizes in qqplots with R

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)
The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

How to add labels to original data given clustering result using hclust

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:
library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)
If I create an hclust object with a sample of 60 from the original dataset like so:
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')
I know that I can then add the labels to the dendrogram for viewing like this:
observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")
However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.
It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.
Can anyone help with extracting this information?
You need to define a level where you cut your dendrogram, this will form the groups.
Use:
labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels
Let's look at the dendrogram in order to find the best number for k:
plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups
It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).
Use the horizontal lines at that height and count the cluster "formed".

R map a k-means clustering of a self organising map back to the data

I've used K-means clustering to classify a self-organising map (SOM and would not like to back code the data with the SOM clusters.
Example script below.
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
# Prepare SOM
set.seed(590507)
som1 <- som(dt,
somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
# Plot codes map
myPal1=colorRampPalette(c("black","orange","red","green"))
plot(som1,
type="codes",
palette.name = myPal1,
main="Codes",
shape="straight",
border ="gray")
# Extract the codebooks from SOM
cds <- as.data.frame(som1$codes)
# Compute WSS for up to 6 clusters for codebook vectors
wss <- (nrow(cds)-1)*sum(apply(cds,2,var))
for (i in 2:6){
wss[i] <- sum(kmeans(cds,centers=i)$withinss)
}
# Plot the scree plot
par(mar = c(8,5,8,2))
plot(1:6,
wss,
type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares",
main="Within cluster sum of squares (WCSS)",
col="blue",
lwd =2)
# Scree plot - 3 clusters look sensible choice
nCls =3
som1.km <- kmeans(cds, nCls, nstart = 20)
# Plot the SOM codes map with 3 clusters as background
MyPal3 <- c("grey80", 'aquamarine', 'burlywood1')
par(mar = c(0,5,0,2))
plot(som1,
type="codes",
palette.name= myPal1,
bgcol = MyPal3[som1.km$cluster],
main = "k-mean cluster",
shape="straight",
border ="gray"
)
legend("right",
x=7,
y=4,
cex=1.5,
title="Cluster",
legend = c(1:nCls),
fill= MyPal3[c(1:nCls)]
)
# Get the SOM cell number number assoicated with each of the 150 data
SOM.clss <- as.data.frame(som1$unit.classif)
names(SOM.clss) <- "Cell.Nmbr"
unique(SOM.clss)
# Get the k-means 3-class classification of the 36 SOM cells
kMns.clst <- as.data.frame(som1.km$cluster)
names(kMns.clst) <- "Clstr"
# Add a SOM cell reference for a lookup table
kMns.clst$Cell.Nmbr <- 1:nrow(kMns.clst)
# Use the lookup table to map the cluster number to each datum
dt.clst <- merge(SOM.clss,kMns.clst,by="Cell.Nmbr")
# Add the cluster column to the original data
iris.clst <- cbind(iris,dt.clst)
# Compute means as a reality check
aggregate(iris.clst[,1:4],
by=list(iris.clst$Clstr),
FUN=mean
)
The answer seems to make sense but I'm not sure if the approach is correct. Is this correct and if so is there a more efficient method of doing this back-coding exercise?

Edit betadisper permutest plot

I have used the script below to generate this betadisper plot between 2 communities.
In my "df", the first column is station names (x13)
I have 2 questions:
There is a point behind the "ABC" label, so how do I make the label transparent? Preferably adding different colours to each community?
How do I add the station names next to each point so I can visually compare which stations are most similar?
Script:
df <-read.csv("NMDS matrix_csv_NEW.csv", header=T, row.names=1, sep= ",")
df
Label<-rownames(df)
Label
dis <- vegdist(df)
groups <- factor(c(rep(1,8), rep(2,5)), labels = c("ABC","DEF"))
groups
mod <- betadisper(dis, groups)
mod
anova(mod)
permutest(mod, pairwise = TRUE)
plot(mod)
plot(mod, ellipse = TRUE, hull = FALSE, main= "MultiVariate Permutation")
To answer 2), here's how to plot the station names on top of the points.
text(mod$vectors[,1:2], label=Label)
Here is a possibile solution to your problem.
Download the myplotbetadisp.r file from this link and place the file in the working directory (warning, do not save the file as myplotbetadisp.r.txt!).
Some additional options are available in myplotbetadisper function:
fillrect, filling color of the box where centroid labels are printed;
coltextrect, vector of colors for centroid labels;
alphaPoints, alpha trasparency for centroid points;
labpoints, vectors of labels plotted close to points;
poslabPoints, position specifier for the text in labpoints.
library(vegan)
# A dummy data generation process
set.seed(1)
n <- 100
df <- matrix(runif(13*n),nrow=13)
# Compute dissimilarity indices
dis <- vegdist(df)
groups <- factor(c(rep(1,8), rep(2,5)), labels = c("ABC","DEF"))
# Analysis of multivariate homogeneity of group dispersions
mod <- betadisper(dis, groups)
source("myplotbetadisp.r")
labPts <- LETTERS[1:13]
col.fill.rect <- addAlpha(col2rgb("gray65"), alpha=0.5)
col.text.rect <- apply(col2rgb(c("blue","darkgreen")), 2, addAlpha, alpha=0.5)
transp.centroids <- 0.7
myplotbetadisper(mod, ellipse = TRUE, hull = FALSE,
fillrect=col.fill.rect, coltextrect=col.text.rect,
alphaPoints=transp.centroids, labPoints=labPts,
main= "MultiVariate Permutation")
Here is the plot
Hope it can help you.

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Resources