Neighborhood calculations for outlier detection - r

I'm using the R programming language, and I'm trying to understand the details of the following function used for outlier detection: https://rdrr.io/cran/dbscan/src/R/LOF.R
This function (from the "dbscan" library) uses the Local Outlier Factor (LOF) algorithm for calculating outliers : https://en.wikipedia.org/wiki/Local_outlier_factor.
The LOF algorithm is an unsupervised, distance based algorithm that defines outliers in a dataset relative to the "reachability and neighborhood" of an observation. In general, observations that are not "very reachable" with respect to other observations in their neighborhood are considered to be an "outlier". Based on these properties (the user specifies these properties, e.g the neighborhood (denoted by "k") could be "3"), this algorithm assigns a LOF "score" to each point in the dataset. The bigger the LOF score for a given observation, this observation is considered to be more of an outlier.
Now, I am trying to better understand some of the calculations taking place in the dbscan::lof() function.
1) The basic LOF algorithm can be run on some artificially created data like this:
```#load library(dbscan)
par(mfrow = c(1,2))
#generate data
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
### calculate LOF score
lof <- lof(x, k=3)
### distribution of outlier factors
summary(lof)
hist(lof, breaks=10)
### point size is proportional to LOF
plot(x, pch = ".", main = "LOF (k=3)")
points(x, cex = (lof-1)*3, pch = 1, col="red") ```
My question is : Do larger values of "k" result in fewer outliers being identified (histogram is left-skewed), but those outliers that are identified are more "extreme" (i.e. bigger LOF scores)?
I observed this general pattern, but I am not sure if this trend is reflected in the LOF algorithms code. E.g.
#plot LOF results for different values of k
par(mfrow = c(2,2))
### calculate LOF score
lof <- lof(x, k=3)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 3",breaks=10)
### calculate LOF score
lof <- lof(x, k=10)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 10", breaks=10)
### calculate LOF score
lof <- lof(x, k=20)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 20", breaks=10)
### calculate LOF score
lof <- lof(x, k=40)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 10", breaks=40)
In the above plots, you can see as the value of "k" increases fewer outliers are being identified. Is this correct?
2) Is there an "optimal" way to select a value of "k" for the LOF algorithm? Seeing how the LOF algorithm, it does not seem to me that there is a "optimal" way to select a value of "k". It seems that you must refer to the logic described in 1) :
Bigger values of "k" result in fewer outliers being identified, but
the outliers identified are more "extreme"
Smaller values of "k" result in more outliers being identified, but
the outliers identified are less "extreme"
Is this correct?

Regarding 2) , I found this stackoverflow post over here: https://stats.stackexchange.com/questions/138675/choosing-a-k-value-for-local-outlier-factor-lof-detection-analysis :
" The authors of the paper recommend choosing a minimum k and a maximum k, and for each point, taking the maximum LOF value over each k in that range. They offer several guidelines for choosing the bounds."
This is my logic of implementing the above into R code:
library(dbscan)
#generate data
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
x = data.frame(x)
### calculate LOF score for a range of different "k" values:
lof_10 <- lof(x, k=10)
lof_15 <- lof(x, k=15)
lof_20 <- lof(x, k=20)
#append these lof calculations the original data set:
x$lof_10 = lof_10
x$lof_15 = lof_15
x$lof_20 = lof_20
#as the previous stackoverflow post suggests: for each row, choose the highest LOF value
x$max_lof = pmax(x$lof_10, x$lof_15, x$lof_20)
#view results:
head(x)
x y lof_10 lof_15 lof_20 max_lof
1 2.443382 4.2611753 0.9803894 0.9866732 0.9841705 0.9866732
2 2.397454 -0.3732838 1.0527592 1.4638348 1.6008284 1.6008284
3 2.617348 3.0435179 0.9952212 0.9945580 0.9715819 0.9952212
4 3.731156 4.1668976 1.0339001 1.0802826 1.0921033 1.0921033
5 1.103123 1.6642337 1.1260092 1.0773444 1.0650159 1.1260092
6 2.735938 4.3737450 0.9939896 0.9573139 0.9700123 0.9939896
Therefore, the LOF score for each row is the value of the "max_lof" column.
Can someone please tell me if interpreted the previous stackoverflow post correctly? Have I also written the R code correctly?
Thanks
Note: I am still not sure about 1) from my initial question,
i.e. Do larger values of "k" result in fewer outliers being identified (histogram is left-skewed), but those outliers that are identified are more "extreme"?

Related

Creating a histogram from iterations of a binomial distribution in R

Here are the instructions:
Create 10,000 iterations (N = 10,000) of
rbinom(50,1, 0.5) with n = 50 and your guess of p0 = 0.50 (hint: you will need to
construct a for loop). Plot a histogram of the results of the sample. Then plot your
pstar on the histogram. If pstar is not in the extreme region of the histogram, you would
assume your guess is correct and vice versa. Finally calculate the probability that
p0 < pstar (this is a p value).
I know how to create the for loop and the rbinom function, but am unsure on how transfer this information to plotting on a histogram, in addition to plotting a custom point (my guess value).
I'm not doing your homework for you, but this should get you started. You don't say what pstar is supposed to be, so I am assuming you are interested in the (distribution of the) maximum likelihood estimates for p.
You create 10,000 N=50 binomial samples (there is no need for a for loop):
sample <- lapply(seq(10^5), function(x) rbinom(50, 1, 0.5))
The ML estimates for p are then
phat <- sapply(sample, function(x) sum(x == 1) / length(x))
Inspect the distribution
require(ggplot)
ggplot(data.frame(phat = phat), aes(phat)) + geom_histogram(bins = 30)
and calculate the probability that p0 < phat.
Edit 1
If you insist, you can also use a for loop to generate your samples.
sample <- list();
for (i in 1:10^5) {
sample[[i]] <- rbinom(50, 1, 0.5);
}

Plot a ROC curve in R without using any packages

Hi i am pretty new to programming in R and i am having troble plotting a ROC curve without using any package.
I generated my data using:
d=rpearsonIII(100,0.5,360,20)
nd=rnorm(100,450,25)
i need a vector with values <400 for d and >400 for nd, so i did:
spec = (cumsum(nd[nd>400])/sum(nd))*100
sens = (cumsum(d[d<400])/sum(nd))*100
and the i plotted like this:
plot(1-spec,sens)
but the plot was nothing like i expected it to be
Edit:
Thanks to the advice given my code looks like this now:
sc2 = c(rnorm(50,450,25),rpearsonIII(50,0.5,360,20))
scF = sc2 < 395
thresholds <- sort(sc2)
pos <- sum(scF);pos
neg <- sum(!scF);neg
tn <- cumsum(!scF);tn
spec <- tn/neg;spec
tp <- pos - cumsum(scF);tp
sens <- tp/pos;sens
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
The plotted roc curve looks like this:
roc curve
My problem now is that if change the order of the generated data (rnorm and rpearsonIII), the curve is reversed.
I don't know what rpearsonIII is, so I am just going to make a sample random data with actual classes actuals as well as the scores for the predictions scores.
set.seed(100)
actuals <- sample(c(TRUE,FALSE), 100, replace = TRUE)
scores <- runif(100,-1,1)
The long version with explanation
If in your data the actuals are strings or factors rather than logicals, you will need to convert them to logicals using:
actuals <- actuals == "postiveClass"
Next we want to order the instances based on their scores. We can do this using:
actuals <- actuals[order(scores)]
If you want to keep track of the thresholds for the sensitivities and specificity, you can keep them aligned using:
thresholds <- sort(scores)
Now we need to get our sensitivities and specificities. Sensitivity is TP/P and specificity is TN/N. Getting the total number of positives P is easy, since our actuals are logical, we can just use sum(actuals). Similarity, we can get our negatives N using sum(!actuals).
pos <- sum(actuals)
neg <- sum(!actuals)
First lets get our true negatives at each threshold. That is pretty easy, it is just the number of FALSE values at or below each the threshold. Since our data are in order by threshold, we can calculate that (and the specificity) using:
tn <- cumsum(!actuals)
spec <- tn/neg
The number of true positives is slightly harder because we are looking for the number of positives greater that the threshold, so cumsum alone won't work. However, since the number above the threshold is equal to the total minus number below or at the threshold, we can get our true positives using:
tp <- pos - cumsum(actuals)
sens <- tp/pos
Now all we need to do is plot the two.
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
To get the AUC of the curve, we simply need to calculate the height of the curve (the sensitivity) multiplied by the width the (difference in 1 - specificity) at each value of actuals. We already have the sensitivity, we just need the specificity. The diff function will give us our difference in adjacent values of specificity, however, we need to put a 0 value at the beginning to get the width of the first columns.
width <- diff(c(0, 1 - sens))
auc <- sum(spec*width)
the minimal code version
actuals <- actuals[order(scores)]
sens <- (sum(actuals) - cumsum(actuals))/sum(actuals)
spec <- cumsum(!actuals)/sum(!actuals)
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
(auc <- sum(spec*diff(c(0, 1 - sens))))

Central Limit Theorem in R

I wish to simulate the central limit theorem in order to demonstrate it, and I am not sure how to do it in R. I want to create 10,000 samples with a sample size of n (can be numeric or a parameter), from a distribution I will choose (uniform, exponential, etc...). Then I want to graph in one plot (using the par and mfrow commands) the original distribution (histogram), the distribution of the means of all samples, a Q-Q plot of the means, and in the 4th graph (there are four, 2X2), I am not sure what to plot. Can you please assist me in starting to program it in R ? I think once I have the simulated data I should be fine. Thank you.
My initial attempt is below, it is too simple and I am not sure even correct.
r = 10000;
n = 20;
M = matrix(0,n,r);
Xbar = rep(0,r);
for (i in 1:r)
{
M[,i] = runif(n,0,1);
}
for (i in 1:r)
{
Xbar[i] = mean(M[,i]);
}
hist(Xbar);
The CLT states that given i.i.d. samples from a distribution with mean and variance, the sample mean (as a random variable) has a distribution that converges to a Gaussian as the number of samples n increase. Here, I will assume that you want to generate r sample sets containing n samples each to create r samples of the sample mean. Some code to do that is as follows:
set.seed(123) ## set the seed for reproducibility
r <- 10000
n <- 200 ## I use 200 instead of 20 to enhance convergence to Gaussian
## this function computes the r samples of the sample mean from the
## r*n original samples
sample.means <- function(samps, r, n) {
rowMeans(matrix(samps,nrow=r,ncol=n))
}
For generating the plots, we use ggplot2 and Aaron's qqplot.data function from here. We also use gridExtra to plot multiple plots in one frame.
library(ggplot2)
library(gridExtra)
qqplot.data <- function (vec) {
# following four lines from base R's qqline()
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d <- data.frame(resids = vec)
ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int, colour="red") + ggtitle("Q-Q plot")
}
generate.plots <- function(samps, samp.means) {
p1 <- qplot(samps, geom="histogram", bins=30, main="Sample Histogram")
p2 <- qplot(samp.means, geom="histogram", bins=30, main="Sample Mean Histogram")
p3 <- qqplot.data(samp.means)
grid.arrange(p1,p2,p3,ncol=2)
}
Then we can use these functions with the uniform distribution:
samps <- runif(r*n) ## uniform distribution [0,1]
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the poisson distribution with mean = 3:
samps <- rpois(r*n,lambda=3)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the exponential distribution with mean = 1/1:
samps <- rexp(r*n,rate=1)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Note that the mean of the sample mean histograms all look like Gaussians with mean that is very similar to the mean of the original generating distribution, whether this is uniform, poisson, or exponential, as predicted by the CLT (also its variance will be 1/(n=200) the variance of the original generating distribution).
Maybe this can help you get started. I have hard-coded the normal distribution and only shown two of your suggested plots: a the histogram of a randomly selected sample, and a histogram of all sample means.
I guess my main suggestion is using a list to store the samples instead of a matrix.
r <- 10000
my.n <- 20
simulation <- list()
for (i in 1:r) {
simulation[[i]] <- rnorm(my.n)
}
sample.means <- sapply(simulation, mean)
selected.sample <- runif(1, min = 1, max = r)
dev.off()
par(mfrow = c(1, 2))
hist(simulation[[selected.sample]])
hist(sample.means)

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

R- using power.prop.test & prop.test

Disclaimer: I have a similar thread open in Cross Validated, but it hasn't gotten any answers. I've decided to ask a simpler question here instead:
How can I use power.prop.test and prop.test together to determine an adequate sample size before an experiment and determine whether or not a conclusion is statistically significant afterwards?
Actually, any and all knowledge regarding these two functions (and related functions) would be very much appreciated.
Context: I'm trying to develop a testing methodology for simple A/B tests, ranging from experiment set-up to analysis.
You can write a function to draw a power curve to determine the sample size (per group) to achieve a desired power level, given your guess of population proportions.
Below is my crude attempt. This function gives you a data frame containing sample sizes and corresponding power levels, along with an optional plot. It takes the following arguments:
## n = vector of sample sizes;
## desired_power = the power level you want to achieve (typically 0.8);
## p1 = population proportion of group 1;
## p2 = population proportion of group 2;
## plot = whether you want a power curve or not (by default yes).
DrawPowerCurve <- function(n, desired_power, p1, p2, plot=TRUE){
powers <- sapply(n, function(x) power.prop.test(x, p1=p1, p2=p2)$power)
n_power <- min(n[powers>desired_power])
print(data.frame(n, powers))
if(plot){
plot(n, powers, type="l")
segments(y0=desired_power, x0=0, x1=n_power, col="red")
segments(y0=0, y1=desired_power, x0=n_power, col="red")
text(paste("n =", n_power, " \nper group"), x=n_power, y=desired_power/2, pos=4)
title(paste("Sample Size (n) per Group to Achieve Power of", desired_power))
}
}
Say you want to determine the sample size per group to achieve desired power of 0.8, given population proportions of 0.5 and 0.6. Then the plot shows that you'll need 390 participants per group.
n <- seq(10, 1000, 5)
DrawPowerCurve(n, desired_power=0.8, p1=0.5, p2=0.6, plot=TRUE)

Resources