R: Identifying outliers in 2D gaussian distribution

R: Identifying outliers in 2D gaussian distribution - r

I have a two dimensional Gaussian distribution, and I am trying to identify outliers. This is not in the sense of outlier removals, but rather to identify samples that are the most dissimilar to the bulk.
http://imgur.com/hlOqjig
Do you have a suggestion how this is best done for this data? I have tried to fit a normal distribution on both dimensions and to calculate p-values for all data points, and then to identify the outliers as the data points with the lowest p-values. I, however, get the following result:
http://imgur.com/a/w6SAz
This is the code for calculating P-values:
library(fitdistrplus)
norm_pvalue <- function(input_dist, input_values) {
# Fitting normal distribution
fit <- fitdist(input_dist, "norm")
# Calculating p-values
p_values <- unlist(lapply(input_values, function(x) dnorm(x = x, mean= fit$estimate[['mean']], sd= fit$estimate[['sd']])))
return(p_values)
}
I would like the solution to be generalisable.

Without the data, it is hard to respond in any detail. However, you might want to check out the latest version of the package assertr, noted here: http://www.onthelambda.com/2017/03/20/data-validation-with-the-assertr-package/.
I really like its workflow, which is very generalisable.
For example, if you're looking to inspect data from a column (col) within a dataframe (df), you'd use something like:
library(assertr)
library(magrittr)
df %>% insist(within_n_sds(2), col)
This final function would then notify you of all outliers (i.e. those points that are more than two standard deviations from the mean). The package also includes plenty of different measures for assessing outliers.
In your case, the column in question would probably be based on the residuals from the best-fit line of PC1 and PC2:
PCA.lm = lm(PC2 ~ PC1, data=df)
PCA.res = resid(PCA.lm)
I hope that helps you.

I just ended up using the stat_ellipse of ggplot2 for identifying the outliers. I used a confidence level of 0.999.
This function extracts points outside the ellipsoid and takes a ggplot and the layer in which the ellipsoid is plotted.
# Function for identifying points outside ellipse
outside_ellipse <- function(ggplot, ellipsoid_layer_number) {
# Extracting components
build <- ggplot_build(ggplot)$data
points <- build[[1]]
ell <- build[[ellipsoid_layer_number]]
# Finding points are inside the ellipse, and add this to the data
df <- data.frame(points[1:2],
in.ell = as.logical(point.in.polygon(points$x, points$y, ell$x, ell$y)))
# Plot the result
ggplot(df, aes(x, y)) +
geom_point(aes(col = in.ell)) +
stat_ellipse()
# Returning indices of outliers
return(which(df$in.ell == FALSE))
}
Here I plot my data with the ellipsoid option, and extract the points outside the ellipsoid and add their information to the dataframe.
# Saving plot with confidence ellipsoid
plotData <- ggplot(pc_df, aes(PC1, PC2)) + geom_point() + stat_ellipse(level = 0.999)
# Identifying points outside ellipsoid
outside <- outside_ellipse(plotData, 2)
pc_df$in_ellipsoid <- rep(FALSE, dim(pc_df)[1])
pc_df$in_ellipsoid[outside] <- TRUE

Related

Export results from LOESS plot

I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)

You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.

To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}

How to plot numerous polygons in each data category?

I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:

some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group

How to do a 2d heatmap with color smoothing ... or a density plot from absolute values?

I've done the rounds here and via google without a solution, so please help if you can.
I'm looking to create something like this : painSensitivityHeatMap using ggplot2
I can create something kinda similar using geom_tile, but without the smoothing between data points ... the only solution I have found requires a lot of code and data interpolation. Not very elegant, me thinks.uglySolutionUsingTile
So I'm thinking, I could coerce the density2d plots to my purposes instead by having the plot use fixed values rather than a calculated data-point density -- much in the same way that stat='identity' can be used in histograms to make them represent data values, rather than data counts.
So a minimal working example:
df <- expand.grid(letters[1:5], LETTERS[1:5])
df$value <- sample(1:4, 25, replace=TRUE)
# A not so pretty, non-smooth tile plot
ggplot(df, aes(x=Var1, y=Var2, fill=value)) + geom_tile()
# A potentially beautiful density2d plot, except it fails :-(
ggplot(df, aes(x=Var1, y=Var2)) + geom_density2d(aes(color=..value..))

This took me a little while, but here is a solution for future reference
A solution using idw from the gstat package and spsample from the sp package.
I've written a function which takes a dataframe, number of blocks (tiles) and a low and upper anchor for the colour scale.
The function creates a polygon (a simple quadrant of 5x5) and from that creates a grid of that shape.
In my data, the location variables are ordered factors -- therefor I unclass them into numbers (1-to-5 corresponding to the polygon-grid) and convert them to coordinates -- thus converting the tmpDF from a datafra to a spatial dataframe. Note: there are no overlapping/duplicate locations -- i.e 25 observations corresponding to the 5x5 grid.
The idw function fills in the polygon-grid (newdata) with inverse-distance weighted values ... in other words, it interpolates my data to the full polygon grid of a given number of tiles ('blocks').
Finally I create a ggplot based on a color gradient from the colorRamps package
painMapLumbar <- function(tmpDF, blocks=2500, lowLimit=min(tmpDF$value), highLimit=max(tmpDF$value)) {
# Create polygon to represent the lower back (lumbar)
poly <- Polygon(matrix(c(0.5, 0.5,0.5, 5.5,5.5, 5.5,5.5, 0.5,0.5, 0.5), ncol=2, byrow=TRUE))
# Create a grid of datapoints from the polygon
polyGrid <- spsample(poly, n=blocks, type="regular")
# Filter out the data for the figure we want
tmpDF <- tmpDF %>% mutate(x=unclass(x)) %>% mutate(y=unclass(y))
tmpDF <- tmpDF %>% filter(y<6) # Lumbar region only
coordinates(tmpDF) <- ~x+y
# Interpolate the data as Inverse Distance Weighted
invDistanceWeigthed <- as.data.frame(idw(formula = value ~ 1, locations = tmpDF, newdata = polyGrid))
p <- ggplot(invDistanceWeigthed, aes(x=x1, y=x2, fill=var1.pred)) + geom_tile() + scale_fill_gradientn(colours=matlab.like2(100), limits=c(lowLimit,highLimit))
return(p)
}
I hope this is useful to someone ... thanks for the replies above ... they helped me move on.

R ggplot: Weighted CDF

I'd like to plot a weighted CDF using ggplot. Some old non-SO discussions (e.g. this from 2012) suggest this is not possible, but thought I'd reraise.
For example, consider this data:
df <- data.frame(x=sort(runif(100)), w=1:100)
I can show an unweighted CDF with
ggplot(df, aes(x)) + stat_ecdf()
How would I weight this by w? For this example, I'd expect an x^2-looking function, since the larger numbers have higher weight.

There is a mistake in your answer.
This is the right code to compute the weighted ECDF:
df <- df[order(df$x), ] # Won't change anything since it was created sorted
df$cum.pct <- with(df, cumsum(w) / sum(w))
ggplot(df, aes(x, cum.pct)) + geom_line()
The ECDF is a function F(a) equal to the sum of weights (probabilities) of observations where x<a divided by the total sum of weights.
But here is a more satisfying option that simply modifies the original code of the ggplot2 stat_ecdf:
https://github.com/NicolasWoloszko/stat_ecdf_weighted

ggplot2 - plot multiple models on the same plot

I have a list of linear and non-linear models derived from different data sets measuring the same two variables x and y that I would like to plot on the same plot using stat_smooth. This is to be able to easily compare the shape of the relationship between x and y across datasets.
I'm trying to figure out the most effective way to do this. Right now I am considering creating an empty ggplot object and then using some kind of loop or lapply to add sequentially to that object, but this is proving more difficult than I thought. Of course it would be easiest to simply supply the models to ggplot but as far as I know, this is not possible. Any thoughts?
Here is a simple example data set to play with using just two models, one linear and one exponential:
df1=data.frame(x=rnorm(10),y=rnorm(10))
df2=data.frame(x=rnorm(15),y=rnorm(15))
df.list=list(lm(y~x,df1),nls(y~exp(a+b*x),start=list(a=1,b=1),df2))
And two separate example plots:
ggplot(df1,aes(x,y))+stat_smooth(method=lm,se=F)
ggplot(df2,aes(x,y))+stat_smooth(method=nls,formula=y~exp(a+b*x),start=list(a=1,b=1),se=F)

EDIT: Note that the OP changed the question after this answer was posted
Combine the data into a single data frame, with a new column indicating the model, then use ggplot to distinguish between the models:
df1=data.frame(x=rnorm(10),y=rnorm(10))
df2=data.frame(x=rnorm(10),y=rnorm(10))
df1$model <- "A"
df2$model <- "B"
dfc <- rbind(df1, df2)
library(ggplot2)
ggplot(dfc, aes(x, y, group=model)) + geom_point() + stat_smooth(aes(col=model))
This produces:

I think the answer here is to get a common range of X and Y you want to run this over, and go from there. You can pull out a curve from each model using predict, and add on layers to a ggplot using l_ply.
d
f1=data.frame(x=rnorm(10),y=rnorm(10))
df2=data.frame(x=rnorm(15),y=rnorm(15))
df.list=list(lm(y~x,df1),nls(y~exp(a+b*x),start=list(a=1,b=1),df2))
a<-ggplot()
#get the range of x you want to look at
x<-seq(min(c(df1$x, df2$x)), max(c(df1$x, df2$x)), .01)
#use l_ply to keep adding layers
l_ply(df.list, function(amod){
#a data frame for predictors and response
ndf <- data.frame(x=x)
#get the response using predict - you can even get a CI here
ndf$y <- predict(amod, ndf)
#now add this new layer to the plot
a<<- a+geom_line(ndf, mapping=(aes(x=x, y=y)))
} )
a
OR, if you want to have a nice color key with model number or something:
names(df.list) <- 1:length(df.list)
modFits <- ldply(df.list, function(amod){
ndf <- data.frame(x=x)
#get the response using predict - you can even get a CI here
ndf$y <- predict(amod, ndf)
ndf
})
qplot(x, y, geom="line", colour=.id, data=modFits)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Identifying outliers in 2D gaussian distribution - r

Related

Export results from LOESS plot

How to plot numerous polygons in each data category?

How to do a 2d heatmap with color smoothing ... or a density plot from absolute values?

R ggplot: Weighted CDF

ggplot2 - plot multiple models on the same plot

Categories

Resources