R ggplot: Weighted CDF - r

I'd like to plot a weighted CDF using ggplot. Some old non-SO discussions (e.g. this from 2012) suggest this is not possible, but thought I'd reraise.
For example, consider this data:
df <- data.frame(x=sort(runif(100)), w=1:100)
I can show an unweighted CDF with
ggplot(df, aes(x)) + stat_ecdf()
How would I weight this by w? For this example, I'd expect an x^2-looking function, since the larger numbers have higher weight.

There is a mistake in your answer.
This is the right code to compute the weighted ECDF:
df <- df[order(df$x), ] # Won't change anything since it was created sorted
df$cum.pct <- with(df, cumsum(w) / sum(w))
ggplot(df, aes(x, cum.pct)) + geom_line()
The ECDF is a function F(a) equal to the sum of weights (probabilities) of observations where x<a divided by the total sum of weights.
But here is a more satisfying option that simply modifies the original code of the ggplot2 stat_ecdf:
https://github.com/NicolasWoloszko/stat_ecdf_weighted

Related

R: Identifying outliers in 2D gaussian distribution

I have a two dimensional Gaussian distribution, and I am trying to identify outliers. This is not in the sense of outlier removals, but rather to identify samples that are the most dissimilar to the bulk.
http://imgur.com/hlOqjig
Do you have a suggestion how this is best done for this data? I have tried to fit a normal distribution on both dimensions and to calculate p-values for all data points, and then to identify the outliers as the data points with the lowest p-values. I, however, get the following result:
http://imgur.com/a/w6SAz
This is the code for calculating P-values:
library(fitdistrplus)
norm_pvalue <- function(input_dist, input_values) {
# Fitting normal distribution
fit <- fitdist(input_dist, "norm")
# Calculating p-values
p_values <- unlist(lapply(input_values, function(x) dnorm(x = x, mean= fit$estimate[['mean']], sd= fit$estimate[['sd']])))
return(p_values)
}
I would like the solution to be generalisable.
Without the data, it is hard to respond in any detail. However, you might want to check out the latest version of the package assertr, noted here: http://www.onthelambda.com/2017/03/20/data-validation-with-the-assertr-package/.
I really like its workflow, which is very generalisable.
For example, if you're looking to inspect data from a column (col) within a dataframe (df), you'd use something like:
library(assertr)
library(magrittr)
df %>% insist(within_n_sds(2), col)
This final function would then notify you of all outliers (i.e. those points that are more than two standard deviations from the mean). The package also includes plenty of different measures for assessing outliers.
In your case, the column in question would probably be based on the residuals from the best-fit line of PC1 and PC2:
PCA.lm = lm(PC2 ~ PC1, data=df)
PCA.res = resid(PCA.lm)
I hope that helps you.
I just ended up using the stat_ellipse of ggplot2 for identifying the outliers. I used a confidence level of 0.999.
This function extracts points outside the ellipsoid and takes a ggplot and the layer in which the ellipsoid is plotted.
# Function for identifying points outside ellipse
outside_ellipse <- function(ggplot, ellipsoid_layer_number) {
# Extracting components
build <- ggplot_build(ggplot)$data
points <- build[[1]]
ell <- build[[ellipsoid_layer_number]]
# Finding points are inside the ellipse, and add this to the data
df <- data.frame(points[1:2],
in.ell = as.logical(point.in.polygon(points$x, points$y, ell$x, ell$y)))
# Plot the result
ggplot(df, aes(x, y)) +
geom_point(aes(col = in.ell)) +
stat_ellipse()
# Returning indices of outliers
return(which(df$in.ell == FALSE))
}
Here I plot my data with the ellipsoid option, and extract the points outside the ellipsoid and add their information to the dataframe.
# Saving plot with confidence ellipsoid
plotData <- ggplot(pc_df, aes(PC1, PC2)) + geom_point() + stat_ellipse(level = 0.999)
# Identifying points outside ellipsoid
outside <- outside_ellipse(plotData, 2)
pc_df$in_ellipsoid <- rep(FALSE, dim(pc_df)[1])
pc_df$in_ellipsoid[outside] <- TRUE

add exponential function given mean and intercept to cdf plot

Considering the following random data:
set.seed(123456)
# generate random normal data
x <- rnorm(100, mean = 20, sd = 5)
weights <- 1:100
df1 <- data.frame(x, weights)
#
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
We can create a general cumulative distribution plot.
But, I want to compare my curve to that from data used 20 years ago. From the paper, I only know that the data is "best modeled by a shifted exponential distribution with an x intercept of 1.1 and a mean of 18"
How can I add such a function to my plot?
+ stat_function(fun=dexp, geom = "line", size=2, col="red", args = (mean=18.1))
but I am not sure how to deal with the shift (x intercept)
I think scenarios like this are best handled by making your function first outside of the ggplot call.
dexp doesn't take a parameter mean but uses rate instead which is the same as lambda. That means you want rate = 1/18.1 based on properties of exponential distributions. Also, I don't think dexp makes much sense here since it shows the density and I think you really want the probability with is pexp.
your code could look something like this:
library(ggplot2)
test <- function(x) {pexp(x, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red")
you could shift your pexp distributions doing this:
test <- function(x) {pexp(x-10, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red") +
xlim(10,45)
just for fun this is what using dexp produces:
I am not entirely sure if I understand concept of mean for exponential function. However, generally, when you pass function as an argument, which is fun=dexp in your case, you can pass your own, modified functions in form of: fun = function(x) dexp(x)+1.1, for example.
Maybe experimenting with this feature will get you to the solution.

Creating barplot with standard errors plotted in R

I am trying to find the best way to create barplots in R with standard errors displayed. I have seen other articles but I cannot figure out the code to use with my own data (having not used ggplot before and this seeming to be the most used way and barplot not cooperating with dataframes). I need to use this in two cases for which I have created two example dataframes:
Plot df1 so that the x-axis has sites a-c, with the y-axis displaying the mean value for V1 and the standard errors highlighted, similar to this example with a grey colour. Here, plant biomass should the mean V1 value and treatments should be each of my sites.
Plot df2 in the same way, but so that before and after are located next to each other in a similar way to this, so pre-test and post-test equate to before and after in my example.
x <- factor(LETTERS[1:3])
site <- rep(x, each = 8)
values <- as.data.frame(matrix(sample(0:10, 3*8, replace=TRUE), ncol=1))
df1 <- cbind(site,values)
z <- factor(c("Before","After"))
when <- rep(z, each = 4)
df2 <- data.frame(when,df1)
Apologies for the simplicity for more experienced R users and particuarly those that use ggplot but I cannot apply snippets of code that I have found elsewhere to my data. I cannot even get enough code together to produce a start to a graph so I hope my descriptions are sufficient. Thank you in advance.
Something like this?
library(ggplot2)
get.se <- function(y) {
se <- sd(y)/sqrt(length(y))
mu <- mean(y)
c(ymin=mu-se, ymax=mu+se)
}
ggplot(df1, aes(x=site, y=V1)) +
stat_summary(fun.y=mean, geom="bar", fill="lightgreen", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1)
ggplot(df2, aes(x=site, y=V1, fill=when)) +
stat_summary(fun.y=mean, geom="bar", position="dodge", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1, position=position_dodge(width=0.9))
So this takes advantage of the stat_summary(...) function in ggplot to, first, summarize y for given x using mean(...) (for the bars), and then to summarize y for given x using the get.se(...) function for the error-bars. Another option would be to summarize your data prior to using ggplot, and then use geom_bar(...) and geom_errorbar(...).
Also, plotting +/- 1 se is not a great practice (although it's used often enough). You'd be better served plotting legitimate confidence limits, which you could do, for instance, using the built-in mean_cl_normal function instead of the contrived get.se(...). mean_cl_normal returns the 95% confidence limits based on the assumption that the data is normally distributed (or you can set the CL to something else; read the documentation).
I used group_by and summarise_each function for this and std.error function from package plotrix
library(plotrix) # for std error function
library(dplyr) # for group_by and summarise_each function
library(ggplot2) # for creating ggplot
For df1 plot
# Group data by when and site
grouped_df1<-group_by(df1,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error(from plotrix)
summarised_df1<-summarise_each(grouped_df1,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df1,aes(site,mean))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
For df2 plot
# Group data by when and site
grouped_df2<-group_by(df2,when,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error
summarised_df2<-summarise_each(grouped_df2,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df2,aes(site,mean,fill=when))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g

ggplot - what is the unit of stat_density2d plots?

stat_density2d is really a nice display for dense scatter plots, however I could not find any explanation on what the density actually means. I have a plot with densities ranging from 0 to 400. What is the unit of this scale ?
Thanks !
The density values will depend on the range of x and y in your dataset.
stat_density2d(...) uses kde2d(...) in the MASS package to calculate the 2-dimensional kernal density estimate, based on bivariate normal distributions. The density at a point is scaled so that the integral of density over all x and y = 1. So if you data is highly localized, or if the range for x and y is small, you can get large numbers for density.
You can see this in the following simple example:
library(ggplot2)
set.seed(1)
df1 <- data.frame(x=c(rnorm(50,0,5),rnorm(50,20,5)),
y=c(rnorm(50,0,5),rnorm(50,20,5)))
ggplot(df1, aes(x,y)) + geom_point()+
stat_density2d(geom="path",aes(color=..level..))
set.seed(1)
df2 <- data.frame(x=c(rnorm(50,0,5),rnorm(50,20,5))/100,
y=c(rnorm(50,0,5),rnorm(50,20,5))/100)
ggplot(df2, aes(x,y)) + geom_point()+
stat_density2d(geom="path",aes(color=..level..))
These two data frames are identical except that in df2 the scale is 1/100 that in df1 (in each direction), and therefore the density levels are 10,000 times greater in the the plot of df2.

Manually specifying bins with stat_summary2d

I have a large set of data that consists of coordinates (x,y) and a numeric z value that is similar to density. I'm interested in binning the data, performing summary statistics (median, length, etc.) and plotting the binned values as points with the statistics mapped to ggplot aesthetics.
I've tried using stat_summary2d and extracting the results manually (based on this answer: https://stackoverflow.com/a/22013347/2832911). However, the problem I'm running into is that the bin placements are based on the range of the data, which in my case varies by data set. Thus between two plots the bins are not covering the same area.
My question is how to either manually set bins using stat_summary2d, or at least set them to be consistent regardless of the data.
Here is a basic example which demonstrates the approach and how the bins don't line up:
library(ggplot2)
set.seed(2)
df1 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
df2 <- data.frame(x=runif(100, -1,1), y=runif(100, -1,1), z=rnorm(100))
g1 <- ggplot(df1, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df1.binned <-
data.frame(with(ggplot_build(g1)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=1)))
g2 <- ggplot(df2, aes(x,y))+stat_summary2d(fun=mean, bins=10, aes(z=z))+geom_point()
df2.binned <-
data.frame(with(ggplot_build(g2)$data[[1]],
cbind(x=(xmax+xmin)/2, y=(ymax+ymin)/2, z=value, df=2)))
df.binned <- rbind(df1.binned, df2.binned)
ggplot(df.binned, aes(x,y, size=z, color=factor(df)))+geom_point(alpha=.5)
Which generates
In reality I will use stat_summary2d several times to get, for instance, the number of points in the bin, and the median and then use aes(size=bin.length, colour=bin.median).
Any tips on how to accomplish this using my proposed approach, or an alternative approach would be welcome.
You can manually set breaks with stat_summary2d. If you want 10 levels from -1 to 1 you can do
bb<-seq(-1,1,length.out=10+1)
breaks<-list(x=bb, y=bb)
And then use the breaks variable when you call your plots
g1 <- ggplot(df1, aes(x,y))+
stat_summary2d(fun=mean, breaks=breaks, aes(z=z))+
geom_point()
It's a shame you can't change the geom of the stat_summary2d to "point" so you could make this in one go, but it doesn't look as though stat_summary2d calculate the proper x and y values for that.

Resources