Create a histogram for weighted values - r

If I have a vector (e.g., v<-runif(1000)), I can plot its histogram (which will look, more or less, as a horizontal line because v is a sample from the uniform distribution).
However, suppose I have a vector and its associated weights (e.g., w<-seq(1,1000) in addition to v<-sort(runif(1000))). E.g., this is the result of table() on a much larger data set.
How do I plot the new histogram? (it should look more of less like the y=x line in this example).
I guess I could reverse the effects of table by using rep (hist(rep(v,w))) but this "solution" seems ugly and resource-heavy (creates an intermediate vector of size sum(w)), and it only supports integer weights.

library(ggplot2)
w <- seq(1,1000)
v <- sort(runif(1000))
foo <- data.frame(v, w)
ggplot(foo, aes(v, weight = w)) + geom_histogram()

Package plotrix has a function weighted.hist which does what you want:
w<-seq(1,1000)
v<-sort(runif(1000))
weighted.hist(v, w)

An alternative from the weights package is wtd.hist()
w<-seq(1,1000)
v<-sort(runif(1000))
wtd.hist(x=v,weight=w)

Related

Set ylim() automatically

Here is some data to work with.
df <- data.frame(x1=c(234,543,342,634,123,453,456,542,765,141,636,3000),x2=c(645,123,246,864,134,975,341,573,145,468,413,636))
If I plot these data, it will produce a simple scatter plot with an obvious outlier:
plot(df$x2,df$x1)
Then I can always write the code below to remove the y-axis outlier(s).
plot(df$x2,df$x1,ylim=c(0,800))
So my question is: Is there a way to exclude obvious outliers in scatterplots automatically? Like ouline=F would do if I were to plot, say, boxplots for an example. To my knowledge, outline=F doesn't work with scatterplots.
This is relevant because I have hundreds of scatterplots and I want to exclude all obvious outlying data points without setting ylim(...) for each individual scatterplot.
You could write a function that returns the index of what you define as an obvious outlier. Then use that function to subset your data before plotting.
Here all observations with "a" exceeding 5 * median of "a" are excluded.
df <- data.frame(a = c(1,3,4,2,100), b=c(1,3,2,4,2))
f <- function(x){
which(x$a > 5*median(x$a))
}
with(df[-f(df),], plot(b, a))
There is no easy yes/no option to do what you are looking for (the question of defining what is an "obvious outlier" for a generic scatterplot is potentially quite problematic).
That said, it should not be too difficult to write a reasonable function to give y-axis limits from a set of data points. If we take "obvious outlier" to mean a point with y value significantly above or below the bulk of the sample (which could be justified assuming a sufficient distribution of x values), then you could use something like:
ybounds <- function(y){ # y is the response variable in the dataframe
bounds = quantile(df$x1, probs=c(0.05, 0.95), type=3, names=FALSE)
return(bounds + c(-1,1) * 0.1 * (bounds[2]-bounds[1]) )
}
Then plot each dataframe with plot(df$x, df$y, ylim=ybounds(df$y))

color discrete groups of parallel coordinate plot in GGally package

To create a parallel coordinate plot I wanted to use ggparcoord() function in package GGally. The following codes show a reproducible example.
set.seed(3674)
k <- rep(1:3, each=30)
x <- k + rnorm(mean=10, sd=.2,n=90)
y <- -2*k + rnorm(mean=10, sd=.4,n=90)
z <- 3*k + rnorm(mean=10, sd=.6,n=90)
dat <- data.frame(group=factor(k),x,y,z)
library(GGally)
ggparcoord(dat,columns=1:4,groupColumn = 1)
Notice in the picture that the color for group was continuous even though I have the group variable as a factor. Is there any way I can display the plot with three discrete color instead?
I have looked at some other posts where they discuss various other ways of doing parallel coordinate plots in here. But I really wanted to do this in ggparcoord() function of package GGally. I appreciate your time in thinking about this problem.
Your code was almost correct. I spotted that columns=1:4 was not right in this case. You need to drop the column for groupColumn in columns
ggparcoord(dat,columns=2:4,groupColumn = 1)

Find local minimum in bimodal distribution with r

My data are pre-processed image data and I want to seperate two classes. In therory (and hopefully in practice) the best threshold is the local minimum between the two peaks in the bimodal distributed data.
My testdata is: http://www.file-upload.net/download-9365389/data.txt.html
I tried to follow this thread:
I plotted the histogram and calculated the kernel density function:
datafile <- read.table("....txt")
data <- data$V1
hist(data)
d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results
But how to continue?
I would calculate the first and second derivates of the density function to find the local extrema, specifically the local minimum. However I have no idea how to do this in R and density(test) seems not to be a normal function. Thus please help me: how can I calculate the derivates and find the local minimum of the pit between the two peaks in the density function density(test)?
There are a few ways to do this.
First, using d for the density as in your question, d$x and d$y contain the x and y values for the density plot. The minimum occurs when the derivative dy/dx = 0. Since the x-values are equally spaced, we can estimate dy using diff(d$y), and seek d$x where abs(diff(d$y)) is minimized:
d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785
The problem is that the density curve could also be maximized when dy/dx = 0. In this case the minimum is shallow but the maxima are peaked, so it works, but you can't count on that.
So a second way uses optimize(...) which seeks a local minimum in a given interval. optimize(...) needs a function as argument, so we use approxfun(d$x,d$y) to create an interpolation function.
optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791
Finally, we show that this is indeed the minimum:
hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")
Another approach, which is preferred actually, uses k-means clustering.
df <- read.csv(header=F,"data.txt")
colnames(df) = "X"
# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
The data actually looks more trimodal than bimodal.
# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")

ggplot2 2d Density Weights

I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)

Calculating an area under a continuous density plot

I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.

Resources