How to get quantile from category count in R? - r

For example, I have a sample data of human height in a DataFrame:
df <- data_frame(height = c(1.5, 1.6, 1.7, 1.8, 1.9), number = c(20, 30, 50, 30, 20))
How can I calculate the 90% quantile of this sample?
I know ggplot2 has a function can plot the ecdf of the sample:
ggplot(df, aes(x = height, y = number)) + stat_ecdf()
but I only need a specified quantile not the plot.
I could repeat each height number times to make a vector and use the quantile function on the vector, but as the number getting larger, this method seems to be very inefficient.
EDIT:
It seems stat_ecdf are not supposed to be used in this way, and when data distribution is skewed:
df <- data_frame(height = c(1.5, 1.6, 1.7, 1.8, 1.9), number = c(100, 2, 3, 4, 5))
only quantile of the repeated vector gives the desired result:
quantile(c(rep(1.5,100), rep(1.6,2), rep(1.7,3), rep(1.8,4), rep(1.9,5)))

Related

Plotting unequal error bars as bubbles on a scatterplot in ggplot2

I have a set of 10 density estimates, obtained from 5 sites using two differnt methods (REM and DS). Each density estimate has their respective confidence intervals, which are unequal.
I want a scatter plot with the x-axis showing the density from REM and the y-axis showing the density estimate from DS. I then want to a bubble around each point, representing the confidence intervals.
At the moment I can only seem to set specific height and width values for these confidence intervals, which would be fine if they were even. Since they are uneven, the bubbles will not be circles but should be more of an egg-shaped ellipse, off-centre from the point estimate.
This is the code I've used, in which you can see the respective confidence intervals. The plot shows what this makes, if the confidence intervals were event. How would I adapt this to make the confidence intervals uneven?
Thank you.
# sample data
df <- data.frame(site=c(1, 2, 3, 4, 5),
rem=c(17.7, 14.1, 10.6, 13.2, 1.0),
rem_lower=c(8.2, 6.6, 4.2, 3.2, 0.2),
rem_upper=c(27.1, 21.5, 17.0, 23.1, 1.7),
ds=c(16.6, 18.5, 5.2, 21.8, 2.4),
ds_lower=c(6.3, 5.1, 2.7, 4.5, 0.5),
ds_upper=c(40.4, 39.9, 10.9, 44.7, 8.3))
# calculate the width and height of each ellipse
width <- df$rem_upper - df$rem_lower
height <- df$ds_upper - df$ds_lower
# plot the data with ellipses
ggplot(df, aes(x = rem, y = ds, color = factor(site))) +
geom_point(size = 5) +
geom_ellipse(aes(x0 = rem, y0 = ds, a = width, b = height, fill = factor(site),
angle = 45), alpha = 0.3) +
scale_fill_manual(values = c("#1f78b4", "#33a02c", "#e31a1c", "#ff7f00", "#6a3d9a")) +
labs(x = "rem", y = "DS") +
theme_classic()

How to put labels between columns in a bar plot in R?

I'm a beginner with R and looking for help with plotting.
I would like to make a distribution plot in R that looks like a histogram of continuous data bucketed into columns with x-axis labels between each column to denote the range captured in each column.
Instead of continuous data though, I only have the bucketed counts. I can create a plot with barplot, however I can't find a way to label BETWEEN the columns to denote the range captured in each bar.
I've tried barplot but cannot get the labels to fall between columns instead of being treated as column labels and falling directly beneath each column.
dat$freq = c(5,15,20,10)
dat$mid = c(-1.5,-.5,.5,1.5) #midpoint in each bucketed range
dat$perc = dat$freq/sum(dat$freq)
barplot(dat$perc, names.arg = dat$mid)
Each column is labeled with the midpoint. I would instead like the labels to be -2,-1,0,1,2 BETWEEN the columns.
Thank you!
edit: dput(dat) outputs:
list(freq = c(5, 15, 20, 10), mid = c(-1.5, -0.5, 0.5, 1.5), perc =
c(0.1, 0.3, 0.4, 0.2))
Is this what you're after?
df <- data.frame(freq = c(5, 15, 20, 10), mid = c(-1.5, -0.5, 0.5, 1.5), perc = c(0.1, 0.3, 0.4, 0.2))
I'm using the awesome and highly customisable library ggplot2 to plot this, which renders the plot as I think you want it. You can install this with install.packages('ggplot2'):
# install.packages('ggplot2')
library(ggplot2)
p <- ggplot(df)
p <- p + geom_bar(aes(mid, perc), stat='identity')
p

Multiple Layers in ggplot2

I want to overlay a plot of an empirical cdf with a cdf of a normal distribution. I can only get the code to work without using ggplot.
rnd_nv1 <- rnorm(1000, 1.5, 0.5)
plot(ecdf(rnd_nv1))
lines(seq(0, 3, by=.1), pnorm(seq(0, 3, by=.1), 1.5, 0.5), col=2)
For ggplot to work I would need a single data frame, for example joining rnd_vn1 and pnorm(seq(0, 3, by=.1), 1.5, 0.5), col=2). This is a problem, because the function rnorm gives me just the function values without values on the domain. I don't even know how rnorm creates these, if I view the table I just see function values. But then again, magically, the plot of rnd_nv1 works.
The following plots the two lines but they overlap, since they are almost equal.
set.seed(1856)
x <- seq(0, 3, by = 0.1)
rnd_nv1 <- rnorm(1000, 1.5, 0.5)
dat <- data.frame(x = x, ecdf = ecdf(rnd_nv1)(x), norm = pnorm(x, 1.5, 0.5))
library(ggplot2)
long <- reshape2::melt(dat, id.vars = "x")
ggplot(long, aes(x = x, y = value, colour = variable)) +
geom_line()

R: Bar plot on a continuous x-axis (time-scaled)

I'm fairly new to R so please comment on anything you see.
I have data taken at different timepoints, under two conditions (for one timpoint) and I want to plot this as a bar plot with errorbars and with the bars at the appropriate timepoint.
I currently have this (stolen from another question on this site):
library(ggplot2)
example <- data.frame(tp = factor(c(0, "14a", "14b", 24, 48, 72)), means = c(1, 2.1, 1.9, 1.8, 1.7, 1.2), std = c(0.3, 0.4, 0.2, 0.6, 0.2, 0.3))
ggplot(example, aes(x = tp, y = means)) +
geom_bar(position = position_dodge()) +
geom_errorbar(aes(ymin=means-std, ymax=means+std))
Now my timepoints are a factor, but the fact that there is an unequal distribution of measurements across time makes the plot less nice.!
This is how I imagine the graph :
I find the ggplot2 package can give you very nice graphs, but I have a lot more difficulty understanding it than I have with other R stuff.
Before we get into R, you have to realize that even in a bar plot the x axis needs a numeric value. If you treat them as factors then the software assumes equal spacing between the bars by default. What would be the x-values for each of the bars in this case? It can be (0, 14, 14, 24, 48, 72) but then it will plot two bars at point 14 which you don't seem to want. So you have to come up with the x-values.
Joran provides an elegant solution by modifying the width of the bars at position 14. Modifying the code given by joran to make the bars fall at the right position in the x-axis, the final solution is:
library(ggplot2)
example <- data.frame(tp = factor(c(0, "14a", "14b", 24, 48, 72)), means = c(1, 2.1, 1.9, 1.8, 1.7, 1.2), std = c(0.3, 0.4, 0.2, 0.6, 0.2, 0.3))
example$tp1 <- gsub("a|b","",example$tp)
example$grp <- c('a','a','b','a','a','a')
example$tp2 <- as.numeric(example$tp1)
ggplot(example, aes(x = tp2, y = means,fill = grp)) +
geom_bar(position = "dodge",stat = "identity") +
geom_errorbar(aes(ymin=means-std, ymax=means+std),position = "dodge")

Put data into unequal bin sizes

I'm new to R and want to utilize it to directly work with my data. My ultimate goal is to make a histogram / bar plot.
Depth: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Percent: .4, .1, .5, .2, .1, .3, .9, .3, .2, .2, .8
I want to take the Depth vector and bin it into unequal chunks (0, 1-5, 6-8, 9-10), and take the Percent values and somehow sum them together for the matching chunks.
For example:
0 -> .4
1-5 -> 1.2
6-8 -> 1.4
9-10 -> 1.0
The actual data set goes into the thousands, and I feel R might be more suited for this then using C++ to group my data into a smaller table before letting R plot it.
I looked up how to use SPLIT and CUT, but I'm not quite sure how to utilize the data after I do cut it into ranges. If I do "breaks" for a CUT, I don't know how to include the Zero initial value (corresponding to .4 in the example).
Any suggestions or approaches would be appreciated.
You're on the right track with cut:
dat <- data.frame(Depth = 0:10,
Percent = c(0.4, 0.1, 0.5, 0.2, 0.1, 0.3, 0.9, 0.3, 0.2, 0.2, 0.8))
cuts <- cut(dat$Depth, breaks=c(0, 1, 6, 9, 11), right=FALSE)
Then you can use aggregate:
aggregate(dat$Percent, list(cuts), sum)
Or as a oneliner:
aggregate(dat$Percent,
list(cut(dat$Depth,
breaks=c(0, 1, 6, 9, 11),
right=FALSE)),
sum)

Resources