I know how to draw histograms or other frequency/percentage related tables.
But now I want to know, how can I get those frequency values in a table to use after the fact.
I have a massive dataset, now I draw a histogram with a set binwidth. I want to extract the frequency value (i.e. value on y-axis) that corresponds to each binwidth and save it somewhere.
Can someone please help me with this?
Thank you!
The hist function has a return value (an object of class histogram):
R> res <- hist(rnorm(100))
R> res
$breaks
[1] -4 -3 -2 -1 0 1 2 3 4
$counts
[1] 1 2 17 27 34 16 2 1
$intensities
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01
$density
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01
$mids
[1] -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5
$xname
[1] "rnorm(100)"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
From ?hist:
Value
an object of class "histogram" which is a list with components:
breaks the n+1 cell boundaries (= breaks if that was a vector).
These are the nominal breaks, not with the boundary fuzz.
counts n integers; for each cell, the number of x[] inside.
density values f^(x[i]), as estimated density values. If
all(diff(breaks) == 1), they are the relative frequencies counts/n
and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i]
= breaks[i].
intensities same as density. Deprecated, but retained for
compatibility.
mids the n cell midpoints.
xname a character string with the actual x argument name.
equidist logical, indicating if the distances between breaks are all
the same.
breaks and density provide just about all you need:
histrv<-hist(x)
histrv$breaks
histrv$density
Just in case someone hits this question with ggplot's geom_histogram in mind, note that there is a way to extract the data from a ggplot object.
The following convenience function outputs a dataframe with the lower limit of each bin (xmin), the upper limit of each bin (xmax), the mid-point of each bin (x), as well as the frequency value (y).
## Convenience function
get_hist <- function(p) {
d <- ggplot_build(p)$data[[1]]
data.frame(x = d$x, xmin = d$xmin, xmax = d$xmax, y = d$y)
}
# make a dataframe for ggplot
set.seed(1)
x = runif(100, 0, 10)
y = cumsum(x)
df <- data.frame(x = sort(x), y = y)
# make geom_histogram
p <- ggplot(data = df, aes(x = x)) +
geom_histogram(aes(y = cumsum(..count..)), binwidth = 1, boundary = 0,
color = "black", fill = "white")
Illustration:
hist = get_hist(p)
head(hist$x)
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
head(hist$y)
## [1] 7 13 24 38 52 57
head(hist$xmax)
## [1] 1 2 3 4 5 6
head(hist$xmin)
## [1] 0 1 2 3 4 5
A related question I answered here (Cumulative histogram with ggplot2).
Related
Context
I asked this question recently:
Comparing partitions from split() using a nested for loop containing an if statement
where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by #robertdj
set.seed(1234) # set random seed for reproducibility
# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix
splt <- split(d, 1:5) # split data with 5 values in each partition
for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
if (i != j) {
a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
}
}
}
I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.
As per #Robert Hacken's comment if I instead do
d <- na.omit(d[lower.tri(d)])
I get partitions of unequal length.
Real Data
However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.
Here is my code
splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))
The distance matrix dist_matrix contains FASTA headers from which I extract the species names.
I then use splt above in the doubly nested loop.
For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.
splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667
splt[[10]]
[1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000
Expected Output
For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.
splt would thus contain 10 partitions (each not necessarily of the same length) for all species
The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().
What I've Done
I've read on SO that %in% is typically used to resolve the warning
In splt[[i]] == splt[[j]] :
longer object length is not a multiple of shorter object length
except in my case, I believe I would need %notin% <- Negate(%in%).
However, %notin% gives the error in my original post
the condition has length > 1
Question
How can my nested loop be altered to remove the warning?
I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).
Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,
set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)
We have:
splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389 0.06445882
# $`5`
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).
Some quick math so we know what we should end up with:
splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069
Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.
Similarly, [[5]] over [[4]]:
splt[[5]]
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644 0.06445882
### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2
### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2
We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.
Greater than the other's max
sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
# 1 2 3 4 5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1
Interpretation and subset validation:
1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.
Less than the other's min
sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
# 1 2 3 4 5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1
Same two pairs:
1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected
Edit: #compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks #compbiostats!
First: This is my first question here and I'm relatively new to R, too. So, I'm sorry if this is a stupid question or wrong way to ask.
I have a data frame like this:
df <- data.frame(Website = c("A", "A", "A", "B", "B", "B"),
seconds = c(1,12,40,3,5,14),
visitors = c(200000,100000,12000,250000,180000,90000))
> df
Website seconds visitors
A 1 200000
A 12 100000
A 40 12000
B 3 250000
B 5 180000
B 14 90000
How to interpret the data: Website A has 200000 visitors who have been on the website for only 1 second, 100000 visitors for 12 seconds and so on.
In reality, the data has about hundred different websites, each with seconds ranging from 0 to about 900 (and a high number of visitors respectively).
Now, I want to calculate percentiles or at least quartiles for the visiting duration (for each website).
I already found and tried this solution here: https://stackoverflow.com/a/53882909
However, this solution is very inefficient as it results in a data frame with several million rows (and a very long processing time).
My question now: Is there a faster (more efficient way) to calculate percentiles from such pre-aggregated data?
I believe this will be faster. First make a function to compute the quantiles you specify. Then split the data into a list and use sapply:
quant <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
s <- c(0, x$seconds)
approx(v, s, p)$y
}
df.split <- split(df, df$Website)
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 0.2 0.3 0.5 0.6 0.8 0.9 3.0 6.5 9.9
# B 0.6 1.2 1.9 2.5 3.1 3.7 4.3 4.8 8.8
To see better what is going on here is a plot showing the data for Website A:
test1 <- df[1:3, ]
test1$cumvis <- cumsum(test1$visitors)
barplot(test1$seconds, test1$visitors, space=0, xlim=c(0, 325000))
axis(1, seq(0, 300000, 50000), c("0", "50K", "100K", "150K", "200K",
"250K", "300K"), xpd=NA)
axis(3, seq(0, sum(test1$visitors), by=31200), seq(0, 1, by=.1), lty=1)
lines(c(0, test1$cumvis), c(0, test1$seconds), col="red", lwd=2)
lines(c(0, test1$cumvis-.5*test1$visitors, tail(test1$cumvis, 1)),
c(0, test1$seconds, tail(test1$seconds, 1)), col="blue", lwd=2)
The plot shows the data as grey rectangles. The bottom x-axis shows the cumulative number of visits and the top x-axis shows the cumulative proportion. We can treat the rectangles as the distribution or we can assume that the rectangles are a sample that approximates the underlying distribution. My suggested solution took the red line and used the approx function to use linear interpolation between the data points to estimate the number of seconds along that curve.
The same approach can be used with a different definition of the curve in which the data points are placed in the middle of each rectangle, the blue curve. I'll provide code for that approach as well. It is also possible to estimate the quantiles from the original data without replicating it.
First a function to estimate the quantiles along the blue line:
quant2 <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)-(.5*x$visitors)/sum(x$visitors), 1)
s <- c(0, x$seconds, tail(x$seconds, 1))
approx(v, s, p)$y
}
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant2, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 4.0 8.0 12.0 16.0 20 24.0 28.0 32.0 36.0
# B 1.4 2.8 4.2 5.6 7 8.4 9.8 11.2 12.6
The estimates are higher because the blue line is above the red line.
Finally, we can simply use the rectangles without any interpolation. Basically we set breaks at the boundaries of the data points and use those to identify which proportions fall in which groups of observations (seconds).
quant3 <- function(x, p=c(.25, .50, .75)){
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
limits <- cut(p, breaks=v, include.lowest=TRUE, labels=x$seconds)
limits <- as.numeric(as.character(limits))
}
p <- 0:10/10
stats <- t(sapply(df.split, quant3, p=p))
colnames(stats) <- as.character(p)
stats
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
# A 1 1 1 1 1 1 1 12 12 12 40
# B 3 3 3 3 3 5 5 5 5 14 14
So for website A, 1 second is the value for quantiles 0 - .6.
I found a plot in a stats book, which I want to reproduce with the base package.
The plot looks like this:
So far I have the plot, but I have problems to add a centred labels to each part of the bar.
My code looks like this:
data <- sample( 5, 10 , replace = TRUE )
colors <- c('yellow','violet','green','pink','red')
relative.frequencies <- as.matrix( prop.table( table( data ) ) )
bc <- barplot( relative.frequencies, horiz = TRUE, axes = FALSE, col = colors )
For your given example, we can do (all readers can skip this part and jump to the next):
set.seed(0) ## `set.seed` for reproducibility
dat <- sample( 5, 10 , replace = TRUE )
colors <- c('yellow','violet','green','pink')
h <- as.matrix( prop.table( table( dat ) ) )
## compute x-location of the centre of each bar
H <- apply(h, 2L, cumsum) - h / 2
## add text to barplot
bc <- barplot(h, horiz = TRUE, axes = FALSE, col = colors )
text(H, bc, labels = paste0(100 * h, "%"))
For all readers
I will now construct a comprehensive example for you to digest the idea.
Step 1: generate a toy matrix of percentage for experiment
## a function to generate `n * p` matrix `h`, with `h > 0` and `colSums(h) = 1`
sim <- function (n, p) {
set.seed(0)
## a positive random matrix of 4 rows and 3 columns
h <- matrix(runif(n * p), nrow = n)
## rescale columns of `h` so that `colSums(h)` is 1
h <- h / rep(colSums(h), each = n)
## For neatness we round `h` up to 2 decimals
h <- round(h, 2L)
## but then `colSums(h)` is not 1 again
## no worry, we simply reset the last row:
h[n, ] <- 1 - colSums(h[-n, ])
## now return this good toy matrix
h
}
h <- sim(4, 3)
# [,1] [,2] [,3]
#[1,] 0.43 0.31 0.42
#[2,] 0.13 0.07 0.40
#[3,] 0.18 0.30 0.04
#[4,] 0.26 0.32 0.14
Step 2: understand a stacked bar-chart and get "mid-height" of each stacked bar
For stacked bar-chart, the height of the bar is the cumulative sum of each column of h:
H <- apply(h, 2L, cumsum)
# [,1] [,2] [,3]
#[1,] 0.43 0.31 0.42
#[2,] 0.56 0.38 0.82
#[3,] 0.74 0.68 0.86
#[4,] 1.00 1.00 1.00
We now shift back h / 2 to get the mid / centre of each stacked bar:
H <- H - h / 2
# [,1] [,2] [,3]
#[1,] 0.215 0.155 0.21
#[2,] 0.495 0.345 0.62
#[3,] 0.650 0.530 0.84
#[4,] 0.870 0.840 0.93
Step 3: producing a bar-chart with filled numbers
For a vertical bar-chart, H above gives the y coordinate of the centre of each stacked bar. The x coordinate is returned by barplot (invisibly). Be aware, that we need to replicate each of x's element nrow(H) times when using text:
x <- barplot(h, col = 1 + 1:nrow(h), yaxt = "n")
text(rep(x, each = nrow(H)), H, labels = paste0(100 * h, "%"))
For a horizontal bar-chart, H above gives the x coordinate of the centre of each stacked bar. The y coordinate is returned by barplot (invisibly). Be aware, that we need to replicate each of y's element nrow(H) times when using text:
y <- barplot(h, col = 1 + 1:nrow(h), xaxt = "n", horiz = TRUE)
text(H, rep(y, each = nrow(H)), labels = paste0(100 * h, "%"))
Here is another solution using mapply:
invisible(mapply(function(k, l) text(x = (k - l/2), y = bc,
labels = paste0(l*100, "%"), cex = 1.5),
cumsum(relative.frequencies), relative.frequencies))
mapply is a multivariate version of sapply. In this case, it takes two inputs: cumsum(relative.frequencies) and relative.frequencies and applies the text() function using those two vectors. x = is the coordinates of the labels which takes each cumulative sum minus half of each corresponding relative.frequencies. relative.frequencies is then used again as the labels to be plotted.
The invisible() function suppresses the printing of outputs into the console.
The question hast 2 parts.
Which is the data structure in R that allows to store the paired data:
0:0
0.5:10
1:20
(Python dictionary {[0]:0, [0.5]:10, [1]:20})
and how to initiate it with one liner? i.e. to couple seq(0,1,by=0.5)
with seq(0,10,by=5) in this data structure
Assume I added 0.25 to the list, then I want the weighted average of the neighbor nodes to appear (automatically) in the data set, i.e. the element 0.25:5 and the paired set would be
0:0
0.25:5
0.5:10
1:20
If I add the element 0.3, then it must be paired with 5+(10-5)*(0.3-0.25)/(0.5-0.25)=6 and element 0.3:6 to be added.
How I can create the class with S4 or Reference Class class model where I could put this functionality?
Not really sure what you are getting at but maybe the package hash may have what you want
library(hash)
h<-hash(keys=seq(0,1,by=0.5),values=seq(0,10,by=5))
h[['0.25']]<-2.5
Probably deals with the first part of your question. http://cran.r-project.org/web/packages/hash/hash.pdf may allude to help on the second.
a similar construct with lists
lst<-list()
lst<-seq(0,10,5)
names(lst)<-seq(0,1,0.5)
> lst['0.5']
0.5
5
lst['0.25']<-2.5
for your second part you could construct a simple function to update you hash/list with a new value.
A two-column data.frame seems appropriate:
xy <- data.frame(x = seq(0, 1, by = 0.5), y = seq(0, 20, by = 10))
xy
# x y
# 1 0.0 0
# 2 0.5 10
# 3 1.0 20
Then, what you are trying to do is a linear-interpolation, which you can achieve using the approx function. For example:
approx(xy$x, xy$y, xout = 0.3)
# $x
# [1] 0.3
#
# $y
# [1] 6
If you want to add that result to the data.frame, you can do something like:
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, 0.3))))
xy
# x y
# 1 0.0 0
# 2 0.3 6
# 3 0.5 10
# 4 1.0 20
which is a bit expensive, especially if you plan to add points one at a time. You could instead add all your points at once since the result is independent of the order in which you add them:
add.points <- c(0.25, 0.3)
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, add.points))))
xy
# x y
# 1 0.00 0
# 2 0.25 5
# 3 0.30 6
# 4 0.50 10
# 5 1.00 20
I'm using the cut function to split my data in equal bins, it does the job but I'm not happy with the way it returns the values. What I need is the center of the bin not the upper and lower ends.
I've also tried to use cut2{Hmisc}, this gives me the center of each bins, but it divides the range of data in bins that contains the same numbers of observations, rather than being of the same length.
Does anyone have a solution to this?
It's not too hard to make the breaks and labels yourself, with something like this. Here since the midpoint is a single number, I don't actually return a factor with labels but instead a numeric vector.
cut2 <- function(x, breaks) {
r <- range(x)
b <- seq(r[1], r[2], length=2*breaks+1)
brk <- b[0:breaks*2+1]
mid <- b[1:breaks*2]
brk[1] <- brk[1]-0.01
k <- cut(x, breaks=brk, labels=FALSE)
mid[k]
}
There's probably a better way to get the bin breaks and midpoints; I didn't think about it very hard.
Note that this answer is different than Joshua's; his gives the median of the data in each bins while this gives the center of each bin.
> head(cut2(x,3))
[1] 16.666667 3.333333 16.666667 3.333333 16.666667 16.666667
> head(ave(x, cut(x,3), FUN=median))
[1] 18 2 18 2 18 18
Use ave like so:
set.seed(21)
x <- sample(0:20, 100, replace=TRUE)
xCenter <- ave(x, cut(x,3), FUN=median)
We can use smart_cut from package cutr:
devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Joshua's sample data:
median by interval (same output as #Joshua except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ median(.))
# [1] 18 2 18 2 18 18 ...
# Levels: 2 < 11 < 18
center of each interval (same output as #Aaron except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ mean(.y))
# [1] 16.67 3.333 16.67 3.333 16.67 16.67 ...
# Levels: 3.333 < 10 < 16.67
mean of values by interval :
smart_cut(x,3, "n_intervals", labels= ~ mean(.))
# [1] 17.48 2.571 17.48 2.571 17.48 17.48 ...
# Levels: 2.571 < 11.06 < 17.48
labels can be a character vector just like in base::cut.default, but it can also be, as it is here, a function of 2 parameters, the first being the values contained in the bin, and the second the cut points of the bin.
more on cutr and smart_cut