Given a dataframe containing time delays: how to tell by what point 90% of delays have occurred - r

I have a set of observations that measure time delay from an initial event, such as the elapsed time from when an email is sent to when it is opened.
Given a set of 100 observations, how can I tell at what point in time 90 percent of the opens took place. I want to be able to say "90 % of the opens took place within 4 hours of send time."
I can generate a histogram of delays, which shows that most opens happen early, but I do not know how to get a cumulative measure for all counts in the bins. (I'm not explaining myself very well, not a stats wonk)
So with this sample data I have 10 observations with a delay of 1 hour, 5 with a delay of 2 hours, 3 with a delay of 3 hours and 2 with a delay of 4 hours. This means that 90% of the opens came within less than 4 hours. How do I determine that 90% limit for a real set of observations?
Edited with more compact sample data creation and added first cut at plot of cumulative percentage. Would welcome better solutions.
library(tidyverse)
library(ggplot2)
all_delays <- tibble(delay = rep(1:4, c(10, 5, 3, 2)))
all_delays
#> # A tibble: 20 x 1
#> delay
#> <int>
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 1
#> 10 1
#> 11 2
#> 12 2
#> 13 2
#> 14 2
#> 15 2
#> 16 3
#> 17 3
#> 18 3
#> 19 4
#> 20 4
# histogram of data
ggplot(all_delays) + aes(delay) +
geom_histogram() +
scale_y_continuous(breaks = seq(0,10,1))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# newbie incorrect way to get plot of cumulative percentage.
# would welcome better way to do this.
all_delays <- all_delays %>% mutate(cnt = 1) %>%
arrange(delay) %>%
mutate(cs = cumsum(cnt))
ggplot(all_delays) + aes(cs/nrow(all_delays),delay) +
geom_line() +
scale_x_continuous(breaks = c(0,.25,.50,.75,.90,1),
labels=c("0","25%","50%","75%","90%","100%")) +
geom_vline(xintercept =.9) +
xlab("Cumulative Percentage of opens") +
ylab("hours since open")
Created on 2019-04-27 by the reprex package (v0.2.1)
I guess my expected results are something that would say "90% limit = 3", or some kind of cumulative curve that would start at the shortest open delay and then increase in value until 100 % was reached with a tick at 90 %.
Thanks for the quantile() answer!
Email open rates typically have a long tail where mot activity happens within a day or two of the email send, and then a very long tail as people browse their email inboxes weeks or even months after the email was sent.

What you describe is called a quantile. The code below removes all delays above the 90th percentile; i.e., the remaining delays give you the points by which 90% of the events have occurred.
> all_delays %>% filter(delay <= quantile(delay, 0.9))
# A tibble: 18 x 1
delay
<dbl>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 3

Related

Acoustic complexity index time series output

I have a wav file and I would like to calculate the Acoustic Complexity Index at each second and receive a time series output.
I understand how to modify other settings within a function like seewave::ACI() but I am unable to find out how to output a time series data frame where each row is one second of time with the corresponding ACI value.
For a reproducible example, this audio file is 20 seconds, so I'd like the output to have 20 rows, with each row printing the ACI for that 1-second of time.
library(soundecology)
data(tropicalsound)
acoustic_complexity(tropicalsound)
In fact, I'd like to achieve this is a few other indices, for example:
soundecology::ndsi(tropicalsound)
soundecology::acoustic_evenness(tropicalsound)
You can subset your wav file according to the samples it contains. Since the sampling frequency can be obtained from the wav object, we can get one-second subsets of the file and perform our calculations on each. Note that you have to set the cluster size to 1 second, since the default is 5 seconds.
library(soundecology)
data(tropicalsound)
f <- tropicalsound#samp.rate
starts <- head(seq(0, length(tropicalsound), f), -1)
aci <- sapply(starts, function(i) {
aci <- acoustic_complexity(tropicalsound[i + seq(f)], j = 1)
aci$AciTotAll_left
})
nds <- sapply(starts, function(i) {
nds <- ndsi(tropicalsound[i + seq(f)])
nds$ndsi_left
})
aei <- sapply(starts, function(i) {
aei <- acoustic_evenness(tropicalsound[i + seq(f)])
aei$aei_left
})
This allows us to create a second-by-second data frame representing a time series of each measure:
data.frame(time = 0:19, aci, nds, aei)
#> time aci nds aei
#> 1 0 152.0586 0.7752307 0.438022
#> 2 1 168.2281 0.4171902 0.459380
#> 3 2 149.2796 0.9366220 0.516602
#> 4 3 176.8324 0.8856127 0.485036
#> 5 4 162.4237 0.8848515 0.483414
#> 6 5 161.1535 0.8327568 0.511922
#> 7 6 163.8071 0.7532586 0.549262
#> 8 7 156.4818 0.7706808 0.436910
#> 9 8 156.1037 0.7520663 0.489253
#> 10 9 160.5316 0.7077717 0.491418
#> 11 10 157.4274 0.8320380 0.457856
#> 12 11 169.8831 0.8396483 0.456514
#> 13 12 165.4426 0.6871337 0.456985
#> 14 13 165.1630 0.7655454 0.497621
#> 15 14 154.9258 0.8083035 0.489896
#> 16 15 162.8614 0.7745876 0.458035
#> 17 16 148.6004 0.1393345 0.443370
#> 18 17 144.6733 0.8189469 0.458309
#> 19 18 156.3466 0.6067827 0.455578
#> 20 19 158.3413 0.7175293 0.477261
Note that this is simply a demonstration of how to achieve the desired output; you would need to check the literature to determine whether it is appropriate to use these measures over such short time periods.

Ridge plot: sort by value / rank

I have a data set which I uploaded here as a gist in CSV format.
It is the extracted form of the PDFs provided in the YouGov article "How good is 'good'?". People where asked to rate words (e.g. "perfect", "bad") with a score between 0 (very negative) and 10 (very positive). The gist contains exactly that data, i.e. for every word (column: Word) it stores for every ranking from 0 to 10 (column: Category) the number of votes (column: Total).
I would usually try to visualize the data with matplotlib and Python since I lack knowledge in R, but it seems that ggridges can create way nicer plots than I see myself doing with Python.
Using:
library(ggplot2)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov, aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
I was able to create this plot (which is still far from perfect):
Ignoring the fact that I have to tweak the aesthetics, there are three things I struggle to do:
Sort the words by their average rank.
Color the ridge by the average rank.
Or color the ridge by the category value, i.e. with varying color.
I tried to adapt the suggestions from this source, but ultimately failed because my data seems to be in the wrong format: Instead of having single instances of votes, I already have the aggregated vote count for each category.
I hope to end up with a result closer to this plot, which satisfies criteria 3 (source):
It took me a little while to get there myself. The key for me way understanding the data and how to order Word based on the average Category score. So let's look at the data first:
> YouGov
# A tibble: 440 x 17
ID Word Category Total Male Female `18 to 35` `35 to 54` `55+`
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 Incr~ 0 0 0 0 0 0 0
2 1 Incr~ 1 1 1 1 1 1 0
3 2 Incr~ 2 0 0 0 0 0 0
4 3 Incr~ 3 1 1 1 1 1 1
5 4 Incr~ 4 1 1 1 1 1 1
6 5 Incr~ 5 5 6 5 6 5 5
7 6 Incr~ 6 6 7 5 5 8 5
8 7 Incr~ 7 9 10 8 10 7 10
9 8 Incr~ 8 15 16 14 13 15 16
10 9 Incr~ 9 20 20 20 22 18 19
# ... with 430 more rows, and 8 more variables: Northeast <dbl>,
# Midwest <dbl>, South <dbl>, West <dbl>, White <dbl>, Black <dbl>,
# Hispanic <dbl>, `Other (NET)` <dbl>
Every Word has a row for every Category (or score, 1-10). The Total provides the number of responses for that Word/Category combination. So although there were no responses where the word "Incredible" scored zero there is still a row for it.
Before we calculate the average score for each Word we calculate the product of Category and Total for each Word-Category combination, let's call it Total Score. From there, we can treat Word as a factor, and reorder based on the average Total Score using forcats. After that, you can plot your data just as you did.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
YouGov %>%
mutate(total_score = Category*Total) %>%
mutate(Word = fct_reorder(.f = Word, .x = total_score, .fun = mean)) %>%
ggplot(aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
By treating Word as a factor we reordered the Words based on their mean Category. ggplot also orders colors accordingly so we don't have to modify ourselves, unless you'd prefer a different color palette.
The other solution is exactly correct. I just wanted to point out that you can call fct_reorder() from within aes() for an even more compact solution. However, you need to do it twice if you want to change fill color by position along the y axis.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = fct_reorder(Word, Category*Total, .fun = sum)
)) +
geom_density_ridges(stat = "identity", scale = 3) +
theme(legend.position = "none")
Created on 2020-01-19 by the reprex package (v0.3.0)
If instead you want to color by x position, you can do something like the following. It just doesn't look as nice as the temperature example because the x values come in discrete steps.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = stat(x)
)) +
geom_density_ridges_gradient(stat = "identity", scale = 3) +
theme(legend.position = "none") +
scale_fill_viridis_c(option = "C")
Created on 2020-01-19 by the reprex package (v0.3.0)

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Normalize/scale data set

I have the following data set:
dat<-as.data.frame(rbind(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10))
colnames(dat)<-"Score"
print(dat)
Score
10
8
2
7
10
10
1
10
14
9
2
6
10
8
10
8
10
10
7
11
10
these are the test scores which students obtained, a student could get a maximum of 15 or a minimum of 0 in this test (by the way, nobody got the max or the min), however the lowest score obtained in this test was 1 and the highest was 14.
Now, I want to normalize/scale this data to the scale of 0 to 20.
How to achieve this in excel? or in R?
My final goal is to normalize the scores in this test to the above scale and to compare them with another set of data for which the max and min is 5 and 0 respectively.
How to compare these two different scaled data sets correctly against each other?
What I tried:
I went through many stuff on the internet, and came up with this:
which I got it from the wikipedia.
Is this method reliable?
In your case I would use the feature scale formula you posted on your question. The (x - min(x)) / (max(x) - min(x)) will essentially convert your test marks to the range between 0-1.
Since your edges are indeed 0 and 15 and not 2 and 14, your min(x)=0 and your max(x)=15. Once you have your marks between 0-1 using the above, you just multiply by 20.
i.e.
tests <- read.table(header=T, file='clipboard')
tests2 <- (tests - 0) / (15 - 0) #or equally tests / 15
And multiply by 20 to get marks between 0-20:
> tests2 * 20
Score
1 13.333333
2 10.666667
3 2.666667
4 9.333333
5 13.333333
6 13.333333
7 1.333333
8 13.333333
9 18.666667
10 12.000000
11 2.666667
12 8.000000
13 13.333333
14 10.666667
15 13.333333
16 10.666667
17 13.333333
18 13.333333
19 9.333333
20 14.666667
21 13.333333
The results are intuitive and the function is reliable. For example the person who scored 14/15 should get the highest mark (and very close to 20) which is the case here (after the transformation they scored 18.6666).
In Excel, if you want the normalized data to have a min of 0 and and max of 20, then we need to solve:
y = A * x + b
for two points.
Put the max of the raw data in C1:
=MAX(A:A)
Put the min of the raw data in C2:
=MIN(A:A)
Put the desired max in D1 and the desired min in D2. Put the formula for the A-coefficient in C3:
=($D$1-$D$2)/($C$1-$C$2)
and the formula for the B-coefficient in C4:
=$D$1-$C$3*$C$1
Finally put the scaling formula in B1:
=A1*$C$3+$C$4
and copy down:
Naturally, if you want the scaling to be independent of the raw max or min, you would use 15 in C1 and 0 in C2.
You can scale between 0 to 20 with this command in R:
newvalue <- 20/(max(score)-min(score))*(score-min(score))
The math way is fairly straightforward if the floor for all scales is 0.
new_value = new_ceiling * old_value / old_ceiling
The next formula will account for different floors on each scale:
new_value = new_floor + (new_ceiling - old_ceiling) * ((old_value-old_floor)/(old_ceiling-old_floor)) which is actually the formula you posted from Wikipedia. ;)
Hope this helps!
That is very simple. Due to the fact that both of those grades are linear, that a simple multiple ratio will do the work. Or in other word each grade in your set needs to be *20/15.
Here's a little r function which can help you run this if you need to repeat the operation and give you some flexibility on what you rescale to. Also one must be careful of NA values because min() and max() do not drop them by default which will then return NA. Therefore I provided an option on to handle NA values (drops them by default).
# function rescales data from 0 to 1 and optionally multiplies by new max
rescale <- function(x, new_max = 1, na.rm = T) {
as.vector(new_max * scale(x,
center = min(x, na.rm = na.rm),
scale = (max(x, na.rm = na.rm) - min(x, na.rm = na.rm))))
}
# old scores
scores <- c(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10)
# new scores
data.frame(old = scores,
new = rescale(scores, new_max = 20))
#> old new
#> 1 10 13.846154
#> 2 8 10.769231
#> 3 2 1.538462
#> 4 7 9.230769
#> 5 10 13.846154
#> 6 10 13.846154
#> 7 1 0.000000
#> 8 10 13.846154
#> 9 14 20.000000
#> 10 9 12.307692
#> 11 2 1.538462
#> 12 6 7.692308
#> 13 10 13.846154
#> 14 8 10.769231
#> 15 10 13.846154
#> 16 8 10.769231
#> 17 10 13.846154
#> 18 10 13.846154
#> 19 7 9.230769
#> 20 11 15.384615
#> 21 10 13.846154
Created on 2022-03-10 by the reprex package (v2.0.1)

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

Resources