Related
I have a set of observations that measure time delay from an initial event, such as the elapsed time from when an email is sent to when it is opened.
Given a set of 100 observations, how can I tell at what point in time 90 percent of the opens took place. I want to be able to say "90 % of the opens took place within 4 hours of send time."
I can generate a histogram of delays, which shows that most opens happen early, but I do not know how to get a cumulative measure for all counts in the bins. (I'm not explaining myself very well, not a stats wonk)
So with this sample data I have 10 observations with a delay of 1 hour, 5 with a delay of 2 hours, 3 with a delay of 3 hours and 2 with a delay of 4 hours. This means that 90% of the opens came within less than 4 hours. How do I determine that 90% limit for a real set of observations?
Edited with more compact sample data creation and added first cut at plot of cumulative percentage. Would welcome better solutions.
library(tidyverse)
library(ggplot2)
all_delays <- tibble(delay = rep(1:4, c(10, 5, 3, 2)))
all_delays
#> # A tibble: 20 x 1
#> delay
#> <int>
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 1
#> 10 1
#> 11 2
#> 12 2
#> 13 2
#> 14 2
#> 15 2
#> 16 3
#> 17 3
#> 18 3
#> 19 4
#> 20 4
# histogram of data
ggplot(all_delays) + aes(delay) +
geom_histogram() +
scale_y_continuous(breaks = seq(0,10,1))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# newbie incorrect way to get plot of cumulative percentage.
# would welcome better way to do this.
all_delays <- all_delays %>% mutate(cnt = 1) %>%
arrange(delay) %>%
mutate(cs = cumsum(cnt))
ggplot(all_delays) + aes(cs/nrow(all_delays),delay) +
geom_line() +
scale_x_continuous(breaks = c(0,.25,.50,.75,.90,1),
labels=c("0","25%","50%","75%","90%","100%")) +
geom_vline(xintercept =.9) +
xlab("Cumulative Percentage of opens") +
ylab("hours since open")
Created on 2019-04-27 by the reprex package (v0.2.1)
I guess my expected results are something that would say "90% limit = 3", or some kind of cumulative curve that would start at the shortest open delay and then increase in value until 100 % was reached with a tick at 90 %.
Thanks for the quantile() answer!
Email open rates typically have a long tail where mot activity happens within a day or two of the email send, and then a very long tail as people browse their email inboxes weeks or even months after the email was sent.
What you describe is called a quantile. The code below removes all delays above the 90th percentile; i.e., the remaining delays give you the points by which 90% of the events have occurred.
> all_delays %>% filter(delay <= quantile(delay, 0.9))
# A tibble: 18 x 1
delay
<dbl>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 3
I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12
How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]
So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.
We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.
I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16