r - dplyr: counting the frequency of unique values in one variable for each unique value of another variable in the same data frame - r

So here's a sample of some of the rows from my dataframe:
> data[1:25, c("TR_DATE", "TR_TYPE...")]
TR_DATE TR_TYPE...
1 2016-03-01 4
2 2016-03-01 4
3 2016-03-01 5
4 2016-03-01 4
5 2016-03-01 1
6 2016-03-01 7
7 2016-03-01 4
8 2016-03-01 4
9 2016-03-01 24
10 2016-03-01 23
11 2016-03-01 4
12 2016-03-02 4
13 2016-03-02 1
14 2016-03-02 1
15 2016-03-02 4
16 2016-03-02 4
17 2016-03-02 14
18 2016-03-02 4
19 2016-03-02 4
20 2016-03-03 4
21 2016-03-03 1
22 2016-03-03 4
23 2016-03-03 23
24 2016-03-03 1
25 2016-03-03 4
What I'd like to do exactly is rearrange in such a way that for every unique day, I get the number of unique transaction types and the frequency of each transaction type
Here's the code that I tried:
data %>%
group_by(TR_DATE) %>%
summarise(trancount = n(), trantype = n_distinct(TR_TYPE...))
which gave me part of the result that I wanted:
# A tibble: 68 x 3
TR_DATE trancount trantype
<date> <int> <int>
1 2016-03-01 5816 6
2 2016-03-02 5637 3
3 2016-03-03 4818 3
4 2016-03-04 5070 8
5 2016-03-05 4 2
6 2016-03-08 6707 5
7 2016-03-09 5228 5
8 2016-03-10 4722 6
9 2016-03-11 4469 8
10 2016-03-12 1 1
# ... with 58 more rows
so trantype tells me the number of unique transaction types that happened on a particular day, but I'd like to know the frequency of each of these unique transaction types. What would be the best way to go around doing this?
I tried looking around and found similar questions but was unable to modify the solutions to my requirement.
I'm fairly new to R and would really appreciate some help. Thanks.

You should group by both variables:
data %>%
group_by(TR_DATE, TR_TYPE...) %>%
summarise(trancount = n(), trantype = n_distinct(TR_TYPE...))

Related

Cumulative sums in R with multiple conditions?

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.
The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.
So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.
# A tibble: 15 x 3
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index
<date> <int> <int>
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5
The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.
For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.
The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)
I've provided the desired output for the sample below ("Copies").
# A tibble: 15 x 4
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index Copies
<date> <int> <int> <dbl>
1 2013-07-09 2 1 1
2 2013-08-07 2 2 2
3 2013-08-20 2 3 3
4 2013-09-08 2 4 4
5 2013-09-28 2 5 5
6 2013-12-27 2 6 5
7 2014-02-10 2 7 5
8 2014-03-12 2 8 5
9 2014-03-14 2 9 5
10 2014-08-27 2 10 5
11 2014-04-27 6 1 1
12 2014-08-01 6 2 2
13 2014-11-13 6 3 2
14 2015-02-14 6 4 2
15 2015-05-14 6 5 2
>
I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.
Thanks a million!
*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.
Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10680332 2013-10-17 6 1 NA days NA days NA days NA days NA days 1
2 PEN-10835735 2014-04-27 6 2 192 days NA days NA days NA days NA days 1
3 PEN-10873506 2014-08-01 6 3 96 days 288 days NA days NA days NA days 1
4 PEN-10951264 2014-11-13 6 4 104 days 200 days 392 days NA days NA days 1
5 PEN-11029526 2015-02-14 6 5 93 days 197 days 293 days 485 days NA days 1
6 PEN-11106581 2015-05-14 6 6 89 days 182 days 286 days 382 days 574 days 1
Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10999392 2015-01-20 76 1 NA days NA days NA days NA days NA days 1
2 YAL-11004302 2015-01-22 76 2 2 days NA days NA days NA days NA days 2
3 COR-11108471 2015-05-18 76 3 116 days 118 days NA days NA days NA days 2
4 HVD-11136632 2015-07-27 76 4 70 days 186 days 188 days NA days NA days 2
5 MIT-11164843 2015-09-09 76 5 44 days 114 days 230 days 232 days NA days 2
6 HVD-11166239 2015-09-10 76 6 1 days 45 days 115 days 231 days 233 days 2
You can use runner package to apply any R function on cumulative window.
This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.
library(dplyr)
library(runner)
data %>%
group_by(Title.ID) %>%
mutate(
Copies = runner(
x = REQUEST.DATE,
f = function(x) {
length(x[x <= (min(x + 112))])
}
)
)
# # A tibble: 15 x 4
# # Groups: Title.ID [2]
# REQUEST.DATE Title.ID ID.Index Copies
# <date> <int> <int> <int>
# 1 2013-07-09 2 1 1
# 2 2013-08-07 2 2 2
# 3 2013-08-20 2 3 3
# 4 2013-09-08 2 4 4
# 5 2013-09-28 2 5 5
# 6 2013-12-27 2 6 5
# 7 2014-02-10 2 7 5
# 8 2014-03-12 2 8 5
# 9 2014-03-14 2 9 5
# 10 2014-08-27 2 10 5
# 11 2014-04-27 6 1 1
# 12 2014-08-01 6 2 2
# 13 2014-11-13 6 3 2
# 14 2015-02-14 6 4 2
# 15 2015-05-14 6 5 2
data
data <- read.table(
text = " REQUEST.DATE Title.ID ID.Index
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5",
header = TRUE)
data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))
I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.
data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>%
group_by(Title.ID) %>%
mutate(
Copies = sapply(REQUEST.DATE, function(x)
sum(as.Date(REQUEST.DATE) <= as.Date(x) &
as.Date(RETURN.DATE) >= as.Date(x)
))
)
Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.
I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

R code to get max count of time series data by group

I'd like to get a summary of time series data where group is "Flare" and the max value of the FlareLength is the data of interest for that group.
If I have a dataframe, like this:
Date Flare FlareLength
1 2015-12-01 0 1
2 2015-12-02 0 2
3 2015-12-03 0 3
4 2015-12-04 0 4
5 2015-12-05 0 5
6 2015-12-06 0 6
7 2015-12-07 1 1
8 2015-12-08 1 2
9 2015-12-09 1 3
10 2015-12-10 1 4
11 2015-12-11 0 1
12 2015-12-12 0 2
13 2015-12-13 0 3
14 2015-12-14 0 4
15 2015-12-15 0 5
16 2015-12-16 0 6
17 2015-12-17 0 7
18 2015-12-18 0 8
19 2015-12-19 0 9
20 2015-12-20 0 10
21 2015-12-21 0 11
22 2016-01-11 1 1
23 2016-01-12 1 2
24 2016-01-13 1 3
25 2016-01-14 1 4
26 2016-01-15 1 5
27 2016-01-16 1 6
28 2016-01-17 1 7
29 2016-01-18 1 8
I'd like output like:
Date Flare FlareLength
1 2015-12-06 0 6
2 2015-12-10 1 4
3 2015-12-21 0 11
4 2016-01-18 1 8
I have tried various aggregate forms but I'm not very familiar with the time series wrinkle.
Using dplyr, we can create a grouping variable by comparing the FlareLength with the previous FlareLength value and select the row with maximum FlareLength in the group.
library(dplyr)
df %>%
group_by(gr = cumsum(FlareLength < lag(FlareLength,
default = first(FlareLength)))) %>%
slice(which.max(FlareLength)) %>%
ungroup() %>%
select(-gr)
# A tibble: 4 x 3
# Date Flare FlareLength
# <fct> <int> <int>
#1 2015-12-06 0 6
#2 2015-12-10 1 4
#3 2015-12-21 0 11
#4 2016-01-18 1 8
In base R with ave we can do the same as
subset(df, FlareLength == ave(FlareLength, cumsum(c(TRUE, diff(FlareLength) < 0)),
FUN = max))

Conditional aggregation by regarding missing data in R

I am trying to aggregate hourly data to daily data in R. The problem is missing values. I want to consider a threshold for the number of missing values before the aggregation. If the number of missing values is more than two in the given day, DO NOT compute the daily average and fill that day with NA.
My dummy data are daily data for the first day of 2005.
day hour amount amount2
1 2005-01-01 0 1 1
2 2005-01-01 1 2 NA
3 2005-01-01 2 4 4
4 2005-01-01 3 5 5
5 2005-01-01 4 11 NA
6 2005-01-01 5 4 NA
7 2005-01-01 6 NA NA
8 2005-01-01 7 2 2
9 2005-01-01 8 4 4
10 2005-01-01 9 2 2
11 2005-01-01 10 4 20
12 2005-01-01 11 12 12
13 2005-01-01 12 13 13
14 2005-01-01 13 7 7
15 2005-01-01 14 4 4
16 2005-01-01 15 12 12
17 2005-01-01 16 4 4
18 2005-01-01 17 12 12
19 2005-01-01 18 5 5
20 2005-01-01 19 11 11
21 2005-01-01 20 4 4
22 2005-01-01 21 12 12
23 2005-01-01 22 13 13
24 2005-01-01 23 7 7
What I already got
agg
day amount amount2
1 2005-01-01 6.9 7.7
what I want to have
agg
day amount amount2
1 2005-01-01 6.9 NA
Because the number of missing values in the column amount2 is more than two, I want its daily average filled by NA (not 7.7) while it is calculated for the column amount (6.9).
I have used the function "aggregate" from "stats" library.
library(stats)
amount=(c(1,2,4,5,11,4,NA,2,4,2,4,12,13,7,4,12,4,12,5,11,4,12,13,7))
amount2=(c(1,NA,4,5,NA,NA,NA,2,4,2,20,12,13,7,4,12,4,12,5,11,4,12,13,7))
day=c("2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01")
hour=seq(0,23)
date=data.frame(day,hour)
dummy=cbind(date,amount,amount2)
agg <- aggregate(cbind(amount,amount2) ~ day, dummy, mean)

Creating a monthly growth rate column for different factors in R

Suppose I have the following data
set.seed(123)
Company <- c(rep("Company 1",5),rep("Company 2",10))
Dates <- c(seq(as.Date("2014-09-01"), as.Date("2015-01-01"), by="months"),
seq(as.Date("2011-09-01"), as.Date("2012-06-01"), by="months"))
X.1 <- sample(c(0,0,5,10,15,20,25,30,40,50),size=15,replace=TRUE)
X.2 <- sample(c(0,0,5,10,15,20,25,30,40,50),size=15,replace=TRUE)
df <- data.frame(Dates,Company,X.1,X.2)
Dates Company X.1 X.2
1 2014-09-01 Company 1 50 0
2 2014-10-01 Company 1 50 5
3 2014-11-01 Company 1 25 15
4 2014-12-01 Company 1 30 5
5 2015-01-01 Company 1 0 40
6 2011-09-01 Company 2 15 0
7 2011-10-01 Company 2 30 15
8 2011-11-01 Company 2 5 30
9 2011-12-01 Company 2 10 0
10 2012-01-01 Company 2 5 20
11 2012-02-01 Company 2 0 5
12 2012-03-01 Company 2 15 0
13 2012-04-01 Company 2 15 30
14 2012-05-01 Company 2 10 40
15 2012-06-01 Company 2 0 10
What I am trying to do is find monthly growth rates for variables X.1 and X.2
within each company and bind those columns to the right side of the dataframe. The problem here is that the date ranges for each Company are different, which is why I am having trouble with this. Also, since I have 0s in the data Inf and NAs are okay.
Thanks for your help.
#I computed the growth using log: growth=log(X(t)/X(t-1)). If you want to compute using (X(t)-X(t-1))/X(t-1), you can just use that. Also, for the first period, it will be NA.
#Assumption: the data are equally spaced for each company. You get Inf, if your last period value is 0 and -Inf if your current period value is 0 (because we used log). The growth will be 0 (when current period is zero) if you don't use log (see second method)
library(dplyr)
df%>%
group_by(Company)%>%
mutate(gx_1=log(X.1/lag(X.1,1)),gx_2=log(X.1/lag(X.2,1))
)
Source: local data frame [15 x 6]
Groups: Company
Dates Company X.1 X.2 gx_1 gx_2
1 2014-09-01 Company 1 5 40 NA NA
2 2014-10-01 Company 1 30 5 1.7917595 -0.2876821
3 2014-11-01 Company 1 15 0 -0.6931472 1.0986123
4 2014-12-01 Company 1 40 10 0.9808293 Inf
5 2015-01-01 Company 1 50 50 0.2231436 1.6094379
6 2011-09-01 Company 2 0 40 NA NA
7 2011-10-01 Company 2 20 25 Inf -0.6931472
8 2011-11-01 Company 2 40 25 0.6931472 0.4700036
9 2011-12-01 Company 2 20 50 -0.6931472 -0.2231436
10 2012-01-01 Company 2 15 25 -0.2876821 -1.2039728
11 2012-02-01 Company 2 50 30 1.2039728 0.6931472
12 2012-03-01 Company 2 15 20 -1.2039728 -0.6931472
13 2012-04-01 Company 2 25 20 0.5108256 0.2231436
14 2012-05-01 Company 2 20 5 -0.2231436 0.0000000
15 2012-06-01 Company 2 0 0 -Inf -Inf
#without using log , i.e.
df%>%
group_by(Company)%>%
mutate(gx_1=((X.1-lag(X.1,1))/lag(X.1,1)),gx_2=((X.2-lag(X.2,1))/lag(X.2,1)))

Resources