Summarize variable for different time periods and by group using ddply - r

I am trying to summarize a sales report by client and get the total sales for different periods of time:
Client Q Sales Date
A 2 30 01/01/2014
A 3 24 02/01/2014
A 1 10 03/01/2014
B 4 10 01/01/2014
B 1 20 02/01/2014
B 3 30 03/01/2014
I am able to summarize by client using ddply:
rapport <- ddply(df, CLIENT, summarise,
Q = sum(Q),
Sales = sum(Sales) )
Client Q Sales
A 6 64
B 7 60
I would like to add an extra column with sales only for the date 03/01/2014
Client Q Sales Sales03/01/2014
A 6 64 10
B 7 60 30

DF <- read.table(text=" Client Q Sales Date
A 2 30 01/01/2014
A 3 24 02/01/2014
A 1 10 03/01/2014
B 4 10 01/01/2014
B 1 20 02/01/2014
B 3 30 03/01/2014", header=TRUE)
library(plyr)
ddply(DF, .(Client), summarise,
Q = sum(Q),
`Sales03/01/2014` = Sales[Date=="03/01/2014"],
Sales = sum(Sales))
# Client Q Sales03/01/2014 Sales
#1 A 6 10 64
#2 B 8 30 60
Note that order of evaluation is important here if you want the same name for output as for input Sales. Also, it is best to avoid names that are not valid syntax.

You can also achieve the same result using dplyr:
library(dplyr)
DF %>%
group_by(Client) %>%
summarise(SumOfQ = sum(Q)
`Sales03/01/2014` = Sales[Date=="03/01/2014"],
SumOfSales = sum(Sales))
dplyr is slower for the example case, but much faster for large data frames.

Related

How to add the value of a row to other rows based on some criteria in R?

I have a panel data for costs, sampled monthly for various product types. I also have "Generic" costs which doesn't belong to any product type. A super simple representative df looks like this:
type <- c("A","A","B","B","C","C","Generic","Generic")
year <- c(2020,2020,2020,2020,2020,2020,2020,2020)
month <- c(1,2,1,2,1,2,1,2)
cost <- c(1,2,3,4,5,6,600,630)
volume <- c(10,11,20,21,30,31,60,63)
df <- data.frame(type,year,month,cost,volume)
type year month cost volume
A 2020 1 1 10
A 2020 2 2 11
B 2020 1 3 20
B 2020 2 4 21
C 2020 1 5 30
C 2020 2 6 31
Generic 2020 1 600 60
Generic 2020 2 630 63
I need to distribute the "Generic" costs to product types according to their "Volume".
For example,
For 2020-1, the volume ratio of
product type A: 10 / (10 + 20 + 30) = 1/6
product type B: 20 / (10 + 20 + 30) = 2/6
product type C: 30 / (10 + 20 + 30) = 3/6
For 2020-2, the volume ratio of
product type A: 11 / (11 + 21 + 31) = 11/63
product type B: 21 / (11 + 21 + 31) = 21/63
product type C: 31 / (11 + 21 + 31) = 31/63
So, I would like to distribute "Generic" costs for 2020-1 to product types like this:
1/6 * 600 = 100 for product type A
2/6 * 600 = 200 for product type B
3/6 * 600 = 300 for product type C
Similarly for 2020-2, I would like to distribute "Generic" costs like:
11/63 * 630 = 110 for product type A
21/63 * 630 = 210 for product type B
31/63 * 630 = 310 for product type C
In the end, I would like to end up with the following data frame:
type year month new_cost volume
A 2020 1 101 10
A 2020 2 112 11
B 2020 1 203 20
B 2020 2 214 21
C 2020 1 305 30
C 2020 2 316 31
I already have the total volume in the original dataframe within the "Generic" type, so there is no need to calculate that seperately.
I was trying to do these calculations via dplyr package's group_by() and mutate() functions, but I couldn't figure out how.
Any help is appreciated.
We can do this using data.table, by first merging in the generic costs separately and spreading them according to the percentage of volume made up by each type in each month/year:
df <- setDT(df)
generic <- df[type == "Generic"]
setnames(generic, "cost", "generic_cost")
df <- df[type !="Generic"]
df[, volume_ratio:=volume/sum(volume), by = c("year", "month")]
df <- merge(df, generic[,c("year", "month", "generic_cost")], by = c("year", "month"))
df[,new_cost:=cost + (generic_cost*volume_ratio)]
Which gives us:
df
year month type cost volume volume_ratio generic_cost new_cost
1: 2020 1 A 1 10 0.1666667 600 101
2: 2020 1 B 3 20 0.3333333 600 203
3: 2020 1 C 5 30 0.5000000 600 305
4: 2020 2 A 2 11 0.1746032 630 112
5: 2020 2 B 4 21 0.3333333 630 214
6: 2020 2 C 6 31 0.4920635 630 316
This has a few extra columns, but new cost seems to be the most important column of interest.

Unable to Group and Sum Properly

I have data similar to this Sample Data:
Cities Country Date Cases
1 BE A 2/12/20 12
2 BD A 2/12/20 244
3 BF A 2/12/20 1
4 V 2/12/20 13
5 Q 2/13/20 2
6 D 2/14/20 4
7 GH N 2/15/20 6
8 DA N 2/15/20 624
9 AG J 2/15/20 204
10 FS U 2/16/20 433
11 FR U 2/16/20 38
I want to organize the data by on the date and country and then sum a country's daily case. However, I try something like, it reveal the total sum:
my_data %>%
group_by(Country, Date)%>%
summarize(Cases=sum(Cases))
Your summarize function is likely being called from another package (plyr?). Try calling dplyr::sumarize like this:
my_data %>%
group_by(Country, Date)%>%
dplyr::summarize(Cases=sum(Cases))
# A tibble: 7 x 3
# Groups: Country [7]
Country Date Cases
<fct> <fct> <int>
1 A 2/12/20 257
2 D 2/14/20 4
3 J 2/15/20 204
4 N 2/15/20 630
5 Q 2/13/20 2
6 U 2/16/20 471
7 V 2/12/20 13
I sympathize with you that this is can be very frustrating. I have gotten into a habit of always using dplyr::select, dplyr::filter and dplyr::summarize. Otherwise you spend needless time frustrated about why your code isn't working.
We can also use aggregate
aggregate(Cases ~ Country + Date, my_data, sum)

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

How to quote the grouped data frame it self in the function in ddply()

It is possible to apply certain function in the grouping of data frame by certain variables with ddply(), but how to quote the grouped data frame as the argument of the function?
Take min() as an EXAMPLE:
What I have:
> BodyWeight
Treatment day1 day2 day3
1 a 32 33 36
2 a 35 35 26
3 a 33 38 46
4 b 23 24 25
5 b 22 16 34
6 b 36 35 37
7 c 45 45 39
8 c 29 26 12
9 c 43 27 36
What I want:
Treatment min
1 a 26
2 b 16
3 c 12
What I did and what I got:
> ddply(BodyWeight, .(Treatment), summarize, min= min(BodyWeight[,-1]))
Treatment min
1 a 12
2 b 12
3 c 12
The min() is just an example, unspecific solutions are desired.
What you want is to summarize by Treatment and Day. The issue is you have days in multiple columns. You need to convert your data from the wide format its in (multiple columns) into a long format (key-value pairs).
library(tidyr)
library(plyr)
bw_long <- gather(Bodyweight, day, value, day1:day3)
ddply(bw_long, .(Treatment, day), summarize, min = min(value))
p.s. Check out the successor to plyr, dplyr
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(BodyWeight)), grouped by 'Treatment', unlist the Subset of Data.table (.SD) and get the min value.
library(data.table)
setDT(BodyWeight)[, .(min = min(unlist(.SD))) , by = Treatment]
# Treatment min
#1: a 26
#2: b 16
#3: c 12

tapply based on multiple indexes in R

I have a data frame, much like this one:
ref=rep(c("A","B"),each=240)
year=rep(rep(2014:2015,each=120),2)
month=rep(rep(1:12,each=10),4)
values=c(rep(NA,200),rnorm(100,2,1),rep(NA,50),rnorm(40,4,2),rep(NA,90))
DF=data.frame(ref,year,month,values)
I would like to compute the maximum number of consecutive NAs per reference, per year.
I have created a function, which works out the maximum number of consecutive NAs, but can only be based on one variable.
For example,
func <- function(x) {
max(rle(is.na(x))$lengths)
}
with(DF, tapply(values,ref, func))
# A B
# 200 90
with(DF, tapply(values,year, func))
# 2014 2015
# 120 90
So there are a maximum of 200 consecutive NAs in ref A in total, and maximum of 90 in ref B, which is correct. There are also 120 NAs in 2014, and 90 in 2015.
What I'd like is a result per ref and year, such as:
A 2015 80
A 2014 120
B 2015 90
B 2014 50
There are multiple ways of doing this, one is with the plyr library:
library(plyr)
ddply(DF,c('ref','year'),summarise,NAs=max(rle(is.na(values))$lengths))
ref year NAs
1 A 2014 120
2 A 2015 80
3 B 2014 60
4 B 2015 90
Using your function, you could also try:
with(DF, tapply(values,list(ref,year), func))
which gives a slightly different output
2014 2015
A 120 80
B 60 90
By using melt() you can however get to the same dataframe.
Very similar to the tapply solution above. I find aggregate give a better output than tapply though.
with(DF, aggregate(list(Value = values),list(Year = year,ref = ref), func))
Year ref Value
1 2014 A 120
2 2015 A 80
3 2014 B 60
4 2015 B 90
I like the recipe format
library(dplyr)
DF$values[is.na(DF$values)] <- 1
DF %>%
filter(values==1) %>%
group_by(ref,year) %>%
mutate(csum=cumsum(values)) %>%
group_by(ref,year) %>%
summarise(max(csum))
Source: local data frame [4 x 3]
Groups: ref [?]
ref year max(csum)
(fctr) (int) (dbl)
1 A 2014 120
2 A 2015 80
3 B 2014 50
4 B 2015 90

Resources