Aggregating by subsets in dplyr - r

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)

The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

Testing for a significant increase in the frequency of periods of time where wind is blowing easterly for 5 days or more in a row

I am trying to figure out whether the number of instances where the wind direction (U) is blowing <(-3.54) for >=5 days in a row is increasing significantly over time at a 10% level. Below is the code I have used to produce the number of instances per year in Austral Summer. I have already checked for a significant increase using Excel at a 5% level, but I would like to learn how to do it using R and at a 10% level. Any help would be hugely appreciated.
####WIND PERIOD CALCULATIONS###########################################################################
#### U <(-3.54)
###Austral summer#######################################################################################
library(dplyr)
AusSum_Wind <- data.frame(year=character(), instances=integer(), stringsAsFactors=FALSE)
RowNum <- 1
for (i in 1993:2015) {
AusSum_Wind[RowNum,1] <- paste(as.character(i), as.character(i+1), sep = "-")
wind %>%
filter((Month >= 10 & Year == i) | (Month <= 3 & Year == (i+1))) %>%
mutate(threshold = U < (-3.54),
group = cumsum(threshold != lag(threshold, default = FALSE))) %>%
group_by(group) %>%
mutate(n_days = n()) %>%
summarise_all(first) %>%
filter(threshold, n_days >= 5) %>%
select(-group, -threshold) -> instances
AusSum_Wind[RowNum,2] <- nrow(instances)
RowNum <- RowNum + 1
}
AusSum_Wind
> AusSum_Wind
year instances
1 1993-1994 1
2 1994-1995 3
3 1995-1996 3
4 1996-1997 1
5 1997-1998 5
6 1998-1999 3
7 1999-2000 4
8 2000-2001 2
9 2001-2002 1
10 2002-2003 0
11 2003-2004 3
12 2004-2005 3
13 2005-2006 1
14 2006-2007 1
15 2007-2008 1
16 2008-2009 0
17 2009-2010 3
18 2010-2011 5
19 2011-2012 1
20 2012-2013 5
21 2013-2014 4
22 2014-2015 3
23 2015-2016 2
>

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
Pre_Post<-c(1,1,1,1,2,2,2,2,1,1)
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
df$ClientNumber<-as.character(df$ClientNumber)
df$Pre_Post<-as.factor(df$Pre_Post)
View(df)
#tried solution
df2<-df%>%
group_by(ClientNumber)%>%
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
library(dplyr)
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
library(dplyr)
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
library(dplyr)
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
select(-group)
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

Create a Table with Alternating Total Rows Followed by Sub-Rows Using Dplyr and Tidyverse

library(dplyr)
library(forcats)
Using the simple dataframe and code below, I want to create a table with total rows and sub-rows. For example, the first row would be "Region1" from the NEW column and 70 from the TotNumber column, then below that would be three rows for "Town1", "Town2", and "Town3", and their associated numbers from the Number column, and the same for "Region2" and "Region3". I attached a pic of the desired table...
I'm also looking for a solution using dplyr and Tidyverse.
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number)
DF<-DF%>%mutate_at(vars(Town),funs(as.factor))
To create Region variable...
DF<-DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")))%>%
group_by(NEW)%>%
summarise(TotNumber=sum(Number))
Modifying your last pipes and adding some addition steps:
library(dplyr)
library(forcats)
DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")),
NEW = as.character(NEW)) %>%
group_by(NEW) %>%
mutate(TotNumber=sum(Number)) %>%
ungroup() %>%
split(.$NEW) %>%
lapply(function(x) rbind(setNames(x[1,3:4], names(x)[1:2]), x[1:2])) %>%
do.call(rbind, .)
Results:
# A tibble: 13 × 2
Town Number
* <chr> <dbl>
1 Region1 70
2 Town1 10
3 Town2 30
4 Town3 30
5 Region2 96
6 Town4 10
7 Town5 56
8 Town6 30
9 Region3 133
10 Town7 40
11 Town8 50
12 Town9 33
13 Town10 10
Data:
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-c("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number) %>%
mutate_at(vars(Town),funs(as.factor))

How to calculate top rows from a large data set

I have a dataset in which there are following columns: flavor, flavorid and unitSoled.
Flavor Flavorid unitsoled
beans 350 6
creamy 460 2
.
.
.
I want to find top ten flavors and then calculate market share for each flavor. My logic is market share for each flavor = units soled for particular flavor divided by total units soled.
How do I implement this. For output I just want two col Flavorid and corresponding market share. Do I need to save top ten flavors in some table first?
One way is with the dplyr package:
An example data set:
flavor <- rep(letters[1:15],each=5)
flavorid <- rep(1:15,each=5)
unitsold <- 1:75
df <- data.frame(flavor,flavorid,unitsold)
> df
flavor flavorid unitsold
1 a 1 1
2 a 1 2
3 a 1 3
4 a 1 4
5 a 1 5
6 b 2 6
7 b 2 7
8 b 2 8
9 b 2 9
...
...
Solution:
library(dplyr)
df %>%
select(flavorid,unitsold) %>% #select the columns you want
group_by(flavorid) %>% #group by flavorid
summarise(total=sum(unitsold)) %>% #sum the total units sold per id
mutate(marketshare=total/sum(total)) %>% #calculate the market share per id
arrange( desc(marketshare)) %>% #order by marketshare descending
head(10) #pick the 10 first
#and you can add another select(flavorid,marketshare) if you only want those two
Output:
Source: local data frame [10 x 3]
flavorid total marketshare
1 15 365 0.12807018
2 14 340 0.11929825
3 13 315 0.11052632
4 12 290 0.10175439
5 11 265 0.09298246
6 10 240 0.08421053
7 9 215 0.07543860
8 8 190 0.06666667
9 7 165 0.05789474
10 6 140 0.04912281

Resources