How to summarize events prior to a specific event (that can happen multiple times) across multiple observations in r? - r

I'm trying to collect data on what events have happened prior to a specific event (i.e. bDragons)which can be recurring based on the full observation. These are just an excerpt of one observation where a dragon is taken more than once, and I want to be able to pull insights on each and every one over many observations. So in the data set below, I would want to know that only 1 outer turret was taken prior to the first dragon at Time == 12.891. The next is taken at 20.215, which 4 towers and a drake before it.
ID TeamObj Time Type Lane League Year Season bResult rResult gamelength Gold
1 1 bTowers 9.397 OUTER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
2 1 bDragons 12.891 AIR_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
3 1 bTowers 16.215 OUTER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
4 1 bTowers 16.591 INNER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
5 1 bTowers 19.830 OUTER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
6 1 bDragons 20.215 EARTH_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
7 1 bBarons 22.512 BARON_NASHOR <NA> CBLoL 2017 Summer 1 0 34 NA
8 1 bTowers 23.962 INNER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
9 1 bTowers 24.707 INNER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
10 1 bTowers 24.962 BASE_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
I'd want this for every TeamObj of that type but the issue comes up where I try to group_by address and filter by (Time <= which(Team == bDragons)and the wrong things get filtered out or I can't summarize based on that count(Type) or anything. I'm looking for help on recording some type of recurring function or a better way to record and summarize that. Looking to fit the observations into a linear model later on, but I can't get to that square one which causes the issue.
Am I thinking about my filter incorrectly? My summarize? tst3 %>% group_by(ID) %>% filter(Time <= which(Team == "bDragons")) %>% summarize(count(Type))
Something like:
ID dragonID dragonType Time Baron_Nashor Base_Turret Inner_Turret Nexus_Turret Outer_Turret
1 1 AIR_DRAGON 12.891 N/A N/A N/A N/A 1
2 2 EARTH_DRAGON 20.215 N/A N/A 1 N/A 3
and so on, if that is clear. Want to be able to use each as an observation.

How about the following
tst3 %>%
group_by(ID) %>%
# arrange(Time) %>% # uncomment if needed
mutate(
Type = factor(Type),
dragonID = cumsum(dplyr::lag(TeamObj == 'bDragons', default = 1))) %>%
group_by(ID, dragonID) %>%
summarize(
dragonType = last(Type),
Time = last(Time),
tmp = list(as.data.frame(table(Type)))) %>%
unnest() %>%
spread(Type, Freq, fill = 0) %>%
# select(-ends_with("DRAGON")) %>%
group_by(ID) %>%
mutate_at(vars(BARON_NASHOR:OUTER_TURRET), cumsum) %>%
filter(str_detect( dragonType, "DRAGON"))

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

Testing for a significant increase in the frequency of periods of time where wind is blowing easterly for 5 days or more in a row

I am trying to figure out whether the number of instances where the wind direction (U) is blowing <(-3.54) for >=5 days in a row is increasing significantly over time at a 10% level. Below is the code I have used to produce the number of instances per year in Austral Summer. I have already checked for a significant increase using Excel at a 5% level, but I would like to learn how to do it using R and at a 10% level. Any help would be hugely appreciated.
####WIND PERIOD CALCULATIONS###########################################################################
#### U <(-3.54)
###Austral summer#######################################################################################
library(dplyr)
AusSum_Wind <- data.frame(year=character(), instances=integer(), stringsAsFactors=FALSE)
RowNum <- 1
for (i in 1993:2015) {
AusSum_Wind[RowNum,1] <- paste(as.character(i), as.character(i+1), sep = "-")
wind %>%
filter((Month >= 10 & Year == i) | (Month <= 3 & Year == (i+1))) %>%
mutate(threshold = U < (-3.54),
group = cumsum(threshold != lag(threshold, default = FALSE))) %>%
group_by(group) %>%
mutate(n_days = n()) %>%
summarise_all(first) %>%
filter(threshold, n_days >= 5) %>%
select(-group, -threshold) -> instances
AusSum_Wind[RowNum,2] <- nrow(instances)
RowNum <- RowNum + 1
}
AusSum_Wind
> AusSum_Wind
year instances
1 1993-1994 1
2 1994-1995 3
3 1995-1996 3
4 1996-1997 1
5 1997-1998 5
6 1998-1999 3
7 1999-2000 4
8 2000-2001 2
9 2001-2002 1
10 2002-2003 0
11 2003-2004 3
12 2004-2005 3
13 2005-2006 1
14 2006-2007 1
15 2007-2008 1
16 2008-2009 0
17 2009-2010 3
18 2010-2011 5
19 2011-2012 1
20 2012-2013 5
21 2013-2014 4
22 2014-2015 3
23 2015-2016 2
>

How can I filter out Duplicated Rows per Group

So this is the data I'm working with:
ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000
What I'm trying to do is create a table that shows the amount of value lost that is grouped by Year, State and Grade. That part I have done but the issue is you can see that there is a duplicated row for ID=1. I need to add a component to my code that removes any duplicated rows like it in my data once I have grouped the data by Year, State and Grade.
The reason I want to remove the duplicates after I have grouped the data is that the ID number may duplicate for a different year but that is OK as that is a new observation. I just want to remove any duplicates if the Year, State and Grade match. Basically if the whole row is a duplicate, it should be removed.
I can't tell if I should use Unique() or Distinct() but here is what I have so far:
Answer <- data %>%
group_by(Year, State, Grade) %>%
filter(row_number(ID) == 1) %>% #This is the part to replace
summarise(x = sum(Loss) / sum(Total)) %>%
spread(State, x)
The output should look like this:
Year State Grade x
2016 AZ A 0.05
2016 AZ B 0
2016 AZ C 0
2017 AZ A 0
2017 AZ B 0
2017 AZ C 0.1
A few things. Below, I use distinct to remove duplicate rows. Also, in your expected results you have an entry for grade C for 2016, which isn't in your original data. So, I used complete to add this (and any other missing cases) as a zero. Finally, as #akrun notes above: where does 0.00833 come from? Typo or have I misunderstood the calculation?
df <- read.table(text = "ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000", header = TRUE)
Answer <- df %>%
distinct %>%
group_by(Year, State, Grade) %>%
summarise(x = sum(Loss) / sum(Total)) %>%
complete(Year, State, Grade, fill = list(x = 0))
# # A tibble: 6 x 4
# # Groups: Year, State [2]
# Year State Grade x
# <int> <fct> <fct> <dbl>
# 1 2016 AZ A 0.05
# 2 2016 AZ B 0
# 3 2016 AZ C 0
# 4 2017 AZ A 0
# 5 2017 AZ B 0
# 6 2017 AZ C 0.1

Grouping the Data in a data frame based on conditions from more than 1 columns

Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

Resources