I am trying to create a flag for unique people (defined by id) that have a flight duration that is over 14 hours and they have another flight greater than or equal to 25 days after the 14-hour flight.
To tackle this, I decided to use an if-else statement where the max date grouped by id was subtracted by row date, but the flagging only seems to work for cases where the first flight is above 14 hours.
#Setup Data Frame
id <- c(1,1,2,2,3,3,4,4,4,4,5,5)
flght_dur <- c(27,13,13,17,19,12,7,9,27,14,13,45)
flght_dt <- as.Date(c("2016-03-29","2016-09-01","2015-07-23","2016-06-16","2015-11-12","2016-03-25","2015-12-23","2016-05-19","2016-08-18","2016-09-27","2016-08-18","2016-09-27"))
df <- data.frame(id, flght_dur, flght_dt)
df2 <- df %>% group_by(id) %>% mutate(flag = ifelse(flght_dur >= 14 && (max(as.Date(flght_dt)) - as.Date(flght_dt)) >= 25, 1,0))
df2
Any suggestions on next steps would be appreciated,
You are using the scalar and condition && with vectors, which will only look at the first element of the vector; To look at all possible conditions and return a scalar per group, you can use & on vectors and then use any to reduce the boolean result:
df2 <- df %>%
group_by(id) %>%
mutate(flag = +any(flght_dur >= 14 & max(as.Date(flght_dt)) - as.Date(flght_dt) >= 25))
# ^ used + here to convert boolean to 1 and 0 instead of if/else for short
df2
# A tibble: 12 x 4
# Groups: id [5]
# id flght_dur flght_dt flag
# <dbl> <dbl> <date> <int>
# 1 1. 27. 2016-03-29 1
# 2 1. 13. 2016-09-01 1
# 3 2. 13. 2015-07-23 0
# 4 2. 17. 2016-06-16 0
# 5 3. 19. 2015-11-12 1
# 6 3. 12. 2016-03-25 1
# 7 4. 7. 2015-12-23 1
# 8 4. 9. 2016-05-19 1
# 9 4. 27. 2016-08-18 1
#10 4. 14. 2016-09-27 1
#11 5. 13. 2016-08-18 0
#12 5. 45. 2016-09-27 0
Try using chaining with data.table as follows:
DF[, longHaul := ifelse(flght_dur > 14, TRUE, FALSE)][, maxFlight_DATE := max(flght_dt), by = "id"][longHaul == TRUE & (maxFlight_DATE - flght_dt > 25),]
This is after converting your data.frame to data.table with DF = data.table(df)
It gives me the following output, which appears to follow the logic you want.
id flght_dur flght_dt longHaul maxFlight_DATE
1: 1 27 2016-03-29 TRUE 2016-09-01
2: 3 19 2015-11-12 TRUE 2016-03-25
3: 4 27 2016-08-18 TRUE 2016-09-27
You can do this avoiding loops using rollapply as below.
df$sameid <- c(rollapply(df$id, width = 2, by = 1, FUN = function(x) x[1]==x[2] , align = "right"),NA)
df$nextdurcondition <- c(diff(df$flght_dt)>25 ,NA)
df$flag <- df$sameid &df$nextdurcondition
df
However, for these rolling functions, I personally always use loops
Related
I want to create a table with two columns. The first one represents the working weeks, named time_axis in my exemple below.
The second column, is also a sequence of Dates which represents particular events in a year, called bank_holidays. Each of the date get a one value to signalise its presence.
What I need, is to create a table where the first columns time axis remains unchanged and the second column will be a vector of ones and zeros. Zeros anywhere outside the weeks which contain the events in bank_holiday and with ones for the weeks which includes those dates in bank_holiday occurs. Every week starts with the date in time_axis
library(xts)
time_axis <- seq(as.Date("2017-01-21"), length = 10, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18"))
bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
The desired outcome:
df <- data.frame ( time_axis = c("2017-01-20", "2017-01-27", "2017-02-03", "2017-02-10", "2017-02-17", "2017-02-24", "2017-03-03", "2017-03-10", "2017-03-17", "2017-03-24"), bank_holidays = c(0, 1, 0,1,1,0,0,0,1,0))
df
Any idee on how to make it?
Thank you.
Something which needs to bear in mind and is not obviously from the data: the weeks on time_axis start on Saturday. Therefore, 2017-01-21 is not the end of the 3rd week (as if it would be in case the week starts on Monday) but it is already the 4th week.
Using strftime, "%V" gives the ISO 8601 week numbers where you may match on.
res <- data.frame(time_axis,
bank_holidays =+(strftime(time_axis, "%V") %in%
strftime(index(bank_holidays), "%V")))
res
# time_axis bank_holidays
# 1 2017-01-20 0
# 2 2017-01-27 0
# 3 2017-02-03 1
# 4 2017-02-10 0
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 1
# 10 2017-03-24 0
Edit
To use the custom working weeks whose starts are defined in time_axis variable, the simplest thing would probably be to compare if bank_holidays are greater or equal than that. Then counting the TRUEs with colSums gives the index where to set to 1.
res <- data.frame(time_axis, bank_holidays=0) ## init. column with `0`
res$bank_holidays[colSums(sapply(index(bank_holidays), `>=`, time_axis))] <- 1 ## set matches to 1
res
# time_axis bank_holidays
# 1 2017-01-21 0
# 2 2017-01-28 1
# 3 2017-02-04 0
# 4 2017-02-11 1
# 5 2017-02-18 1
# 6 2017-02-25 0
# 7 2017-03-04 0
# 8 2017-03-11 0
# 9 2017-03-18 1
# 10 2017-03-25 0
This works for me, a little bit longer than the previous answer but you can see what's happening here
I need the start date and end date so I can fill the rest of the dates in between so I'm selecting 11 weeks instead of 10 in your example. And also I can match the vector for bank holidays instead of xts object
library(xts)
library(tidyverse)
time_axis <- seq(as.Date("2017-01-20"), length = 11, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18")) # I'll work with the vector
#bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
df <- tibble() # I'm creating an empty tibble
for(i in 1:(length(time_axis)-1)) { # running a for loop for 10 weeks
df <- seq(as.Date(time_axis[i]), as.Date(time_axis[i+1]), "days") %>% # filling the dates between the two dates
enframe(name = NULL) %>% #converting it into a data frame (tibble)
mutate(week = as.Date(time_axis[i])) %>% # creating a new column indicating which week the given date belong to
bind_rows(df) # binding the rows to previous dataframe
}
Now I'm taking the df and checking the given holidays matching with our generated dates or not. if present 1 or it will be 0.
Then I'm group_by based on week column which is our given weeks above and summarising to find the sum
df %>%
mutate(bank_holidays_presence = if_else(value %in% bank_holidays, 1, 0)) %>%
group_by(week) %>%
summarise(sum = bank_holidays_presence %>% sum())
# A tibble: 10 x 2
# week sum
# <date> <dbl>
# 1 2017-01-20 0
# 2 2017-01-27 1
# 3 2017-02-03 0
# 4 2017-02-10 1
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 0
#10 2017-03-24 0
The advantage of this method is that even if you have more than one holiday for a particular week it'll give the count rather than mere presence or absence
I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2
I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1
I would like to add a column that counts the number of consecutive values. Most of what I am seeing on here is how to count duplicate values (1,1,1,1,1) and I would like to count a when the number goes up by 1 ( 5,6,7,8,9). The ID column is what I have and the counter column is what I would like to create. Thanks!
ID Counter
5 1
6 2
7 3
8 4
10 1
11 2
13 1
14 2
15 3
16 4
A solution using the dplyr package. The idea is to calculate the difference between each number to create a grouping column, and then assign counter to each group.
library(dplyr)
dat2 <- dat %>%
mutate(Diff = ID - lag(ID, default = 0),
Group = cumsum(Diff != 1)) %>%
group_by(Group) %>%
mutate(Counter = row_number()) %>%
ungroup() %>%
select(-Diff, -Group)
dat2
# # A tibble: 10 x 2
# ID Counter
# <int> <int>
# 1 5 1
# 2 6 2
# 3 7 3
# 4 8 4
# 5 10 1
# 6 11 2
# 7 13 1
# 8 14 2
# 9 15 3
# 10 16 4
DATA
dat <- read.table(text = "ID
5
6
7
8
10
11
13
14
15
16",
header = TRUE, stringsAsFactors = FALSE)
A loop version is simple:
for (i in 2:length(ID))
if (diff(ID)[i-1] == 1)
counter[i] <- counter[i-1] +1
else
counter[i] <- 1
But this loop will perform very bad for n > 10^4! I'll try to think of a vector-solution!
You can using
s=df$ID-shift(df$ID)
s[is.na(s)]=1
ave(s,cumsum(s!=1),FUN=seq_along)
[1] 1 2 3 4 1 2 1 2 3 4
This one makes use solely of highly efficient vector-arithmetic. Idea goes as follows:
1.take the cumulative sum of the differences of ID
2.subtract the value if jump is bigger than one
cum <- c(0, cumsum(diff(ID))) # take the cumulative difference of ID
ccm <- cum * c(1, (diff(ID) > 1)) # those with jump > 1 will remain its value
# subtract value with jump > 1 for all following numbers (see Link for reference)
# note: rep(0, n) is because ccm[...] starts at first non null value
counter <- cum - c(rep(0, which(diff(dat) != 1)[1]),
ccm[which(ccm != 0)][cumsum(ccm != 0)]) + 1
enter code here
Notes:
Reference for highliy efficient fill-function by nacnudus: Fill in data frame with values from rows above
Restriction: Id must be monotonically increasing
That should deal with your millions of data efficiently!
Another solution:
breaks <- c(which(diff(ID)!=1), length(ID))
x <- c(breaks[1], diff(breaks))
unlist(sapply(x, seq_len))
I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?
General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4
Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")
I have a Data set consisting of dates when a person left the network. A person can leave a network multiple times as they may join the network again after leaving it. Following code replicates the scenario.
library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))
(ids repeat multiple times in this table as a person can leave a network multiple times given they joined it again)
> Leaving_Date
Id Date
1: 1 2017-01-01
2: 2 2017-02-03
3: 3 2017-01-01
4: 4 2017-03-10
5: 3 2017-02-09
6: 5 2017-02-05
I have another dataset giving the dates whenever a particular person was followed up which can be before or after they left the network. Following code replicates the scenario.
FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
Date =as.Date(c("2016-10-01","2017-02-04",
"2017-01-17","2017-02-23", "2017-03-03",
"2017-02-10","2017-02-11","2017-01-01",
"2017-01-15","2017-01-01")))
> FOLLOWUPs
Id Date
1: 1 2016-10-01
2: 2 2017-02-04
3: 3 2017-01-17
4: 2 2017-02-23
5: 2 2017-03-03
6: 3 2017-02-10
7: 3 2017-02-11
8: 4 2017-01-01
9: 1 2017-01-15
10: 5 2017-01-01
Now I want to lookup each case in Leaving_Date and find dates when they were followed up and create three columns(SevenDay, FourteenDay,ThirtyDay) indicating time period of followup(incase if there was any) in 0s and 1s. I am using following code :
SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+7)])== 0){
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
}
else{
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+14)])== 0){
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
}
else{
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+30)])== 0){
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
}
else{
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
}
}
Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)
Final Data
> Leaving_Date
Id Date SEVENDAY FOURTEENDAY THIRTYDAY
1: 1 2017-01-01 0 0 1
2: 2 2017-02-03 1 1 1
3: 3 2017-01-01 0 0 1
4: 4 2017-03-10 0 0 0
5: 3 2017-02-09 1 1 1
6: 5 2017-02-05 0 0 0
This code is very inefficient as I have to run it for 100k observations and it takes a lot of time. Is there any efficient way to do this.
Using a non-equi join:
setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n :=
FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]
Id Date n
1: 1 2017-01-01 14 days
2: 2 2017-02-03 1 days
3: 3 2017-01-01 16 days
4: 4 2017-03-10 NA days
5: 3 2017-02-09 1 days
6: 5 2017-02-05 NA days
Switching from Date to IDate will probably make this about twice as fast. See ?IDate.
I think it's best to stop here, but n can be compared against 7, 14, 30 if necessary, like
Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]
Id Date n bin
1: 1 2017-01-01 14 days 30
2: 2 2017-02-03 1 days 7
3: 3 2017-01-01 16 days 30
4: 4 2017-03-10 NA days NA
5: 3 2017-02-09 1 days 7
6: 5 2017-02-05 NA days NA
Side note: Please don't give tables names like this.
I think this does what you are looking for using dplyr.
It does an 'inner join' by Id - generating all combinations of dates in the two data frames for a given Id - then calculates the date differences, groups by Id, then checks whether there are values falling in the ranges for your three categories.
library(dplyr)
Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>%
mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>%
summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
THIRTYDAY=as.numeric(any(datediff %in% 0:29)))
We can do this as a query instead of a loop. First, I cleaned your data.tables a bit because I was getting confused by the variable names.
To make things easier for the comparison step, we first pre-compute the follow up date limit for the 7, 14 and 30 day thresholds.
library(dplyr)
dt_leaving_neat = Leaving_Date %>%
mutate(.id = 1:n()) %>%
mutate(limit_07 = Date + 7) %>%
mutate(limit_14 = Date + 14) %>%
mutate(limit_30 = Date + 30) %>%
rename(id = .id, id_person = Id, leaving_date = Date)
dt_follow_neat = FOLLOWUPs %>%
select(id_person = Id, followed_up_date = Date)
The actual operation is just a query. It's written out in dplyr for readability, but if speed is a main concern of yours, you could translate it to data.table. I'd recommend running each step in the pipeline to make sure you understand what's going on.
dt_followed_up = dt_leaving_neat %>%
tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
left_join(dt_follow_neat, by = "id_person") %>%
mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
select(id, id_person, leaving_date, follow_up, followed_up) %>%
filter(followed_up == TRUE) %>%
unique() %>%
tidyr::spread(follow_up, followed_up, fill = 0) %>%
select(id, id_person, leaving_date, limit_07, limit_14, limit_30)
The idea is to join the leaving dates to the follow up dates and check whether the follow up date is within the threshold (and also after the leaving date, as presumably you can't follow up before leaving).
Then some final cleaning to return your desired format. You can use select or rename to change the column names back too.
dt_result = dt_leaving_neat %>%
select(id, id_person, leaving_date) %>%
left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))
dt_result[is.na(dt_result)] = 0
Result
> dt_result
id id_person leaving_date limit_07 limit_14 limit_30
1 1 1 2017-01-01 0 0 1
2 2 2 2017-02-03 1 1 1
3 3 3 2017-01-01 0 0 1
4 4 4 2017-03-10 0 0 0
5 5 3 2017-02-09 1 1 1
6 6 5 2017-02-05 0 0 0
And following Andrew's answer, an equivalent 1 line data.table soln is
FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]