Find index of first and last occurrence in data table - r

I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?

General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4

Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")

Related

Sum of elements in a forward looking rolling window by month

I have the following data.frame with columns: Id, Month, have
library(dplyr)
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-2018 1.000000000000000 1.234567901220000
1 01-Feb-2018 0.200000000000000 0.234567901233000
1 01-Mar-2018 0.030000000000000 0.034567901234400
1 01-Apr-2018 0.004000000000000 0.004567901234550
1 01-May-2018 0.000500000000000 0.000567901234566
1 01-Jun-2018 0.000060000000000 0.000067901234566
1 01-Jul-2018 0.000007000000000 0.000007901234566
1 01-Aug-2018 0.000000800000000 0.000000901234566
1 01-Sep-2018 0.000000090000000 0.000000101234566
1 01-Oct-2018 0.000000010000000 0.000000011234566
1 01-Nov-2018 0.000000001100000 0.000000001234566
1 01-Dec-2018 0.000000000120000 0.000000000134566
1 01-Jan-2019 0.000000000013000 0.000000000014566
1 01-Feb-2019 0.000000000001400 0.000000000001566
1 01-Mar-2019 0.000000000000150 0.000000000000166
1 01-Apr-2019 0.000000000000016 0.000000000000016
2 01-Jan-2018 1337.00 1338.00
2 01-Feb-2018 1.00 1.00
3 01-Jan-2018 5.000000000000000000 5.000000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I would like to programmatically calculate sum of elements in a 12 month forward looking rolling window by Month and grouped by Id as demonstrated in column want. If the rolling observation window is less than 12 months, the missing elements should be ignored.
For bonus points would the solution would also allow for missing months, such as in:
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-18 1.000000000000000 1.200000000000000
1 01-Dec-18 0.200000000000000 0.230000000000000
1 01-Jan-19 0.030000000000000 0.030000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I have tried different solutions, e.g. rollapplyr() of the zoo package and some functions in the runner package, but it doesn't seem to give me what I need.
You can use zoo's rollaply with partial = TRUE
library(dplyr)
dt %>%
group_by(Id) %>%
tidyr::complete(Month = seq(min(Month), max(Month), "month")) %>%
mutate(result = zoo::rollapply(have, 12, sum, na.rm = TRUE,
align = 'left', partial = TRUE)) -> result
result
If you have data for every month for each Id like in the example shared you can remove the complete step.
I suggest to use runner package in this case. runner function let you to calculate rolling window having a full control in time. k is a window length, lag is a lag of the window and in idx you specify index column which window depends on.
library(runner)
dt %>%
group_by(Id) %>%
mutate(want2 = runner(
.,
f = function(x) sum(x$have),
k = 12, # or "12 months"
lag = -11, # or "-11 months"
idx = Month)
)
# # A tibble: 19 x 5
# # Groups: Id [3]
# Id Month have want want2
# <int> <date> <dbl> <dbl> <dbl>
# 1 1 2018-01-01 1.00e+ 0 1.23e+ 0 1.00e+ 0
# 2 1 2018-02-01 2.00e- 1 2.35e- 1 2.00e- 1
# 3 1 2018-03-01 3.00e- 2 3.46e- 2 3.00e- 2
# 4 1 2018-04-01 4.00e- 3 4.57e- 3 4.00e- 3
# 5 1 2018-05-01 5.00e- 4 5.68e- 4 5.00e- 4
# 6 1 2018-06-01 6.00e- 5 6.79e- 5 6.00e- 5

Consecutive wins/losses R

I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2
I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1

Adding a column that counts sequential numbers

I would like to add a column that counts the number of consecutive values. Most of what I am seeing on here is how to count duplicate values (1,1,1,1,1) and I would like to count a when the number goes up by 1 ( 5,6,7,8,9). The ID column is what I have and the counter column is what I would like to create. Thanks!
ID Counter
5 1
6 2
7 3
8 4
10 1
11 2
13 1
14 2
15 3
16 4
A solution using the dplyr package. The idea is to calculate the difference between each number to create a grouping column, and then assign counter to each group.
library(dplyr)
dat2 <- dat %>%
mutate(Diff = ID - lag(ID, default = 0),
Group = cumsum(Diff != 1)) %>%
group_by(Group) %>%
mutate(Counter = row_number()) %>%
ungroup() %>%
select(-Diff, -Group)
dat2
# # A tibble: 10 x 2
# ID Counter
# <int> <int>
# 1 5 1
# 2 6 2
# 3 7 3
# 4 8 4
# 5 10 1
# 6 11 2
# 7 13 1
# 8 14 2
# 9 15 3
# 10 16 4
DATA
dat <- read.table(text = "ID
5
6
7
8
10
11
13
14
15
16",
header = TRUE, stringsAsFactors = FALSE)
A loop version is simple:
for (i in 2:length(ID))
if (diff(ID)[i-1] == 1)
counter[i] <- counter[i-1] +1
else
counter[i] <- 1
But this loop will perform very bad for n > 10^4! I'll try to think of a vector-solution!
You can using
s=df$ID-shift(df$ID)
s[is.na(s)]=1
ave(s,cumsum(s!=1),FUN=seq_along)
[1] 1 2 3 4 1 2 1 2 3 4
This one makes use solely of highly efficient vector-arithmetic. Idea goes as follows:
1.take the cumulative sum of the differences of ID
2.subtract the value if jump is bigger than one
cum <- c(0, cumsum(diff(ID))) # take the cumulative difference of ID
ccm <- cum * c(1, (diff(ID) > 1)) # those with jump > 1 will remain its value
# subtract value with jump > 1 for all following numbers (see Link for reference)
# note: rep(0, n) is because ccm[...] starts at first non null value
counter <- cum - c(rep(0, which(diff(dat) != 1)[1]),
ccm[which(ccm != 0)][cumsum(ccm != 0)]) + 1
enter code here
Notes:
Reference for highliy efficient fill-function by nacnudus: Fill in data frame with values from rows above
Restriction: Id must be monotonically increasing
That should deal with your millions of data efficiently!
Another solution:
breaks <- c(which(diff(ID)!=1), length(ID))
x <- c(breaks[1], diff(breaks))
unlist(sapply(x, seq_len))

Check whether the data in column1 lies in a range And add data in column 2

I have a data frame in R containing 2 columns. I want to check whether the data in column one lies in the following range: x>80,70
count1
Var1 Freq
1 0.00000 7
2 10.00000 1
3 16.66667 1
4 30.95238 1
5 33.33333 2
Data frame contains 32 rows in total with values in column 1 ranging from 0 to 100.
output should be something like this :
Var1 Freq
1 x<60 12
2 60<x<70 *something*
3 70<x<80 *something*
4 x>80 *something*
With the datatable library
df is your dataframe :
breaks <- c(0,60,70,80,Inf)
setDT(df)
df[,list(SUM = sum(freq)),by = list(VAR=cut(var1,breaks = breaks))][order(VAR)]
With dplyr library :
df %>%
group_by(VAR = cut(var1, breaks = breaks)) %>%
summarise(SUM = sum(freq)) %>%
arrange(VAR)

Flag dates based on multiple columns

I have a df, this provides information about the create_date and delete_date(if any) for a given ID.
Structure:
ID create_date1 create_date2 delete_date1 delete_date2
1 01-01-2014 NA NA NA
2 01-04-2014 01-08-2014 01-05-2014 NA
the create_date and delete_date extends till 10, i.e. create_date10
and delete_date10 columns are present
Rules/Logic:
We charge a user on monthly basis, if a user was created on 30th of a month, even then it's treated as if the user was active for a month(very low cost)
If a user has a delete date (irrespective on which date) in this month, then from next month the user is not charged
If a user has only create_date and no delete_date then all dates including the create_month is charged
Output expected:
ID 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08
1 1 1 1 1 1 1 1 1
2 0 0 0 1 1 0 0 1
so on till current date
1 indicates the user is charged/active for that month
Problem:
I have been struggling to do this, but can't even understand how to do this. My earlier method is a bit too slow
Previous Solution:
Make the dataset into tall
Insert sequence of dates for each ID as a new column
Use a for loop to check the status
for each ID, status is equal to 1,
if create_date is equal to sequence, and it's 0 if the lag(delete_date) is equal to sequence
else is same as lag(status)
ID create_date delete_date sequence status?
1 01-01-2014 NA 2014-01 1
1 01-01-2014 NA 2014-02 1
1 01-01-2014 NA 2014-03 1
may not be that efficient : assuming this is just for a single year(could be extended easily)
# convert all dates to Date format
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) as.Date(df[[x]], format = "%d-%m-%Y"))
# extract the month
library(lubridate)
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) month(df[[x]]))
# df
# ID create_date1 create_date2 delete_date1 delete_date2
#1 1 1 NA NA NA
#2 2 4 8 5 NA
# get the current month
current.month <- month(Sys.Date())
# assume for now current month is 9
current.month <- 9
flags <- rep(FALSE, current.month)
func <- function(x){
x[is.na(x)] <- current.month # replacing all NA with current month(9)
create.columns.indices <- x[grepl("create_date", colnames(df[-1]))] # extract the create_months
delete.columns.indices <- x[grepl("delete_date", colnames(df[-1]))] # extract the delete_months
flags <- pmin(1,colSums(t(sapply(seq_along(create.columns.indices),
function(x){
flags[create.columns.indices[x]:delete.columns.indices[x]] = TRUE;
flags
}))))
flags
}
df1 = cbind(df$ID , t(apply(df[-1], 1, func)))
colnames(df1) = c("ID", paste0("month",1:current.month))
# df1
# ID month1 month2 month3 month4 month5 month6 month7 month8 month9
#[1,] 1 1 1 1 1 1 1 1 1 1
#[2,] 2 0 0 0 1 1 0 0 1 1
Here's a still-pretty-long tidyverse approach:
library(tidyverse)
df %>% gather(var, date, -ID) %>% # reshape to long form
# separate date type from column set number
separate(var, c('action', 'number'), sep = '_date', convert = TRUE) %>%
mutate(date = as.Date(date, '%d-%m-%Y')) %>% # parse dates
spread(action, date) %>% # spread create and delete to two columns
mutate(min_date = min(create, delete, na.rm = TRUE), # add helper columns; use outside
max_date = max(create, delete, na.rm = TRUE)) %>% # variable to save memory if an issue
group_by(ID, number) %>%
mutate(month = list(seq(min_date, max_date, by = 'month')), # add month sequence list column
# boolean vector of whether range of months in whole range
active = ifelse(is.na(create),
list(rep(FALSE, length(month[[1]]))),
lapply(month, `%in%`,
seq.Date(create,
min(delete, max_date, na.rm = TRUE),
by = 'month')))) %>%
unnest() %>% # unnest list columns to long form
group_by(ID, month = format(month, '%Y-%m')) %>%
summarise(active = any(active) * 1L) %>% # combine muliple rows for one ID
spread(month, active) # reshape to wide form
## Source: local data frame [2 x 9]
## Groups: ID [2]
##
## ID `2014-01` `2014-02` `2014-03` `2014-04` `2014-05` `2014-06` `2014-07` `2014-08`
## * <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 1 1 1 1 1 1 1 1
## 2 2 0 0 0 1 1 0 0 1

Resources