How do I find number of continuous weeks by group but counted from the max date in the dataset?
Say I have this dataframe:
id Week
1 A 2/06/2019
2 A 26/05/2019
3 A 19/05/2019
4 A 12/05/2019
5 A 5/05/2019
6 B 2/06/2019
7 B 26/05/2019
8 B 12/05/2019
9 B 5/05/2019
10 C 26/05/2019
11 C 19/05/2019
12 C 12/05/2019
13 D 2/06/2019
14 D 26/05/2019
15 D 19/05/2019
16 E 2/06/2019
17 E 19/05/2019
18 E 12/05/2019
19 E 5/05/2019
My desired output is:
id count
1: A 5
2: B 2
3: D 3
4: E 1
I am currently converting dates into factor to get ordered number and checking against the reference number created based on the number of rows in each group.
library(data.table)
df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 5L),
.Label = c("A", "B", "C", "D", "E"), class = "factor"),
Week = structure(c(3L, 4L, 2L, 1L, 5L, 3L, 4L, 1L, 5L, 4L, 2L, 1L, 3L, 4L, 2L, 3L, 2L, 1L, 5L),
.Label = c("12/05/2019", "19/05/2019", "2/06/2019", "26/05/2019", "5/05/2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -19L))
dt <- data.table(df)
dt[, Week_no := as.factor(as.Date(Week, format = "%d/%m/%Y"))]
dt[, Week_no := factor(Week_no)]
dt[, Week_no := as.numeric(Week_no)]
max_no <- max(dt$Week_no)
dt[, Week_ref := max_no:(max_no - .N + 1), by = "id"]
dt[, Week_diff := Week_no - Week_ref]
dt[Week_diff == 0, list(count = .N), by = "id"]
Here's one way to do this:
dt <- dt[, Week := as.Date(Week, format = "%d/%m/%Y")]
ids_having_max <- dt[.(max(Week)), id, on = "Week"]
dt <- dt[.(ids_having_max), on = "id"
][order(-Week), .(count = sum(rleid(c(-7L, diff(Week))) == 1)), by = "id"]
Breaking it into steps:
We leave Week as a date because it can already be compared,
and you can subtract dates to get time differences.
We then get all the ids that contain the maximum date in the whole table.
This is using secondary indices.
We use secondary indices again to filter out those ids that were not part of the previous result
(the dt[.(ids_having_max), on = "id" part).
The last frame is tricky.
We group by id and make sure that rows are ordered by Week in descending order.
Then the logic is as follows.
When you have contiguous weeks,
diff(Week) is always -7 with the chosen sorting.
Computing diff returns a shorter vector because the first result is computed by subtracting the first input element from the second,
so we prepend a -7 to make sure that it is the first element in the input to rleid.
With rleid we assign a 1 to the first -7 and keep the 1 until we see something different from -7.
Something different means weeks stopped being contiguous.
The sum(rleid(c(-7L, diff(Week))) == 1) will simply return how many rows had a rleid equal to 1.
Example of the last part for B:
Differences: -7, -14, -7
After prepending -7: -7, -7, -14, -7
After rleid: 1, 1, 2, 3
From the previous, two had a rleid == 1
Apologies for dplyr solution, but I presume a similar approach can be achieved more concisely with data.table.
library(dplyr)
df$Week = lubridate::dmy(df$Week)
df %>%
group_by(id) %>%
arrange(id, Week) %>%
# Assign group to each new streak
mutate(new_streak = cumsum(Week != lag(Week, default = 0) + 7)) %>%
add_count(id, new_streak) %>%
slice(n()) # Only keep last week
So I would suggest converting the format of the data column to show week number "%W" as follows
dt[, Week_no := format(as.Date(Week, format = "%d/%m/%Y"),"%W")]
Then find the amount of unique week numbers for each id value
dt[,(length(unique(Week_no))),by="id"]
FULL DISCLOSURE
I realise that when I run this I get a different table than you present, as R counts the week by the week # for the given year.
If this doesnt answer your question just let me know and I can try to update
Related
I'm currently working in data.table in R with the following data set:
id age_start age_end cases
1 2 2 1000
1 3 3 500
1 4 4 300
1 2 4 1800
2 2 2 8000
2 3 3 200
2 4 4 100
In the given data set I only want values of cases where the age_start == 2 and the age_end ==4.
In each ID where the age_start !=2 and the age_end !=4, I need to sum or aggregate the rows to create a group of age_start==2 and age_end ==4. In these cases I'd need to sum up the cases of age_start==2 & age_end==2, age_start==3 & age_end==3, as well as age_start==4 & age_end==4 into one new row of age_start==2 and age_end==4.
After these are summed up into one row, I want to drop the rows that I used to make the new age_start==2 and age_start==4 row (i.e. the age values 2-2, 3-3, and 4-4) as they are no longer needed
Ideally the data set would look like this when I finish these steps:
id age_start age_end cases
1 2 4 1800
2 2 4 8300
Any suggestions on how to accomplish this in data.table are greatly appreciated!
You can use an equi-join for the first bullet; and a non-equi join for the second:
m_equi = x[.(id = unique(id), age_dn = 2, age_up = 4),
on=.(id, age_start = age_dn, age_end = age_up),
nomatch=0
]
m_nonequi = x[!m_equi, on=.(id)][.(id = unique(id), age_dn = 2, age_up = 4),
on=.(id, age_start >= age_dn, age_end <= age_up),
.(cases = sum(cases)), by=.EACHI
]
res = rbind(m_equi, m_nonequi)
id age_start age_end cases
1: 1 2 4 1800
2: 2 2 4 8300
How it works:
x[i] uses values in i to look up rows and columns in x according to rules specified in on=.
nomatch=0 means unmatched rows of i in x[i] are dropped, so m_equi only ends up with id=1.
x[!m_equi, on=.(id)] is an anti-join that skips id=1 since we already matched it in the equi join.
by=.EACHI groups by each row of i in x[i] for the purpose of doing the aggregation.
An alternative would be to anti-join on rows with start 2 and end 4 so that all groups need to be aggregated (similar to #akrun's answer), though I guess that would be less efficient.
We can specify the i with the logical condition, grouped by 'id', get the sum of 'cases' while adding 'age_start', 'age_end' as 2 and 4
library(data.table)
as.data.table(df1)[age_start != 2|age_end != 4,
.(age_start = 2, age_end = 4, cases = sum(cases)), id]
# id age_start age_end cases
#1: 1 2 4 1800
#2: 2 2 4 8300
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age_start = c(2L,
3L, 4L, 2L, 2L, 3L, 4L), age_end = c(2L, 3L, 4L, 4L, 2L, 3L,
4L), cases = c(1000L, 500L, 300L, 1800L, 8000L, 200L, 100L)),
class = "data.frame", row.names = c(NA,
-7L))
I am trying to create new variables in my data set that are cumulative totals which restart based on other variables (using group by)… I want these to be new columns in the data set and this is the part I am struggling with...
Using the data below, I want to create cumulative Sale and Profit columns that will restart for every Product and Product_Cat grouping.
The below code partly gives me what I need, but the variables are not new variables, instead it overwrites the existing Sale/Profit... what am I getting wrong? I imagine this is simple haven't found anything.
Note: I'm using lapply as my real data set has 40+ varbs that I need to create calculations for.
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
Data for example:
Product <- c('A','A','A','B','B','B','C','C','C')
Product_Cat <- c('S1','S1','S2','C1','C1','C1','D1','E1','F1')
Sale <- c(10,15,5,20,15,10,5,5,5)
Profit <- c(2,4,2,6,8,2,4,6,8)
Sale_Cum <- c(10,25,5,20,35,45,5,5,5)
Profit_Cum <- c(2,6,2,6,14,16,4,6,8)
Data <- data.frame(Product,Product_Cat,Sale,Profit)
Desired_Data <- data.frame(Product,Product_Cat,Sale,Profit,Sale_Cum,Profit_Cum)
This doesn't use the group by per se but I think it achieves what you're looking for in that it is easily extensible to many columns:
D2 <- data.frame(lapply(Data[,c(3,4)], cumsum))
names(D2) <- gsub("$", "_cum", names(Data[,c(3,4)]))
Data <- cbind(Data, D2)
If you have 40+ columns just change the c(3,4) to include all the columns you're after.
EDIT:
I forgot that the OP wanted it to reset for each category. In that case, you can modify your original code:
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
names(D2)[c(-1,-2)] <- gsub("$", "_cum", names(Data)[c(-1,-2)])
cbind(Data, D2[,c(-1,-2)])
library(data.table)
setDT(Data)
cols <- names(Data)[3:4]
Data[, paste0(cols, '_cumsum') := lapply(.SD, cumsum)
, by = .(Product, Product_Cat)
, .SDcols = cols]
Data:
structure(list(Product = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Product_Cat = structure(c(5L,
5L, 6L, 1L, 1L, 1L, 2L, 3L, 4L), .Label = c("C1", "D1", "E1",
"F1", "S1", "S2"), class = "factor"), Sale = c(10L, 15L, 5L,
20L, 15L, 10L, 5L, 5L, 5L), Profit = c(2L, 4L, 2L, 6L, 8L, 2L,
4L, 6L, 8L), Sale_Cum = c(10, 25, 5, 20, 35, 45, 5, 5, 5), Profit_Cum = c(2,
6, 2, 6, 14, 16, 4, 6, 8)), .Names = c("Product", "Product_Cat",
"Sale", "Profit", "Sale_Cum", "Profit_Cum"), row.names = c(NA,
-9L), class = "data.frame")`
We can iteratively slice the dataframe based on Product and Product_Cat, and for each iteration, assign the output produced by cumsum() to Sale_Cum and Product_Cum:
cols <- c('Sale', 'Profit')
for (column in cols){
x[, paste0(column, '_Cum')] <- 0
for(p in unique(x$Product)){
for (pc in unique(x$Product_Cat)){
x[x$Product == p & x$Product_Cat == pc, paste0(column, '_Cum')] <- cumsum(x[x$Product == p & x$Product_Cat == pc, column])
}
}
}
print(x)
# Product Product_Cat Sale Profit Sale_Cum Profit_Cum
# 1 A S1 10 2 10 2
# 2 A S1 15 4 25 6
# 3 A S2 5 2 5 2
# 4 B C1 20 6 20 6
# 5 B C1 15 8 35 14
# 6 B C1 10 2 45 16
# 7 C D1 5 4 5 4
# 8 C E1 5 6 5 6
# 9 C F1 5 8 5 8
Here is some pretty poor code that does everything step by step
#sample data
d<-sample(1:10)
f<-sample(1:10)
p<-c("f","f","f","f","q","q","q","w","w","w")
pc<-c("c","c","d","d","d","v","v","v","b","b")
cc<-data.table(p,pc,d,f)
#storing the values that are overwritten first.
three<-cc[,3]
four<- cc[,4]
#applying your function
dt<-setDT(c)[,lapply(.SD,cumsum), by=.(p,pc)]
#binding the stored values to your function and renaming everything.
x<-cbind(dt,three,four)
colnames(x)[5]<-"sale"
colnames(x)[6]<-"profit"
colnames(x)[4]<-"CumSale"
colnames(x)[3]<-"CumProfit"
#reordering the columns
xx<-x[,c("p","pc","profit","sale","CumSale","CumProfit")]
xx
I have data similar to this:
dt <- structure(list(fct = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), X = c(2L, 4L, 3L, 2L, 5L, 4L, 7L, 2L, 9L, 1L, 4L, 2L, 5L, 4L, 2L)), .Names = c("fct", "X"), class = "data.frame", row.names = c(NA, -15L))
I want to select rows from this data frame based on the values in the fct variable. For example, if I wish to select rows containing either "a" or "c" I can do this:
dt[dt$fct == 'a' | dt$fct == 'c', ]
which yields
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
as expected. But my actual data is more complex and I actually want to select rows based on the values in a vector such as
vc <- c('a', 'c')
So I tried
dt[dt$fct == vc, ]
but of course that doesn't work. I know I could code something to loop through the vector and pull out the rows needed and append them to a new dataframe, but I was hoping there was a more elegant way.
So how can I filter/subset my data based on the contents of the vector vc?
Have a look at ?"%in%".
dt[dt$fct %in% vc,]
fct X
1 a 2
3 c 3
5 c 5
7 a 7
9 c 9
10 a 1
12 c 2
14 c 4
You could also use ?is.element:
dt[is.element(dt$fct, vc),]
Similar to above, using filter from dplyr:
filter(df, fct %in% vc)
Another option would be to use a keyed data.table:
library(data.table)
setDT(dt, key = 'fct')[J(vc)] # or: setDT(dt, key = 'fct')[.(vc)]
which results in:
fct X
1: a 2
2: a 7
3: a 1
4: c 3
5: c 5
6: c 9
7: c 2
8: c 4
What this does:
setDT(dt, key = 'fct') transforms the data.frame to a data.table (which is an enhanced form of a data.frame) with the fct column set as key.
Next you can just subset with the vc vector with [J(vc)].
NOTE: when the key is a factor/character variable, you can also use setDT(dt, key = 'fct')[vc] but that won't work when vc is a numeric vector. When vc is a numeric vector and is not wrapped in J() or .(), vc will work as a rowindex.
A more detailed explanation of the concept of keys and subsetting can be found in the vignette Keys and fast binary search based subset.
An alternative as suggested by #Frank in the comments:
setDT(dt)[J(vc), on=.(fct)]
When vc contains values that are not present in dt, you'll need to add nomatch = 0:
setDT(dt, key = 'fct')[J(vc), nomatch = 0]
or:
setDT(dt)[J(vc), on=.(fct), nomatch = 0]
I have a data set in R of 1.5 million rows and 23 columns, which looks like:
ID Week col1 col2 col3 ...
A 1 2 3 1
A 2 3 4 1
...
A 69 15 2 11
B 1 5 1 2
B 2 6 10 3
...
B 69 2 1 1
Z 1 1 12 2
Z 2 4 5 3
...
Z 69 1 20 2
I want to alter each ID but only in the "Week" 69, with one third of the Max value of each group ID
For example:
Max value for ID = A in the col1, divided by 3 and replace it in the original data set.
My current logic, which seems not be working:
index<-unique(data$ID)
dat<-filter(data, id== index[1])
b<-sapply(dat[,3:23],max)
b<-b/3
dat[69,4:23]<-dat[69,4:23]+b
data.alt<-dat
enter code here
for (i in 2:19477)
{
dat<-filter(data, id== index[i])
b<-sapply(dat[,4:23],max)
b<-b/3
dat[69,4:23]<-dat[69,4:23]+b
data.alt<-rbind(data.alt,dat)
}
We can use data.table methods. Create a vector of names from the original dataset where there is col in the column names ('nm1'), paste with 'i.' to create second vector ('nm2' - for assigning the values while joining), then summarise the datasets with the max of 'cols's grouped by 'ID' and specifying the .SDcols as 'nm1', create a column 'Week' as '69', join the two datasets on, 'ID', 'Week' and assign (:=) the values of 'nm2' to 'nm1' columns
library(data.table)
nm1 <- grep("col", names(df1), value = TRUE)
nm2 <- paste0("i.", nm1)
df2 <- setDT(df1)[, lapply(.SD, max) , ID, .SDcols = nm1][, Week := factor(69)][]
df1[df2, (nm1) := mget(nm2), on = .(ID, Week)]
df1
Update
If we want to replace the max value divided by 3 for 'nm1' columns where 'Week' is 69,
setDT(df1)[, (nm1) := lapply(.SD, as.numeric), .SDcol = nm1]
df2 <- df1[, lapply(.SD, function(x) max(x)/3) , ID, .SDcols = nm1][, Week := factor(69)][]
df1[df2, (nm1) := mget(nm2), on = .(ID, Week)]
Update2
If we need to add to the original values, change the last line of code to
df1[df2, (nm1) := Map(`+`, mget(nm1), mget(nm2)), on = .(ID, Week)]
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "B", "Z", "Z",
"Z"), Week = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "69"), class = "factor"), col1 = c(2L, 3L, 15L, 5L, 6L,
2L, 1L, 4L, 1L), col2 = c(3L, 4L, 2L, 1L, 10L, 1L, 12L, 5L, 20L
), col3 = c(1L, 1L, 11L, 2L, 3L, 1L, 2L, 3L, 2L)), .Names = c("ID",
"Week", "col1", "col2", "col3"), row.names = c(NA, -9L),
class = "data.frame")
I have data that looks like this:
ID CLASS START END
100 GA 3-Jan-15 1-Feb-15
100 G 1-Feb-15 22-Feb-15
100 GA 28-Feb-15 17-Mar-15
100 G 1-Apr-15 8-Apr-15
100 G 10-Apr-15 18-Apr-15
200 FA 3-Jan-14 1-Feb-14
200 FA 1-Feb-14 22-Feb-14
200 G 28-Feb-14 15-Mar-14
200 F 1-Apr-14 20-Apr-14
Here is the data:
df <- structure(list(ID = c(100L, 100L, 100L, 100L, 100L, 200L, 200L,
200L, 200L), CLASS = structure(c(4L, 3L, 4L, 3L, 3L, 2L, 2L,
3L, 1L), .Label = c("F", "FA", "G", "GA"), class = "factor"),
START = structure(c(9L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 1L), .Label = c("1-Apr-14",
"1-Apr-15", "1-Feb-14", "1-Feb-15", "10-Apr-15", "28-Feb-14",
"28-Feb-15", "3-Jan-14", "3-Jan-15"), class = "factor"),
END = structure(c(2L, 8L, 4L, 9L, 5L, 1L, 7L, 3L, 6L), .Label = c("1-Feb-14",
"1-Feb-15", "15-Mar-14", "17-Mar-15", "18-Apr-15", "20-Apr-14",
"22-Feb-14", "22-Feb-15", "8-Apr-15"), class = "factor")), .Names = c("ID",
"CLASS", "START", "END"), class = "data.frame", row.names = c(NA,
-9L))
I would like to group the data by the ID column and then consolidate any consecutive occurrences of the same value in the CLASS column (sorted by the START date), while selecting the minimum start date and the maximum end date. So for ID number 100, there is only one instance where the "G" class is consecutive, so I would like to consolidate those two rows into a single row with the min(START) and max(END) dates. This is a simple example but in the real data sometimes there are several consecutive rows that need to be consolidated.
I have tried group_by followed by using some kind of ranking but this doesn't seem to do the trick. Any suggestions on how to solve this? Also this is the first time I am posting on SO, so I hope this question makes sense.
Result should look like this:
ID CLASS START END
100 GA 3-Jan-15 1-Feb-15
100 G 1-Feb-15 22-Feb-15
100 GA 28-Feb-15 17-Mar-15
100 G 1-Apr-15 18-Apr-15
200 FA 3-Jan-14 22-Feb-14
200 G 28-Feb-14 15-Mar-14
200 F 1-Apr-14 20-Apr-14
Here's an option, using data.table::rleid to make an id for runs of the same ID and CLASS:
# make START and END Date class for easier manipulation
df <- df %>% mutate(START = as.Date(START, '%d-%b-%y'),
END = as.Date(END, '%d-%b-%y'))
# More concise alternative:
# df <- df %>% mutate_each(funs(as.Date(., '%d-%b-%y')), START, END)
# group and make rleid as mentioned above
df %>% group_by(ID, CLASS, rleid = data.table::rleid(ID, CLASS)) %>%
# collapse with summarise, replacing START and END with their min and max for each group
summarise(START = min(START), END = max(END)) %>%
# clean up arrangement and get rid of added rleid column
ungroup() %>% arrange(rleid) %>% select(-rleid)
# Source: local data frame [7 x 4]
#
# ID CLASS START END
# (int) (fctr) (date) (date)
# 1 100 GA 2015-01-03 2015-02-01
# 2 100 G 2015-02-01 2015-02-22
# 3 100 GA 2015-02-28 2015-03-17
# 4 100 G 2015-04-01 2015-04-18
# 5 200 FA 2014-01-03 2014-02-22
# 6 200 G 2014-02-28 2014-03-15
# 7 200 F 2014-04-01 2014-04-20
Here's the pure data.table analogue:
library(data.table)
setDT(df)
datecols = c("START","END")
df[, (datecols) := lapply(.SD, as.IDate, format = '%d-%b-%y'), .SDcols = datecols]
df[, .(START = START[1L], END = END[.N]), by=.(ID, CLASS, r = rleid(ID, CLASS))][, r := NULL][]