Repeating sets of rows according to the number of rows by column in R with data.table - r

Currently in R, I am trying to do the following for data.table table:
Suppose my data looks like:
Class Person ID Index
A 1 3
A 2 3
A 5 3
B 7 2
B 12 2
C 18 1
D 25 2
D 44 2
Here, the class refers to the class a person belongs to. The Person ID variable represents a unique identifier of a person. Finally, the Index tells us how many people are in each class.
From this, I would like to create a new data table as so:
Class Person ID Index
A 1 3
A 2 3
A 5 3
A 1 3
A 2 3
A 5 3
A 1 3
A 2 3
A 5 3
B 7 2
B 12 2
B 7 2
B 12 2
C 18 1
D 25 2
D 44 2
D 25 2
D 44 2
where we repeated each set of persons by class based on the index variable. Hence, we would repeat the class A by 3 times because the index says 3.
So far, my code looks like:
setDT(data)[, list(Class = rep(Person ID, seq_len(.N)), Person ID = sequence(seq_len(.N)), by = Index]
However, I am not getting the correct result and I feel like there is a simpler way to do this. Would anyone have any ideas? Thank you!

If that particular order is important to you, then perhaps something like this should work:
setDT(data)[, list(PersonID, sequence(rep(.N, Index))), by = list(Class, Index)]
# Class Index PersonID V2
# 1: A 3 1 1
# 2: A 3 2 2
# 3: A 3 5 3
# 4: A 3 1 1
# 5: A 3 2 2
# 6: A 3 5 3
# 7: A 3 1 1
# 8: A 3 2 2
# 9: A 3 5 3
# 10: B 2 7 1
# 11: B 2 12 2
# 12: B 2 7 1
# 13: B 2 12 2
# 14: C 1 18 1
# 15: D 2 25 1
# 16: D 2 44 2
# 17: D 2 25 1
# 18: D 2 44 2
If the order is not important, perhaps:
setDT(data)[rep(1:nrow(data), Index)]

Here is a way using dplyr in case you wanted to try
library(dplyr)
data %>%
group_by(Class) %>%
do(data.frame(.[sequence(.$Index[row(.)[,1]]),]))
which gives the output
# Class Person.ID Index
#1 A 1 3
#2 A 2 3
#3 A 5 3
#4 A 1 3
#5 A 2 3
#6 A 5 3
#7 A 1 3
#8 A 2 3
#9 A 5 3
#10 B 7 2
#11 B 12 2
#12 B 7 2
#13 B 12 2
#14 C 18 1
#15 D 25 2
#16 D 44 2
#17 D 25 2
#18 D 44 2

Related

How to add a column with repeating but changing sequence?

I'm trying to add a column with repeating sequence but one that changes for each group. In the example data, the group is the id column.
data <- tibble::expand_grid(id = 1:12, condition = c("a", "b", "c"))
data
id condition
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
... and so on
I'd like to add a column called order to repeat various combinations like 1 2 3 2 3 1 3 1 2 1 3 2 2 1 3 3 2 1 for each id.
In the end, the desired output will look like this
id condition order
1 a 1
1 b 2
1 c 3
2 a 2
2 b 3
2 c 1
3 a 3
3 b 1
3 c 2
... and so on
I'm looking for a simple mutate solution or base R solution. I tried generating a list of combinations but I'm not sure how to create a variable from that.
You can use perms from package pracma to generate all permutations, e.g.,
data %>%
cbind(order = c(t(pracma::perms(1:3))))
which gives
id condition order
1 1 a 3
2 1 b 2
3 1 c 1
4 2 a 3
5 2 b 1
6 2 c 2
7 3 a 2
8 3 b 3
9 3 c 1
10 4 a 2
11 4 b 1
12 4 c 3
13 5 a 1
14 5 b 2
15 5 c 3
16 6 a 1
17 6 b 3
18 6 c 2
19 7 a 3
20 7 b 2
21 7 c 1
22 8 a 3
23 8 b 1
24 8 c 2
25 9 a 2
26 9 b 3
27 9 c 1
28 10 a 2
29 10 b 1
30 10 c 3
31 11 a 1
32 11 b 2
33 11 c 3
34 12 a 1
35 12 b 3
36 12 c 2

how to group consecutive days based on another category in R

I would like to use the following data frame
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
to order it by group of consecutive days based on the members. therefore if the same days are in member 2 and 1 I dont want them grouped together as consecutive!
ultimately I want a new column making this group
this is what I tried but it didnt work
y = sort(trainall$time)
trainall$g = cumsum(c(1, abs(y[-length(y)] - y[-1]) > 1))
this is the outcome I want.
trainall
time member g
1 01/01/1951 1 1
2 02/01/1951 1 1
3 03/01/1951 1 1
4 04/01/1951 1 1
5 03/03/1953 1 2
6 04/03/1953 1 2
7 05/03/1953 1 2
8 06/03/1953 1 2
9 02/01/1951 2 3
10 03/01/1951 2 3
11 04/01/1951 2 3
12 05/01/1951 2 3
13 13/03/1953 2 4
14 14/03/1953 2 4
15 15/03/1953 2 4
16 16/03/1953 2 4
17 01/05/1951 3 5
18 02/05/1951 3 5
19 03/05/1951 3 5
20 04/05/1951 3 5
21 04/03/1953 3 6
22 05/03/1953 3 6
23 06/03/1953 3 6
24 07/03/1953 3 6
ultimately this is the outcome I want. however, here I did it manually and my actual data frame is much much larger (16 members)
anyone know how to easily do this?
The use of logical values as integers 0 and 1 and your friend diff can do the trick. Something like this should do it, provided that your data is sorted by member and time.
# Your data
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
# Creating column g
trainall$g <- cumsum(c(1, (abs(diff(trainall$time)) + diff(trainall$member))!=1))
print(trainall)
# time member g
#1 1951-01-01 1 1
#2 1951-01-02 1 1
#3 1951-01-03 1 1
#4 1951-01-04 1 1
#5 1953-03-03 1 2
#6 1953-03-04 1 2
#7 1953-03-05 1 2
#8 1953-03-06 1 2
#9 1951-01-02 2 3
#10 1951-01-03 2 3
#11 1951-01-04 2 3
#12 1951-01-05 2 3
#13 1953-03-13 2 4
#14 1953-03-14 2 4
#15 1953-03-15 2 4
#16 1953-03-16 2 4
#17 1951-05-01 3 5
#18 1951-05-02 3 5
#19 1951-05-03 3 5
#20 1951-05-04 3 5
#21 1953-03-04 3 6
#22 1953-03-05 3 6
#23 1953-03-06 3 6
#24 1953-03-07 3 6
Edit: Added abs() around the time difference. I guess the abs cannot strictly be omitted as you could have a time difference of -2 days when the member changes, which cause the sum to be 1.
Edit 2: Re. your extra comment, try
trainall$G <- sequence(table(trainall$g))
Here is one option with .GRP from data.table
library(data.table)
setDT(trainall)[, g := .GRP, .(member, grp = cumsum(c(FALSE, diff(time) != 1)))]
trainall
# time member g
# 1: 1951-01-01 1 1
# 2: 1951-01-02 1 1
# 3: 1951-01-03 1 1
# 4: 1951-01-04 1 1
# 5: 1953-03-03 1 2
# 6: 1953-03-04 1 2
# 7: 1953-03-05 1 2
# 8: 1953-03-06 1 2
# 9: 1951-01-02 2 3
#10: 1951-01-03 2 3
#11: 1951-01-04 2 3
#12: 1951-01-05 2 3
#13: 1953-03-13 2 4
#14: 1953-03-14 2 4
#15: 1953-03-15 2 4
#16: 1953-03-16 2 4
#17: 1951-05-01 3 5
#18: 1951-05-02 3 5
#19: 1951-05-03 3 5
#20: 1951-05-04 3 5
#21: 1953-03-04 3 6
#22: 1953-03-05 3 6
#23: 1953-03-06 3 6
#24: 1953-03-07 3 6

How to compute a new variable based on the number of days since a particular type of record

I'm trying to create a variable that shows the number of days since a particular event occurred. This is a follow up to this previous question, using the same data.
The data looks like this (note dates are in DD-MM-YYYY format):
ID date drug score
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4
C 6/12/2016 2 4
C 8/12/2016 1 2
C 9/12/2016 1 2
For 'drug': 1=drug taken, 2=no drug taken.
Each time the value of drug is 1, if that ID has a previous record that is also drug==1, then I need to generate a new value 'lagtime' that shows the number of days (not the number of rows!) since the previous time the drug was taken.
So the output I am looking for is:
ID date drug score lagtime
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4 5
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4 3
C 6/12/2016 2 4
C 8/12/2016 1 2 3
C 9/12/2016 1 2 1
So I need a way to generate (mutate?) this lagtime score that is calculated as the date for each drug==1 record, minus the date of the previous drug==1 record, grouped by ID.
This has me completely bamboozled.
Here's code for the example data:
data<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'ID', specify the i (drug ==1), get the difference of 'date' (diff(date)), concatenate with NA as the diff output length is 1 less than the original vector, convert to integer and assign (:=) to create the 'lagtime'. By default, all other values will be NA
library(data.table)
setDT(data)[drug==1, lagtime := as.integer(c(NA, diff(date))), ID]
data
# ID date drug score lagtime
# 1: A 2016-08-28 2 3 NA
# 2: A 2016-08-29 1 4 NA
# 3: A 2016-08-30 2 4 NA
# 4: A 2016-09-02 2 4 NA
# 5: A 2016-09-03 1 4 5
# 6: A 2016-09-04 2 4 NA
# 7: B 2016-08-08 1 3 NA
# 8: B 2016-08-09 2 4 NA
# 9: B 2016-08-10 2 3 NA
#10: B 2016-08-11 1 3 3
#11: C 2016-11-30 2 4 NA
#12: C 2016-12-02 1 5 NA
#13: C 2016-12-03 2 1 NA
#14: C 2016-12-05 1 4 3
#15: C 2016-12-06 2 4 NA
#16: C 2016-12-08 1 2 3
#17: C 2016-12-09 1 2 1

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue
Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5
We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

Create a block column based on id and the value of another column in R

Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5

Resources