how to group consecutive days based on another category in R - r

I would like to use the following data frame
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
to order it by group of consecutive days based on the members. therefore if the same days are in member 2 and 1 I dont want them grouped together as consecutive!
ultimately I want a new column making this group
this is what I tried but it didnt work
y = sort(trainall$time)
trainall$g = cumsum(c(1, abs(y[-length(y)] - y[-1]) > 1))
this is the outcome I want.
trainall
time member g
1 01/01/1951 1 1
2 02/01/1951 1 1
3 03/01/1951 1 1
4 04/01/1951 1 1
5 03/03/1953 1 2
6 04/03/1953 1 2
7 05/03/1953 1 2
8 06/03/1953 1 2
9 02/01/1951 2 3
10 03/01/1951 2 3
11 04/01/1951 2 3
12 05/01/1951 2 3
13 13/03/1953 2 4
14 14/03/1953 2 4
15 15/03/1953 2 4
16 16/03/1953 2 4
17 01/05/1951 3 5
18 02/05/1951 3 5
19 03/05/1951 3 5
20 04/05/1951 3 5
21 04/03/1953 3 6
22 05/03/1953 3 6
23 06/03/1953 3 6
24 07/03/1953 3 6
ultimately this is the outcome I want. however, here I did it manually and my actual data frame is much much larger (16 members)
anyone know how to easily do this?

The use of logical values as integers 0 and 1 and your friend diff can do the trick. Something like this should do it, provided that your data is sorted by member and time.
# Your data
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
# Creating column g
trainall$g <- cumsum(c(1, (abs(diff(trainall$time)) + diff(trainall$member))!=1))
print(trainall)
# time member g
#1 1951-01-01 1 1
#2 1951-01-02 1 1
#3 1951-01-03 1 1
#4 1951-01-04 1 1
#5 1953-03-03 1 2
#6 1953-03-04 1 2
#7 1953-03-05 1 2
#8 1953-03-06 1 2
#9 1951-01-02 2 3
#10 1951-01-03 2 3
#11 1951-01-04 2 3
#12 1951-01-05 2 3
#13 1953-03-13 2 4
#14 1953-03-14 2 4
#15 1953-03-15 2 4
#16 1953-03-16 2 4
#17 1951-05-01 3 5
#18 1951-05-02 3 5
#19 1951-05-03 3 5
#20 1951-05-04 3 5
#21 1953-03-04 3 6
#22 1953-03-05 3 6
#23 1953-03-06 3 6
#24 1953-03-07 3 6
Edit: Added abs() around the time difference. I guess the abs cannot strictly be omitted as you could have a time difference of -2 days when the member changes, which cause the sum to be 1.
Edit 2: Re. your extra comment, try
trainall$G <- sequence(table(trainall$g))

Here is one option with .GRP from data.table
library(data.table)
setDT(trainall)[, g := .GRP, .(member, grp = cumsum(c(FALSE, diff(time) != 1)))]
trainall
# time member g
# 1: 1951-01-01 1 1
# 2: 1951-01-02 1 1
# 3: 1951-01-03 1 1
# 4: 1951-01-04 1 1
# 5: 1953-03-03 1 2
# 6: 1953-03-04 1 2
# 7: 1953-03-05 1 2
# 8: 1953-03-06 1 2
# 9: 1951-01-02 2 3
#10: 1951-01-03 2 3
#11: 1951-01-04 2 3
#12: 1951-01-05 2 3
#13: 1953-03-13 2 4
#14: 1953-03-14 2 4
#15: 1953-03-15 2 4
#16: 1953-03-16 2 4
#17: 1951-05-01 3 5
#18: 1951-05-02 3 5
#19: 1951-05-03 3 5
#20: 1951-05-04 3 5
#21: 1953-03-04 3 6
#22: 1953-03-05 3 6
#23: 1953-03-06 3 6
#24: 1953-03-07 3 6

Related

Looping through Columns replicating each column fetched six times

I have this data frame where the column names are from v1 to v292. There are 17 observations. I need to iterate over the columns and replicate each column fetched 6 times.
For example:
v1 v2 v3 v4
1 3 4 6
3 4 3 1
What the output should be
x
1
3
1
3
1
3
1
3
1
3
1
3
3
4
3
4
3
4
3
4
3
4
3
4 .. and so on.
Please help. Thank you in advance.
You could use rep
data.frame(x = unlist(rep(df, each = 6)))
Checking output with each = 2
data.frame(x = unlist(rep(df, each = 2)))
# x
#1 1
#2 3
#3 1
#4 3
#5 3
#6 4
#7 3
#8 4
#9 4
#10 3
#11 4
#12 3
#13 6
#14 1
#15 6
#16 1

merge/join two long df in R

I have two dataframes a and b which I would like to combine
a <- data.frame(g=c("1","2","2","3","3","3","4","4","4","4"),h=c("1","1","2","1","2","3","1","2","3","4"))
b <- data.frame(g=c("1","2","3","3","3","4","4","4","4","4"),i=c("1","2","3","2","1","2","3","4","5","6"))
g represents a grouping variable and h and i the columns I want to merge/join
> a
g h
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
9 4 3
10 4 4
> b
g i
1 1 1
2 2 2
3 3 3
4 3 2
5 3 1
6 4 2
7 4 3
8 4 4
9 4 5
10 4 6
a and b should be merged on the level of the grouping variable g whereas identical values of h and i should be put together (independant of the order they appear in h/i) and not identical values should be combined once (not all possible combinations).
a final df would look like:
g h i
1 1 1 1
2 2 1 <NA>
3 2 2 2
4 3 1 1
5 3 2 2
6 3 3 3
7 4 1 <NA>
8 4 2 2
9 4 3 3
10 4 4 4
11 4 <NA> 5
12 4 <NA> 6
I need that df to perform a correlation analysis.
Sounds like a merge on h==i, while retaining i, so create a new variable x to join on, and keep join results from both sides (all=TRUE). With a large hat-tip to #Moody_Mudskipper:
merge(transform(a,x=h), transform(b,x=i), all=TRUE)
# g x h i
#1 1 1 1 1
#2 2 1 1 <NA>
#3 2 2 2 2
#4 3 1 1 1
#5 3 2 2 2
#6 3 3 3 3
#7 4 1 1 <NA>
#8 4 2 2 2
#9 4 3 3 3
#10 4 4 4 4
#11 4 5 <NA> 5
#12 4 6 <NA> 6
We can also do this with dplyr
library(dplyr)
a %>%
mutate(x = h) %>%
full_join(mutate(b, x = i)) %>%
select(-x)

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue
Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5
We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

Create a block column based on id and the value of another column in R

Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5

Repeating sets of rows according to the number of rows by column in R with data.table

Currently in R, I am trying to do the following for data.table table:
Suppose my data looks like:
Class Person ID Index
A 1 3
A 2 3
A 5 3
B 7 2
B 12 2
C 18 1
D 25 2
D 44 2
Here, the class refers to the class a person belongs to. The Person ID variable represents a unique identifier of a person. Finally, the Index tells us how many people are in each class.
From this, I would like to create a new data table as so:
Class Person ID Index
A 1 3
A 2 3
A 5 3
A 1 3
A 2 3
A 5 3
A 1 3
A 2 3
A 5 3
B 7 2
B 12 2
B 7 2
B 12 2
C 18 1
D 25 2
D 44 2
D 25 2
D 44 2
where we repeated each set of persons by class based on the index variable. Hence, we would repeat the class A by 3 times because the index says 3.
So far, my code looks like:
setDT(data)[, list(Class = rep(Person ID, seq_len(.N)), Person ID = sequence(seq_len(.N)), by = Index]
However, I am not getting the correct result and I feel like there is a simpler way to do this. Would anyone have any ideas? Thank you!
If that particular order is important to you, then perhaps something like this should work:
setDT(data)[, list(PersonID, sequence(rep(.N, Index))), by = list(Class, Index)]
# Class Index PersonID V2
# 1: A 3 1 1
# 2: A 3 2 2
# 3: A 3 5 3
# 4: A 3 1 1
# 5: A 3 2 2
# 6: A 3 5 3
# 7: A 3 1 1
# 8: A 3 2 2
# 9: A 3 5 3
# 10: B 2 7 1
# 11: B 2 12 2
# 12: B 2 7 1
# 13: B 2 12 2
# 14: C 1 18 1
# 15: D 2 25 1
# 16: D 2 44 2
# 17: D 2 25 1
# 18: D 2 44 2
If the order is not important, perhaps:
setDT(data)[rep(1:nrow(data), Index)]
Here is a way using dplyr in case you wanted to try
library(dplyr)
data %>%
group_by(Class) %>%
do(data.frame(.[sequence(.$Index[row(.)[,1]]),]))
which gives the output
# Class Person.ID Index
#1 A 1 3
#2 A 2 3
#3 A 5 3
#4 A 1 3
#5 A 2 3
#6 A 5 3
#7 A 1 3
#8 A 2 3
#9 A 5 3
#10 B 7 2
#11 B 12 2
#12 B 7 2
#13 B 12 2
#14 C 18 1
#15 D 25 2
#16 D 44 2
#17 D 25 2
#18 D 44 2

Resources