Give unique identifier to consecutive groupings - r

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue

Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5

We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

Related

How to add a column with repeating but changing sequence?

I'm trying to add a column with repeating sequence but one that changes for each group. In the example data, the group is the id column.
data <- tibble::expand_grid(id = 1:12, condition = c("a", "b", "c"))
data
id condition
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
... and so on
I'd like to add a column called order to repeat various combinations like 1 2 3 2 3 1 3 1 2 1 3 2 2 1 3 3 2 1 for each id.
In the end, the desired output will look like this
id condition order
1 a 1
1 b 2
1 c 3
2 a 2
2 b 3
2 c 1
3 a 3
3 b 1
3 c 2
... and so on
I'm looking for a simple mutate solution or base R solution. I tried generating a list of combinations but I'm not sure how to create a variable from that.
You can use perms from package pracma to generate all permutations, e.g.,
data %>%
cbind(order = c(t(pracma::perms(1:3))))
which gives
id condition order
1 1 a 3
2 1 b 2
3 1 c 1
4 2 a 3
5 2 b 1
6 2 c 2
7 3 a 2
8 3 b 3
9 3 c 1
10 4 a 2
11 4 b 1
12 4 c 3
13 5 a 1
14 5 b 2
15 5 c 3
16 6 a 1
17 6 b 3
18 6 c 2
19 7 a 3
20 7 b 2
21 7 c 1
22 8 a 3
23 8 b 1
24 8 c 2
25 9 a 2
26 9 b 3
27 9 c 1
28 10 a 2
29 10 b 1
30 10 c 3
31 11 a 1
32 11 b 2
33 11 c 3
34 12 a 1
35 12 b 3
36 12 c 2

Split columns to few variables and move corresponding value to new column

I have a data frame like this (with many more rows):
id act_l_n pas_l_n act_q_p pas_q_p act_l_p pas_l_p act_q_n pas_q_n
1 14 8 14 10 21 11 21 11
2 19 9 11 17 22 11 20 11
Every column name contains information about 3 variables separated by '_' (each has 2 levels named act/pas, l/q, n/p). Values are scores corresponding to each combination of variables (i.e. 1 of 8 conditions).
I need to move 3 variables to 3 separate columns, mark their levels by digits, and move corresponding value to separate column called "score". So from 1st row of current data frame I'd get something like this:
id score actpas lq pn
1 14 1 1 1
1 8 2 1 1
1 14 1 2 2
1 10 2 2 2
1 21 1 1 2
1 11 2 1 2
1 21 1 2 1
1 11 2 2 1
I've tried wrangling this with dplyr using gather and separate functions, but I can't really get what I need. Help with dplyr would be most appriciated!
If I understand well:
df<-read.table(textConnection(
"id,act_l_n,pas_l_n,act_q_p,pas_q_p,act_l_p,pas_l_p,act_q_n,pas_q_n
1,14,8,14,10,21,11,21,11
2,19,9,11,17,22,11,20,11"),
header=TRUE,sep=",")
library(tidyr)
library(dplyr)
gather(df,k,score,-id) %>% mutate(v1=1+as.integer(substr(k,1,3)=="pas")
,v2=1+as.integer(substr(k,5,5)=="q")
,v3=1+as.integer(substr(k,7,7)=="p")) %>%
select(-2) %>% arrange(id)
# id score v1 v2 v3
#1 1 14 1 1 1
#2 1 8 2 1 1
#3 1 14 1 2 2
#4 1 10 2 2 2
#5 1 21 1 1 2
#6 1 11 2 1 2
#7 1 21 1 2 1
#8 1 11 2 2 1
#9 2 19 1 1 1
#10 2 9 2 1 1
#11 2 11 1 2 2
#12 2 17 2 2 2
#13 2 22 1 1 2
#14 2 11 2 1 2
#15 2 20 1 2 1
#16 2 11 2 2 1

how to group consecutive days based on another category in R

I would like to use the following data frame
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
to order it by group of consecutive days based on the members. therefore if the same days are in member 2 and 1 I dont want them grouped together as consecutive!
ultimately I want a new column making this group
this is what I tried but it didnt work
y = sort(trainall$time)
trainall$g = cumsum(c(1, abs(y[-length(y)] - y[-1]) > 1))
this is the outcome I want.
trainall
time member g
1 01/01/1951 1 1
2 02/01/1951 1 1
3 03/01/1951 1 1
4 04/01/1951 1 1
5 03/03/1953 1 2
6 04/03/1953 1 2
7 05/03/1953 1 2
8 06/03/1953 1 2
9 02/01/1951 2 3
10 03/01/1951 2 3
11 04/01/1951 2 3
12 05/01/1951 2 3
13 13/03/1953 2 4
14 14/03/1953 2 4
15 15/03/1953 2 4
16 16/03/1953 2 4
17 01/05/1951 3 5
18 02/05/1951 3 5
19 03/05/1951 3 5
20 04/05/1951 3 5
21 04/03/1953 3 6
22 05/03/1953 3 6
23 06/03/1953 3 6
24 07/03/1953 3 6
ultimately this is the outcome I want. however, here I did it manually and my actual data frame is much much larger (16 members)
anyone know how to easily do this?
The use of logical values as integers 0 and 1 and your friend diff can do the trick. Something like this should do it, provided that your data is sorted by member and time.
# Your data
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
# Creating column g
trainall$g <- cumsum(c(1, (abs(diff(trainall$time)) + diff(trainall$member))!=1))
print(trainall)
# time member g
#1 1951-01-01 1 1
#2 1951-01-02 1 1
#3 1951-01-03 1 1
#4 1951-01-04 1 1
#5 1953-03-03 1 2
#6 1953-03-04 1 2
#7 1953-03-05 1 2
#8 1953-03-06 1 2
#9 1951-01-02 2 3
#10 1951-01-03 2 3
#11 1951-01-04 2 3
#12 1951-01-05 2 3
#13 1953-03-13 2 4
#14 1953-03-14 2 4
#15 1953-03-15 2 4
#16 1953-03-16 2 4
#17 1951-05-01 3 5
#18 1951-05-02 3 5
#19 1951-05-03 3 5
#20 1951-05-04 3 5
#21 1953-03-04 3 6
#22 1953-03-05 3 6
#23 1953-03-06 3 6
#24 1953-03-07 3 6
Edit: Added abs() around the time difference. I guess the abs cannot strictly be omitted as you could have a time difference of -2 days when the member changes, which cause the sum to be 1.
Edit 2: Re. your extra comment, try
trainall$G <- sequence(table(trainall$g))
Here is one option with .GRP from data.table
library(data.table)
setDT(trainall)[, g := .GRP, .(member, grp = cumsum(c(FALSE, diff(time) != 1)))]
trainall
# time member g
# 1: 1951-01-01 1 1
# 2: 1951-01-02 1 1
# 3: 1951-01-03 1 1
# 4: 1951-01-04 1 1
# 5: 1953-03-03 1 2
# 6: 1953-03-04 1 2
# 7: 1953-03-05 1 2
# 8: 1953-03-06 1 2
# 9: 1951-01-02 2 3
#10: 1951-01-03 2 3
#11: 1951-01-04 2 3
#12: 1951-01-05 2 3
#13: 1953-03-13 2 4
#14: 1953-03-14 2 4
#15: 1953-03-15 2 4
#16: 1953-03-16 2 4
#17: 1951-05-01 3 5
#18: 1951-05-02 3 5
#19: 1951-05-03 3 5
#20: 1951-05-04 3 5
#21: 1953-03-04 3 6
#22: 1953-03-05 3 6
#23: 1953-03-06 3 6
#24: 1953-03-07 3 6

R: Separate data into combinations of two columns

I have some data where each id is measured by different types which can be have different values type_val. The measured value is val. A small dummy data is like this:
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
Then df is:
id type type_val val
1 a t1 1 1
2 b t1 1 2
3 a t1 2 3
4 b t1 2 4
5 a t1 3 5
6 b t1 3 6
7 a t2 1 7
8 b t2 1 8
9 a t2 2 9
10 b t2 2 10
11 a t2 3 11
12 b t2 3 12
I need to spread/cast data so that all combinations of type and type_val for each id are row-wise. I think this must be a job for pkgs reshape2 or tidyr but I have completely failed to generate anything other than errors.
The outcome data structure - somewhat redundant - would be something like this (hope I got it right!) where pairs of type (as given by combinations of the type_val) are columns type_t1 and type_t2 , and their associated values (val in df) are val_t1 and val_t2 - columns names are of cause arbitrary :
id type_t1 type_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
UPDATE
Note that (#Sotos)
> spread(df, type, val)
id type_val t1 t2
1 a 1 1 7
2 a 2 3 9
3 a 3 5 11
4 b 1 2 8
5 b 2 4 10
6 b 3 6 12
is not the desired output - it fails to deliver the wide format defined by combinations of type and type_val in df.
how about this:
df1=df[df$type=="t1",]
df2=df[df$type=="t2",]
DF=merge(df1,df2,by="id")
DF=DF[,-c(2,5)]
colnames(DF)<-c("id", "type_t1", "val_t1","type_t2", "val_t2")
Here is something more generic that will work with an arbitrary number of unique type:
library(dplyr)
# This function takes a list of dataframes (.data) and merges them by ID
reduce_merge <- function(.data, ID) {
return(Reduce(function(x, y) merge(x, y, by = ID), .data))
}
# This function renames the cols columns in .data by appending _identifier
batch_rename <- function(.data, cols, identifier, sep = '_') {
return(plyr::rename(.data, sapply(cols, function(x){
x = paste(x, .data[1, identifier], sep = sep)
})))
}
# This function creates a list of subsetted dataframes
# (subsetted by values of key),
# uses batch_rename() to give each dataframe more informative column names,
# merges them together, and returns the columns you'd like in a sensible order
multi_spread <- function(.data, grp, key, vals) {
.data %>%
plyr::dlply(key, subset) %>%
lapply(batch_rename, vals, key) %>%
reduce_merge(grp) %>%
select(-starts_with(paste0(key, '.'))) %>%
select(id, sort(setdiff(colnames(.), c(grp, key, vals))))
}
# Your example
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
# An example with three unique values of 'type'
df <- data.frame(id = rep(letters[1:2], 9),
type = c(rep('t1', 6), rep('t2', 6), rep('t3', 6)),
type_val = rep(c(1, 1, 2, 2, 3, 3), 3),
val = 1:18)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 type_val_t3 val_t1 val_t2 val_t3
1 a 1 1 1 1 7 13
2 a 1 1 2 1 7 15
3 a 1 1 3 1 7 17
4 a 1 2 1 1 9 13
5 a 1 2 2 1 9 15
6 a 1 2 3 1 9 17
7 a 1 3 1 1 11 13
8 a 1 3 2 1 11 15
9 a 1 3 3 1 11 17
10 a 2 1 1 3 7 13
11 a 2 1 2 3 7 15
12 a 2 1 3 3 7 17
13 a 2 2 1 3 9 13
14 a 2 2 2 3 9 15
15 a 2 2 3 3 9 17
16 a 2 3 1 3 11 13
17 a 2 3 2 3 11 15
18 a 2 3 3 3 11 17
19 a 3 1 1 5 7 13
20 a 3 1 2 5 7 15
21 a 3 1 3 5 7 17
22 a 3 2 1 5 9 13
23 a 3 2 2 5 9 15
24 a 3 2 3 5 9 17
25 a 3 3 1 5 11 13
26 a 3 3 2 5 11 15
27 a 3 3 3 5 11 17
28 b 1 1 1 2 8 14
29 b 1 1 2 2 8 16
30 b 1 1 3 2 8 18
31 b 1 2 1 2 10 14
32 b 1 2 2 2 10 16
33 b 1 2 3 2 10 18
34 b 1 3 1 2 12 14
35 b 1 3 2 2 12 16
36 b 1 3 3 2 12 18
37 b 2 1 1 4 8 14
38 b 2 1 2 4 8 16
39 b 2 1 3 4 8 18
40 b 2 2 1 4 10 14
41 b 2 2 2 4 10 16
42 b 2 2 3 4 10 18
43 b 2 3 1 4 12 14
44 b 2 3 2 4 12 16
45 b 2 3 3 4 12 18
46 b 3 1 1 6 8 14
47 b 3 1 2 6 8 16
48 b 3 1 3 6 8 18
49 b 3 2 1 6 10 14
50 b 3 2 2 6 10 16
51 b 3 2 3 6 10 18
52 b 3 3 1 6 12 14
53 b 3 3 2 6 12 16
54 b 3 3 3 6 12 18

How do I select rows in a data frame before and after a condition is met?

I'm searching the web for a few a days now and I can't find a solution to my (probably easy to solve) problem.
I have huge data frames with 4 variables and over a million observations each. Now I want to select 100 rows before, all rows while and 1000 rows after a specific condition is met and fill the rest with NA's. I tried it with a for loop and if/ifelse but it doesn't work so far. I think it shouldn't be a big thing, but in the moment I just don't get the hang of it.
I create the data using:
foo<-data.frame(t = 1:15, a = sample(1:15), b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1), c = sample(1:15))
My Data looks like this:
ID t a b c
1 1 4 1 7
2 2 7 1 10
3 3 10 1 6
4 4 2 1 4
5 5 13 1 9
6 6 15 4 3
7 7 8 4 15
8 8 3 4 1
9 9 9 4 2
10 10 14 1 8
11 11 5 1 11
12 12 11 1 13
13 13 12 1 5
14 14 6 1 14
15 15 1 1 12
What I want is to pick the value of a (in this example) 2 rows before, all rows while and 3 rows after the value of b is >1 and fill the rest with NA's. [Because this is just an example I guess you can imagine that after these 15 rows there are more rows with the value for b changing from 1 to 4 several times (I did not post it, so I won't spam the question with unnecessary data).]
So I want to get something like:
ID t a b c d
1 1 4 1 7 NA
2 2 7 1 10 NA
3 3 10 1 6 NA
4 4 2 1 4 2
5 5 13 1 9 13
6 6 15 4 3 15
7 7 8 4 15 8
8 8 3 4 1 3
9 9 9 4 2 9
10 10 14 1 8 14
11 11 5 1 11 5
12 12 11 1 13 11
13 13 12 1 5 NA
14 14 6 1 14 NA
15 15 1 1 12 NA
I'm thankful for any help.
Thank you.
Best regards,
Chris
here is the same attempt as missuse, but with data.table:
library(data.table)
foo<-data.frame(t = 1:11, a = sample(1:11), b = c(1,1,1,4,4,4,4,1,1,1,1), c = sample(1:11))
DT <- setDT(foo)
DT[ unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ])), d := a]
t a b c d
1: 1 10 1 2 NA
2: 2 6 1 10 6
3: 3 5 1 7 5
4: 4 11 4 4 11
5: 5 4 4 9 4
6: 6 8 4 5 8
7: 7 2 4 8 2
8: 8 3 1 3 3
9: 9 7 1 6 7
10: 10 9 1 1 9
11: 11 1 1 11 NA
Here
unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ]))
gives you your desired indixes : the unique indices of the line for your condition, the same indices+3 and -2.
Here is an attempt.
Get indexes that satisfy the condition b > 1
z <- which(foo$b > 1)
get indexes for (z - 2) : (z + 3)
ind <- unique(unlist(lapply(z, function(x){
g <- pmax(x - 2, 1) #if x - 2 is negative
g : (x + 3)
})))
create d column filled with NA
foo$d <- NA
replace elements with appropriate indexes with foo$a
foo$d[ind] <- foo$a[ind]
library(dplyr)
library(purrr)
# example dataset
foo<-data.frame(t = 1:15,
a = sample(1:15),
b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1),
c = sample(1:15))
# function to get indices of interest
# for a given index x go 2 positions back and 3 forward
# keep only positive indices
GetIDsBeforeAfter = function(x) {
v = (x-2) : (x+3)
v[v > 0]
}
foo %>% # from your dataset
filter(b > 1) %>% # keep rows where b > 1
pull(t) %>% # get the positions
map(GetIDsBeforeAfter) %>% # for each position apply the function
unlist() %>% # unlist all sets indices
unique() -> ids_to_remain # keep unique ones and save them in a vector
foo$d = foo$c # copy column c as d
foo$d[-ids_to_remain] = NA # put NA to all positions not in our vector
foo
# t a b c d
# 1 1 5 1 8 NA
# 2 2 6 1 14 NA
# 3 3 4 1 10 NA
# 4 4 1 1 7 7
# 5 5 10 1 5 5
# 6 6 8 4 9 9
# 7 7 9 4 15 15
# 8 8 3 4 6 6
# 9 9 7 4 2 2
# 10 10 12 1 3 3
# 11 11 11 1 1 1
# 12 12 15 1 4 4
# 13 13 14 1 11 NA
# 14 14 13 1 13 NA
# 15 15 2 1 12 NA

Resources