Creating a "run ID" for values in sequence - r

I have a vector which contains an ordered sequence of repeated integers:
x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, 5, 5, 6, 6, 9, 9, 9, 9)
I want to create a "run ID" (I assume using data.table::rleid()) for numbers that are in sequence. That is, numbers which are either equal or +1 the previous value.
So, the expected output would be:
x
#> [1] 1 1 1 2 2 2 2 3 3 5 5 5 5 6 6 9 9 9 9
data.table::rleid(???)
#> [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
My first thought was to simply check if each value is the same or +1 the previous, but that doesn't work since the first change is considered a run of its own, obviously (a FALSE surrounded by TRUEs):
x
#> [1] 1 1 1 2 2 2 2 3 3 5 5 5 5 6 6 9 9 9 9
data.table::rleid((x - lag(x, default = 1)) %in% 0:1)
#> [1] 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 4 5 5 5
I obviously need something which allows me to compare each value to the last different value, but I can't think of how to do that effectively. Any pointers?

How about using lag from dplyr with cumsum?
library(dplyr)
cumsum(x - lag(x,default = 0) > 1)+1
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
Or the data.table way with shift:
library(data.table)
cumsum(x - shift(x,1,fill = 0) > 1) + 1
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3

Base R option using diff and cumsum :
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3

x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, 5, 5, 6, 6, 9, 9, 9, 9)
tibble(X = x) %>%
mutate(PREV.X = lag(X, default = 0),
IS.SEQ = X != PREV.X & X != PREV.X + 1,
RZLT = 1 + cumsum(IS.SEQ))
# A tibble: 19 x 4
X PREV.X IS.SEQ RZLT
<dbl> <dbl> <lgl> <dbl>
1 1 0 FALSE 1
2 1 1 FALSE 1
3 1 1 FALSE 1
4 2 1 FALSE 1
5 2 2 FALSE 1
6 2 2 FALSE 1
7 2 2 FALSE 1
8 3 2 FALSE 1
9 3 3 FALSE 1
10 5 3 TRUE 2
11 5 5 FALSE 2
12 5 5 FALSE 2
13 5 5 FALSE 2
14 6 5 FALSE 2
15 6 6 FALSE 2
16 9 6 TRUE 3
17 9 9 FALSE 3
18 9 9 FALSE 3
19 9 9 FALSE 3

Related

How to collapse dataframe columns by character in the middle of column name?

I have a dataset with the following layout:
ABC1a_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1)
ABC1b_1 <- c(4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1a_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4)
ABC1b_2 <- c(2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2a_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3)
ABC2b_1 <- c(1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2a_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4)
ABC2b_2 <- c(2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ABC1a_1, ABC1b_1, ABC1a_2, ABC1b_2, ABC2a_1, ABC2b_1, ABC2a_2, ABC2b_2)
I want to collapse all of the ABC[N][x]_[n] variables into a single ABC[N]_[n] variable like this:
ABC1_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df2 <- data.frame(ABC1_1, ABC1_2, ABC2_1, ABC2_2)
What's the best way to achieve this, ideally with a tidyverse solution?
You could also use pivot_longer:
df %>%
rename_with(~str_replace(.x, "(.)(_\\d)", "\\2:\\1")) %>%
pivot_longer(everything(), names_sep = ':', names_to = c(".value", "group")) %>%
arrange(group)
# A tibble: 20 x 5
group ABC1_1 ABC1_2 ABC2_1 ABC2_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 4 2 2
2 a 5 5 5 5
3 a 3 5 3 5
4 a 4 4 5 1
5 a 3 2 3 2
6 a 4 5 4 1
7 a 5 5 5 5
8 a 2 1 3 1
9 a 2 2 2 3
10 a 1 4 3 4
11 b 4 2 1 2
12 b 2 3 2 3
13 b 1 3 2 3
14 b 1 2 4 2
15 b 5 2 5 1
16 b 3 3 3 3
17 b 2 2 2 1
18 b 1 1 4 1
19 b 1 4 1 2
20 b 5 2 4 2
If you desire to go the Base R way, you could do:
reshape(df, split(names(df), sub("._", "_", names(df))), dir="long")
time ABC1a_1 ABC1a_2 ABC2a_1 ABC2a_2 id
1.1 1 1 4 2 2 1
2.1 1 5 5 5 5 2
3.1 1 3 5 3 5 3
4.1 1 4 4 5 1 4
5.1 1 3 2 3 2 5
6.1 1 4 5 4 1 6
7.1 1 5 5 5 5 7
8.1 1 2 1 3 1 8
9.1 1 2 2 2 3 9
10.1 1 1 4 3 4 10
1.2 2 4 2 1 2 1
2.2 2 2 3 2 3 2
3.2 2 1 3 2 3 3
4.2 2 1 2 4 2 4
5.2 2 5 2 5 1 5
6.2 2 3 3 3 3 6
7.2 2 2 2 2 1 7
8.2 2 1 1 4 1 8
9.2 2 1 4 1 2 9
10.2 2 5 2 4 2 10
Then you can change the names.
If you care about the names from the very beginning:
df1 <- setNames(df, gsub("(.)(_\\d)", "\\2.\\1", names(df)))
reshape(df1, names(df1), dir = "long")
time ABC1_1 ABC1_2 ABC2_1 ABC2_2 id
1 a 1 4 2 2 1
2 a 5 5 5 5 2
3 a 3 5 3 5 3
4 a 4 4 5 1 4
5 a 3 2 3 2 5
6 a 4 5 4 1 6
7 a 5 5 5 5 7
8 a 2 1 3 1 8
9 a 2 2 2 3 9
10 a 1 4 3 4 10
11 b 4 2 1 2 1
12 b 2 3 2 3 2
13 b 1 3 2 3 3
14 b 1 2 4 2 4
15 b 5 2 5 1 5
16 b 3 3 3 3 6
17 b 2 2 2 1 7
18 b 1 1 4 1 8
19 b 1 4 1 2 9
20 b 5 2 4 2 10
A base R solution to collapse it:
res <- as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), unlist))
rownames(res) <- NULL
# >res
# ABC1_1 ABC1_2 ABC2_1 ABC2_2
# 1 1 4 2 2
# 2 5 5 5 5
# 3 3 5 3 5
# 4 4 4 5 1
# 5 3 2 3 2
# 6 4 5 4 1
# 7 5 5 5 5
# 8 2 1 3 1
# 9 2 2 2 3
# 10 1 4 3 4
# 11 4 2 1 2
# 12 2 3 2 3
# 13 1 3 2 3
# 14 1 2 4 2
# 15 5 2 5 1
# 16 3 3 3 3
# 17 2 2 2 1
# 18 1 1 4 1
# 19 1 4 1 2
# 20 5 2 4 2
identical(df2, res)
# [1] TRUE
Using rowSums as the function to combine column values would be better I guess:
> as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), rowSums))
ABC1_1 ABC1_2 ABC2_1 ABC2_2
1 5 6 3 4
2 7 8 7 8
3 4 8 5 8
4 5 6 9 3
5 8 4 8 3
6 7 8 7 4
7 7 7 7 6
8 3 2 7 2
9 3 6 3 5
10 6 6 7 6
You could do:
library(tidyverse)
map_dfr(c("a", "b"),
~df %>%
select(contains(.x, ignore.case = FALSE)) %>%
rename_all(funs(str_remove_all(., .x))))
#ABC1_1 ABC1_2 ABC2_1 ABC2_2
#1 1 4 2 2
#2 5 5 5 5
#3 3 5 3 5
#4 4 4 5 1
# ..
Depending on your actual data, you could replace c("a", "b") with letters[1:2] or unique(str_extract(colnames(df), "[a-z]")).
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
arrange(name) %>%
mutate(name = gsub("[a-z]_", "_", name)) %>%
pivot_wider(values_fn = list) %>%
unchop(everything())
pivot_longer will put all the column names into a single column name that you can then edit by removing the lowercase letter preceding the underscore.
Then when you pivot back to a wide format the columns will automatically group. The output of pivot_wider are list-columns, unchop will convert these lists into a longer dataframe.
Output
ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl>
1 1 4 2 2
2 5 5 5 5
3 3 5 3 5
4 4 4 5 1
5 3 2 3 2
6 4 5 4 1
7 5 5 5 5
8 2 1 3 1
9 2 2 2 3
10 1 4 3 4
11 4 2 1 2
12 2 3 2 3
13 1 3 2 3
14 1 2 4 2
15 5 2 5 1
16 3 3 3 3
17 2 2 2 1
18 1 1 4 1
19 1 4 1 2
20 5 2 4 2

Add a count column and count twice if a certain condition is met

I am wondering if there is a way to make a conditional column-count by a group, adding 1 to a row_number or rowid if a certain value is met (in this case 0). For example:
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,3,3,3),
condition=c(1,0,1,1,1,0,0,1,1,0,1,1,0, 1),
want=c(1, 3, 4,5,1,3,5,6,7,2,3,4,6,7))
group condition want
1 1 1 1
2 1 0 3
3 1 1 4
4 1 1 5
5 2 1 1
6 2 0 3
7 2 0 5
8 2 1 6
9 2 1 7
10 3 0 2
11 3 1 3
12 3 1 4
13 3 0 6
14 3 1 7
I think this might involve making a row_number per group and then making a customized row_number but I am open to suggestions. It is kind of a work-around method to "break up" my data when a 0 appears.
Using dplyr, for each group of data (group-by(group)) we can add a column which has a counter from 1 to the length of each group (i.e. n()). By adding a cumulative sum of condition == 0, that counter will jump one more, whenever your desired condition is met.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(desired = (1:n()) + cumsum(condition == 0))
Output:
#> # A tibble: 14 x 3
#> # Groups: group [3]
#> group condition desired
#> <dbl> <dbl> <int>
#> 1 1 1 1
#> 2 1 0 3
#> 3 1 1 4
#> 4 1 1 5
#> 5 2 1 1
#> 6 2 0 3
#> 7 2 0 5
#> 8 2 1 6
#> 9 2 1 7
#> 10 3 0 2
#> 11 3 1 3
#> 12 3 1 4
#> 13 3 0 6
#> 14 3 1 7
Data:
df1 <- data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,3,3,3),
condition=c(1,0,1,1,1,0,0,1,1,0,1,1,0, 1))
You can do:
transform(df, want = ave(condition, group, FUN = function(x) cumsum(x + (x == 0) * 2 )))
group condition want
1 1 1 1
2 1 0 3
3 1 1 4
4 1 1 5
5 2 1 1
6 2 0 3
7 2 0 5
8 2 1 6
9 2 1 7
10 3 0 2
11 3 1 3
12 3 1 4
13 3 0 6
14 3 1 7

Convert matrix to three defined columns in R

Given m:
m <- structure(c(5, 1, 3, 2, 1, 4, 5, 2, 5, 1, 1, 5, 1, 4, 0, 4, 5,
5, 3, 2, 0, 0, 3, 0, 3, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0), .Dim = c(7L,
5L))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 5 2 0 0 0
# [2,] 1 5 4 3 0
# [3,] 3 1 5 0 0
# [4,] 2 1 5 3 0
# [5,] 1 5 3 2 0
# [6,] 4 1 2 3 0
# [7,] 5 4 0 0 0
Consider the element 1, it appears in 5 rows (2, 3, 4, ,5, 6) and the respective column-wise indices are (1, 2, 2, 1, 2). I would like to have the following:
1 2 1
1 3 2
1 4 2
1 5 1
1 6 2
As another example, consider the element 2, it appears in 4 rows (1, 4, 5, 6) and the respective column-wise indices are (2, 1, 4, 3) and we have:
1 2 1
1 3 2
1 4 2
1 5 1
1 6 2
2 1 2
2 4 1
2 5 4
2 6 3
What I want is a n*3 matrix for all 1-5. Preferably in base R
A convenient way to transform it is to use sparseMatrix from Matrix library, since your desired output is very close to the representation of sparse Matrix:
library(Matrix)
summary(Matrix(m, sparse = T))
# 7 x 5 sparse Matrix of class "dgCMatrix", with 23 entries
# i j x
# 1 1 1 5
# 2 2 1 1
# 3 3 1 3
# 4 4 1 2
# 5 5 1 1
# 6 6 1 4
# 7 7 1 5
# 8 1 2 2
# 9 2 2 5
# 10 3 2 1
# 11 4 2 1
# 12 5 2 5
# 13 6 2 1
# 14 7 2 4
# 15 2 3 4
# 16 3 3 5
# 17 4 3 5
# 18 5 3 3
# 19 6 3 2
# 20 2 4 3
# 21 4 4 3
# 22 5 4 2
# 23 6 4 3
To see it better:
summary(Matrix(m, sparse = T)) %>% dplyr::arrange(x)
# i j x
# 1 2 1 1
# 2 5 1 1
# 3 3 2 1
# 4 4 2 1
# 5 6 2 1
# 6 4 1 2
# 7 1 2 2
# 8 6 3 2
# 9 5 4 2
# 10 3 1 3
# 11 5 3 3
# 12 2 4 3
# 13 4 4 3
# 14 6 4 3
# 15 6 1 4
# 16 7 2 4
# 17 2 3 4
# 18 1 1 5
# 19 7 1 5
# 20 2 2 5
# 21 5 2 5
# 22 3 3 5
# 23 4 3 5
We can use which with arr.ind=TRUE
cbind(val= 1, which(m==1, arr.ind=TRUE))
# val row col
#[1,] 1 2 1
#[2,] 1 5 1
#[3,] 1 3 2
#[4,] 1 4 2
#[5,] 1 6 2
For multiple cases, as #RHertel mentioned
for(i in 1:5) print(cbind(i,which(m==i, arr.ind=TRUE)))
Or with lapply
do.call(rbind, lapply(1:2, function(i) {
m1 <-cbind(val=i,which(m==i, arr.ind=TRUE))
m1[order(m1[,2]),]}))
# val row col
#[1,] 1 2 1
#[2,] 1 3 2
#[3,] 1 4 2
#[4,] 1 5 1
#[5,] 1 6 2
#[6,] 2 1 2
#[7,] 2 4 1
#[8,] 2 5 4
#[9,] 2 6 3
As the OP mentioned about base R solutions, the above would help. But, in case, if somebody wants a compact solution,
library(reshape2)
melt(m)
and then subset the values of interest.
Just use row and col.
> data.frame(m=as.vector(m), row=as.vector(row(m)), col=as.vector(col(m)))
m row col
1 5 1 1
2 1 2 1
3 3 3 1
4 2 4 1
5 1 5 1
...
Subset, sort, and print as desired.
> tmp <- out[order(out$m, out$row), ]
> print(subset(tmp, m==1), row.names=FALSE)
m row col
1 2 1
1 3 2
1 4 2
1 5 1
1 6 2

Create new variable based on size of value in other column

I am attempting to create a df with a new variable called 'epi' (stands for episode)... which is based on the 'days.since.last' variable. when the value of 'days.since.last' is greater than 90, I want the episode variable to increase by 1.
Here is the original df
deid session.number days.since.last
1 1 1 0
2 1 2 7
3 1 3 12
4 5 1 0
5 5 2 7
6 5 3 14
7 5 4 93
8 5 5 5
9 5 6 102
10 12 1 0
11 12 2 21
12 12 3 104
13 12 4 4
Created from
help <- data.frame(deid = c(1, 1, 1, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12),
session.number = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4),
days.since.last = c(0, 7, 12, 0, 7, 14, 93, 5, 102, 0, 21, 104, 4))
This is the output I am hoping to achieve
deid session.number days.since.last epi
1 1 1 0 1
2 1 2 7 1
3 1 3 12 1
4 5 1 0 1
5 5 2 7 1
6 5 3 14 1
7 5 4 93 2
8 5 5 5 2
9 5 6 102 3
10 12 1 0 1
11 12 2 21 1
12 12 3 104 2
13 12 4 4 2
My best attempt is the below code, however, it does not change the first value of each new episode (they remain at 0)...
help$epi <- as.numeric(0)
tmp <- gapply(help, form = ~ deid, FUN = function(x)
{
spanSeq <- rle(x$days.since.last <= 90)$lengths[rle(x$days.since.last <= 90)$values == TRUE]
x$epi[x$days.since.last <= 90] <- rep(seq_along(spanSeq), times = spanSeq)
rm(spanSeq)
x
})
help2 <- do.call("rbind", tmp)
rownames(help2)<-c(1:length(help2$deid))
Any assistance is greatly appreciated!
You could do this with dplyr like this:
library(dplyr)
help %>% group_by(deid) %>% mutate(epi = cumsum(ifelse(days.since.last>90,1,0))+1)
deid session.number days.since.last epi
1 1 1 0 1
2 1 2 7 1
3 1 3 12 1
4 5 1 0 1
5 5 2 7 1
6 5 3 14 1
7 5 4 93 2
8 5 5 5 2
9 5 6 102 3
10 12 1 0 1
11 12 2 21 1
12 12 3 104 2
13 12 4 4 2
Essentially, the group_by does everything by group for your 'deid' variable. We assign a 1 or a 0 for each 'days.since.last' that is over 90. Then we create a new variable that is the cumulative sum of these 1's and 0's. By adding one to it we get your desired result.

lag not working as expected

lag is not working as I expected
a<-c(0,1,2,3,4,5,6,7,8)
a
## [1] 0 1 2 3 4 5 6 7 8
lag(a,k=1)
## [1] 0 1 2 3 4 5 6 7 8
## attr(,"tsp")
## [1] 0 8 1
I thought I would get:
0 0 1 2 3 4 5 6 7
or
1 2 3 4 5 6 7 8 0
What am I doing wrong?
You should use Lag from the Hmisc package:
library(Hmisc)
Lag(c(0,1,2,3,4,5,6,7,8), shift = 1)
# [1] NA 0 1 2 3 4 5 6 7
Actually,
I went for this in the end:
c(a[-1],0)
Does what I wanted
data.table::shift is another one:
library(data.table)
a <- c(0, 1, 2, 3, 4, 5, 6, 7, 8)
shift(a)
#[1] NA 0 1 2 3 4 5 6 7
Another way using base::lag with the zoo library:
library(zoo)
a <- zoo(c(0, 1, 2, 3, 4, 5, 6, 7, 8)) #convert to zoo
So:
lag(a, 1, na.pad = TRUE)
# 1 2 3 4 5 6 7 8 9
# 1 2 3 4 5 6 7 8 NA
Or:
lag(a, -1, na.pad = TRUE)
# 1 2 3 4 5 6 7 8 9
#NA 0 1 2 3 4 5 6 7
You can wrap an as.vector around it to get rid of the indices.

Resources