lag not working as expected - r

lag is not working as I expected
a<-c(0,1,2,3,4,5,6,7,8)
a
## [1] 0 1 2 3 4 5 6 7 8
lag(a,k=1)
## [1] 0 1 2 3 4 5 6 7 8
## attr(,"tsp")
## [1] 0 8 1
I thought I would get:
0 0 1 2 3 4 5 6 7
or
1 2 3 4 5 6 7 8 0
What am I doing wrong?

You should use Lag from the Hmisc package:
library(Hmisc)
Lag(c(0,1,2,3,4,5,6,7,8), shift = 1)
# [1] NA 0 1 2 3 4 5 6 7

Actually,
I went for this in the end:
c(a[-1],0)
Does what I wanted

data.table::shift is another one:
library(data.table)
a <- c(0, 1, 2, 3, 4, 5, 6, 7, 8)
shift(a)
#[1] NA 0 1 2 3 4 5 6 7
Another way using base::lag with the zoo library:
library(zoo)
a <- zoo(c(0, 1, 2, 3, 4, 5, 6, 7, 8)) #convert to zoo
So:
lag(a, 1, na.pad = TRUE)
# 1 2 3 4 5 6 7 8 9
# 1 2 3 4 5 6 7 8 NA
Or:
lag(a, -1, na.pad = TRUE)
# 1 2 3 4 5 6 7 8 9
#NA 0 1 2 3 4 5 6 7
You can wrap an as.vector around it to get rid of the indices.

Related

How to collapse dataframe columns by character in the middle of column name?

I have a dataset with the following layout:
ABC1a_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1)
ABC1b_1 <- c(4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1a_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4)
ABC1b_2 <- c(2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2a_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3)
ABC2b_1 <- c(1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2a_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4)
ABC2b_2 <- c(2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ABC1a_1, ABC1b_1, ABC1a_2, ABC1b_2, ABC2a_1, ABC2b_1, ABC2a_2, ABC2b_2)
I want to collapse all of the ABC[N][x]_[n] variables into a single ABC[N]_[n] variable like this:
ABC1_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df2 <- data.frame(ABC1_1, ABC1_2, ABC2_1, ABC2_2)
What's the best way to achieve this, ideally with a tidyverse solution?
You could also use pivot_longer:
df %>%
rename_with(~str_replace(.x, "(.)(_\\d)", "\\2:\\1")) %>%
pivot_longer(everything(), names_sep = ':', names_to = c(".value", "group")) %>%
arrange(group)
# A tibble: 20 x 5
group ABC1_1 ABC1_2 ABC2_1 ABC2_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 4 2 2
2 a 5 5 5 5
3 a 3 5 3 5
4 a 4 4 5 1
5 a 3 2 3 2
6 a 4 5 4 1
7 a 5 5 5 5
8 a 2 1 3 1
9 a 2 2 2 3
10 a 1 4 3 4
11 b 4 2 1 2
12 b 2 3 2 3
13 b 1 3 2 3
14 b 1 2 4 2
15 b 5 2 5 1
16 b 3 3 3 3
17 b 2 2 2 1
18 b 1 1 4 1
19 b 1 4 1 2
20 b 5 2 4 2
If you desire to go the Base R way, you could do:
reshape(df, split(names(df), sub("._", "_", names(df))), dir="long")
time ABC1a_1 ABC1a_2 ABC2a_1 ABC2a_2 id
1.1 1 1 4 2 2 1
2.1 1 5 5 5 5 2
3.1 1 3 5 3 5 3
4.1 1 4 4 5 1 4
5.1 1 3 2 3 2 5
6.1 1 4 5 4 1 6
7.1 1 5 5 5 5 7
8.1 1 2 1 3 1 8
9.1 1 2 2 2 3 9
10.1 1 1 4 3 4 10
1.2 2 4 2 1 2 1
2.2 2 2 3 2 3 2
3.2 2 1 3 2 3 3
4.2 2 1 2 4 2 4
5.2 2 5 2 5 1 5
6.2 2 3 3 3 3 6
7.2 2 2 2 2 1 7
8.2 2 1 1 4 1 8
9.2 2 1 4 1 2 9
10.2 2 5 2 4 2 10
Then you can change the names.
If you care about the names from the very beginning:
df1 <- setNames(df, gsub("(.)(_\\d)", "\\2.\\1", names(df)))
reshape(df1, names(df1), dir = "long")
time ABC1_1 ABC1_2 ABC2_1 ABC2_2 id
1 a 1 4 2 2 1
2 a 5 5 5 5 2
3 a 3 5 3 5 3
4 a 4 4 5 1 4
5 a 3 2 3 2 5
6 a 4 5 4 1 6
7 a 5 5 5 5 7
8 a 2 1 3 1 8
9 a 2 2 2 3 9
10 a 1 4 3 4 10
11 b 4 2 1 2 1
12 b 2 3 2 3 2
13 b 1 3 2 3 3
14 b 1 2 4 2 4
15 b 5 2 5 1 5
16 b 3 3 3 3 6
17 b 2 2 2 1 7
18 b 1 1 4 1 8
19 b 1 4 1 2 9
20 b 5 2 4 2 10
A base R solution to collapse it:
res <- as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), unlist))
rownames(res) <- NULL
# >res
# ABC1_1 ABC1_2 ABC2_1 ABC2_2
# 1 1 4 2 2
# 2 5 5 5 5
# 3 3 5 3 5
# 4 4 4 5 1
# 5 3 2 3 2
# 6 4 5 4 1
# 7 5 5 5 5
# 8 2 1 3 1
# 9 2 2 2 3
# 10 1 4 3 4
# 11 4 2 1 2
# 12 2 3 2 3
# 13 1 3 2 3
# 14 1 2 4 2
# 15 5 2 5 1
# 16 3 3 3 3
# 17 2 2 2 1
# 18 1 1 4 1
# 19 1 4 1 2
# 20 5 2 4 2
identical(df2, res)
# [1] TRUE
Using rowSums as the function to combine column values would be better I guess:
> as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), rowSums))
ABC1_1 ABC1_2 ABC2_1 ABC2_2
1 5 6 3 4
2 7 8 7 8
3 4 8 5 8
4 5 6 9 3
5 8 4 8 3
6 7 8 7 4
7 7 7 7 6
8 3 2 7 2
9 3 6 3 5
10 6 6 7 6
You could do:
library(tidyverse)
map_dfr(c("a", "b"),
~df %>%
select(contains(.x, ignore.case = FALSE)) %>%
rename_all(funs(str_remove_all(., .x))))
#ABC1_1 ABC1_2 ABC2_1 ABC2_2
#1 1 4 2 2
#2 5 5 5 5
#3 3 5 3 5
#4 4 4 5 1
# ..
Depending on your actual data, you could replace c("a", "b") with letters[1:2] or unique(str_extract(colnames(df), "[a-z]")).
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
arrange(name) %>%
mutate(name = gsub("[a-z]_", "_", name)) %>%
pivot_wider(values_fn = list) %>%
unchop(everything())
pivot_longer will put all the column names into a single column name that you can then edit by removing the lowercase letter preceding the underscore.
Then when you pivot back to a wide format the columns will automatically group. The output of pivot_wider are list-columns, unchop will convert these lists into a longer dataframe.
Output
ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl>
1 1 4 2 2
2 5 5 5 5
3 3 5 3 5
4 4 4 5 1
5 3 2 3 2
6 4 5 4 1
7 5 5 5 5
8 2 1 3 1
9 2 2 2 3
10 1 4 3 4
11 4 2 1 2
12 2 3 2 3
13 1 3 2 3
14 1 2 4 2
15 5 2 5 1
16 3 3 3 3
17 2 2 2 1
18 1 1 4 1
19 1 4 1 2
20 5 2 4 2

Extract cumulative unique values in a rolling basis (reset and resume) using data.table R

Given a data.table, I would like to extract cumulative unique elements until it reachs three unique values, than reset and resume:
y <- data.table(a=c(1, 2, 2, 3, 3, 4, 3, 2, 2, 5, 6, 7, 9, 8))
The desired output unique_acc_roll_3 is:
a unique_acc_roll_3
1 1
2 1 2
2 1 2
3 1 2 3
3 1 2 3
4 4 #4 is the forth element, so it resets and start again
3 3 4
2 2 3 4
2 2 3 4
5 5 #5 is the forth element, so it resets and start again
6 5 6
7 5 6 7
9 9 #9 is the forth element, so it resets and start again
8 8 9
Because it refers back recursively, I really got stucked... Real data is large, so data.table solutions would be great.
I can't think of any way to avoid a for loop essentially, except to hide it behind a Reduce call. My logic is to keep union-ing each new value at each row, until the set grows to length == n, at which point the new value is used as the starting point to the next iteration of the loop.
unionlim <- function(x, y, n=4) {
u <- union(x,y)
if(length(u) == n) y else u
}
y[, out := sapply(Reduce(unionlim, a, accumulate=TRUE), paste, collapse=" ")]
# a out
# 1: 1 1
# 2: 2 1 2
# 3: 2 1 2
# 4: 3 1 2 3
# 5: 3 1 2 3
# 6: 4 4
# 7: 3 4 3
# 8: 2 4 3 2
# 9: 2 4 3 2
#10: 5 5
#11: 6 5 6
#12: 7 5 6 7
#13: 9 9
#14: 8 9 8
This is far from the fastest code on the planet, but a quick test suggests it will chew about 1M cases in ~15 seconds on my decent machine.
bigy <- y[rep(1:nrow(y), 75e3)]
system.time({
bigy[, out := sapply(Reduce(unionlim, a, accumulate=TRUE), paste, collapse=" ")]
})
# user system elapsed
# 14.27 0.09 15.06
purrr::accumulate also does the work here
y$b <- accumulate(y$a, ~if(length(union(.x, .y)) == 4) .y else union(.x, .y))
y
a b
1 1 1
2 2 1, 2
3 2 1, 2
4 3 1, 2, 3
5 3 1, 2, 3
6 4 4
7 3 4, 3
8 2 4, 3, 2
9 2 4, 3, 2
10 5 5
11 6 5, 6
12 7 5, 6, 7
13 9 9
14 8 9, 8

Creating a "run ID" for values in sequence

I have a vector which contains an ordered sequence of repeated integers:
x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, 5, 5, 6, 6, 9, 9, 9, 9)
I want to create a "run ID" (I assume using data.table::rleid()) for numbers that are in sequence. That is, numbers which are either equal or +1 the previous value.
So, the expected output would be:
x
#> [1] 1 1 1 2 2 2 2 3 3 5 5 5 5 6 6 9 9 9 9
data.table::rleid(???)
#> [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
My first thought was to simply check if each value is the same or +1 the previous, but that doesn't work since the first change is considered a run of its own, obviously (a FALSE surrounded by TRUEs):
x
#> [1] 1 1 1 2 2 2 2 3 3 5 5 5 5 6 6 9 9 9 9
data.table::rleid((x - lag(x, default = 1)) %in% 0:1)
#> [1] 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 4 5 5 5
I obviously need something which allows me to compare each value to the last different value, but I can't think of how to do that effectively. Any pointers?
How about using lag from dplyr with cumsum?
library(dplyr)
cumsum(x - lag(x,default = 0) > 1)+1
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
Or the data.table way with shift:
library(data.table)
cumsum(x - shift(x,1,fill = 0) > 1) + 1
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
Base R option using diff and cumsum :
cumsum(c(TRUE, diff(x) > 1))
#[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3
x <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, 5, 5, 6, 6, 9, 9, 9, 9)
tibble(X = x) %>%
mutate(PREV.X = lag(X, default = 0),
IS.SEQ = X != PREV.X & X != PREV.X + 1,
RZLT = 1 + cumsum(IS.SEQ))
# A tibble: 19 x 4
X PREV.X IS.SEQ RZLT
<dbl> <dbl> <lgl> <dbl>
1 1 0 FALSE 1
2 1 1 FALSE 1
3 1 1 FALSE 1
4 2 1 FALSE 1
5 2 2 FALSE 1
6 2 2 FALSE 1
7 2 2 FALSE 1
8 3 2 FALSE 1
9 3 3 FALSE 1
10 5 3 TRUE 2
11 5 5 FALSE 2
12 5 5 FALSE 2
13 5 5 FALSE 2
14 6 5 FALSE 2
15 6 6 FALSE 2
16 9 6 TRUE 3
17 9 9 FALSE 3
18 9 9 FALSE 3
19 9 9 FALSE 3

Convert matrix to three defined columns in R

Given m:
m <- structure(c(5, 1, 3, 2, 1, 4, 5, 2, 5, 1, 1, 5, 1, 4, 0, 4, 5,
5, 3, 2, 0, 0, 3, 0, 3, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0), .Dim = c(7L,
5L))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 5 2 0 0 0
# [2,] 1 5 4 3 0
# [3,] 3 1 5 0 0
# [4,] 2 1 5 3 0
# [5,] 1 5 3 2 0
# [6,] 4 1 2 3 0
# [7,] 5 4 0 0 0
Consider the element 1, it appears in 5 rows (2, 3, 4, ,5, 6) and the respective column-wise indices are (1, 2, 2, 1, 2). I would like to have the following:
1 2 1
1 3 2
1 4 2
1 5 1
1 6 2
As another example, consider the element 2, it appears in 4 rows (1, 4, 5, 6) and the respective column-wise indices are (2, 1, 4, 3) and we have:
1 2 1
1 3 2
1 4 2
1 5 1
1 6 2
2 1 2
2 4 1
2 5 4
2 6 3
What I want is a n*3 matrix for all 1-5. Preferably in base R
A convenient way to transform it is to use sparseMatrix from Matrix library, since your desired output is very close to the representation of sparse Matrix:
library(Matrix)
summary(Matrix(m, sparse = T))
# 7 x 5 sparse Matrix of class "dgCMatrix", with 23 entries
# i j x
# 1 1 1 5
# 2 2 1 1
# 3 3 1 3
# 4 4 1 2
# 5 5 1 1
# 6 6 1 4
# 7 7 1 5
# 8 1 2 2
# 9 2 2 5
# 10 3 2 1
# 11 4 2 1
# 12 5 2 5
# 13 6 2 1
# 14 7 2 4
# 15 2 3 4
# 16 3 3 5
# 17 4 3 5
# 18 5 3 3
# 19 6 3 2
# 20 2 4 3
# 21 4 4 3
# 22 5 4 2
# 23 6 4 3
To see it better:
summary(Matrix(m, sparse = T)) %>% dplyr::arrange(x)
# i j x
# 1 2 1 1
# 2 5 1 1
# 3 3 2 1
# 4 4 2 1
# 5 6 2 1
# 6 4 1 2
# 7 1 2 2
# 8 6 3 2
# 9 5 4 2
# 10 3 1 3
# 11 5 3 3
# 12 2 4 3
# 13 4 4 3
# 14 6 4 3
# 15 6 1 4
# 16 7 2 4
# 17 2 3 4
# 18 1 1 5
# 19 7 1 5
# 20 2 2 5
# 21 5 2 5
# 22 3 3 5
# 23 4 3 5
We can use which with arr.ind=TRUE
cbind(val= 1, which(m==1, arr.ind=TRUE))
# val row col
#[1,] 1 2 1
#[2,] 1 5 1
#[3,] 1 3 2
#[4,] 1 4 2
#[5,] 1 6 2
For multiple cases, as #RHertel mentioned
for(i in 1:5) print(cbind(i,which(m==i, arr.ind=TRUE)))
Or with lapply
do.call(rbind, lapply(1:2, function(i) {
m1 <-cbind(val=i,which(m==i, arr.ind=TRUE))
m1[order(m1[,2]),]}))
# val row col
#[1,] 1 2 1
#[2,] 1 3 2
#[3,] 1 4 2
#[4,] 1 5 1
#[5,] 1 6 2
#[6,] 2 1 2
#[7,] 2 4 1
#[8,] 2 5 4
#[9,] 2 6 3
As the OP mentioned about base R solutions, the above would help. But, in case, if somebody wants a compact solution,
library(reshape2)
melt(m)
and then subset the values of interest.
Just use row and col.
> data.frame(m=as.vector(m), row=as.vector(row(m)), col=as.vector(col(m)))
m row col
1 5 1 1
2 1 2 1
3 3 3 1
4 2 4 1
5 1 5 1
...
Subset, sort, and print as desired.
> tmp <- out[order(out$m, out$row), ]
> print(subset(tmp, m==1), row.names=FALSE)
m row col
1 2 1
1 3 2
1 4 2
1 5 1
1 6 2

Create new variable based on size of value in other column

I am attempting to create a df with a new variable called 'epi' (stands for episode)... which is based on the 'days.since.last' variable. when the value of 'days.since.last' is greater than 90, I want the episode variable to increase by 1.
Here is the original df
deid session.number days.since.last
1 1 1 0
2 1 2 7
3 1 3 12
4 5 1 0
5 5 2 7
6 5 3 14
7 5 4 93
8 5 5 5
9 5 6 102
10 12 1 0
11 12 2 21
12 12 3 104
13 12 4 4
Created from
help <- data.frame(deid = c(1, 1, 1, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12),
session.number = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4),
days.since.last = c(0, 7, 12, 0, 7, 14, 93, 5, 102, 0, 21, 104, 4))
This is the output I am hoping to achieve
deid session.number days.since.last epi
1 1 1 0 1
2 1 2 7 1
3 1 3 12 1
4 5 1 0 1
5 5 2 7 1
6 5 3 14 1
7 5 4 93 2
8 5 5 5 2
9 5 6 102 3
10 12 1 0 1
11 12 2 21 1
12 12 3 104 2
13 12 4 4 2
My best attempt is the below code, however, it does not change the first value of each new episode (they remain at 0)...
help$epi <- as.numeric(0)
tmp <- gapply(help, form = ~ deid, FUN = function(x)
{
spanSeq <- rle(x$days.since.last <= 90)$lengths[rle(x$days.since.last <= 90)$values == TRUE]
x$epi[x$days.since.last <= 90] <- rep(seq_along(spanSeq), times = spanSeq)
rm(spanSeq)
x
})
help2 <- do.call("rbind", tmp)
rownames(help2)<-c(1:length(help2$deid))
Any assistance is greatly appreciated!
You could do this with dplyr like this:
library(dplyr)
help %>% group_by(deid) %>% mutate(epi = cumsum(ifelse(days.since.last>90,1,0))+1)
deid session.number days.since.last epi
1 1 1 0 1
2 1 2 7 1
3 1 3 12 1
4 5 1 0 1
5 5 2 7 1
6 5 3 14 1
7 5 4 93 2
8 5 5 5 2
9 5 6 102 3
10 12 1 0 1
11 12 2 21 1
12 12 3 104 2
13 12 4 4 2
Essentially, the group_by does everything by group for your 'deid' variable. We assign a 1 or a 0 for each 'days.since.last' that is over 90. Then we create a new variable that is the cumulative sum of these 1's and 0's. By adding one to it we get your desired result.

Resources