Doing rolling custom computations using data.table for multiple columns - r

I am doing rolling computations on a column,using the below code
dt <- data.table(x1=1:8,x2=2:10,x3=4:11,x4=6:12)
N = 3L
dt[, y1 := (2*dt$x1[.I] -dt$x1[(.I+N-1L)]), by=1:nrow(dt)]
dt
x1 x2 x3 x4 y1
1: 1 2 4 6 -1
2: 2 3 5 7 0
3: 3 4 6 8 1
4: 4 5 7 9 2
5: 5 6 8 10 3
6: 6 7 9 11 4
7: 7 8 10 12 13
8: 8 9 11 6 NA
9: 1 10 4 7 NA
sdcols=paste0("x",1:4)
how does one use sdcols to achieve the same result for columns x1 through x4, creating new columns y1 to y4

Perhaps we don't need a group by operation
nm1 <- names(dt)
dt[, paste0('y', seq_along(nm1)) := lapply(.SD,
function(x) c((2*shift(x)- shift(x, type = 'lead'))[-1], NA)), .SDcols = nm1]
dt
# x1 x2 x3 x4 y1 y2 y3 y4
#1: 1 2 4 6 -1 0 2 4
#2: 2 3 5 7 0 1 3 5
#3: 3 4 6 8 1 2 4 6
#4: 4 5 7 9 2 3 5 7
#5: 5 6 8 10 3 4 6 8
#6: 6 7 9 11 4 5 7 16
#7: 7 8 10 12 13 6 16 17
#8: 8 9 11 6 NA NA NA NA
#9: 1 10 4 7 NA NA NA NA

Related

Create lagged variables for several columns group by two conditions in r

I would like to create lagged variables for several columns that are grouped by two conditions.
Here is the dataset:
df <- data.frame(id = c(rep(1,4),rep(2,4)), tp = rep(1:4,2), x1 = 1:8, x2 = 2:9, x3 = 3:10, x4 = 4:11)
> df
id tp x1 x2 x3 x4
1 1 1 1 2 3 4
2 1 2 2 3 4 5
3 1 3 3 4 5 6
4 1 4 4 5 6 7
5 2 1 5 6 7 8
6 2 2 6 7 8 9
7 2 3 7 8 9 10
8 2 4 8 9 10 11
I want to lag x1, x2, x3, x4 that are grouped by id and tp and create new variables x1_lag1, x2_lag1, x3_lag1, x4_lag1, like this:
> df
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA
How to achieve that?
Your result doesn't seem to be grouped by tp at all. It is grouped by id and ordered by tp within the id grouping.
Generally a "lag" is a variable that takes the value from the previous row. The columns you want labeled as "lag" columns take the value from the next row, so we use the lead function.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("x"), lead, .names = "{.col}_lag1")) %>%
ungroup()
# A tibble: 8 × 10
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA

Finding equal rows between dataframes in R

I have the following data set as example:
df1 <- data.frame(V1 = 1:10, V2 = 1:10, V3 = 1:10)
df2 <- data.frame(V1 = 5:1, V2 = 5:1, v3 = c(1, 4, 5, 2, 3))
If a row in df1 are present in df2, I would create a column in df1 that indicates the corresponding row to the df2 and for other rows showed FALSE or NULL or NA or 0 or ...
output expected:
V1 V2 V3 rows_matched
1 1 1 1 FALSE
2 2 2 2 4
3 3 3 3 FALSE
4 4 4 4 2
5 5 5 5 FALSE
6 6 6 6 FALSE
7 7 7 7 FALSE
8 8 8 8 FALSE
9 9 9 9 FALSE
10 10 10 10 FALSE
in Base R:
cbind(df1, matched = match(interaction(df1), interaction(df2)))
V1 V2 V3 matched
1 1 1 1 NA
2 2 2 2 4
3 3 3 3 NA
4 4 4 4 2
5 5 5 5 NA
6 6 6 6 NA
7 7 7 7 NA
8 8 8 8 NA
9 9 9 9 NA
10 10 10 10 NA
You can do a simple left join. Note: I fixed the column name in df2 from v3 to V3 to match the names of df1
left_join(
df1,
df2 %>% mutate(rows_matched=row_number())
)
Output:
V1 V2 V3 rows_matched
1 1 1 1 NA
2 2 2 2 4
3 3 3 3 NA
4 4 4 4 2
5 5 5 5 NA
6 6 6 6 NA
7 7 7 7 NA
8 8 8 8 NA
9 9 9 9 NA
10 10 10 10 NA
Here is another way of solving your problem using data.table
library(data.table)
setDT(df1)
setDT(df2)
df1[, rows_matched := df2[df1, on=.(V1,V2,V3), which=TRUE]]
#
# V1 V2 V3 rows_matched
# 1: 1 1 1 NA
# 2: 2 2 2 4
# 3: 3 3 3 NA
# 4: 4 4 4 2
# 5: 5 5 5 NA
# 6: 6 6 6 NA
# 7: 7 7 7 NA
# 8: 8 8 8 NA
# 9: 9 9 9 NA
# 10: 10 10 10 NA
Another possible solution, based on dplyr::left_join (we have to previously capitalize V3 in df2):
library(dplyr)
df1 %>%
left_join(df2 %>% mutate(new = row_number()))
#> Joining, by = c("V1", "V2", "V3")
#> V1 V2 V3 new
#> 1 1 1 1 NA
#> 2 2 2 2 4
#> 3 3 3 3 NA
#> 4 4 4 4 2
#> 5 5 5 5 NA
#> 6 6 6 6 NA
#> 7 7 7 7 NA
#> 8 8 8 8 NA
#> 9 9 9 9 NA
#> 10 10 10 10 NA

Create run-length ID for subset of values

In this type of dataframe:
df <- data.frame(
x = c(3,3,1,12,2,2,10,10,10,1,5,5,2,2,17,17)
)
how can I create a new column recording the run-length ID of only a subset of x values, say, 3-20?
My own attempt only succeeds at inserting NA where the run-length count should be interrupted; but internally it seems the count is uninterrupted:
library(data.table)
df %>%
mutate(rle = ifelse(x %in% 3:20, rleid(x), NA))
x rle
1 3 1
2 3 1
3 1 NA
4 12 3
5 2 NA
6 2 NA
7 10 5
8 10 5
9 10 5
10 1 NA
11 5 7
12 5 7
13 2 NA
14 2 NA
15 17 9
16 17 9
The expected result:
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
In base R:
df[df$x %in% 3:20, "rle"] <- data.table::rleid(df[df$x %in% 3:20, ])
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With left_join:
left_join(df, df %>%
filter(x %in% 3:20) %>%
distinct() %>%
mutate(rle = row_number()))
Joining, by = "x"
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With data.table:
library(data.table)
setDT(df)
df[x %between% c(3,20),rle:=rleid(x)][]
x rle
<num> <int>
1: 3 1
2: 3 1
3: 1 NA
4: 12 2
5: 2 NA
6: 2 NA
7: 10 3
8: 10 3
9: 10 3
10: 1 NA
11: 5 4
12: 5 4
13: 2 NA
14: 2 NA
15: 17 5
16: 17 5

How to lag multiple specific columns of a data frame in R

I would like to lag multiple specific columns of a data frame in R.
Let's take this generic example. Let's assume I have defined which columns of my dataframe I need to lag:
Lag <- c(0, 1, 0, 1)
Lag.Index <- is.element(Lag, 1)
df <- data.frame(x1 = 1:8, x2 = 1:8, x3 = 1:8, x4 = 1:8)
My initial dataframe:
x1 x2 x3 x4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would like to compute the following dataframe:
x1 x2 x3 x4
1 1 NA 1 NA
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would know how to do it for only one lagged column as shown here, but not able to find a way to do it for multiple lagged columns in an elegant way. Any help is very much appreciated.
You can use purrr's map2_dfc to lag different values by column.
purrr::map2_dfc(df, Lag, dplyr::lag)
# x1 x2 x3 x4
# <int> <int> <int> <int>
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or with data.table :
library(data.table)
setDT(df)[, names(df) := Map(shift, .SD, Lag)]
A data.table option using shift along with Vectorize
> setDT(df)[, Vectorize(shift)(.SD, Lag)]
x1 x2 x3 x4
[1,] 1 NA 1 NA
[2,] 2 1 2 1
[3,] 3 2 3 2
[4,] 4 3 4 3
[5,] 5 4 5 4
[6,] 6 5 6 5
[7,] 7 6 7 6
[8,] 8 7 8 7
Not sure whether this is elegant enough, but I would use dplyr's mutate_at function to tweak columns
df %>% dplyr::mutate_at(.vars = vars(x2,x4),.funs = ~lag(., default = NA))
We convert the lag to logical class, get the corresponding names and use across from dplyr
library(dplyr)
df %>%
mutate(across(names(.)[as.logical(Lag)], lag))
# x1 x2 x3 x4
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or we can do this in base R
df[as.logical(Lag)] <- rbind(NA, df[-nrow(df), as.logical(Lag)])

Subset of columns to dummy variables using last row in case of ties

I have a subset of columns in my dataframe that I woould like to convert to
dummy variables.
The column with the max value gets a 1 and all other dummy
variables get a zero.
If there is a tie then use the previous rows dummy variable
encoding.
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*7, TRUE), ncol=7))
df <- DF %>% rename(D1 = X2, D2 = X3, D3 = X4)
head(df)
X1 D1 D2 D3 X5 X6 X7
2 2 2 10 10 1 2
8 4 1 1 8 9 10
3 4 2 3 9 2 9
3 8 1 9 2 4 1
3 1 8 1 5 4 8
1 4 3 7 3 2 6
Below is a dataframe where the D columns are the subset of columns I would like
to convert to dummy variables.
X1 D1 D2 D3 X5 X6 X7
2 0 0 1 10 1 2
8 1 0 0 8 9 10
3 1 0 0 9 2 9
3 0 0 1 2 4 1
3 0 1 0 5 4 8
1 0 0 1 3 2 6
I would prefer a data.table solution, but I would be happy with a base R or
dplyr solution also.
Here is an option:
library(zoo)
cols <- paste0("V", 2L:4L)
DT[, (cols) := {
#set largest to 1 with first for ties
m <- matrix(0L, .N, length(cols))
m[cbind(seq_len(.N), max.col(.SD, "first"))] <- 1L
#identify rows with dupes and set to NA
idx <- apply(.SD, 1L, anyDuplicated) > 0L
m[idx, seq_along(cols)] <- NA_integer_
#fill NAs with previous row (if using data.table 1.12.3, there is a nafill function)
as.data.table(zoo::na.locf(m))
}, .SDcols=cols]
tail of output:
V1 V2 V3 V4 V5 V6 V7
1: 5 0 0 1 6 3 2
2: 8 0 0 1 4 6 3
3: 4 0 1 0 7 8 1
4: 8 1 0 0 5 3 2
5: 6 1 0 0 4 10 7
6: 9 1 0 0 5 7 3
7: 3 0 1 0 10 3 2
8: 5 1 0 0 9 8 2
9: 6 0 1 0 10 3 9
10: 6 0 1 0 8 3 9
data:
library(data.table)
set.seed(0L)
DT <- as.data.table(matrix(sample(10, 26746*7, TRUE), ncol=7))
tail of data:
V1 V2 V3 V4 V5 V6 V7
1: 5 1 4 7 6 3 2
2: 8 5 5 3 4 6 3
3: 4 6 8 4 7 8 1
4: 8 7 4 5 5 3 2
5: 6 10 10 9 4 10 7
6: 9 2 4 4 5 7 3
7: 3 2 4 3 10 3 2
8: 5 10 2 1 9 8 2
9: 6 6 8 5 10 3 9
10: 6 5 8 7 8 3 9
Your initial seed didn't work for me - my first row had a tie so my solution didn't work.
library(data.table)
set.seed(46) #changed
dt <- data.table(matrix(sample(10, 23746*7, TRUE), ncol = 7))
setnames(dt, c('V2', 'V3', 'V4'), c('D1', 'D2', 'D3'))
cols <- c('D1', 'D2', 'D3')
max_dt <- dt[, .(ID = 1:.N
,max_col = apply(.SD, 1
, function (x) {
maxes <- which(x == max(x))
ifelse((length(maxes) != 1), NA_integer_, maxes)
}
)
)
, .SDcols = cols]
#Could still be a null in the first value - you'd have to determine what you want.
max_dt[, max_col := zoo::na.locf(max_col)]
max_dt
dcast(max_dt, ID ~ max_col, fun.aggregate = length, value.var = 'max_col', fill = 0)
dt[, (cols) := dcast(max_dt, ID ~ max_col, fun.aggregate = length, value.var = 'max_col', fill = 0 )[, -1]]
dt
Before:
V1 D1 D2 D3 V5 V6 V7
1: 2 3 4 6 9 8 2
2: 3 6 2 9 6 3 3
3: 6 4 3 7 6 9 10
4: 4 8 7 10 7 7 10
5: 3 7 10 8 6 1 5
---
23742: 10 6 7 10 7 3 3
23743: 9 4 10 3 4 7 9
23744: 1 9 1 8 10 8 2
23745: 8 6 6 8 7 4 8
23746: 7 3 7 8 8 10 4
After:
V1 D1 D2 D3 V5 V6 V7
1: 2 0 0 1 9 8 2
2: 3 0 0 1 6 3 3
3: 6 0 0 1 6 9 10
4: 4 0 0 1 7 7 10
5: 3 0 1 0 6 1 5
---
23742: 10 0 0 1 7 3 3
23743: 9 0 1 0 4 7 9
23744: 1 1 0 0 10 8 2
23745: 8 0 0 1 7 4 8
23746: 7 0 0 1 8 10 4

Resources