R DataTable Keep Columns After ColumnName - r

set.seed(1)
data=data.frame("a"=sample(-5:5, 20, r=T),
"b"=sample(-5:5, 20, r=T),
"c"=sample(-5:5, 20, r=T),
"d"==sample(-5:5, 20, r=T))
library(data.table)
setDT(data)
I wish to create vector V that equal to all column names of 'data' that come after b.
so I wish for V=c("c","d") using datatable solution!

One way would be using match :
data[, (match('b', names(data)) + 1):ncol(data)]
# c d
# 1: 3 0
# 2: 2 1
# 3: 3 0
# 4: 1 2
# 5: 2 1
# 6: 0 5
# 7: 4 -5
#...

We can use cumsum to create a logical vector and subset the data columns
data[, .SD[, cumsum(cumsum(names(.SD)== 'b'))> 1, with = FALSE]]
# c d
# 1: 3 0
# 2: 2 1
# 3: 3 0
# 4: 1 2
# 5: 2 1
# 6: 0 5
# 7: 4 -5
# 8: 1 -2
# 9: -3 2
#10: 4 3
#11: 0 3
#12: 2 1
#13: -4 -2
#14: -4 1
#15: 0 0
#16: 0 -5
#17: -5 -1
#18: -3 0
#19: -3 -5
#20: 2 3

Related

Extract and collapse non-missing elements by row in the data.table

I would like to extract all unique non missing elements in a row and then collapse them using &&&&. Here comes a small example:
#Load needed libraries:
library(data.table)
#Generate the data:
set.seed(1)
n_rows<-10
#Define function to apply to rows:
function_non_missing<-function(x){
x<-x[!is.na(x)]
x<-x[x!="NA"]
x<-unique(x[order(x)])
paste(x,collapse="&&&&")
}
data<-data.table(
a=sample(c(1,2,NA,NA),n_rows,replace=TRUE),
b=sample(c(1,2,NA,NA),n_rows,replace=TRUE),
c=sample(c(1,2,NA,NA),n_rows,replace=TRUE)
)
> data
a b c
1: 1 NA 1
2: NA NA NA
3: NA 1 1
4: 1 1 1
5: 2 1 1
6: 1 2 1
7: NA 2 2
8: NA 2 1
9: 2 2 1
10: 2 NA 2
#Obtain results
data[,paste(.SD),by=1:nrow(data)][,function_non_missing(V1),by=nrow]
nrow V1
1: 1 1
2: 2
3: 3 1
4: 4 1
5: 5 1&&&&2
6: 6 1&&&&2
7: 7 2
8: 8 1&&&&2
9: 9 1&&&&2
10: 10 2
The above code looks very convoluted and I believe there might be better solutions.
Using melt() / dcast():
data[, row := .I
][, melt(.SD, id.vars = "row")
][order(row, value), paste0(unique(value[!is.na(value)]), collapse = "&&&"), by = row]
row V1
1: 1 1
2: 2
3: 3 1
4: 4 1
5: 5 1&&&2
6: 6 1&&&2
7: 7 2
8: 8 1&&&2
9: 9 1&&&2
10: 10 2
Alterntively using your original function:
data[, function_non_missing(unlist(.SD)), by = 1:nrow(data)]
nrow V1
1: 1 1
2: 2
3: 3 2
4: 4 1&&&&2
5: 5 1&&&&2
6: 6 1&&&&2
7: 7 1
8: 8 2
9: 9 1&&&&2
10: 10 1&&&&2
Probably using apply?
library(data.table)
data[, col := apply(.SD, 1, function(x)
paste(sort(unique(na.omit(x))), collapse = "&&&"))]
data
# a b c col
# 1: 1 NA 1 1
# 2: NA NA NA
# 3: NA 1 1 1
# 4: 1 1 1 1
# 5: 2 1 1 1&&&2
# 6: 1 2 1 1&&&2
# 7: NA 2 2 2
# 8: NA 2 1 1&&&2
# 9: 2 2 1 1&&&2
#10: 2 NA 2 2

Tracking the first incidence of each episode

I am currently using R to process a data set that looks like the following:
age ep
1 0
2 0
3 1
4 1
5 1
6 1
7 0
8 0
9 1
10 1
11 0
I want to create a variable that will keep track of the first occurrence of ep=1 per series of ep=1. These series will have ep=0 prior to the first ep=1 and ep=0 following the last ep=1 of each series.
I would like the data set to look like this after processing:
age ep first
1 0 NA
2 0 NA
3 1 1
4 1 NA
5 1 NA
6 1 NA
7 0 NA
8 0 NA
9 1 1
10 1 NA
11 0 NA
I am working in data table as this data set is rather large, so I'd prefer to process the data using code for data tables, however if this isn't possible I can convert to a data frame and use other code. Any assistance would be greatly appreciated.
A fast data.table method ...
library(data.table)
dt <- fread("age ep
1 0
2 0
3 1
4 1
5 1
6 1
7 0
8 0
9 1
10 1
11 0")
dt[!shift(ep) & ep, first := 1]
# or more explicit:
dt[shift(ep) != 1 & ep == 1, first := 1]
dt
# age ep first
# 1: 1 0 NA
# 2: 2 0 NA
# 3: 3 1 1
# 4: 4 1 NA
# 5: 5 1 NA
# 6: 6 1 NA
# 7: 7 0 NA
# 8: 8 0 NA
# 9: 9 1 1
# 10: 10 1 NA
# 11: 11 0 NA
Note: just for clarity, if your object is not already a data.table. You can coerce it to a data.table:
setDT(dt)
Another option using an update join
dt[, first := dt[dt[, .I[1], by=rleid(ep)]$V1][ep == 1][dt, on=.(age), ep]]
dt
# age ep first
# 1: 1 0 NA
# 2: 2 0 NA
# 3: 3 1 1
# 4: 4 1 NA
# 5: 5 1 NA
# 6: 6 1 NA
# 7: 7 0 NA
# 8: 8 0 NA
# 9: 9 1 1
#10: 10 1 NA
#11: 11 0 NA
Using data provided by #Khaynes
An approach using fifelse
dt[, first := fifelse( ep == 1 & shift( ep , type = "lag" ) == 0L, 1L, NA_integer_) ]
dt
# age ep first
# 1: 1 0 NA
# 2: 2 0 NA
# 3: 3 1 1
# 4: 4 1 NA
# 5: 5 1 NA
# 6: 6 1 NA
# 7: 7 0 NA
# 8: 8 0 NA
# 9: 9 1 1
# 10: 10 1 NA
# 11: 11 0 NA
Another update join version, using mult="first" to only overwrite the first matching row in the group:
dt[, rid := rleid(ep)][dt[ep==1], on=.(rid), mult="first", first := 1]
dt
# age ep rid first
# 1: 1 0 1 NA
# 2: 2 0 1 NA
# 3: 3 1 2 1
# 4: 4 1 2 NA
# 5: 5 1 2 NA
# 6: 6 1 2 NA
# 7: 7 0 3 NA
# 8: 8 0 3 NA
# 9: 9 1 4 1
#10: 10 1 4 NA
#11: 11 0 5 NA

When 0 in x is odd, how to assign id value between this zero and the next zero to the new variable ref

x<-c(0,0,1,1,0,1,1,1,0,1,1,0,1,1)
aaa<-data.frame(x)
aaa$id<-1:nrow(aaa)
When 0 in x is odd, how to assign id value between this zero and the next zero to the new variable ref.
The results like:
aaa$ref <- with(aaa, ifelse(cumsum(x == 0) %% 2, id, NA))
aaa
# x id ref
# 1 0 1 1
# 2 0 2 NA
# 3 1 3 NA
# 4 1 4 NA
# 5 0 5 5
# 6 1 6 6
# 7 1 7 7
# 8 1 8 8
# 9 0 9 NA
# 10 1 10 NA
# 11 1 11 NA
# 12 0 12 12
# 13 1 13 13
# 14 1 14 14
An option using data.table
library(data.table)
i1 <- setDT(aaa)[, grp := rleid(x)][, .I[seq_len(.N) == .N & x==0], grp]$V1
i2 <- unlist(lapply(split(i1, as.integer(gl(length(i1), 2,
length(i1)))), function(x) head(x[1]:x[2],-1)))
aaa[!i2, ref := id][, grp := NULL][]
# x id ref
# 1: 0 1 1
# 2: 0 2 NA
# 3: 1 3 NA
# 4: 1 4 NA
# 5: 0 5 5
# 6: 1 6 6
# 7: 1 7 7
# 8: 1 8 8
# 9: 0 9 NA
#10: 1 10 NA
#11: 1 11 NA
#12: 0 12 12
#13: 1 13 13
#14: 1 14 14

How to efficiently find last preceding row with nonzero value in R data.table

Introduction
I am trying to find the optimal way how to find the last preceding row with nonzero value in a given column and return a value of different column on that row. I want to do it in R data.table and i am looking for maximum efficiency of that operation.
Example
let's have a data table like so:
set.seed(123)
DT = data.table(x=rep(c("b","a","c"),each=6),
y=rep(1:6, 3),
z = rbinom(18, 1, 0.3))
That gives us the following data table:
x y z
1: b 1 0
2: b 2 1
3: b 3 0
4: b 4 1
5: b 5 1
6: b 6 0
7: a 1 0
8: a 2 1
9: a 3 0
10: a 4 0
11: a 5 1
12: a 6 0
13: c 1 0
14: c 2 0
15: c 3 0
16: c 4 1
17: c 5 0
18: c 6 0
Now, the table is for each value in column x ordered by the column y. For each group given by the values in column x, I would like to create a column which would give me for each row the value of y from the row with last nonzero value of z.
Right now I am using lapply for each y and grouping by x which gives the desired result:
DT[, list(y,
z,
output = lapply(y, function(x) max(y[z != 0 & y <= x]))
),
by = 'x']
The question
Can i make my code from the example more efficient?
You might try using nafill:
# create a dummy column that is only populated for nonzero z (and hence NA elsewhere)
DT[z != 0, y_copy := y]
# nafill on this column using LOCF strategy by group:
DT[ , ans := nafill(y_copy, type = 'locf'), by = x][]
# x y z y_copy ans
# 1: b 1 0 NA NA
# 2: b 2 1 2 2
# 3: b 3 0 NA 2
# 4: b 4 1 4 4
# 5: b 5 1 5 5
# 6: b 6 0 NA 5
# 7: a 1 0 NA NA
# 8: a 2 1 2 2
# 9: a 3 0 NA 2
# 10: a 4 0 NA 2
# 11: a 5 1 5 5
# 12: a 6 0 NA 5
# 13: c 1 0 NA NA
# 14: c 2 0 NA NA
# 15: c 3 0 NA NA
# 16: c 4 1 4 4
# 17: c 5 0 NA 4
# 18: c 6 0 NA 4
For now, nafill is a development only feature (data.table 1.12.3+) but 1.12.4 should be on CRAN in the next week or two. For the moment, you can install this with install.packages('data.table', type = 'source', repos = 'http://Rdatatable.github.io/data.table')
If you don't want to create y_copy, you could do this inline with is.na<-:
DT[ , ans := nafill(`is.na<-`(y, z == 0), type = 'locf'), by = x]
This will be inefficient because z==0 is calculated repeatedly by group (instead of as a single vector); you could do this in the first step then:
DT[ , z_zero := z == 0]
But this means another dummy column (with less storage than y_copy if y is numeric, character, or complex)
Another option using rolling join:
DT[, output:= DT[z==1][.SD, on=.(x, y), roll=Inf, x.y]]
output:
x y z output
1: b 1 0 NA
2: b 2 1 2
3: b 3 0 2
4: b 4 1 4
5: b 5 1 5
6: b 6 0 5
7: a 1 0 NA
8: a 2 1 2
9: a 3 0 2
10: a 4 0 2
11: a 5 1 5
12: a 6 0 5
13: c 1 0 NA
14: c 2 0 NA
15: c 3 0 NA
16: c 4 1 4
17: c 5 0 4
18: c 6 0 4
An option with non-equi join
library(data.table)
library(zoo)
DT[DT[z!=0, .(y1 = y, x)], output := y1, on = .(x, y <= y1),
mult = 'last'][, output := na.locf0(output), x]
DT
# x y z output
# 1: b 1 0 NA
# 2: b 2 1 2
# 3: b 3 0 2
# 4: b 4 1 4
# 5: b 5 1 5
# 6: b 6 0 5
# 7: a 1 0 NA
# 8: a 2 1 2
# 9: a 3 0 2
#10: a 4 0 2
#11: a 5 1 5
#12: a 6 0 5
#13: c 1 0 NA
#14: c 2 0 NA
#15: c 3 0 NA
#16: c 4 1 4
#17: c 5 0 4
#18: c 6 0 4

data.table: Select n specific rows before & after other rows meeting a condition

Given the following example data table:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
I want to select
by group grp
all rows that have y==5
and up to two rows before and after each row from 2 within the grouping.
but 3. only those rows that have exclude==0.
Assuming each group has max one row with y==5, this would yield the desired result for 1.-3.:
idx <- -2:2 # 2 rows before match, the matching row itself, and two rows after match
(row_numbers <- DT[,.I[{
x <- rep(which(y==5),each=length(idx))+idx
x[x>0 & x<=.N]
}], by=grp]$V1)
# [1] 3 4 5 6 7 12 13 14 15 16 20
DT[row_numbers]
# grp y exclude
# 1: a 3 0
# 2: a 4 1
# 3: a 5 0 # y==5 + two rows before and two rows after
# 4: a 7 1
# 5: a 8 0
# 6: b 3 0
# 7: b 4 1
# 8: b 5 0 # y==5 + two rows before and two rows after
# 9: b 6 1
# 10: b 7 1
# 11: c 5 1 # y==5 + nothing, because the group has only 1 element
However, how do I incorporate 4. so that I get
# grp y exclude
# 1: a 2 0
# 2: a 3 0
# 3: a 5 0
# 4: a 8 0
# 5: a 9 0
# 6: b 2 0
# 7: b 3 0
# 8: b 5 0
# 9: b 8 0
# 10: b 9 0
# 11: c 5 1
? Feels like I'm close, but I guess I looked too long at heads and whiches, now, so I'd be thankful for some fresh ideas.
A bit more simplified:
DT[DT[, rn := .I][exclude==0 | y==5][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1]
# grp y exclude rn
#1: a 2 0 2
#2: a 3 0 3
#3: a 5 0 5
#4: a 8 0 7
#5: a 9 0 8
#6: b 2 0 11
#7: b 3 0 12
#8: b 5 0 14
#9: b 8 0 17
#10: b 9 0 18
#11: c 5 1 20
You are very close. This should do it:
row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]

Resources