Using data.table to select rows by distance from another row - r

Let's say I have the following data.table.
library(data.table)
DT <- data.table(x=1:6, y=c(0,0,1,0,0,0))
Could I write some command DT[...] that selects all the rows within 2 rows of the one in which y=1? That is, using proximity to row three, I want to select rows 1-5.

Here is one option to loop over the position index (which(y == 1)) with sapply, create a sequence by adding/subtracting 2 to it, get the unique elements (in case of overlaps) and subset the rows by using that i
library(data.table)
DT[unique(sapply(which(y==1), function(i) (i-2):(i + 2)))]
-output
# x y
#1: 1 0
#2: 2 0
#3: 3 1
#4: 4 0
#5: 5 0
If there are negative index, we can subset those
i1 <- DT[,unique(sapply(which(y==1), function(i) (i-2):(i + 2)))][,1]
DT[i1[i1 > 0]]

We can use rolling operations with different alignment to find if there is any value in y which has 1 in it with a window size of 3.
library(data.table)
library(zoo)
DT[rollapplyr(y == 1, 3, any, fill = FALSE) |
rollapply(y == 1, 3, any, fill = FALSE, align = 'left')]
# x y
#1: 1 0
#2: 2 0
#3: 3 1
#4: 4 0
#5: 5 0
rollapplyr is same as rollapply(...., align = 'right')

Related

R First Row By Group When Condition Is Met

dataHAVE=data.frame(STUDENT=c(1,1,1,2,2,2,3,3,3),
SCORE=c(0,1,1,5,1,2,1,1,1),
CAT=c(3,10,7,4,5,0,4,5,1),
FOX=c(5,0,10,8,9,1,8,9,0))
dataWANT=data.frame(STUDENT=c(1,2,3),
SCORE=c(1,1,1),
CAT=c(10,5,4),
FOX=c(0,9,8))
I have 'dataHAVE' and want 'dataWANT' which takes the first row for every 'STUDENT' when 'SCORE' equals to 1. I am seeking a data.table solution because of it being a large data. I try this but do not know how to set the criteria for 'SCORE'
dataWANT[,.SD[1],by = key(STUDENT)]
Convert the 'data.frame' to 'data.table' (setDT), grouped by 'STUDENT', specify the logical condition in i, get the index of the first row (.I[1]), extract that column ($V1) and subset the rows
library(data.table)
setDT(dataHAVE)[dataHAVE[SCORE == 1, .I[1], STUDENT]$V1]
.I returns row index. If we don't have a grouping column, it would return a vector i.e.
setDT(dataHAVE)[SCORE == 1, .I]
#[1] 1 2 3 4 5 6
when we provide the grouping column, by default, the .I returns with a named column V1 (we could override it by changing the name)
setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]
# STUDENT colindex
#1: 1 2
#2: 2 5
#3: 3 7
Nowe, we have two columns, 'STUDENT', 'colindex'. We are specifically interested in the 'colindex', so extract with standard procedures ($ or [[) and then use that as row index in i
i1 <- setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]$colindex
i1
#[1] 2 5 7
This we use for subsetting
dataHAVE[i1]
Here is a base R option using subset + ave
subset(
dataHAVE,
ave(SCORE==1, STUDENT, FUN = function(x) seq_along(x) == min(which(x)))
)
which gives
STUDENT SCORE CAT FOX
2 1 1 10 0
5 2 1 5 9
7 3 1 4 8
Solution 1. There is a straightforward and comprehensive solution in two lines:
dataWANT <- dataHAVE[dataHAVE$SCORE == 1,] #Filter score equals to 1
dataWANT <- dataWANT[!duplicated(dataWANT$STUDENT), ] #Remove duplicated students
Solution 2. However, if you prefer to solve in one line:
dataWANT <- dataHAVE[!duplicated(paste0(dataHAVE$STUDENT, dataHAVE$SCORE)) & dataHAVE$SCORE ==1, ]
That creates a logical vector showing which of the combinations that are not duplicated of preceding elements, and combine it with a test if 'SCORE' is 1.
You could use match to get 1st row where SCORE = 1 for each STUDENT.
library(data.table)
setDT(dataHAVE)
dataHAVE[, .SD[match(1, SCORE)], STUDENT]
# STUDENT SCORE CAT FOX
#1: 1 1 10 0
#2: 2 1 5 9
#3: 3 1 4 8

Take first non-0 value or last 0 value if that's all there is

Ciao,
Here is my replicating example.
HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
WANT <- data.frame(ID=c(1,2,3,4,5,6),
ABSENCE=c(NA,0,1,0,1,0),
TIME=c(NA,3,3,2,2,3))
The tall data file HAVE is the one I need to convert to WANT. So essentially for each ID I need to identify the first non-zero value and that value goes into the data file WANT. If all values of absence is NA than TIME is NA. If all values of ABSENCE is 0 then I report the last possible row in WANT (as reflected in the TIME variable)
This is my attempt:
WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))
but I do not know how to take the last of the 0 rows if there are only 0s.
library(data.table)
setDT(HAVE)
res = unique(HAVE[, .(ID)])
# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]
# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]
# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE
ID ABSENCE TIME
1: 1 NA NA
2: 2 0 3
3: 3 1 3
4: 4 0 2
5: 5 1 2
6: 6 0 3
I'm using data.table since the tidyverse does not and never will support sub-assignment / modifying only rows selected by a condition (like the is.na(ABSENCE) here).
If there two rules can be made more consistent with each other, this should be doable in a left join or a single group_by + slice as the OP attempted, though. Okay, here's one way, though it looks impossible to debug:
HAVE %>%
arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>%
distinct(ID, .keep_all = TRUE)
ID ABSENCE TIME
1 1 NA 3
2 2 0 3
3 3 1 3
4 4 0 2
5 5 1 2
6 6 0 3
Using data.table as well, based on subsetting the .I row counter:
WANT <- HAVE[
HAVE[,
if(all(is.na(ABSENCE))) .I[1] else
if(!any(ABSENCE > 0, na.rm=TRUE)) max(.I[ABSENCE==0], na.rm=TRUE) else
min(.I[ABSENCE > 0], na.rm=TRUE),
by=ID
]$V1,
]
WANT[is.na(ABSENCE), TIME := NA_integer_]
# ID ABSENCE TIME
#1: 1 NA NA
#2: 2 0 3
#3: 3 1 3
#4: 4 0 2
#5: 5 1 2
#6: 6 0 3
Here are two approaches using dplyr and custom functions. Both rely on the data being sorted by TIME.
Filter Approach
# We'll use this function inside filter() to keep only the desired rows
flag_wanted <- function(absence){
flags <- rep(FALSE, length(absence))
if (any(absence > 0, na.rm = TRUE)) {
# There's a nonzero value somewhere in x; we want the first one.
flags[which.max(absence > 0)] <- TRUE
} else if (any(absence == 0, na.rm = TRUE)) {
# There's a zero value somewhere in x; we want the last one.
flags[max(which(absence == 0))] <- TRUE
} else {
# All values are NA; we want the last row
flags[length(absence)] <- TRUE
}
return(flags)
}
# After filtering, we have to flip TIME to NA if ABSENCE is NA
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
filter(flag_wanted(ABSENCE)) %>%
mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
ungroup()
# A tibble: 6 x 3
ID ABSENCE TIME
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 0. 3.
3 3. 1. 3.
4 4. 0. 2.
5 5. 1. 2.
6 6. 0. 3.
The filter() step reduces the dataframe to the rows you need. Since it doesn't modify the TIME values, we need to mutate() as well.
Summarize Approach
# This function captures the general logic of getting the value of one variable
# based on the value of another
get_wanted <- function(of_this, by_this){
# If there are any positive values of `by_this`, use the first
if (any(by_this > 0, na.rm = TRUE)) {
return( of_this[ which.max(by_this > 0) ] )
}
# If there are any zero values of `by_this`, use the last
if (any(by_this == 0, na.rm = TRUE)) {
return( of_this[ max(which(by_this == 0)) ] )
}
# Otherwise, use NA
return(NA)
}
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
# A tibble: 6 x 3
ID TIME ABSENCE
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 3. 0.
3 3. 3. 1.
4 4. 2. 0.
5 5. 2. 1.
6 6. 3. 0.
The order of summarization matters because we're overwriting variables, so this approach is risky. It only produces the output WANT if you summarize TIME and then ABSENCE.

calculated columns in new datatable without altering the original

I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]

R, conditional summing of every second cell in each row

I have a data frame and want for each row the sum of every second cell (beginning with the second cell), whose left neighbor is greater than zero. Here's an example:
a <- c(-2,1,1,-2)
b <- c(1,2,3,4)
c <- c(-2,1,-1,2)
d <- c(5,6,7,8)
df <- data.frame(a,b,c,d)
This gives:
> df
a b c d
1 -2 1 -2 5
2 1 2 1 6
3 1 3 -1 7
4 -2 4 2 8
For the first row the correct sum is 0 (the left neighbor of 1 is -2 and the left neighbor of 5 is also -2); for the second it's 8; for the third it's 3; for the fourth it's again 8.
I want to do it without loops, so I tried it with sum() and which() like in Conditional Sum in R, but could not find a way through.
We subset the dataset for alternating columns using the recycling vector (c(TRUE, FALSE)) to get the 1st, 3rd, ...etc columns of the dataset, convert it to a logical vector by checking whether it is greater than 0 ( > 0), then multiply the values with the second subset of alternating columns ie. columns 2nd, 4th etc. by using the recycling vector (c(FALSE, TRUE)). The idea is that if there are values in the left column that are less than 0, it will be FALSE in the logical matrix and it gets coerced to 0 by multiplying with the other subset. Finally, do the rowSums to get the expected output
rowSums((df[c(TRUE, FALSE)]>0)*df[c(FALSE, TRUE)])
#[1] 0 8 3 8
It can be also replaced with seq
rowSums((df[seq(1, ncol(df), by = 2)]>0)*df[seq(2, ncol(df), by = 2)])
#[1] 0 8 3 8
Or another option is Reduce with Map
Reduce(`+`, Map(`*`, lapply(df[c(TRUE, FALSE)], `>`, 0), df[c(FALSE, TRUE)]))
#[1] 0 8 3 8

calculate median from data.table columns in R

I am trying to calculate a median value across a number of columns, however my data is a bit funky. It looks like the following example.
library(data.table)
dt <- data.table("ID" = c(1,2,3,4),"none" = c(0,5,5,3),
"ten" = c(3,2,5,4),"twenty" = c(0,2,3,1))
ID none ten twenty
1: 1 0 3 0
2: 2 5 2 2
3: 3 5 5 3
4: 4 3 4 1
In the table to column represents the number of occurrences of that value. I am wanting to calculate the median occurrence.
For example for ID = 1
median(c(10, 10, 10))
is the calculation I am wanting to create.
for ID = 2
median(c(0, 0, 0, 0, 0, 10, 10, 20, 20))
I have tried using rep() and lapply() with very limited success and am after some clear guidance on how this might be achieved. I understand for the likes of rep() I would be having to hard code my value to be repeated (e.g. rep(0,2) or rep(10,2)) and this is what I expect. I am just struggling to create a list or vector with the repetitions from each column.
Here's another data.table way (assuming unique ID):
dt[, median(rep(c(0, 10, 20), c(none, ten, twenty))), by=ID]
# ID V1
# 1: 1 10
# 2: 2 0
# 3: 3 10
# 4: 4 10
This is just an attempt to get #eddi's answer without reshaping (which I tend to use as a last resort).
You need a dictionary to translate column names to corresponding numbers, and then it's fairly straightforward:
dict = data.table(name = c('none', 'ten', 'twenty'), number = c(0, 10, 20))
melt(dt, id.var = 'ID')[
dict, on = c(variable = 'name')][, median(rep(number, value)), by = ID]
# ID V1
#1: 1 10
#2: 2 0
#3: 3 10
#4: 4 10
Here's a way that avoids by-row operations and reshaping:
dt[, m := {
cSD = Reduce(`+`, .SD, accumulate=TRUE)
k = floor(cSD[[length(.SD)]]/2)
m = integer(.N)
for(i in seq_along(cSD)) {
left = m == 0L
if(!any(left)) break
m[left] = i * (cSD[[i]][left] >= k[left])
}
names(.SD)[m]
}, .SDcols=none:twenty]
which gives
ID none ten twenty m
1: 1 0 3 0 ten
2: 2 5 2 2 none
3: 3 5 5 3 ten
4: 4 3 4 1 ten
For the loop, I'm borrowing #alexis_laz' style, e.g. https://stackoverflow.com/a/30513197/
I've skipped translation of the column names, but that's pretty straightforward. You could use c(0,10,20) instead of names(.SD) at the end.
Here is a rowwise dplyr way:
dt %>% rowwise %>%
do(med = median(c(rep(0, .$none), rep(10, .$ten), rep(20, .$twenty)))) %>%
as.data.frame
med
1 10
2 0
3 10
4 10
Inspired by #Arun's answer, this is also working:
dt %>% group_by(ID) %>%
summarise(med = median(rep(c(0, 10, 20), c(none, ten, twenty))))
Source: local data table [4 x 2]
ID med
(dbl) (dbl)
1 1 10
2 2 0
3 3 10
4 4 10

Resources