Replace row values in data.table using 'by' and conditions - r

I am trying to replace certain row values in a column according to conditions in another column, within a grouping.
EDIT: edited to highligh the recursive nature of the problem.
E.g.
DT = data.table(y=rep(c(1,3), each = 3)
,v=as.numeric(c(1,2,4,4,5,8))
,x=as.numeric(rep(c(9:11),each=2)),key=c("y","v"))
DT
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 10
4: 3 4 10
5: 3 5 11
6: 3 8 11
Within each 'y', I then want to replace values of 'x' where 'v' has an observation v+t (e.g. t = 3), with 2222 (or in reality the results of a function) to following result:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 10
5: 3 5 11
6: 3 8 2222
I have tried the following, but to no avail.
DT[which((v-3) %in% v), x:= 2222, y][]
And it mysteriously (?) results in:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 2222
5: 3 5 2222
6: 3 8 2222
Running:
DT[,print(which((v-3) %in% v)), by =y]
Indicates that it does the correct indexing within the groups, but what happens from (or the lack thereof) I don't understand.

You could try using replace (which could have some overhead because it copies whole x)
DT[, x:=replace(x, which(v %in% (v+3)), 2222), by=y]
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Alternatively, you could create a logical index column and then do the assignment in the next step
DT[,indx:=v %in% (v+3), by=y][(indx), x:=2222, by=y][, indx:=NULL]
DT
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Or slightly modifying your own approach using .I in order to create an index
indx <- DT[, .I[which((v-3) %in% v)], by = y]$V1
DT[indx, x := 2222]

Related

R data table: Assign a value to column based on reference column

I would like to assign a value into a column from a larger table, using another column as a reference.
E.g. data:
require(data.table)
dt <- data.table(N=c(1:5),GPa1=c(sample(0:5,5)),GPa2=c(sample(5:15,5)),
GPb1=c(sample(0:20,5)),GPb2=c(sample(0:10,5)),id=c("b","a","b","b","a"))
N GPa1 GPa2 GPb1 GPb2 id
1: 1 4 10 7 0 b
2: 2 5 15 19 7 a
3: 3 1 5 20 5 b
4: 4 0 13 3 4 b
5: 5 3 7 8 1 a
The idea is to get new columns Val1 and Val2. Any GP column ending in 1 is eligible for Val1 and any ending in 2 is eligible for Val2. The value to be insterted into the column is determined by the id column, per row.
So you can see for Val1, you'd draw on the GPb1 column, then GPa1, GPb1, GPb1 again and finally GPa1.
The final result would be;
N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
1: 1 4 10 7 0 b 7 0
2: 2 5 15 19 7 a 5 15
3: 3 1 5 20 5 b 20 5
4: 4 0 13 3 4 b 3 4
5: 5 3 7 8 1 a 3 7
I did achieve the answer but in quite a few lines after melting it etc, but i'm sure there must be an elegant way to do this in data.table. I was initially frustrated by the fact paste0 doesn't work in data.table;
dt[1,paste0("GP",id,"1")]
but;
# The following gives a vector that is correct for Val1 (and works for 2)
diag(as.matrix(dt[,.SD,.SDcols=dt[,paste0("GP",id,"1")]]))
# I think the answer lies in `set`, but i've not had any luck.
for (i in 1:nrow(dt)) set(dt, i=dt[i,.SD,.SDcols=dt[,paste0("GP",id,"2")]], j=i, value=0)
The data is quite ugly this way so perhaps it's better to just use the melt method.
dt[id == "a", c("Val1", "Val2") := .(GPa1, GPa2)]
dt[id == "b", c("Val1", "Val2") := .(GPb1, GPb2)]
# N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
#1: 1 2 13 5 8 b 5 8
#2: 2 3 8 7 2 a 3 8
#3: 3 5 11 19 1 b 19 1
#4: 4 4 5 6 9 b 6 9
#5: 5 1 15 1 10 a 1 15

Differences to earlier instance within group or external variable

I have data
dat1 <- data.table(id=1:9,
group=c(1,1,2,2,2,3,3,3,3),
t=c(14,17,20,21,26,89,90,95,99),
index=c(1,2,1,2,3,1,2,3,4)
)
and I would like to compute the difference on t to the previous value, according to index. For the first instance of each group, I would like to compute the difference to some external variable
dat2 <- data.table(group=c(1,2,3),
start=c(10,15,80)
)
such that the following result should be obtained:
> res
id group t index dif
1: 1 1 14 1 4
2: 2 1 17 2 3
3: 3 2 20 1 5
4: 4 2 21 2 1
5: 5 2 26 3 5
6: 6 3 89 1 9
7: 7 3 90 2 1
8: 8 3 95 3 5
9: 9 3 99 4 4
I have tried using
dat1[ , ifelse(index == min(index), dif := t - dat2$start, dif := t - t[-1]), by = group]
but I was unsure about referencing other elements of the same group and external elements in one step. Is this at all possible using data.table?
A possible solution:
dat1[, dif := ifelse(index == min(index),
t - dat2$start[match(.BY, dat2$group)],
t - shift(t))
, by = group][]
which gives:
id group t index dif
1: 1 1 14 1 4
2: 2 1 17 2 3
3: 3 2 20 1 5
4: 4 2 21 2 1
5: 5 2 26 3 5
6: 6 3 89 1 9
7: 7 3 90 2 1
8: 8 3 95 3 5
9: 9 3 99 4 4
Or a variant as proposed by #jogo in the comments which avoids the ifelse:
dat1[, dif := t - shift(t), by = group
][index == 1, dif := t - dat2[group==.BY, start], by = group][]
I would try to avoid ifelse and use data.tables efficient join-capabilities:
dat1[dat2, on = "group", # join on group
start := i.start][, # add start value
diff := diff(c(start[1L], t)), by = group][, # compute difference
start := NULL] # remove start value
The resulting table is:
# id group t index diff
#1: 1 1 14 1 4
#2: 2 1 17 2 3
#3: 3 2 20 1 5
#4: 4 2 21 2 1
#5: 5 2 26 3 5
#6: 6 3 89 1 9
#7: 7 3 90 2 1
#8: 8 3 95 3 5
#9: 9 3 99 4 4
You may use shift with a dynamic fill argument: Index 'dat2' with .BY to get 'start' values for each 'group':
dat1[ , dif := t - shift(t, fill = dat2[group == .BY, start]), by = group]
# id group t index dif
# 1: 1 1 14 1 4
# 2: 2 1 17 2 3
# 3: 3 2 20 1 5
# 4: 4 2 21 2 1
# 5: 5 2 26 3 5
# 6: 6 3 89 1 9
# 7: 7 3 90 2 1
# 8: 8 3 95 3 5
# 9: 9 3 99 4 4
Alternatively, you can do this in steps. Probably a matter of taste, but I find it more transparent than the ifelse way.
First the 'normal' shift. Then add an 'index' variable to 'dat2' and do an update join.
dat1[ , dif := t - shift(t), by = group]
dat2[ , index := 1]
dat1[dat2, on = .(group, index), dif := t - start]

between vs inrange in data.table

In R's data.table, when should one choose between %between% and %inrange% for subsetting operations? I've read the help page for ?between and I'm still scratching my head as to the differences.
library(data.table)
X = data.table(a=1:5, b=6:10, c=c(5:1))
> X[b %between% c(7,9)]
a b c
1: 2 7 4
2: 3 8 3
3: 4 9 2
> X[b %inrange% c(7,9)]
a b c
1: 2 7 4
2: 3 8 3
3: 4 9 2
They look the same to me. Could someone please explain why there exist both operations?
> X
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
Using the example in the comments:
> X[a %between% list(c, b)]
a b c
1: 3 8 3
2: 4 9 2
3: 5 10 1
> X[a %inrange% list(c, b)]
a b c
1: 1 6 5
2: 2 7 4
3: 3 8 3
4: 4 9 2
5: 5 10 1
It seems between looks at each row individually and checks to see if the value in a is such that c <= a <= b for that row.
inrange looks for the smallest scalar value in c, say cmin and the largest scalar value in b, bmax, forming a range [cmin, bmax], and then checks to see if a lies in this range [cmin, bmax], for each row in the a column.

Removing certain rows and replacing values based on a condition

I have the following data:
set.seed(2)
d <- data.frame(iteration=c(1,1,2,2,2,3,4,5,6,6,6),
value=sample(11),
var3=sample(11))
iteration value var3
1 1 3 7
2 1 8 4
3 2 6 8
4 2 2 3
5 2 7 9
6 3 9 11
7 4 1 10
8 5 4 1
9 6 10 2
10 6 11 6
11 6 5 5
Now, I want the following:
1. IF there are more than one iteration to remove the last row AND replace the value of the last row with the previous value.
So in the example above here is the output that I want:
d<-data.frame(iteration=c(1,2,2,3,4,5,6,6),
value=c(8,6,7,9,1,4,10,5))
iteration value var3
1 1 8 7
2 2 6 8
3 2 7 3
4 3 9 11
5 4 1 10
6 5 4 1
7 6 10 2
8 6 5 6
We can use data.table
library(data.table)
setDT(d)[, .(value = if(.N>1) c(value[seq_len(.N-2)], value[.N]) else value), iteration]
# iteration value
#1: 1 8
#2: 2 6
#3: 2 7
#4: 3 9
#5: 4 1
#6: 5 4
#7: 6 10
#8: 6 5
Update
Based on the update in OP's post, we can first create a new column with the lead values in 'value', assign the 'value1' to 'value' only for those meet the conditions in 'i1', then subset the rows
setDT(d)[, value1 := shift(value, type = "lead"), iteration]
i1 <- d[, if(.N >1) .I[.N-1], iteration]$V1
d[i1, value := value1]
d[d[, if(.N > 1) .I[-.N] else .I, iteration]$V1][, value1 := NULL][]
# iteration value var3
#1: 1 8 7
#2: 2 6 8
#3: 2 7 3
#4: 3 9 11
#5: 4 1 10
#6: 5 4 1
#7: 6 10 2
#8: 6 5 6
This base R solution using the split-apply-combine methodology returns the same values as #akrun's data.table version, although the logic appears to be different.
do.call(rbind, lapply(split(d, d$iteration),
function(i)
if(nrow(i) >= 3) i[-(nrow(i)-1),] else tail(i, 1)))
iteration value
1 1 8
2.3 2 6
2.5 2 7
3 3 9
4 4 1
5 5 4
6.9 6 10
6.11 6 5
The idea is to split the data.frame into a list of data.frames along iteration, then for each data.frame, check if there are more than 2 rows, if yes, grab the first and final row, if no, then return only the final row. do.call with rbind then compiles these observations into a single data.frame.
Note that this will not work in the presence of other variables.

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources