Adding columns dynamically on data table depending on a column value - r

What is the best way to add a sequence column in a data.table depending on a fixed selected value and a value defined by a column. The following example ilustrates the input and desired output:
library(data.table)
# Input
# Add a column sequence till 7 starting from the value of column V2.
dt <- data.table(c("A","D","H"), c(2, 4, 5))
# Desired Ouput
dt <- data.table(c(rep("A", 6), rep("D", 4), rep("H", 3)), new_column = c(2:7, 4:7, 5:7))
dt

Does this achieve what you're after?
dt[ , .( new_column = V2:7 ), by = V1 ]
output:
V1 new_column
1: A 2
2: A 3
3: A 4
4: A 5
5: A 6
6: A 7
7: D 4
8: D 5
9: D 6
10: D 7
11: H 5
12: H 6
13: H 7

Related

Replacing some values of a column based on some match in data.table

Let say I have below data.table
library(data.table)
DT = data.table(Col1 = LETTERS[1:10], Col2 = c(1,4,2,3,6,NA,4,2, 5, 4))
DT
Col1 Col2
1: A 1
2: B 4
3: C 2
4: D 3
5: E 6
6: F NA
7: G 4
8: H 2
9: I 5
10: J 4
Now I want to replace the 4 and NA values in Col2 by 999
In actual scenario, I have very large DT, so I am looking for most efficient way to achieve the same.
Any insight will be highly appreciated.
An option with na_if/replace_na
library(dplyr)
library(data.table)
DT[, Col2 := replace_na(na_if(Col2, 4), 999)]

How to merge two data.tables with complementary column data in one go?

I have two data.tables, columns v2 of each one are complementary:
set.seed(1234)
v1 <- sample(1:20, 5)
v2a <- c(1:2,NA,NA,NA)
v2b <- c(NA,NA,3:5)
id <- c(letters[1:5])
library(data.table)
dt1 <- data.table(id = id, v1=v1,v2=v2a)
dt2 <- data.table(id = id, v2=v2b)
dt1
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 NA
4: d 15 NA
5: e 9 NA
dt2
id v2
1: a NA
2: b NA
3: c 3
4: d 4
5: e 5
The goal is to merge the two data.tables and have column v2 with the proper values without NA.
I got it correctly done either by:
dt <- rbindlist(list(dt1,dt2), use.names = T, fill = T)
dt <- dt[,v2:= sum(v2, na.rm = T), by = id]
dt <- dt[!is.na(v1)]
or:
dt <- merge(dt1, dt2, by = "id", all = T)
dt[, v2:=sum(v2.x, v2.y, na.rm = T), by = id][, v2.x := NULL][,v2.y := NULL]
both giving the correct desired result:
dt
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
Is there an easier/one go way to do it?
The code below updates the values of dt1$v2 where is.na(dt1$v2) == TRUE with the values of dt$v2, based on id.
dt1[is.na(v2), v2 := dt2[ dt1[is.na(v2),], v2, on = .(id)] ][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
There is another, less convoluted approach which uses the fcoalesce() function which was introduced with data.table v1.12.4 (on CRAN 03 Oct 2019):
dt1[dt2, on = .(id), v2 := fcoalesce(x.v2, i.v2)][]
id v1 v2
1: a 16 1
2: b 5 2
3: c 12 3
4: d 15 4
5: e 9 5
dt1[dt2, on = .(id), v2 := fcoalesce(v2, i.v2)][]
works as well because
dt1[dt2, on = .(id)]
returns
id v1 v2 i.v2
1: a 16 1 NA
2: b 5 2 NA
3: c 12 NA 3
4: d 15 NA 4
5: e 9 NA 5

Filling in missing values in a data.table by reference

Suppose I have a data.table with missing values and a reference data.table:
dt <- data.table(id = 1:5, value = c(1, 2, NA, NA, 5))
id value
1: 1 1
2: 2 2
3: 3 NA
4: 4 NA
5: 5 5
ref <- data.table(id = 1:4, value = c(1, 2, 98, 99))
id value
1: 1 1
2: 2 2
3: 3 98
4: 4 99
How would I fill the column value of dt by using the matching id in the two data.tables,
so that I get the following data.table?
id value
1: 1 1
2: 2 2
3: 3 98
4: 4 99
5: 5 5
We can use a join on the 'id' and assign (:=) the value column from 'ref' (i.value) to that in 'dt'
library(data.table)
dt[ref, value := i.value, on = .(id)]
dt
# id value
#1: 1 1
#2: 2 2
#3: 3 98
#4: 4 99
#5: 5 5
If we don't want to replace the original non-NA elements in the 'value' column
dt[ref, value := fcoalesce(value, i.value), on = .(id)]

Removing all duplicates in data.table, adding column with list of identifiers

Building on my previous question, I have the following data.table:
> dt = data.table(V1 = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4),
V2 = c(4, 4, 4, 5, 5, 6, 6, 7, 7, 4, 5),
ID = c(1, 2, 3, 1, 2, 2, 2, 1, 3, 1, 1))
V1 V2 ID
1: 1 4 1
2: 1 4 2
3: 1 4 3
4: 2 5 1
5: 2 5 2
6: 2 6 2
7: 2 6 2
8: 2 7 1
9: 2 7 3
10: 3 4 1
11: 4 5 1
with many duplicate rows (when looking at V1 and V2).
I want to remove all the "entirely duplicate" rows (those where V2 is always the same for a given V1), including the ones with a unique V1 (since all one V2 is equal to itself).
I was given many solutions in the previous question linked above, including #akrun's (modified for this new table):
> newDT = unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1], by = c("V1", "V2"))
V1 V2 ID
1: 2 5 1
2: 2 6 2
3: 2 7 3
However, I actually need to store a list of IDs which had a given result. I can collect the list of IDs for a given pair of (V1, V2) trivially with
> unique(dt[V1 == 2 & V2 == 5, ID])
[1] 1 2
(unique is necessary because a given (V1,V2) pair may also have duplicate ID, see dt rows 6:7) and add it with
newDT[, ID := .(.(unique(dt[V1 == 2 & V2 == 5, ID])))]
but I can't figure out how to "iterate" over each (V1,V2) pair.
In summary, I need the output to be
> newDT
V1 V2 ID
1: 2 5 (1, 2)
2: 2 6 (2)
3: 2 7 (1, 3)
After subsetting the rows of data, grouped by 'V1', 'V2', create a list of unique 'ID'
dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1][, .(ID = list(unique(ID))), .(V1, V2)]
# V1 V2 ID
#1: 2 5 1,2
#2: 2 6 2
#3: 2 7 1,3
dt[, .(ID = .(unique(ID))), by = .(V1, V2)
][dt[, .(V2 = if (uniqueN(V2) > 1) unique(V2) else numeric(0)), by = V1], on = .(V1, V2)]
Broken down into steps:
# Extract our pairs of interests first
our_pairs <- dt[, .(V2 = if (uniqueN(V2) > 1) unique(V2) else numeric(0)), by = V1]
our_pairs
# V1 V2
# 1: 2 5
# 2: 2 6
# 3: 2 7
# Aggregate all ID's for each pair on original data
ids_forpairs <- dt[, .(ID = .(unique(ID))), by = .(V1, V2)]
ids_forpairs
# V1 V2 ID
# 1: 1 4 1,2,3
# 2: 2 5 1,2
# 3: 2 6 2
# 4: 2 7 1,3
# 5: 3 4 1
# 6: 4 5 1
# Then filter by our pairs of interest
ids_forpairs[our_pairs, on = .(V1, V2)]
# V1 V2 ID
# 1: 2 5 1,2
# 2: 2 6 2
# 3: 2 7 1,3

Carry / use value from previous group

data:
structure(list(id = c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5),
ax = c("a", "a", "b", "b", "b", "b", "b", "b", "c", "c",
"d", "d", "e"), time = c(1, 3, 0, 2, 4, 5, 6, 8, 7, 9, 10,
11, 12)), .Names = c("id", "ax", "time"), class = c("data.table",
"data.frame"), row.names = c(NA, -13L))
looks like:
id ax time
1: 1 a 1
2: 1 a 3
3: 2 b 0
4: 2 b 2
5: 2 b 4
6: 2 b 5
7: 2 b 6
8: 2 b 8
9: 3 c 7
10: 3 c 9
11: 4 d 10
12: 4 d 11
13: 5 e 12
I want to have the max of the previous group next to the actual group:
desired output:
id ax time newCol
1: 1 a 1 NA
2: 1 a 3 NA
3: 2 b 0 3
4: 2 b 2 3
5: 2 b 4 3
6: 2 b 5 3
7: 2 b 6 3
8: 2 b 8 3
9: 3 c 7 8
10: 3 c 9 8
11: 4 d 10 9
12: 4 d 11 9
13: 5 e 12 11
Is it also possible to have the value of the "previous-previous" grp?
Interessted in baseR, data.table and tidyverse solutions
note:
Can be grouped by EITHER id or ax. The example is a little redundant here.
A data.table solution:
dtt.max <- dtt[, .(max = max(time)), by = ax]
dtt.max[, max.prev := shift(max)]
dtt[dtt.max, newCol := i.max.prev, on = 'ax']
# > dtt
# id ax time newCol
# 1: 1 a 1 NA
# 2: 1 a 3 NA
# 3: 2 b 0 3
# 4: 2 b 2 3
# 5: 2 b 4 3
# 6: 2 b 5 3
# 7: 2 b 6 3
# 8: 2 b 8 3
# 9: 3 c 7 8
# 10: 3 c 9 8
# 11: 4 d 10 9
# 12: 4 d 11 9
# 13: 5 e 12 11
data.table solution using id + 1
library(data.table)
merge(d, setDT(d)[, max(time), id + 1], all.x = TRUE)
Here is a dplyr approach. The key here is to group and ungroup when necessary:
df %>%
group_by(ax) %>%
mutate(new = time[n()]) %>%
ungroup() %>%
mutate(new = lag(new)) %>%
group_by(ax) %>%
mutate(new = new[1])
# A tibble: 13 x 4
# Groups: ax [5]
id ax time new
<dbl> <chr> <dbl> <dbl>
1 1. a 1. NA
2 1. a 3. NA
3 2. b 0. 3.
4 2. b 2. 3.
5 2. b 4. 3.
6 2. b 5. 3.
7 2. b 6. 3.
8 2. b 8. 3.
9 3. c 7. 8.
10 3. c 9. 8.
11 4. d 10. 9.
12 4. d 11. 9.
13 5. e 12. 11.
Assuming id is the same as group:
dfr <- dfr %>% group_by(id) %>% mutate(groupmax = max(time))
dfr$old_group_max <- dfr$groupmax[match(dfr$id - 1, dfr$id)]
The antepenultimate group is left as an exercise :-)
1) This uses no packages. It computes the maximum for each group giving Ag and and then lags it giving LagMax. Finally it left joins using merge that back into the original data frame DF:
Ag <- aggregate(time ~ id, DF, max)
LagMax <- transform(Ag, lagmax = c(NA, head(time, -1)), time = NULL)
merge(DF, LagMax, by = "id", all.x = TRUE)
giving:
id ax time lagmax
1 1 a 1 NA
2 1 a 3 NA
3 2 b 0 3
4 2 b 2 3
5 2 b 4 3
6 2 b 5 3
7 2 b 6 3
8 2 b 8 3
9 3 c 7 8
10 3 c 9 8
11 4 d 10 9
12 4 d 11 9
13 5 e 12 11
2) This sorts time within id so that we know that the maximum is the last value in each id group.
o <- order(factor(DF$id, levels = unique(DF$id)), DF$time)
Time <- DF$time[o]
lagmax <- function(r) if (r[1] == 1) NA else Time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
In the question the time values are already sorted within id and if that is known to be the case the above could be shortened to:
lagmax <- function(r) if (r[1] == 1) NA else DF$time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
3) This one-liner is a data.table translation of (2):
library(data.table)
DT <- copy(DF) # don't overwrite DF
setDT(DT)[, g:=rleid(id)][, lagmax := DT$time[.I[1]-1], keyby = c("g", "id")]
In the sample data in the question time is sorted within id and if that were known to be the case we could use the following shorter code in place of the last line above
setDT(DT)[, lagmax := DT$time[.I[1]-1], by = id]

Resources