Backfill data frame in R based on maximum number - r

I have a data frame with a number of id's and max values.
df <- data.frame(id = c("a", "b", "c"), max.number = c(0, 6, 4))
id max.number
1 a 0
2 b 6
3 c 4
I want to create a new data frame that backfills down to zero for each id. It would look like:
id number
1 a 0
2 b 0
3 b 1
4 b 2
5 b 3
6 b 4
7 b 5
8 b 6
9 b 0
10 c 1
11 c 2
12 c 3
13 c 4
I've tried to do this through a nested for loop but can't get it to work correctly. I would love some guidance in doing this with apply if possible.

This would be very easy with a data.table table:
# install.packages("data.table")
library(data.table)
df <- data.frame(id = c("a", "b", "c"), max.number = c(0, 6, 4))
dt <- as.data.table(df)
# or directly:
# dt <- data.table(id = c("a", "b", "c"), max.number = c(0, 6, 4))
dt[, .(number=seq(0, max.number)), by=id]
Output:
id number
1: a 0
2: b 0
3: b 1
4: b 2
5: b 3
6: b 4
7: b 5
8: b 6
9: c 0
10: c 1
11: c 2
12: c 3
13: c 4

Related

How to convert the values in one column into new columns, the values in another column into rows, indexing values in a third column?

Suppose I have the following data.table in R:
> data.table(Group = c(rep(1, 5), rep(2,5), rep(3,5)), Type = c("A","B","C","D","E"), Value = c(1:15))
Group Type Value
1: 1 A 1
2: 1 B 2
3: 1 C 3
4: 1 D 4
5: 1 E 5
6: 2 A 6
7: 2 B 7
8: 2 C 8
9: 2 D 9
10: 2 E 10
11: 3 A 11
12: 3 B 12
13: 3 C 13
14: 3 D 14
15: 3 E 15
I would like to create a new data table where I have:
> dat <- data.table(A = c(1,6,11), B = c(2,7,12), C = c(3,8,13), D = c(4,9,14), E = c(5,10,15))
> rownames(dat) <- c("1","2","3")
> dat
A B C D E
1: 1 2 3 4 5
2: 6 7 8 9 10
3: 11 12 13 14 15
where the rownames are now the Group values, the Type the column names, with the entries being the corresponding values from Values. Is there a way to do this using a function in data.table?
Using data.table rather than tidyr functions:
dt <- data.table(Group = c(rep(1, 5), rep(2,5), rep(3,5)), Type = c("A","B","C","D","E"), Value = c(1:15))
data.table::dcast(dt, Group ~ Type, value.var = "Value")
# Group A B C D E
# 1: 1 1 2 3 4 5
# 2: 2 6 7 8 9 10
# 3: 3 11 12 13 14 15
Edit: I have made the data.table:: explicit because there is also reshape2::dcast().

Subset groups in a data.table using conditions on two columns

I have a data.table with a high number of groups. I would like to subset whole groups (not just rows) based on the conditions on multiple columns. Consider the following data.table:
DT <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L),
group = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
y = c(14, 19, 16, 10, 6, 8, 14, 19, 10, 9, 6, 8),
x = c(3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3)),
row.names = c(NA, -12L),
class = c("data.table", "data.frame"))
>DT
id group y x
1: 1 A 14 3
2: 2 A 19 3
3: 3 A 16 2
4: 4 A 10 3
5: 5 B 6 3
6: 6 B 8 3
7: 7 B 14 3
8: 8 B 19 2
9: 9 C 10 2
10: 10 C 9 3
11: 11 C 6 3
12: 12 C 8 3
I would like to keep groups that have y=6 and x=3 in the same row. So that I would have only class B and C (preferably using data.table package in R):
id group y x
1: 5 B 6 3
2: 6 B 8 3
3: 7 B 14 3
4: 8 B 19 2
5: 9 C 10 2
6: 10 C 9 3
7: 11 C 6 3
8: 12 C 8 3
All my attempts gave me only those rows containing y=6 and x=3, which I do not want:
id group y x
1: 5 B 6 3
2: 11 C 6 3
With data.table:
DT[,.SD[any(x == 3 & y == 6)], by=group]
group id y x
<char> <int> <num> <num>
1: B 5 6 3
2: B 6 8 3
3: B 7 14 3
4: B 8 19 2
5: C 9 10 2
6: C 10 9 3
7: C 11 6 3
8: C 12 8 3
Another possibly faster option:
DT[, if (any(x == 3 & y == 6)) .SD, by=group]
Try dplyr package
#select groups containing y and x
groups = DT %>% filter(y == 6, x == 3) %>% select(group) %>% unique() %>% unlist() %>% as.vector()
# filter for selected groups
DT %>% filter(group %in% groups)
A data.table option
> DT[group %in% DT[.(3, 6), group, on = .(x, y)]]
id group y x
1: 5 B 6 3
2: 6 B 8 3
3: 7 B 14 3
4: 8 B 19 2
5: 9 C 10 2
6: 10 C 9 3
7: 11 C 6 3
8: 12 C 8 3

Find overlaps in time intervals by group and return subsetted data.frame

Say I have this dataframe, which has two IDs (1/2) with their start and end times in three different zones (A/B/C):
df <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), zone = c("A",
"B", "A", "C", "B", "A", "B", "A", "B", "C"), start = c(0, 6,
7, 8, 10, 0, 3, 5, 6, 7), end = c(6, 7, 8, 10, 11, 3, 5, 6, 7,
11)), row.names = c(NA, -10L), class = "data.frame")
df
id zone start end
1 1 A 0 6
2 1 B 6 7
3 1 A 7 8
4 1 C 8 10
5 1 B 10 11
6 2 A 0 3
7 2 B 3 5
8 2 A 5 6
9 2 B 6 7
10 2 C 7 11
If we look at each zone, we can visually inspect the times when IDs are in the same zone and when they are not:
split(df,df$zone)
$A
id zone start end
1 1 A 0 6
3 1 A 7 8
6 2 A 0 3
8 2 A 5 6
$B
id zone start end
2 1 B 6 7
5 1 B 10 11
7 2 B 3 5
9 2 B 6 7
$C
id zone start end
4 1 C 8 10
10 2 C 7 11
e.g. Both 1 and 2 are together in zone A from 0-3, and from 5-6, but not at other times.
Desired Output
I want to extract three dataframes.
A dataframe showing the times and zones they are together:
zone start end id
1 A 0 3 1-2
2 A 5 6 1-2
3 B 6 7 1-2
4 C 8 10 1-2
2 & 3: Dataframes for times when they are not together:
#id=1
zone start end
1 A 3 5
2 A 7 8
3 B 10 11
#id=2
zone start end
1 B 3 5
2 C 7 8
3 C 10 11
I have been trying to work with foverlaps from data.table and the intervals package, but can't seem to work out the correct method.
e.g. Subsetting each zone/id, I can sort of get an output that includes overlaps, but it doesn't seem to be quite the right direction:
A <- split(df,df$zone)$A
Asp <- split(A,A$id)
x <- setDT(Asp[[1]])
y <- setDT(Asp[[2]])
setkey(y, start, end)
foverlaps(x, y, type="any")
id zone start end i.id i.zone i.start i.end
1: 2 A 0 3 1 A 0 6
2: 2 A 5 6 1 A 0 6
3: NA <NA> NA NA 1 A 7 8
Any help greatly appreciated.
EDIT: Extra example dataset that seemed to bring up some issues with current suggested solutions:
df2 <- structure(list(start = c(0, 5, 6, 8, 10, 13, 15, 20, 22, 26,
29, 37, 40, 42, 0, 3, 6, 9, 15, 20, 25, 33, 35, 40), id = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2), zone = c("A", "B", "A", "D", "C", "B", "C", "B", "A",
"B", "A", "D", "C", "D", "A", "B", "C", "D", "A", "B", "C", "B",
"A", "D"), end = c(5, 6, 8, 10, 13, 15, 20, 22, 26, 29, 37, 40,
42, 45, 3, 6, 9, 15, 20, 25, 33, 35, 40, 45)), class = c("data.table", "data.frame"), row.names = c(NA, -24L))
df2
start id zone end
1: 0 1 A 5
2: 5 1 B 6
3: 6 1 A 8
4: 8 1 D 10
5: 10 1 C 13
6: 13 1 B 15
7: 15 1 C 20
8: 20 1 B 22
9: 22 1 A 26
10: 26 1 B 29
11: 29 1 A 37
12: 37 1 D 40
13: 40 1 C 42
14: 42 1 D 45
15: 0 2 A 3
16: 3 2 B 6
17: 6 2 C 9
18: 9 2 D 15
19: 15 2 A 20
20: 20 2 B 25
21: 25 2 C 33
22: 33 2 B 35
23: 35 2 A 40
24: 40 2 D 45
start id zone end
This seems to work, filtering the foverlaps output:
DT = data.table(df)
setkey(DT, start, end)
oDT0 = foverlaps(DT[id==1], DT[id==2])
oDT0[, `:=`(
ostart = pmax(start, i.start),
oend = pmin(end, i.end)
)]
oDT = oDT0[ostart < oend]
# together
oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
# ids zone ostart oend
# 1: 1-2 A 0 3
# 2: 1-2 A 5 6
# 3: 1-2 B 6 7
# 4: 1-2 C 8 10
# apart
oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
# id zone i.id i.zone ostart oend
# 1: 2 B 1 A 3 5
# 2: 2 C 1 A 7 8
# 3: 2 C 1 B 10 11
Repeating for new input... not sure if it's correct since no expected output was provided:
> DT = data.table(df2)
> ...
> oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
ids zone ostart oend
1: 1-2 A 0 3
2: 1-2 B 5 6
3: 1-2 D 9 10
4: 1-2 B 20 22
5: 1-2 A 35 37
6: 1-2 D 42 45
> oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
id zone i.id i.zone ostart oend
1: 2 B 1 A 3 5
2: 2 C 1 A 6 8
3: 2 C 1 D 8 9
4: 2 D 1 C 10 13
5: 2 D 1 B 13 15
6: 2 A 1 C 15 20
7: 2 B 1 A 22 25
8: 2 C 1 A 25 26
9: 2 C 1 B 26 29
10: 2 C 1 A 29 33
11: 2 B 1 A 33 35
12: 2 A 1 D 37 40
13: 2 D 1 C 40 42
I suspect there is a way to pass arguments to foverlaps to avoid needing to define and filter by ostart and oend. As of the latest CRAN version of the package, the doc indicates that minoverlap is not yet implemented, so maybe it is necessary for now.
I think you are almost there. You can try the code below by defining a function f
f <- function(A) {
Asp <- split(A, by = "id")
u <- na.omit(foverlaps(Asp[[1]], setkey(Asp[[2]], start, end)))
r <- c()
for (k in 1:nrow(u)) {
if (u[k, end - start < i.end - i.start]) {
p <- u[k, .(start, end)]
} else {
p <- u[k, .(start = i.start, end = i.end)]
}
r[[k]] <- p
}
cbind(
zone = u[, zone],
rbindlist(r),
id = paste0(unique(A[, id]), collapse = "-")
)
}
and then run
rbindlist(Map(f, split(setDT(df), by = "zone")))
which gives
> rbindlist(Map(f, split(setDT(df), by = "zone")))
zone start end id
1: A 0 3 1-2
2: A 5 6 1-2
3: B 6 7 1-2
4: C 8 10 1-2
For the first data.frame, you can also use a non-equi join:
ovlap <- df[df, on=.(zone, id<id, start<end, end>start), nomatch=0L,
.(zone, id2=i.id, i.start, i.end, id1=x.id, x.start, x.end)][,
.(start=max(x.start, i.start), end=min(x.end, i.end)),
.(zone, id1, id2, i.start)][,
i.start := NULL][]
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 5 6
#3: B 1 2 6 7
#4: C 1 2 8 10
For the other output data.frames, you can perform a non-equi join first with the previous result and then for each interval find the sub-intervals where the other partner is not around:
rangeDiff <- function(DT) {
DT[,
if (is.na(x.start[1L])) {
.(start=i.start, end=i.end)
} else {
.(start=c(i.start, x.end+1L),
end=c(x.start-1L, i.end))
},
.(zone, id, i.start, i.end)][
start<=end][,
c("i.start","i.end") := NULL][]
} #rangeDiff
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: A 1 4 4
#2: A 1 7 8
#3: B 1 10 11
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: B 2 3 5
#2: C 2 7 7
#3: C 2 11 11
There is some inconsistency in OP where the bounds of intervals are inclusive or exclusive. I have used inclusive when both ids are around in the same zone (i.e. in the first output data.frame).
Edit: show output for df2
ovlap
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 35 37
#3: B 1 2 5 6
#4: B 1 2 20 22
#5: D 1 2 9 10
#6: D 1 2 42 45
other required data.frames:
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 1 4 5
# 2: A 1 6 8
# 3: A 1 22 26
# 4: A 1 29 34
# 5: B 1 13 15
# 6: B 1 26 29
# 7: C 1 10 13
# 8: C 1 15 20
# 9: C 1 40 42
# 10: D 1 8 8
# 11: D 1 37 40
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 2 15 20
# 2: A 2 38 40
# 3: B 2 3 4
# 4: B 2 23 25
# 5: B 2 33 35
# 6: C 2 6 9
# 7: C 2 25 33
# 8: D 2 11 15
# 9: D 2 40 41
df2 sorted by zone for easier checking:
start id zone end
1: 0 1 A 5
2: 6 1 A 8
3: 22 1 A 26
4: 29 1 A 37
5: 0 2 A 3
6: 15 2 A 20
7: 35 2 A 40
8: 5 1 B 6
9: 13 1 B 15
10: 20 1 B 22
11: 26 1 B 29
12: 3 2 B 6
13: 20 2 B 25
14: 33 2 B 35
15: 10 1 C 13
16: 15 1 C 20
17: 40 1 C 42
18: 6 2 C 9
19: 25 2 C 33
20: 8 1 D 10
21: 37 1 D 40
22: 42 1 D 45
23: 9 2 D 15
24: 40 2 D 45
Updated Solution
I have made some modifications to the previous solution so that it works with the newly presented data set df2:
I tried to create all combinations of id == 1 and id == 2 in every zone to try and find their intersects
Then I created a custom function to take a subset of our data set plus a pair of ids to extract their start to end values so that we have two vectors and we can find their intersects easily
` In the end I applied this function to every subset of our data set
library(dplyr)
library(tidyr)
library(purrr)
fn <- function(data, x, y) {
base::intersect(data %>%
filter(row_number() == x) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist(),
data %>%
filter(row_number() == y) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist())
}
Then we apply it on our data set:
split(df2, df2$zone) %>%
map(~ .x %>%
mutate(grp = row_number()) %>%
{expand.grid(.$grp[.$id == 1], .$grp[.$id == 2])} %>%
rowwise() %>%
mutate(insec = list(fn(.x, Var1, Var2))) %>%
filter(length(insec) != 0) %>%
unnest(cols = c(insec)) %>%
group_by(Var1, Var2) %>%
filter(row_number() == 1 | row_number() == n()) %>%
filter(n() > 1) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = insec) %>%
ungroup()) %>%
keep(~ nrow(.x) != 0) %>%
imap_dfr(~ .x %>%
mutate(zone
= .y) %>%
select(!starts_with("Var"))) %>%
relocate(zone) %>%
rename(start = `1`, end = `2`)
# A tibble: 6 x 3
zone start end
<chr> <int> <int>
1 A 0 3
2 A 35 37
3 B 5 6
4 B 20 22
5 D 9 10
6 D 42 45

Conditional statement within group

I have a dataframe in which I want to make a new column with values based on condition within groups. So for the dataframe below, I want to make a new column n_actions which gives
Cond1. for the whole group GROUP the number 2 if a 6 appears in column STEP
Cond 2. for the whole group GROUP the number 3 if a 9 appears in column STEP
Cond 3. if not a 6 or 9 appears within column STEP for the GROUP, then 1
#dataframe start
dataframe <- data.frame(group = c("A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D"),
step = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9))
# dataframe desired
dataframe$n_actions <- c(rep(1, 3), rep(2, 6,), rep(1, 3), rep(3, 9))
Try out:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(n_actions = ifelse(9 %in% step, 3,
ifelse(6 %in% step, 2, 1)))
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fctr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
# ... with 11 more rows
Another way with dplyr's case_when:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(
n_actions1 = case_when(
9 %in% step ~ 3,
6 %in% step ~ 2,
TRUE ~ 1
)
)
Output:
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fct> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
11 C 2 1
12 C 3 1
13 D 1 3
14 D 2 3
15 D 3 3
16 D 4 3
17 D 5 3
18 D 6 3
19 D 7 3
20 D 8 3
21 D 9 3
You could divide the maximum value per group by %/% 3, it seems.
dataframe <- transform(dataframe,
n_actions2 = ave(step, group, FUN = function(x) max(x) %/% 3))
dataframe
# group step n_actions n_actions2
#1 A 1 1 1
#2 A 2 1 1
#3 A 3 1 1
#4 B 1 2 2
#5 B 2 2 2
#6 B 3 2 2
#7 B 4 2 2
#8 B 5 2 2
#9 B 6 2 2
#10 C 1 1 1
#11 C 2 1 1
#12 C 3 1 1
#13 D 1 3 3
#14 D 2 3 3
#15 D 3 3 3
#16 D 4 3 3
#17 D 5 3 3
#18 D 6 3 3
#19 D 7 3 3
#20 D 8 3 3
#21 D 9 3 3

Efficient way of adding row with difference from specific group in data table

I have the following data table:
require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
ind cat counts
1: 1 A 10
2: 2 A 9
3: 3 A 8
4: 4 B 7
5: 5 B 6
6: 6 C 5
7: 7 C 4
8: 8 D 3
What I would like to achieve is to add a row for each cat which in the counts has the difference between the sum(counts) of the cat and the sum(counts) of cat A. For these rows the ind should be 0.
Essentially I would like to rbind the following information:
added_info <- cbind(ind =0, dt1[, .(counts = dt1[cat == "A", sum(counts)] - sum(counts)), by = cat])
> added_info
ind cat counts
1: 0 A 0
2: 0 B 14
3: 0 C 18
4: 0 D 24
And the end result would be:
dt1 <- rbind(dt1, added_info)[order(cat)]
> dt1
ind cat counts
1: 1 A 10
2: 2 A 9
3: 3 A 8
4: 0 A 0
5: 4 B 7
6: 5 B 6
7: 0 B 14
8: 6 C 5
9: 7 C 4
10: 0 C 18
11: 8 D 3
12: 0 D 24
My question is if there is a better (shorter) way of achieving this using datatable (perhaps by using .I or .N ??)
You could do
require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
dt1[,c:=sum(counts[cat=="A"])][,.(ind=c(ind,0), counts=c(counts,c[.N]-sum(counts))),cat][]
# cat ind counts
# 1: A 1 10
# 2: A 2 9
# 3: A 3 8
# 4: A 0 0
# 5: B 4 7
# 6: B 5 6
# 7: B 0 14
# 8: C 6 5
# 9: C 7 4
# 10: C 0 18
# 11: D 8 3
# 12: D 0 24
This may be a solution within one data.table call:
dt1[, rbind(.SD,
data.table(ind = 0,
counts = dt1[cat == 'A', sum(counts)] - sum(.SD$counts))),
by = cat]
Out:
cat ind counts
1: A 1 10
2: A 2 9
3: A 3 8
4: A 0 0
5: B 4 7
6: B 5 6
7: B 0 14
8: C 6 5
9: C 7 4
10: C 0 18
11: D 8 3
12: D 0 24
You said efficient, so... This has two by's; the unique is likely vectorized and the data.table by for sum should compile to a c for loop.
> dt1[, .SD
][, ca := sum(.SD[cat == 'A', counts])
][, cc := sum(counts), cat
][, cd := ca - cc
][, rbind(.SD, unique(.SD, by=c('cat'))[, `:=`(ind=0)])
][ind == 0, counts := cd
][, .(cat, ind, counts)
][order(cat, ind)
]
cat ind counts
1: A 0 0
2: A 1 10
3: A 2 9
4: A 3 8
5: B 0 14
6: B 4 7
7: B 5 6
8: C 0 18
9: C 6 5
10: C 7 4
11: D 0 24
12: D 8 3
>

Resources