dplyr: add numbers based on matching rows - r

Let's say I have
> fig
hands imp_spe n
1 A 0 39
2 A 1 32
3 B 0 3
4 B 1 2
5 C 0 115
6 C 1 24
7 D 0 11
8 D 1 3
I want to add a new column fig$new, that adds numbers in fig$n, but only when rows in fig$hands are matching.
I need to keep the dataframe as it is.
Expected output
> fig
hands imp_spe n new
1 A 0 39 71
2 A 1 32 71
3 B 0 3 5
4 B 1 2 5
5 C 0 115 139
6 C 1 24 139
7 D 0 11 14
8 D 1 3 14
I am looking for a solution in dplyr
fig <- structure(list(hands = c("A", "A", "B", "B", "C", "C", "D", "D"
), imp_spe = c(0, 1, 0, 1, 0, 1, 0, 1), n = c(39L, 32L, 3L, 2L,
115L, 24L, 11L, 3L)), row.names = c(NA, -8L), class = "data.frame")

here you go
library(dplyr)
fig %>%
group_by(hands) %>%
mutate(new = sum(n)) %>%
ungroup

dplyr solution:
dplyr::add_count(fig, hands, wt = n, name = 'new')
# hands imp_spe n new
# 1 A 0 39 71
# 2 A 1 32 71
# 3 B 0 3 5
# 4 B 1 2 5
# 5 C 0 115 139
# 6 C 1 24 139
# 7 D 0 11 14
# 8 D 1 3 14
base solution:
transform(
fig,
new = ave(x = n, hands, FUN = sum)
)
# hands imp_spe n new
# 1 A 0 39 71
# 2 A 1 32 71
# 3 B 0 3 5
# 4 B 1 2 5
# 5 C 0 115 139
# 6 C 1 24 139
# 7 D 0 11 14
# 8 D 1 3 14

Related

Conditional replacement of values in a column

I have the following:
ID Value1 Value2 Code
0001 3.3 432 A
0001 0 654 A
0001 0 63 A
0002 0 78 B
0002 1 98 B
0003 0 22 C
0003 0 65 C
0003 0 91 C
I need the following:
ID Value1 Value2 Code
0001 3.3 432 A
0001 0 0 A
0001 0 0 A
0002 0 0 B
0002 1 98 B
0003 0 22 C
0003 0 65 C
0003 0 91 C
i.e., for the same "Code" if there is at least one row with Value1 !=0 then all the other rows referred to the same Code will be set to 0 (meaning that 654 and 63 for 0001 relative to Value2 will be set to 0). If this is not the case (like for 0003 nothing will be done).
Can anyone help me please?
Thank you in advance
dplyr
library(dplyr)
quux %>%
group_by(Code) %>%
mutate(Value2 = if_else(abs(Value1) > 0 | !any(abs(Value1) > 0),
Value2, 0L)) %>%
ungroup()
# # A tibble: 8 x 4
# ID Value1 Value2 Code
# <int> <dbl> <int> <chr>
# 1 1 3.3 432 A
# 2 1 0 0 A
# 3 1 0 0 A
# 4 2 0 0 B
# 5 2 1 98 B
# 6 3 0 22 C
# 7 3 0 65 C
# 8 3 0 91 C
base R
quux |>
transform(Value2 = ifelse(ave(abs(Value1), Code, FUN = function(v) abs(v) > 0 | !any(abs(v) > 0)),
Value2, 0L))
# ID Value1 Value2 Code
# 1 1 3.3 432 A
# 2 1 0.0 0 A
# 3 1 0.0 0 A
# 4 2 0.0 0 B
# 5 2 1.0 98 B
# 6 3 0.0 22 C
# 7 3 0.0 65 C
# 8 3 0.0 91 C
data.table
library(data.table)
as.data.table(quux)[, Value2 := fifelse(abs(Value1) > 0 | !any(abs(Value1) > 0), Value2, 0L), by = Code][]
# ID Value1 Value2 Code
# <int> <num> <int> <char>
# 1: 1 3.3 432 A
# 2: 1 0.0 0 A
# 3: 1 0.0 0 A
# 4: 2 0.0 0 B
# 5: 2 1.0 98 B
# 6: 3 0.0 22 C
# 7: 3 0.0 65 C
# 8: 3 0.0 91 C
Data
quux <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Value1 = c(3.3, 0, 0, 0, 1, 0, 0, 0), Value2 = c(432L, 654L, 63L, 78L, 98L, 22L, 65L, 91L), Code = c("A", "A", "A", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA, -8L))
This should do it:
df %>% group_by(Code) %>%
mutate(Value2 = if_else(row_number() == 1 & any(Value1 != 0), Value2, 0))
# A tibble: 8 × 4
# Groups: Code [3]
# ID Value1 Value2 Code
# <int> <dbl> <dbl> <fct>
# 1 1 3.3 432 A
# 2 1 0 0 A
# 3 1 0 0 A
# 4 2 0 78 B
# 5 2 1 0 B
# 6 3 0 0 C
# 7 3 0 0 C
# 8 3 0 0 C
We can use an if_else here. For example
library(dplyr)
dd %>%
group_by(ID) %>%
mutate(Value2=if_else(any(Value1!=0) & Value1==0, 0L, Value2))
Basically we use any() to check for non-zero values and then replace with 0s if one is found.

Group repeated values with random breaks in R

I have the following data frame called df (dput below):
group value
1 A 0
2 A 24
3 A 0
4 A 24
5 A 0
6 A 0
7 B 0
8 B 24
9 B 0
10 B 0
11 B 24
12 B 0
I would like to group the repeated values per group when the order is 0->24. Sometimes there is a random 0 with no 24 after. The desired output should look like this:
group value subgroup
1 A 0 1
2 A 24 1
3 A 0 2
4 A 24 2
5 A 0 3
6 A 0 4
7 B 0 1
8 B 24 1
9 B 0 2
10 B 0 3
11 B 24 3
12 B 0 4
As you can see for rows 5 and 9 there is no 24 after it, that's why they have grouped alone. So I was wondering if anyone knows how to group repeated values with some random breaks in R?
dput df:
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), value = c(0, 24, 0, 24, 0, 0, 0, 24, 0,
0, 24, 0)), class = "data.frame", row.names = c(NA, -12L))
Looks like the subgroup increments whenever there is a 0 value:
df %>%
group_by(group) %>%
mutate(subgroup = cumsum(value == 0)) %>%
ungroup()
# # A tibble: 12 × 3
# group value subgroup
# <chr> <dbl> <int>
# 1 A 0 1
# 2 A 24 1
# 3 A 0 2
# 4 A 24 2
# 5 A 0 3
# 6 A 0 4
# 7 B 0 1
# 8 B 24 1
# 9 B 0 2
# 10 B 0 3
# 11 B 24 3
# 12 B 0 4

Find overlaps in time intervals by group and return subsetted data.frame

Say I have this dataframe, which has two IDs (1/2) with their start and end times in three different zones (A/B/C):
df <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), zone = c("A",
"B", "A", "C", "B", "A", "B", "A", "B", "C"), start = c(0, 6,
7, 8, 10, 0, 3, 5, 6, 7), end = c(6, 7, 8, 10, 11, 3, 5, 6, 7,
11)), row.names = c(NA, -10L), class = "data.frame")
df
id zone start end
1 1 A 0 6
2 1 B 6 7
3 1 A 7 8
4 1 C 8 10
5 1 B 10 11
6 2 A 0 3
7 2 B 3 5
8 2 A 5 6
9 2 B 6 7
10 2 C 7 11
If we look at each zone, we can visually inspect the times when IDs are in the same zone and when they are not:
split(df,df$zone)
$A
id zone start end
1 1 A 0 6
3 1 A 7 8
6 2 A 0 3
8 2 A 5 6
$B
id zone start end
2 1 B 6 7
5 1 B 10 11
7 2 B 3 5
9 2 B 6 7
$C
id zone start end
4 1 C 8 10
10 2 C 7 11
e.g. Both 1 and 2 are together in zone A from 0-3, and from 5-6, but not at other times.
Desired Output
I want to extract three dataframes.
A dataframe showing the times and zones they are together:
zone start end id
1 A 0 3 1-2
2 A 5 6 1-2
3 B 6 7 1-2
4 C 8 10 1-2
2 & 3: Dataframes for times when they are not together:
#id=1
zone start end
1 A 3 5
2 A 7 8
3 B 10 11
#id=2
zone start end
1 B 3 5
2 C 7 8
3 C 10 11
I have been trying to work with foverlaps from data.table and the intervals package, but can't seem to work out the correct method.
e.g. Subsetting each zone/id, I can sort of get an output that includes overlaps, but it doesn't seem to be quite the right direction:
A <- split(df,df$zone)$A
Asp <- split(A,A$id)
x <- setDT(Asp[[1]])
y <- setDT(Asp[[2]])
setkey(y, start, end)
foverlaps(x, y, type="any")
id zone start end i.id i.zone i.start i.end
1: 2 A 0 3 1 A 0 6
2: 2 A 5 6 1 A 0 6
3: NA <NA> NA NA 1 A 7 8
Any help greatly appreciated.
EDIT: Extra example dataset that seemed to bring up some issues with current suggested solutions:
df2 <- structure(list(start = c(0, 5, 6, 8, 10, 13, 15, 20, 22, 26,
29, 37, 40, 42, 0, 3, 6, 9, 15, 20, 25, 33, 35, 40), id = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2), zone = c("A", "B", "A", "D", "C", "B", "C", "B", "A",
"B", "A", "D", "C", "D", "A", "B", "C", "D", "A", "B", "C", "B",
"A", "D"), end = c(5, 6, 8, 10, 13, 15, 20, 22, 26, 29, 37, 40,
42, 45, 3, 6, 9, 15, 20, 25, 33, 35, 40, 45)), class = c("data.table", "data.frame"), row.names = c(NA, -24L))
df2
start id zone end
1: 0 1 A 5
2: 5 1 B 6
3: 6 1 A 8
4: 8 1 D 10
5: 10 1 C 13
6: 13 1 B 15
7: 15 1 C 20
8: 20 1 B 22
9: 22 1 A 26
10: 26 1 B 29
11: 29 1 A 37
12: 37 1 D 40
13: 40 1 C 42
14: 42 1 D 45
15: 0 2 A 3
16: 3 2 B 6
17: 6 2 C 9
18: 9 2 D 15
19: 15 2 A 20
20: 20 2 B 25
21: 25 2 C 33
22: 33 2 B 35
23: 35 2 A 40
24: 40 2 D 45
start id zone end
This seems to work, filtering the foverlaps output:
DT = data.table(df)
setkey(DT, start, end)
oDT0 = foverlaps(DT[id==1], DT[id==2])
oDT0[, `:=`(
ostart = pmax(start, i.start),
oend = pmin(end, i.end)
)]
oDT = oDT0[ostart < oend]
# together
oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
# ids zone ostart oend
# 1: 1-2 A 0 3
# 2: 1-2 A 5 6
# 3: 1-2 B 6 7
# 4: 1-2 C 8 10
# apart
oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
# id zone i.id i.zone ostart oend
# 1: 2 B 1 A 3 5
# 2: 2 C 1 A 7 8
# 3: 2 C 1 B 10 11
Repeating for new input... not sure if it's correct since no expected output was provided:
> DT = data.table(df2)
> ...
> oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
ids zone ostart oend
1: 1-2 A 0 3
2: 1-2 B 5 6
3: 1-2 D 9 10
4: 1-2 B 20 22
5: 1-2 A 35 37
6: 1-2 D 42 45
> oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
id zone i.id i.zone ostart oend
1: 2 B 1 A 3 5
2: 2 C 1 A 6 8
3: 2 C 1 D 8 9
4: 2 D 1 C 10 13
5: 2 D 1 B 13 15
6: 2 A 1 C 15 20
7: 2 B 1 A 22 25
8: 2 C 1 A 25 26
9: 2 C 1 B 26 29
10: 2 C 1 A 29 33
11: 2 B 1 A 33 35
12: 2 A 1 D 37 40
13: 2 D 1 C 40 42
I suspect there is a way to pass arguments to foverlaps to avoid needing to define and filter by ostart and oend. As of the latest CRAN version of the package, the doc indicates that minoverlap is not yet implemented, so maybe it is necessary for now.
I think you are almost there. You can try the code below by defining a function f
f <- function(A) {
Asp <- split(A, by = "id")
u <- na.omit(foverlaps(Asp[[1]], setkey(Asp[[2]], start, end)))
r <- c()
for (k in 1:nrow(u)) {
if (u[k, end - start < i.end - i.start]) {
p <- u[k, .(start, end)]
} else {
p <- u[k, .(start = i.start, end = i.end)]
}
r[[k]] <- p
}
cbind(
zone = u[, zone],
rbindlist(r),
id = paste0(unique(A[, id]), collapse = "-")
)
}
and then run
rbindlist(Map(f, split(setDT(df), by = "zone")))
which gives
> rbindlist(Map(f, split(setDT(df), by = "zone")))
zone start end id
1: A 0 3 1-2
2: A 5 6 1-2
3: B 6 7 1-2
4: C 8 10 1-2
For the first data.frame, you can also use a non-equi join:
ovlap <- df[df, on=.(zone, id<id, start<end, end>start), nomatch=0L,
.(zone, id2=i.id, i.start, i.end, id1=x.id, x.start, x.end)][,
.(start=max(x.start, i.start), end=min(x.end, i.end)),
.(zone, id1, id2, i.start)][,
i.start := NULL][]
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 5 6
#3: B 1 2 6 7
#4: C 1 2 8 10
For the other output data.frames, you can perform a non-equi join first with the previous result and then for each interval find the sub-intervals where the other partner is not around:
rangeDiff <- function(DT) {
DT[,
if (is.na(x.start[1L])) {
.(start=i.start, end=i.end)
} else {
.(start=c(i.start, x.end+1L),
end=c(x.start-1L, i.end))
},
.(zone, id, i.start, i.end)][
start<=end][,
c("i.start","i.end") := NULL][]
} #rangeDiff
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: A 1 4 4
#2: A 1 7 8
#3: B 1 10 11
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: B 2 3 5
#2: C 2 7 7
#3: C 2 11 11
There is some inconsistency in OP where the bounds of intervals are inclusive or exclusive. I have used inclusive when both ids are around in the same zone (i.e. in the first output data.frame).
Edit: show output for df2
ovlap
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 35 37
#3: B 1 2 5 6
#4: B 1 2 20 22
#5: D 1 2 9 10
#6: D 1 2 42 45
other required data.frames:
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 1 4 5
# 2: A 1 6 8
# 3: A 1 22 26
# 4: A 1 29 34
# 5: B 1 13 15
# 6: B 1 26 29
# 7: C 1 10 13
# 8: C 1 15 20
# 9: C 1 40 42
# 10: D 1 8 8
# 11: D 1 37 40
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 2 15 20
# 2: A 2 38 40
# 3: B 2 3 4
# 4: B 2 23 25
# 5: B 2 33 35
# 6: C 2 6 9
# 7: C 2 25 33
# 8: D 2 11 15
# 9: D 2 40 41
df2 sorted by zone for easier checking:
start id zone end
1: 0 1 A 5
2: 6 1 A 8
3: 22 1 A 26
4: 29 1 A 37
5: 0 2 A 3
6: 15 2 A 20
7: 35 2 A 40
8: 5 1 B 6
9: 13 1 B 15
10: 20 1 B 22
11: 26 1 B 29
12: 3 2 B 6
13: 20 2 B 25
14: 33 2 B 35
15: 10 1 C 13
16: 15 1 C 20
17: 40 1 C 42
18: 6 2 C 9
19: 25 2 C 33
20: 8 1 D 10
21: 37 1 D 40
22: 42 1 D 45
23: 9 2 D 15
24: 40 2 D 45
Updated Solution
I have made some modifications to the previous solution so that it works with the newly presented data set df2:
I tried to create all combinations of id == 1 and id == 2 in every zone to try and find their intersects
Then I created a custom function to take a subset of our data set plus a pair of ids to extract their start to end values so that we have two vectors and we can find their intersects easily
` In the end I applied this function to every subset of our data set
library(dplyr)
library(tidyr)
library(purrr)
fn <- function(data, x, y) {
base::intersect(data %>%
filter(row_number() == x) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist(),
data %>%
filter(row_number() == y) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist())
}
Then we apply it on our data set:
split(df2, df2$zone) %>%
map(~ .x %>%
mutate(grp = row_number()) %>%
{expand.grid(.$grp[.$id == 1], .$grp[.$id == 2])} %>%
rowwise() %>%
mutate(insec = list(fn(.x, Var1, Var2))) %>%
filter(length(insec) != 0) %>%
unnest(cols = c(insec)) %>%
group_by(Var1, Var2) %>%
filter(row_number() == 1 | row_number() == n()) %>%
filter(n() > 1) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = insec) %>%
ungroup()) %>%
keep(~ nrow(.x) != 0) %>%
imap_dfr(~ .x %>%
mutate(zone
= .y) %>%
select(!starts_with("Var"))) %>%
relocate(zone) %>%
rename(start = `1`, end = `2`)
# A tibble: 6 x 3
zone start end
<chr> <int> <int>
1 A 0 3
2 A 35 37
3 B 5 6
4 B 20 22
5 D 9 10
6 D 42 45

How to collapse Table while ignore a specific value in R?

I have a data frame like the following:
> example
name X1.8 X1.8.1 X1.8.2
1 a 1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
and I want to collapse(sum) the row with the same name (column 1) but ignore the value -1 while collapsing (summing). For example, the example above would become:
> example # the goal
name X1.8 X1.8.1 X1.8.2
1 a 1 8 18
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
> dput(example)
structure(list(name = structure(c(1L, 2L, 3L, 1L, 4L, 5L, 1L,
3L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
X1.8 = c(1, 33, 3, -1, 5, 7, -1, 5), X1.8.1 = c(1, 0, 10,
-1, 8, 6, 7, 20), X1.8.2 = c(7, 2, -1, 4, 5, 12, 7, 9)), row.names = c(NA,
8L), class = "data.frame")
Edit for question:
will this work if there are some rows with -1? For example,
> example
name X1.8 X1.8.1 X1.8.2
1 a 1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
9 f -1 -1 -1
You can remove -1 and sum rest of the values.
Using base R :
aggregate(.~name, example, function(x) sum(x[x!=-1]))
# name X1.8 X1.8.1 X1.8.2
#1 a 1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
In dplyr :
library(dplyr)
example %>%
group_by(name) %>%
summarise(across(everything(), ~sum(.[. != -1])))
and data.table :
library(data.table)
setDT(example)[, lapply(.SD, function(x) sum(x[x!=-1])), name]
As you are calculating sums you can set the -1 you want to ignore to 0 and use rowsum to get the sum per group.
x[x==-1] <- 0
rowsum(x[-1], x[,1])
# X1.8 X1.8.1 X1.8.2
#a 1 8 18
#b 33 0 2
#c 8 30 9
#d 5 8 5
#e 7 6 12
another option is to set -1 to NA
x[x==-1] <- NA
rowsum(x[-1], x[,1], na.rm = TRUE)

take the last value in data frame by factor category

I have a data frame like this:
a=c("A","A","A","A","B","B","C","C","C","D","D")
b=c(1,2,3,4,1,2,1,2,3,1,2)
c=c(1345,645,75,8,95,678,598,95,75,4,53)
mydf <- data.frame(a,b,c) # edit note: do _not_ use cbind inside data.frame
My aim is to get add an extra column on the new data frame which will take the last value of column "c" taking into account the factor in column "a".
More specifically, in this examlpe the end result is like this:
a b c d
1 A 1 1345 0
2 A 2 645 0
3 A 3 75 0
4 A 4 8 8
5 B 1 95 0
6 B 2 678 678
7 C 1 598 0
8 C 2 95 0
9 C 3 75 75
10 D 1 4 0
11 D 2 53 53
If you don't need your variables to be all fators, there's a nice solution with dplyr:
df <- data.frame(a = c("A","A","A","A","B","B","C","C","C","D","D"),
b=c(1,2,3,4,1,2,1,2,3,1,2),
c=c(1345,645,75,8,95,678,598,95,75,4,53),stringsAsFactors = F)
library(dplyr)
df <- tbl_df(df)
df %>% group_by(a)%>%
mutate(d = ifelse(b == max(b),c[which(b == max(b))],0))
# A tibble: 11 x 4
# Groups: a [4]
a b c d
<chr> <dbl> <dbl> <dbl>
1 A 1 1345 0
2 A 2 645 0
3 A 3 75 0
4 A 4 8 8
5 B 1 95 0
6 B 2 678 678
7 C 1 598 0
8 C 2 95 0
9 C 3 75 75
10 D 1 4 0
11 D 2 53 53
Using data.table:
library(data.table)
df <- data.frame(a,b,c)
setDT(df)
df[, idx := .N, by = a]
df[, id := 1:.N, by = a]
df <- df[id == idx, d := c]
df[, c("id", "idx") := NULL]
df[is.na(df)] <- 0
a b c d
1: A 1 1345 0
2: A 2 645 0
3: A 3 75 0
4: A 4 8 8
5: B 1 95 0
6: B 2 678 678
7: C 1 598 0
8: C 2 95 0
9: C 3 75 75
10: D 1 4 0
11: D 2 53 53

Resources