How to sum a df while ignoring value in R? - r

This is an extension of my previous question. I reviewed the linked duplicate, but I am still having trouble.
I have a data frame like the following:
> example
name X1.8 X1.8.1 X1.8.2
1 a -1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
9 f -1 -1 -1
and I want to collapse(sum) the row with the same name (column 1) but ignore the value -1 while collapsing (summing). *-1 is similar to NA. For example, the example above would become:
> example # the goal
name X1.8 X1.8.1 X1.8.2
1 a -1 8 18 # the first col stays as -1 b/c all are -1
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
6 f -1 -1 -1
> dput(example)
structure(list(name = structure(c(1L, 2L, 3L, 1L, 4L, 5L, 1L,
3L, 6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
X1.8 = c(-1, 33, 3, -1, 5, 7, -1, 5, -1), X1.8.1 = c(1, 0,
10, -1, 8, 6, 7, 20, -1), X1.8.2 = c(7, 2, -1, 4, 5, 12,
7, 9, -1)), row.names = c(NA, 9L), class = "data.frame")

We can use an if/else after doing the group_by i.e. after grouping by 'name', summarise across all the other columns (dplyr 1.0.0), if all values are -1, then return it or else get the sum of values excluding -1
library(dplyr) # 1.0.0
example %>%
group_by(name) %>%
summarise(across(everything(), ~ if(all(.==-1)) -1 else
sum(.[. != -1], na.rm = TRUE)))
# A tibble: 6 x 4
# name X1.8 X1.8.1 X1.8.2
# <fct> <dbl> <dbl> <dbl>
#1 a -1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
#6 f -1 -1 -1
An option is also to use na_if to replace the -1 to NA and then make use of na.rm= TRUE in sum. But, we have avoided that route in case there are actual NAs in the dataset for a particular group. This would help in identifying the -1 as such
or with summarise_at
example %>%
group_by(name) %>%
summarise_at(vars(-group_cols()), ~ if(all(.==-1)) -1 else
sum(.[. != -1], na.rm = TRUE))
# A tibble: 6 x 4
# name X1.8 X1.8.1 X1.8.2
# <fct> <dbl> <dbl> <dbl>
#1 a -1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
#6 f -1 -1 -1

This solution could help you:
library(dplyr)
#Format
example[example==-1]<-NA
#Aggregate
example %>% group_by(name) %>% summarise_all(sum,na.rm=T)
# A tibble: 6 x 4
name X1.8 X1.8.1 X1.8.2
<fct> <dbl> <dbl> <dbl>
1 a 0 8 18
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
6 f 0 0 0

base R
aggregate(x = example[,2:4],
by = list(name = example$name),
FUN = function(x)ifelse(all(x==-1), -1, sum(x[x!=-1])))

Related

Conditional replacing of a numeric value in dplyr

Dear all I have a data frame that looks like this
df <- data.frame(time=c(1,2,3,4,1,2,3,4,5), type=c("A","A","A","A","B","B","B","B","B"), count=c(10,0,0,1,8,0,1,0,1))
df
time type count
1 1 A 10
2 2 A 0
3 3 A 0
4 4 A 1
5 1 B 8
6 2 B 0
7 3 B 1
8 4 B 0
9 5 B 1
I want to examine each group of types and if I see that one count is 0 then to replace the next count forward in time with 0. I do not count to be resurrected from the zero.
I want my data to looks like this
time type count
1 1 A 10
2 2 A 0
3 3 A 0
4 4 A 0
5 1 B 8
6 2 B 0
7 3 B 0
8 4 B 0
9 5 B 0
If I understood correctly
library(tidyverse)
df <-
data.frame(
time = c(1, 2, 3, 4, 1, 2, 3, 4, 5),
type = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
count = c(10, 0, 0, 1, 8, 0, 1, 0, 1)
)
df %>%
group_by(type) %>%
mutate(count = if_else(lag(count, default = first(count)) == 0, 0, count))
#> # A tibble: 9 x 3
#> # Groups: type [2]
#> time type count
#> <dbl> <chr> <dbl>
#> 1 1 A 10
#> 2 2 A 0
#> 3 3 A 0
#> 4 4 A 0
#> 5 1 B 8
#> 6 2 B 0
#> 7 3 B 0
#> 8 4 B 0
#> 9 5 B 0
Created on 2021-09-10 by the reprex package (v2.0.1)
You may use cummin function.
library(dplyr)
df %>% group_by(type) %>% mutate(count = cummin(count))
# time type count
# <dbl> <chr> <dbl>
#1 1 A 10
#2 2 A 0
#3 3 A 0
#4 4 A 0
#5 1 B 8
#6 2 B 0
#7 3 B 0
#8 4 B 0
#9 5 B 0
Since cummin is a base R function you may also implement it in base R -
transform(df, count = ave(count, type, FUN = cummin))

Find overlaps in time intervals by group and return subsetted data.frame

Say I have this dataframe, which has two IDs (1/2) with their start and end times in three different zones (A/B/C):
df <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), zone = c("A",
"B", "A", "C", "B", "A", "B", "A", "B", "C"), start = c(0, 6,
7, 8, 10, 0, 3, 5, 6, 7), end = c(6, 7, 8, 10, 11, 3, 5, 6, 7,
11)), row.names = c(NA, -10L), class = "data.frame")
df
id zone start end
1 1 A 0 6
2 1 B 6 7
3 1 A 7 8
4 1 C 8 10
5 1 B 10 11
6 2 A 0 3
7 2 B 3 5
8 2 A 5 6
9 2 B 6 7
10 2 C 7 11
If we look at each zone, we can visually inspect the times when IDs are in the same zone and when they are not:
split(df,df$zone)
$A
id zone start end
1 1 A 0 6
3 1 A 7 8
6 2 A 0 3
8 2 A 5 6
$B
id zone start end
2 1 B 6 7
5 1 B 10 11
7 2 B 3 5
9 2 B 6 7
$C
id zone start end
4 1 C 8 10
10 2 C 7 11
e.g. Both 1 and 2 are together in zone A from 0-3, and from 5-6, but not at other times.
Desired Output
I want to extract three dataframes.
A dataframe showing the times and zones they are together:
zone start end id
1 A 0 3 1-2
2 A 5 6 1-2
3 B 6 7 1-2
4 C 8 10 1-2
2 & 3: Dataframes for times when they are not together:
#id=1
zone start end
1 A 3 5
2 A 7 8
3 B 10 11
#id=2
zone start end
1 B 3 5
2 C 7 8
3 C 10 11
I have been trying to work with foverlaps from data.table and the intervals package, but can't seem to work out the correct method.
e.g. Subsetting each zone/id, I can sort of get an output that includes overlaps, but it doesn't seem to be quite the right direction:
A <- split(df,df$zone)$A
Asp <- split(A,A$id)
x <- setDT(Asp[[1]])
y <- setDT(Asp[[2]])
setkey(y, start, end)
foverlaps(x, y, type="any")
id zone start end i.id i.zone i.start i.end
1: 2 A 0 3 1 A 0 6
2: 2 A 5 6 1 A 0 6
3: NA <NA> NA NA 1 A 7 8
Any help greatly appreciated.
EDIT: Extra example dataset that seemed to bring up some issues with current suggested solutions:
df2 <- structure(list(start = c(0, 5, 6, 8, 10, 13, 15, 20, 22, 26,
29, 37, 40, 42, 0, 3, 6, 9, 15, 20, 25, 33, 35, 40), id = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2), zone = c("A", "B", "A", "D", "C", "B", "C", "B", "A",
"B", "A", "D", "C", "D", "A", "B", "C", "D", "A", "B", "C", "B",
"A", "D"), end = c(5, 6, 8, 10, 13, 15, 20, 22, 26, 29, 37, 40,
42, 45, 3, 6, 9, 15, 20, 25, 33, 35, 40, 45)), class = c("data.table", "data.frame"), row.names = c(NA, -24L))
df2
start id zone end
1: 0 1 A 5
2: 5 1 B 6
3: 6 1 A 8
4: 8 1 D 10
5: 10 1 C 13
6: 13 1 B 15
7: 15 1 C 20
8: 20 1 B 22
9: 22 1 A 26
10: 26 1 B 29
11: 29 1 A 37
12: 37 1 D 40
13: 40 1 C 42
14: 42 1 D 45
15: 0 2 A 3
16: 3 2 B 6
17: 6 2 C 9
18: 9 2 D 15
19: 15 2 A 20
20: 20 2 B 25
21: 25 2 C 33
22: 33 2 B 35
23: 35 2 A 40
24: 40 2 D 45
start id zone end
This seems to work, filtering the foverlaps output:
DT = data.table(df)
setkey(DT, start, end)
oDT0 = foverlaps(DT[id==1], DT[id==2])
oDT0[, `:=`(
ostart = pmax(start, i.start),
oend = pmin(end, i.end)
)]
oDT = oDT0[ostart < oend]
# together
oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
# ids zone ostart oend
# 1: 1-2 A 0 3
# 2: 1-2 A 5 6
# 3: 1-2 B 6 7
# 4: 1-2 C 8 10
# apart
oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
# id zone i.id i.zone ostart oend
# 1: 2 B 1 A 3 5
# 2: 2 C 1 A 7 8
# 3: 2 C 1 B 10 11
Repeating for new input... not sure if it's correct since no expected output was provided:
> DT = data.table(df2)
> ...
> oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
ids zone ostart oend
1: 1-2 A 0 3
2: 1-2 B 5 6
3: 1-2 D 9 10
4: 1-2 B 20 22
5: 1-2 A 35 37
6: 1-2 D 42 45
> oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
id zone i.id i.zone ostart oend
1: 2 B 1 A 3 5
2: 2 C 1 A 6 8
3: 2 C 1 D 8 9
4: 2 D 1 C 10 13
5: 2 D 1 B 13 15
6: 2 A 1 C 15 20
7: 2 B 1 A 22 25
8: 2 C 1 A 25 26
9: 2 C 1 B 26 29
10: 2 C 1 A 29 33
11: 2 B 1 A 33 35
12: 2 A 1 D 37 40
13: 2 D 1 C 40 42
I suspect there is a way to pass arguments to foverlaps to avoid needing to define and filter by ostart and oend. As of the latest CRAN version of the package, the doc indicates that minoverlap is not yet implemented, so maybe it is necessary for now.
I think you are almost there. You can try the code below by defining a function f
f <- function(A) {
Asp <- split(A, by = "id")
u <- na.omit(foverlaps(Asp[[1]], setkey(Asp[[2]], start, end)))
r <- c()
for (k in 1:nrow(u)) {
if (u[k, end - start < i.end - i.start]) {
p <- u[k, .(start, end)]
} else {
p <- u[k, .(start = i.start, end = i.end)]
}
r[[k]] <- p
}
cbind(
zone = u[, zone],
rbindlist(r),
id = paste0(unique(A[, id]), collapse = "-")
)
}
and then run
rbindlist(Map(f, split(setDT(df), by = "zone")))
which gives
> rbindlist(Map(f, split(setDT(df), by = "zone")))
zone start end id
1: A 0 3 1-2
2: A 5 6 1-2
3: B 6 7 1-2
4: C 8 10 1-2
For the first data.frame, you can also use a non-equi join:
ovlap <- df[df, on=.(zone, id<id, start<end, end>start), nomatch=0L,
.(zone, id2=i.id, i.start, i.end, id1=x.id, x.start, x.end)][,
.(start=max(x.start, i.start), end=min(x.end, i.end)),
.(zone, id1, id2, i.start)][,
i.start := NULL][]
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 5 6
#3: B 1 2 6 7
#4: C 1 2 8 10
For the other output data.frames, you can perform a non-equi join first with the previous result and then for each interval find the sub-intervals where the other partner is not around:
rangeDiff <- function(DT) {
DT[,
if (is.na(x.start[1L])) {
.(start=i.start, end=i.end)
} else {
.(start=c(i.start, x.end+1L),
end=c(x.start-1L, i.end))
},
.(zone, id, i.start, i.end)][
start<=end][,
c("i.start","i.end") := NULL][]
} #rangeDiff
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: A 1 4 4
#2: A 1 7 8
#3: B 1 10 11
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: B 2 3 5
#2: C 2 7 7
#3: C 2 11 11
There is some inconsistency in OP where the bounds of intervals are inclusive or exclusive. I have used inclusive when both ids are around in the same zone (i.e. in the first output data.frame).
Edit: show output for df2
ovlap
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 35 37
#3: B 1 2 5 6
#4: B 1 2 20 22
#5: D 1 2 9 10
#6: D 1 2 42 45
other required data.frames:
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 1 4 5
# 2: A 1 6 8
# 3: A 1 22 26
# 4: A 1 29 34
# 5: B 1 13 15
# 6: B 1 26 29
# 7: C 1 10 13
# 8: C 1 15 20
# 9: C 1 40 42
# 10: D 1 8 8
# 11: D 1 37 40
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 2 15 20
# 2: A 2 38 40
# 3: B 2 3 4
# 4: B 2 23 25
# 5: B 2 33 35
# 6: C 2 6 9
# 7: C 2 25 33
# 8: D 2 11 15
# 9: D 2 40 41
df2 sorted by zone for easier checking:
start id zone end
1: 0 1 A 5
2: 6 1 A 8
3: 22 1 A 26
4: 29 1 A 37
5: 0 2 A 3
6: 15 2 A 20
7: 35 2 A 40
8: 5 1 B 6
9: 13 1 B 15
10: 20 1 B 22
11: 26 1 B 29
12: 3 2 B 6
13: 20 2 B 25
14: 33 2 B 35
15: 10 1 C 13
16: 15 1 C 20
17: 40 1 C 42
18: 6 2 C 9
19: 25 2 C 33
20: 8 1 D 10
21: 37 1 D 40
22: 42 1 D 45
23: 9 2 D 15
24: 40 2 D 45
Updated Solution
I have made some modifications to the previous solution so that it works with the newly presented data set df2:
I tried to create all combinations of id == 1 and id == 2 in every zone to try and find their intersects
Then I created a custom function to take a subset of our data set plus a pair of ids to extract their start to end values so that we have two vectors and we can find their intersects easily
` In the end I applied this function to every subset of our data set
library(dplyr)
library(tidyr)
library(purrr)
fn <- function(data, x, y) {
base::intersect(data %>%
filter(row_number() == x) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist(),
data %>%
filter(row_number() == y) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist())
}
Then we apply it on our data set:
split(df2, df2$zone) %>%
map(~ .x %>%
mutate(grp = row_number()) %>%
{expand.grid(.$grp[.$id == 1], .$grp[.$id == 2])} %>%
rowwise() %>%
mutate(insec = list(fn(.x, Var1, Var2))) %>%
filter(length(insec) != 0) %>%
unnest(cols = c(insec)) %>%
group_by(Var1, Var2) %>%
filter(row_number() == 1 | row_number() == n()) %>%
filter(n() > 1) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = insec) %>%
ungroup()) %>%
keep(~ nrow(.x) != 0) %>%
imap_dfr(~ .x %>%
mutate(zone
= .y) %>%
select(!starts_with("Var"))) %>%
relocate(zone) %>%
rename(start = `1`, end = `2`)
# A tibble: 6 x 3
zone start end
<chr> <int> <int>
1 A 0 3
2 A 35 37
3 B 5 6
4 B 20 22
5 D 9 10
6 D 42 45

how to add row sum and col sum at the same time by subject and then arrange subject by total

I have a dataset that looks like this:
if I want to get the subtotal by subject as well as for each date, and also arrange subject based on total number of the subject, what should I do?
The final output should looks like following (blue part is the one we need to add, and also total ELA (23) <total Math (47), so ELA will be in front of math):
The sample table can be build using codes:
df <- structure(list(Subject = c("Math", "Math", "Math", "Math", "ELA",
"ELA", "ELA"), date = c(1, 7, 14, 21, 1, 7, 21), A = c(1, 2,
0, 9, 2, 6, 0), B = c(3, 5, 5, 1, 0, 5, 0), C = c(2, 1, 0, 8,
0, 0, 0), D = c(0, 0, 2, 8, 0, 8, 2)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Use package Janitor to do it fastly
df %>% mutate(date = as.character(date)) %>%
group_split(Subject) %>%
map_df(., janitor::adorn_totals, fill = "All Dates", name = "Sub-Total") %>%
adorn_totals(where = "col")
Subject date A B C D Total
ELA 1 2 0 0 0 2
ELA 7 6 5 0 8 19
ELA 21 0 0 0 2 2
Sub-Total All Dates 8 5 0 10 23
Math 1 1 3 2 0 6
Math 7 2 5 1 0 8
Math 14 0 5 0 2 7
Math 21 9 1 8 8 26
Sub-Total All Dates 12 14 11 10 47
If you won't change column date into a charachter one, it will be totalled also
Here is a base R solution. The main functions are
by and addmargins, to compute the totals per groups of Subject and the row totals;
a second loop (lapply) to put the column totals as the first row.
The rest of the code puts everything together.
res <- by(df[-1], df[1], FUN = function(x){
x <- as.matrix(x)
rownames(x) <- x[, 1]
addmargins(x[, -1], margin = 1:2)
})
res <- lapply(seq_along(res), function(i){
x <- as.data.frame(res[[i]])
row.names(x)[row.names(x) == "Sum"] <- "All dates"
y <- cbind.data.frame(Subject = names(res)[i], date = row.names(x), x)
names(y)[ncol(y)] <- "Total"
y[order(y[["Total"]], decreasing = TRUE), ]
})
i <- sapply(res, '[', 1, "Total")
res <- do.call(rbind.data.frame, res[order(i, decreasing = TRUE)])
row.names(res) <- NULL
res
# Subject date A B C D Total
#1 Math All dates 12 14 11 10 47
#2 Math 21 9 1 8 8 26
#3 Math 7 2 5 1 0 8
#4 Math 14 0 5 0 2 7
#5 Math 1 1 3 2 0 6
#6 ELA All dates 8 5 0 10 23
#7 ELA 7 6 5 0 8 19
#8 ELA 1 2 0 0 0 2
#9 ELA 21 0 0 0 2 2
Here's a dplyr way :
library(dplyr)
c_order <- c('All dates', 1, 7, 14, 21)
df %>%
group_by(Subject) %>%
summarise(across(A:D, sum)) %>%
mutate(date = 'All dates', .after = 'Subject') %>%
bind_rows(df %>% mutate(date = as.character(date))) %>%
arrange(Subject, match(date, c_order)) %>%
mutate(Total = rowSums(select(., A:D)))
# Subject date A B C D Total
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 ELA All dates 8 5 0 10 23
#2 ELA 1 2 0 0 0 2
#3 ELA 7 6 5 0 8 19
#4 ELA 21 0 0 0 2 2
#5 Math All dates 12 14 11 10 47
#6 Math 1 1 3 2 0 6
#7 Math 7 2 5 1 0 8
#8 Math 14 0 5 0 2 7
#9 Math 21 9 1 8 8 26
First for each Subject sum columns A:D and add a column 'date' with value 'All Dates'. Bind this to original dataframe and arrange the data according to required order and perform a rowwise sum.
Does this work:
library(dplyr)
library(tidyr)
df %>% rowwise() %>% mutate(Total = sum(c_across(A:D))) %>%
bind_rows(df %>% rowwise() %>% mutate(Total = sum(c_across(A:D))) %>% group_by(Subject) %>% summarise_at(vars(A:Total), sum)) %>%
mutate(date = replace_na(date, 'All Dates')) %>% arrange(Subject, desc(Total))
# A tibble: 9 x 7
# Rowwise:
Subject date A B C D Total
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ELA All Dates 8 5 0 10 23
2 ELA 7 6 5 0 8 19
3 ELA 1 2 0 0 0 2
4 ELA 21 0 0 0 2 2
5 Math All Dates 12 14 11 10 47
6 Math 21 9 1 8 8 26
7 Math 7 2 5 1 0 8
8 Math 14 0 5 0 2 7
9 Math 1 1 3 2 0 6

How to collapse Table while ignore a specific value in R?

I have a data frame like the following:
> example
name X1.8 X1.8.1 X1.8.2
1 a 1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
and I want to collapse(sum) the row with the same name (column 1) but ignore the value -1 while collapsing (summing). For example, the example above would become:
> example # the goal
name X1.8 X1.8.1 X1.8.2
1 a 1 8 18
2 b 33 0 2
3 c 8 30 9
4 d 5 8 5
5 e 7 6 12
> dput(example)
structure(list(name = structure(c(1L, 2L, 3L, 1L, 4L, 5L, 1L,
3L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
X1.8 = c(1, 33, 3, -1, 5, 7, -1, 5), X1.8.1 = c(1, 0, 10,
-1, 8, 6, 7, 20), X1.8.2 = c(7, 2, -1, 4, 5, 12, 7, 9)), row.names = c(NA,
8L), class = "data.frame")
Edit for question:
will this work if there are some rows with -1? For example,
> example
name X1.8 X1.8.1 X1.8.2
1 a 1 1 7
2 b 33 0 2
3 c 3 10 -1
4 a -1 -1 4
5 d 5 8 5
6 e 7 6 12
7 a -1 7 7
8 c 5 20 9
9 f -1 -1 -1
You can remove -1 and sum rest of the values.
Using base R :
aggregate(.~name, example, function(x) sum(x[x!=-1]))
# name X1.8 X1.8.1 X1.8.2
#1 a 1 8 18
#2 b 33 0 2
#3 c 8 30 9
#4 d 5 8 5
#5 e 7 6 12
In dplyr :
library(dplyr)
example %>%
group_by(name) %>%
summarise(across(everything(), ~sum(.[. != -1])))
and data.table :
library(data.table)
setDT(example)[, lapply(.SD, function(x) sum(x[x!=-1])), name]
As you are calculating sums you can set the -1 you want to ignore to 0 and use rowsum to get the sum per group.
x[x==-1] <- 0
rowsum(x[-1], x[,1])
# X1.8 X1.8.1 X1.8.2
#a 1 8 18
#b 33 0 2
#c 8 30 9
#d 5 8 5
#e 7 6 12
another option is to set -1 to NA
x[x==-1] <- NA
rowsum(x[-1], x[,1], na.rm = TRUE)

Conditional statement within group

I have a dataframe in which I want to make a new column with values based on condition within groups. So for the dataframe below, I want to make a new column n_actions which gives
Cond1. for the whole group GROUP the number 2 if a 6 appears in column STEP
Cond 2. for the whole group GROUP the number 3 if a 9 appears in column STEP
Cond 3. if not a 6 or 9 appears within column STEP for the GROUP, then 1
#dataframe start
dataframe <- data.frame(group = c("A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D"),
step = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9))
# dataframe desired
dataframe$n_actions <- c(rep(1, 3), rep(2, 6,), rep(1, 3), rep(3, 9))
Try out:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(n_actions = ifelse(9 %in% step, 3,
ifelse(6 %in% step, 2, 1)))
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fctr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
# ... with 11 more rows
Another way with dplyr's case_when:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(
n_actions1 = case_when(
9 %in% step ~ 3,
6 %in% step ~ 2,
TRUE ~ 1
)
)
Output:
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fct> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
11 C 2 1
12 C 3 1
13 D 1 3
14 D 2 3
15 D 3 3
16 D 4 3
17 D 5 3
18 D 6 3
19 D 7 3
20 D 8 3
21 D 9 3
You could divide the maximum value per group by %/% 3, it seems.
dataframe <- transform(dataframe,
n_actions2 = ave(step, group, FUN = function(x) max(x) %/% 3))
dataframe
# group step n_actions n_actions2
#1 A 1 1 1
#2 A 2 1 1
#3 A 3 1 1
#4 B 1 2 2
#5 B 2 2 2
#6 B 3 2 2
#7 B 4 2 2
#8 B 5 2 2
#9 B 6 2 2
#10 C 1 1 1
#11 C 2 1 1
#12 C 3 1 1
#13 D 1 3 3
#14 D 2 3 3
#15 D 3 3 3
#16 D 4 3 3
#17 D 5 3 3
#18 D 6 3 3
#19 D 7 3 3
#20 D 8 3 3
#21 D 9 3 3

Resources