taking sum of rows in R based on conditions - r

I have a data in this format
ColA
ColB
ColC
A
2
1
A
1
1
B
3
2
B
5
2
C
2
3
C
5
3
A
1
1
A
3
1
B
7
2
B
1
2
I want to get a new column with the sum of the rows of ColB, something like this:
ColA
ColB
ColC
ColD
A
2
1
3
A
1
1
3
B
3
2
8
B
5
2
8
C
2
3
7
C
5
3
7
A
1
1
4
A
3
1
4
B
7
2
8
B
1
2
8
Thanks much for your help!
I tried
df$ColD <- with(df, sum(ColB[ColC == 1]))

It seems to me that you want ColD to have the sum of ColB for each consecutive group defined by the values in ColA. In which case, we may do:
library(dplyr)
df %>%
mutate(group = data.table::rleid(ColA)) %>%
group_by(group) %>%
mutate(ColD = sum(ColB)) %>%
ungroup() %>%
select(-group)
#> # A tibble: 10 x 4
#> ColA ColB ColC ColD
#> <chr> <int> <int> <int>
#> 1 A 2 1 3
#> 2 A 1 1 3
#> 3 B 3 2 8
#> 4 B 5 2 8
#> 5 C 2 3 7
#> 6 C 5 3 7
#> 7 A 1 1 4
#> 8 A 3 1 4
#> 9 B 7 2 8
#> 10 B 1 2 8
This, at any rate, is the same as the expected output.
Created on 2023-01-16 with reprex v2.0.2
Data from question in reproducible format
df <- structure(list(ColA = c("A", "A", "B", "B", "C", "C", "A", "A",
"B", "B"), ColB = c(2L, 1L, 3L, 5L, 2L, 5L, 1L, 3L, 7L, 1L),
ColC = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L)),
class = "data.frame", row.names = c(NA, -10L))

Base R
df$ColD=ave(
df$ColB,
cumsum(c(1,abs(diff(match(df$ColA,LETTERS))))),
FUN=sum
)
ColA ColB ColC ColD
1 A 2 1 3
2 A 1 1 3
3 B 3 2 8
4 B 5 2 8
5 C 2 3 7
6 C 5 3 7
7 A 1 1 4
8 A 3 1 4
9 B 7 2 8
10 B 1 2 8

A base solution:
df |>
transform(ColD = ave(ColB, with(rle(ColA), rep(seq_along(values), lengths)), FUN = sum))
ColA ColB ColC ColD
1 A 2 1 3
2 A 1 1 3
3 B 3 2 8
4 B 5 2 8
5 C 2 3 7
6 C 5 3 7
7 A 1 1 4
8 A 3 1 4
9 B 7 2 8
10 B 1 2 8

Another base solution using ave.
df$ColD <- ave(df$ColB, c(0, cumsum(diff(df$ColC) != 0)), FUN=sum) #Using ColC
#df$ColD <- ave(df$ColB, c(0, cumsum(df$ColA[-nrow(df)] != df$ColA[-1])), FUN=sum) #Using ColA
df
# ColA ColB ColC ColD
#1 A 2 1 3
#2 A 1 1 3
#3 B 3 2 8
#4 B 5 2 8
#5 C 2 3 7
#6 C 5 3 7
#7 A 1 1 4
#8 A 3 1 4
#9 B 7 2 8
#10 B 1 2 8

Related

R Create multiple rows from 1 row based on presence of values in certain columns

I have a data frame that looks like the following:
ID Date Participant_1 Participant_2 Participant_3 Covariate 1 Covariate 2 Covariate 3
1 9/1 A B 16 2 1
2 5/4 B 4 2 2
3 6/3 C A B 8 3 6
4 2/8 A 7 8 4
5 9/3 C A 7 1 3
I need to expand this data frame so that a row is present for all of the participants present at each event "ID", with the date and all other variables in all the created rows. The multiple participant columns would now only be one column for participant. The output would therefore be:
ID Date Participant Covariate 1 Covariate 2 Covariate 3
1 9/1 A 16 2 1
1 9/1 B 16 2 1
2 5/4 B 4 2 2
3 6/3 C 8 3 6
3 6/3 A 8 3 6
3 6/3 B 8 3 6
4 2/8 A 7 8 4
5 9/3 C 7 1 3
5 9/3 A 7 1 3
Is there a way to do this efficiently? Perhaps with a pivot function?
We can use pivot_longer and then some formatting
library(tidyr)
df %>%
pivot_longer(starts_with("Participant"), values_to = "Participant") %>%
select(-name) %>%
relocate(Participant, .before = Covariate_1) %>%
drop_na()
# A tibble: 9 × 6
ID Date Participant Covariate_1 Covariate_2 Covariate_3
<int> <chr> <chr> <int> <int> <int>
1 1 9/1 A 16 2 1
2 1 9/1 B 16 2 1
3 2 5/4 B 4 2 2
4 3 6/3 C 8 3 6
5 3 6/3 A 8 3 6
6 3 6/3 B 8 3 6
7 4 2/8 A 7 8 4
8 5 9/3 C 7 1 3
9 5 9/3 A 7 1 3
Here's the example data used:
df <- structure(list(ID = 1:5, Date = c("9/1", "5/4", "6/3", "2/8",
"9/3"), Participant_1 = c("A", "B", "C", "A", "C"), Participant_2 = c("B",
NA, "A", NA, "A"), Participant_3 = c(NA, NA, "B", NA, NA), Covariate_1 = c(16L,
4L, 8L, 7L, 7L), Covariate_2 = c(2L, 2L, 3L, 8L, 1L), Covariate_3 = c(1L,
2L, 6L, 4L, 3L)), class = "data.frame", row.names = c(NA, -5L
))

By group relative order

I have a data set that looks like this
ID
Week
1
3
1
5
1
5
1
8
1
11
1
16
2
2
2
2
2
3
2
3
2
9
Now, what I would like to do is to add another column to the DataFrame so that, for every ID I will mark the week's relative position. More elaborately, I would like to the mark ID's earliest week (smallest number) as 1, then the next week for the ID as 2 and so forth, where if there are two observations of the same week they get the same number.
So, in the above example I should get:
ID
Week
Order
1
3
1
1
5
2
1
5
2
1
8
3
1
11
4
1
16
5
2
2
1
2
2
1
2
3
2
2
3
2
2
9
3
How could I achieve this?
Thank you very much!
A base R option using ave + match
transform(
df,
Order = ave(Week,
ID,
FUN = function(x) match(x, sort(unique(x)))
)
)
or ave + order (thank #IRTFM for comments)
transform(
df,
Order = ave(Week,
ID,
FUN = order
)
)
gives
ID Week Order
1 1 3 1
2 1 5 2
3 1 5 2
4 1 8 3
5 1 11 4
6 1 16 5
7 2 2 1
8 2 2 1
9 2 3 2
10 2 3 2
11 2 9 3
A data.table option with frank
> setDT(df)[, Order := frank(Week, ties.method = "dense"), ID][]
ID Week Order
1: 1 3 1
2: 1 5 2
3: 1 5 2
4: 1 8 3
5: 1 11 4
6: 1 16 5
7: 2 2 1
8: 2 2 1
9: 2 3 2
10: 2 3 2
11: 2 9 3
Data
> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Week = c(3L, 5L, 5L, 8L, 11L, 16L, 2L, 2L, 3L, 3L, 9L)), class = "data.frame", row.names =
c(NA,
-11L))
You can use dense_rank in dplyr :
library(dplyr)
df %>% group_by(ID) %>% mutate(Order = dense_rank(Week)) %>% ungroup
# ID Week Order
# <int> <int> <int>
# 1 1 3 1
# 2 1 5 2
# 3 1 5 2
# 4 1 8 3
# 5 1 11 4
# 6 1 16 5
# 7 2 2 1
# 8 2 2 1
# 9 2 3 2
#10 2 3 2
#11 2 9 3

Expand a dataframe n times and add a column numbering replicates 1 to n

A simple question probably but couldn't figure it out. I want to join two tables by replicating the former. I tried dplyr join functions but they don't seem to add the category column in the example below. Any help is appreciated.
> # I have two tables
>
> table1
Place Round Value
1 A 1 12397
2 A 2 18413
3 A 3 7351
4 A 4 5820
5 B 1 3874
6 B 2 10140
7 B 3 10073
8 B 4 7379
>
> table2
Place Category
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B 3
>
> # I want to add the category column from table2 and expand table1 as follows
>
> final_table
Place Round Value Category
1 A 1 12397 1
2 A 2 18413 1
3 A 3 7351 1
4 A 4 5820 1
5 B 1 3874 1
6 B 2 10140 1
7 B 3 10073 1
8 B 4 7379 1
9 A 1 12397 2
10 A 2 18413 2
11 A 3 7351 2
12 A 4 5820 2
13 B 1 3874 2
14 B 2 10140 2
15 B 3 10073 2
16 B 4 7379 2
17 A 1 12397 3
18 A 2 18413 3
19 A 3 7351 3
20 A 4 5820 3
21 B 1 3874 3
22 B 2 10140 3
23 B 3 10073 3
24 B 4 7379 3
We could use crossing
library(tidyr)
library(dplyr)
crossing(table1, table2[2]) %>%
arrange(Category)
# A tibble: 24 x 4
# Place Round Value Category
# <chr> <int> <int> <int>
# 1 A 1 12397 1
# 2 A 2 18413 1
# 3 A 3 7351 1
# 4 A 4 5820 1
# 5 B 1 3874 1
# 6 B 2 10140 1
# 7 B 3 10073 1
# 8 B 4 7379 1
# 9 A 1 12397 2
#10 A 2 18413 2
# … with 14 more rows
data
table1 <- structure(list(Place = c("A", "A", "A", "A", "B", "B", "B", "B"
), Round = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Value = c(12397L,
18413L, 7351L, 5820L, 3874L, 10140L, 10073L, 7379L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
table2 <- structure(list(Place = c("A", "A", "A", "B", "B", "B"),
Category = c(1L,
2L, 3L, 1L, 2L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Order values within column according to values within different column by group in R

I have the following panel data set:
group i f r d
1 4 8 3 3
1 9 4 5 1
1 2 2 2 2
2 5 5 3 2
2 3 9 3 3
2 9 1 3 1
I want to reorder column i in this data frame according to values in column d for each group. So the highest value for group 1 in column i should correspond to the highest value in column d. In the end my data.frame should look like this:
group i f r d
1 9 8 3 3
1 2 4 5 1
1 4 2 2 2
2 5 5 3 2
2 9 9 3 3
2 3 1 3 1
Here is a dplyr solution.
First, group by group. Then get the permutation rearrangement of column d in a temporary new column, ord and use it to reorder i.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(ord = order(d),
i = i[ord]) %>%
ungroup() %>%
select(-ord)
## A tibble: 6 x 5
# group i f r d
# <int> <int> <int> <int> <int>
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 9 5 3 2
#5 2 5 9 3 3
#6 2 3 1 3 1
original (wrong)
You can achieve this using dplyr and rank:
library(dplyr)
df1 %>% group_by(group) %>%
mutate(i = i[rev(rank(d))])
Edit
This question is actually trickier than it first seems and the original answer I posted is incorrect. The correct solution orders by i before subsetting by the rank of d. This gives OP's desired output which my previous answer did not (not paying attention!)
df1 %>% group_by(group) %>%
mutate(i = i[order(i)][rank(d)])
# A tibble: 6 x 5
# Groups: group [2]
# group i f r d
# <int> <int> <int> <int> <int>
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 5 5 3 2
#5 2 9 9 3 3
#6 2 3 1 3 1
There is some confusion regarding the expected output. Here I am showing a way to get both the versions of the output.
A base R using split and mapply
df$i <- c(mapply(function(x, y) sort(y)[x],
split(df$d, df$group), split(df$i, df$group)))
df
# group i f r d
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 5 5 3 2
#5 2 9 9 3 3
#6 2 3 1 3 1
Or another version
df$i <- c(mapply(function(x, y) y[order(x)],
split(df$d, df$group), split(df$i, df$group)))
df
# group i f r d
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 9 5 3 2
#5 2 5 9 3 3
#6 2 3 1 3 1
We can also use dplyr for this :
For 1st version
library(dplyr)
df %>%
group_by(group) %>%
mutate(i = sort(i)[d])
2nd version is already shown by #Rui using order
df %>%
group_by(group) %>%
mutate(i = i[order(d)])
An option with data.table
library(data.table)
setDT(df1)[, i := i[order(d)], group]
df1
# group i f r d
#1: 1 9 8 3 3
#2: 1 2 4 5 1
#3: 1 4 2 2 2
#4: 2 9 5 3 2
#5: 2 5 9 3 3
#6: 2 3 1 3 1
If we need the second version
setDT(df1)[, i := sort(i)[d], group]
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L), i = c(4L, 9L,
2L, 5L, 3L, 9L), f = c(8L, 4L, 2L, 5L, 9L, 1L), r = c(3L, 5L,
2L, 3L, 3L, 3L), d = c(3L, 1L, 2L, 2L, 3L, 1L)), class = "data.frame",
row.names = c(NA,
-6L))

sum for each ID depending on another variable

I would like to sum a column (by ID) depending on another variable (group). If we take for instance:
ID t group
1 12 1
1 14 1
1 2 6
2 0.5 7
2 12 1
3 3 1
4 2 4
I'd like to sum values of column t separately for each ID only if group==1, and obtain:
ID t group sum
1 12 1 26
1 14 1 26
1 2 6 NA
2 0.5 7 NA
2 12 1 12
3 3 1 3
4 2 4 NA
Using dplyr,
df %>%
group_by(ID) %>%
mutate(new = sum(t[group == 1]),
new = replace(new, group != 1, NA))
which gives,
# A tibble: 7 x 4
# Groups: ID [4]
ID t group new
<int> <dbl> <int> <dbl>
1 1 12 1 26
2 1 14 1 26
3 1 2 6 NA
4 2 0.5 7 NA
5 2 12 1 12
6 3 3 1 3
7 4 2 4 NA
Consider base R with ifelse and ave() for conditional inline aggregation.
df$sum <- with(df, ifelse(group == 1, ave(t, ID, group, FUN=sum), NA))
df
# ID t group sum
# 1 1 12.0 1 26
# 2 1 14.0 1 26
# 3 1 2.0 6 NA
# 4 2 0.5 7 NA
# 5 2 12.0 1 12
# 6 3 3.0 1 3
# 7 4 2.0 4 NA
Rextester demo
We can use data.table methods. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', specify the i with the logical expression group ==1, get the sum of 't' and assign (:=) it to 'new'. By default, other rows are assigned to NA by default
library(data.table)
setDT(df)[group == 1, new := sum(t), ID]
df
# ID t group new
#1: 1 12.0 1 26
#2: 1 14.0 1 26
#3: 1 2.0 6 NA
#4: 2 0.5 7 NA
#5: 2 12.0 1 12
#6: 3 3.0 1 3
#7: 4 2.0 4 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 4L), t = c(12,
14, 2, 0.5, 12, 3, 2), group = c(1L, 1L, 6L, 7L, 1L, 1L, 4L)),
class = "data.frame", row.names = c(NA,
-7L))

Resources