R: Calculate difference between values in rows with group reference - r

This is my df:
group value
1 10
1 20
1 25
2 5
2 10
2 15
I now want to compute differences between each value of a group and a reference value, which is the first row of a group. More precisely:
group value diff
1 10 NA # because this is the reference for group 1
1 20 10 # value[2] - value[1]
1 25 15 # value[3] - value[1]
2 5 NA # because this is the reference for group 2
2 10 5 # value[5] - value[4]
2 15 10 # value[6] - value[4]
I found good answers for difference scores of the previous line (e.g., lag-function in dpylr, shift-function in data.table). However, I am looking for a fixed reference point and I couldn't make it work.

Try the code below
transform(
df,
Diff = ave(value, group, FUN = function(x) c(NA, diff(x)))
)
which gives
group value Diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5

I think you can also use this:
library(dplyr)
df %>%
group_by(group) %>%
mutate(diff = value - value[1],
diff = replace(diff, row_number() == 1, NA))
# A tibble: 6 x 3
# Groups: group [2]
group value diff
<int> <int> <int>
1 1 10 NA
2 1 20 10
3 1 25 15
4 2 5 NA
5 2 10 5
6 2 15 10

df <-
structure(list(
group = c(1L, 1L, 1L, 2L, 2L, 2L),
value = c(10L,
20L, 25L, 5L, 10L, 15L)
),
class = "data.frame",
row.names = c(NA,
-6L))
library(tidyverse)
df %>%
group_by(group) %>%
mutate(DIFF = ifelse(row_number() == 1, NA, value - first(value))) %>%
ungroup()
#> # A tibble: 6 x 3
#> group value DIFF
#> <int> <int> <int>
#> 1 1 10 NA
#> 2 1 20 10
#> 3 1 25 15
#> 4 2 5 NA
#> 5 2 10 5
#> 6 2 15 10
Created on 2021-06-18 by the reprex package (v2.0.0)

Related

Grouped filter common value in a column

Sample data:
# A tibble: 10 × 2
id value
<int> <dbl>
1 1 1
2 1 2
3 1 3
4 1 5
5 1 6
6 2 6
7 2 3
8 2 2
9 2 0
10 2 10
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
value = c(1, 2, 3, 5, 6, 6, 3, 2, 0, 10)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
How do I perform a group filter for common values in the column value with dplyr? Such that the expected output would be:
# A tibble: 6 × 2
# Groups: id [2]
id value
<int> <dbl>
1 1 2
2 1 3
3 1 6
4 2 6
5 2 3
6 2 2
We could use n_distinct for filtering after grouping by 'value'
library(dplyr)
df1 %>%
group_by(value) %>%
filter(n_distinct(id) == n_distinct(df1$id)) %>%
ungroup
-output
# A tibble: 6 × 2
id value
<int> <dbl>
1 1 2
2 1 3
3 1 6
4 2 6
5 2 3
6 2 2
Or use split/reduce/intersect
library(purrr)
df1 %>%
filter(value %in% (split(value, id) %>% reduce(intersect)))
-output
# A tibble: 6 × 2
id value
<int> <dbl>
1 1 2
2 1 3
3 1 6
4 2 6
5 2 3
6 2 2
In base R, it would be
subset(df1, value %in% Reduce(intersect, split(value, id)))
-output
# A tibble: 6 × 2
id value
<int> <dbl>
1 1 2
2 1 3
3 1 6
4 2 6
5 2 3
6 2 2

How to create a new column in R where arithmetic is applied on opposite values given a group

Suppose the following data frame:
group
value1
value2
1
16
2
1
20
4
2
12
2
2
8
6
3
10
7
3
14
5
I want to create a table that divides value2 over value1. However, I want it to be the only other unique value in the same group. I've attached an image to demonstrate the process.
here is an image of what I'm trying to achieve
When that is done, the output should look something like this:
group
value1
value2
perc
1
16
2
2/20 10
1
20
4
4/16 25
2
12
2
2/8 25
2
8
6
6/12 50
3
10
7
7/14 50
3
14
5
5/10 50
(I've added the fractions in the perc column so it follows my image, I'd just like the value at the end of each row)
At the moment, I'm having a hard time with this problem, I realise it may have something to do with setdiff and selecting the only other unique value in that group (there's only two rows per group) but I'm not sure how. Any help is much appreciated. Thank you!
We can reverse the order of value1 then calculate the perc column.
library(dplyr)
df %>%
group_by(group) %>%
mutate(value3 = rev(value1),
perc = (value2/value3)*100) %>%
select(-value3)
# A tibble: 6 × 4
# Groups: group [3]
group value1 value2 perc
<int> <int> <int> <dbl>
1 1 16 2 10
2 1 20 4 25
3 2 12 2 25
4 2 8 6 50
5 3 10 7 50
6 3 14 5 50
data
df <- read.table(header = T, text = "
group value1 value2
1 16 2
1 20 4
2 12 2
2 8 6
3 10 7
3 14 5")
You can use lead and lag to get the cell above or below the current row. The two results can be joined together:
library(tidyverse)
data <- tribble(
~group, ~value1, ~value2,
1L, 16L, 2L,
1L, 20L, 4L,
2L, 12L, 2L,
2L, 8L, 6L,
3L, 10L, 7L,
3L, 14L, 5L
)
full_join(
data %>%
group_by(group) %>%
mutate(
frac = value2 / lead(value1),
perc_text = str_glue("{value2}/{lead(value1)} {frac * 100}")
) %>%
filter(!is.na(frac)),
data %>%
group_by(group) %>%
mutate(
frac = value2 / lag(value1),
perc_text = str_glue("{value2}/{lag(value1)} {frac * 100}")
) %>%
filter(!is.na(frac))
) %>%
arrange(group)
#> Joining, by = c("group", "value1", "value2", "frac", "perc_text")
#> # A tibble: 6 × 5
#> # Groups: group [3]
#> group value1 value2 frac perc_text
#> <int> <int> <int> <dbl> <glue>
#> 1 1 16 2 0.1 2/20 10
#> 2 1 20 4 0.25 4/16 25
#> 3 2 12 2 0.25 2/8 25
#> 4 2 8 6 0.5 6/12 50
#> 5 3 10 7 0.5 7/14 50
#> 6 3 14 5 0.5 5/10 50
Created on 2022-04-07 by the reprex package (v2.0.0)

Conditionally take value from column1 if the column1 name == first(value) from column2 BY GROUP

I have this fake dataframe:
df <- structure(list(Group = c(1L, 1L, 2L, 2L), A = 1:4, B = 5:8, C = 9:12,
X = c("A", "A", "B", "B")), class = "data.frame", row.names = c(NA, -4L))
Group A B C X
1 1 1 5 9 A
2 1 2 6 10 A
3 2 3 7 11 B
4 2 4 8 12 B
I try to mutate a new column, which should take the value of THE column that has the column name in an other column:
Desired output:
Group A B C X new_col
1 1 5 9 A 1
1 2 6 10 A 1
2 3 7 11 B 7
2 4 8 12 B 7
My try so far:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(across(c(A,B,C), ~ifelse(first(X) %in% colnames(.), first(.), .), .names = "new_{.col}"))
Group A B C X new_A new_B new_C
<int> <int> <int> <int> <chr> <int> <int> <int>
1 1 1 5 9 A 1 5 9
2 1 2 6 10 A 1 5 9
3 2 3 7 11 B 3 7 11
4 2 4 8 12 B 3 7 11
One option might be:
df %>%
rowwise() %>%
mutate(new_col = get(X)) %>%
group_by(Group, X) %>%
mutate(new_col = first(new_col))
Group A B C X new_col
<int> <int> <int> <int> <chr> <int>
1 1 1 5 9 A 1
2 1 2 6 10 A 1
3 2 3 7 11 B 7
4 2 4 8 12 B 7
Using by and add + 1 to the group number to select column. Assuming group columns are arranged as in example after "Group" column.
transform(df, new_col=do.call(rbind, by(df, df$Group, \(x)
cbind(paste(x$X, x[1, x$Group[1] + 1])))))
# Group A B C X new_col
# 1 1 1 5 9 A A 1
# 2 1 2 6 10 A A 1
# 3 2 3 7 11 B B 7
# 4 2 4 8 12 B B 7
Note: R version 4.1.2 (2021-11-01).
Data:
df <- structure(list(Group = c(1L, 1L, 2L, 2L), A = 1:4, B = 5:8, C = 9:12,
X = c("A", "A", "B", "B")), class = "data.frame", row.names = c(NA,
-4L))
In base R, we may use row/column indexing
df$new_col <- df[2:4][cbind(match(unique(df$Group), df$Group)[df$Group],
match(df$X, names(df)[2:4]))]
df$new_col
[1] 1 1 7 7

Order values within column according to values within different column by group in R

I have the following panel data set:
group i f r d
1 4 8 3 3
1 9 4 5 1
1 2 2 2 2
2 5 5 3 2
2 3 9 3 3
2 9 1 3 1
I want to reorder column i in this data frame according to values in column d for each group. So the highest value for group 1 in column i should correspond to the highest value in column d. In the end my data.frame should look like this:
group i f r d
1 9 8 3 3
1 2 4 5 1
1 4 2 2 2
2 5 5 3 2
2 9 9 3 3
2 3 1 3 1
Here is a dplyr solution.
First, group by group. Then get the permutation rearrangement of column d in a temporary new column, ord and use it to reorder i.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(ord = order(d),
i = i[ord]) %>%
ungroup() %>%
select(-ord)
## A tibble: 6 x 5
# group i f r d
# <int> <int> <int> <int> <int>
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 9 5 3 2
#5 2 5 9 3 3
#6 2 3 1 3 1
original (wrong)
You can achieve this using dplyr and rank:
library(dplyr)
df1 %>% group_by(group) %>%
mutate(i = i[rev(rank(d))])
Edit
This question is actually trickier than it first seems and the original answer I posted is incorrect. The correct solution orders by i before subsetting by the rank of d. This gives OP's desired output which my previous answer did not (not paying attention!)
df1 %>% group_by(group) %>%
mutate(i = i[order(i)][rank(d)])
# A tibble: 6 x 5
# Groups: group [2]
# group i f r d
# <int> <int> <int> <int> <int>
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 5 5 3 2
#5 2 9 9 3 3
#6 2 3 1 3 1
There is some confusion regarding the expected output. Here I am showing a way to get both the versions of the output.
A base R using split and mapply
df$i <- c(mapply(function(x, y) sort(y)[x],
split(df$d, df$group), split(df$i, df$group)))
df
# group i f r d
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 5 5 3 2
#5 2 9 9 3 3
#6 2 3 1 3 1
Or another version
df$i <- c(mapply(function(x, y) y[order(x)],
split(df$d, df$group), split(df$i, df$group)))
df
# group i f r d
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 9 5 3 2
#5 2 5 9 3 3
#6 2 3 1 3 1
We can also use dplyr for this :
For 1st version
library(dplyr)
df %>%
group_by(group) %>%
mutate(i = sort(i)[d])
2nd version is already shown by #Rui using order
df %>%
group_by(group) %>%
mutate(i = i[order(d)])
An option with data.table
library(data.table)
setDT(df1)[, i := i[order(d)], group]
df1
# group i f r d
#1: 1 9 8 3 3
#2: 1 2 4 5 1
#3: 1 4 2 2 2
#4: 2 9 5 3 2
#5: 2 5 9 3 3
#6: 2 3 1 3 1
If we need the second version
setDT(df1)[, i := sort(i)[d], group]
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L), i = c(4L, 9L,
2L, 5L, 3L, 9L), f = c(8L, 4L, 2L, 5L, 9L, 1L), r = c(3L, 5L,
2L, 3L, 3L, 3L), d = c(3L, 1L, 2L, 2L, 3L, 1L)), class = "data.frame",
row.names = c(NA,
-6L))

sum for each ID depending on another variable

I would like to sum a column (by ID) depending on another variable (group). If we take for instance:
ID t group
1 12 1
1 14 1
1 2 6
2 0.5 7
2 12 1
3 3 1
4 2 4
I'd like to sum values of column t separately for each ID only if group==1, and obtain:
ID t group sum
1 12 1 26
1 14 1 26
1 2 6 NA
2 0.5 7 NA
2 12 1 12
3 3 1 3
4 2 4 NA
Using dplyr,
df %>%
group_by(ID) %>%
mutate(new = sum(t[group == 1]),
new = replace(new, group != 1, NA))
which gives,
# A tibble: 7 x 4
# Groups: ID [4]
ID t group new
<int> <dbl> <int> <dbl>
1 1 12 1 26
2 1 14 1 26
3 1 2 6 NA
4 2 0.5 7 NA
5 2 12 1 12
6 3 3 1 3
7 4 2 4 NA
Consider base R with ifelse and ave() for conditional inline aggregation.
df$sum <- with(df, ifelse(group == 1, ave(t, ID, group, FUN=sum), NA))
df
# ID t group sum
# 1 1 12.0 1 26
# 2 1 14.0 1 26
# 3 1 2.0 6 NA
# 4 2 0.5 7 NA
# 5 2 12.0 1 12
# 6 3 3.0 1 3
# 7 4 2.0 4 NA
Rextester demo
We can use data.table methods. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', specify the i with the logical expression group ==1, get the sum of 't' and assign (:=) it to 'new'. By default, other rows are assigned to NA by default
library(data.table)
setDT(df)[group == 1, new := sum(t), ID]
df
# ID t group new
#1: 1 12.0 1 26
#2: 1 14.0 1 26
#3: 1 2.0 6 NA
#4: 2 0.5 7 NA
#5: 2 12.0 1 12
#6: 3 3.0 1 3
#7: 4 2.0 4 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 4L), t = c(12,
14, 2, 0.5, 12, 3, 2), group = c(1L, 1L, 6L, 7L, 1L, 1L, 4L)),
class = "data.frame", row.names = c(NA,
-7L))

Resources