sum for each ID depending on another variable - r

I would like to sum a column (by ID) depending on another variable (group). If we take for instance:
ID t group
1 12 1
1 14 1
1 2 6
2 0.5 7
2 12 1
3 3 1
4 2 4
I'd like to sum values of column t separately for each ID only if group==1, and obtain:
ID t group sum
1 12 1 26
1 14 1 26
1 2 6 NA
2 0.5 7 NA
2 12 1 12
3 3 1 3
4 2 4 NA

Using dplyr,
df %>%
group_by(ID) %>%
mutate(new = sum(t[group == 1]),
new = replace(new, group != 1, NA))
which gives,
# A tibble: 7 x 4
# Groups: ID [4]
ID t group new
<int> <dbl> <int> <dbl>
1 1 12 1 26
2 1 14 1 26
3 1 2 6 NA
4 2 0.5 7 NA
5 2 12 1 12
6 3 3 1 3
7 4 2 4 NA

Consider base R with ifelse and ave() for conditional inline aggregation.
df$sum <- with(df, ifelse(group == 1, ave(t, ID, group, FUN=sum), NA))
df
# ID t group sum
# 1 1 12.0 1 26
# 2 1 14.0 1 26
# 3 1 2.0 6 NA
# 4 2 0.5 7 NA
# 5 2 12.0 1 12
# 6 3 3.0 1 3
# 7 4 2.0 4 NA
Rextester demo

We can use data.table methods. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', specify the i with the logical expression group ==1, get the sum of 't' and assign (:=) it to 'new'. By default, other rows are assigned to NA by default
library(data.table)
setDT(df)[group == 1, new := sum(t), ID]
df
# ID t group new
#1: 1 12.0 1 26
#2: 1 14.0 1 26
#3: 1 2.0 6 NA
#4: 2 0.5 7 NA
#5: 2 12.0 1 12
#6: 3 3.0 1 3
#7: 4 2.0 4 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 4L), t = c(12,
14, 2, 0.5, 12, 3, 2), group = c(1L, 1L, 6L, 7L, 1L, 1L, 4L)),
class = "data.frame", row.names = c(NA,
-7L))

Related

R: Calculate difference between values in rows with group reference

This is my df:
group value
1 10
1 20
1 25
2 5
2 10
2 15
I now want to compute differences between each value of a group and a reference value, which is the first row of a group. More precisely:
group value diff
1 10 NA # because this is the reference for group 1
1 20 10 # value[2] - value[1]
1 25 15 # value[3] - value[1]
2 5 NA # because this is the reference for group 2
2 10 5 # value[5] - value[4]
2 15 10 # value[6] - value[4]
I found good answers for difference scores of the previous line (e.g., lag-function in dpylr, shift-function in data.table). However, I am looking for a fixed reference point and I couldn't make it work.
Try the code below
transform(
df,
Diff = ave(value, group, FUN = function(x) c(NA, diff(x)))
)
which gives
group value Diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5
I think you can also use this:
library(dplyr)
df %>%
group_by(group) %>%
mutate(diff = value - value[1],
diff = replace(diff, row_number() == 1, NA))
# A tibble: 6 x 3
# Groups: group [2]
group value diff
<int> <int> <int>
1 1 10 NA
2 1 20 10
3 1 25 15
4 2 5 NA
5 2 10 5
6 2 15 10
df <-
structure(list(
group = c(1L, 1L, 1L, 2L, 2L, 2L),
value = c(10L,
20L, 25L, 5L, 10L, 15L)
),
class = "data.frame",
row.names = c(NA,
-6L))
library(tidyverse)
df %>%
group_by(group) %>%
mutate(DIFF = ifelse(row_number() == 1, NA, value - first(value))) %>%
ungroup()
#> # A tibble: 6 x 3
#> group value DIFF
#> <int> <int> <int>
#> 1 1 10 NA
#> 2 1 20 10
#> 3 1 25 15
#> 4 2 5 NA
#> 5 2 10 5
#> 6 2 15 10
Created on 2021-06-18 by the reprex package (v2.0.0)

By group relative order

I have a data set that looks like this
ID
Week
1
3
1
5
1
5
1
8
1
11
1
16
2
2
2
2
2
3
2
3
2
9
Now, what I would like to do is to add another column to the DataFrame so that, for every ID I will mark the week's relative position. More elaborately, I would like to the mark ID's earliest week (smallest number) as 1, then the next week for the ID as 2 and so forth, where if there are two observations of the same week they get the same number.
So, in the above example I should get:
ID
Week
Order
1
3
1
1
5
2
1
5
2
1
8
3
1
11
4
1
16
5
2
2
1
2
2
1
2
3
2
2
3
2
2
9
3
How could I achieve this?
Thank you very much!
A base R option using ave + match
transform(
df,
Order = ave(Week,
ID,
FUN = function(x) match(x, sort(unique(x)))
)
)
or ave + order (thank #IRTFM for comments)
transform(
df,
Order = ave(Week,
ID,
FUN = order
)
)
gives
ID Week Order
1 1 3 1
2 1 5 2
3 1 5 2
4 1 8 3
5 1 11 4
6 1 16 5
7 2 2 1
8 2 2 1
9 2 3 2
10 2 3 2
11 2 9 3
A data.table option with frank
> setDT(df)[, Order := frank(Week, ties.method = "dense"), ID][]
ID Week Order
1: 1 3 1
2: 1 5 2
3: 1 5 2
4: 1 8 3
5: 1 11 4
6: 1 16 5
7: 2 2 1
8: 2 2 1
9: 2 3 2
10: 2 3 2
11: 2 9 3
Data
> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Week = c(3L, 5L, 5L, 8L, 11L, 16L, 2L, 2L, 3L, 3L, 9L)), class = "data.frame", row.names =
c(NA,
-11L))
You can use dense_rank in dplyr :
library(dplyr)
df %>% group_by(ID) %>% mutate(Order = dense_rank(Week)) %>% ungroup
# ID Week Order
# <int> <int> <int>
# 1 1 3 1
# 2 1 5 2
# 3 1 5 2
# 4 1 8 3
# 5 1 11 4
# 6 1 16 5
# 7 2 2 1
# 8 2 2 1
# 9 2 3 2
#10 2 3 2
#11 2 9 3

Sort, calculate and then mutate in a dataframe

I'm a beginner in R and I'm facing an issue.
Problem: I need to sort a dataframe by 2 columns (ID, i'th column) and then take lagged difference of the i'th column and record it. Then resort the data with the ID and the i+1 column and so on and so forth.
What I have written up till now:
for (val in (4:length(colnames(df)))){
df <- df[with(df, order(ID, df[val])), ]
d2_df <- df %>%
mutate_at(c(df[val]), list(lagged = ~ . - lag(.)))
}
The above code is messing somehow because the mutate_at function is throwing the error below:
Error: `.vars` must be a character/numeric vector or a `vars()` object, not a list.
Original dataset:
ID S1 S2
1 1 3 1
2 1 5 2
3 1 1 3
4 2 2 7
5 3 4 9
6 3 2 11
After Sort on ID and S1
ID S1 S2
1 1 1 3
2 1 3 1
3 1 5 2
4 2 2 7
5 3 2 11
6 3 4 9
Now what I need? S1.1 (which is the lagged difference of the sorted dataframe respective to each ID)
ID S1 S2 S1.1
1 1 1 3 NA
2 1 3 1 2
3 1 5 2 2
4 2 2 7 NA
5 3 2 11 NA
6 3 4 9 2
Similar logic applies for S2 where a new S2.2 will be generated.
Any help would be immensely appreciated.
Additionally what is required (below); where sum.S1 is the sum of the lagged differences and count.S1 is the count of observations at S1 for respective ID:
ID sum.S1 sum.S2 count.S1 count.S2
1 1 4 2 3 3
2 2 NA NA 1 1
3 3 2 2 2 2
Here's a way using non-standard evaluation (NSE) :
library(dplyr)
library(purrr)
library(rlang)
cols <- c('S1', 'S2')
bind_cols(df, map_dfc(cols, ~{
col <- sym(.x)
df %>%
arrange(ID, !!col) %>%
group_by(ID) %>%
transmute(!!paste0(.x, '.1') := !!col - lag(!!col)) %>%
ungroup %>%
select(-ID)
}))
# ID S1 S2 S1.1 S2.1
#1 1 3 1 NA NA
#2 1 5 2 2 1
#3 1 1 3 2 1
#4 2 2 7 NA NA
#5 3 4 9 NA NA
#6 3 2 11 2 2
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), S1 = c(3L, 5L,
1L, 2L, 4L, 2L), S2 = c(1L, 2L, 3L, 7L, 9L, 11L)),
class = "data.frame", row.names = c(NA, -6L))

Order values within column according to values within different column by group in R

I have the following panel data set:
group i f r d
1 4 8 3 3
1 9 4 5 1
1 2 2 2 2
2 5 5 3 2
2 3 9 3 3
2 9 1 3 1
I want to reorder column i in this data frame according to values in column d for each group. So the highest value for group 1 in column i should correspond to the highest value in column d. In the end my data.frame should look like this:
group i f r d
1 9 8 3 3
1 2 4 5 1
1 4 2 2 2
2 5 5 3 2
2 9 9 3 3
2 3 1 3 1
Here is a dplyr solution.
First, group by group. Then get the permutation rearrangement of column d in a temporary new column, ord and use it to reorder i.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(ord = order(d),
i = i[ord]) %>%
ungroup() %>%
select(-ord)
## A tibble: 6 x 5
# group i f r d
# <int> <int> <int> <int> <int>
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 9 5 3 2
#5 2 5 9 3 3
#6 2 3 1 3 1
original (wrong)
You can achieve this using dplyr and rank:
library(dplyr)
df1 %>% group_by(group) %>%
mutate(i = i[rev(rank(d))])
Edit
This question is actually trickier than it first seems and the original answer I posted is incorrect. The correct solution orders by i before subsetting by the rank of d. This gives OP's desired output which my previous answer did not (not paying attention!)
df1 %>% group_by(group) %>%
mutate(i = i[order(i)][rank(d)])
# A tibble: 6 x 5
# Groups: group [2]
# group i f r d
# <int> <int> <int> <int> <int>
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 5 5 3 2
#5 2 9 9 3 3
#6 2 3 1 3 1
There is some confusion regarding the expected output. Here I am showing a way to get both the versions of the output.
A base R using split and mapply
df$i <- c(mapply(function(x, y) sort(y)[x],
split(df$d, df$group), split(df$i, df$group)))
df
# group i f r d
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 5 5 3 2
#5 2 9 9 3 3
#6 2 3 1 3 1
Or another version
df$i <- c(mapply(function(x, y) y[order(x)],
split(df$d, df$group), split(df$i, df$group)))
df
# group i f r d
#1 1 9 8 3 3
#2 1 2 4 5 1
#3 1 4 2 2 2
#4 2 9 5 3 2
#5 2 5 9 3 3
#6 2 3 1 3 1
We can also use dplyr for this :
For 1st version
library(dplyr)
df %>%
group_by(group) %>%
mutate(i = sort(i)[d])
2nd version is already shown by #Rui using order
df %>%
group_by(group) %>%
mutate(i = i[order(d)])
An option with data.table
library(data.table)
setDT(df1)[, i := i[order(d)], group]
df1
# group i f r d
#1: 1 9 8 3 3
#2: 1 2 4 5 1
#3: 1 4 2 2 2
#4: 2 9 5 3 2
#5: 2 5 9 3 3
#6: 2 3 1 3 1
If we need the second version
setDT(df1)[, i := sort(i)[d], group]
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L), i = c(4L, 9L,
2L, 5L, 3L, 9L), f = c(8L, 4L, 2L, 5L, 9L, 1L), r = c(3L, 5L,
2L, 3L, 3L, 3L), d = c(3L, 1L, 2L, 2L, 3L, 1L)), class = "data.frame",
row.names = c(NA,
-6L))

Expand an R Column Values To Column Headers with Another Column's values

I'm trying to expand an R data table that looks like this:
a step_num duration
1 1 5
1 2 4
1 3 1
2 1 7
2 2 2
2 3 9
3 1 1
3 2 1
3 3 3
Into something that looks like this:
a | step_num | duration | 1_duration | 2_duration | 3_duration |
----------------------------------------------------------------
1 1 5 5 - -
1 2 4 - 4 -
1 3 1 - - 1
2 1 7 7 - -
2 2 2 - 2 -
2 3 9 - - 9
3 1 1 1 - -
3 2 1 - 1 -
3 3 3 - - 3
I'm wondering if there's an 'expand' function, so to speak, that would do this.
Thanks!
We can do this in base r.
cbind(df,
reshape(df, idvar = c("a","step_num"), timevar = "step_num", direction = "wide")[,-1])
#> a step_num duration duration.1 duration.2 duration.3
#> 1 1 1 5 5 NA NA
#> 2 1 2 4 NA 4 NA
#> 3 1 3 1 NA NA 1
#> 4 2 1 7 7 NA NA
#> 5 2 2 2 NA 2 NA
#> 6 2 3 9 NA NA 9
#> 7 3 1 1 1 NA NA
#> 8 3 2 1 NA 1 NA
#> 9 3 3 3 NA NA 3
Created on 2019-05-21 by the reprex package (v0.2.1)
Simple tidyverse solution:
library(tidyverse)
df %>%
mutate(step = step_num) %>%
spread(step, duration, fill = '-') %>%
rename_all( ~ gsub('(\\d+)', 'duration_\\1', .))
# a step_num duration_1 duration_2 duration_3
# 1 1 1 5 - -
# 2 1 2 - 4 -
# 3 1 3 - - 1
# 4 2 1 7 - -
# 5 2 2 - 2 -
# 6 2 3 - - 9
# 7 3 1 1 - -
# 8 3 2 - 1 -
# 9 3 3 - - 3
Or an option with dcast from data.table
library(data.table)
dcast(setDT(df), a + step_num ~
paste0("duration_", step_num), value.var = 'duration')
# a step_num duration_1 duration_2 duration_3
#1: 1 1 5 NA NA
#2: 1 2 NA 4 NA
#3: 1 3 NA NA 1
#4: 2 1 7 NA NA
#5: 2 2 NA 2 NA
#6: 2 3 NA NA 9
#7: 3 1 1 NA NA
#8: 3 2 NA 1 NA
#9: 3 3 NA NA 3
NOTE: It is better to have NA instead of - as NA is easily removable with is.na/complete.cases/na.omit and it wouldn't change the class of the column to character
data
df <- structure(list(a = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), step_num = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), duration = c(5L, 4L, 1L, 7L,
2L, 9L, 1L, 1L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
Here's an approach using dplyr and tidyr.
We take the original data and add on some columns by first adding a new column col which holds the column header we want, based on the step_num. Then we use tidyr::spread to put the durations into different columns depending on which col they go with. fill = "-" fills all the empty columns with dashes. Finally, we drop the a and step_num columns since they're already there in the original data and we don't want to have copies of them.
(Note, we needed step_num to still exist at the spread step, because we wanted to keep each row aligned with the original rows. Without step_num, the data would get spread into a wider, shorter format that would have misaligned rows.)
library(dplyr); library(tidyr)
df %>%
mutate(col = paste0(step_num, "_duration")) %>%
spread(col, duration, fill = "-") %>%
select(-a, -step_num)) %>%
bind_cols(df, .) # Edit, per excellent suggestion from M-M
a step_num duration 1_duration 2_duration 3_duration
1 1 1 5 5 - -
2 1 2 4 - 4 -
3 1 3 1 - - 1
4 2 1 7 7 - -
5 2 2 2 - 2 -
6 2 3 9 - - 9
7 3 1 1 1 - -
8 3 2 1 - 1 -
9 3 3 3 - - 3

Resources