Dplyr equivalent of SUM over PARTITION BY - r

I'm sure this question has been asked before, but I can't find the answer.
Here's my data:
df <- data.frame(group=c("a","a","a","b","b","c"), value=c(1,2,3,4,5,7))
df
#> group value
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 c 7
I'd like a 3rd column which has the sum of "value" for each "group", like so:
#> group value group_sum
#> 1 a 1 6
#> 2 a 2 6
#> 3 a 3 6
#> 4 b 4 9
#> 5 b 5 9
#> 6 c 7 7
How can I do this with dplyr?

Using dplyr -
df %>%
group_by(group) %>%
mutate(group_sum = sum(value))

Nobody mentioned data.table yet:
library(data.table)
dat <- data.table(df)
dat[, `:=`(sums = sum(value)), group]
Which transforms dat into:
group value sums
1: a 1 6
2: a 2 6
3: a 3 6
4: b 4 9
5: b 5 9
6: c 7 7

left_join(
df,
df %>% group_by(group) %>% summarise(group_sum = sum(value)),
by = c("group")
)

I don't know how to do it one step, but
df_avg <- df %>% group_by(group) %>% summarize(group_sum=sum(value))
df %>% full_join(df_avg,by="group")
works. (This is basically equivalent to #KeqiangLi's answer.)
ave(), from base R, is useful here too:
df %>% mutate(group_sum=ave(value,group,FUN=sum))

Related

fill NA values per group based on first value of a group

I am trying to fill NA values of my dataframe. However, I would like to fill them based on the first value of each group.
#> df = data.frame(
group = c(rep("A", 4), rep("B", 4)),
val = c(1, 2, NA, NA, 4, 3, NA, NA)
)
#> df
group val
1 A 1
2 A 2
3 A NA
4 A NA
5 B 4
6 B 3
7 B NA
8 B NA
#> fill(df, val, .direction = "down")
group val
1 A 1
2 A 2
3 A 2 # -> should be 1
4 A 2 # -> should be 1
5 B 4
6 B 3
7 B 3 # -> should be 4
8 B 3 # -> should be 4
Can I do this with tidyr::fill()? Or is there another (more or less elegant) way how to do this? I need to use this in a longer chain (%>%) operation.
Thank you very much!
Use tidyr::replace_na() and dplyr::first() (or val[[1]]) inside a grouped mutate():
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val))) %>%
ungroup()
#> # A tibble: 8 × 2
#> group val
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 1
#> 4 A 1
#> 5 B 4
#> 6 B 3
#> 7 B 4
#> 8 B 4
PS - #richarddmorey points out the case where the first value for a group is NA. The above code would keep all NA values as NA. If you'd like to instead replace with the first non-missing value per group, you could subset the vector using !is.na():
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val[!is.na(val)]))) %>%
ungroup()
Created on 2022-11-17 with reprex v2.0.2
This should work, which uses dplyr's case_when
library(dplyr)
df %>%
group_by(group) %>%
mutate(val = case_when(
is.na(val) ~ val[1],
TRUE ~ val
))
Output:
group val
<chr> <dbl>
1 A 1
2 A 2
3 A 1
4 A 1
5 B 4
6 B 3
7 B 4
8 B 4

How can I add a counter column that records the number of occurrences of a value in a data frame in R? [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed last year.
I have the following data frame:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6))
and I want to record the number of occurrences of each category so the output would be:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
Is there a simple way to do this?
# load package
library(data.table)
# set as data.table
setDT(df)
# count by category
df[, category_count := .N, category]
With dplyr:
library(dplyr)
df %>%
group_by(category) %>%
mutate(category_count = n()) %>%
ungroup
# A tibble: 7 × 3
category value category_count
<chr> <dbl> <int>
1 a 1 2
2 b 5 4
3 b 3 4
4 b 6 4
5 b 7 4
6 a 4 2
7 c 6 1
base
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
df$res <- with(df, ave(x = seq(nrow(df)), list(catergory), FUN = length))
df
#> catergory value category_count res
#> 1 a 1 2 2
#> 2 b 5 4 4
#> 3 b 3 4 4
#> 4 b 6 4 4
#> 5 b 7 4 4
#> 6 a 4 2 2
#> 7 c 6 1 1
Created on 2022-02-08 by the reprex package (v2.0.1)

Address a column with the variable value of another column in the same row using the pipe operator

An example data.frame:
library(tidyverse)
example <- data.frame(matrix(sample.int(15),5,3),
sample(c("A","B","C"),5,replace=TRUE) ) %>%
`colnames<-`( c("A","B","C","choose") ) %>% print()
Output:
A B C choose
1 9 12 4 A
2 7 8 13 C
3 5 1 2 A
4 15 3 11 C
5 14 6 10 B
The column "choose" indicates which value should be selected from the columns A,B,C
My humble solution for the column "result" :
cols <- c(A=1,B=2,C=3)
col_index <- cols[example$choose]
xy <- cbind(1:nrow(example),col_index)
example %>% mutate(result = example[xy])
Output:
A B C choose result
1 9 12 4 A 9
2 7 8 13 C 13
3 5 1 2 A 5
4 15 3 11 C 11
5 14 6 10 B 6
I'am sure there is a more elegant solution with dplyr,
but my attemps with "rowwise" or "accross" failed.
Is it possible to get here a one-line-solution?
The efficient option is to make use of row/column indexing
example$result <- example[1:3][cbind(seq_len(nrow(example)),
match(example$choose, names(example)))]
with dplyr, we may use get with rowwise
library(dplyr)
example %>%
rowwise %>%
mutate(result = get(choose)) %>%
ungroup
Or instead of get use cur_data()
example %>%
rowwise %>%
mutate(result = cur_data()[[choose]]) %>%
ungroup
Or the vectorized option with row/column indexing
example %>%
mutate(result = select(., where(is.numeric))[cbind(row_number(),
match(choose, names(example)))])
Here is an alternative way:
library(dplyr)
library(tidyr)
example %>%
pivot_longer(
-choose,
) %>%
filter(choose == name) %>%
select(result=value) %>%
bind_cols(example)
result A B C choose
<int> <int> <int> <int> <chr>
1 9 6 9 1 B
2 14 5 2 14 C
3 7 8 7 3 B
4 15 15 4 12 A
5 11 13 10 11 C

How to lag a specific column of a data frame in R

Input
(Say d is the data frame below.)
a b c
1 5 7
2 6 8
3 7 9
I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you.
I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6).
Output
a b c
1 6 7
2 5 8
3 6 9
Use head instead
df$b <- c(6, head(df$b, -1))
# a b c
#1 1 6 7
#2 2 5 8
#3 3 6 9
You could also use lag in dplyr
library(dplyr)
df %>% mutate(b = lag(b, default = 6))
Or shift in data.table
library(data.table)
setDT(df)[, b:= shift(b, fill = 6)]
A dplyr solution uses lag with an explicit default argument, if you prefer:
library(dplyr)
d <- tibble(a = 1:3, b = 5:7, c = 7:9)
d %>% mutate(b = lag(b, default = 6))
#> # A tibble: 3 x 3
#> a b c
#> <int> <dbl> <int>
#> 1 1 6 7
#> 2 2 5 8
#> 3 3 6 9
Created on 2019-12-05 by the reprex package (v0.3.0)
Here is a solution similar to the head approach by #Ronak Shah
df <- within(df,b <- c(runif(1),b[-1]))
where a uniformly random variable is added to the first place of b column:
> df
a b c
1 1 0.6644704 7
2 2 6.0000000 8
3 3 7.0000000 9
Best solution below will help in any lag or lead position
d <- data.frame(a=c(1,2,3),b=c(5,6,7),c=c(7,8,9))
d1 <- d %>% arrange(b) %>% group_by(b) %>%
mutate(b1= dplyr::lag(b, n = 1, default = NA))

Dense Rank by Multiple Columns in R

How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8

Resources