How to summarize value not matching the group using dplyr - r

I want to sum values of rows which belongs to group other than the row's group. For example using this sample data
> df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
> df
id group val
1 1 A 9
2 2 A 7
3 3 B 5
4 4 B 3
5 5 A 1
Summarizing with dplyr by group
> df %>% group_by(group) %>% summarize(sumval = sum(val))
Source: local data frame [2 x 2]
group sumval
(fctr) (dbl)
1 A 17
2 B 8
What I want is the value for rows belonging to group A to use sumval of not group A. i.e. the final result is
id group val notval
1 1 A 9 8
2 2 A 7 8
3 3 B 5 17
4 4 B 3 17
5 5 A 1 8
Is there a way to do this in dplyr? Preferrably in a single chain?

We can do this with base R
s1 <- sapply(unique(df$group), function(x) sum(df$val[df$group !=x]))
s1[with(df, match(group, unique(group)))]
#[1] 8 8 17 17 8
Or using data.table
library(data.table)
setDT(df)[,notval := sum(df$val[df$group!=group]) ,group]

#akrun answers are best. But if you want to do in dplyr, this is a round about way.
df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
df %>% mutate(TotalSum = sum(val)) %>% group_by(group) %>%
mutate(valsumval = TotalSum - sum(val))
Source: local data frame [5 x 5]
Groups: group [2]
id group val TotalSum valsumval
(int) (fctr) (dbl) (dbl) (dbl)
1 1 A 9 25 8
2 2 A 7 25 8
3 3 B 5 25 17
4 4 B 3 25 17
5 5 A 1 25 8
This also works even if there are more than two groups.
Also Just this works
df %>% group_by(group) %>% mutate(notval = sum(df$val)- sum(val))

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Re-ordering a factor (in a table) with hierarchical groups in R

Suppose the following table with two factor variabels and one numerical variable:
df <- tibble(
x = as_factor(c("a", "a", "a", "b", "b", "b")),
y = as_factor(1:6),
val = c(10, 3, 8, 2, 6, 1)
)
> df
# A tibble: 6 x 3
x y val
<fct> <fct> <dbl>
1 a 1 10
2 a 2 3
3 a 3 8
4 b 4 2
5 b 5 6
6 b 6 1
I would like to re-order y such that the sum of val, when grouped by x, takes precedent, but y is still ordered by val. To illustrate the goal:
# A tibble: 6 x 4
# Groups: x [2]
x y val sum
<fct> <fct> <dbl> <dbl>
1 a 1 10 21 # all y for which x=="a" come first, because
2 a 3 8 21 # the sum of val for x=="a" is greater than
3 a 2 3 21 # for x=="b"
4 b 5 6 9 # within each group, y is ordered by val
5 b 4 2 9
6 b 6 1 9
But how do I get there? Within tidyverse, I tried to solve it with forcats::fct_reorder(), thinking that grouping might help (df |> group_by(x) |> mutate(y = fct_reorder(y, val))), but it doesn't.
Can fct_reorder() do that at all? What other approaches could work?
Edit: I have found a solution, but it feels rather hacky:
df |>
group_by(x) |>
mutate(sum = sum(val)) |>
arrange(desc(sum), desc(val)) |> ungroup() |>
tibble::rowid_to_column() |>
mutate(across(c(x, y), \(x) fct_reorder(x, rowid)))
Perhaps, we need to arrange
library(dplyr)
library(forcats)
df %>%
arrange(desc(ave(val, x, FUN = sum)), desc(val)) %>%
mutate(across(where(is.factor), fct_inorder))
-output
# A tibble: 6 × 3
x y val
<fct> <fct> <dbl>
1 a 1 10
2 a 3 8
3 a 2 3
4 b 5 6
5 b 4 2
6 b 6 1
Or use fct_reorder/reorder in arrange
df %>%
arrange(desc(fct_reorder(x, val, .fun = sum)), desc(val)) %>%
mutate(across(where(is.factor), fct_inorder)
Probably we can use the following data.table option along with fct_inorder
setorder(
setDT(df)[
,
sum := sum(val), x
],
-sum, -val
)[
,
lapply(
.SD,
function(x) ifelse(is.factor(x), fct_inorder, c)(x)
)
]
and you will obtain
x y val sum
1: a 1 10 21
2: a 3 8 21
3: a 2 3 21
4: b 5 6 9
5: b 4 2 9
6: b 6 1 9

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

Sum multiple variables by group and create new column with their sum

I have a data frame with grouped variable and I want to sum them by group. It's easy with dplyr.
library(dplyr)
library(magrittr)
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum)
# A tibble: 3 x 3
group n1 n2
<fctr> <int> <int>
1 a 3 5
2 b 3 4
3 c 9 11
But now I want a new column total with the sum of n1 and n2 by group. Like this:
# A tibble: 3 x 3
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
How can I do that with dplyr?
EDIT:
Actually, it's just an example, I have a lot of variables.
I tried these two codes but it's not in the right dimension...
data %>% group_by(group) %>%
summarise_all(sum) %>%
summarise_if(is.numeric, sum)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate_if(is.numeric, .funs = sum)
You can use mutate after summarize:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = n1 + n2)
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <int>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
If need to sum all numeric columns, you can use rowSums with select_if (to select numeric columns) to sum columns up:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = rowSums(select_if(., is.numeric)))
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <dbl>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
We can use apply together with the dplyr functions.
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = apply(.[, 2:ncol(.)], 1, sum))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Or rowSums with the same strategy. The key is to use . to specify the data frame and [] with x:ncol(.) to keep the columns you want.
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = rowSums(.[, 2:ncol(.)]))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <dbl>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Base R
cbind(aggregate(.~group, data, sum), ttl = sapply(split(data[,-1], data$group), sum))
# group n1 n2 ttl
#a a 3 5 8
#b b 3 4 7
#c c 9 11 20
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest
library(data.table)
setDT(data)[, lapply(.SD, sum) , group][, tt1 := Reduce(`+`, .SD),
.SDcols = names(data)[-1]][]
# group n1 n2 tt1
#1: a 3 5 8
#2: b 3 4 7
#3: c 9 11 20
Or with base R
addmargins(as.matrix(rowsum(data[-1], data$group)), 2)
# n1 n2 Sum
#a 3 5 8
#b 3 4 7
#c 9 11 20
Or with dplyr
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt = rowSums(.[-1]))

How do I work with continuous variables in frequency tables? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have a thousand-row table of the type below and need to calculate the sum and mean of a continuous "count" variable for every categorical "df" varible.
I have attempted to solve this through table() function, but since I am a using continuous variable, I can't work myself towards a solution.
df count
1 a 5
2 f 3
3 g 8
4 l 2
5 a 10
6 s 4
7 l 6
8 s 8
9 a 2
10 g 1
if I am not mistaken, you are looking for the following code
library(dplyr)
daf %>%
group_by(df) %>%
summarise(Sum = sum(count), Count = n()) %>%
ungroup() %>%
arrange(df)
"daf" is the data set that I am working on.
Enjoy R programming!!!
Maybe this would help you out ,
> df3 <- aggregate(count ~ df , df, mean)
> df3
df count
1 a 5.666667
2 f 3.000000
3 g 4.500000
4 l 4.000000
5 s 6.000000
> df2 <- aggregate(count ~ df , df, sum)
> df2
df count
1 a 17
2 f 3
3 g 9
4 l 8
5 s 12
Simple aggregate functions can do it . Count in df3 is the mean and count in df2 is the sum .
This isn't an especially unique question, but the suggested duplicated questions only ask for a single summary statistic. As this is a simple problem to solve in dplyr I thought I'd throw this in.
dframe <- data.frame(df = c("a", "f", "g", "l", "a", "s", "l", "s", "a", "g"), count = c(5, 3, 8, 2, 10, 4, 6, 8, 2, 1))
dframe
df count
1 a 5
2 f 3
3 g 8
4 l 2
5 a 10
6 s 4
7 l 6
8 s 8
9 a 2
10 g 1
library(dplyr)
dframe %>% group_by(df) %>% summarise(sum = sum(count), mean = mean(count))
Source: local data frame [5 x 3]
df sum mean
(fctr) (dbl) (dbl)
1 a 17 5.666667
2 f 3 3.000000
3 g 9 4.500000
4 l 8 4.000000
5 s 12 6.000000
You can see that summarise() allows you to calculate whatever, and however many, summary statistics for each group that you like.

Resources