Weighted mean by group in long-data - r

My questing is kinda similar to this question, and is building on this answer, only thing is that my data is long format, not wide, and I would like to keep it that way.
Wondered if there's smart way to calculate the weighted.mean() shown in this answer, but with long data.
Say my data lookslike this
library(tidyverse)
dft_w <- structure(list(obs = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), education = c("A",
"A", "B", "B", "B", "B", "A", "A"), Item = c("income", "weight",
"income", "weight", "income", "weight", "income", "weight"),
Amount = c(1000L, 10L, 2000L, 1L, 1500L, 5L, 2000L, 2L)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")); dft_w
# A tibble: 8 x 4
obs education Item Amount
<int> <chr> <chr> <int>
1 1 A income 1000
2 1 A weight 10
3 2 B income 2000
4 2 B weight 1
5 3 B income 1500
6 3 B weight 5
7 4 A income 2000
8 4 A weight 2
and I would like to get to something like this
# A tibble: 12 x 4
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 1 A weighted_income 1167.
4 2 B income 2000
5 2 B weight 1
6 2 B weighted_income 1583.
7 3 B income 1500
8 3 B weight 5
9 3 B weighted_income 1583.
10 4 A income 2000
11 4 A weight 2
12 4 A weighted_income 1167.

dft_w %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))),
obs = unique(obs),
Item = "weighted_income"
) %>%
bind_rows(dft_w, .) %>%
arrange(obs, education, Item)
# # A tibble: 12 x 4
# obs education Item Amount
# <int> <chr> <chr> <dbl>
# 1 1 A income 1000
# 2 1 A weight 10
# 3 1 A weighted_income 1167.
# 4 2 B income 2000
# 5 2 B weight 1
# 6 2 B weighted_income 1583.
# 7 3 B income 1500
# 8 3 B weight 5
# 9 3 B weighted_income 1583.
# 10 4 A income 2000
# 11 4 A weight 2
# 12 4 A weighted_income 1167.
Note that this will error if the data does not contain equal numbers of "income" and "weight" (erring with 'x' and 'w' must have the same length).
This can be preempted with sufficient filtering, perhaps this:
dft_w %>%
slice(-1) %>% # just to trigger the fail, test the filter
group_by(obs, education) %>%
filter(all(c("income", "weight") %in% Item)) %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))),
obs = unique(obs),
Item = "weighted_income"
) %>%
bind_rows(slice(dft_w, -1), .) %>% # slice() only to keep the output consistent
arrange(obs, education, Item)
# # A tibble: 10 x 4
# obs education Item Amount
# <int> <chr> <chr> <dbl>
# 1 1 A weight 10
# 2 2 B income 2000
# 3 2 B weight 1
# 4 2 B weighted_income 1583.
# 5 3 B income 1500
# 6 3 B weight 5
# 7 3 B weighted_income 1583.
# 8 4 A income 2000
# 9 4 A weight 2
# 10 4 A weighted_income 2000
noting that the obs/education pair without both will not gain the "weighted_income" value.

Another way around is to use tidyr's pivot_wider and pivot_longer in the same pipe chain so you can actually work with wide data before going back to long format. It may not be the most efficient way but it allows to keep "wide-format" tips & tricks.
library(dplyr)
dft_w %>%
tidyr::pivot_wider(names_from = Item, values_from = Amount) %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight)) %>%
tidyr::pivot_longer(3:last_col(), names_to = "Item", values_to = "Amount")
Output:
# A tibble: 12 x 4
# Groups: education [2]
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 1 A weighted_income 1167.
4 2 B income 2000
5 2 B weight 1
6 2 B weighted_income 1583.
7 3 B income 1500
8 3 B weight 5
9 3 B weighted_income 1583.
10 4 A income 2000
11 4 A weight 2
12 4 A weighted_income 1167.

Here is just another way of doing this using tibble::add_row. I just opted for only one summary per grouping variable:
library(dplyr)
library(purrr)
dft_w %>%
group_split(education) %>%
map_dfr(~ .x %>%
add_row(obs = .x$obs[1], education = .x$education[1],
Item = "weighted.mean", Amount = weighted.mean(.x$Amount[.x$Item == "income"],
.x$Amount[.x$Item == "weight"])))
# A tibble: 10 x 4
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 4 A income 2000
4 4 A weight 2
5 1 A weighted.mean 1167.
6 2 B income 2000
7 2 B weight 1
8 3 B income 1500
9 3 B weight 5
10 2 B weighted.mean 1583.

Related

Filter within dplyr::summarise to make calculation

I need to use information from a subset of my data within a dplyr::summarise function.
My example data is grouped by unit. Each unit has a number of parts of different type with a number of dates.
library(dplyr)
library(lubridate)
q = data.frame(unit = c(rep(1,4), rep(2,3), rep(3,2)) ,
type = c("a", "b", "a", "a", "a", "b", "a", "a", "a"),
create = dmy(c("01/01/2001", "02/02/2002", "10/03/2003", "04/04/2004", "01/01/2001", "02/02/2002", "03/03/2003", "01/01/2001", "02/01/2001")),
fail = dmy(c("05/05/2001", "10/10/2003", "30/03/2004", NA, "01/01/2002", "01/03/2003", "01/06/2003", "01/01/2001", NA)),
last = dmy(c(rep("11/03/2008", 4), rep("01/01/2009", 3), rep("01/03/2001",2) )) )%>%
group_by(unit)%>%
mutate(last_for_unit = case_when(row_number() == n() ~T,
T~F),
atleast_6m = case_when(interval(create,last)/months(1) >=6 | !is.na(fail)~T,
T~F))
q
# A tibble: 9 x 7
# Groups: unit [3]
unit type create fail last last_for_unit atleast_6m
<dbl> <chr> <date> <date> <date> <lgl> <lgl>
1 1 a 2001-01-01 2001-05-05 2008-03-11 FALSE TRUE
2 1 b 2002-02-02 2003-10-10 2008-03-11 FALSE TRUE
3 1 a 2003-03-10 2004-03-30 2008-03-11 FALSE TRUE
4 1 a 2004-04-04 NA 2008-03-11 TRUE TRUE
5 2 a 2001-01-01 2002-01-01 2009-01-01 FALSE TRUE
6 2 b 2002-02-02 2003-03-01 2009-01-01 FALSE TRUE
7 2 a 2003-03-03 2003-06-01 2009-01-01 TRUE TRUE
8 3 a 2001-01-01 2001-01-01 2001-03-01 FALSE TRUE
9 3 a 2001-01-02 NA 2001-03-01 TRUE FALSE
I group by the type and for each type calculate the number that meet a rule. These are the working_at_6m.
I now want to calculate the proportion of working_at_6m of the total where atleast_6m ==T.
The output should be:
type Total working_at_6m `working_at_6m%`
a 7 4 0.667 #i.e 4/6
b 2 2 1
This is what I have tried:
q_sum = q%>%
ungroup()%>%
group_by(type)%>%
summarise(Total = n(),
working_at_6m = sum(case_when(!is.na(fail) & interval(create,fail)/months(1) >= 6 ~T,
last_for_unit ==T & interval(create,last)/months(1) >= 6 ~T,
T~F)),
`working_at_6m%` = working_at_6m/Total[atleast_6m ==T])
Which produces
q_sum
# A tibble: 8 x 4
# Groups: type [2]
type Total working_at_6m `working_at_6m%`
<chr> <int> <int> <dbl>
1 a 7 4 0.571
2 a 7 4 NA
3 a 7 4 NA
4 a 7 4 NA
5 a 7 4 NA
6 a 7 4 NA
7 b 2 2 1
8 b 2 2 NA
You can try,
library(dplyr)
q%>%
ungroup()%>%
group_by(type)%>%
summarise(Total = n(),
working_at_6m = sum(case_when(!is.na(fail) & interval(create,fail)/months(1) >= 6 ~T,
last_for_unit ==T & interval(create,last)/months(1) >= 6 ~T,
T~F)),
`work_at_6m%` = working_at_6m/sum(atleast_6m))
# A tibble: 2 x 4
type Total working_at_6m `work_at_6m%`
<chr> <int> <int> <dbl>
1 a 7 4 0.667
2 b 2 2 1

R dplyr::Filter dataframe by group and numeric vector?

I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)
Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1
here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1
Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98

Sum up two variables in a long-format dataframe with tidyverse

I have a simple data frame in a tidy format:
group variable value
<fct> <chr> <dbl>
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
2 fishers_here 140
2 money_per_fisher 8000
2 unnecessary_variable 304
3 fishers_here 10
3 money_per_fisher 9000
....
for each group I'd like to have the variable "total money in group" which is just fishers_here * money_per_fisher; basically I'd like it to look like this:
group variable value
<fct> <chr> <dbl>
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
1 TOTAL_MONEY 200000
....
Is there a simple way to get this done with tidyverse?
By simple I mean without having to filter, summarise, add the variable column back in and then join the two now separate dataframes.
You can spread, do the multiplication and then gather back up. Note I'm assuming that there is a typo in the group number in row 6 as I commented, where it should be group 2 instead of group 1. If that's not the case, then some additional cleaning steps are needed. You can also sort your resulting rows however you want (e.g. to put the rows for each group back together)
library(tidyverse)
tbl <- read_table2(
"group variable value
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
2 fishers_here 140
2 money_per_fisher 8000
2 unnecessary_variable 304
3 fishers_here 10
3 money_per_fisher 9000"
)
tbl %>%
spread(variable, value) %>%
mutate(total_money_in_group = money_per_fisher * fishers_here) %>%
gather(variable, value, -group)
#> # A tibble: 12 x 3
#> group variable value
#> <dbl> <chr> <dbl>
#> 1 1 fishers_here 100
#> 2 2 fishers_here 140
#> 3 3 fishers_here 10
#> 4 1 money_per_fisher 2000
#> 5 2 money_per_fisher 8000
#> 6 3 money_per_fisher 9000
#> 7 1 unnecessary_variable 10
#> 8 2 unnecessary_variable 304
#> 9 3 unnecessary_variable NA
#> 10 1 total_money_in_group 200000
#> 11 2 total_money_in_group 1120000
#> 12 3 total_money_in_group 90000
Created on 2019-02-04 by the reprex package (v0.2.1)
An option would be to filter the 'money_per_fisher', 'fishers_here', grouped by 'group', summarise to get the prod of 'value', bind the rows with the original data and arrange by 'group'
library(tidyverse)
df1 %>%
filter(variable %in% c('fishers_here', 'money_per_fisher')) %>%
group_by(group) %>%
summarise(variable = "total_money_in_group", value = prod(value)) %>%
bind_rows(tbl, .) %>%
arrange(group)
# A tibble: 11 x 3
# group variable value
# <int> <chr> <dbl>
# 1 1 fishers_here 100
# 2 1 money_per_fisher 2000
# 3 1 unnecessary_variable 10
# 4 1 total_money_in_group 200000
# 5 2 fishers_here 140
# 6 2 money_per_fisher 8000
# 7 2 unnecessary_variable 304
# 8 2 total_money_in_group 1120000
# 9 3 fishers_here 10
#10 3 money_per_fisher 9000
#11 3 total_money_in_group 90000
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
variable = c("fishers_here",
"money_per_fisher", "unnecessary_variable", "fishers_here", "money_per_fisher",
"unnecessary_variable", "fishers_here", "money_per_fisher"),
value = c(100L, 2000L, 10L, 140L, 8000L, 304L, 10L, 9000L
)), class = "data.frame", row.names = c(NA, -8L))
Based on your output I think this is a possible solution:
df %>%
group_by(group) %>%
summarise(value = prod(value))
Edit: If you want a column on the original dataset you can use mutate instead of summarise

dplyr - How to obtain the order of one column within a group?

Example data:
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
> tibbly
# A tibble: 12 x 4
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 10 A X 1
2 30 A X 2
3 50 A X 3
4 10 A Y 4
5 30 A Y 4
6 50 A Y 6
7 10 B X 2
8 30 B X 5
9 50 B X 3
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
Question:
How to obtain the order of rows for each group in a dataframe? I can use dplyr to arrange the data in the an appropriate form to visualize what I am interested in:
> tibbly %>%
group_by(grouping1, grouping2) %>%
arrange(grouping1, grouping2, desc(value))
# A tibble: 12 x 4
# Groups: grouping1, grouping2 [4]
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 50 A X 3
2 30 A X 2
3 10 A X 1
4 50 A Y 6
5 10 A Y 4
6 30 A Y 4
7 30 B X 5
8 50 B X 3
9 10 B X 2
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
In the end I am interested in the order of the age column, for each group based on the value column. Is there a elegant way to do this with dplyr? Something like summarise() based on the order of rows and not actual values
library(dplyr)
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
tibbly %>%
group_by(grouping1, grouping2) %>% # for each group
arrange(desc(value)) %>% # arrange value descending
summarise(order = paste0(age, collapse = ",")) %>% # get the order of age as a strings
ungroup() # forget the grouping
# # A tibble: 4 x 3
# grouping1 grouping2 order
# <chr> <chr> <chr>
# 1 A X 50,30,10
# 2 A Y 50,10,30
# 3 B X 30,50,10
# 4 B Y 10,30,50
With data.table
library(data.table)
setDT(tibbly)[order(-value), .(order = toString(age)),.(grouping1, grouping2)]

Display weighted.mean of sub-groups with using column index instead of name in group_by_at

Sticking to library dplyr, I need to calculate weighted average of a variable by subgroups of other variables using column indexes instead of column names. Here is the example:
data <- read.table(text = 'obs income education type weight
1 1000 A blue 10
2 2000 B yellow 1
3 1500 B blue 5
4 2000 A yellow 2
5 3000 B yellow 2',
header = TRUE)
Everything goes well using group_by, weighted.mean and mutate when using column names for grouping:
df <-data %>%
group_by(education,type) %>%
mutate(weighted_income = weighted.mean(income, weight))
df
# A tibble: 5 x 6
# Groups: education, type [4]
obs income education type weight weighted_income
<int> <int> <fct> <fct> <int> <dbl>
1 1 1000 A blue 10 1000.
2 2 2000 B yellow 1 2667.
3 3 1500 B blue 5 1500.
4 4 2000 A yellow 2 2000.
5 5 3000 B yellow 2 2667.
But I need to use column indexes instead of column names. I was able to make group_by_at works but only for 1 group, like this (column 3 = education):
df %>%
group_by_at(3) %>%
mutate(weighted_income = weighted.mean(income, weight))
df
# A tibble: 5 x 6
# Groups: education [2]
obs income education type weight weighted_income
<int> <int> <fct> <fct> <int> <dbl>
1 1 1000 A blue 10 1167.
2 2 2000 B yellow 1 1938.
3 3 1500 B blue 5 1938.
4 4 2000 A yellow 2 1167.
5 5 3000 B yellow 2 1938.
But I get an error for sub-groups (education = column 3, type= column 4)
df %>%
group_by_at(3,4) %>%
mutate(weighted_income = weighted.mean(income, weight))
Error: Can't create call to non-callable object
How to make this last piece of code work for sub-groups?
My query is related to this topic on grouping using column indexes rather column names but the answers only refer to groups, not sub-groups.
We need to concatenate the indexes as without it, the group_by_at thinks the '3' as the .vars and the 4 as .funs based on the usage
group_by_at(.tbl, .vars, .funs = list(), ..., .add = FALSE)
Therefore, do a concatenation and it would evaluate it for .vars
data %>%
group_by_at(c(3, 4)) %>%
mutate(weighted_income = weighted.mean(income, weight))
# A tibble: 5 x 6
# Groups: education, type [4]
# obs income education type weight weighted_income
# <int> <int> <fctr> <fctr> <int> <dbl>
#1 1 1000 A blue 10 1000
#2 2 2000 B yellow 1 2667
#3 3 1500 B blue 5 1500
#4 4 2000 A yellow 2 2000
#5 5 3000 B yellow 2 2667
Or we can place it inside vars to notify that it is the .vars
data %>%
group_by_at(vars(3, 4)) %>%
mutate(weighted_income = weighted.mean(income, weight))
# A tibble: 5 x 6
# Groups: education, type [4]
# obs income education type weight weighted_income
# <int> <int> <fctr> <fctr> <int> <dbl>
#1 1 1000 A blue 10 1000
#2 2 2000 B yellow 1 2667
#3 3 1500 B blue 5 1500
#4 4 2000 A yellow 2 2000
#5 5 3000 B yellow 2 2667

Resources