group "weighted" mean with multiple grouping variables and excluding own group value - r

I'm trying to get group "weighted" mean with multiple grouping variables and excluding own group value. This is related to my earlier post Get group mean with multiple grouping variables and excluding own group value, but when I applied it to my actual question (which is getting the weighted mean) I found out that it's much more complicated than getting the simple mean. Here's what I mean by that.
df <- data_frame(
state = rep(c("AL", "CA"), each = 6),
county = rep(letters[1:6], each = 2),
year = rep(c(2011:2012), 6),
value = c(91,46,37,80,33,97,4,19,85,90,56,94),
wt = c(1,4,3,5,1,4,5,1,5,5,4,1)
) %>% arrange(state, year)
For unweighted mean case, the following code (from the accepted answer of my earlier post) should work.
df %>%
group_by(state, year) %>%
mutate(q = (sum(value) - value) / (n()-1))
The desired variable new_val, which is the weighted mean, would be the following. For instance, the first two rows of new_val column are calculated as 37*3/4 + 33*1/4 = 36, 91*1/2 + 33*1/2 = 62.
# A tibble: 12 x 6
state county year value wt new_val
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 AL a 2011 91 1 36
2 AL b 2011 37 3 62
3 AL c 2011 33 1 50.5
4 AL a 2012 46 4 87.6
5 AL b 2012 80 5 71.5
6 AL c 2012 97 4 64.9
7 CA d 2011 4 5 72.1
8 CA e 2011 85 5 27.1
9 CA f 2011 56 4 44.5
10 CA d 2012 19 1 90.7
11 CA e 2012 90 5 56.5
12 CA f 2012 94 1 78.2
I searched for similar posts with weighted mean in mind, but all the available ones were for the simple mean cases. Any comments would be greatly appreciated. Thank you!

We can use map_dbl to exclude current row in the calculation of weighted.mean
library(dplyr)
df %>%
group_by(state, year) %>%
mutate(new_val = purrr::map_dbl(row_number(),
~weighted.mean(value[-.x], wt[-.x])))
# state county year value wt new_val
# <chr> <chr> <int> <dbl> <dbl> <dbl>
# 1 AL a 2011 91 1 36
# 2 AL b 2011 37 3 62
# 3 AL c 2011 33 1 50.5
# 4 AL a 2012 46 4 87.6
# 5 AL b 2012 80 5 71.5
# 6 AL c 2012 97 4 64.9
# 7 CA d 2011 4 5 72.1
# 8 CA e 2011 85 5 27.1
# 9 CA f 2011 56 4 44.5
#10 CA d 2012 19 1 90.7
#11 CA e 2012 90 5 56.5
#12 CA f 2012 94 1 78.2

Related

R: Cumulative Mean Excluding Current Value?

I am working with the R programming language.
I have a dataset that looks something like this:
id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)
my_data = data.frame(id, year,var)
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658
For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).
As an example:
row 1: NA
row 2: 12.186300/1
row 3: (12.186300 + 19.069836)/2
row 4: (12.186300 + 19.069836 + 7.45)/3
row 5: NA
row 6: 20.827933
row 7: (20.827933 + 5.029625)/2
I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:
transform(my_data, cmean = ave(var, id, FUN = cummean) - var)
id year var cmean
1 1 2010 12.186300 0.000000
2 1 2011 19.069836 -3.441768
3 1 2012 7.456078 5.447994
4 1 2013 14.875019 -1.478211
5 2 2012 20.827933 0.000000
6 2 2013 5.029625 7.899154
7 2 2014 -2.260658 10.126291
The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).
I then tried an answer provided here (Compute mean excluding current value):
my_data %>%
group_by(id) %>%
mutate(avg = (sum(var) - var)/(n() - 1))
# A tibble: 7 x 4
# Groups: id [2]
id year var avg
<dbl> <dbl> <dbl> <dbl>
1 1 2010 12.2 13.8
2 1 2011 19.1 11.5
3 1 2012 7.46 15.4
4 1 2013 14.9 12.9
5 2 2012 20.8 1.38
6 2 2013 5.03 9.28
But it is still not working.
Can someone please show me what I am doing wrong and what I can do this fix this problem?
Thanks!
df %>%
group_by(id)%>%
mutate(avg = lag(cummean(var)))
# A tibble: 7 × 4
# Groups: id [2]
id year var avg
<int> <int> <dbl> <dbl>
1 1 2010 12.2 NA
2 1 2011 19.1 12.2
3 1 2012 7.46 15.6
4 1 2013 14.9 12.9
5 2 2012 20.8 NA
6 2 2013 5.03 20.8
7 2 2014 -2.26 12.9
With the help of some intermediate variables you can do it like so:
library(dplyr)
df <- read.table(text = "
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658", header=T)
df |>
group_by(id) |>
#mutate(avg =lag(cummean(var)))
mutate(id_g = row_number()) |>
mutate(ms = cumsum(var)) |>
mutate(cm = ms/id_g,
cm = ifelse(ms == cm, NA, cm)) |>
select(-id_g, -ms)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> id year var cm
#> <int> <int> <dbl> <dbl>
#> 1 1 2010 12.2 NA
#> 2 1 2011 19.1 15.6
#> 3 1 2012 7.46 12.9
#> 4 1 2013 14.9 13.4
#> 5 2 2012 20.8 NA
#> 6 2 2013 5.03 12.9
#> 7 2 2014 -2.26 7.87

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

Consider a dataframe in R where I want to drop row 6 because it has missing observations for the variables var1:var3. But the dataframe has valid observations for id and year. See code below.
In python, this can be done in two ways:
use df.dropna(subset = ['var1', 'var2', 'var3'], inplace=True)
use df.set_index(['id', 'year']).dropna()
How to do this in R with tidyverse?
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
We may use complete.cases
library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))
-output
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA
We can use pmap for this case also:
library(dplyr)
library(purrr)
df %>%
filter(!pmap_lgl(., ~ {x <- c(...)[-c(1, 2)];
all(is.na(x))}))
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 90 55 77
2 2 2002 77 5 18
3 3 2003 17 NA 70
4 4 2004 72 33 33
5 5 2005 10 55 77
6 7 2007 22 81 17
7 8 2008 NA NA 46
8 9 2009 93 28 100
9 10 2010 50 NA NA
Or we could also use complete.cases function in pmap as suggested by dear #akrun:
df %>%
filter(pmap_lgl(select(., 3:5), ~ any(complete.cases(c(...)))))
You can use if_any in filter -
library(dplyr)
df %>% filter(if_any(var1:var3, Negate(is.na)))
# id year var1 var2 var3
# <int> <int> <int> <int> <int>
#1 1 2001 14 99 43
#2 2 2002 25 72 76
#3 3 2003 90 NA 15
#4 4 2004 91 7 32
#5 5 2005 69 42 7
#6 7 2007 57 83 41
#7 8 2008 NA NA 74
#8 9 2009 9 78 23
#9 10 2010 93 NA NA
In base R, we can use rowSums to select rows which has atleast 1 non-NA value.
cols <- grep('var', names(df))
df[rowSums(!is.na(df[cols])) > 0, ]
If looking for complete cases, use the following (kernel of this is based on other answers):
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
df %>% filter(!if_any(var1:var3, is.na))
#> # A tibble: 6 x 5
#> id year var1 var2 var3
#> <int> <int> <int> <int> <int>
#> 1 1 2001 13 28 26
#> 2 2 2002 61 77 58
#> 3 4 2004 95 38 58
#> 4 5 2005 38 34 91
#> 5 7 2007 85 46 14
#> 6 9 2009 45 60 40
Created on 2021-06-24 by the reprex package (v2.0.0)

Average of a variable by collapsing two columns in r [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I would wish to find the average per season for each year. Each year is observed 4 times. The seasons are two but are repeated twice as shown below
year=rep(c(1990:1992),each=4)
season=c("W","D","W","D","W","W","D","D","D","W","W","D")
temp=c(28,25,26,21,28,25,20,20,20,35,28,21)
df=data.frame(year,season,temp)
which gives
year season temp
1 1990 W 28
2 1990 D 25
3 1990 W 26
4 1990 D 21
5 1991 W 28
6 1991 W 25
7 1991 D 20
8 1991 D 20
9 1992 D 20
10 1992 W 35
11 1992 W 28
12 1992 D 21
i want to collapse this data to have the average of the two seasons for each year as below
year season avgtemp
1 1990 D 23.0
2 1990 W 27.0
3 1991 D 20.0
4 1991 W 25.1
5 1992 D 20.5
6 1992 W 31.5
How can i obtain this?
Try below:
aggregate(df[, 3], df[, 1:2], mean)
library(tidyvere)
df %>%
group_by(year,season) %>%
summarise(avgtemp=mean(temp))
# A tibble: 6 x 3
# Groups: year [?]
year season avgtemp
<int> <fct> <dbl>
1 1990 D 23
2 1990 W 27
3 1991 D 20
4 1991 W 26.5
5 1992 D 20.5
6 1992 W 31.5

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

Resources