How calculate ratio with the lagged values per group? - r

I have the following dataset:
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
Firstly, I wanted to change all NA values to the average value of that variable for this group. So, instead of NA should be 2.43.
Secondly, I wanted to calculate a fourth variable, which is ratio of the lagged value of the school to the number of the students.
data <-
a %>%
group_by(school) %>%
summarize(lag.value.ratio = lag(school, 1)/numberofstudents) %>% ungroup
Unfortunately, I have the following error: Error: Column lag.value.ratio must be length 1 (a summary value), not 5.
How to avoid this error and get the average group value instead of NA?

If you want the mean value of the group to replace the NAs, I calculate 2.83 to be the mean for school 3. You are getting the error because you are using summarize, which wants to collapse the result down to the number of groups that you have (in this case 2). I believe what you want is a mutate.
EDIT: I an loading the libraries used below and making sure that the lag function that is used is from the dplyr package.
library(dplyr)
library(tidyr)
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
a %>%
group_by(school) %>%
mutate(numberofstudents = replace_na(numberofstudents, mean(numberofstudents, na.rm = TRUE)),
lag.value.ratio = dplyr::lag(school, 1)/numberofstudents) %>%
ungroup()
gives
# A tibble: 12 x 4
school year numberofstudents lag.value.ratio
<dbl> <dbl> <dbl> <dbl>
1 2 2011 3 NA
2 2 2011 3 0.667
3 2 2011 3 0.667
4 2 2012 2 1
5 2 2012 2 1
6 3 2011 3 NA
7 3 2011 3 1
8 3 2011 3 1
9 3 2012 2 1.5
10 3 2012 2.83 1.06
11 3 2012 2 1.5
12 3 2012 4 0.75

Related

How to group non-ecluding year ranges using a loop with dplyr

I'm new here, so maybe my question could be difficult to understand. So, I have some data and it's date information and I need to group the mean of the data in year ranges. But this year ranges are non-ecluding, I mean that, for example, my first range is: 2013-2015 then 2014-2016 then 2015-2017, etc. So I think that it could be done by using a loop function and dplyr, but I dont know how to do it. I´ll be very thankfull if someone can help me.
Thank you,
Alejandro
What I tried was like:
for (i in Year){
Year_3=c(i, i+1, i+2)
db>%> group_by(Year_3)
#....etc
}
As you note, each observation would be used in multiple groups, so one approach could be to make copies of your data accordingly:
df <- data.frame(year = 2013:2020, value = 1:8)
library(dplyr)
df %>%
tidyr::uncount(3, .id = "grp") %>%
mutate(group_start = year - grp + 1,
group_name = paste0(group_start, "-", group_start + 2)) %>%
group_by(group_name) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
group_name value n
<chr> <dbl> <int>
1 2011-2013 1 1
2 2012-2014 1.5 2
3 2013-2015 2 3
4 2014-2016 3 3
5 2015-2017 4 3
6 2016-2018 5 3
7 2017-2019 6 3
8 2018-2020 7 3
9 2019-2021 7.5 2
10 2020-2022 8 1
Or we might take a more algebraic approach, noting that the sum of a three year period will be the difference between the cumulative amount two years in the future minus the cumulative amount the prior year. This approach excludes the partial ranges.
df %>%
mutate(cuml = cumsum(value),
value_3yr = (lead(cuml, n = 2) - lag(cuml, default = 0)) / 3)
year value cuml value_3yr
1 2013 1 1 2
2 2014 2 3 3
3 2015 3 6 4
4 2016 4 10 5
5 2017 5 15 6
6 2018 6 21 7
7 2019 7 28 NA
8 2020 8 36 NA

Panel data: Calculate group means while omitting first period from calculation

I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)
I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.
The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.
library(tidyverse)
library(plm)
ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))
I first tried a filter over periods like so:
dt <- dt %>% group_by(ID) %>%
filter(year %in% first(year)+1:last(year)) %>%
mutate(mean.y = mean(y))
But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:
See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4
i hope you people could understand it. I would massively appreciate any help.
Bye
We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.
library(tidyverse)
library(plm)
dt %>%
group_by(ID) %>%
mutate(mean.y = ifelse(year == first(year),
NA,
mean(y[year != first(year)], na.rm = TRUE)))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID y year mean.y
#> <fct> <dbl> <fct> <dbl>
#> 1 a 9 2001 NA
#> 2 a 2 2002 3.25
#> 3 a 5 2003 3.25
#> 4 a 3 2004 3.25
#> 5 a 3 2005 3.25
#> 6 b 9 2001 NA
#> 7 b 1 2002 2
#> 8 b 2 2003 2
#> 9 b 3 2004 2
#> 10 c 9 2002 NA
#> 11 c 2 2003 3.5
#> 12 c 5 2004 3.5
Created on 2022-01-23 by the reprex package (v0.3.0)
Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.
library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))
Output
# A tibble: 12 x 4
# Groups: ID [3]
ID y year mean.y
<chr> <dbl> <dbl> <dbl>
1 a 9 2001 NA
2 a 2 2002 3.25
3 a 5 2003 3.25
4 a 3 2004 3.25
5 a 3 2005 3.25
6 b 9 2001 NA
7 b 1 2002 2
8 b 2 2003 2
9 b 3 2004 2
10 c 9 2002 NA
11 c 2 2003 3.5
12 c 5 2004 3.5
More compactly,
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])

How to drop NA's out of the summarise(count = n()) function in R?

I have a dataset containing 4 organisation units (org_unit) with different number of participants and 2 Questions (Q1,Q2) on a 2-degree scale (1:2). I want to know how many people per unit answered the respective question with [1] and divide them by the total number of participants / unit.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(-9,-9,-9,-9,-9,-9,-9,-9,-9,-9)
The problem is, my Q2 only consists of [-9] which stands for non-response. I therefore assigned NA to [-9].
DF <- data.frame(Org_unit, Q1, Q2)
DF[DF == -9] <- NA
DF
Org_unit Q1 Q2
1 1 1 NA
2 1 2 NA
3 1 1 NA
4 1 2 NA
5 2 1 NA
6 2 2 NA
7 2 1 NA
8 3 2 NA
9 3 1 NA
10 4 2 NA
Next I calculated the proportion of people who answered Q1 with [1], which works fine.
prop_q1 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q1 == 1))
prop_q1
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 0.5
2 2 3 0.667
3 3 2 0.5
4 4 1 0
when i run the same code for Q2 however, I get the same amount of members per unit (count = c(1,2,3,4), although nobody answered the question and I don't want them to be registered as participants, since they technically didn't participate in the study.
prop_q2 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
prop_q2
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 NA
2 2 3 NA
3 3 2 NA
4 4 1 NA
Is there a way to calculate the right amount of members per unit when facing NA's? [-9]
Thanks!
Would
prop_q2 <- DF %>%
filter(!is.na(Q2)) %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
do the job?
Given that you want to do this across multiple columns, I think that using across() within the dplyr verbs will be better for you. I explain the solution below.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(1,-9,-9,-9,-9,-9,-9,-9,-9,-9) #Note one response
df <- tibble(Org_unit, Q1, Q2)
df %>%
mutate(across(starts_with("Q"), ~na_if(., -9))) %>%
group_by(Org_unit) %>%
summarize(across(starts_with("Q"),
list(
N = ~sum(!is.na(.)),
prop = ~sum(. == 1, na.rm = TRUE)/sum(!is.na(.)))
))
# A tibble: 4 x 5
Org_unit Q1_N Q1_prop Q2_N Q2_prop
* <dbl> <int> <dbl> <int> <dbl>
1 1 4 0.5 1 1
2 2 3 0.667 0 NaN
3 3 2 0.5 0 NaN
4 4 1 0 0 NaN
First, we take the data frame (which I created as a tibble) and substitute NA for all values that equal -9 for all columns that start with a capital "Q". This converts all question columns to have NAs in place of -9s.
Second, we group by the organizational unit and then summarize using two functions. The first sums all values where the response to the question is not NA. The string _N will be appended to columns with these values. The second calculates the proportion and will have _prop appended to the values.

Acceptable practice to use 'group_by' stats in mutate?

In the past, when I've needed to create a new variable in an R data frame that is partly based on a 'group_by' summary statistic, I've always used the following sequence:
(1) calculate 'group stats' from data in the base (ungrouped) data frame using group_by() and summarize()
(2) join the base data frame with the result of the previous step, then calculate the new variable value using mutate.
However, (after years of using dplyr!) I accidentally did the 'summarizing' in a mutate step and everything seemed to work. This is illustrated in Option #2 in the code snippet below. I'm assuming Option #2 is okay because I'm getting identical results using both options, and because I found similar examples searching the web today. However, I wasn't sure.
Is Option #2 acceptable practice, or is Option #1 preferred (and if so why)?
set.seed(123)
df <- tibble(year_ = c(rep(c(2019), 4), rep(c(2020), 4)),
qtr_ = c(rep(c(1,2,3,4), 2)),
foo = sample(seq(1:8)))
# Option 1: calc statistics then rejoin with input data
df_stats <- df %>%
group_by(year_) %>%
summarize(mean_foo = mean(foo))
df_with_stats <- left_join(df, df_stats) %>%
mutate(dfoo = foo - mean_foo)
# Option 2: everything in one go
df_with_stats2 <- df %>%
group_by(year_) %>%
mutate(mean_foo = mean(foo),
dfoo = foo - mean_foo)
df_with_stats
# A tibble: 8 x 5
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
df_with_stats2
# A tibble: 8 x 5
# Groups: year_ [2]
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement:
df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo))
also, data.table
setDT(df)[,dfoo:=foo-mean(foo), by =year_]

Add a calculated column based on same and two other columns in r

I'm trying to add a calculated column based on values of the same and another column calculated from the values in a third column. There are three columns, year, id, and value. If the id for 2011 matches the id for 2005, then subtract the value of 2005 from the value of 2011. So the difference shows 10-11=-1, 20-5=15, and 30-16=14... and the remaining rows can be 0 or NA, it doesn't matter. The following table shows the resulting table with the new difference column.
I know I could split the data into two tables and then create the column via a simple subtraction if the two tables are ordered the same by year and id, but that's not an option for this particular problem. Tried thinking of how I could use case_when or ifelse but it's a mind-bender and can't get my head around it. There are examples I've found but they don't address this - they're mostly based on using a comparison between only two columns, or perhaps three... here, though, one of the values is from the same column. How can I address this?
Your help is appreciated greatly in advance.
Here is the code for the original table:
dat <- data.frame(year=c(2011,2011,2011,2005,2005,2005),
id=c(1,2,3,1,2,3),
value=c(10,20,30,11,5,6))
For situations where there are multiple ids in your comment to Ronak's answer, you can do:
library(tidyr)
library(dplyr)
dat2 |>
pivot_wider(id, values_from = value, names_from = year) |>
unnest(c(`2011`, `2005`)) |>
mutate(difference = `2011` - `2005`) |>
pivot_longer(c(`2011`, `2005`), names_to = "year")
# A tibble: 10 x 4
id difference year value
<dbl> <dbl> <chr> <dbl>
1 1 -1 2011 10
2 1 -1 2005 11
3 1 -1 2011 10
4 1 -1 2005 11
5 2 15 2011 20
6 2 15 2005 5
7 2 15 2011 20
8 2 15 2005 5
9 3 24 2011 30
10 3 24 2005 6
Arrange the data based on descending order of year value and for each id subtract the current value with the next one.
library(dplyr)
dat %>%
arrange(desc(year)) %>%
group_by(id) %>%
mutate(difference = value - lead(value)) %>%
#to get 0 instead of NA use the below one
#mutate(difference = value - lead(value, default = last(value))) %>%
ungroup
# year id value difference
# <dbl> <dbl> <dbl> <dbl>
#1 2011 1 10 -1
#2 2011 2 20 15
#3 2011 3 30 24
#4 2005 1 11 NA
#5 2005 2 5 NA
#6 2005 3 6 NA

Resources