I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)
I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.
The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.
library(tidyverse)
library(plm)
ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))
I first tried a filter over periods like so:
dt <- dt %>% group_by(ID) %>%
filter(year %in% first(year)+1:last(year)) %>%
mutate(mean.y = mean(y))
But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:
See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4
i hope you people could understand it. I would massively appreciate any help.
Bye
We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.
library(tidyverse)
library(plm)
dt %>%
group_by(ID) %>%
mutate(mean.y = ifelse(year == first(year),
NA,
mean(y[year != first(year)], na.rm = TRUE)))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID y year mean.y
#> <fct> <dbl> <fct> <dbl>
#> 1 a 9 2001 NA
#> 2 a 2 2002 3.25
#> 3 a 5 2003 3.25
#> 4 a 3 2004 3.25
#> 5 a 3 2005 3.25
#> 6 b 9 2001 NA
#> 7 b 1 2002 2
#> 8 b 2 2003 2
#> 9 b 3 2004 2
#> 10 c 9 2002 NA
#> 11 c 2 2003 3.5
#> 12 c 5 2004 3.5
Created on 2022-01-23 by the reprex package (v0.3.0)
Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.
library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))
Output
# A tibble: 12 x 4
# Groups: ID [3]
ID y year mean.y
<chr> <dbl> <dbl> <dbl>
1 a 9 2001 NA
2 a 2 2002 3.25
3 a 5 2003 3.25
4 a 3 2004 3.25
5 a 3 2005 3.25
6 b 9 2001 NA
7 b 1 2002 2
8 b 2 2003 2
9 b 3 2004 2
10 c 9 2002 NA
11 c 2 2003 3.5
12 c 5 2004 3.5
More compactly,
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])
Related
I'm new here, so maybe my question could be difficult to understand. So, I have some data and it's date information and I need to group the mean of the data in year ranges. But this year ranges are non-ecluding, I mean that, for example, my first range is: 2013-2015 then 2014-2016 then 2015-2017, etc. So I think that it could be done by using a loop function and dplyr, but I dont know how to do it. I´ll be very thankfull if someone can help me.
Thank you,
Alejandro
What I tried was like:
for (i in Year){
Year_3=c(i, i+1, i+2)
db>%> group_by(Year_3)
#....etc
}
As you note, each observation would be used in multiple groups, so one approach could be to make copies of your data accordingly:
df <- data.frame(year = 2013:2020, value = 1:8)
library(dplyr)
df %>%
tidyr::uncount(3, .id = "grp") %>%
mutate(group_start = year - grp + 1,
group_name = paste0(group_start, "-", group_start + 2)) %>%
group_by(group_name) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
group_name value n
<chr> <dbl> <int>
1 2011-2013 1 1
2 2012-2014 1.5 2
3 2013-2015 2 3
4 2014-2016 3 3
5 2015-2017 4 3
6 2016-2018 5 3
7 2017-2019 6 3
8 2018-2020 7 3
9 2019-2021 7.5 2
10 2020-2022 8 1
Or we might take a more algebraic approach, noting that the sum of a three year period will be the difference between the cumulative amount two years in the future minus the cumulative amount the prior year. This approach excludes the partial ranges.
df %>%
mutate(cuml = cumsum(value),
value_3yr = (lead(cuml, n = 2) - lag(cuml, default = 0)) / 3)
year value cuml value_3yr
1 2013 1 1 2
2 2014 2 3 3
3 2015 3 6 4
4 2016 4 10 5
5 2017 5 15 6
6 2018 6 21 7
7 2019 7 28 NA
8 2020 8 36 NA
I have a data df as follows:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3),
year=c(2011,2012,2013,2010,2011,2012,2013,2012,2013),
points=c(45,69,79,53,13,12,11,89,91),
result = c(2,3,5,4,6,1,2,3,4))
But I want to make df as below:
df <- data.frame(id = c(1,1,2,2,2,3),
year=c(2011,2012,2010,2011,2012,2012),
points=c(45,69,53,13,12,89),
result = c(3,5,6,1,2,4))
Here, I want to do some regression with the response variable result. Since I want to estimate result, I have to delay the response variable result and leave the other dependent variable points. So, for my regression setting, result is the response variable and points is the dependent variable. In summary, I want to do time lag for result. Within each id, each last row should be removed because, there are no next result.
I simplified my problem for demonstration purpose. Is there any way to achieve this using R?
Tidyverse solution:
library(tidyverse)
df %>% group_by(id) %>% mutate(lead_result = lead(result)) %>% na.exclude
# A tibble: 6 x 5
# Groups: id [3]
id year points result lead_result
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2011 45 2 3
2 1 2012 69 3 5
3 2 2010 53 4 6
4 2 2011 13 6 1
5 2 2012 12 1 2
6 3 2012 89 3 4
A data.table solution:
library(data.table)
na.omit(setDT(df)[, result := shift(result, type = "lead"), by = "id"], "result")
Output
id year points result
<num> <num> <num> <num>
1: 1 2011 45 3
2: 1 2012 69 5
3: 2 2010 53 6
4: 2 2011 13 1
5: 2 2012 12 2
6: 3 2012 89 4
I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop
Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)
I have the following dataset:
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
Firstly, I wanted to change all NA values to the average value of that variable for this group. So, instead of NA should be 2.43.
Secondly, I wanted to calculate a fourth variable, which is ratio of the lagged value of the school to the number of the students.
data <-
a %>%
group_by(school) %>%
summarize(lag.value.ratio = lag(school, 1)/numberofstudents) %>% ungroup
Unfortunately, I have the following error: Error: Column lag.value.ratio must be length 1 (a summary value), not 5.
How to avoid this error and get the average group value instead of NA?
If you want the mean value of the group to replace the NAs, I calculate 2.83 to be the mean for school 3. You are getting the error because you are using summarize, which wants to collapse the result down to the number of groups that you have (in this case 2). I believe what you want is a mutate.
EDIT: I an loading the libraries used below and making sure that the lag function that is used is from the dplyr package.
library(dplyr)
library(tidyr)
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
a %>%
group_by(school) %>%
mutate(numberofstudents = replace_na(numberofstudents, mean(numberofstudents, na.rm = TRUE)),
lag.value.ratio = dplyr::lag(school, 1)/numberofstudents) %>%
ungroup()
gives
# A tibble: 12 x 4
school year numberofstudents lag.value.ratio
<dbl> <dbl> <dbl> <dbl>
1 2 2011 3 NA
2 2 2011 3 0.667
3 2 2011 3 0.667
4 2 2012 2 1
5 2 2012 2 1
6 3 2011 3 NA
7 3 2011 3 1
8 3 2011 3 1
9 3 2012 2 1.5
10 3 2012 2.83 1.06
11 3 2012 2 1.5
12 3 2012 4 0.75
I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1