How to group non-ecluding year ranges using a loop with dplyr - r

I'm new here, so maybe my question could be difficult to understand. So, I have some data and it's date information and I need to group the mean of the data in year ranges. But this year ranges are non-ecluding, I mean that, for example, my first range is: 2013-2015 then 2014-2016 then 2015-2017, etc. So I think that it could be done by using a loop function and dplyr, but I dont know how to do it. I´ll be very thankfull if someone can help me.
Thank you,
Alejandro
What I tried was like:
for (i in Year){
Year_3=c(i, i+1, i+2)
db>%> group_by(Year_3)
#....etc
}

As you note, each observation would be used in multiple groups, so one approach could be to make copies of your data accordingly:
df <- data.frame(year = 2013:2020, value = 1:8)
library(dplyr)
df %>%
tidyr::uncount(3, .id = "grp") %>%
mutate(group_start = year - grp + 1,
group_name = paste0(group_start, "-", group_start + 2)) %>%
group_by(group_name) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
group_name value n
<chr> <dbl> <int>
1 2011-2013 1 1
2 2012-2014 1.5 2
3 2013-2015 2 3
4 2014-2016 3 3
5 2015-2017 4 3
6 2016-2018 5 3
7 2017-2019 6 3
8 2018-2020 7 3
9 2019-2021 7.5 2
10 2020-2022 8 1
Or we might take a more algebraic approach, noting that the sum of a three year period will be the difference between the cumulative amount two years in the future minus the cumulative amount the prior year. This approach excludes the partial ranges.
df %>%
mutate(cuml = cumsum(value),
value_3yr = (lead(cuml, n = 2) - lag(cuml, default = 0)) / 3)
year value cuml value_3yr
1 2013 1 1 2
2 2014 2 3 3
3 2015 3 6 4
4 2016 4 10 5
5 2017 5 15 6
6 2018 6 21 7
7 2019 7 28 NA
8 2020 8 36 NA

Related

Panel data: Calculate group means while omitting first period from calculation

I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)
I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.
The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.
library(tidyverse)
library(plm)
ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))
I first tried a filter over periods like so:
dt <- dt %>% group_by(ID) %>%
filter(year %in% first(year)+1:last(year)) %>%
mutate(mean.y = mean(y))
But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:
See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4
i hope you people could understand it. I would massively appreciate any help.
Bye
We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.
library(tidyverse)
library(plm)
dt %>%
group_by(ID) %>%
mutate(mean.y = ifelse(year == first(year),
NA,
mean(y[year != first(year)], na.rm = TRUE)))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID y year mean.y
#> <fct> <dbl> <fct> <dbl>
#> 1 a 9 2001 NA
#> 2 a 2 2002 3.25
#> 3 a 5 2003 3.25
#> 4 a 3 2004 3.25
#> 5 a 3 2005 3.25
#> 6 b 9 2001 NA
#> 7 b 1 2002 2
#> 8 b 2 2003 2
#> 9 b 3 2004 2
#> 10 c 9 2002 NA
#> 11 c 2 2003 3.5
#> 12 c 5 2004 3.5
Created on 2022-01-23 by the reprex package (v0.3.0)
Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.
library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))
Output
# A tibble: 12 x 4
# Groups: ID [3]
ID y year mean.y
<chr> <dbl> <dbl> <dbl>
1 a 9 2001 NA
2 a 2 2002 3.25
3 a 5 2003 3.25
4 a 3 2004 3.25
5 a 3 2005 3.25
6 b 9 2001 NA
7 b 1 2002 2
8 b 2 2003 2
9 b 3 2004 2
10 c 9 2002 NA
11 c 2 2003 3.5
12 c 5 2004 3.5
More compactly,
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])

Extracting data from different columns in a data frame based on row values

From each row in the data frame, df, I want to extract values in columns, as explained below and create a new data frame, output.
When Year is equal to 2003, I need values in Y_2001 and Y_2002 columns, in output data frame as Year 1 and Year 2. They are the values corresponding to two years prior to year specified in Year column. Similarly, if year equals to 2006, I need values in Y_2004 and Y_2005 in output data frame. Likewise, for all years in Year column.
> df
ID Year Y_2001 Y_2002 Y_2003 Y_2004 Y_2005
[1,] 1 2003 2 4 6 4 3
[2,] 2 2004 5 9 7 1 2
[3,] 3 2006 4 3 5 7 8
[4,] 4 2004 7 6 4 8 9
> output
ID Year Year1 Year2
[1,] 1 2003 2 4
[2,] 2 2004 9 7
[3,] 3 2006 7 8
[4,] 4 2004 6 4
Can someone please help me to create a code to get above output? Highly appreciate any support.
Here is a tidyverse solution:
Would take data and put into long form with pivot_longer. The data values of interest are where the Year "row" is 1 or 2 years less than the "column" Year. You can filter on these differences (filter here is explicit for 1 or 2 year differences).
An additional column is created with mutate for your column names of Year1 and Year2 (note Year1 is difference of 2 years, and Year2 is difference of 1 year, so the values is subtracted from 3 for this reversal). Finally, pivot_wider puts the data back in wide form.
library(tidyverse)
df %>%
pivot_longer(cols = -c(ID, Year), names_to = c(".value", "Year_Sep"), names_sep = "_", names_ptypes = list(Year_Sep = numeric())) %>%
filter(Year - Year_Sep == 1 | Year - Year_Sep == 2) %>%
mutate(YearCol = paste0("Year", 3 - (Year - Year_Sep))) %>%
pivot_wider(id_cols = c(ID, Year), names_from = YearCol, values_from = Y)
Output
# A tibble: 4 x 4
ID Year Year1 Year2
<int> <int> <int> <int>
1 1 2003 2 4
2 2 2004 9 7
3 3 2006 7 8
4 4 2004 6 4
Bit of a clunky solution, but ...
i.col <- function(data, n) { # Returns the column index corresponding to the year
sapply(data$Year-n, function(x) grep(x, names(data)))
}
df$Year1 <- diag(as.matrix(df[, i.col(df, n=2)]))
df$Year2 <- diag(as.matrix(df[, i.col(df, n=1)]))
Edit:
Apparently using diag is very slow. Using cbind to access matrix elements is preferred.
df$Year1 <- df[cbind(1:4, i.col(df, n=2))] # where 4 is number of rows
df$Year2 <- df[cbind(1:4, i.col(df, n=1))]
df
ID Year Y_2001 Y_2002 Y_2003 Y_2004 Y_2005 Year1 Year2
1 1 2003 2 4 6 4 3 2 4
2 2 2004 5 9 7 1 2 9 7
3 3 2006 4 3 5 7 8 7 8
4 4 2004 7 6 4 8 9 6 4
Here is one way with row-wise apply assuming that you can find out the starting year (2001).
cbind(df[1:2], t(apply(df[-1], 1, function(x)
{ vals <- x[1] - 2001; x[c(vals:(vals + 1))]})))
# ID Year 1 2
#1 1 2003 2 4
#2 2 2004 9 7
#3 3 2006 7 8
#4 4 2004 6 4

Filter for first 5 observations per group in tidyverse

I have precipitation data of several different measurement locations and would like to filter for only the first n observations per location and per group of precipitation intensity using tidyverse functions.
So far, I've grouped the data by location and by precipitation intensity.
This is a minimal example (there are several observations of each rainfall intensity per location)
df <- data.frame(location = c(rep(1, 7), rep(2, 7)),
rain = c(1:7, 1:7))
location rain
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 2 1
9 2 2
10 2 3
11 2 4
12 2 5
13 2 6
14 2 7
I thought that it should be quite easy using group_by() and filter(), but so far, I haven't found an expression that would return only the first n observations per rain group per location.
df %>% group_by(rain, location) %>% filter(???)
You can do:
df %>%
group_by(location) %>%
slice(1:5)
location rain
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
library(dplyr)
df %>%
group_by(location) %>%
filter(row_number() %in% 1:5)
Non-dplyr solutions (that also rearrange the rows)
# Base R
df[unlist(lapply(split(row.names(df), df$location), "[", 1:5)), ]
# data.table
library(data.table)
setDT(df)[, .SD[1:5], by = location]
An option in data.table
library(data.table)
setDT(df)[, .SD[seq_len(.N) <=5], location]

How calculate ratio with the lagged values per group?

I have the following dataset:
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
Firstly, I wanted to change all NA values to the average value of that variable for this group. So, instead of NA should be 2.43.
Secondly, I wanted to calculate a fourth variable, which is ratio of the lagged value of the school to the number of the students.
data <-
a %>%
group_by(school) %>%
summarize(lag.value.ratio = lag(school, 1)/numberofstudents) %>% ungroup
Unfortunately, I have the following error: Error: Column lag.value.ratio must be length 1 (a summary value), not 5.
How to avoid this error and get the average group value instead of NA?
If you want the mean value of the group to replace the NAs, I calculate 2.83 to be the mean for school 3. You are getting the error because you are using summarize, which wants to collapse the result down to the number of groups that you have (in this case 2). I believe what you want is a mutate.
EDIT: I an loading the libraries used below and making sure that the lag function that is used is from the dplyr package.
library(dplyr)
library(tidyr)
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
a %>%
group_by(school) %>%
mutate(numberofstudents = replace_na(numberofstudents, mean(numberofstudents, na.rm = TRUE)),
lag.value.ratio = dplyr::lag(school, 1)/numberofstudents) %>%
ungroup()
gives
# A tibble: 12 x 4
school year numberofstudents lag.value.ratio
<dbl> <dbl> <dbl> <dbl>
1 2 2011 3 NA
2 2 2011 3 0.667
3 2 2011 3 0.667
4 2 2012 2 1
5 2 2012 2 1
6 3 2011 3 NA
7 3 2011 3 1
8 3 2011 3 1
9 3 2012 2 1.5
10 3 2012 2.83 1.06
11 3 2012 2 1.5
12 3 2012 4 0.75

Keep less recent duplicate row in R

So, I have a dataset with bill numbers, day, month, year, and aggregate value. There are a bunch of bull number duplicates, and I want to keep the first ones. If there is a duplicate with the same day, month, and year, I want to keep the one with the highest amount in aggregate value.
For example, if the dataset now looks like this:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4
I want the result to look like this:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
2 23 11 2001 12
3 11 3 2005 8
I'm not sure if there is a command I can use and just introduce all these arguments or if I should do it in stages, but either way I'm not sure how to begin. I used duplicate() and unique() and then got stuck.
Thanks!
library( data.table )
dt <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
dt[ !duplicated( Bill_Number), ]
# Bill_Number Day Month Year Ag_Value
# 1: 1 10 4 1998 10
# 2: 2 23 11 2001 12
# 3: 3 11 3 2005 8
or
dt[, .SD[1], by = .(Bill_Number) ] #other approach, a bit slower
duplicated() gives you the entries which are identical to earlier one (i.e. ones with smaller subscripts). Therefore, sorting your bill numbers by date (earliest to the top) and then removing duplicates should do the trick. Aggregating your columns day, month and year into one date-column might be helpful.
This answer uses dplyr package and satisfies your condition: "If there is a duplicate with the same day, month, and year, I want to keep the one with the highest amount in aggregate value."
library(data.table)
library(dplyr)
myData <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
myData <- as.tibble(myData) #tibble form
sData <- arrange(myData, Bill_Number, Year, Month, Day, desc(Ag_Value)) #sort the data with the required manner
fData <- distinct(sData, Bill_Number, .keep_all = 1) #final data
fData
# A tibble: 3 x 5
Bill_Number Day Month Year Ag_Value
<int> <int> <int> <int> <int>
1 1 10 4 1998 10
2 2 23 11 2001 12
3 3 11 3 2005 8
I used some loops and condition checks, and tried with a test set besides the "base" set you mentioned.
library(tidyverse)
#base dataset
billNumber <- c(1,1,2,2,3,3,3)
day <- c(10,11,23,23,11,12,13)
month <- c(4,4,11,11,3,3,3)
year <- c(1998,1998,2001,2001,2005,2005,2005)
agValue <- c(10,14,12,9,8,9,4)
#test dataset
billNumber <- c(1,1,2,2,3,3,3,4,4,4)
day <- c(10,11,23,23,11,12,13,15,15,15)
month <- c(4,4,11,11,3,3,3,6,6,6)
year <- c(1998,1998,2001,2001,2005,2005,2005,2020,2020,2020)
agValue <- c(10,14,9,12,8,9,4,13,15,8)
#build the dataset
df <- data.frame(billNumber,day,month,year,agValue)
#add a couple of working columns
df_full <- df %>%
mutate(
concat = paste(df$billNumber,df$day,df$month,df$year,sep="-"),
flag = ""
)
df_full
billNumber day month year agValue concat flag
1 1 10 4 1998 10 1-10-4-1998
2 1 11 4 1998 14 1-11-4-1998
3 2 23 11 2001 12 2-23-11-2001
4 2 23 11 2001 9 2-23-11-2001
5 3 11 3 2005 8 3-11-3-2005
6 3 12 3 2005 9 3-12-3-2005
7 3 13 3 2005 4 3-13-3-2005
#separate records with one/multi occurence as defined in the question
row_single <- df_full %>% count(concat) %>% filter(n == 1)
df_full_single <- df_full[df_full$concat %in% row_single$concat,]
row_multi <- df_full %>% count(concat) %>% filter(n > 1)
df_full_multi <- df_full[df_full$concat %in% row_multi$concat,]
#flag the rows with single occurence
df_full_single[1,]$flag = "Y"
for (row in 2:nrow(df_full_single)) {
if (df_full_single[row,]$billNumber == df_full_single[row-1,]$billNumber) {
df_full_single[row,]$flag = "N"
} else
{
df_full_single[row,]$flag = "Y"
}
}
df_full_single
#flag the rows with multi occurences
df_full_multi[1,]$flag = "Y"
for (row in 2:nrow(df_full_multi)) {
if (
(df_full_multi[row,]$billNumber == df_full_multi[row-1,]$billNumber) &
(df_full_multi[row,]$agValue > df_full_multi[row-1,]$agValue)
) {
df_full_multi[row,]$flag = "Y"
df_full_multi[row-1,]$flag = "N"
} else
{
df_full_multi[row,]$flag = "N"
}
}
df_full_multi
#rebuild full dataset and retrieve the desired output
df_full_final <- rbind(df_full_single,df_full_multi)
df_full_final <- df_full_final[df_full_final$flag == "Y",c(1,2,3,4,5)]
df_full_final <- df_full_final[order(df_full_final$billNumber),]
df_full_final
billNumber day month year agValue
1 1 10 4 1998 10
3 2 23 11 2001 12
5 3 11 3 2005 8

Resources