Calculate seasonal mean with a n years time series with monthly data - r

I have a dataframe df with 3 columns (months, year, value).
>head(df)
months year value
January 01 23875.00
February 01 15343.25
March 01 9584.25
April 01 19026.33
May 01 26324.00
June 01 31228.00
Every 12 rows (starting from the first January), the year goes 02, 03, 04, etc.. until 16.
I need to calculate seasonal means i.e.
For Summer mean of (December,January,February); for Autumn mean of (March,April,May), for Winter mean of (June,July,August) and for Spring mean of (September,October,November).
Then make a new dataframe with seasons, year, and the mean value of them to get something like this.
>head(seasdf)
season year value
DJF 01
MAM 01
JJA 01
SON 01
DJF 02
MAM 02
With all the years until 16. I searched for similar questions with this kind of dataframe, but i couldn't find a way to do it.
Sorry for this noob question.

We assume that adjacent months in the same quarter should all have the same quarter name and year and that quarters are named after the year in which the quarter ends. For example, Dec 2001, Jan 2002 and Feb 2002 would all be part of the DJF 2002 quarter.
First convert the year and month to a "yearmon" class variable, ym, and then add 1/12 to push the months forward one. This is based on the fact that yearmon variables are stored as the year + 0 for Jan, 1/12 for Feb, 2/12 for Mar, etc. Then convert that to a "yearqtr" class variable, yq. Now aggregate value by yq noting that yearqtr variables sort correctly so that 2001 Q1 will come before 2001 Q2, etc. Finally reconstitute the aggregated data frame with the columns shown in the question.
library(zoo) # yearmon and yearqtr classes
ym <- as.yearmon(paste(DF$months, DF$year), "%B %y")
yq <- as.yearqtr(ym + 1/12)
Ag <- aggregate(value ~ yq, DF, mean)
season.name <- c("DJF", "MAM", "JJA", "SON")
with(Ag, data.frame(year = as.integer(yq), season = season.name[cycle(yq)], value))
giving:
year season value
1 2001 DJF 19609.12
2 2001 MAM 18311.53
3 2001 JJA 31228.00
If the exact layout shown in the question is not important then we could omit the last two lines of code above and just use Ag
> Ag
yq value
1 2001 Q1 19609.12
2 2001 Q2 18311.53
3 2001 Q3 31228.00
Note: The input DF in reproducible form was assumed to be:
DF <- structure(list(months = c("January", "February", "March", "April",
"May", "June"), year = c("01", "01", "01", "01", "01", "01"),
value = c(23875, 15343.25, 9584.25, 19026.33, 26324, 31228
)), .Names = c("months", "year", "value"), class = "data.frame", row.names = c(NA, -6L))

Seems like your months variable is standard month name, you can match it against the month.name variable in R to get the month as a number, i.e(January will be 1, February will 2, etc), and take modulo division of 3 to get the season as another group variable aside from year, and then it should be trivial to group by year, season and take the average:
library(dplyr)
df %>% group_by(season = match(months, month.name) %% 12 %/% 3, year) %>%
summarise(value = mean(value)) %>% ungroup() %>%
# optional: convert the season from number to meaningful labels which could also be
# summer, autumn, winter and spring
mutate(season = factor(season, levels = c(0,1,2,3),
labels = c("DJF", "MAM", "JJA", "SON")))
# A tibble: 3 × 3
# season year value
# <fctr> <int> <dbl>
#1 DJF 1 19609.12
#2 MAM 1 18311.53
#3 JJA 1 31228.00
If December needs to be rolled to the next year Summer, you can add one to the year variable when months == "December":
df %>% group_by(season = match(months, month.name) %% 12 %/% 3, year = ifelse(months == "December", year + 1, year)) %>%
summarise(value = mean(value)) %>% ungroup() %>%
# optional: convert the season from number to meaningful labels which could also be
# summer, autumn, winter and spring
mutate(season = factor(season, levels = c(0,1,2,3),
labels = c("DJF", "MAM", "JJA", "SON")))

Related

Separate day of year into month and day of month columns in R

For simplicity, I have data that has two columns. One column is the year (year) and the other is the number of days (yday). So year with a value of 1980 and yday with a value of 1 is January 1, 1980. Year with a value of 1980 and yday with a value of 365 is December 31, 1980. How do I separate the single yday column into two columns; a month column and the day of the month column? For example, 365 would be 12 for the month and 31 for the day. Thanks in advance.
Create a Date from the yday + year columns, then extract the day of month, and month separately:
dat <- data.frame(year=1980, yday=c(1,365))
# year yday
#1 1980 1
#2 1980 365
dat[c("month","day")] <- lapply(c("%m","%d"), \(x) {
d <- as.Date(paste(dat$year, dat$yday), format="%Y %j")
as.integer(format(d, x))
})
# year yday month day
#1 1980 1 1 1
#2 1980 365 12 30

column value changes in R group find date difference

Suppose I have the following DataFrame
ID Result Date
1 Pos 4th Jan, 2020
1 Pos 20th Jan, 2020
1 Neg 21st Jan, 2020
2 Pos 5th Jan, 2020
2 Neg 7th Jan, 2020
I want to record the delta (between days) by ID when the result changes from positive to negative ONLY.
so I would like an answer for this test case as:
ID Result Date Delta Time_Spent_Pos
1 Pos 4th Jan, 2020 0 17
1 Pos 20th Jan, 2020 16 17
1 Neg 21st Jan, 2020 17 17
2 Pos 5th Jan, 2020 0 2
2 Neg 7th Jan, 2020 2 2
Where I plan to use the time_spent_pos column for further analysis.
Further Testing Case
I also would like to point out the data could look like
ID Result Date
1 Neg 12th Dec, 2019
1 Pos 4th Jan, 2020
1 Pos 20th Jan, 2020
1 Neg 21st Jan, 2020
2 Neg 2nd Jan, 2020
2 Pos 5th Jan, 2020
2 Neg 7th Jan, 2020
In which case I would still like the old output. So it is important to find the first time an ID was positive (Record that forever) -> then find the first time it changed to negative. And push the delta to a column.
Any tips + help is appreciated.
You can write a function to do this calculation. Get the first date where result = 'Pos' and subtract it from the immediate next 'Neg' date.
get_delta <- function(res, date) {
d1 <- date[match('Pos', res)]
as.integer(min(date[res == 'Neg' & date > d1]) - d1)
}
library(dplyr)
df %>%
mutate(Date = lubridate::dmy(Date)) %>%
group_by(ID) %>%
mutate(Time_Spent_Pos = get_delta(Result, Date)) %>%
ungroup
# ID Result Date Time_Spent_Pos
# <int> <chr> <date> <int>
#1 1 Pos 2020-01-04 17
#2 1 Pos 2020-01-20 17
#3 1 Neg 2020-01-21 17
#4 2 Pos 2020-01-05 2
#5 2 Neg 2020-01-07 2
So a simple idea is to create two separate columns one whenever the value is positive and another whenever the value is negative then grouping and getting minimum/maximum values for each of these columns.
Here is how you can do it;
# Reading required libraries
library(dplyr)
library(lubridate)
# Create sample dataframes
df <-
data.frame(ID = c(1,1,1,1,2,2,2),
Result = c("Neg", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg"),
Date = c("12th Dec, 2019", "4th Jan, 2020", "20th Jan, 2020",
"21st Jan, 2020", "2nd Jan, 2020", "5th Jan, 2020",
"7th Jan, 2020"))
df %>%
# Convert date into yyyy-mm-dd to easily manipulate it
mutate(Date = dmy(Date),
# In case positive/negative then create a column with value
POSITIVE = as.Date(ifelse(Result == "Pos", Date, NA), origin = lubridate::origin),
NEGATIVE = as.Date(ifelse(Result == "Neg", Date, NA), origin = lubridate::origin)) %>%
# Grouping by ID
group_by(ID) %>%
# Getting firs positive and last negative
mutate(POSITIVE = min(POSITIVE, na.rm = TRUE),
NEGATIVE = max(NEGATIVE, na.rm = TRUE)) %>%
ungroup() %>%
# Calculating difference between positive and negative
mutate(Time_Spent_Pos = NEGATIVE - POSITIVE)

Datafram format transforming in R: how to with dates to years (each ID new row per year)

I’ve to transform my dataframe from the current to the new format (see image or structure below). I’ve no idea how I can accomplish that. I want a year for each ID, from 2013-2018 (so each ID has 6 rows, one for every year). The dates are the dates of living on that adress (entry date) and when they left that adress (end date). So each ID and year gives the zipcode and city they lived. The place the ID lived (for each year) should be were they lived the longest that year. I've already set the enddate to 31-12-2018 if they still live there (here showed with NA). Below a picture and the first 3 rows. Hopefully you guys can help me out!
Current format:
ID (1, 1, 2)
ZIPCODE (1234AB, 5678CD, 9012EF)
CITY (NEWYORK, LA, MIAMI)
ENTRY_DATE (2-1-2014, 13-3-2017, 10-11-2011)
END_DATE (13-5-2017, 21-12-2018, 6-9-2017)
New format:
ID (1, 1, 1, 1, 1, 1, 2)
YEAR (2013, 2014, 2015, 2016, 2017, 2018, 2013)
ZIPCODE (NA, 1234AB, 1234AB, 1234AB, 5678CD, 5678CD, 9012EF)
CITY (NA, NEWYORK, NEWYORK, NEWYORK, LA, LA, MIAMI)
See link below
Here is one approach.
First, create date intervals for each location from start to end dates. Using map2 and unnest you will create additional rows for each year.
Since you wish to include the location information where there were the greatest number of days for that calendar year, you could look at overlaps between 2 intervals: one interval is the calendar year, and the second interval is the ENTRY_DATE to END_DATE. For each year, you can filter by max(WEEKS) (or to ensure a single address per year, arrange in descending order by WEEKS and slice(1) --- or with latest tidyr consider slice_max). This will keep the row where there is the greatest number of weeks duration overlap between intervals.
The final complete will ensure you have rows for all years between 2013-2018.
library(tidyverse)
library(lubridate)
df %>%
mutate(ENTRY_END_INT = interval(ENTRY_DATE, END_DATE),
YEAR = map2(year(ENTRY_DATE), year(END_DATE), seq)) %>%
unnest(YEAR) %>%
mutate(YEAR_INT = interval(as.Date(paste0(YEAR, '-01-01')), as.Date(paste0(YEAR, '-12-31'))),
WEEKS = as.duration(intersect(ENTRY_END_INT, YEAR_INT))) %>%
group_by(ID, YEAR) %>%
arrange(desc(WEEKS)) %>%
slice(1) %>%
group_by(ID) %>%
complete(YEAR = seq(2013, 2018, 1)) %>%
arrange(ID, YEAR) %>%
select(-c(ENTRY_DATE, END_DATE, ENTRY_END_INT, YEAR_INT, WEEKS))
Output
# A tibble: 14 x 4
# Groups: ID [2]
ID YEAR ZIPCODE CITY
<dbl> <dbl> <chr> <chr>
1 1 2013 NA NA
2 1 2014 1234AB NEWYORK
3 1 2015 1234AB NEWYORK
4 1 2016 1234AB NEWYORK
5 1 2017 5678CD LA
6 1 2018 5678CD LA
7 2 2011 9012EF MIAMI
8 2 2012 9012EF MIAMI
9 2 2013 9012EF MIAMI
10 2 2014 9012EF MIAMI
11 2 2015 9012EF MIAMI
12 2 2016 9012EF MIAMI
13 2 2017 9012EF MIAMI
14 2 2018 NA NA
Data
df <- structure(list(ID = c(1, 1, 2), ZIPCODE = c("1234AB", "5678CD",
"9012EF"), CITY = c("NEWYORK", "LA", "MIAMI"), ENTRY_DATE = structure(c(16072,
17238, 15288), class = "Date"), END_DATE = structure(c(17299,
17896, 17415), class = "Date")), class = "data.frame", row.names = c(NA,
-3L))

Generate which 15-days period a day fall into

I have a data frame with year and day
df <- data.frame(year = rep(1980:2015,each = 365), day = 1:365)
Please note that I only need 365 days a year i.e. I am asusming each day has
365 years.
I want to generate two data:
1) which month does each day fall in
2) which 15-days period each day fall in. A year will have 24 15-days period. i.e. each month will be split into two halves something like this;
Jan: 1st - 15th: 1st Quarter
Jan: 16th- 31st: 2nd Quarter
Feb: 1st - 15th: 3rd Quarter
Feb: 16th - 28th: 4th Quarter
March: 1st - 15th: 5th Quarter
.
.
Decmber: 16th - 31st: 24th quarter
My final data should look like this
Year Day Quarter Month
1980 1 1 1
1980 2 1 1
.
.
1980 365 24 12
.
.
2015 1 1 1
2015 2 1 1
.
.
2015 365 12 24
I can generate the month using this:
library(dplyr)
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
df1 <- df %>% group_by(year) %>%
mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y)))
But I do not know how to generate the 15-days period?
To handle that Feb 29th in leap years should not be included, we may generate a complete sequence of dates and then remove instances of Feb 29th. Grab month from the date. Calculate the two-week periods by checking if day of the month %d is <= 15 and subtract from 2* the month number.
# complete sequence of dates
# use two years in this example, with 2012 being a leap year
dates <- data.frame(date = seq(as.Date("2011-01-01"), as.Date("2012-12-31"), by = "1 day"))
# remove Feb 29th in leap years
d <- dates[format(dates$date, "%m-%d") != "02-29", , drop = FALSE]
# create month
d$month <- month(d$date)
# create two-week number
d$twoweek <- d$month * 2 - (as.numeric(format(d$date, "%d")) <= 15)

Increasing Year count every January

I have a data frame that looks similar to this:
I know the starting year of the first obs (1963). The obs are in the exact chronological order. So the next instance of "Jan" (obs 13) indicates that the year is 1964. Is there a way to create a column "Year" that has increases the current year every time that the next occurrence of "Jan " happens?
In the pic, it would be "1964" and then when "Jan" happens again, 1965 and so on....
There is an answer to a similar problem that was suggested but it doesn't quite do it and here it is:
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10, 15),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("1963", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 1963 - cumsum(diff(c(julian[2], julian))>0)
But this won't do it: Because the last two observations have the same month ("Dec" and then another "Dec") the count for year increases:
The last observation should still read "1960" NOT "1959".
The OP has requested to complete the years in ascending order starting in 1963.
The approach below works without date conversion and dummy dates and can be amended to work with fiscal years (see here).
df$year <- 1963 + cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) < 0))
df
day month year
1 24 Jun 1963
2 21 Mar 1964
3 20 Jan 1965
4 10 Dec 1965
5 20 Jun 1966
6 20 Jan 1967
7 10 Dec 1967
8 15 Dec 1967
Note that there is a question which seems to be similar but was asking to complete years in descending order. The solution there needs to be changed in two places to work here.

Resources