Question for calculating the mean date only with month and day - r

I have the following dataset, and I would like to have the average date (Month and day) for each (phenology) pheno and station across years. It seems I can directly use the mean function to calculate the mean for the date format objects. However, if I convert the month day to date, with function as.Date, then the year is added, and the average date is not independent of years. How can I directly calculate the mean date only based on Month and day?

You cannot compute a "mean month + day" independet of the year, since not every year has the same number of days. So you need to choose a fixed year for your computations.
Then you can:
Create "dummy" date objects which have the correct month and day, but the previously select year.
Compute the mean of those dummies
Extract month and day from the result (remove the year)

You can use the yday function from the lubridate package to convert each date into the day of the year for that year then average the day of the year for each Pheno. The conversion of the day of the year to a month and day depends upon whether your want the date in a leap year or non leap year. I report both dates.
The code looks like:
library(tidyverse)
library(lubridate)
#
# calculate average day of year
#
average_doy <- df %>% mutate(day_of_year = yday(as.Date(paste(Year,Month,Day, sep="-")))) %>%
group_by(Pheno) %>%
summarize(avg_doy = round(mean(day_of_year,0)))
# set base years
non_leap_year <- 2003
leap_year <- 2004
#
# convert day of year to average day using base years
#
averages <- average_doy %>%
mutate(avg_non_leap_year_mon_day = paste(avg_doy, non_leap_year, sep="_") %>%
as.Date(format = "%j_%Y") %>%
str_remove(paste0(non_leap_year,"-")),
avg_leap_year_mon_day = paste(avg_doy, leap_year, sep="_") %>%
as.Date(format = "%j_%Y") %>%
str_remove(paste0(leap_year,"-") ))
Using the first seven rows of your data, this gives
# A tibble: 3 x 4
Pheno avg_doy avg_non_leap_year_mon_day avg_leap_year_mon_day
<chr> <dbl> <chr> <chr>
1 Dormant 348 12-14 12-13
2 Tillering 343 12-09 12-08
3 Turning green 48 02-17 02-17

Related

How to obtain the average of two different row ranges in a specific column?

I have the following sample dataframe. The first column is the month and the second column is the number of surveys conducted each month.
month = c(1,2,3,4,5,6,7,8,9,10,11,12)
surveys = c(4,5,3,7,3,4,4,4,6,1,1,7)
df = data.frame(month, surveys)
I want to calculate the average number of surveys from May - August, and then, the average number of surveys for the remaining months (Jan - April PLUS September - December).
How do I do this using the dplyr package?
Assuming the integers represent months, in dplyr, you could use group_by with a boolean TRUE/FALSE and find the mean with summarize:
df %>% group_by(MayAug = month %in% 5:8) %>% summarize(mean = mean(surveys))
# MayAug mean
# <lgl> <dbl>
#1 FALSE 4.25
#2 TRUE 3.75
I first create a new factor variable period with labels, then group_by period and summarise using mean
df %>%
mutate(period = factor(between(month, 5,8), labels = c("Other months", "May-Aug"))) %>%
group_by(period) %>%
summarise(mean_surveys = mean(surveys))
# A tibble: 2 × 2
period mean_surveys
<fct> <dbl>
1 Other months 4.25
2 May-Aug 3.75
First, you need to install the dplyr package if you haven't already:
install.packages("dplyr")
Then you can load the package and use the group_by() and summarize() functions to calculate the averages:
library(dplyr)
df <- data.frame(month, surveys)
may_aug_avg <- df %>%
filter(month >= 5 & month <= 8) %>%
summarize(average = mean(surveys))
remaining_months_avg <- df %>%
filter(!(month >= 5 & month <= 8)) %>%
summarize(average = mean(surveys))
The first line of code filters the dataframe to only include the months of May through August, and then calculates the average of the number of surveys for those months. The second line of code filters the dataframe to exclude the months of May through August, and then calculates the average of the number of surveys for the remaining months.
You can check the values of may_aug_avg, remaining_months_avg to access the averages.
Hope this helps!

How can I calculate the number of nights PER MONTH between two dates in R even if they are across two months?

I have hotel booking data and there's an arrival and a departure date. I have successfully counted the days in between using difftime but I would now like to know the number of dates per month. If both arrival and departure date are within one month (like arrival on September 1st and departure on September 10th) that's not a problem of course but what do I do with bookings that are across months like arrival on September 25th and departure on October 4th or even years? In this case I would like to calculate how many days fall in September and how many days fall in October.
The overall goal is to calculate booked days per month / year.
Since you included no sample data (may I suggest you do so in next questions), I made it up to replicate what you want:
library(lubridate)
library(tidyverse)
#creating sample data
bookings <- tibble(
pax = c("Jane", "John"),
arrival = as.Date(c("2020-12-20", "2021-01-25")),
departure = as.Date(c("2021-01-04", "2021-02-02"))
)
#creating a column with all booked dates to group_by and summarize
bookings <- bookings %>%
rowwise() %>%
mutate(booked_dates = list(seq(arrival, departure, by="days"))) %>% # this creates a column of tiny dataframes with the occupied dates by pax
unnest(cols = booked_dates) %>% # this flattens the list-column into a regular one
mutate( # extracting the year and month
year = year(booked_dates),
month = month(booked_dates, label = TRUE)
) %>%
group_by(year, month) %>% # grouping and summarizing
summarise(n_days = n())
Then you have the desired output:
bookings
# A tibble: 3 × 3
# Groups: year [2]
year month n_days
<dbl> <ord> <int>
1 2020 Dec 12
2 2021 Jan 11
3 2021 Feb 2

Lag based on condition

I have historical monthly data and need to perform rolling calculation. Price of each period will be compared to 3 years back date i.e. Current Price / Base Price. Here Base is 3 years past date. It will be rolling for each month. For every month it should be compared 3 years paste date. I am using lag function to find out past date. It returns NA before Jan-2013 which is correct.
I want to add additional criteria - if minimum date of combination of (Location, Asset, SubType) is post year 2010, it should be compared with minimum date of the combination. For example minimum date is Jan-2014 so all the prices after Jan-2014 should be compared with Jan-2014 (static base year).
You can read data from the code below -
library(readxl)
library(httr)
GET("https://sites.google.com/site/pocketecoworld/Trend_Sale%20-%20Copy.xlsx", write_disk(tf <- tempfile(fileext = ".xlsx")))
dff <- read_excel(tf)
My code -
dff <- dff %>% group_by(Location, Asset, SubType) %>%
mutate(BasePrice=lag(Price, 36),
Index = round(100*(Price/BasePrice), 1)) %>%
filter(Period >= '2013-01-31')
Do you mean something like this ?
library(dplyr)
dff %>%
group_by(Location, Asset, SubType) %>%
mutate(BasePrice= if(lubridate::year(min(Period)) > 2010)
Price[which.min(Period)] else lag(Price, 36),
Index = round(100*(Price/BasePrice), 1))
If minimum date in Period is after 2010 we select the Price of minimum Period value or use 3 year earlier Price as BasePrice.

Calculate mean of one column for 14 rows before certain row, as identified by date for each group (year)

I would like to calculate mean of Mean.Temp.c. before certain date, such as 1963-03-23 as showed in date2 column in this example. This is time when peak snowmelt runoff occurred in 1963 in my area. I want to know 10 day’s mean temperature before this date (ie., 1963-03-23). How to do it? I have 50 years data, and each year peak snowmelt date is different.
example data
You can try:
library(dplyr)
df %>%
mutate(date2 = as.Date(as.character(date2)),
ten_day_mean = mean(Mean.Temp.c[between(date2, "1963-03-14", "1963-03-23")]))
In this case the desired mean would populate the whole column.
Or with data.table:
library(data.table)
setDT(df)[between(as.Date(as.character(date2)), "1963-03-14", "1963-03-23"), ten_day_mean := mean(Mean.Temp.c)]
In the latter case you'd get NA for those days that are not relevant for your date range.
Supposing date2 is a Date field and your data.frame is called x:
start_date <- as.Date("1963-03-23")-10
end_date <- as.Date("1963-03-23")
mean(x$Mean.Temp.c.[x$date2 >= start_date & x$date2 <= end_date])
Now, if you have multiple years of interest, you could wrap this code within a for loop (or [s|l]apply) taking elements from a vector of dates.

Convert day of year to date assuming all years are non-leap years

I have a df with year and day of year as columns:
dat <- data.frame(year = rep(1980:2015, each = 365), day = rep(1:365,times = 36))
Please note that I am assuming 365 days in a year even if it is a leap year. I need to generate two things:
1) month
2) date
I did this:
# this tells me how many days in each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
# this assigns each day to a month
dat1 <- dat %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))))
I want to produce a third column which is a date in the format year,month,day.
However, since I am assuming all years are non-leap years, I need to ensure that my dates also reflect this i.e. there should be no date as 29th Feb.
The reason I need to generate the date is because I want to generate number
of 15 days period of a year. A year will have 24 15-days period
1st Jan - 15th Jan: 1 period
16th Jan- 31st Jan: 2 period
1st Feb - 15th Feb: 3 period....
16th till 31st dec: 24th period)
I need dates to specify whether a day in a month falls in the first
half (i.e.d day <= 15) or second quarter (day > 15). I use the following
script to do this:
dat2 <- dat1 %>% mutate(twowk = month*2 - (as.numeric(format(date,"%d")) <= 15))
In order for me to run this above line, I need to generate date and hence my question.
A possible solution:
dat$dates <- as.Date(paste0(dat$year,'-',
format(strptime(paste0('1981-',dat$day), '%Y-%j'),
'%m-%d'))
)
What this does:
With strptime(paste0('1981-',dat$day), '%Y-%j') you get the dates of a non-leap year.
By embedding that in format with '%m-%d' you extract the month and the day in the month.
paste that together with the year in the year-column and wrap that in as.Date to get a non-leap-year date.

Resources