Assign day of the day year to a month - r

Sample data
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day which is the day of the year. I need to produce two columns:
Month column: a column of month (which month does the day belong)
Biweek column: which biweek does a day belong to. There are 24 biweek in a year. All days <= 15 in a month is the first biweek and > 15 is second biweek.
For e.g.
15th Jan is Biweek 1,
16-31 Jan is biweek 2,
1-15 Feb is biweek 3 and
16-28 Feb is biweek 4 and so on.
For sake of simplicity, I am assuming all the years are non-leap years.
Here's the code I have (with help from RS as well) that creates the two columns.
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster
EDIT: To be clear, I have assumed all years must have 365 days. In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). I hope this is clear. My solution addresses this issue but takes lot of time to run.

Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment. The key ones are yday<-, mday() and month(), which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality.
Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year.
EDIT: Here is a significantly faster solution. You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 0.36s running time, since you no longer have to repetitively create the date. We also bypass having to use case_when, since the join will take care of the missing days. See that Day 59 of year 2000 is February and Day 60 is March, as requested.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
)
tictoc::tic("")
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
)
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0).

You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date.
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question
# user system elapsed
# 117.67 0.15 118.45

Filtered for one year. I think it solves the leap issue you described, unless I'm not clear on what you're saying. Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed.
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that.

Related

Six-month peak-season running average

I'm trying to implement this:
The recommendation is a peak season ozone AQG level of 60 μg/m3
(the average of daily maximum 8-hour mean ozone concentrations).
The peak season is defined as the six consecutive months of the year
with the highest six-month running-average ozone concentration.
In regions away from the equator, this period will typically be in the
warm season within a single calendar year (northern hemisphere)
or spanning two calendar years (southern hemisphere). Close to
the equator, such clear seasonal patterns may not be obvious, but a
running-average six-month peak season will usually be identifiable
from existing monitoring or modelling data.
I have:
# A tibble: 300 × 2
date value
<dttm> <dbl>
1 1997-01-01 00:00:00 NA
2 1997-02-01 00:00:00 NA
3 1997-03-01 00:00:00 NA
4 1997-04-01 00:00:00 30.2
5 1997-05-01 00:00:00 20.9
6 1997-06-01 00:00:00 10.1
7 1997-07-01 00:00:00 9.40
8 1997-08-01 00:00:00 22.4
9 1997-09-01 00:00:00 26.2
10 1997-10-01 00:00:00 32.9
# … with 290 more rows
Every year is complete (with or without NA). I found the peaks by "findpeaks" from pracma package, and get:
peaks = findpeaks(mda8_omit$value, minpeakdistance = 6,
minpeakheight = mean(mda8_omit$value))
How do i optimize to get the best six month by peak? For northern hemisphere is easier because the peaks is within a yer (summer) but in the southern hemisphere is split in two years and peaks may change depending on latitude. Any ideas on how to continue?
Assuming that
we only use windows with 6 consecutive months of data
the year that a window falls is determined by the last month of the window
we compare all such windows, at most 12, within each calendar year
Calculate the rolling mean and then grouping by year take the row with the largest rolling mean within year. This row is the last month of the 6 month window. The input is shown reproducibly in the Note at the end.
library(dplyr)
library(zoo)
DF %>%
mutate(date = as.yearmon(date),
peakmean = rollapplyr(value, 6, mean, fill = NA)) %>%
group_by(year = as.integer(date)) %>%
slice_max(peakmean) %>%
ungroup %>%
select(-year)
## # A tibble: 1 × 3
## date value peakmean
## <yearmon> <dbl> <dbl>
## 1 Oct 1997 32.9 20.3
Note
Lines <- "date value
1 1997-01-01T00:00:00 NA
2 1997-02-01T00:00:00 NA
3 1997-03-01T00:00:00 NA
4 1997-04-01T00:00:00 30.2
5 1997-05-01T00:00:00 20.9
6 1997-06-01T00:00:00 10.1
7 1997-07-01T00:00:00 9.40
8 1997-08-01T00:00:00 22.4
9 1997-09-01T00:00:00 26.2
10 1997-10-01T00:00:00 32.9"
DF <- read.table(text = Lines)

summing based on conditions from two dataframes and dealing with dates

I have two dataframes, one with climate data for every location and date across 4 years. The other data frame has a date for each day an animal was trapped at a site. I am trying to calculate the mean of each climate variable based on a specific amount of time before the day the animal was trapped (time length depends on variable in question).
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
> climate
site precip humid date
1 1 0.1 1 6/13/2020
2 1 0.2 1 6/12/2020
3 1 0.1 3 6/11/2020
4 1 0.1 1 6/14/2020
5 2 0.5 2 6/13/2020
6 2 0.2 3 6/12/2020
7 2 0.3 3 6/11/2020
8 2 0.1 1 6/14/2020
9 1 0.2 1 2/13/2019
10 1 0.1 3 2/14/2019
11 1 0.1 1 2/15/2019
12 1 0.5 2 2/16/2019
> trap
site date
1 1 7/1/2020
2 2 7/1/2020
3 3 7/2/2020
4 3 7/4/2020
I want to calculate the mean humid 18-20 days before the date written in the trap dataframe. So essentially what is the mean humid between 6/11/2020 and 6/13/2020 according to the climate data.frame for animals trapped on 7/1/2020. So for site 1 that would be: 1.667 and site 2 that would be 2.67.
I also want to calculate the sum of precipitation 497-500 days before the date written in the trap dataframe. So I would need to calculate the sum (total) precip between 2/13/2019 and 2/16/2019 for an animal trapped on 7/1/2020 at each site. So for site 1 precip would be 0.9.
I know how to create new columns in the trap data frame for mean precip and sum humid but I'm not sure where to start in terms of coding so that each value is calculated as described above and the data that corresponds to the correct date is used for the large dataset that contains many different trap dates.
Thank you very much, hopefully I am being clear in my description.
I have a solution using functions from the tidyverse. It is always useful to convert date variables to the class date. With this class, you can make calculations. Note, that I renamed the date column in the trap data to trap_date. See comments for more details:
library(tidyverse)
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), trap_date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
# merge data
data <- merge(climate, trap, by="site")
> head(data)
site precip humid date trap_date
1 1 0.1 1 2020-06-13 2020-07-01
2 1 0.2 1 2020-06-12 2020-07-01
3 1 0.1 3 2020-06-11 2020-07-01
4 1 0.1 1 2020-06-14 2020-07-01
5 1 0.2 1 2019-02-13 2020-07-01
6 1 0.1 3 2019-02-14 2020-07-01
# parse dates to class 'date'; enables calculations
data <- data %>%
mutate(date = parse_date(date, format="%m/%d/%Y"),
trap_date = parse_date(trap_date, format="%m/%d/%Y"))
For means:
# humid means
data %>%
group_by(site) %>%
filter(date >= trap_date-20 & date <= trap_date-18) %>%
summarise(mean = mean(humid))
# A tibble: 2 x 2
site mean
<dbl> <dbl>
1 1 1.67
2 2 2.67
However, it seems that the range of 497 to 500 days before the trap date contains no observations. When I used your specified dates, I got the same result of 0.9:
# precip sums
data %>%
group_by(site) %>%
filter(date >= trap_date-500 & date <= trap_date-497)
# A tibble: 0 x 5
# Groups: site [0]
# ... with 5 variables: site <dbl>, precip <dbl>, humid <dbl>,
# date <date>, trap_date <date>
# using your provided dates
data %>%
group_by(site) %>%
filter(date >= as.Date("2019-02-13") & date <= as.Date("2019-02-16")) %>%
summarise(sum = sum(precip))
# A tibble: 1 x 2
site sum
<dbl> <dbl>
1 1 0.9
Hope I can help.

Sum unique occurrences per night and create a new data frame in R

I have studied prey deliveries in a breeding owl and want to score the number of prey items delivered during the night to the nestlings. I define night as from 21 to 5. How could I make a new data frame with number of prey each night per location ID based upon these 24/7 observation dataset? In the new data frame, I wish to have the following columns: ID (A & B), No_prey_during_night (the sum of prey items), Time (date, e.g. 4/6 to 5/6), there will be a unique row per night per ID.
https://drive.google.com/file/d/1y5VCoNWZCmYbyWCktKfMSBqjOIaLeumQ/view?usp=sharing. I have done it in Excel so far, but very time demanding. I would be happy to get help with a simple script I could use in R.
To take into account the fact that a night begins and ends on different dates, you could first assign all the morning hours to the prior day. The final label (the Time column in your question) then includes the next day. If the year of the data collection has a Feb 29, make sure the year is correct (I used 2022).
library(dplyr)
library(lubridate)
read.csv("Tot_prey_example.csv") %>%
mutate(time = make_datetime(year = 2022, month = Month, day = Day, hour = Hour),
night_time = if_else(between(Hour, 0, 5), time - days(1), time),
night_date = floor_date(night_time, unit = "day"),
night = Hour <= 5 | Hour >= 21) %>%
filter(night) %>%
group_by(ID, night_date) %>%
summarise(No_prey_during_night = sum(n), .groups = "drop") %>%
mutate(next_day = night_date + days(1),
Time = glue::glue("{day(night_date)}/{month(night_date)} to {day(next_day)}/{month(next_day)}")) %>%
select(ID, No_prey_during_night, Time)
#> # A tibble: 88 × 3
#> ID No_prey_during_night Time
#> <chr> <int> <glue>
#> 1 A 12 4/6 to 5/6
#> 2 A 22 5/6 to 6/6
#> 3 A 20 6/6 to 7/6
#> 4 A 14 7/6 to 8/6
#> 5 A 14 8/6 to 9/6
#> 6 A 27 9/6 to 10/6
#> 7 A 22 10/6 to 11/6
#> 8 A 18 11/6 to 12/6
#> 9 A 22 12/6 to 13/6
#> 10 A 25 13/6 to 14/6
#> # … with 78 more rows
Created on 2022-05-18 by the reprex package (v2.0.1)
You can do something like this:
library(dplyr)
library(lubridate)
read.csv("Tot_prey_example.csv") %>%
# create initial datetime variable, `night`
mutate(night = lubridate::make_datetime(2021, Month,Day,Hour)) %>%
# filter to nighttime hours
filter(Hour>=21 | Hour<=5) %>%
# flip datetime variable to the next day if hour is >=21
mutate(night = if_else(Hour>=21,night + 60*60*24, night)) %>%
# now group by the date part of `night`
group_by(ID,Night_No = as.Date(night)) %>%
# summarize the sum of prey
summarize(
No_prey_during_night = sum(n),
No_deliveries_during_night = sum(PreyDelivery)
) %>%
# replace the Night_No with a character variable showing both dates
mutate(Night_No = paste0(Night_No-1, "-", Night_No))
Output:
# A tibble: 88 × 4
# Groups: ID [2]
ID Night_No No_prey_during_night No_deliveries_during_night
<chr> <chr> <int> <int>
1 A 2021-06-04-2021-06-05 12 5
2 A 2021-06-05-2021-06-06 22 6
3 A 2021-06-06-2021-06-07 20 5
4 A 2021-06-07-2021-06-08 14 6
5 A 2021-06-08-2021-06-09 14 5
6 A 2021-06-09-2021-06-10 27 5
7 A 2021-06-10-2021-06-11 22 4
8 A 2021-06-11-2021-06-12 18 6
9 A 2021-06-12-2021-06-13 22 6
10 A 2021-06-13-2021-06-14 25 5
# … with 78 more rows

Time difference calculated from wide data with missing rows

There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369
We could use pmap
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(out = pmap(.[-1], ~ {
dates <- as.Date(na.omit(c(...)))
tibble(years = lubridate::time_length(difftime(last(dates),
first(dates)), "years"),
days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
})) %>%
unnest_wider(out)
# A tibble: 4 x 7
# id adate bdate cdate ddate years days
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#1 1 2011-06-18 2012-06-15 2013-06-18 2014-06-15 2.99 1093
#2 2 2011-06-18 2012-06-15 2013-06-18 <NA> 2.00 731
#3 3 2011-04-09 <NA> 2013-04-09 2014-04-11 3.01 1098
#4 4 2011-05-20 2012-05-23 <NA> <NA> 1.01 369
Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:
grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:
days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference
cbind(df[1],days, years = round(days/365,2)) # Create your table
id days years
1.1 1 1093 2.99
2.2 2 731 2.00
3.3 3 1098 3.01
4.4 4 369 1.01
if comfortable with other higher functions then you could do:
dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
id adate year
1 1 1093 2.99
2 2 731 2.00
3 3 1098 3.01
4 4 369 1.01
Using base R apply :
df1[-1] <- lapply(df1[-1], as.Date)
df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
x <- na.omit(x)
x1 <- difftime(x[length(x)], x[1], 'days')
c(x1/365, x1)
}))
df1[c('id', 'years', 'days')]
# id years days
#1 1 2.994521 1093
#2 2 2.002740 731
#3 3 3.008219 1098
#4 4 1.010959 369

How do I know, what day of week is a date

I've got the following problem: I have the daily stock exchange rates of a certain share stored in a vector with the belonging date(from 2015 to 2017).
I need to extract the last exchange rate of every week.
This means I need to know what weekday corresponds to every date and store those rates in a vector (or delete the other rows from the existing vector). I did this by using 'wday' (from lubridate) and then did the following:
vector<-stochexchangerate
weekdays<-wday(stockexchangerate) ## length =35; monday=2,
tuesday=3,..
for(i in 1:10){
if(weekdays[i]<6){
vector<-vector[-c(i)]
}
}
But this only has the consequence, that some "random" rows are deleted and if I run this code 6 times, there is only 1 row left although there were some values which were taken on friday. Can anyone help me?
Yes, using lubridate was a good insight. I would extract the day of the week using lubridate::wday and argument label = TRUE and filter that column.
Assuming that you have a dataframe with 2 columns (one for the dates and, one for the value of rates) you can do:
library(tidyverse)
library(lubridate)
# DATA
#> df <- tibble(date = mdy("02/15/1980") + 1:300,
#> value = 1:300)
df %>%
mutate(day = wday(date, label = TRUE)) %>%
filter(day == "Fri")
#> # A tibble: 42 x 3
#> date value day
#> <date> <int> <ord>
#> 1 1980-02-22 7 Fri
#> 2 1980-02-29 14 Fri
#> 3 1980-03-07 21 Fri
#> 4 1980-03-14 28 Fri
#> 5 1980-03-21 35 Fri
#> 6 1980-03-28 42 Fri
#> 7 1980-04-04 49 Fri
#> 8 1980-04-11 56 Fri
#> 9 1980-04-18 63 Fri
#> 10 1980-04-25 70 Fri
#> # … with 32 more rows

Resources