DST calculation using R - r

I want to calculate the daylight saving time beginning date for each year from 2003 through 2021 and keep only the days that are 60 days before and after the daylight saving time begin date each year.
i.e date will change each year (falls on a Sunday) and moved from happening in April 2003-2006 to happening in March during the years 2007-2021.
I need to Create a running variable “days” that measures the distance from the daylight saving time begin date for each year with days=0 on the first day of daylight saving time.
Here's dataset
year month day propertycrimes violentcrimes
2003 1 1 94 34
2004 1 1 60 46
2005 1 1 106 41
2006 1 1 87 40
2007 1 1 72 36
2008 1 1 43 50
2009 1 1 35 32
2010 1 1 32 50
2011 1 1 29 45
2012 1 1 32 45
Here's my code so far
library(readr)
dailycrimedataRD <- read_csv("dailycrimedataRD.csv")
View(dailycrimedataRD)
days <- .POSIXct(month, tz="GMT")

How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(readr)
dailycrimedataRD <- read_csv("~/Downloads/dailycrimedataRD.csv")
#> Rows: 6940 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): year, month, day, propertycrimes, violentcrimes
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tmp <- dailycrimedataRD %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep="-"), tz='Canada/Eastern'),
dst = lubridate::dst(date)) %>%
arrange(date) %>%
group_by(year) %>%
mutate(dst_date = date[which(dst == TRUE & lag(dst) == FALSE)],
diff = (as.Date(dst_date) - as.Date(date))) %>%
filter(diff <= 60 & diff >= 0)
tmp
#> # A tibble: 1,159 × 9
#> # Groups: year [19]
#> year month day propertycrimes violentcrimes date dst
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <lgl>
#> 1 2003 2 6 68 8 2003-02-06 00:00:00 FALSE
#> 2 2003 2 7 71 8 2003-02-07 00:00:00 FALSE
#> 3 2003 2 8 81 12 2003-02-08 00:00:00 FALSE
#> 4 2003 2 9 68 7 2003-02-09 00:00:00 FALSE
#> 5 2003 2 10 68 9 2003-02-10 00:00:00 FALSE
#> 6 2003 2 11 61 8 2003-02-11 00:00:00 FALSE
#> 7 2003 2 12 73 10 2003-02-12 00:00:00 FALSE
#> 8 2003 2 13 62 14 2003-02-13 00:00:00 FALSE
#> 9 2003 2 14 71 10 2003-02-14 00:00:00 FALSE
#> 10 2003 2 15 90 11 2003-02-15 00:00:00 FALSE
#> # … with 1,149 more rows, and 2 more variables: dst_date <dttm>, diff <drtn>
Created on 2022-04-14 by the reprex package (v2.0.1)

Related

How to control the fill_gaps interval in tsibble?

I have two data frames that fill missing in different intervals.
I would like to fill the two to the same interval.
Consider two data frames with the same month-day but two years apart:
library(tidyverse)
library(fpp3)
df_2020 <- tibble(month_day = as_date(c('2020-1-1','2020-2-1','2020-3-1')),
amount = c(5, 2, 1))
df_2022 <- tibble(month_day = as_date(c('2022-1-1','2022-2-1','2022-3-1')),
amount = c(5, 2, 1))
These data frames both have three rows, with the same dates, 2 years apart.
Create tsibbles with a yearweek index:
ts_2020 <- df_2020 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2022 <- df_2022 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2020
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022
#> # A tsibble: 3 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 2022-02-01 2 2022 W05
#> 3 2022-03-01 1 2022 W09
Still three rows in each tsibble
Now fill gaps:
ts_2020_filled <- ts_2020 |> fill_gaps()
ts_2022_filled <- ts_2022 |> fill_gaps()
ts_2020_filled
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022_filled
#> # A tsibble: 10 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 NA NA 2022 W01
#> 3 NA NA 2022 W02
#> 4 NA NA 2022 W03
#> 5 NA NA 2022 W04
#> 6 2022-02-01 2 2022 W05
#> 7 NA NA 2022 W06
#> 8 NA NA 2022 W07
#> 9 NA NA 2022 W08
#> 10 2022-03-01 1 2022 W09
Here is the issue:
ts_2020_filled has 4-weekly steps, and ts_2022_filled has 1-weekly steps.
This is because the two tsibbles have different intervals:
tsibble::interval(ts_2020)
#> <interval[1]>
#> [1] 4W
tsibble::interval(ts_2022)
#> <interval[1]>
#> [1] 1W
This is because the tsibbles have different steps:
ts_2020 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 4 4
ts_2022 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 5 4
Therefore, the greatest common divisors are different (4 and 1). From the manual
for as_tibble:
regular Regular time interval (TRUE) or irregular (FALSE). The
interval is determined by the greatest common divisor of index column,
if TRUE.
Both tsibbles are
regular:
is_regular(ts_2020)
#> [1] TRUE
is_regular(ts_2020)
#> [1] TRUE
So, I would like to set the gap fill interval, so the periods are consistent.
I tried setting .full in fill_gaps and .regular in as_tsibble.
I could not find a way to set the interval of a tsibble.
Is there a way of manually setting the interval used by fill_gaps? Granted an interval of four weeks won't work for df_2022, but the LCM of one would work for both.

how to find the growth rate of applicants per year

I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23

Casewhen ignoring one case in R

i have problem with my R code. I am trying to use casewhen to distribute time attribute into part of the day.
data_aoi_droped = data_aoi_droped %>%
mutate (Day_Time = case_when(
Hour >= 05 & Hour < 09 ~ "Rano",
Hour >= 09 & Hour < 11 ~ "Doobeda",
Hour >= 11 & Hour < 13 ~ "Obed",
Hour >= 13 & Hour < 16 ~ "Poobede",
Hour >= 16 & Hour < 19 ~ "Podvecer",
Hour >= 19 & Hour < 22 ~ "Vecer",
Hour >= 22 | Hour < 05 ~ "Noc"
)
)
head(data_aoi_droped,20)
Here you can see the result, as you can see the casewhen is ignoring part with "Rano", which means morning.
I recommend using cut over case_when because your splits were just next to each other. The result does not contain time perioids which are not part of the data.
library(tidyverse)
data_aoi_droped <- tibble(Hour = c(0,7,11,15,17,20,21))
data_aoi_droped %>%
mutate(
Day_Time = Hour %>% cut(
breaks = c(5,9,11,13,16,19,22),
labels = c("Rano", "Doobeda", "Obed", "Poobede", "Podvecer", "Vecer")
) %>% as.character() %>% replace_na("Noc")
)
#> # A tibble: 7 × 2
#> Hour Day_Time
#> <dbl> <chr>
#> 1 0 Noc
#> 2 7 Rano
#> 3 11 Doobeda
#> 4 15 Poobede
#> 5 17 Podvecer
#> 6 20 Vecer
#> 7 21 Vecer
data_aoi_droped %>%
complete(Hour = seq(24)) %>%
mutate(
Day_Time = Hour %>% cut(
breaks = c(5,9,11,13,16,19,22),
labels = c("Rano", "Doobeda", "Obed", "Poobede", "Podvecer", "Vecer")
) %>% as.character() %>% replace_na("Noc")
) %>%
print(n=Inf)
#> # A tibble: 25 × 2
#> Hour Day_Time
#> <dbl> <chr>
#> 1 1 Noc
#> 2 2 Noc
#> 3 3 Noc
#> 4 4 Noc
#> 5 5 Noc
#> 6 6 Rano
#> 7 7 Rano
#> 8 8 Rano
#> 9 9 Rano
#> 10 10 Doobeda
#> 11 11 Doobeda
#> 12 12 Obed
#> 13 13 Obed
#> 14 14 Poobede
#> 15 15 Poobede
#> 16 16 Poobede
#> 17 17 Podvecer
#> 18 18 Podvecer
#> 19 19 Podvecer
#> 20 20 Vecer
#> 21 21 Vecer
#> 22 22 Vecer
#> 23 23 Noc
#> 24 24 Noc
#> 25 0 Noc
Created on 2022-04-04 by the reprex package (v2.0.0)

R calculating differences on a pivoted tibble

I'm struggling some beginner issues with R and tables. I spend most of my data visualisation time in Tableau but I want to be able to replicate work in R to take advantage of the report generation capacity of RMarkdown and the StatCanR library to allow me to pull data in from their Statistics Canada's CANSIM/CODR tables. My coding experience is along the lines of C, C++, Java, Javascript and Python with all but Python learnt in college around the turn of the millenium.
I am extracting rates of certain types of crimes and have created the following table.
```# A tibble: 4 × 11
Violations `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total, all Criminal Code violati… 5780. 5638. 5206. 5061. 5232. 5297. 5375. 5513. 5878. 5301.
2 Total violent Criminal Code viol… 1236. 1199. 1096. 1044. 1070. 1076. 1113. 1152. 1279. 1254.
3 Total property crime violations … 3536. 3438. 3154. 3100. 3231. 3239. 3265. 3348. 3512. 3071.
4 Total drug violations [401] 330. 317. 311. 295. 280. 267. 254. 229. 186. 176.
I have filtered away data that is more than ten years old and only for certain crimes.
# Pivot the data
table_01 <- pivot_wider(table_01 %>%select("REF_DATE","Violations","VALUE"),names_from=REF_DATE, values_from=VALUE)
table01a<-table_01 %>%select(2020,2019,2011)
)
mutate(
ten_year_change = 2020-2011,
one_year_change = 2020-2019
)
I've been messing around with different libraries including tidyverse and dplyr. I want the code to calculate the diffence between the most recent two years and the difference between the most recent year and (most recent year - 10 years ago). The idea is to generate a new report when Statistics Canada updates their data.
This code is above absolutely not what I want. I do want the years that I calculate differences to not be hard coded so I don't have to edit the code in six months.
My suspicion is that I am not getting my head around the R way of doing things, but if I can get a push in the right direction, I would appreciate it.
Below is the TLDR full RMarkdown script:
---
title: "CJS Statistical Summary"
output: word_document
date: '2021-10-05'
---
` ` `{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#load libraries
#install.packages("tidyverse")
#install.packages("statcanR")
#install.packages("flextable")
#install.packages("dplyr")
library("tidyverse")
library("statcanR")
library("flextable")
library("dplyr")
setwd("~/R_Scripts") # change for a Windows-style path if ran in Windows.
#set language
language <-"eng"
# Load dataset Incident-based crime statistics, by detailed violations
CODR_0177 <- statcan_data('35-10-0177-01', language)
# Code not written for these CODR tables
#CODR_0027 <- statcan_data('35-10-0027-01', language)
#CODR_0038 <- statcan_data('35-10-0038-01', language)
#CODR_0029 <- statcan_data('35-10-0029-01', language)
#CODR_0022 <- statcan_data('35-10-0022-01', language)
#CODR_0006 <- statcan_data('35-10-0006-01', language)
` ` `
## Table 1
` ` `{r table_01, echo=FALSE}
# Develop table 1 - Crime Stats
# =============================
# Find most recent ten years
years <- distinct(CODR_0177 %>% select("REF_DATE"))
years <- arrange(years,desc(REF_DATE))%>% slice(1:10)
# Copying the crime stats table so it isn't altered in case we need to reuse it.
table_01 <- CODR_0177
# Remove unused columns
table_01 <- table_01 %>% select("REF_DATE","GEO","Violations","Statistics","UOM","VALUE") %>% filter(REF_DATE %in% years$REF_DATE)
# Keep only national data
table_01 <- table_01 %>% filter(GEO == "Canada")
# Keep only crime rate
table_01 <- table_01 %>% filter(Statistics == "Rate per 100,000 population")
# Keep only certain Violations
display_violations <- c("Total, all Criminal Code violations (excluding traffic) [50]","Total violent Criminal Code violations [100]","Total property crime violations [200]","Total drug violations [401]" )
table_01 <- table_01 %>% filter(Violations %in% display_violations)
# Pivot the data
table_01 <- pivot_wider(table_01 %>%select("REF_DATE","Violations","VALUE"),names_from=REF_DATE, values_from=VALUE)
#calculating year to year differences
table01a<-table_01 %>%select(2020,2019,2011)
)
mutate(
ten_year_change = 2020-2011,
one_year_change = 2020-2019
)
# Edit look and feel for report using Flextable
flex_table_01<-flextable(table_01)
flex_table_01<-theme_vanilla(flex_table_01)
flex_table_01<-add_header_row(
flex_table_01,
values=c("","Rates per 100,000 population","% change"),
colwidths=c(1,10,2)
)
flex_table_01<-add_header_row(
flex_table_01,
values=c("Incidents Reported to Police (Crime Rate)"),
colwidths=c(13)
)
flex_table_01 <- align(flex_table_01, i = 1, part = "header", align = "center")
flex_table_01 <- fontsize(flex_table_01, i = NULL, j = NULL, size = 8, part = "all")
flex_table_01 <- colformat_double(flex_table_01, big.mark=",", digits = 0, na_str = "N/A")
flex_table_01
#remove temporary files
rm(years)
rm(display_violations)
rm(table_01)
This is much easier with the data in "long" format. Below is an example with fake data. We use the lag function to get changes over different time ranges. Once you've added the changes over various timescales, you can subset and reshape the data as needed to create your final tables.
library(tidyverse)
# Fake data
set.seed(2)
d = tibble(
REF_DATE = rep(2010:2020, each=4),
Violations = rep(LETTERS[1:4], 11),
value = sample(100:200, 44)
)
d
#> # A tibble: 44 × 3
#> REF_DATE Violations value
#> <int> <chr> <int>
#> 1 2010 A 184
#> 2 2010 B 178
#> 3 2010 C 169
#> 4 2010 D 105
#> 5 2011 A 131
#> 6 2011 B 107
#> 7 2011 C 116
#> 8 2011 D 192
#> 9 2012 A 180
#> 10 2012 B 175
#> # … with 34 more rows
d1 = d %>%
arrange(Violations, REF_DATE) %>%
group_by(Violations) %>%
mutate(lag1 = value - lag(value),
lag10 = value - lag(value, n=10))
print(d1, n=23)
#> # A tibble: 44 × 5
#> # Groups: Violations [4]
#> REF_DATE Violations value lag1 lag10
#> <int> <chr> <int> <int> <int>
#> 1 2010 A 184 NA NA
#> 2 2011 A 131 -53 NA
#> 3 2012 A 180 49 NA
#> 4 2013 A 174 -6 NA
#> 5 2014 A 189 15 NA
#> 6 2015 A 132 -57 NA
#> 7 2016 A 139 7 NA
#> 8 2017 A 108 -31 NA
#> 9 2018 A 101 -7 NA
#> 10 2019 A 147 46 NA
#> 11 2020 A 193 46 9
#> 12 2010 B 178 NA NA
#> 13 2011 B 107 -71 NA
#> 14 2012 B 175 68 NA
#> 15 2013 B 164 -11 NA
#> 16 2014 B 154 -10 NA
#> 17 2015 B 153 -1 NA
#> 18 2016 B 115 -38 NA
#> 19 2017 B 171 56 NA
#> 20 2018 B 166 -5 NA
#> 21 2019 B 190 24 NA
#> 22 2020 B 117 -73 -61
#> 23 2010 C 169 NA NA
#> # … with 21 more rows
We can also do multiple lags as once:
d2 = d %>%
arrange(Violations, REF_DATE) %>%
group_by(Violations) %>%
mutate(map_dfc(1:10 %>% set_names(paste0("lag.", .)),
~ value - lag(value, n=.x)))
d2
#> # A tibble: 44 × 13
#> # Groups: Violations [4]
#> REF_DATE Violations value lag.1 lag.2 lag.3 lag.4 lag.5 lag.6 lag.7 lag.8
#> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2010 A 184 NA NA NA NA NA NA NA NA
#> 2 2011 A 131 -53 NA NA NA NA NA NA NA
#> 3 2012 A 180 49 -4 NA NA NA NA NA NA
#> 4 2013 A 174 -6 43 -10 NA NA NA NA NA
#> 5 2014 A 189 15 9 58 5 NA NA NA NA
#> 6 2015 A 132 -57 -42 -48 1 -52 NA NA NA
#> 7 2016 A 139 7 -50 -35 -41 8 -45 NA NA
#> 8 2017 A 108 -31 -24 -81 -66 -72 -23 -76 NA
#> 9 2018 A 101 -7 -38 -31 -88 -73 -79 -30 -83
#> 10 2019 A 147 46 39 8 15 -42 -27 -33 16
#> # … with 34 more rows, and 2 more variables: lag.9 <int>, lag.10 <int>
Created on 2021-10-05 by the reprex package (v2.0.1)

Assigning Values in R by Date Range

I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1

Resources