Casewhen ignoring one case in R - r

i have problem with my R code. I am trying to use casewhen to distribute time attribute into part of the day.
data_aoi_droped = data_aoi_droped %>%
mutate (Day_Time = case_when(
Hour >= 05 & Hour < 09 ~ "Rano",
Hour >= 09 & Hour < 11 ~ "Doobeda",
Hour >= 11 & Hour < 13 ~ "Obed",
Hour >= 13 & Hour < 16 ~ "Poobede",
Hour >= 16 & Hour < 19 ~ "Podvecer",
Hour >= 19 & Hour < 22 ~ "Vecer",
Hour >= 22 | Hour < 05 ~ "Noc"
)
)
head(data_aoi_droped,20)
Here you can see the result, as you can see the casewhen is ignoring part with "Rano", which means morning.

I recommend using cut over case_when because your splits were just next to each other. The result does not contain time perioids which are not part of the data.
library(tidyverse)
data_aoi_droped <- tibble(Hour = c(0,7,11,15,17,20,21))
data_aoi_droped %>%
mutate(
Day_Time = Hour %>% cut(
breaks = c(5,9,11,13,16,19,22),
labels = c("Rano", "Doobeda", "Obed", "Poobede", "Podvecer", "Vecer")
) %>% as.character() %>% replace_na("Noc")
)
#> # A tibble: 7 × 2
#> Hour Day_Time
#> <dbl> <chr>
#> 1 0 Noc
#> 2 7 Rano
#> 3 11 Doobeda
#> 4 15 Poobede
#> 5 17 Podvecer
#> 6 20 Vecer
#> 7 21 Vecer
data_aoi_droped %>%
complete(Hour = seq(24)) %>%
mutate(
Day_Time = Hour %>% cut(
breaks = c(5,9,11,13,16,19,22),
labels = c("Rano", "Doobeda", "Obed", "Poobede", "Podvecer", "Vecer")
) %>% as.character() %>% replace_na("Noc")
) %>%
print(n=Inf)
#> # A tibble: 25 × 2
#> Hour Day_Time
#> <dbl> <chr>
#> 1 1 Noc
#> 2 2 Noc
#> 3 3 Noc
#> 4 4 Noc
#> 5 5 Noc
#> 6 6 Rano
#> 7 7 Rano
#> 8 8 Rano
#> 9 9 Rano
#> 10 10 Doobeda
#> 11 11 Doobeda
#> 12 12 Obed
#> 13 13 Obed
#> 14 14 Poobede
#> 15 15 Poobede
#> 16 16 Poobede
#> 17 17 Podvecer
#> 18 18 Podvecer
#> 19 19 Podvecer
#> 20 20 Vecer
#> 21 21 Vecer
#> 22 22 Vecer
#> 23 23 Noc
#> 24 24 Noc
#> 25 0 Noc
Created on 2022-04-04 by the reprex package (v2.0.0)

Related

How to control the fill_gaps interval in tsibble?

I have two data frames that fill missing in different intervals.
I would like to fill the two to the same interval.
Consider two data frames with the same month-day but two years apart:
library(tidyverse)
library(fpp3)
df_2020 <- tibble(month_day = as_date(c('2020-1-1','2020-2-1','2020-3-1')),
amount = c(5, 2, 1))
df_2022 <- tibble(month_day = as_date(c('2022-1-1','2022-2-1','2022-3-1')),
amount = c(5, 2, 1))
These data frames both have three rows, with the same dates, 2 years apart.
Create tsibbles with a yearweek index:
ts_2020 <- df_2020 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2022 <- df_2022 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2020
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022
#> # A tsibble: 3 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 2022-02-01 2 2022 W05
#> 3 2022-03-01 1 2022 W09
Still three rows in each tsibble
Now fill gaps:
ts_2020_filled <- ts_2020 |> fill_gaps()
ts_2022_filled <- ts_2022 |> fill_gaps()
ts_2020_filled
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022_filled
#> # A tsibble: 10 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 NA NA 2022 W01
#> 3 NA NA 2022 W02
#> 4 NA NA 2022 W03
#> 5 NA NA 2022 W04
#> 6 2022-02-01 2 2022 W05
#> 7 NA NA 2022 W06
#> 8 NA NA 2022 W07
#> 9 NA NA 2022 W08
#> 10 2022-03-01 1 2022 W09
Here is the issue:
ts_2020_filled has 4-weekly steps, and ts_2022_filled has 1-weekly steps.
This is because the two tsibbles have different intervals:
tsibble::interval(ts_2020)
#> <interval[1]>
#> [1] 4W
tsibble::interval(ts_2022)
#> <interval[1]>
#> [1] 1W
This is because the tsibbles have different steps:
ts_2020 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 4 4
ts_2022 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 5 4
Therefore, the greatest common divisors are different (4 and 1). From the manual
for as_tibble:
regular Regular time interval (TRUE) or irregular (FALSE). The
interval is determined by the greatest common divisor of index column,
if TRUE.
Both tsibbles are
regular:
is_regular(ts_2020)
#> [1] TRUE
is_regular(ts_2020)
#> [1] TRUE
So, I would like to set the gap fill interval, so the periods are consistent.
I tried setting .full in fill_gaps and .regular in as_tsibble.
I could not find a way to set the interval of a tsibble.
Is there a way of manually setting the interval used by fill_gaps? Granted an interval of four weeks won't work for df_2022, but the LCM of one would work for both.

how to find the growth rate of applicants per year

I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23

DST calculation using R

I want to calculate the daylight saving time beginning date for each year from 2003 through 2021 and keep only the days that are 60 days before and after the daylight saving time begin date each year.
i.e date will change each year (falls on a Sunday) and moved from happening in April 2003-2006 to happening in March during the years 2007-2021.
I need to Create a running variable “days” that measures the distance from the daylight saving time begin date for each year with days=0 on the first day of daylight saving time.
Here's dataset
year month day propertycrimes violentcrimes
2003 1 1 94 34
2004 1 1 60 46
2005 1 1 106 41
2006 1 1 87 40
2007 1 1 72 36
2008 1 1 43 50
2009 1 1 35 32
2010 1 1 32 50
2011 1 1 29 45
2012 1 1 32 45
Here's my code so far
library(readr)
dailycrimedataRD <- read_csv("dailycrimedataRD.csv")
View(dailycrimedataRD)
days <- .POSIXct(month, tz="GMT")
How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(readr)
dailycrimedataRD <- read_csv("~/Downloads/dailycrimedataRD.csv")
#> Rows: 6940 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): year, month, day, propertycrimes, violentcrimes
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tmp <- dailycrimedataRD %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep="-"), tz='Canada/Eastern'),
dst = lubridate::dst(date)) %>%
arrange(date) %>%
group_by(year) %>%
mutate(dst_date = date[which(dst == TRUE & lag(dst) == FALSE)],
diff = (as.Date(dst_date) - as.Date(date))) %>%
filter(diff <= 60 & diff >= 0)
tmp
#> # A tibble: 1,159 × 9
#> # Groups: year [19]
#> year month day propertycrimes violentcrimes date dst
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <lgl>
#> 1 2003 2 6 68 8 2003-02-06 00:00:00 FALSE
#> 2 2003 2 7 71 8 2003-02-07 00:00:00 FALSE
#> 3 2003 2 8 81 12 2003-02-08 00:00:00 FALSE
#> 4 2003 2 9 68 7 2003-02-09 00:00:00 FALSE
#> 5 2003 2 10 68 9 2003-02-10 00:00:00 FALSE
#> 6 2003 2 11 61 8 2003-02-11 00:00:00 FALSE
#> 7 2003 2 12 73 10 2003-02-12 00:00:00 FALSE
#> 8 2003 2 13 62 14 2003-02-13 00:00:00 FALSE
#> 9 2003 2 14 71 10 2003-02-14 00:00:00 FALSE
#> 10 2003 2 15 90 11 2003-02-15 00:00:00 FALSE
#> # … with 1,149 more rows, and 2 more variables: dst_date <dttm>, diff <drtn>
Created on 2022-04-14 by the reprex package (v2.0.1)

Pivot_longer error: Can't combine `01/01/2020` <character> and `03/01/2020` <double>

I'm trying to transpose certain columns in my data to make it tidy, but having trouble due to column types. My data looks like this:
Station Polluant Mesure Unité `01/01/2020` `02/01/2020` `03/01/2020` `04/01/2020` `05/01/2020` `06/01/2020`
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Grand Parc dioxyde d'a… Dioxyd… µg/m3 16 23 26 16 30 24
2 Grand Parc ozone (O3) Ozone µg/m3 29 27 24 41 28 10
3 Grand Parc particules … PM10 µg/m3 23 20 20 13 26 21
4 Talence dioxyde d'a… Dioxyd… µg/m3 15 24 27 22 36 22
5 Talence ozone (O3) Ozone µg/m3 26 21 21 33 22 9
6 Talence particules … PM10 µg/m3 24 25 21 14 31 24
My desired output:
Station Polluant Mesure Unité Date Value
<chr> <chr> <chr> <chr> <chr> <chr>
1 Grand Parc dioxyde d'a… Dioxyd… µg/m3 01/01/2020 16
2 Grand Parc dioxyde d'a… Dioxyd… µg/m3 02/01/2020 23
3 Grand Parc dioxyde d'a… Dioxyd… µg/m3 03/01/2020 26
I've tried transposing but have difficulty transposing only certain columns. I've also tried pivot_longer but receiving the error: Error: Can't combine `01/01/2020` <character> and `03/01/2020` <double>.
The problem is that some of your columns are character and some are numeric. You can fix this with the values_transform argument.
library(tidyverse)
test_data <- tibble(id = 1:5,
d1 = c(2,3,6,2,1),
d2 = c("3","6","1","1","9"))
test_data |>
pivot_longer(cols = c(d1,d2),
values_transform = as.numeric)
#> # A tibble: 10 x 3
#> id name value
#> <int> <chr> <dbl>
#> 1 1 d1 2
#> 2 1 d2 3
#> 3 2 d1 3
#> 4 2 d2 6
#> 5 3 d1 6
#> 6 3 d2 1
#> 7 4 d1 2
#> 8 4 d2 1
#> 9 5 d1 1
#> 10 5 d2 9
Update:
I'm not sure why that is not working for you, but you could transform the data first and then pivot:
test_data |>
mutate(across(c(d1:d2), as.numeric)) |>
pivot_longer(cols = c(d1,d2))
For your data, it would be
atmo |>
mutate(across(c(`01/01/2020`:`01/01/2022`), as.numeric)) |>
pivot_longer(cols = c(`01/01/2020`:`01/01/2022`))

How to retain the first observation in every 30 days in the dataset for each ID?

I have a dataset in which I need to keep the first incidence in the database and remove what happens in 30 days window after and redo the process again. Here is a demonstration
I want to keep all the rows with arrows and exclude the other ones.
Any help would be really appreciated.
Thanks
# Example data
set.seed(2022)
n <- 20
df <- data.frame(
dt = rep(as.Date('2010-06-01'), n) + cumsum(sample(1:20, n, TRUE))
)
df
#> dt
#> 1 2010-06-05
#> 2 2010-06-24
#> 3 2010-07-08
#> 4 2010-07-19
#> 5 2010-07-23
#> 6 2010-07-29
#> 7 2010-08-12
#> 8 2010-08-21
#> 9 2010-09-04
#> 10 2010-09-11
#> 11 2010-09-29
#> 12 2010-10-15
#> 13 2010-10-20
#> 14 2010-10-21
#> 15 2010-11-09
#> 16 2010-11-10
#> 17 2010-11-12
#> 18 2010-11-19
#> 19 2010-12-01
#> 20 2010-12-16
library(dplyr, warn.conflicts = FALSE)
library(purrr)
cutoff <- 25
df %>%
# if date is < cutoff days of first date, maintain the same group
# else create a new group
group_by(g = accumulate(dt, ~ if (.y - .x < cutoff) .x else .y)) %>%
# for each group select the first row
slice_head(n = 1) %>%
# ungroup and remove grouping variable
ungroup() %>%
select(-g)
#> # A tibble: 7 × 1
#> dt
#> <date>
#> 1 2010-06-05
#> 2 2010-07-08
#> 3 2010-08-12
#> 4 2010-09-11
#> 5 2010-10-15
#> 6 2010-11-09
#> 7 2010-12-16
Created on 2022-02-13 by the reprex package (v2.0.1)
Or, using data.table::rowid
library(dplyr, warn.conflicts = FALSE)
library(purrr)
library(data.table, warn.conflicts = FALSE)
cutoff <- 25
df %>%
filter(rowid(accumulate(dt, ~ if (.y - .x < cutoff) .x else .y)) == 1)
#> dt
#> 1 2010-06-05
#> 2 2010-07-08
#> 3 2010-08-12
#> 4 2010-09-11
#> 5 2010-10-15
#> 6 2010-11-09
#> 7 2010-12-16
Created on 2022-02-13 by the reprex package (v2.0.1)

Resources