I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23
Related
I have two data frames that fill missing in different intervals.
I would like to fill the two to the same interval.
Consider two data frames with the same month-day but two years apart:
library(tidyverse)
library(fpp3)
df_2020 <- tibble(month_day = as_date(c('2020-1-1','2020-2-1','2020-3-1')),
amount = c(5, 2, 1))
df_2022 <- tibble(month_day = as_date(c('2022-1-1','2022-2-1','2022-3-1')),
amount = c(5, 2, 1))
These data frames both have three rows, with the same dates, 2 years apart.
Create tsibbles with a yearweek index:
ts_2020 <- df_2020 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2022 <- df_2022 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2020
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022
#> # A tsibble: 3 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 2022-02-01 2 2022 W05
#> 3 2022-03-01 1 2022 W09
Still three rows in each tsibble
Now fill gaps:
ts_2020_filled <- ts_2020 |> fill_gaps()
ts_2022_filled <- ts_2022 |> fill_gaps()
ts_2020_filled
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022_filled
#> # A tsibble: 10 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 NA NA 2022 W01
#> 3 NA NA 2022 W02
#> 4 NA NA 2022 W03
#> 5 NA NA 2022 W04
#> 6 2022-02-01 2 2022 W05
#> 7 NA NA 2022 W06
#> 8 NA NA 2022 W07
#> 9 NA NA 2022 W08
#> 10 2022-03-01 1 2022 W09
Here is the issue:
ts_2020_filled has 4-weekly steps, and ts_2022_filled has 1-weekly steps.
This is because the two tsibbles have different intervals:
tsibble::interval(ts_2020)
#> <interval[1]>
#> [1] 4W
tsibble::interval(ts_2022)
#> <interval[1]>
#> [1] 1W
This is because the tsibbles have different steps:
ts_2020 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 4 4
ts_2022 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 5 4
Therefore, the greatest common divisors are different (4 and 1). From the manual
for as_tibble:
regular Regular time interval (TRUE) or irregular (FALSE). The
interval is determined by the greatest common divisor of index column,
if TRUE.
Both tsibbles are
regular:
is_regular(ts_2020)
#> [1] TRUE
is_regular(ts_2020)
#> [1] TRUE
So, I would like to set the gap fill interval, so the periods are consistent.
I tried setting .full in fill_gaps and .regular in as_tsibble.
I could not find a way to set the interval of a tsibble.
Is there a way of manually setting the interval used by fill_gaps? Granted an interval of four weeks won't work for df_2022, but the LCM of one would work for both.
I want to complete a df in R when in it miss a month date for example if I have one year of information by months and days like this one.
df = data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-02","2020-04-01","2020-09-01","2020-10-01",
"2020-11-01","2020-12-01"))
When I use the function complete, I use it like this
df = df%>%
mutate(Date = as.Date(Date)) %>%
complete(Date= seq.Date("2020-01-01", "2020-12-31", by = "month"))
And the problem is that my final df complete all the dates like may, june, july and that is ok but also complete march because march doesn't have the first day and begings in 2020-03-02.
df = data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-01","2020-03-02","2020-04-01","2020-05-01",
"2020-06-01","2020-07-01","2020-08-01","2020-09-01",
"2020-10-01","2020-11-01","2020-12-01"))
Do you know how to complete df only if the df doesn't have any date of a month?
In my case I don't want to complete march because march has a date already.
Thanks a lot.
You can extract year and month value from the Date and use complete on that.
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
mutate(Date = as.Date(Date),
year = year(Date),
month = month(Date)) %>%
complete(year, month = 1:12) %>%
mutate(Date = if_else(is.na(Date),
as.Date(paste(year, month, 1, sep = '-')), Date)) %>%
select(Date)
# Date
# <date>
# 1 2020-01-01
# 2 2020-02-01
# 3 2020-03-02
# 4 2020-04-01
# 5 2020-05-01
# 6 2020-06-01
# 7 2020-07-01
# 8 2020-08-01
# 9 2020-09-01
#10 2020-10-01
#11 2020-11-01
#12 2020-12-01
A possible solution would be completing only by yearmon from the zoo package, so that it the actual day of the month is irrelevant.
library(dplyr)
library(zoo) # for as.yearmon
library(tidyr) # for complete
df <- data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-02","2020-04-01",
"2020-09-01","2020-10-01",
"2020-11-01","2020-12-01"),
id = 1:8)
df
#> Date id
#> 1 2020-01-01 1
#> 2 2020-02-01 2
#> 3 2020-03-02 3
#> 4 2020-04-01 4
#> 5 2020-09-01 5
#> 6 2020-10-01 6
#> 7 2020-11-01 7
#> 8 2020-12-01 8
df %>%
mutate(Date = as.Date(Date),
year_mon = as.yearmon(Date)) %>%
complete(
year_mon = seq.Date(as.Date("2020-01-01"),
as.Date("2020-12-31"),
by = "month") %>% as.yearmon()
)
#> # A tibble: 12 x 3
#> year_mon Date id
#> <yearmon> <date> <int>
#> 1 Jan 2020 2020-01-01 1
#> 2 Feb 2020 2020-02-01 2
#> 3 Mar 2020 2020-03-02 3
#> 4 Apr 2020 2020-04-01 4
#> 5 May 2020 NA NA
#> 6 Jun 2020 NA NA
#> 7 Jul 2020 NA NA
#> 8 Aug 2020 NA NA
#> 9 Sep 2020 2020-09-01 5
#> 10 Oct 2020 2020-10-01 6
#> 11 Nov 2020 2020-11-01 7
#> 12 Dec 2020 2020-12-01 8
Created on 2021-06-25 by the reprex package (v2.0.0)
I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1
I have this df which observations are monthly represented:
library(dplyr)
library(lubridate)
Date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
A <- (10:21)
df <- data.frame(Date, A)
view(df)
Date A
<date> <int>
1 2019-11-01 10
2 2019-12-01 11
3 2020-01-01 12
4 2020-02-01 13
5 2020-03-01 14
6 2020-04-01 15
7 2020-05-01 16
8 2020-06-01 17
9 2020-07-01 18
10 2020-08-01 19
11 2020-09-01 20
12 2020-10-01 21
Using lag() I know how to calculate %change from Month over Month (MoM), but haven't been able to compare a quarter with the previous quarter: i.e, the sum of 3 months compared with the previous 3 months summed. I tried a loop approach but it didn't work and there should be a more efficient approach.
I appreciate it if someone can help.
We can use as.yearqtr from zoo to convert the 'Date' column to quarter, do a group by sum and then get the Difference between the current and next (lead) or current and previous (lag)
library(dplyr)
library(zoo)
df %>%
group_by(Quarter = as.yearqtr(Date)) %>%
summarise(A = sum(A), .groups = 'drop') %>%
mutate(Diff = lead(A) - A)
-output
# A tibble: 5 x 3
# Quarter A Diff
# <yearqtr> <int> <int>
#1 2019 Q4 21 18
#2 2020 Q1 39 9
#3 2020 Q2 48 9
#4 2020 Q3 57 -36
#5 2020 Q4 21 NA
I am trying to find a way to create a column in my dataframe that will list out occurrences of each unique combination of personID and fiscal year.
I have a dataframe set up with variables like so:
Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017
So in this example I want to create an additional column in the df above that has something like 'year' which would list year 1 for both occurrences of id 250 and year 2017, but would have year 2 for id 250 and fiscal year 2018. Like so:
Person.Id Reported.Fiscal.Year year
250 2017 1
250 2017 1
250 2018 2
300 2018 1
511 2019 1
300 2018 1
700 2017 1
I've tried the following code:
df1 <- df1 %>% arrange(Person.Id,Reported.Fiscal.Year)
df2<- df1 %>% group_by(Person.Id,Reported.Fiscal.Year) %>% mutate(year=row_number())
But this results in a data frame that looks like this (essentially counting the occurrences of each year by ID):
Person.Id Reported.Fiscal.Year year
250 2017 1
250 2017 2
250 2018 1
300 2018 1
511 2019 1
300 2018 2
700 2017 1
Here's an alternative to #Petr & #Bruno's very nice join-based solutions. This one works by building a cumulative count of unique years for each person.
library(readr)
df <- read_table("Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017")
library(dplyr)
df %>%
arrange(Person.Id, Reported.Fiscal.Year) %>%
group_by(Person.Id) %>%
mutate(year = cumsum(!duplicated(Reported.Fiscal.Year)))
#> # A tibble: 7 x 3
#> # Groups: Person.Id [4]
#> Person.Id Reported.Fiscal.Year year
#> <dbl> <dbl> <int>
#> 1 250 2017 1
#> 2 250 2017 1
#> 3 250 2018 2
#> 4 300 2018 1
#> 5 300 2018 1
#> 6 511 2019 1
#> 7 700 2017 1
Created on 2020-07-06 by the reprex package (v0.3.0)
Welcome to SO!
Had to summarise your data before, maybe someone can provide a simpler solution
library(tidyverse)
df_example <- read_table("Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017")
df_example_summary <- df_example %>%
group_by(Person.Id,Reported.Fiscal.Year) %>%
summarise(number_reports = n(),,.groups = "drop_last") %>%
mutate(Year = row_number()) %>%
ungroup()
df_example %>%
left_join(df_example_summary)
#> Joining, by = c("Person.Id", "Reported.Fiscal.Year")
#> # A tibble: 7 x 4
#> Person.Id Reported.Fiscal.Year number_reports Year
#> <dbl> <dbl> <int> <int>
#> 1 250 2017 2 1
#> 2 250 2017 2 1
#> 3 250 2018 1 2
#> 4 300 2018 2 1
#> 5 511 2019 1 1
#> 6 300 2018 2 1
#> 7 700 2017 1 1
Created on 2020-07-06 by the reprex package (v0.3.0)
If I understand correctly, you want to enumarate the occurences of IDs accross the years?
I have used pieces of your code, you were close. Only you need to choose distinct rows to count the occurences with:
arrange() both columns,
group_by() IDs to count fiscal years for each ID,
choose distinct() rows, i.e. unique combinations of ID and fiscal year,
mutate() with row_number() as you did,
and join that to the original dataset.
See comments inside the code:
library(dplyr)
# your example data
df <- read.table(header = TRUE, text = "
Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017
")
# 1. arrange by ids and years (this is what you did)
# 2. group by ids to be able to count different fiscal years
# 3. choose only unique combinations of ids and fiscal years
# 4. use row numbers (as you did)
# 5. merge new column to original data
df %>%
arrange(Person.Id, Reported.Fiscal.Year) %>%
group_by(Person.Id) %>%
distinct() %>%
mutate(year = row_number()) %>%
inner_join(df, .)
#> Joining, by = c("Person.Id", "Reported.Fiscal.Year")
#> Person.Id Reported.Fiscal.Year year
#> 1 250 2017 1
#> 2 250 2017 1
#> 3 250 2018 2
#> 4 300 2018 1
#> 5 511 2019 1
#> 6 300 2018 1
#> 7 700 2017 1
Created on 2020-07-06 by the reprex package (v0.3.0)