I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23
I want to insert missing weeks for each household_id, channel combination so that weeks becomes in sequence. The corresponding duration column will be inserted with 0 and other columns value remains same.
Below is the dataset.
For e.g. household_id 100 and channel A: missing weeks are 37,39 and 41. I want these weeks to be inserted and duration will be 0.
But For household_id 101 and channel C: Two years are involved, 2019 and 2020. Missing are weeks 52 of 2019 and week 3 of 2020.
what I tried is below using complete function
library(tidyr)
library(dplyr)
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
For Household_id 100 and channel A combination it worked fine. All weeks are now in sequence.
But for Household_id 101 and channel C it didn't worked. I want after inserting 52 week of 2019 it should go to 1st week of 2020.
I tried getting dates from week and year column thinking from exact date it may work
but not able to get that to work also.
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
Any help is greatly apprecited!
Here is the sample dataset with code:
library(dplyr)
library(tidyr)
data <- data.frame(Household_id = c(100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","C","C","C","C","D","D"),
duration = c(12,34,567,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,38,40,42,51,1,2,4,38,39),
year = c(2021,2021,2021,2021,2019,2020,2020,2020,2021,2021))
# imputing missing weeks and duration = 0 for each husehold channel combination
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
# Getting Date from week/year if it may help
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
You can try defining the dates, making the sequence and converting to weeks. I used lubridate for ease.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(Household_id,channel) %>%
mutate(new = paste0(year, '01-01'),
new = ymd(new) + 7 * week) %>%
complete(new = seq(min(new),max(new), by = 'week'), fill = list(duration=0)) %>%
mutate(year = replace(year, is.na(year), format(new, '%Y')[is.na(year)]),
week = week(new)) %>%
select(-new)
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 100 A 12 42 2021 37 2021
2 100 A 0 NA NA 38 2021
3 100 A 34 42 2021 39 2021
4 100 A 0 NA NA 40 2021
5 100 A 567 42 2021 41 2021
6 100 A 0 NA NA 42 2021
7 100 A 67 42 2021 43 2021
8 101 C 98 5 2020 52 2019
9 101 C 0 NA NA 53 2019
10 101 C 0 NA NA 1 2020
11 101 C 0 NA NA 2 2020
12 101 C 0 NA NA 3 2020
13 101 C 0 NA NA 4 2020
14 102 D 73 30 2021 39 2021
15 102 D 76 30 2021 40 2021
16 101 C 23 5 2020 2 2020
17 101 C 56 5 2020 3 2020
18 101 C 89 5 2020 5 2020
I am using dplyr for most of my data wrangling in R. Yet, I am having a hard time achieving this particular effect. Can't also seem to find the answer by googling either.
Assume I have data like this and what I want to achieve is to sort person-grouped data based on cash value from the year 2021. Below I show the outcome I wish to achieve. I am just missing my imagination on this one I guess. If I only had 2021 value I could simply use ... %>% arrange(desc(cash)) but I am not sure how to follow from here.
year person cash
0 2020 personone 29
1 2021 personone 40
2 2020 persontwo 17
3 2021 persontwo 13
4 2020 personthree 62
5 2021 personthree 55
And what I want to achieve is to sort this data in descending order based on values from the year 2021. So that the data should look like:
year person cash
0 2020 personthree 62
1 2021 personthree 55
2 2020 personone 29
3 2021 personone 40
4 2020 persontwo 17
5 2021 persontwo 13
One approach using a join:
df %>%
filter(year == 2021) %>%
# group_by(person) %>% slice(2) %>% ungroup() %>% #each person's yr2
arrange(-cash) %>%
select(-cash, -year) %>%
left_join(df)
Output:
person year cash
1 personthree 2020 62
2 personthree 2021 55
3 personone 2020 29
4 personone 2021 40
5 persontwo 2020 17
6 persontwo 2021 13
Another option:
library(dplyr)
dat %>%
group_by(person) %>%
mutate(maxcash = max(cash)) %>%
arrange(desc(maxcash)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash maxcash
# <int> <chr> <int> <int>
# 1 2020 personthree 62 62
# 2 2021 personthree 55 62
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 17
# 6 2021 persontwo 13 17
Or a one-liner, using base R as a helper:
dat %>%
arrange(-ave(cash, person, FUN = max))
# year person cash
# 4 2020 personthree 62
# 5 2021 personthree 55
# 0 2020 personone 29
# 1 2021 personone 40
# 2 2020 persontwo 17
# 3 2021 persontwo 13
Edit:
If instead of max you mean "always 2021's data", then:
dat %>%
group_by(person) %>%
mutate(cash2021 = cash[year == 2021]) %>%
arrange(desc(cash2021)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash cash2021
# <int> <chr> <int> <int>
# 1 2020 personthree 62 55
# 2 2021 personthree 55 55
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 13
# 6 2021 persontwo 13 13
I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1
I have this df which observations are monthly represented:
library(dplyr)
library(lubridate)
Date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
A <- (10:21)
df <- data.frame(Date, A)
view(df)
Date A
<date> <int>
1 2019-11-01 10
2 2019-12-01 11
3 2020-01-01 12
4 2020-02-01 13
5 2020-03-01 14
6 2020-04-01 15
7 2020-05-01 16
8 2020-06-01 17
9 2020-07-01 18
10 2020-08-01 19
11 2020-09-01 20
12 2020-10-01 21
Using lag() I know how to calculate %change from Month over Month (MoM), but haven't been able to compare a quarter with the previous quarter: i.e, the sum of 3 months compared with the previous 3 months summed. I tried a loop approach but it didn't work and there should be a more efficient approach.
I appreciate it if someone can help.
We can use as.yearqtr from zoo to convert the 'Date' column to quarter, do a group by sum and then get the Difference between the current and next (lead) or current and previous (lag)
library(dplyr)
library(zoo)
df %>%
group_by(Quarter = as.yearqtr(Date)) %>%
summarise(A = sum(A), .groups = 'drop') %>%
mutate(Diff = lead(A) - A)
-output
# A tibble: 5 x 3
# Quarter A Diff
# <yearqtr> <int> <int>
#1 2019 Q4 21 18
#2 2020 Q1 39 9
#3 2020 Q2 48 9
#4 2020 Q3 57 -36
#5 2020 Q4 21 NA