I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 10 months ago.
Improve this question
I have a data frame with a column of date objects. How can i get a new column with the number of a month?
example: January -> 1, february -> 2 ...
I need a new column with the numbers of each day of the month, too.
example: 2022-01-01 -> 1 , 2022-01-02 - 2
You can use the following code:
df = data.frame(date = as.Date(c("2022-01-01", "2022-01-02")))
df[, "month"] <- format(df[,"date"], "%m")
df
Output:
date month
1 2022-01-01 01
2 2022-01-02 01
I'm sure there are date object-specific solutions in R, but I've never used them. A base R solution:
splitDates <- function( date ) {
day <- gsub('^0+','',unlist(strsplit(date,'-'))[2])
month <- gsub('^0+','',unlist(strsplit(date,'-'))[3])
return(list(day,
month))
}
require(tidyverse)
require(lubridate)
Example data:
# A tibble: 13 × 1
date
<date>
1 2022-04-28
2 2022-05-28
3 2022-06-28
4 2022-07-28
5 2022-08-28
6 2022-09-28
7 2022-10-28
8 2022-11-28
9 2022-12-28
10 2023-01-28
11 2023-02-28
12 2023-03-28
13 2023-04-28
A possible solution:
df %>%
mutate(
month = month(date),
month_name = month(date, label = TRUE, abbr = FALSE),
day = day(date),
week = week(date)
)
The output:
# A tibble: 13 × 5
date month month_name day week
<date> <dbl> <ord> <int> <dbl>
1 2022-04-28 4 April 28 17
2 2022-05-28 5 May 28 22
3 2022-06-28 6 June 28 26
4 2022-07-28 7 July 28 30
5 2022-08-28 8 August 28 35
6 2022-09-28 9 September 28 39
7 2022-10-28 10 October 28 43
8 2022-11-28 11 November 28 48
9 2022-12-28 12 December 28 52
10 2023-01-28 1 January 28 4
11 2023-02-28 2 February 28 9
12 2023-03-28 3 March 28 13
13 2023-04-28 4 April 28 17
I want to insert missing weeks for each household_id, channel combination so that weeks becomes in sequence. The corresponding duration column will be inserted with 0 and other columns value remains same.
Below is the dataset.
For e.g. household_id 100 and channel A: missing weeks are 37,39 and 41. I want these weeks to be inserted and duration will be 0.
But For household_id 101 and channel C: Two years are involved, 2019 and 2020. Missing are weeks 52 of 2019 and week 3 of 2020.
what I tried is below using complete function
library(tidyr)
library(dplyr)
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
For Household_id 100 and channel A combination it worked fine. All weeks are now in sequence.
But for Household_id 101 and channel C it didn't worked. I want after inserting 52 week of 2019 it should go to 1st week of 2020.
I tried getting dates from week and year column thinking from exact date it may work
but not able to get that to work also.
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
Any help is greatly apprecited!
Here is the sample dataset with code:
library(dplyr)
library(tidyr)
data <- data.frame(Household_id = c(100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","C","C","C","C","D","D"),
duration = c(12,34,567,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,38,40,42,51,1,2,4,38,39),
year = c(2021,2021,2021,2021,2019,2020,2020,2020,2021,2021))
# imputing missing weeks and duration = 0 for each husehold channel combination
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
# Getting Date from week/year if it may help
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
You can try defining the dates, making the sequence and converting to weeks. I used lubridate for ease.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(Household_id,channel) %>%
mutate(new = paste0(year, '01-01'),
new = ymd(new) + 7 * week) %>%
complete(new = seq(min(new),max(new), by = 'week'), fill = list(duration=0)) %>%
mutate(year = replace(year, is.na(year), format(new, '%Y')[is.na(year)]),
week = week(new)) %>%
select(-new)
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 100 A 12 42 2021 37 2021
2 100 A 0 NA NA 38 2021
3 100 A 34 42 2021 39 2021
4 100 A 0 NA NA 40 2021
5 100 A 567 42 2021 41 2021
6 100 A 0 NA NA 42 2021
7 100 A 67 42 2021 43 2021
8 101 C 98 5 2020 52 2019
9 101 C 0 NA NA 53 2019
10 101 C 0 NA NA 1 2020
11 101 C 0 NA NA 2 2020
12 101 C 0 NA NA 3 2020
13 101 C 0 NA NA 4 2020
14 102 D 73 30 2021 39 2021
15 102 D 76 30 2021 40 2021
16 101 C 23 5 2020 2 2020
17 101 C 56 5 2020 3 2020
18 101 C 89 5 2020 5 2020
I am using dplyr for most of my data wrangling in R. Yet, I am having a hard time achieving this particular effect. Can't also seem to find the answer by googling either.
Assume I have data like this and what I want to achieve is to sort person-grouped data based on cash value from the year 2021. Below I show the outcome I wish to achieve. I am just missing my imagination on this one I guess. If I only had 2021 value I could simply use ... %>% arrange(desc(cash)) but I am not sure how to follow from here.
year person cash
0 2020 personone 29
1 2021 personone 40
2 2020 persontwo 17
3 2021 persontwo 13
4 2020 personthree 62
5 2021 personthree 55
And what I want to achieve is to sort this data in descending order based on values from the year 2021. So that the data should look like:
year person cash
0 2020 personthree 62
1 2021 personthree 55
2 2020 personone 29
3 2021 personone 40
4 2020 persontwo 17
5 2021 persontwo 13
One approach using a join:
df %>%
filter(year == 2021) %>%
# group_by(person) %>% slice(2) %>% ungroup() %>% #each person's yr2
arrange(-cash) %>%
select(-cash, -year) %>%
left_join(df)
Output:
person year cash
1 personthree 2020 62
2 personthree 2021 55
3 personone 2020 29
4 personone 2021 40
5 persontwo 2020 17
6 persontwo 2021 13
Another option:
library(dplyr)
dat %>%
group_by(person) %>%
mutate(maxcash = max(cash)) %>%
arrange(desc(maxcash)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash maxcash
# <int> <chr> <int> <int>
# 1 2020 personthree 62 62
# 2 2021 personthree 55 62
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 17
# 6 2021 persontwo 13 17
Or a one-liner, using base R as a helper:
dat %>%
arrange(-ave(cash, person, FUN = max))
# year person cash
# 4 2020 personthree 62
# 5 2021 personthree 55
# 0 2020 personone 29
# 1 2021 personone 40
# 2 2020 persontwo 17
# 3 2021 persontwo 13
Edit:
If instead of max you mean "always 2021's data", then:
dat %>%
group_by(person) %>%
mutate(cash2021 = cash[year == 2021]) %>%
arrange(desc(cash2021)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash cash2021
# <int> <chr> <int> <int>
# 1 2020 personthree 62 55
# 2 2021 personthree 55 55
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 13
# 6 2021 persontwo 13 13
I have this df which observations are monthly represented:
library(dplyr)
library(lubridate)
Date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
A <- (10:21)
df <- data.frame(Date, A)
view(df)
Date A
<date> <int>
1 2019-11-01 10
2 2019-12-01 11
3 2020-01-01 12
4 2020-02-01 13
5 2020-03-01 14
6 2020-04-01 15
7 2020-05-01 16
8 2020-06-01 17
9 2020-07-01 18
10 2020-08-01 19
11 2020-09-01 20
12 2020-10-01 21
Using lag() I know how to calculate %change from Month over Month (MoM), but haven't been able to compare a quarter with the previous quarter: i.e, the sum of 3 months compared with the previous 3 months summed. I tried a loop approach but it didn't work and there should be a more efficient approach.
I appreciate it if someone can help.
We can use as.yearqtr from zoo to convert the 'Date' column to quarter, do a group by sum and then get the Difference between the current and next (lead) or current and previous (lag)
library(dplyr)
library(zoo)
df %>%
group_by(Quarter = as.yearqtr(Date)) %>%
summarise(A = sum(A), .groups = 'drop') %>%
mutate(Diff = lead(A) - A)
-output
# A tibble: 5 x 3
# Quarter A Diff
# <yearqtr> <int> <int>
#1 2019 Q4 21 18
#2 2020 Q1 39 9
#3 2020 Q2 48 9
#4 2020 Q3 57 -36
#5 2020 Q4 21 NA