How to create a column with numbered instance of occurence by ID and year - r

I am trying to find a way to create a column in my dataframe that will list out occurrences of each unique combination of personID and fiscal year.
I have a dataframe set up with variables like so:
Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017
So in this example I want to create an additional column in the df above that has something like 'year' which would list year 1 for both occurrences of id 250 and year 2017, but would have year 2 for id 250 and fiscal year 2018. Like so:
Person.Id Reported.Fiscal.Year year
250 2017 1
250 2017 1
250 2018 2
300 2018 1
511 2019 1
300 2018 1
700 2017 1
I've tried the following code:
df1 <- df1 %>% arrange(Person.Id,Reported.Fiscal.Year)
df2<- df1 %>% group_by(Person.Id,Reported.Fiscal.Year) %>% mutate(year=row_number())
But this results in a data frame that looks like this (essentially counting the occurrences of each year by ID):
Person.Id Reported.Fiscal.Year year
250 2017 1
250 2017 2
250 2018 1
300 2018 1
511 2019 1
300 2018 2
700 2017 1

Here's an alternative to #Petr & #Bruno's very nice join-based solutions. This one works by building a cumulative count of unique years for each person.
library(readr)
df <- read_table("Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017")
library(dplyr)
df %>%
arrange(Person.Id, Reported.Fiscal.Year) %>%
group_by(Person.Id) %>%
mutate(year = cumsum(!duplicated(Reported.Fiscal.Year)))
#> # A tibble: 7 x 3
#> # Groups: Person.Id [4]
#> Person.Id Reported.Fiscal.Year year
#> <dbl> <dbl> <int>
#> 1 250 2017 1
#> 2 250 2017 1
#> 3 250 2018 2
#> 4 300 2018 1
#> 5 300 2018 1
#> 6 511 2019 1
#> 7 700 2017 1
Created on 2020-07-06 by the reprex package (v0.3.0)

Welcome to SO!
Had to summarise your data before, maybe someone can provide a simpler solution
library(tidyverse)
df_example <- read_table("Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017")
df_example_summary <- df_example %>%
group_by(Person.Id,Reported.Fiscal.Year) %>%
summarise(number_reports = n(),,.groups = "drop_last") %>%
mutate(Year = row_number()) %>%
ungroup()
df_example %>%
left_join(df_example_summary)
#> Joining, by = c("Person.Id", "Reported.Fiscal.Year")
#> # A tibble: 7 x 4
#> Person.Id Reported.Fiscal.Year number_reports Year
#> <dbl> <dbl> <int> <int>
#> 1 250 2017 2 1
#> 2 250 2017 2 1
#> 3 250 2018 1 2
#> 4 300 2018 2 1
#> 5 511 2019 1 1
#> 6 300 2018 2 1
#> 7 700 2017 1 1
Created on 2020-07-06 by the reprex package (v0.3.0)

If I understand correctly, you want to enumarate the occurences of IDs accross the years?
I have used pieces of your code, you were close. Only you need to choose distinct rows to count the occurences with:
arrange() both columns,
group_by() IDs to count fiscal years for each ID,
choose distinct() rows, i.e. unique combinations of ID and fiscal year,
mutate() with row_number() as you did,
and join that to the original dataset.
See comments inside the code:
library(dplyr)
# your example data
df <- read.table(header = TRUE, text = "
Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017
")
# 1. arrange by ids and years (this is what you did)
# 2. group by ids to be able to count different fiscal years
# 3. choose only unique combinations of ids and fiscal years
# 4. use row numbers (as you did)
# 5. merge new column to original data
df %>%
arrange(Person.Id, Reported.Fiscal.Year) %>%
group_by(Person.Id) %>%
distinct() %>%
mutate(year = row_number()) %>%
inner_join(df, .)
#> Joining, by = c("Person.Id", "Reported.Fiscal.Year")
#> Person.Id Reported.Fiscal.Year year
#> 1 250 2017 1
#> 2 250 2017 1
#> 3 250 2018 2
#> 4 300 2018 1
#> 5 511 2019 1
#> 6 300 2018 1
#> 7 700 2017 1
Created on 2020-07-06 by the reprex package (v0.3.0)

Related

how to find the growth rate of applicants per year

I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23

How to take the mean of two subsequent rows iteratively thereby reducing the number of rows?

I have a tibble like so:
library(dplyr)
set.seed(1)
my_tib <- tibble(identifier = rep(letters[1:3], each = 4),
year = rep(seq(2005, 2020, 5), 3),
value = rnorm(12, mean = 1000, 100) %>% round()
)
my_tib
# A tibble: 12 × 3
identifier year value
<chr> <dbl> <dbl>
1 a 2005 937
2 a 2010 1018
3 a 2015 916
4 a 2020 1160
5 b 2005 1033
6 b 2010 918
7 b 2015 1049
8 b 2020 1074
9 c 2005 1058
10 c 2010 969
11 c 2015 1151
12 c 2020 1039
Now I'd like to shrink down my tibble by taking the mean value for two years each, creating a new column for the year bracket. For example, I'd like to take the mean of 937 and 1018 (977.5) for the new year_bracket 2005-2010.
I'd like to repeat this for all years and all identifiers.
So the first new 5 rows of my tibble look like this:
head(my_new_tib, 5)
# A tibble: 9 × 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005-2010 977.5
2 a 2010-2015 967
3 a 2015-2020 1038
4 b 2005-2010 975.5
5 b 2010-2015 983.5
Ideally, I'm looking for a piped dplyr solution but I'm also curious regarding other solutions.
Using dplyr:
library(dplyr)
my_tib |>
group_by(identifier) |>
mutate(value = (value + lag(value))/2,
year_bracket = paste0(lag(year)," - ",year),
.keep = "unused",
.before = 2) |>
filter(!is.na(value)) |>
ungroup()
Output:
# A tibble: 9 x 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005 - 2010 978.
2 a 2010 - 2015 967
3 a 2015 - 2020 1038
4 b 2005 - 2010 976.
5 b 2010 - 2015 984.
6 b 2015 - 2020 1062.
7 c 2005 - 2010 1014.
8 c 2010 - 2015 1060
9 c 2015 - 2020 1095
Another possible solution:
library(tidyverse)
my_tib %>%
group_by(identifier) %>%
slice(c(1, rep(2:(n()-1), each = 2) , n())) %>%
group_by(identifier, aux = rep(1:n(), each=2, length.out = n())) %>%
summarise(year_bracket = str_c(year, collapse = "_"), value = mean(value),
.groups = "drop") %>% select(-aux)
#> # A tibble: 9 × 3
#> identifier year_bracket value
#> <chr> <chr> <dbl>
#> 1 a 2005_2010 978.
#> 2 a 2010_2015 967
#> 3 a 2015_2020 1038
#> 4 b 2005_2010 976.
#> 5 b 2010_2015 984.
#> 6 b 2015_2020 1062.
#> 7 c 2005_2010 1014.
#> 8 c 2010_2015 1060
#> 9 c 2015_2020 1095

Select first row for each id for each year

Say I have a dataset below where each id can have multiple records per year. I would like to keep only the id's most recent record per year.
id<-c(1,1,1,2,2,2)
year<-c(2020,2020,2019,2020,2018,2018)
month<-c(12,6,4,5,4,1)
have<-as.data.frame(cbind(id,year,month))
have
id year month
1 2020 12
1 2020 6
1 2019 4
2 2020 5
2 2018 4
2 2018 1
This is what would like the dataset to look like:
want
id year month
1 2020 12
1 2019 4
2 2020 5
2 2018 4
I know that I can get the first instance of each id with this code, however I want the latest record for each year.
want<-have[match(unique(have$id), have$id),]
id year month
1 2020 12
2 2020 5
I modified the code to add in year, but it outputs the same results as the code above:
want<-have[match(unique(have$id,have$year), have$id),]
id year month
1 2020 12
2 2020 5
How would I modify this so I can see one record displayed per year?
You can use dplyr::slice_min like this:
library(dplyr)
have %>%
group_by(id,year) %>%
slice_min(order_by = month)
Output:
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
We could group and then summarise with first()
library(dplyr)
have %>%
group_by(id, year) %>%
summarise(month = first(month))
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
You can use the group_by in dplyr as follows:
have %>% group_by(year) %>% tally(max(month))

Assigning Values in R by Date Range

I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1

Group By and summaries with condition

I have data frame df. After group_by(id, Year, Month, new_used_ind) and summarise(n = n()) it looks like:
id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2
I want to add and get total for id, year and month but also want a total of ' N' from new_used_ind in a new column.
Something like this
id Year Month Total_New total
1 2001 apr 3 5
2 2002 mar 5 8
4 2004 july 4 6
library(dplyr)
read.table(text= "id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2", header = T) -> df
df %>%
group_by(id, Year, Month) %>%
mutate(total_New=sum(n*(new_used_ind=="N"))) %>%
mutate(total_n=sum(n)) %>%
summarise_at(c("total_New", "total_n"), mean)
#> # A tibble: 4 x 5
#> # Groups: id, Year [4]
#> id Year Month total_New total_n
#> <int> <int> <fct> <dbl> <dbl>
#> 1 1 2001 apr 3 5
#> 2 2 2002 mar 5 5
#> 3 3 2003 mar 0 3
#> 4 4 2004 july 4 6
Created on 2019-06-11 by the reprex package (v0.3.0)

Resources