How to find first non-NA leading or lagging value? - r

I have rows grouped by ID and I want to calculate how much time passes until the next event occurs (if it does occur for that ID).
Here is example code:
year <- c(2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018)
id <- c(rep("A", times = 4), rep("B", times = 4), rep("C", times = 4))
event_date <- c(NA, 2016, NA, 2018, NA, NA, NA, NA, 2015, NA, NA, 2018)
df<- as.data.frame(cbind(id, year, event_date))
df
id year event_date
1 A 2015 <NA>
2 A 2016 2016
3 A 2017 <NA>
4 A 2018 2018
5 B 2015 <NA>
6 B 2016 <NA>
7 B 2017 <NA>
8 B 2018 <NA>
9 C 2015 2015
10 C 2016 <NA>
11 C 2017 <NA>
12 C 2018 2018
Here is what I want the output to look like:
id year event_date years_till_next_event
1 A 2015 <NA> 1
2 A 2016 2016 0
3 A 2017 <NA> 1
4 A 2018 2018 0
5 B 2015 <NA> <NA>
6 B 2016 <NA> <NA>
7 B 2017 <NA> <NA>
8 B 2018 <NA> <NA>
9 C 2015 2015 0
10 C 2016 <NA> 2
11 C 2017 <NA> 1
12 C 2018 2018 0
Person B does not have the event, so it is not calculated. For the others, I want to calculate the difference between the leading event_date (ignoring NAs, if it exists) and the year.
I want to calculate years_till_next_event such that 1) if there is an event_date for a row, event_date - year. 2) If not, then return the first non-NA leading value - year. I'm having difficulty with the 2nd part of the logic, keeping in mind the event could occur not at all or every year, by ID.

Using zoo with dplyr
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(years_till_next_event = na.locf0(event_date, fromLast = TRUE) - year )

Here is a data.table option
setDT(df)[, years_till_next_event := nafill(event_date, type = "nocb") - year, id]
which gives
id year event_date years_till_next_event
1: A 2015 NA 1
2: A 2016 2016 0
3: A 2017 NA 1
4: A 2018 2018 0
5: B 2015 NA NA
6: B 2016 NA NA
7: B 2017 NA NA
8: B 2018 NA NA
9: C 2015 2015 0
10: C 2016 NA 2
11: C 2017 NA 1
12: C 2018 2018 0

You can create a new column to assign a row number within each id if the value is not NA, fill the NA values from the next values and subtract the current row number from it.
library(dplyr)
df %>%
group_by(id) %>%
mutate(years_till_next_event = replace(row_number(),is.na(event_date), NA)) %>%
tidyr::fill(years_till_next_event, .direction = 'up') %>%
mutate(years_till_next_event = years_till_next_event - row_number()) %>%
ungroup
# id year event_date years_till_next_event
# <chr> <dbl> <dbl> <int>
# 1 A 2015 NA 1
# 2 A 2016 2016 0
# 3 A 2017 NA 1
# 4 A 2018 2018 0
# 5 B 2015 NA NA
# 6 B 2016 NA NA
# 7 B 2017 NA NA
# 8 B 2018 NA NA
# 9 C 2015 2015 0
#10 C 2016 NA 2
#11 C 2017 NA 1
#12 C 2018 2018 0
data
df <- data.frame(id, year, event_date)

Related

How to compare two or more lines in a long dataset to create a new variable?

I have a long format dataset like that:
ID
year
Address
Classification
1
2020
A
NA
1
2021
A
NA
1
2022
B
B_
2
2020
C
NA
2
2021
D
NA
2
2022
F
F_
3
2020
G
NA
3
2021
G
NA
3
2022
G
G_
4
2020
H
NA
4
2021
I
NA
4
2022
H
H_
I have a Classification of each subject in year 2022 based on their addresses in 2022. This Classification was not made in other years. But I would like to generalize this classification to other years, in a way that if their addresses in other years are the same address they hold in 2022, so the NA values from the 'Classification' variable in these years would be replaced with the same value of the 'Classification' they got in 2022.
I have been trying to convert to a wide data and compare the lines in a more direct way with dplyr. But it is not working properly, since there are these NA values. And, also, this doesn't look a smart way to achieve the final dataset I desire. I would like to get the 'Aim' column in my dataset as showed below:
ID
year
Address
Classification
Aim
1
2020
A
NA
NA
1
2021
A
NA
NA
1
2022
B
B_
B_
2
2020
C
NA
NA
2
2021
D
NA
NA
2
2022
F
F_
F_
3
2020
G
NA
G_
3
2021
G
NA
G_
3
2022
G
G_
G_
4
2020
H
NA
H_
4
2021
I
NA
NA
4
2022
H
H_
H_
I use tidyr::fill with dplyr::group_by for this. Here you need to specify the direction (the default is "down" which will fill with NAs since that's the first value in each group).
library(dplyr)
library(tidyr)
df %>%
group_by(ID, Address) %>%
tidyr::fill(Classification, .direction = "up")
Output:
# ID year Address Classification
# <int> <int> <chr> <chr>
# 1 1 2020 A NA
# 2 1 2021 A NA
# 3 1 2022 B B_
# 4 2 2020 C NA
# 5 2 2021 D NA
# 6 2 2022 F F_
# 7 3 2020 G G_
# 8 3 2021 G G_
# 9 3 2022 G G_
#10 4 2020 H H_
#11 4 2021 I NA
#12 4 2022 H H_
Data
df <- read.table(text = "ID year Address Classification
1 2020 A NA
1 2021 A NA
1 2022 B B_
2 2020 C NA
2 2021 D NA
2 2022 F F_
3 2020 G NA
3 2021 G NA
3 2022 G G_
4 2020 H NA
4 2021 I NA
4 2022 H H_", header = TRUE)

Remove duplicate year rows by groups [duplicate]

This question already has answers here:
get rows of unique values by group
(4 answers)
Closed 1 year ago.
I have a data.table of the following form:-
data <- data.table(group = rep(1:3, each = 4),
year = c(2011:2014, rep(2011:2012, each = 2),
2012, 2012, 2013, 2014), value = 1:12)
This is only an abstract of my data.
So group 2 has 2 values for 2011 and 2012. And group 3 has 2 values for the year 2012. I want to just keep the first row for all the duplicated years.
So, in effect, my data.table will become the following:-
data <- data.table(group = c(rep(1, 4), rep(2, 2), rep(3, 3)),
year = c(2011:2014, 2011, 2012, 2012, 2013, 2014),
value = c(1:5, 7, 9, 11, 12))
How can I achieve this? Thanks in advance.
Try this data.table option with duplicated
> data[!duplicated(cbind(group, year))]
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
For data.tables you can pass by argument to unique -
library(data.table)
unique(data, by = c('group', 'year'))
# group year value
#1: 1 2011 1
#2: 1 2012 2
#3: 1 2013 3
#4: 1 2014 4
#5: 2 2011 5
#6: 2 2012 7
#7: 3 2012 9
#8: 3 2013 11
#9: 3 2014 12
Using base R
subset(data, !duplicated(cbind(group, year)))
One solution would be to use distinct from dplyr like so:
library(dplyr)
data %>%
distinct(group, year, .keep_all = TRUE)
Output:
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
This should do the trick:
library(tidyverse)
data %>%
group_by(group, year) %>%
filter(!duplicated(group, year))

Substitute values across different data frames in R

I have the following code for 4 dataframes. The last column of each has only 2 values, either zero ("0") or an id, which is the same within every df, but differs between every df.
How can substitute all zeros in the id columns for all the same ids?
As example, change df1 from:
year counts id
1 2015 0 0
2 2016 0 0
3 2017 7 Fg4s5
4 2018 8 Fg4s5
5 2019 5 0
6 2020 12 Fg4s5
to:
year counts id
1 2015 0 Fg4s5
2 2016 0 Fg4s5
3 2017 7 Fg4s5
4 2018 8 Fg4s5
5 2019 5 Fg4s5
6 2020 12 Fg4s5
Same for other dfs with their ids.
Code for dataframes:
df1 <- data.frame(
year = c(2015:2020),
counts = c(0, 0, 7, 8, 5, 12),
id = c(0, 0, "Fg4s5", "Fg4s5", 0, "Fg4s5")
)
df2 <- data.frame(
year = c(2014:2020),
counts = c(1, 5, 9, 2, 2, 19, 3),
id = c(0, 0, 0, 0, 0, "Qd8a2", "Qd8a2")
)
df3 <- data.frame(
year = c(2016:2020),
counts = c(0, 0, 0, 0, 6),
id = c(0, 0, "Wk9l4", "Wk9l4", "Wk9l4")
)
df4 <- data.frame(
year = c(2014:2020),
counts = c(0, 0, 8, 1, 9, 12, 23),
id = c(0, "Rd7q0", 0, 0, "Rd7q0", "Rd7q0", "Rd7q0")
)
Put the dataframes in a list and change the value in id columns using lapply :
list_df <- list(df1, df2, df3, df4)
lapply(list_df, function(x) {
transform(x, id = replace(id, id == 0, id[id != '0'][1]))
}) -> list_df
list_df
#[[1]]
# year counts id
#1 2015 0 Fg4s5
#2 2016 0 Fg4s5
#3 2017 7 Fg4s5
#4 2018 8 Fg4s5
#5 2019 5 Fg4s5
#6 2020 12 Fg4s5
#[[2]]
# year counts id
#1 2014 1 Qd8a2
#2 2015 5 Qd8a2
#3 2016 9 Qd8a2
#4 2017 2 Qd8a2
#5 2018 2 Qd8a2
#6 2019 19 Qd8a2
#7 2020 3 Qd8a2
#[[3]]
# year counts id
#1 2016 0 Wk9l4
#2 2017 0 Wk9l4
#3 2018 0 Wk9l4
#4 2019 0 Wk9l4
#5 2020 6 Wk9l4
#[[4]]
# year counts id
#1 2014 0 Rd7q0
#2 2015 0 Rd7q0
#3 2016 8 Rd7q0
#4 2017 1 Rd7q0
#5 2018 9 Rd7q0
#6 2019 12 Rd7q0
#7 2020 23 Rd7q0
To put them in separate dataframes.
names(list_df) <- paste0('df', 1:4)
list2env(list_df, .GlobalEnv)
using purrr::map
map(list(df1, df2, df3, df4), ~ .x %>% mutate(id = first(id[id != "0"])))
[[1]]
year counts id
1 2015 0 Fg4s5
2 2016 0 Fg4s5
3 2017 7 Fg4s5
4 2018 8 Fg4s5
5 2019 5 Fg4s5
6 2020 12 Fg4s5
[[2]]
year counts id
1 2014 1 Qd8a2
2 2015 5 Qd8a2
3 2016 9 Qd8a2
4 2017 2 Qd8a2
5 2018 2 Qd8a2
6 2019 19 Qd8a2
7 2020 3 Qd8a2
[[3]]
year counts id
1 2016 0 Wk9l4
2 2017 0 Wk9l4
3 2018 0 Wk9l4
4 2019 0 Wk9l4
5 2020 6 Wk9l4
[[4]]
year counts id
1 2014 0 Rd7q0
2 2015 0 Rd7q0
3 2016 8 Rd7q0
4 2017 1 Rd7q0
5 2018 9 Rd7q0
6 2019 12 Rd7q0
7 2020 23 Rd7q0

Create incremental column year based on id and year column in R

I have the below dataframe and i want to create the 'create_col' using some kind of seq() function i guess using the 'year' column as the start of the sequence. How I could do that?
id <- c(1,1,2,3,3,3,4)
year <- c(2013, 2013, 2015,2017,2017,2017,2011)
create_col <- c(2013,2014,2015,2017,2018,2019,2011)
Ideal result:
id year create_col
1 1 2013 2013
2 1 2013 2014
3 2 2015 2015
4 3 2017 2017
5 3 2017 2018
6 3 2017 2019
7 4 2011 2011
You can add row_number() to minimum year in each id :
library(dplyr)
df %>%
group_by(id) %>%
mutate(create_col = min(year) + row_number() - 1)
# id year create_col
# <dbl> <dbl> <dbl>
#1 1 2013 2013
#2 1 2013 2014
#3 2 2015 2015
#4 3 2017 2017
#5 3 2017 2018
#6 3 2017 2019
#7 4 2011 2011
data
df <- data.frame(id, year)

R: Use data frame names for columns after/before applying purrr reduce

I already checked this solution, but unfortunately, it does not fit my more complex data.
Raw Data:
I have a list named Total.Scores with eleven data frames named
2000-2020 each is containing annual data from 2000 till 2020. Each data frame has a different number of rows but always 12 columns: ID, Category, Score.1-9, and Year.
Sample Data:
library(purrr)
Total.Scores <- list("2020" = data.frame(ID = c("A2_101", "B3_102", "LO_103", "TT_101"),
Category = c("blue", "red", "green", "red"),
Score.1 = c(1,2,3,0),
Score.2 = c(3,4,5,2),
Score.3 = c(0,0,1,1),
Year = c(2020, 2020, 2020, 2020)),
"2019" = data.frame(ID = c("A2_101", "B3_102", "LO_103"),
Category = c("blue", "red", "green"),
Score.1 = c(1,2,3),
Score.2 = c(3,4,5),
Score.3 = c(0,0,1),
Year = c(2019, 2019, 2019)),
"2018" = data.frame(ID = c("A2_101", "B3_102", "LO_103", "TT_201","AA_345"),
Category = c("blue", "red", "green", "yellow", "purple"),
Score.1 = c(1,2,3,3,5),
Score.2 = c(3,4,5,5,3),
Score.3 = c(0,0,1,3,0),
Year = c(2018, 2018, 2018, 2018, 2018)),
"2017" = data.frame(ID = c("A2_101", "B3_102", "LO_103", "TT_101"),
Category = c("blue", "red", "green", "red"),
Score.1 = c(1,2,3,0),
Score.2 = c(3,4,5,2),
Score.3 = c(0,0,1,1),
Year = c(2017, 2017, 2017, 2017)))
Joined Data:
I combine the data frames from the Total.Scores list into the new large data frame Total.Yearly.Scores via a full_join by ID and Category:
Total.Yearly.Scores <- Total.Scores %>% reduce(full_join, by = c("ID", "Category"))
Result:
# Total.Yearly.Scores
ID Category Score.1.x Score.2.x Score.3.x Year.x Score.1.y Score.2.y Score.3.y Year.y Score.1.x.x Score.2.x.x Score.3.x.x Year.x.x
1 A2_101 blue 1 3 0 2020 1 3 0 2019 1 3 0 2018
2 B3_102 red 2 4 0 2020 2 4 0 2019 2 4 0 2018
3 LO_103 green 3 5 1 2020 3 5 1 2019 3 5 1 2018
4 TT_101 red 0 2 1 2020 NA NA NA NA NA NA NA NA
5 TT_201 yellow NA NA NA NA NA NA NA NA 3 5 3 2018
6 AA_345 purple NA NA NA NA NA NA NA NA 5 3 0 2018
Score.1.y.y Score.2.y.y Score.3.y.y Year.y.y
1 1 3 0 2017
2 2 4 0 2017
3 3 5 1 2017
4 0 2 1 2017
5 NA NA NA NA
6 NA NA NA NA
Question:
How can I adjust my code so that the column headers for the Score.1-9 and Year columns incorporate the data frame names of 2000-2020?
For example, changing them from Score.1.x to Score.1 2020:
# Total.Yearly.Scores
ID Category Score.1 2020 Score.2 2020 Score.3 2020 Year 2020 Score.1 2019 Score.2 2019 Score.3 2019 Year 2019 Score.1 2018 Score.2 2018 Score.3 2018 Year 2018
1 A2_101 blue 1 3 0 2020 1 3 0 2019 1 3 0 2018
2 B3_102 red 2 4 0 2020 2 4 0 2019 2 4 0 2018
3 LO_103 green 3 5 1 2020 3 5 1 2019 3 5 1 2018
4 TT_101 red 0 2 1 2020 NA NA NA NA NA NA NA NA
5 TT_201 yellow NA NA NA NA NA NA NA NA 3 5 3 2018
6 AA_345 purple NA NA NA NA NA NA NA NA 5 3 0 2018
Score.1 2017 Score.2 2017 Score.3 2017 Year 2017
1 1 3 0 2017
2 2 4 0 2017
3 3 5 1 2017
4 0 2 1 2017
5 NA NA NA NA
6 NA NA NA NA
Thanks in advance for the help!
Best regards, Thomas.
We can rename before the join
library(dplyr)
library(purrr)
library(stringr)
Total.Scores %>%
imap(~ {nm1 <- .y
rename_at(.x, vars(-c("ID", "Category")), ~ str_c(., nm1, sep= ' '))}) %>%
reduce(full_join, by = c("ID", "Category"))
-output
ID Category Score.1 2020 Score.2 2020 Score.3 2020 Year 2020 Score.1 2019 Score.2 2019 Score.3 2019
1 A2_101 blue 1 3 0 2020 1 3 0
2 B3_102 red 2 4 0 2020 2 4 0
3 LO_103 green 3 5 1 2020 3 5 1
4 TT_101 red 0 2 1 2020 NA NA NA
5 TT_201 yellow NA NA NA NA NA NA NA
6 AA_345 purple NA NA NA NA NA NA NA
Year 2019 Score.1 2018 Score.2 2018 Score.3 2018 Year 2018 Score.1 2017 Score.2 2017 Score.3 2017 Year 2017
1 2019 1 3 0 2018 1 3 0 2017
2 2019 2 4 0 2018 2 4 0 2017
3 2019 3 5 1 2018 3 5 1 2017
4 NA NA NA NA NA 0 2 1 2017
5 NA 3 5 3 2018 NA NA NA NA
6 NA 5 3 0 2018 NA NA NA NA

Resources