How to do a conditional NA fill in R dataframe - r

It may be simple but could not figure out.
How to fill NA in the feature column with conditions as below in the data frame dt.
The conditions to fill NA are:
if the difference in Date is 1, fill the NA with the previous row's value (easily done by fill function of tidyverse)
dt_fl<-dt%>%
fill(feature, .direction = "down")
dt_fl
if the difference in the Date is >1, then fill the NA with the previous feature value +1 and replace the following rows (feature values) with 1 increment to make continuous feature values.
The dt_output shows what I am expecting from dt after filling NA values and replacing the feature numbers accordingly.
dt<-structure(list(Date = structure(c(15126, 15127, 15128, 15129,
15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141,
15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"),
feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, 2, 2, NA)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
dt
dt_output<-structure(list(Date = structure(c(15126, 15127, 15128, 15129,
15130, 15131, 15132, 15133, 15134, 15138, 15139, 15140, 15141,
15142, 15143, 15144, 15145, 15146, 15147, 15148, 15149), class = "Date"),
feature = c(1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, 2, 2, NA), finaloutput = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), row.names = c(NA,
-21L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), finaloutput = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
dt_output
Also, following Ben's suggestion, if the data frame starts with NA feature like in dt2 how to fix it? Expected output for dt2 is in dt2_output
dt2<-structure(list(Date = structure(c(13675, 13676, 13677, 13678,
13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"),
feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
dt2_output<-structure(list(Date = structure(c(13675, 13676, 13677, 13678,
13679, 13689, 13690, 13691, 13692, 13693, 13694, 13695), class = "Date"),
feature = c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, 2), output_feature = c(1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3)), row.names = c(NA, -12L
), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), output_feature = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
The solution Ben provides works fine for all the conditions except in 1 condition like in dt3 (below), just wondering why it is so. My assumption is the second solution should give dt3_expected for dt3.
dt3<-structure(list(Date = structure(c(10063, 10064, 10065, 10066,
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083,
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1,
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA)), row.names = c(NA,
-19L), class = c("tbl_df", "tbl", "data.frame"))
dt3
dt3_expected<-structure(list(Date = structure(c(10063, 10064, 10065, 10066,
10067, 10068, 10069, 10070, 10079, 10080, 10081, 10082, 10083,
10084, 10085, 10086, 10087, 10088, 10089), class = "Date"), feature = c(1,
1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA), output_feature = c(1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), row.names = c(NA,
-19L), spec = structure(list(cols = list(Date = structure(list(), class = c("collector_character",
"collector")), feature = structure(list(), class = c("collector_double",
"collector")), output_feature = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
The help is greatly appreciated, thank you.

You could try creating an "offset" that is added whenever you have missing values and a difference in dates greater than 1 day. This cumulative offset can be added to your feature value to determine the finaloutput.
dt %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date) > 1)) %>%
fill(feature, .direction = "down") %>%
mutate(finaloutput = feature + offset)
Output
# A tibble: 21 x 4
Date feature offset finaloutput
<date> <dbl> <int> <dbl>
1 2011-06-01 1 0 1
2 2011-06-02 1 0 1
3 2011-06-03 1 0 1
4 2011-06-04 1 0 1
5 2011-06-05 1 0 1
6 2011-06-06 1 0 1
7 2011-06-07 1 0 1
8 2011-06-08 1 0 1
9 2011-06-09 1 0 1
10 2011-06-13 1 1 2
11 2011-06-14 1 1 2
12 2011-06-15 1 1 2
13 2011-06-16 1 1 2
14 2011-06-17 1 1 2
15 2011-06-18 2 1 3
16 2011-06-19 2 1 3
17 2011-06-20 2 1 3
18 2011-06-21 2 1 3
19 2011-06-22 2 1 3
20 2011-06-23 2 1 3
21 2011-06-24 2 1 3
Edit: With the second example dt2 that begins with NA, you can try the following.
First, you can add a default for lag. In the case where the first row is NA, it will evaluate for a difference in Date. Since there is no prior Date to compare with, you can use a default of more than 1 day, so that an offset will be added and these initial NA will be considered the "first" feature.
The second issue is filling in the NA when you can't fill in the down direction (no prior feature value when it starts with NA). You can just replace these with 0. Given the offset, this will become finaloutput of 0 + 1 = 1.
dt2 %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
fill(feature, .direction = "down") %>%
replace_na(list(feature = 0)) %>%
mutate(finaloutput = feature + offset)
Output
Date feature offset finaloutput
<date> <dbl> <int> <dbl>
1 2007-06-11 0 1 1
2 2007-06-12 0 1 1
3 2007-06-13 0 1 1
4 2007-06-14 0 1 1
5 2007-06-15 0 1 1
6 2007-06-25 1 1 2
7 2007-06-26 1 1 2
8 2007-06-27 1 1 2
9 2007-06-28 1 1 2
10 2007-06-29 1 1 2
11 2007-06-30 1 1 2
12 2007-07-01 2 1 3
Edit: With additional comment, there is an additional criterion to consider.
If the difference in Date is > 1 and there are only 2 NA, the first NA should be filled by the previous feature, and the second by the following feature. In particular, the second of 2 NA where there is a gap should be dealt with differently.
One approach to this is to count the number of consecutive NA in a row. Then, feature can be filled in for this particular circumstance, where the second of two NA is identified with a Date gap.
dt3 %>%
mutate(grp = cumsum(c(1, abs(diff(is.na(feature))) == 1))) %>%
add_count(grp) %>%
ungroup %>%
mutate(feature = ifelse(is.na(feature) & n == 2 & is.na(lag(feature)), lead(feature), feature)) %>%
mutate(offset = cumsum(is.na(feature) & Date - lag(Date, default = first(Date) - 2) > 1)) %>%
fill(feature, .direction = "down") %>%
replace_na(list(feature = 0)) %>%
mutate(finaloutput = feature + offset)
Output
Date feature grp n offset finaloutput
<date> <dbl> <dbl> <int> <int> <dbl>
1 1997-07-21 1 1 7 0 1
2 1997-07-22 1 1 7 0 1
3 1997-07-23 1 1 7 0 1
4 1997-07-24 1 1 7 0 1
5 1997-07-25 1 1 7 0 1
6 1997-07-26 1 1 7 0 1
7 1997-07-27 1 1 7 0 1
8 1997-07-28 1 2 2 0 1
9 1997-08-06 2 2 2 0 2
10 1997-08-07 2 3 9 0 2
11 1997-08-08 2 3 9 0 2
12 1997-08-09 2 3 9 0 2
13 1997-08-10 2 3 9 0 2
14 1997-08-11 2 3 9 0 2
15 1997-08-12 2 3 9 0 2
16 1997-08-13 2 3 9 0 2
17 1997-08-14 2 3 9 0 2
18 1997-08-15 2 3 9 0 2
19 1997-08-16 2 4 1 0 2
Note that this could be simplified; but before doing so, will need to be sure this meets your needs.

Related

Dplyr sequence matching

I am struggling to write a simple dplyr code for this problem. If values per id are equal and timeslot follows a consecutive and increasing sequence per day I would like to create a t column that is a count of the sequence length. For example, in the case of id 1 there is an increasing consecutive sequence on day 1 starting from timeslot 1 till timeslot 4, so t would be equal 4. There are 144 timeslots and 7 days. How can I do this?
Desired output:
Sample data:
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
variable = c("ha1_001", "ha1_002", "ha1_003", "ha1_004",
"ha1_125", "ha1_126", "ha1_127", "ha1_128", "ha1_009", "ha1_010",
"ha1_011", "ha1_012", "ha1_013"), value = c(110, 110, 110,
110, 110, 110, 110, 110, 110, 110, 110, 110, 110), day = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), timeslot = c(1, 2, 3,
4, 125, 126, 127, 128, 129, 130, 131, 132, 133), n = c(7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -13L), spec = structure(list(
cols = list(id = structure(list(), class = c("collector_double",
"collector")), variable = structure(list(), class = c("collector_character",
"collector")), value = structure(list(), class = c("collector_double",
"collector")), day = structure(list(), class = c("collector_double",
"collector")), timeslot = structure(list(), class = c("collector_double",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
One possible solution can be,
library(dplyr)
df %>%
group_by(id, new = cumsum(c(1, diff(timeslot)) != 1)) %>%
mutate(t = n())
# A tibble: 13 x 7
# Groups: new [2]
id variable value day timeslot n t
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 1 ha1_001 110 1 1 7 4
2 1 ha1_002 110 1 2 7 4
3 1 ha1_003 110 1 3 7 4
4 1 ha1_004 110 1 4 7 4
5 1 ha1_125 110 1 125 7 9
6 1 ha1_126 110 1 126 7 9
7 1 ha1_127 110 1 127 7 9
8 1 ha1_128 110 1 128 7 9
9 1 ha1_009 110 1 129 7 9
10 1 ha1_010 110 1 130 7 9
11 1 ha1_011 110 1 131 7 9
12 1 ha1_012 110 1 132 7 9
13 1 ha1_013 110 1 133 7 9

How can i LAG the previous value that meets a condition in other column (R)?

I would like to return the previous value of each row, but not the n = 1, the previous must meet a condition in other column. In this case it would be if Presence = 1.
Table with expected result
Thanks!
You could use dplyr and tidyr:
library(dplyr)
library(tidyr)
data %>%
group_by(person, indicator = cumsum(presence)) %>%
mutate(expected_lag = ifelse(presence == 0, NA, presence * result)) %>%
fill(expected_lag, .direction = "down") %>%
group_by(person) %>%
mutate(expected_lag = lag(expected_lag)) %>%
select(-indicator) %>%
ungroup()
which returns
# A tibble: 9 x 4
person presence result expected_lag
<chr> <dbl> <dbl> <dbl>
1 Ane 1 5 NA
2 Ane 0 6 5
3 Ane 0 4 5
4 Ane 1 8 5
5 Ane 1 7 8
6 John 0 9 NA
7 John 1 2 NA
8 John 0 4 2
9 John 1 3 2
Data
For simplification I removed the date column.
structure(list(person = c("Ane", "Ane", "Ane", "Ane", "Ane",
"John", "John", "John", "John"), presence = c(1, 0, 0, 1, 1,
0, 1, 0, 1), result = c(5, 6, 4, 8, 7, 9, 2, 4, 3)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(person = structure(list(), class = c("collector_character",
"collector")), presence = structure(list(), class = c("collector_double",
"collector")), result = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

Compare and identify the missing rows

I would like to compare per row 2 df based on serial and day variables and to create a new column called compare to highlight the missing rows. How can this be done in R? I tried the inner_join function without success.
Sample structure df1 and df2
Desired output:
Sample data
df1<-structure(list(serial = c(1, 2, 3, 4, 5), day = c(1, 0, 1, 0,
0)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), spec = structure(list(cols = list(serial = structure(list(), class = c("collector_double",
"collector")), day = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2<-structure(list(serial = c(1, 2, 3, 4, 5, 5, 7), day = c(1, 0,
1, 0, 0, 1, 1)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), spec = structure(list(cols = list(
serial = structure(list(), class = c("collector_double",
"collector")), day = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
We can use tidyverse
library(dplyr)
df2 %>%
mutate(compare = TRUE) %>%
left_join(df1 %>%
mutate(compare1 = TRUE), by = c('serial', 'day')) %>%
transmute(serial, day, compare = (!is.na(compare1)))
-output
# A tibble: 7 x 3
serial day compare
<dbl> <dbl> <lgl>
1 1 1 TRUE
2 2 0 TRUE
3 3 1 TRUE
4 4 0 TRUE
5 5 0 TRUE
6 5 1 FALSE
7 7 1 FALSE
Or with a faster and efficient data.table
library(data.table)
setDT(df2)[, compare := FALSE][setDT(df1), compare := TRUE, on = .(serial, day)]
One way would be to create a unique key combining the two columns and use %in% to find if the key is present in another dataset.
A base R option -
df2$compare <- do.call(paste, df2) %in% do.call(paste, df1)
df2
# A tibble: 7 x 3
# serial day compare
# <dbl> <dbl> <lgl>
#1 1 1 TRUE
#2 2 0 TRUE
#3 3 1 TRUE
#4 4 0 TRUE
#5 5 0 TRUE
#6 5 1 FALSE
#7 7 1 FALSE
If there are more columns in your data apart from serial and day use the below code.
cols <- c('serial', 'day')
df2$compare <- do.call(paste, df2[cols]) %in% do.call(paste, df1[cols])
A base R option
transform(
merge(cbind(df1, compare = TRUE), df2, all = TRUE),
compare = !is.na(compare)
)
gives
serial day compare
1 1 1 TRUE
2 2 0 TRUE
3 3 1 TRUE
4 4 0 TRUE
5 5 0 TRUE
6 5 1 FALSE
7 7 1 FALSE

Fill multiple columns in a R dataframe [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
I have a dataframe called flu that is a count of case(n) by group per week.
flu <- structure(list(isoweek = c(1, 1, 2, 2, 3, 3, 4, 5, 5), group = c("fluA",
"fluB", "fluA", "fluB", "fluA", "fluB", "fluA", "fluA", "fluB"
), n = c(5, 6, 3, 5, 12, 14, 6, 23, 25)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(isoweek = structure(list(), class = c("collector_double",
"collector")), group = structure(list(), class = c("collector_character",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
In the data set there are some rows where zero cases are not reported in the data so there are no NA values to work with.
I have identified a fix for this to fill down missing weeks with zeros.
flu %>% complete(isoweek, nesting(group), fill = list(n = 0))
My problem is that this only works for the weeks of data reported. For example, at weeks 6, 7, 8 etc if there are no cases reported I have no data.
How can I extend this fill down process to extend the data frame with zeros for isoweeks 6 to 10 (for example) and have a corresponding fluA and fluB for each week with a zero value for each isoweek/group pair?
You can expand multiple columns in complete. Let's say if you need data till week 8, you can do :
tidyr::complete(flu, isoweek = 1:8, group, fill = list(n = 0))
# A tibble: 16 x 3
# isoweek group n
# <dbl> <chr> <dbl>
# 1 1 fluA 5
# 2 1 fluB 6
# 3 2 fluA 3
# 4 2 fluB 5
# 5 3 fluA 12
# 6 3 fluB 14
# 7 4 fluA 6
# 8 4 fluB 0
# 9 5 fluA 23
#10 5 fluB 25
#11 6 fluA 0
#12 6 fluB 0
#13 7 fluA 0
#14 7 fluB 0
#15 8 fluA 0
#16 8 fluB 0

Generate self reference key within the table using R mutate in a dataframe

I have an input table with 3 columns (Person_Id, Visit_Id (unique Id for each visit and each person) and Purpose) as shown below. I would like to generate another new column which provides the immediate preceding visit of the person (ex: if person has visited hospital with Visit Id = 2, then I would like to have another column called "Preceding_visit_Id" which will be 1 (ex:2, if visit id = 5, preceding visit id will be 4). Is there a way to do this in a elegant manner using mutate function?
Input Table
Output Table
As you can see that 'Preceding_visit_id' column refers the previous visit of the person which is defined using visit_id column
Please note that this is a transformation for one of the columns in a huge program, so anything elegant would be helpful.
Dput command output is here
structure(list(Person_Id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3), Visit_Id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14), Purpose = c("checkup", "checkup", "checkup", "checkup",
"checkup", "checkup", "checkup", "checkup", "checkup", "checkup",
"checkup", "checkup", "checkup", "checkup"), Preceding_visit_id = c(NA,
1, 2, 3, 4, NA, 6, 7, 8, 9, 10, NA, 12, 12)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec =
structure(list(
cols = list(Person_Id = structure(list(), class = c("collector_double",
"collector")), Visit_Id = structure(list(), class = c("collector_double",
"collector")), Purpose = structure(list(), class =
c("collector_character",
"collector")), Preceding_visit_id = structure(list(), class =
c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))'''
The Person_Id fields in your examples don't match.
I'm not sure if this is what you're after, but from your dput() I have created a file that removes the last column:
df_input <- df_output %>%
select(-Preceding_visit_id)
Then done this:
df_input %>%
group_by(Person_Id) %>%
mutate(Preceding_visit_id = lag(Visit_Id))
And the output is this:
# A tibble: 14 x 4
# Groups: Person_Id [3]
Person_Id Visit_Id Purpose Preceding_visit_id
<dbl> <dbl> <chr> <dbl>
1 1 1 checkup NA
2 1 2 checkup 1
3 1 3 checkup 2
4 1 4 checkup 3
5 1 5 checkup 4
6 2 6 checkup NA
7 2 7 checkup 6
8 2 8 checkup 7
9 2 9 checkup 8
10 2 10 checkup 9
11 2 11 checkup 10
12 3 12 checkup NA
13 3 13 checkup 12
14 3 14 checkup 13

Resources