How to fix a time series with missing dates across multiple observations? - r

Let us consider the following time series with numbered days :
test=data.table( day=sample(1:9, 15, TRUE), name=sort(rep(c("a", "b", "c"), 5)), value=sample(1:3, 15, TRUE) )
test[test[, !duplicated(day), by=name][,V1]][order(name, -day)]
day name value
1: 7 a 3
2: 4 a 2
3: 2 a 2
4: 1 a 2
5: 9 b 1
6: 8 b 3
7: 6 b 3
8: 5 b 2
9: 3 b 3
10: 7 c 1
11: 6 c 1
12: 4 c 1
13: 3 c 3
14: 1 c 2
As you can see we made some measurments on three objects a, b and c during 9 days. We would like to perform a day to day value comparison between the three objects, unfortunately some dates are randomly missing and this causes a problem to run an algorithm that would otherwise be straightforward.
I would like to inject rows into this datatable so all objects have the same days. Injected rows would default the value to 0
All days available across all objects are listed with :
> sort(unique(test[,day]) )
[1] 1 2 3 4 5 6 7 8 9
So for instance the object a is missing days : 3, 5, 6, 8, 9
After the row injection the datatable for a would look like :
test[name=="a"]
day name value
1: 1 a 2
2: 2 a 1
3: 3 a 0
4: 4 a 3
5: 5 a 0
6: 6 a 0
7: 7 a 3
8: 8 a 0
9: 9 a 0
Any idea on how to tackle this problem ? Maybe some libraries such as lubridate already know how to do that.

Using the data that you posted, which I copied and put into a data.table, you can do this using:
library(data.table)
## create a table with all days and names
all.dates <- setDT(expand.grid(day=sort(unique(test[,day])),name=sort(unique(test[,name]))))
## perform a left-outer-join of all.dates with test
setkey(all.dates)
setkey(test,day,name)
test <- test[all.dates]
## set those NA's to zero
test[is.na(test)] <- 0
## day name value
##1 1 a 2
##2 1 b 0
##3 1 c 2
##4 2 a 2
##5 2 b 0
##6 2 c 0
##7 3 a 0
##8 3 b 3
##9 3 c 3
##10 4 a 2
##11 4 b 0
##12 4 c 1
##13 5 a 0
##14 5 b 2
##15 5 c 0
##16 6 a 0
##17 6 b 3
##18 6 c 1
##19 7 a 3
##20 7 b 0
##21 7 c 1
##22 8 a 0
##23 8 b 3
##24 8 c 0
##25 9 a 0
##26 9 b 1
##27 9 c 0
Data:
test <- structure(list(day = c(7L, 4L, 2L, 1L, 9L, 8L, 6L, 5L, 3L, 7L,
6L, 4L, 3L, 1L), name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
value = c(3L, 2L, 2L, 2L, 1L, 3L, 3L, 2L, 3L, 1L, 1L, 1L,
3L, 2L)), .Names = c("day", "name", "value"), class = c("data.table",
"data.frame"), row.names = c(NA, -14L), .internal.selfref = <pointer: 0x102006778>)
## day name value
## 1: 7 a 3
## 2: 4 a 2
## 3: 2 a 2
## 4: 1 a 2
## 5: 9 b 1
## 6: 8 b 3
## 7: 6 b 3
## 8: 5 b 2
## 9: 3 b 3
##10: 7 c 1
##11: 6 c 1
##12: 4 c 1
##13: 3 c 3
##14: 1 c 2

In the tidyverse, one of the packages (tidyr) has a wrapper over expand.grid and left.join.
library(tidyverse)
test$day <- factor(test$day, levels = 1:9)
test$name = factor(test$name, levels = c("a", "b", "c"))
test %>%
complete(day, name, fill = list(value = 0))
#> # A tibble: 32 × 3
#> day name value
#> <fctr> <fctr> <dbl>
#> 1 1 a 0
#> 2 1 b 0
#> 3 1 c 0
#> 4 2 a 0
#> 5 2 b 0
#> 6 2 c 1
#> 7 3 a 1
#> 8 3 b 0
#> 9 3 c 0
#> 10 4 a 3
#> # ... with 22 more rows
You can also do it with expand.grid and a left join.
possibilities = expand.grid(levels(test$day), unique(test$name))
possibilities %>%
left_join(test, by = c("Var1" = "day", "Var2" = "name")) %>%
mutate(value = ifelse(is.na(value), 0, value))
#> Var1 Var2 value
#> 1 1 a 0
#> 2 2 a 0
#> 3 3 a 1
#> 4 4 a 3
#> 5 5 a 1

Related

R Fill in missing rows

I have a similar question like this one: Fill in missing rows in R
However, the gaps I need to fill are not only months, but also missing years in between for one ID. This is an example:
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 3 1 10
4 A 3 2 9
5 A 3 3 8
6 B 2 1 7
7 B 2 2 6
8 B 2 3 5
9 B 3 3 4
And this is what I want it to look like:
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 1 3 0
4 A 2 1 0
5 A 2 2 0
6 A 2 3 0
7 A 3 1 10
8 A 3 2 9
9 A 3 3 8
10 B 2 1 7
11 B 2 2 6
12 B 2 3 5
13 B 3 1 0
14 B 3 2 0
15 B 3 3 4
Has someone an idea how to solve it? I have already played around with the solutions mentioned above.
library(tidyverse)
df <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
df %>%
complete(ID, A, B, fill = list(Var1 = 0))
#> # A tibble: 18 x 4
#> ID A B Var1
#> <chr> <int> <int> <dbl>
#> 1 A 1 1 12
#> 2 A 1 2 11
#> 3 A 1 3 0
#> 4 A 2 1 0
#> 5 A 2 2 0
#> 6 A 2 3 0
#> 7 A 3 1 10
#> 8 A 3 2 9
#> 9 A 3 3 8
#> 10 B 1 1 0
#> 11 B 1 2 0
#> 12 B 1 3 0
#> 13 B 2 1 7
#> 14 B 2 2 6
#> 15 B 2 3 5
#> 16 B 3 1 0
#> 17 B 3 2 0
#> 18 B 3 3 4
Created on 2021-03-03 by the reprex package (v1.0.0)
You could use the solution described there altering it slightly for your problem.
df
full <- with(df, unique(expand.grid(ID = ID, A = A, B = B)))
complete <- merge(df, full, by = c('ID', 'A', 'B'), all.y = TRUE)
complete$Var1[is.na(complete$Var1)] <- 0
Just in case somebody else has the same question, this is what I came up with, thanks to the answers provided:
library(tidyverse)
df %>% group_by(ID) %>% complete(ID, A = full_seq(A,1), B, fill = list(Var1 = 0))
This code avoids that too many unused datasets are produced.

R data.table : rolling lag sum for previous 3 days by group

I am currently working R in data.table and am looking for an easy way to implement a rolling lag sum. I can find posts on lags and posts on various sum functions but haven't been successful finding one in which sum and lag are combined in the way I am looking to implement it (rolling back 3 days).
I have a data set that resembles the following-
id agedays diar
1 1 1
1 2 0
1 3 1
1 4 1
1 5 0
1 6 0
1 7 0
1 8 1
1 9 1
1 10 1
3 2 0
3 5 0
3 6 0
3 8 1
3 9 1
4 1 0
4 4 0
4 5 0
4 6 1
4 7 0
I want to create a variable "diar_prev3" that holds the rolling sum of diar for the past 3 days prior to the current agedays value. Diar_prev3 would be NA for the rows in which agedays < 4 The data set would look like the following :
id agedays diar diar_prev3
1 1 1 NA
1 2 0 NA
1 3 1 NA
1 4 1 2
1 5 0 2
1 6 0 2
1 7 0 1
1 8 1 0
1 9 1 1
1 10 1 2
3 2 0 NA
3 5 0 0
3 6 0 0
3 8 1 0
3 9 1 1
4 1 0 NA
4 4 0 0
4 5 0 0
4 6 1 0
4 7 0 1
I have tried a basic lag function, but am unsure how to implement this with a rolling sum function included. Does anyone have any functions they recommend using to accomplish this?
****Edited to fix an error with ID==2
I don't get the logic; it does not appear to be by id, otherwise the results for id==2 don't make sense - but what is going on with id==3 and 4?
In principle, you could do something like this - either by ID or not:
library(data.table)
library(RcppRoll)
DT <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
agedays = c(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 2L, 5L, 6L, 8L, 9L, 1L, 4L,
5L, 6L, 7L), diar = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L,
0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L)),
class = "data.frame", row.names = c(NA, -20L))
setDT(DT)
DT[, diar_prev3 := ifelse(agedays < 4, NA, RcppRoll::roll_sum(lag(diar, 1), n=3L, fill=NA, align = "right"))][]
#> id agedays diar diar_prev3
#> 1: 1 1 1 NA
#> 2: 1 2 0 NA
#> 3: 1 3 1 NA
#> 4: 1 4 1 2
#> 5: 1 5 0 2
#> 6: 2 6 0 1
#> 7: 2 7 0 0
#> 8: 2 8 1 1
#> 9: 2 9 1 2
#> 10: 2 10 1 3
#> 11: 3 2 0 NA
#> 12: 3 5 0 1
#> 13: 3 6 0 0
#> 14: 3 8 1 1
#> 15: 3 9 1 2
#> 16: 4 1 0 NA
#> 17: 4 4 0 1
#> 18: 4 5 0 0
#> 19: 4 6 1 1
#> 20: 4 7 0 1
DT[, diar_prev3 := ifelse(agedays < 4, NA, RcppRoll::roll_sum(lag(diar, 1), n=3L, fill=NA, align = "right")), by=id][]
#> id agedays diar diar_prev3
#> 1: 1 1 1 NA
#> 2: 1 2 0 NA
#> 3: 1 3 1 NA
#> 4: 1 4 1 2
#> 5: 1 5 0 2
#> 6: 2 6 0 NA
#> 7: 2 7 0 NA
#> 8: 2 8 1 1
#> 9: 2 9 1 2
#> 10: 2 10 1 3
#> 11: 3 2 0 NA
#> 12: 3 5 0 NA
#> 13: 3 6 0 0
#> 14: 3 8 1 1
#> 15: 3 9 1 2
#> 16: 4 1 0 NA
#> 17: 4 4 0 NA
#> 18: 4 5 0 0
#> 19: 4 6 1 1
#> 20: 4 7 0 1
Created on 2020-07-20 by the reprex package (v0.3.0)

R - Insert Missing Numbers in A Sequence by Group's Max Value

I'd like to insert missing numbers in the index column following these two conditions:
Partitioned by multiple columns
The minimum value is always 1
The maximum value is always the maximum for the group and type
Current Data:
group type index vol
A 1 1 200
A 1 2 244
A 1 5 33
A 2 2 66
A 2 3 2
A 2 4 199
A 2 10 319
B 1 4 290
B 1 5 188
B 1 6 573
B 1 9 122
Desired Data:
group type index vol
A 1 1 200
A 1 2 244
A 1 3 0
A 1 4 0
A 1 5 33
A 2 1 0
A 2 2 66
A 2 3 2
A 2 4 199
A 2 5 0
A 2 6 0
A 2 7 0
A 2 8 0
A 2 9 0
A 2 10 319
B 1 1 0
B 1 2 0
B 1 3 0
B 1 4 290
B 1 5 188
B 1 6 573
B 1 7 0
B 1 8 0
B 1 9 122
I've just added in spaces between the partitions for clarity.
Hope you can help out!
You can do the following
library(dplyr)
library(tidyr)
my_df %>%
group_by(group, type) %>%
complete(index = 1:max(index), fill = list(vol = 0))
# group type index vol
# 1 A 1 1 200
# 2 A 1 2 244
# 3 A 1 3 0
# 4 A 1 4 0
# 5 A 1 5 33
# 6 A 2 1 0
# 7 A 2 2 66
# 8 A 2 3 2
# 9 A 2 4 199
# 10 A 2 5 0
# 11 A 2 6 0
# 12 A 2 7 0
# 13 A 2 8 0
# 14 A 2 9 0
# 15 A 2 10 319
# 16 B 1 1 0
# 17 B 1 2 0
# 18 B 1 3 0
# 19 B 1 4 290
# 20 B 1 5 188
# 21 B 1 6 573
# 22 B 1 7 0
# 23 B 1 8 0
# 24 B 1 9 122
With group_by you specify the groups you indicated withed the white spaces. With complete you specify which columns should be complete and then what values should be filled in for the remaining column (default would be NA)
Data
my_df <-
structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
type = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L),
index = c(1L, 2L, 5L, 2L, 3L, 4L, 10L, 4L, 5L, 6L, 9L),
vol = c(200L, 244L, 33L, 66L, 2L, 199L, 319L, 290L, 188L, 573L, 122L)),
class = "data.frame", row.names = c(NA, -11L))
One dplyr and tidyr possibility could be:
df %>%
group_by(group, type) %>%
complete(index = full_seq(1:max(index), 1), fill = list(vol = 0))
group type index vol
<fct> <int> <dbl> <dbl>
1 A 1 1 200
2 A 1 2 244
3 A 1 3 0
4 A 1 4 0
5 A 1 5 33
6 A 2 1 0
7 A 2 2 66
8 A 2 3 2
9 A 2 4 199
10 A 2 5 0
11 A 2 6 0
12 A 2 7 0
13 A 2 8 0
14 A 2 9 0
15 A 2 10 319
16 B 1 1 0
17 B 1 2 0
18 B 1 3 0
19 B 1 4 290
20 B 1 5 188
21 B 1 6 573
22 B 1 7 0
23 B 1 8 0
24 B 1 9 122

in create a new variable with the max or min of another variable -- by group [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
R Community: I am trying to to create a new variable based on the value of existing variable, not on a row-wise basis but rather on a group-wise basis. I'm trying to create max.var and min.var below based on old.var without collapsing or aggregating the rows, that is, preserving all the id rows:
id old.var min.var max.var
1 1 1 3
1 2 1 3
1 3 1 3
2 5 5 11
2 7 5 11
2 9 5 11
2 11 5 11
3 3 3 4
3 4 3 4
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), old.var =
c(1L,
2L, 3L, 5L, 7L, 9L, 11L, 3L, 4L), min.var = c(1L, 1L, 1L, 5L,
5L, 5L, 5L, 3L, 3L), max.var = c(3L, 3L, 3L, 11L, 11L, 11L, 11L,
4L, 4L)), .Names = c("id", "old.var", "min.var", "max.var"), class = "data.frame", row.names = c(NA,
-9L))
I've tried using the aggregate and by functions, but they of course summarize the data. I haven't had much luck trying an Excel-like MATCH/INDEX approach either. Thanks in advance for your assistance!
You can use dplyr,
df %>%
group_by(id) %>%
mutate(min.var = min(old.var), max.var = max(old.var))
#Source: local data frame [9 x 4]
#Groups: id [3]
# id old.var min.var max.var
# (int) (int) (int) (int)
#1 1 1 1 3
#2 1 2 1 3
#3 1 3 1 3
#4 2 5 5 11
#5 2 7 5 11
#6 2 9 5 11
#7 2 11 5 11
#8 3 3 3 4
#9 3 4 3 4
Using ave as docendo discimus pointed out in the question's comments:
df$min.var <- ave(df$old.var, df$id, FUN = min)
df$max.var <- ave(df$old.var, df$id, FUN = max)
Output:
id old.var min.var max.var
1 1 1 1 3
2 1 2 1 3
3 1 3 1 3
4 2 5 5 11
5 2 7 5 11
6 2 9 5 11
7 2 11 5 11
8 3 3 3 4
9 3 4 3 4
We can use data.table
library(data.table)
setDT(df1)[, c('min.var', 'max.var') := list(min(old.var), max(old.var)) , by = id]
df1
# id old.var min.var max.var
#1: 1 1 1 3
#2: 1 2 1 3
#3: 1 3 1 3
#4: 2 5 5 11
#5: 2 7 5 11
#6: 2 9 5 11
#7: 2 11 5 11
#8: 3 3 3 4
#9: 3 4 3 4

How to find the last occurrence of a certain observation in grouped data in R?

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.
For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):
A B day
a 2 1
a 2 2
a 1 5
a 0 8
b 3 1
b 3 4
b 3 6
b 0 7
b 0 9
c 1 2
c 1 3
c 1 4
I would like to achieve the following:
A B day last
a 2 1 5
a 2 2 5
a 1 5 5
a 0 8 5
b 3 1 6
b 3 4 6
b 3 6 6
b 0 7 6
b 0 9 6
c 1 2 4
c 1 3 4
c 1 4 4
I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.
We can try
library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
# A B day last
# 1: a 2 1 5
# 2: a 2 2 5
# 3: a 1 5 5
# 4: a 0 8 5
# 5: b 3 1 6
# 6: b 3 4 6
# 7: b 3 6 6
# 8: b 0 7 6
# 9: b 0 9 6
#10: c 1 2 4
#11: c 1 3 4
#12: c 1 4 4
Or using dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(last = day[max(which(B>=1))])
Or use the last function from dplyr (as #docendo discimus suggested)
df1 %>%
group_by(A) %>%
mutate(last= last(day[B>=1]))
For the second question,
setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_ else
day[max(which(B!=0))+1L] , A]
# A B day dayafter
# 1: a 2 1 8
# 2: a 2 2 8
# 3: a 1 5 8
# 4: a 0 8 8
# 5: b 3 1 7
# 6: b 3 4 7
# 7: b 3 6 7
# 8: b 0 7 7
# 9: b 0 9 7
#10: c 1 2 NA
#11: c 1 3 NA
#12: c 1 4 NA
Here is a solution that does not require loading external packages:
df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L,
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A",
"B", "day"), class = "data.frame", row.names = c(NA, -12L))
x <- split(df, df$A, drop = TRUE)
tp <- lapply(x, function(k) {
tmp <- k[k$B >0,]
k$last <- tmp$day[length(tmp$day)]
k
})
do.call(rbind, tp)
A B day last
#a.1 a 2 1 5
#a.2 a 2 2 5
#a.3 a 1 5 5
#a.4 a 0 8 5
#b.5 b 3 1 6
#b.6 b 3 4 6
#b.7 b 3 6 6
#b.8 b 0 7 6
#b.9 b 0 9 6
#c.10 c 1 2 4
#c.11 c 1 3 4
#c.12 c 1 4 4

Resources