Filling missing dates in a grouped time series - a tidyverse-way? - r

Given a data.frame that contains a time series and one or ore grouping fields. So we have several time series - one for each grouping combination.
But some dates are missing.
So, what's the easiest (in terms of the most "tidyverse way") of adding these dates with the right grouping values?
Normally I would say I generate a data.frame with all dates and do a full_join with my time series. But now we have to do it for each combination of grouping values -- and fill in the grouping values.
Let's look at an example:
First I create a data.frame with missing values:
library(dplyr)
library(lubridate)
set.seed(1234)
# Time series should run vom 2017-01-01 til 2017-01-10
date <- data.frame(date = seq.Date(from=ymd("2017-01-01"), to=ymd("2017-01-10"), by="days"), v = 1)
# Two grouping dimensions
d1 <- data.frame(d1 = c("A", "B", "C", "D"), v = 1)
d2 <- data.frame(d2 = c(1, 2, 3, 4, 5), v = 1)
# Generate the data.frame
df <- full_join(date, full_join(d1, d2)) %>%
select(date, d1, d2)
# and ad to value columns
df$v1 <- runif(200)
df$v2 <- runif(200)
# group by the dimension columns
df <- df %>%
group_by(d1, d2)
# create missing dates
df.missing <- df %>%
filter(v1 <= 0.8)
# So now 2017-01-01 and 2017-01-10, A, 5 are missing now
df.missing %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 8 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
So to add the missing dates I generate a data.frame with all dates:
start <- min(df.missing$date)
end <- max(df.missing$date)
all.dates <- data.frame(date=seq.Date(start, end, by="day"))
No I want to do something like (remember: df.missing is group_by(d1, d2))
df.missing %>%
do(my_join())
So let's define my_join():
my_join <- function(data) {
# get value of both dimensions
d1.set <- data$d1[[1]]
d2.set <- data$d2[[1]]
tmp <- full_join(data, all.dates) %>%
# First we need to ungroup. Otherwise we can't change d1 and d2 because they are grouping variables
ungroup() %>%
mutate(
d1 = d1.set,
d2 = d2.set
) %>%
group_by(d1, d2)
return(tmp)
}
Now we can call my_join() for each combination and have a look at "A/5"
df.missing %>%
do(my_join(.)) %>%
filter(d1 == "A" & d2 == 5)
# A tibble: 10 x 5
# Groups: d1, d2 [1]
date d1 d2 v1 v2
<date> <fctr> <dbl> <dbl> <dbl>
1 2017-01-02 A 5 0.21879954 0.1335497
2 2017-01-03 A 5 0.32977018 0.9802127
3 2017-01-04 A 5 0.23902573 0.1206089
4 2017-01-05 A 5 0.19617465 0.7378315
5 2017-01-06 A 5 0.13373890 0.9493668
6 2017-01-07 A 5 0.48613541 0.3392834
7 2017-01-08 A 5 0.35698708 0.3696965
8 2017-01-09 A 5 0.08498474 0.8354756
9 2017-01-01 A 5 NA NA
10 2017-01-10 A 5 NA NA
Great! That's what we were looking for.
But we need to define d1 and d2 in my_join and it feels a little bit clumsy.
So, is there any tidyverse-way of this solution?
P.S.: I've put the code into a gist: https://gist.github.com/JerryWho/1bf919ef73792569eb38f6462c6d7a8e

tidyr has some great tools for these sorts of problems. Take a look at complete.
library(dplyr)
library(tidyr)
library(lubridate)
want <- df.missing %>%
ungroup() %>%
complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))
want %>% filter(d1 == "A" & d2 == 5)
#> # A tibble: 10 x 5
#> d1 d2 date v1 v2
#> <fctr> <dbl> <date> <dbl> <dbl>
#> 1 A 5 2017-01-01 NA NA
#> 2 A 5 2017-01-02 0.21879954 0.1335497
#> 3 A 5 2017-01-03 0.32977018 0.9802127
#> 4 A 5 2017-01-04 0.23902573 0.1206089
#> 5 A 5 2017-01-05 0.19617465 0.7378315
#> 6 A 5 2017-01-06 0.13373890 0.9493668
#> 7 A 5 2017-01-07 0.48613541 0.3392834
#> 8 A 5 2017-01-08 0.35698708 0.3696965
#> 9 A 5 2017-01-09 0.08498474 0.8354756
#> 10 A 5 2017-01-10 NA NA

package tsibble function fill_gaps should do the job easily.
library(tsibble)
df.missing %>%
# tsibble format
as_tsibble(key = c(d1, d2), index = date) %>%
# fill gaps
fill_gaps(.full = TRUE)

Here's a tidyverse way starting with df.missing
library(tidyverse)
ans <- df.missing %>%
nest(date) %>%
mutate(data = map(data, ~seq.Date(start, end, by="day"))) %>%
unnest(data) %>%
rename(date = data) %>%
left_join(., df.missing, by=c("date","d1","d2"))
ans %>% filter(d1 == "A" & d2 == 5)
Output
d1 d2 date v1 v2
<fctr> <dbl> <date> <dbl> <dbl>
1 A 5 2017-01-01 NA NA
2 A 5 2017-01-02 0.21879954 0.1335497
3 A 5 2017-01-03 0.32977018 0.9802127
4 A 5 2017-01-04 0.23902573 0.1206089
5 A 5 2017-01-05 0.19617465 0.7378315
6 A 5 2017-01-06 0.13373890 0.9493668
7 A 5 2017-01-07 0.48613541 0.3392834
8 A 5 2017-01-08 0.35698708 0.3696965
9 A 5 2017-01-09 0.08498474 0.8354756
10 A 5 2017-01-10 NA NA
-------------------------------------------------------------------------------------------------
Here's an alternative approach that uses expand.grid and dplyr verbs
with(df.missing, expand.grid(unique(date), unique(d1), unique(d2))) %>%
setNames(c("date", "d1", "d2")) %>%
left_join(., df.missing, by=c("date","d1","d2"))
output (head)
date d1 d2 v1 v2
1 2017-01-01 A 1 0.113703411 0.660754634
2 2017-01-02 A 1 0.316612455 0.422330675
3 2017-01-03 A 1 0.553333591 0.424109178
4 2017-01-04 A 1 NA NA
5 2017-01-05 A 1 NA NA
6 2017-01-06 A 1 0.035456727 0.352998502

Here read.zoo creates a wide form zoo object and to that we merge the dates. Then we convert that back to a long data frame using fortify.zoo and spread out out v1 and v2 using spread.
Note that:
if we can assume that each date appears in at least one combination of the split variables, i.e. sort(unique(df.missing$date)) contains all the dates, then we could omit the merge line and no joins would have to be done at all. The test data df.missing shown in the question does have this property:
all(all.dates$date %in% df.missing$date)
## [1] TRUE
we could stop after the merge (or after read.zoo if each date is present at least once as in prior point) if a wide form zoo object can be used as that already has all the dates.
In the code below the line marked ### can be omitted with the development version of zoo (1.8.1):
library(dplyr)
library(tidyr)
library(zoo)
split.vars <- c("d1", "d2")
df.missing %>%
as.data.frame %>% ###
read.zoo(split = split.vars) %>%
merge(zoo(, seq(start(.), end(.), "day"))) %>%
fortify.zoo(melt = TRUE) %>%
separate(Series, c("v", split.vars)) %>%
spread(v, Value)
Update: Note simplification in zoo 1.8.1 .

Related

How to replace some values of a column in a data frame based on another data frame through a lagged date and ID?

I am trying to replace some values in a column of a data frame with the help of a date and ID in another data frame but I cannot manage to find any solution. It will be more clear with an example.
I have two data frames constructed as followed:
date.1 <- c("01.02.2011","02.02.2011","03.02.2011","04.02.2011","05.02.2011","01.02.2011","02.02.2011","03.02.2011","04.02.2011","05.02.2011")
date.1 <- as.Date(date.1, format="%d.%m.%Y")
values.1 <- c("1","3","5","1","2","6","7","8","9","10")
ID.1 <- c("10","10","10","10","10","11","11","11","11","11")
df.1 <- data.frame(date.1, values.1, ID.1)
names(df.1) <- c("date","values","ID")
date.2 <- c("04.02.2011","04.02.2011")
date.2 <- as.Date(date.2, format="%d.%m.%Y")
values.2 <- c("1", "9")
ID.2 <- c("10","11")
df.2 <- data.frame(date.2, values.2, ID.2)
names(df.2) <- c("date","values","ID")
which looked like:
> df.1
date values ID
1 2011-02-01 1 10
2 2011-02-02 3 10
3 2011-02-03 5 10
4 2011-02-04 1 10
5 2011-02-05 2 10
6 2011-02-01 6 11
7 2011-02-02 7 11
8 2011-02-03 8 11
9 2011-02-04 9 11
10 2011-02-05 10 11
> df.2
date values ID
1 2011-02-04 1 10
2 2011-02-04 9 11
I would like to replace the "values" in df.2 for each ID with the "values" of df.1 on the next date, i.e. with the values on 2011-02-05 but I don't manage to replace them. Thus, I would like to obtain:
> df.2
date values ID
1 2011-02-04 2 10
2 2011-02-04 10 11
Your help would be really appreciated. If any editing of the question is needed, do not hesitate to let me know.
If next date means date + 1 day, then try this:
library(dplyr)
df.2 %>%
mutate(date1 = date + 1) %>%
select(-values) %>%
left_join(df.1, by = c(date1 = "date", ID = "ID")) %>%
select(-date1)
#> date ID values
#> 1 2011-02-04 10 2
#> 2 2011-02-04 11 10
Created on 2020-03-28 by the reprex package (v0.3.0)
Is this what you are looking for?
library(lubridate)
library(dplyr)
df.2$values <- df.1 %>% filter (ID == df.2$ID & date == (df.2$date +1)) %>% select(values)

Padding around dates in R to add missing/blank months?

The padr R pacakge vignette describes different package functions to pad dates and times around said dates and times.
I am in situations where I'll be tallying events in data frames (ie dplyr::count()) and will need to plot occurrences, over a period of say... 1 year. When I count the events in a low volume data frame I'll often get single line item results, like this:
library(tidyverse)
library(lubridate)
library(padr)
df <- tibble(col1 = as.Date("2018-10-01"), col2 = "g", col3 = 5)
#> # A tibble: 1 x 3
#> col1 col2 col3
#> <date> <chr> <dbl>
#> 1 2018-10-01 g 5
To plot this with ggplot, over a period of a year, on a monthly basis, requires a data frame of 12 rows. It basically needs to look like this:
#> # A tibble: 12 x 3
#> col1 col2 col3
#> <date> <chr> <dbl>
#> 1 2018-01-01 NA 0
#> 2 2018-02-01 NA 0
#> 3 2018-03-01 NA 0
#> 4 2018-04-01 NA 0
#> 5 2018-05-01 NA 0
#> 6 2018-06-01 NA 0
#> 7 2018-07-01 NA 0
#> 8 2018-08-01 NA 0
#> 9 2018-09-01 NA 0
#> 10 2018-10-01 g 5
#> 11 2018-11-01 NA 0
#> 12 2018-12-01 NA 0
Perhaps padr() can do this with some combination of the thicken() and pad() functions. My attempts are shown below, neither line 3 nor line 4 construct the data frame shown directly above.
How do I construct that data frame direclty above, utilizing padr(), lubridate(), tidyverse(), data.table(), base R, or any way you please? Manual entry of each month shall not be considered either, if that needs to be said. Thank you.
df %>%
thicken("year") %>%
# pad(by = "col1") %>% # line 3
# pad(by = "col1_year") %>% # line 4
print()
library(lubridate)
library(tidyverse)
df <- tibble(col1 = as.Date("2018-10-01"), col2 = "g", col3 = 5)
my_year <- year(df$col1[1])
df2 <- tibble(col1 = seq(ymd(paste0(my_year,'-01-01')),ymd(paste0(my_year,'-12-01')), by = '1 month'))
df3 <- merge(df,df2, by ="col1",all.y=TRUE) %>% mutate(col3 = replace_na(col3,0))
df3

identifying location of NA values in a data frame by ID (not row number) and column name

I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>

summarize multiple dynamic columns and store results in new columns

I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3

insert rows between dates by group

I want to insert rows between two dates by group. My way of doing it is so complicated that I insert missing values by last observation carry forwards and then merge. I was wondering is there any easier way to achieve it.
# sample data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
dt<-data.frame(user,dummy,date)
user dummy date
1 A 1 2017-01-03
2 A 1 2017-01-06
3 B 1 2016-05-01
4 B 1 2016-05-03
5 B 1 2016-05-05
Desired output
By using dplyr and tidyr :)(one line solution )
library(dplyr)
library(tidyr)
dt %>% group_by(user) %>% complete(date=full_seq(date,1),fill=list(dummy=0))
# A tibble: 9 x 3
# Groups: user [2]
user date dummy
<fctr> <date> <dbl>
1 A 2017-01-03 1
2 A 2017-01-04 0
3 A 2017-01-05 0
4 A 2017-01-06 1
5 B 2016-05-01 1
6 B 2016-05-02 0
7 B 2016-05-03 1
8 B 2016-05-04 0
9 B 2016-05-05 1
you can try this
library(data.table)
setDT(dt)
tmp <- dt[, .(date = seq.Date(min(date), max(date), by = '1 day')), by =
'user']
dt <- merge(tmp, dt, by = c('user', 'date'), all.x = TRUE)
dt[, dummy := ifelse(is.na(dummy), 0, dummy)]
We can use the tidyverse to achieve this task.
library(tidyverse)
dt2 <- dt %>%
group_by(user) %>%
do(date = seq(from = min(.$date), to = max(.$date), by = 1)) %>%
unnest() %>%
left_join(dt, by = c("user", "date")) %>%
replace_na(list(dummy = 0)) %>%
select(colnames(dt))
dt2
# A tibble: 9 x 3
user dummy date
<fctr> <dbl> <date>
1 A 1 2017-01-03
2 A 0 2017-01-04
3 A 0 2017-01-05
4 A 1 2017-01-06
5 B 1 2016-05-01
6 B 0 2016-05-02
7 B 1 2016-05-03
8 B 0 2016-05-04
9 B 1 2016-05-05
The simplest way that I have found to do this is with the padr library.
library(padr)
dt_padded <- pad(dt, group = "user", by = "date") %>%
replace_na(list(dummy=0))
A Base R (not quite as elegant) solution:
# Data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
df1 <-data.frame(user,dummy,date)
# Solution
do.call(rbind, lapply(split(df1, df1$user), function(df) {
dff <- data.frame(user=df$user[1], dummy=0, date=seq.Date(min(df$date), max(df$date), 'day'))
dff[dff$date %in% df$date, "dummy"] <- df$dummy[1]
dff
}))
# user dummy date
# A 1 2017-01-03
# A 0 2017-01-04
# A 0 2017-01-05
# A 1 2017-01-06
# B 1 2016-05-01
# B 0 2016-05-02
# B 1 2016-05-03
# B 0 2016-05-04
# B 1 2016-05-05
Assuming your data is called df1, and you want to add dates between two days try this:
library(dplyr)
df2 <- seq.Date(as.Date("2015-01-03"), as.Date("2015-01-06"), by ="day")
left_join(df2, df1)
If you're simply trying to add a new record, I suggest using rbind.
rbind()

Resources