from to table for missing values - r

In the data frame below there are a number of continuous days with missing values.
I want to create a table that shows the missing days
Expected output
Table of missing values
from to
2012-01-08 2012-01-12
2012-01-18 2012-01-22
2012-01-29 2012-02-01
I tried to do it using this code
library(dplyr)
df$Date <- as.Date(df$Date, format = "%d-%b-%Y")
from_to_table_NA <- df %>%
dplyr::filter(is.na(value)) %>%
dplyr::summarise(from = min(Date),
to = max(Date))
> from_to_table_NA
from to
1 2012-01-08 2012-02-01
As expected, it gave me the minimum maximum dates only for missing values. I will highly appreciate any suggestion on how to get the desired output.
DATA
df <- read.table(text = c("
Date value
5-Jan-2012 5
6-Jan-2012 2
7-Jan-2012 3
8-Jan-2012 NA
9-Jan-2012 NA
10-Jan-2012 NA
11-Jan-2012 NA
12-Jan-2012 NA
13-Jan-2012 4
14-Jan-2012 5
15-Jan-2012 5
16-Jan-2012 7
17-Jan-2012 5
18-Jan-2012 NA
19-Jan-2012 NA
20-Jan-2012 NA
21-Jan-2012 NA
22-Jan-2012 NA
23-Jan-2012 12
24-Jan-2012 5
25-Jan-2012 7
26-Jan-2012 8
27-Jan-2012 8
28-Jan-2012 10
29-Jan-2012 NA
30-Jan-2012 NA
31-Jan-2012 NA
1-Feb-2012 NA
2-Feb-2012 12"), header =T)

You need to group by consecutive days. This can be done by getting the cumulative sum of condition where the differences between days is not exactly 1:
df %>%
filter(is.na(value)) %>%
group_by(g = cumsum(coalesce(Date - lag(Date), 1) != 1)) %>%
summarise(from = min(Date),
to = max(Date))
Gives:
# A tibble: 3 x 3
g from to
<int> <date> <date>
1 0 2012-01-08 2012-01-12
2 1 2012-01-18 2012-01-22
3 2 2012-01-29 2012-02-01

Related

Fill missing dates in several time series stored in same database

I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1

How to create monthly non-cumulative subtotals in R with dplyr?

I would like to calculate monthly non-cumulative subtotals for my data frame (df).
"date" "id" "change"
2010-01-01 1 NA
2010-01-07 2 3
2010-01-15 2 -1
2010-02-01 1 NA
2010-02-04 2 7
2010-02-22 2 -2
2010-02-26 2 4
2010-03-01 1 NA
2010-03-14 2 -4
2010-04-01 1 NA
A new period starts at the first day of a new month. The column "id" serves as a grouping variable for the beginning of a new period (==1) and observations within a period (==2). The goal is to sum up all changes within a month and then restart at 0 for the next period. The output should be stored in an additional column of df.
Here a reproducible example for my data frame:
require(dplyr)
require(tidyr)
require(lubridate)
date <- ymd(c("2010-01-01","2010-01-07","2010-01-15","2010-02-01","2010-02-04","2010-02-22","2010-02-26","2010-03-01","2010-03-14","2010-04-01"))
df <- data.frame(date)
df$id <- as.numeric((c(1,2,2,1,2,2,2,1,2,1)))
df$change <- c(NA,3,-1,NA,7,-2,4,NA,-4,NA)
What i have tried to do:
df <- df %>%
group_by(id) %>%
mutate(total = cumsum(change)) %>%
ungroup() %>%
fill(total, .direction = "down") %>%
filter(id == 1)
Which leads to this output:
"date" "id" "change" "total"
2010-01-01 1 NA NA
2010-02-01 1 NA 2
2010-03-01 1 NA 11
2010-04-01 1 NA 7
The problem lies with the function cumsum, which accumulates all the preceding values from a group and does not restart at 0 for a new period.
The desired output looks like this:
"date" "id" "change" "total"
2010-01-01 1 NA NA
2010-02-01 1 NA 2
2010-03-01 1 NA 9
2010-04-01 1 NA -4
The rows with "id" ==1 show the sum of changes for all preceding columns with "id" ==2, restarting at 0 for every period. Does there exist a specific command for this type of problem? Could anyone provide a corrected alternative to the code above?
We may need to also use year-month formatted 'date' in the grouping variable to reset for each month
library(dplyr)
df %>%
group_by(id, grp = format(date, "%Y-%m")) %>%
mutate(total = cumsum(change)) %>%
ungroup() %>%
fill(total, .direction = "down") %>%
filter(id == 1) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 4
# date id change total
# <date> <dbl> <dbl> <dbl>
#1 2010-01-01 1 NA NA
#2 2010-02-01 1 NA 2
#3 2010-03-01 1 NA 9
#4 2010-04-01 1 NA -4

How do I output the max value within a range of rows in a data frame?

Suppose I have the following data and data frame:
sample_data <- c(1:14)
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- as.data.frame(sample_data)
sample_df$sample_data2 <- sample_data2
When I print this data frame, the results are as follows:
sample_data sample_data2
1 1 <NA>
2 2 <NA>
3 3 <NA>
4 4 break
5 5 <NA>
6 6 <NA>
7 7 break
8 8 <NA>
9 9 <NA>
10 10 <NA>
11 11 <NA>
12 12 <NA>
13 13 <NA>
14 14 break
How would I program it so that at every "break", it outputs the max from that row up? For instance, I would want the code to output the set of (4,7,14). Additionally, I would want it so that it only finds the max value between up to the next "break" interval.
I apologize in advance if I used any incorrect nomenclature.
I construct the groups looking for the word "break" and then move the results one row up. Then some dplyr commands to get max of every group.
library(dplyr)
sample_df_new <- sample_df %>%
mutate(group = c(1, cumsum(grepl("break", sample_data2)) + 1)[1:length(sample_data2)]) %>%
group_by(group) %>%
summarise(group_max = max(sample_data))
> sample_df_new
# A tibble: 3 x 2
group group_max
<dbl> <dbl>
1 1 4
2 2 7
3 3 14
I have an answer using data.table:
library(data.table)
sample_df <- setDT(sample_df)
sample_df[,group := (rleid(sample_data2)-0.5)%/%2]
sample_df[,.(maxvalues = max(sample_data)),by = group]
group maxvalues
1: 0 4
2: 1 7
3: 2 14
The tricky part is (rleid(sample_data2)-0.5)%/%2: rleid create an increasing index to each change :
sample_data sample_data2 rleid
1: 1 NA 1
2: 2 NA 1
3: 3 NA 1
4: 4 break 2
5: 5 NA 3
6: 6 NA 3
7: 7 break 4
8: 8 NA 5
9: 9 NA 5
10: 10 NA 5
11: 11 NA 5
12: 12 NA 5
13: 13 NA 5
14: 14 break 6
If you keep the entire part of that index - 0.5, you have a constant index for the rows you want, that you can use for grouping operation:
sample_data sample_data2 group
1: 1 NA 0
2: 2 NA 0
3: 3 NA 0
4: 4 break 0
5: 5 NA 1
6: 6 NA 1
7: 7 break 1
8: 8 NA 2
9: 9 NA 2
10: 10 NA 2
11: 11 NA 2
12: 12 NA 2
13: 13 NA 2
14: 14 break 2
Then it is just taking the maximum for each group. You can easily translate it into dplyr if it is easier for you
Here are 2 ways with base R. The trick is to define a grouping variable, grp.
grp <- !is.na(sample_df$sample_data2) & sample_df$sample_data2 == "break"
grp <- rev(cumsum(rev(grp)))
grp <- -1*grp + max(grp)
tapply(sample_df$sample_data, grp, max, na.rm = TRUE)
aggregate(sample_data ~ grp, sample_df, max, na.rm = TRUE)
Data.
This is simplified data creation code.
sample_data <- 1:14
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- data.frame(sample_data, sample_data2)
Looks like there are lots of different ways of doing this. This is how I went about it:
rows <- which(sample_data2 == "break") #Get the row indices for where "break" appears
findmax <- function(maxrow) {
max(sample_data[1:maxrow])
} #Create a function that returns the max "up to" a given row
sapply(rows, findmax) #apply it for each of your rows
### [1] 4 7 14
Note that this works "up to" the given row. To get the maximum value between the two breaks would probably be easier with one of the other solutions, but you could also do it by looking at the j-1 row to jth row from the rows object.
Depending whether you want to assess the maximum "sample_data" number between all "sample_data2" == break including (e.g. row 1 to row 4) or excluding (e.g. row 1 to row 3) the given "sample_data2" == break row, you can do something like this with tidyverse:
Excluding the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
filter(is.na(sample_data2)) %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 3.
2 2 6.
3 3 13.
Including the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 4.
2 2 7.
3 3 14.
Both of the codes create an ID variable called "temp" using gl() for "sample_data2" == break and then fill up the NA rows with that ID. Then, the first code filters out the "sample_data2" == break rows and assess the maximum "sample_data" values per group, while the second assess the maximum "sample_data" values per group including the "sample_data2" == break rows.

Insert Missing Consecutive Weeks By Group

I have a dataset that contains weekly data. The week starts on a Monday and ends on a Sunday. This dataset is also broken out by group.
I want to detect if there are any missing consecutive dates between the start and finish for each group. Here is an example dataset:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-05-04', '2015-06-29', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A', 'A','B','B','B','B')
Value<- c(2,3,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
df
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-27 B 9
2015-08-03 B 8
For group B, there is missing data between 2015-07-06 and 2015-07-27. There is also missing data in group A between 2015-04-20 and 2015-05-04. I want to add a row for that group and have the value be NA. I have many groups and I want my expected output to be below:
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-04-27 A NA
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-13 B NA
2015-07-20 B NA
2015-07-27 B 9
2015-08-03 B 8
Any help would be great, thanks!
You can use complete from tidyr package, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
complete(Week = seq(min(Week), max(Week), by = 'week'))
which gives,
# A tibble: 10 x 3
# Groups: Group [2]
Group Week Value
<fct> <date> <dbl>
1 A 2015-04-13 2
2 A 2015-04-20 3
3 A 2015-04-27 NA
4 A 2015-05-04 10
5 B 2015-06-29 4
6 B 2015-07-06 NA
7 B 2015-07-13 NA
8 B 2015-07-20 NA
9 B 2015-07-27 11
10 B 2015-08-03 9
The only way I've found to do this is using an inequality join in SQL.
library(tidyverse)
library(sqldf)
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04',
'2015-06-29', '2015-06-07', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#what are the start and end weeks for each group?
GroupWeeks <- df %>%
group_by(Group) %>%
summarise(start = min(Week),
end = max(Week))
#What are all the possible weeks?
AllWeeks <- data.frame(Week = seq.Date(min(df$Week), max(df$Week), by = "week"))
#use an inequality join to add rows for every week within the group's range
sqldf("Select AllWeeks.Week, GroupWeeks.[Group], Value
From AllWeeks inner join GroupWeeks on AllWeeks.Week >= start AND AllWeeks.Week <= end
left join df on AllWeeks.Week = df.Week and GroupWeeks.[Group] = df.[Group]")
This can be achieved using seq function. Here is the code snippet.
Code:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04', '2015-06-29','2015-07-06', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#generate all the missing dates
alldates = seq(min(df$Week[df$Group == 'B']), max(df$Week[df$Group == 'B']), 7)
#filter out the dates that are not present in your dataset
dates = alldates[!(alldates %in% df$Week)]
#add these new dates to a new dataframe and rbind with the old dataframe
new_df = data.frame(Week = dates,Group = 'B', Value = NA)
df = rbind(df, new_df)
df = df[order(df$Week),]
Output:
Week Group Value
1 2015-04-13 A 2
2 2015-04-20 A 3
3 2015-04-27 A 2
4 2015-05-04 A 10
5 2015-06-29 B 4
6 2015-07-06 B 11
9 2015-07-13 B NA
10 2015-07-20 B NA
7 2015-07-27 B 9
8 2015-08-03 B 8

R - Collapsing rows to closest date when NA

Let's suppose I have this dataframe:
Date A B
2010-01-01 NA 1
2010-01-02 2 NA
2010-01-05 3 NA
2010-01-07 NA 4
2010-01-20 5 NA
2010-01-25 6 7
I want to collapse rows, removing the NA values to the closest Date. So the result would be:
Date A B
2010-01-02 2 1
2010-01-07 3 4
2010-01-20 5 NA
2010-01-25 6 7
I saw this stack overflow that solves collapsing using a key value, but I could not find a similar case using close date values to collapse.
Obs1: It would be good if there was a way to not collapse the rows if the dates are too far apart (example: more than 15 days apart).
Obs2: It would be good if the collapsing lines kept the latter date rather than the earlier, as shown in the example above.
Using dplyr package an option could be to group_by on combination of A and B in such a way that they form complete values.
Considering Obs#2 the max of Date should be taken for combined row.
library(dplyr)
library(lubridate)
df %>% mutate(Date = ymd(Date)) %>%
mutate(GrpA = cumsum(!is.na(A)), GrpB = cumsum(!is.na(B))) %>%
rowwise() %>%
mutate(Grp = max(GrpA, GrpB)) %>%
ungroup() %>%
select(-GrpA, -GrpB) %>%
group_by(Grp) %>%
summarise(Date = max(Date), A = A[!is.na(A)][1], B = B[!is.na(B)][1])
# # A tibble: 4 x 4
# Grp Date A B
# <int> <date> <int> <int>
# 1 1 2010-01-02 2 1
# 2 2 2010-01-07 3 4
# 3 3 2010-01-20 5 NA
# 4 4 2010-01-25 6 7
Data:
df <- read.table(text =
"Date A B
2010-01-01 NA 1
2010-01-02 2 NA
2010-01-05 3 NA
2010-01-07 NA 4
2010-01-20 5 NA
2010-01-25 6 7",
stringsAsFactors = FALSE, header = TRUE)

Resources