How to create monthly non-cumulative subtotals in R with dplyr? - r

I would like to calculate monthly non-cumulative subtotals for my data frame (df).
"date" "id" "change"
2010-01-01 1 NA
2010-01-07 2 3
2010-01-15 2 -1
2010-02-01 1 NA
2010-02-04 2 7
2010-02-22 2 -2
2010-02-26 2 4
2010-03-01 1 NA
2010-03-14 2 -4
2010-04-01 1 NA
A new period starts at the first day of a new month. The column "id" serves as a grouping variable for the beginning of a new period (==1) and observations within a period (==2). The goal is to sum up all changes within a month and then restart at 0 for the next period. The output should be stored in an additional column of df.
Here a reproducible example for my data frame:
require(dplyr)
require(tidyr)
require(lubridate)
date <- ymd(c("2010-01-01","2010-01-07","2010-01-15","2010-02-01","2010-02-04","2010-02-22","2010-02-26","2010-03-01","2010-03-14","2010-04-01"))
df <- data.frame(date)
df$id <- as.numeric((c(1,2,2,1,2,2,2,1,2,1)))
df$change <- c(NA,3,-1,NA,7,-2,4,NA,-4,NA)
What i have tried to do:
df <- df %>%
group_by(id) %>%
mutate(total = cumsum(change)) %>%
ungroup() %>%
fill(total, .direction = "down") %>%
filter(id == 1)
Which leads to this output:
"date" "id" "change" "total"
2010-01-01 1 NA NA
2010-02-01 1 NA 2
2010-03-01 1 NA 11
2010-04-01 1 NA 7
The problem lies with the function cumsum, which accumulates all the preceding values from a group and does not restart at 0 for a new period.
The desired output looks like this:
"date" "id" "change" "total"
2010-01-01 1 NA NA
2010-02-01 1 NA 2
2010-03-01 1 NA 9
2010-04-01 1 NA -4
The rows with "id" ==1 show the sum of changes for all preceding columns with "id" ==2, restarting at 0 for every period. Does there exist a specific command for this type of problem? Could anyone provide a corrected alternative to the code above?

We may need to also use year-month formatted 'date' in the grouping variable to reset for each month
library(dplyr)
df %>%
group_by(id, grp = format(date, "%Y-%m")) %>%
mutate(total = cumsum(change)) %>%
ungroup() %>%
fill(total, .direction = "down") %>%
filter(id == 1) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 4
# date id change total
# <date> <dbl> <dbl> <dbl>
#1 2010-01-01 1 NA NA
#2 2010-02-01 1 NA 2
#3 2010-03-01 1 NA 9
#4 2010-04-01 1 NA -4

Related

Grouping and Counting by Dates (R)

I am working with the R programming language. I have a data frame that looks like this:
startdate <- c('2010-01-01','2010-01-01','2010-01-01', '2010-01-02','2010-01-03','2010-01-03')
event <- c(1,1,1,1,1,1)
my_data <- data.frame(startdate, event)
startdate event
1 2010-01-01 1
2 2010-01-01 1
3 2010-01-01 1
4 2010-01-02 1
5 2010-01-03 1
6 2010-01-03 1
Note: The actual value of "startdate" is "POSIXct" and is written as "year-month-date".
I am trying to take a cumulative sum of "event" according to the "startdate" column. The result should look like this
startdate <- c('2010-01-01', '2010-01-02' ,'2010-01-03')
event <- c(3,4,6)
my_data_2 <- data.frame(startdate, event)
#desired file
startdate event
1 2010-01-01 3
2 2010-01-02 4
3 2010-01-03 6
I tried to do this with the "dplyr" library:
library(dplyr)
new_file = my_data %>% group_by(startdate) %>% mutate(cumsum_value = cumsum(event))
But this returns something slightly different and non-intended:
startdate event cumsum_value
<chr> <dbl> <dbl>
1 2010-01-01 1 1
2 2010-01-01 1 2
3 2010-01-01 1 3
4 2010-01-02 1 1
5 2010-01-03 1 1
6 2010-01-03 1 2
Can someone please show me how to fix this?
Thanks
my_data %>%
mutate(cumsum = cumsum(event)) %>%
group_by(startdate) %>%
summarise(max(cumsum))
# A tibble: 3 × 2
startdate `max(cumsum)`
<chr> <dbl>
1 2010-01-01 3
2 2010-01-02 4
3 2010-01-03 6
mutate the event column and calculate cumsum
group_by startdate and
summarise max(event)
library(dplyr)
my_data %>%
mutate(event = cumsum(event)) %>%
group_by(startdate) %>%
summarise(event = max(event))
```
```
startdate event
<chr> <dbl>
1 2010-01-01 3
2 2010-01-02 4
3 2010-01-03 6
```
Another option is also to make use of duplicated and thus avoiding the group_by. Also, if the 'event' column is just 1, instead of doing cumsum, we could use the built-in function row_number() to create a sequence
library(dplyr)
my_data %>%
mutate(event = row_number()) %>%
filter(!duplicated(startdate, fromLast = TRUE))

How to add a column with most resent recurring observation within a group, but within a certain time period, in R

If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA

Fill missing dates in several time series stored in same database

I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1

dplyr Time Diff between rows

I have a data frame in the below format and I'm trying to find the difference in time between the Event 'ASSIGNED' and the last time the Event is 'CREATED' that comes before it.
**AccountID** **TIME** **EVENT**
1 2016-11-08T01:54:15.000Z CREATED
1 2016-11-09T01:54:15.000Z ASSIGNED
1 2016-11-10T01:54:15.000Z CREATED
1 2016-11-11T01:54:15.000Z CALLED
1 2016-11-12T01:54:15.000Z ASSIGNED
1 2016-11-12T01:54:15.000Z SLEEP
Currently my code is as follows, my difficulty is selecting the CREATED that just comes before the ASSIGNED Event
test <- timetable.filter %>%
group_by(AccountID) %>%
mutate(timeToAssign = ifelse(EVENT == 'ASSIGNED',
interval(ymd_hms(TIME), max(ymd_hms(TIME[EVENT == 'CREATED']))) %/% hours(1), NA))
I'm looking for the output to be
**AccountID** **TIME** **EVENT** **timeToAssign**
1 2016-11-08T01:54:15.000Z CREATED NA
1 2016-11-09T01:54:15.000Z ASSIGNED 12
1 2016-11-10T01:54:15.000Z CREATED NA
1 2016-11-11T01:54:15.000Z CALLED NA
1 2016-11-12T01:54:15.000Z ASSIGNED 24
1 2016-11-12T01:54:15.000Z SLEEP NA
With dplyr and tidyr:
library(dplyr); library(tidyr); library(anytime)
df %>%
group_by(AccountID) %>%
mutate(CREATED_INDEX = if_else(EVENT == 'CREATED', row_number(), NA_integer_),
TIME = anytime(TIME)) %>%
fill(CREATED_INDEX) %>%
mutate(TimeToAssign = if_else(EVENT == 'ASSIGNED',
as.numeric(TIME - TIME[CREATED_INDEX], units = 'hours'),
NA_real_)) %>%
select(-CREATED_INDEX)
# A tibble: 6 x 4
# Groups: AccountID [1]
# AccountID TIME EVENT TimeToAssign
# <int> <dttm> <fctr> <dbl>
#1 1 2016-11-08 01:54:15 CREATED NA
#2 1 2016-11-09 01:54:15 ASSIGNED 24
#3 1 2016-11-10 01:54:15 CREATED NA
#4 1 2016-11-11 01:54:15 CALLED NA
#5 1 2016-11-12 01:54:15 ASSIGNED 48
#6 1 2016-11-12 01:54:15 SLEEP NA

from to table for missing values

In the data frame below there are a number of continuous days with missing values.
I want to create a table that shows the missing days
Expected output
Table of missing values
from to
2012-01-08 2012-01-12
2012-01-18 2012-01-22
2012-01-29 2012-02-01
I tried to do it using this code
library(dplyr)
df$Date <- as.Date(df$Date, format = "%d-%b-%Y")
from_to_table_NA <- df %>%
dplyr::filter(is.na(value)) %>%
dplyr::summarise(from = min(Date),
to = max(Date))
> from_to_table_NA
from to
1 2012-01-08 2012-02-01
As expected, it gave me the minimum maximum dates only for missing values. I will highly appreciate any suggestion on how to get the desired output.
DATA
df <- read.table(text = c("
Date value
5-Jan-2012 5
6-Jan-2012 2
7-Jan-2012 3
8-Jan-2012 NA
9-Jan-2012 NA
10-Jan-2012 NA
11-Jan-2012 NA
12-Jan-2012 NA
13-Jan-2012 4
14-Jan-2012 5
15-Jan-2012 5
16-Jan-2012 7
17-Jan-2012 5
18-Jan-2012 NA
19-Jan-2012 NA
20-Jan-2012 NA
21-Jan-2012 NA
22-Jan-2012 NA
23-Jan-2012 12
24-Jan-2012 5
25-Jan-2012 7
26-Jan-2012 8
27-Jan-2012 8
28-Jan-2012 10
29-Jan-2012 NA
30-Jan-2012 NA
31-Jan-2012 NA
1-Feb-2012 NA
2-Feb-2012 12"), header =T)
You need to group by consecutive days. This can be done by getting the cumulative sum of condition where the differences between days is not exactly 1:
df %>%
filter(is.na(value)) %>%
group_by(g = cumsum(coalesce(Date - lag(Date), 1) != 1)) %>%
summarise(from = min(Date),
to = max(Date))
Gives:
# A tibble: 3 x 3
g from to
<int> <date> <date>
1 0 2012-01-08 2012-01-12
2 1 2012-01-18 2012-01-22
3 2 2012-01-29 2012-02-01

Resources