How to get the difference of a lagged variable by date? - r

Consider the following example:
library(tidyverse)
library(lubridate)
df = tibble(client_id = rep(1:3, each=24),
date = rep(seq(ymd("2016-01-01"), (ymd("2016-12-01") + years(1)), by='month'), 3),
expenditure = runif(72))
In df you have stored information on monthly expenditure from a bunch of clients for the past 2 years. Now you want to calculate the monthly difference between this year and the previous year for each client.
Is there any way of doing this maintaining the "long" format of the dataset? Here I show you the way I am doing it nowadays, which implies going wide:
df2 = df %>%
mutate(date2 = paste0('val_',
year(date),
formatC(month(date), width=2, flag="0"))) %>%
select(client_id, date2, value) %>%
pivot_wider(names_from = date2,
values_from = value)
df3 = (df2[,2:13] - df2[,14:25])
However I find tihs unnecessary complex, and in large datasets going from long to wide can take quite a lot of time, so I think there must be a better way of doing it.

If you want to keep data in long format, one way would be to group by month and date value for each client_id and calculate the difference using diff.
library(dplyr)
df %>%
group_by(client_id, month_date = format(date, "%m-%d")) %>%
summarise(diff = -diff(expenditure))
# client_id month_date diff
# <int> <chr> <dbl>
# 1 1 01-01 0.278
# 2 1 02-01 -0.0421
# 3 1 03-01 0.0117
# 4 1 04-01 -0.0440
# 5 1 05-01 0.855
# 6 1 06-01 0.354
# 7 1 07-01 -0.226
# 8 1 08-01 0.506
# 9 1 09-01 0.119
#10 1 10-01 0.00819
# … with 26 more rows

An option with data.table
library(data.table)
library(zoo)
setDT(df)[, .(diff = -diff(expenditure)), .(client_id, month_date = as.yearmon(date))]

Related

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Dplyr doesn't respect groups when ranking data

Using the below code in dplyr 0.7.6, I try to calculate the rank of a variable for each day on a dataset. But dplyr doesn't account for the group_by(CREATIONDATE_DAY)
dates <- sample(seq(from=as.POSIXct("2019-03-12",tz="UTC"),to=as.POSIXct("2019-03-20",tz="UTC"),by = "day"),size = 100,replace=TRUE)
group <- sample(c("A","B","C"),100,TRUE)
df <- data.frame(CREATIONDATE_DAY = dates,GROUP = group)
# calculate the occurances for each day and group
dfMod <- df %>% group_by(CREATIONDATE_DAY,GROUP) %>%
dplyr::summarise(COUNT = n()) %>% ungroup()
# Compute the rank by count for each day
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>%
mutate(rank = rank(-COUNT, ties.method ="min"))
But the rank values are calculate on the entire group instead on the creation day value. As seen in the image the row with id 24 should be rank 1 due to 4 being the highest value for 16.03.2019 and row 23 should be rank 2 of this particular day. Where is my mistake?
Edit: added desired output:
Edit #2: as MrFlick has pointed out I checked my dplyr version (0.7.6) and upgrade to the most current version fixed the issue for me.
It seems that may be are some conflict with another package. If you have active lubridate, try to inverse the order in which you call the packages lubridate and dplyr (I've tried your example and gave me the right answer). Yet, you can stil try with:
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>% mutate(rank = row_number(desc(COUNT)))
> head(dfMod)
# A tibble: 6 x 4
# Groups: CREATIONDATE_DAY [2]
CREATIONDATE_DAY GROUP COUNT rank
<dttm> <fct> <int> <int>
1 2019-03-12 00:00:00 A 2 3
2 2019-03-12 00:00:00 B 5 1
3 2019-03-12 00:00:00 C 4 2
4 2019-03-13 00:00:00 A 4 1
5 2019-03-13 00:00:00 B 3 2
6 2019-03-13 00:00:00 C 2 3

Add sequence of week count aligned to a date column with infrequent dates

I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))

Using lag in mutate() on a arranged data

I am working on a data set which is similar to
data <-tribble(
~id, ~ dates, ~days_prior,
1,20190101, NA,
1,NA, 15,
1,NA, 20,
2, 20190103, NA,
2,NA, 3,
2,NA, 4)
I have the first date for each ID and I am trying to calculate the next date by adding days_prior to the previous date. I am using the lag function to refer to the previous date.
df<- df%>% mutate(dates = as.Date(ymd(dates)), days_prior =as.integer(days_prior))
df<-df %>% mutate(dates =
as.Date(ifelse(is.na(days_prior),dates,days_prior+lag(dates)),
origin="1970-01-01"))
This works but only for the next row as you can see attached data.
What am I doing wrong? I would like all the dates to be calculated by mutate(). What different approach should I take to calculate this.
I don't really see how lag would help here; unless I misunderstood here is an option using tidyr::fill
data %>%
group_by(id) %>%
mutate(dates = as.Date(ymd(dates))) %>%
fill(dates) %>%
mutate(dates = dates + if_else(is.na(days_prior), 0L, as.integer(days_prior))) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 NA
#2 1 2019-01-16 15
#3 1 2019-01-21 20
#4 2 2019-01-03 NA
#5 2 2019-01-06 3
#6 2 2019-01-07 4
Or a slight variation, replacing the NA entries in days_prior with 0
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + as.integer(days_prior)) %>%
ungroup()
Update
In response to your clarifications in the comments, here is what you can do
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + cumsum(days_prior)) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 0
#2 1 2019-01-16 15
#3 1 2019-02-05 20
#4 2 2019-01-03 0
#5 2 2019-01-06 3
#6 2 2019-01-10 4
You can use the na.locf from the zoo package to fill in the last observed date before adding the prior days.
library("tidyverse")
library("zoo")
data %>%
# Fill in NA dates with the previous non-NA date
# The `locf` stands for "last observation carried forward"
# Fill in NA days_prior with 0
mutate(dates = zoo::na.locf(dates),
days_prior = replace_na(days_prior, 0)) %>%
mutate(dates = lubridate::ymd(dates) + days_prior)
This solution makes two assumptions:
The rows are sorted by id. You can get around this assumption with a group_by(id) followed by an ungroup() statement as shows in the solution by Maurits Evers.
For each id, the row with the observed date is first in the group. This needs to be true in any case with either na.locf and fill because both functions fill in NAs using the previous non-NA entry.
If you don't want to make any assumptions about the ordering, you can sort the rows at the start with data %>% arrange(id, dates).

R transpose including NA

I have data like,
trackingnumer = c(1,1,2,2,3)
date = c("2017-08-01", "2017-08-10", "2017-08-02", "2017-08-05", "2017-08-12")
scan = c("Pickup", "Delivered", "Pickup", "Delivered", "Delivered")
df = data.frame(trackingnumer, date, scan)
I want to transpose this data by trackignumber
df2 <- df %>%
group_by(trackingnumer) %>%
mutate(n = row_number()) %>%
{data.table::dcast(data = setDT(.), trackingnumer ~ n, value.var = c('date', 'scan'))}
I have tried this one, but I couldn't get the desirable outcome.I want to set data_1 as pickup date, and date_2 as delivered date. As you can see, trackingnumber 3 doesn't have pickup record so I want date_1 to be NA.
Base R attempt, using relevel to set the appropriate ordering of the scan column:
reshape(
cbind(df, time=as.numeric(relevel(df$scan, "Pickup"))),
idvar="trackingnumer", direction="wide", sep="_"
)
# trackingnumer date_1 scan_1 date_2 scan_2
#1 1 2017-08-01 Pickup 2017-08-10 Delivered
#3 2 2017-08-02 Pickup 2017-08-05 Delivered
#5 3 <NA> <NA> 2017-08-12 Delivered
The problem was that your function in mutate was just counting the rows, it wasn’t paying attention to what was in them. The case_when() function lets you specify specific values for the “n” column based on the value of “scan”
df2 <- df %>%
group_by(trackingnumer) %>%
mutate(n = case_when(scan == "Pickup" ~ 1,
scan == "Delivered" ~ 2)) %>%
{data.table::dcast(data = setDT(.), trackingnumer ~ n, value.var = c('date', 'scan'))}
Or with tidyr
library(tidyr)
df %>% group_by(trackingnumer,scan2 = scan) %>%
nest(date,scan) %>%
spread(scan2,data) %>%
mutate_at(c("Delivered","Pickup"),~ifelse(map_lgl(.x,is_tibble),.x,lst(tibble(date=NA,scan=NA)))) %>%
unnest %>%
rename_at(c("date","scan"),paste0,2)
# # A tibble: 3 x 5
# trackingnumer date2 scan2 date1 scan1
# <dbl> <fctr> <fctr> <fctr> <fctr>
# 1 1 2017-08-10 Delivered 2017-08-01 Pickup
# 2 2 2017-08-05 Delivered 2017-08-02 Pickup
# 3 3 2017-08-12 Delivered <NA> <NA>

Resources