My sample data.frame (date format d/m/y), recording the dates a customer was active:
customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20
I would like to make a column "n_consecutive_days" like so:
customer date n_consecutive_days
1 10/1/20 2
1 9/1/20 1
1 6/1/20 N/A
2 10/1/20 1
2 8/1/20 3
2 7/1/20 2
2 6/1/20 N/A
The new column counts the number of previous consecutive dates per customer. I would like the customer's first date to be N/A as it makes no sense to talk about previous consecutive days if it is the first one.
Any help would be appreciated. I can calculate the difference between dates, but not the number of consecutive days as desired.
One way would be:
library(dplyr)
df %>%
group_by(customer, idx = cumsum(as.integer(c(0, diff(as.Date(date, '%d/%m/%y')))) != -1)) %>%
mutate(n_consecutive_days = rev(sequence(n()))) %>% ungroup() %>%
group_by(customer) %>%
mutate(n_consecutive_days = replace(n_consecutive_days, row_number() == n(), NA), idx = NULL)
Output:
# A tibble: 7 x 3
# Groups: customer [2]
customer date n_consecutive_days
<int> <fct> <int>
1 1 10/1/20 2
2 1 9/1/20 1
3 1 6/1/20 NA
4 2 10/1/20 1
5 2 8/1/20 3
6 2 7/1/20 2
7 2 6/1/20 NA
An option using data.table:
#ensure that data is sorted by customer and reverse chronological
setorder(DT, customer, -date)
#group by customer and consecutive dates and then create the sequence
DT[, ncd := .N:1L, .(customer, cumsum(c(0L, diff(date)!=-1L)))]
#set the first date in each customer to NA
DT[DT[, .I[.N], customer]$V1, ncd := NA]
output:
customer date ncd
1: 1 2020-01-10 2
2: 1 2020-01-09 1
3: 1 2020-01-06 NA
4: 2 2020-01-10 1
5: 2 2020-01-08 3
6: 2 2020-01-07 2
7: 2 2020-01-06 NA
data:
library(data.table)
DT <- fread("customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20")
DT[, date := as.IDate(date, format="%d/%m/%y")]
Related
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I have a longitudinal dataset that I imported in R from Excel that looks like this:
STUDYID VISIT# VISITDate
1 1 2012-12-19
1 2 2018-09-19
2 1 2013-04-03
2 2 2014-05-14
2 3 2016-05-12
In this dataset, each patient/study ID has a different number of visits to the hospital, and their first visit dates which is likely to differ from individual to individual. I want to create a new time variable which is essentially time in years since first visit, so the dataset will look like this:
STUDYID VISIT# VISITDate Time(years)
1 1 2012-12-19 0
1 2 2018-09-19 5
2 1 2013-04-03 0
2 2 2014-05-14 1
2 3 2016-05-12 3
The reason for creating a time variable like this is to assess differential regression effects over time (which is a continuous variable). Is there any way to create a new time variable like this in R so I can use it as an independent variable in my regression analyses?
Consider ave to calculate the minimum of VISITDate by STUDYID group, then take the date difference with conversion to integer years:
df <- within(df, {
minVISITDate <- ave(VISITDate, STUDYID, FUN=min)
Time <- floor(as.double(difftime(VISITDate, minVISITDate, unit="days") / 365))
rm(minVISITDate)
})
df
# STUDYID VISIT# VISITDate Time
# 1 1 1 2012-12-19 0
# 2 1 2 2018-09-19 5
# 3 2 1 2013-04-03 0
# 4 2 2 2014-05-14 1
# 5 2 3 2016-05-12 3
Loading up packages:
library(tibble)
library(dplyr)
library(lubridate)
Setting up the data:
dat <- tribble(~STUDYID , ~VISIT , ~VISITDate ,
1 , 1 , "2012-12-19",
1 , 2 , "2018-09-19",
2 , 1 , "2013-04-03",
2 , 2 , "2014-05-14",
2 , 3 , "2016-05-12") %>%
mutate(VISITDate = as.Date(VISITDate))
Creating the wanted variable:
dat %>%
group_by(STUDYID) %>%
mutate(Time = first(VISITDate) %--% VISITDate,
Time = as.numeric(Time, "years")) %>%
ungroup()
# A tibble: 5 x 4
STUDYID VISIT VISITDate Time
<dbl> <dbl> <date> <dbl>
1 1 1 2012-12-19 0
2 1 2 2018-09-19 5.75
3 2 1 2013-04-03 0
4 2 2 2014-05-14 1.11
5 2 3 2016-05-12 3.11
I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1
I have a set of patient ids and date column. I want to update date1 column with -1 day from the date column. for example :
ID Date Date1
1 23-10-2017 23-09-2018
1 24-09-2018 28-08-2019
1 29-08-2019 -
2 30-05-2016 11-06-2017
2 12-06-2017 12-07-2018
2 13-07-2018 -
I don't know if i get what you want. But if you just want a date less one day, this is the code.
x <- data.frame(ID = c(1,1,1,2,2,2), Date = as.Date(c("20-10-2017", "24-09-2018", "29-08-2019", "30-05-2016", "12-06-2017", "13-07-2018"),"%d-%m-%Y"))
x$Date1 <- x$Date-1
Shift by one row by group, then subtract one day:
library(data.table)
dt1 <- fread("
ID Date
1 23-10-2017
1 24-09-2018
1 29-08-2019
2 30-05-2016
2 12-06-2017
2 13-07-2018")
# convert to date
dt1[, Date := as.Date(Date, "%d-%m-%y")]
# shift per group, then minus 1 day
dt1[, Date1 := shift(Date, - 1) - 1, by = ID]
dt1
# ID Date Date1
# 1: 1 2020-10-23 2020-09-23
# 2: 1 2020-09-24 2020-08-28
# 3: 1 2020-08-29 <NA>
# 4: 2 2020-05-30 2020-06-11
# 5: 2 2020-06-12 2020-07-12
# 6: 2 2020-07-13 <NA>
Try using lead:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Date1 = lead(Date)-1)
# A tibble: 6 x 3
# Groups: ID [2]
ID Date Date1
<int> <date> <date>
1 1 2017-10-23 2018-09-23
2 1 2018-09-24 2019-08-28
3 1 2019-08-29 NA
4 2 2016-05-30 2017-06-11
5 2 2017-06-12 2018-07-12
6 2 2018-07-13 NA
My sample dataset
df <- data.frame(period=rep(1:3,3),
product=c(rep('A',9)),
account= c(rep('1001',3),rep('1002',3),rep('1003',3)),
findme= c(0,0,0,1,0,1,4,2,0))
My Desired output dataset
output <- data.frame(period=rep(1:3,2),
product=c(rep('A',6)),
account= c(rep('1002',3),rep('1003',3)),
findme= c(1,0,1,4,2,0))
Here my conditions are....
I want to eliminate records 3 records from 9 based on below conditions.
If all my periods (1, 2 and 3) meet “findme” value is equal to ‘Zero’ and
if that happens to the same product and
and same account.
Rule 1: It should meet Periods 1, 2, 3
Rule 2: Findme value for all periods = 0
Rule 3: All those 3 records (Preiod 1,2,3) should have same Product
Rule 4: All those 3 recods (period 1,2,3) should have one account.
If I understand correctly, you want to drop all records from a product-account combination where findme == 0, if all periods are included in this combination?
library(dplyr)
df %>%
group_by(product, account, findme) %>%
mutate(all.periods = all(1:3 %in% period)) %>%
ungroup() %>%
filter(!(findme == 0 & all.periods)) %>%
select(-all.periods)
# A tibble: 6 x 4
period product account findme
<int> <fctr> <fctr> <dbl>
1 1 A 1002 1
2 2 A 1002 0
3 3 A 1002 1
4 1 A 1003 4
5 2 A 1003 2
6 3 A 1003 0
Here is an option with data.table
library(data.table)
setDT(df)[df[, .I[all(1:3 %in% period) & !all(!findme)], .(product, account)]$V1]
# period product account findme
#1: 1 A 1002 1
#2: 2 A 1002 0
#3: 3 A 1002 1
#4: 1 A 1003 4
#5: 2 A 1003 2
#6: 3 A 1003 0