I have a longitudinal dataset that I imported in R from Excel that looks like this:
STUDYID VISIT# VISITDate
1 1 2012-12-19
1 2 2018-09-19
2 1 2013-04-03
2 2 2014-05-14
2 3 2016-05-12
In this dataset, each patient/study ID has a different number of visits to the hospital, and their first visit dates which is likely to differ from individual to individual. I want to create a new time variable which is essentially time in years since first visit, so the dataset will look like this:
STUDYID VISIT# VISITDate Time(years)
1 1 2012-12-19 0
1 2 2018-09-19 5
2 1 2013-04-03 0
2 2 2014-05-14 1
2 3 2016-05-12 3
The reason for creating a time variable like this is to assess differential regression effects over time (which is a continuous variable). Is there any way to create a new time variable like this in R so I can use it as an independent variable in my regression analyses?
Consider ave to calculate the minimum of VISITDate by STUDYID group, then take the date difference with conversion to integer years:
df <- within(df, {
minVISITDate <- ave(VISITDate, STUDYID, FUN=min)
Time <- floor(as.double(difftime(VISITDate, minVISITDate, unit="days") / 365))
rm(minVISITDate)
})
df
# STUDYID VISIT# VISITDate Time
# 1 1 1 2012-12-19 0
# 2 1 2 2018-09-19 5
# 3 2 1 2013-04-03 0
# 4 2 2 2014-05-14 1
# 5 2 3 2016-05-12 3
Loading up packages:
library(tibble)
library(dplyr)
library(lubridate)
Setting up the data:
dat <- tribble(~STUDYID , ~VISIT , ~VISITDate ,
1 , 1 , "2012-12-19",
1 , 2 , "2018-09-19",
2 , 1 , "2013-04-03",
2 , 2 , "2014-05-14",
2 , 3 , "2016-05-12") %>%
mutate(VISITDate = as.Date(VISITDate))
Creating the wanted variable:
dat %>%
group_by(STUDYID) %>%
mutate(Time = first(VISITDate) %--% VISITDate,
Time = as.numeric(Time, "years")) %>%
ungroup()
# A tibble: 5 x 4
STUDYID VISIT VISITDate Time
<dbl> <dbl> <date> <dbl>
1 1 1 2012-12-19 0
2 1 2 2018-09-19 5.75
3 2 1 2013-04-03 0
4 2 2 2014-05-14 1.11
5 2 3 2016-05-12 3.11
Related
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2
I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1
My sample data.frame (date format d/m/y), recording the dates a customer was active:
customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20
I would like to make a column "n_consecutive_days" like so:
customer date n_consecutive_days
1 10/1/20 2
1 9/1/20 1
1 6/1/20 N/A
2 10/1/20 1
2 8/1/20 3
2 7/1/20 2
2 6/1/20 N/A
The new column counts the number of previous consecutive dates per customer. I would like the customer's first date to be N/A as it makes no sense to talk about previous consecutive days if it is the first one.
Any help would be appreciated. I can calculate the difference between dates, but not the number of consecutive days as desired.
One way would be:
library(dplyr)
df %>%
group_by(customer, idx = cumsum(as.integer(c(0, diff(as.Date(date, '%d/%m/%y')))) != -1)) %>%
mutate(n_consecutive_days = rev(sequence(n()))) %>% ungroup() %>%
group_by(customer) %>%
mutate(n_consecutive_days = replace(n_consecutive_days, row_number() == n(), NA), idx = NULL)
Output:
# A tibble: 7 x 3
# Groups: customer [2]
customer date n_consecutive_days
<int> <fct> <int>
1 1 10/1/20 2
2 1 9/1/20 1
3 1 6/1/20 NA
4 2 10/1/20 1
5 2 8/1/20 3
6 2 7/1/20 2
7 2 6/1/20 NA
An option using data.table:
#ensure that data is sorted by customer and reverse chronological
setorder(DT, customer, -date)
#group by customer and consecutive dates and then create the sequence
DT[, ncd := .N:1L, .(customer, cumsum(c(0L, diff(date)!=-1L)))]
#set the first date in each customer to NA
DT[DT[, .I[.N], customer]$V1, ncd := NA]
output:
customer date ncd
1: 1 2020-01-10 2
2: 1 2020-01-09 1
3: 1 2020-01-06 NA
4: 2 2020-01-10 1
5: 2 2020-01-08 3
6: 2 2020-01-07 2
7: 2 2020-01-06 NA
data:
library(data.table)
DT <- fread("customer date
1 10/1/20
1 9/1/20
1 6/1/20
2 10/1/20
2 8/1/20
2 7/1/20
2 6/1/20")
DT[, date := as.IDate(date, format="%d/%m/%y")]
I am more used to working with STATA and have been trying to switch to R, and having trouble getting this aggregation using dplyr/summarise to work.
I have a dataframe with admission/discharge variables, and series of columns with binary (0,1) results indicating drug received on 'DrugDate'.
# ID AdmitDate DCdate DrugDate DrugA DrugB .. DrugZ
# 1 03/01/2017 03/04/2017 03/01/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/02/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/02/2017 0 1 0
# 1 03/01/2017 03/04/2017 03/03/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/04/2017 1 0 0
Where each row is essentially an series of indicators of what drugs a patient received that day.
STEP 1.
I would like to first consolidate the dataset like so:
# ID AdmitDate DCdate DrugDate DrugA DrugB .. DrugZ
# 1 03/01/2017 03/04/2017 03/01/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/02/2017 1 1 0
# 1 03/01/2017 03/04/2017 03/03/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/04/2017 1 0 0
So that there is now one row per day (whereas before duplicate DrugDates existed when more than one drug given on a certain day).
STEP 2
I would then like to create a new dataset that counts "drug days" i.e.
# ID AdmitDate DCdate TotDays DrugDaysA DrugDaysB .. DrugZ
# 1 03/01/2017 03/04/2017 4 4 1 0
Step 2 I figured out how to do, but I thought maybe the community would have opinions about the fastest way to compute as the dataset is quite large. My understanding is dplyr is usually computationally efficient.
I would prefer not to simply do something like:
DF %>% group_by(id, drugdate) %>% summarise(NewVar = max(DrugA))
Because there are many variables.
It would be ideal for me to define a list of varnames, then use apply/for-loop to automate the process.
You can reshape or melt the different drugs into one column using a package like reshape2 or a tidyverse package.
Then the call to dplyr doesn't matter how many variables (drugs) you have. I provided a simple example bellow that should illustrate the point. You can extend as needed.
library(dplyr)
library(reshape2)
# set up for data
set.seed(5)
n <- 9
#create data frame
df <- data.frame(id = as.factor(rep(1:3, n/3)),
date = as.character(sample(size=n, 1:10)),
drugA = sample(size=n, 1:2, replace=TRUE),
drugB = sample(size=n, 1:2, replace=TRUE))
#melt data
dfm <- melt(df, id.vars=c("id", "date"))
#call to dplyr
dfms <- dfm %>% group_by(id, date, variable) %>% summarise(max = max(value))
> head(dfms)
Source: local data frame [6 x 4]
Groups: id, date [3]
id date variable max
<fctr> <fctr> <fctr> <int>
1 1 6 drugA 1
2 1 6 drugB 2
3 1 7 drugA 2
4 1 7 drugB 2
5 1 9 drugA 2
6 1 9 drugB 1
To get back into wide format you can use the cast functions.
> head(dcast(dfms, id + date ~ variable, value.var = "max"))
id date drugA drugB
1 1 6 1 2
2 1 7 2 2
3 1 9 2 1
4 2 10 1 2
5 2 2 2 1
6 2 8 1 1