I cleared one hurdle, with some help from SO and thought the next hurdle would be easier. What I really have is start and end dates in a data frame:
require(lubridate)
demo <- read.table(text = "
start end num
2010-12-31 <NA> 35
2013-04-01 <NA> 34
2015-06-02 <NA> 34
2015-06-15 2012-12-31 34
2015-01-30 2011-12-31 33
2014-04-15 2013-12-31 33
2014-05-28 2013-12-31 33
2014-06-02 <NA> 33
2015-06-17 <NA> 33
2015-06-25 <NA> 33
2015-06-24 <NA> 32
2013-07-31 <NA> 32
2013-08-31 <NA> 32
2015-04-27 <NA> 31
2015-05-07 <NA> 31
2013-12-30 <NA> 31
2014-11-21 <NA> 30
2013-12-20 2013-06-30 30
",header = TRUE, sep = "")
demo$start <- as.Date(demo$start, '%Y-%m-%d')
demo$end <- as.Date(demo$end, '%Y-%m-%d')
I can get a table of start years, or a table of end years, with table(year(demo$end)) or table(year(demo$start)) which is a lovely start. But what I really want to know is something more like: for each year, how many entries that started have not yet ended? So count is.na() for each start year.
I thought I could use aggregate() for that, but this:
aggregate(is.na(end) ~ year(start), demo, FUN = length)
But that seems to be counting every observation, not just the observations for which the end date is.na()
You can use table with multiple arguments to give you 2-way or multi-way tables:
> with(demo, table( year=format(demo$start, "%Y"), Not.missing = !is.na(end) ) )
Not.missing
year FALSE TRUE
2010 1 0
2013 4 1
2014 2 2
2015 6 2
You could also use lubridate::year instead of hte format call.
If you need to find the number of NA values for each 'year', we can use sum as the is.na(end) is a logical vector. The length gives the total length of the vector per year instead of the length of the TRUE values
aggregate(cbind(end=is.na(end)) ~ cbind(year=year(start)), demo, FUN = sum)
# year end
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(demo)), grouped by the year of the 'start' column and using i as is.na(end) as row index, we get the .N or the number of elements for each group.
library(data.table)
setDT(demo)[is.na(end), list(end = .N) , list(year=year(start))]
# year end
#1: 2010 1
#2: 2013 4
#3: 2015 6
#4: 2014 2
Here is another option:
library(dplyr)
library(lubridate)
demo %>% subset(is.na(end)) %>% group_by(year(start)) %>% summarise(n=length(end))
#Source: local data frame [4 x 2]
#
# year(start) n
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
This is pretty straightforward. With your original data (demo), subset to only get the NA in your end column. Afterwards (and using year() from the lubridate package), group by each year, and get the summary of the number of NAs present in the end column. This will return a data.frame object.
Related
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.
I have data that includes dates (dd/mm/yyyy) and am wanting to summarise the data by year. I'm sure that there is an easier way to do it but the route that I've taken is to try to create a new categorical variable using the "cut" function.
For example:
# create sample dataframe
dates<-c("01/01/2013", "01/02/2013", "01/01/2014", "01/02/2014", "01/01/2015", "01/02/2015")
cases<-c(3,5,2,6,8,4)
df<-as.data.frame(cbind(dates, cases))
df$dates <- as.Date(df$dates,"%d/%m/%Y")
# categorise by year
df$year <- cut(df$dates, c(2013-01-01, 2013-12-31, 2014-12-31, 2015-12-31))
This gives an error:
invalid specification of 'breaks'
How do I tell R to cut at various "date" intervals? Is my approach to this all wrong? Still new to R (sorry about the basic question).
Greg
How should your output look like?
Your code works when you define your breaks with as.Date:
breaks <- as.Date(c("2013-01-01", "2013-12-31", "2014-12-31", "2015-12-31"))
# categorise by year
df$year <- cut(df$dates, breaks)
dates cases year
1 2013-01-01 3 2013-01-01
2 2013-02-01 5 2013-01-01
3 2014-01-01 2 2013-12-31
4 2014-02-01 6 2013-12-31
5 2015-01-01 8 2014-12-31
6 2015-02-01 4 2014-12-31
I'm guessing you want your variable year to look different, though? You can define labels when using cut:
# categorise by year
df$year <- cut(df$dates, breaks, labels = c(2013, 2014, 2015))
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
if you are just looking for the year, maybe this helps:
df$year <- format(df$dates, format="%Y")
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
I think the solutions based on cut are a bit overkill. You can use the year function from the lubridate package to extract the year from the date:
library(dplyr)
library(lubridate)
df %>% mutate(year = year(dates))
# dates cases year
# 1 2013-01-01 3 2013
# 2 2013-02-01 5 2013
# 3 2014-01-01 2 2014
# 4 2014-02-01 6 2014
# 5 2015-01-01 8 2015
# 6 2015-02-01 4 2015
lubridate is such an awesome package when it comes to dealing with time data.
After the year column is constructed you can apply all kinds of summaries. I use the dplyr style here:
# Note that as.numeric(as.character()) is needed as `cbind` forces `cases` to be a factor
df %>% mutate(year = year(dates), cases = as.numeric(as.character(cases))) %>%
group_by(year) %>% summarise(tot_cases = sum(cases))
# # A tibble: 3 × 2
# year tot_cases
# <dbl> <dbl>
# 1 2013 8
# 2 2014 8
# 3 2015 12
Note that group_by ensures that all operations after that are done per unique category mentioned there, in this case per year.
A simple solution would be using the dplyr package. Here is a simple example:
library(dplyr)
df_grouped <- df %>%
mutate(
dates = as_date(dates),
cases = as.numeric(cases)) %>%
group_by(year = year(dates)) %>%
summarise(tot_cases = sum(cases))
In the mutate statement we convert the variables to a more suitable format, in group_by we select which variable is going to do the grouping and in summarise we create any new variables that we want.
df_grouped looks like this:
# A tibble: 3 × 2
year tot_cases
<dbl> <dbl>
1 2013 6
2 2014 6
3 2015 9
I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days
I have a data frame representing 15 years of follow-up data from several hundred patients. I want to create a subset of the data frame including the most recent 12 months of data for each patient.
Here is a representative example of my data (including one missing value, because missing data abound in my actual dataset):
# Create example dataset.
example.dat <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3), # patient ID numbers
Date = as.Date(c("2000-02-01", "2004-10-21", "2005-02-06", # follow-up dates
"2005-06-14", "2002-11-24", "2009-03-05",
"2009-07-20", "2005-09-02", "2006-01-15",
"2006-05-18")),
Cat = c("Yes", "Yes", "No", "Yes", "No", # responses to a categorical variable
"Yes", "Yes", NA, "No", "No")
)
example.dat
Which yields the following output:
ID Date Cat
1 1 2000-02-01 Yes
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
5 2 2002-11-24 No
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
I need to figure out how to subset, for each ID number, the most recent record and all records from the previous 12 months.
ID Date Cat
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
Several questions have already been asked about subsetting by date in R, but they are generally concerned with subsetting data from a specific date or range of dates, not subsetting by ((variable end date) - (time interval)).
For the sake of completeness, here are two data.table approaches using either subsetting by groups or a non-equi join. In addition, lubridate is used to ensure a period of 12 months is picked even in the case of leap years.
Subsetting by groups
This is essentialy the data.table version of docendo discimus' dplyr answer. However, lubridate functions are used for date arithmetic because simply subtracting 365 days will not cover a period of 12 months as requested by the OP in case the past year contains a leap day:
library(data.table)
library(lubridate)
setDT(example.dat)[, .SD[Date >= max(Date) %m-% years(1)], by = ID]
ID Date Cat
1: 1 2004-10-21 Yes
2: 1 2005-02-06 No
3: 1 2005-06-14 Yes
4: 2 2009-03-05 Yes
5: 2 2009-07-20 Yes
6: 3 2005-09-02 NA
7: 3 2006-01-15 No
8: 3 2006-05-18 No
Non-equi join
With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins:
library(data.table)
library(lubridate)
mDT <- setDT(example.dat)[, max(Date) %m-% years(1), by = ID]
example.dat[example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]]
ID Date Cat
1: 1 2004-10-21 Yes
2: 1 2005-02-06 No
3: 1 2005-06-14 Yes
4: 2 2009-03-05 Yes
5: 2 2009-07-20 Yes
6: 3 2005-09-02 NA
7: 3 2006-01-15 No
8: 3 2006-05-18 No
mDT contains the start dates of the 12 months period for each ID:
ID V1
1: 1 2004-06-14
2: 2 2008-07-20
3: 3 2005-05-18
The non-equi join returns the indices of the rows which fulfill the conditions
example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]
[1] 2 3 4 6 7 8 9 10
which are then used to finally subset example.dat.
Comparison of date arithmetic methods
The answers posted so far employed three different methods to find a date 12 months earlier:
docendo discimus subtracts 365 days,
G. Grothendieck uses seq.Date(),
this answer uses years() and %m-%
The three methods differ in case a leap day is included in the period:
library(data.table)
library(lubridate)
mseq <- Vectorize(function(x) seq(x, length = 2L, by = "-1 year")[2L])
data.table(Date = as.Date("2016-02-28") + 0:2)[
, minus_365d := Date -365][
, minus_1yr := Date - years()][
, minus_1yr_m := Date %m-% years()][
, seq.Date := as_date(mseq(Date))][]
Date minus_365d minus_1yr minus_1yr_m seq.Date
1: 2016-02-28 2015-02-28 2015-02-28 2015-02-28 2015-02-28
2: 2016-02-29 2015-03-01 <NA> 2015-02-28 2015-03-01
3: 2016-03-01 2015-03-02 2015-03-01 2015-03-01 2015-03-01
If there is no leap day in the past period, all three methods return the same result (row 1).
If a leap day is included in the past period, subtracting 365 days does not fully cover 12 months (row 3) as a leap year has 366 days.
If the reference date is a leap date, the seq.Date() approach picks the next day, 1 March 2015, as there is no 29 February in 2015. Using lubridate's %m-% rolls the date to the last day of February, 28 Feb 2015, instead.
Here is a base solution. We have ave operate on dates as numbers since if we were to use raw "Date" values ave would try to return "Date" values. Instead, ave returns 0/1 values and !! converts those to FALSE/TRUE.
in_last_yr <- function(x) {
max_date <- as.Date(max(x), "1970-01-01")
x > seq(max_date, length = 2, by = "-1 year")[2]
}
subset(example.dat, !!ave(as.numeric(Date), ID, FUN = in_last_yr))
Update Improved method of determining which days are in last year.
A possible approach using dplyr
library(dplyr)
example.dat %>% group_by(ID) %>% filter(Date >= max(Date)-365)
#Source: local data frame [8 x 3]
#Groups: ID
#
# ID Date Cat
#1 1 2004-10-21 Yes
#2 1 2005-02-06 No
#3 1 2005-06-14 Yes
#4 2 2009-03-05 Yes
#5 2 2009-07-20 Yes
#6 3 2005-09-02 NA
#7 3 2006-01-15 No
#8 3 2006-05-18 No