Calculating dynamic Month over Month - r

I am trying to calculate Month over Month % Revenue change on data rows using R. For example my current data is:
Booking.Date Revenue Month
4/1/2018 3160 April
4/1/2018 12656 April
4/1/2018 5157 April
5/8/2018 12152 May
5/8/2018 2824 May
5/8/2018 4600 May
6/30/2018 6936 June
6/30/2018 17298 June
6/30/2018 9625 June
I want to make a dynamic function in R which calculates the Revenue
MoM((Revenue_month2-Revenue_month1)/Revenue_month1)*100)
for any new month.
The output should be similar to:
Month Revenue_MoM
April 3%
May -8%
June 50%
and so on.

I got a data.table solution, only the ordering needs to be fixed, by making the month a proper date function. But it should give you an idea. Please keep in mind that for the first month there's no way to calculate a growth rate. I used the logarithmic growth rate, which is in my opinion the best way, but you can easily switch that to any other growth rate calculation.
library(data.table)
dt <- structure(list(Booking.Date = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("4/1/2018", "5/8/2018", "6/30/2018"
), class = "factor")
, Revenue = c(3160L, 12656L, 5157L, 12152L, 2824L, 4600L, 6936L, 17298L, 9625L)
, Month = structure(c(1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L), .Label = c("April", "June", "May"), class = "factor"))
, row.names = c(NA, -9L)
, class = c("data.table","data.frame"))
# Change the month column into date one
# Setting the locale, so that the months can be converted
Sys.setlocale("LC_TIME", "en_US.UTF-8")
dt[, `:=`(Month.Date = as.Date(paste0("2018-",Month,"-01"), tryFormats = "%Y-%B-%d"))
dt[,.(Sum.Revenue = sum(Revenue)), by = list(Month.Date)][, .(Month.Date
, Sum.Revenue
, Change.Revenue = log(Sum.Revenue) - log(shift(Sum.Revenue, n =1L, type = "lag"))
)]
# Calculations, based on the normal growth rate calculation
dt[,.(Sum.Revenue = sum(Revenue)), by = list(Month.Date)][, .(Month.Date
, Sum.Revenue
, Change.Revenue = (Sum.Revenue - shift(Sum.Revenue, n =1L, type = "lag"))/shift(Sum.Revenue, n =1L, type = "lag")
)]

Related

How to combine multiple text entries for a variable once dplyr has grouped by another variable [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
For hundreds of matters, my data frame has daily text entries by dozens of timekeepers. Not every timekeeper enters time each day for each matter. Text entries can be any length. Each entry for a matter is for work done on a different day (but for my purposes, figuring out readability measures for the text, dates don't matter). What I would like to do is to combine for each matter all of its text entries.
Here is a toy data set and what it looks like:
> dput(df)
structure(list(Matter = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 4L, 4L), .Label = c("MatterA", "MatterB", "MatterC", "MatterD"
), class = "factor"), Timekeeper = structure(c(1L, 2L, 3L, 4L,
2L, 3L, 1L, 1L, 3L, 4L), .Label = c("Alpha", "Baker", "Charlie",
"Delta"), class = "factor"), Text = structure(c(5L, 8L, 1L, 3L,
7L, 6L, 9L, 2L, 10L, 4L), .Label = c("all", "all we have", "good men to come to",
"in these times that try men's souls", "Now is", "of", "the aid",
"the time for", "their country since", "to fear is fear itself"
), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
Dplyr groups the time records by matter, but I am stumped as to how to combine the text entries for each matter so that the result is along these lines -- all text gathered for a matter:
1 MatterA Now is the time for all good men to come to
5 MatterB the aid of their country since
8 MatterC all we have
9 MatterD to fear is fear itself in these times that try men's souls
dplyr::mutate() does not work with various concatenation functions:
textCombined <- df %>% group_by(Matter) %>% mutate(ComboText = str_c(Text))
textCombined2 <- df %>% group_by(Matter) %>% mutate(ComboText = paste(Text))
textCombined3 <- df %>% group_by(Matter) %>% mutate(ComboText = c(Text)) # creates numbers
Maybe a loop will do the job, as in "while the matter stays the same, combine the text" but I don't know how to write that. Or maybe dplyr has a conditional mutate, as in "mutate(while the matter stays the same, combine the text)."
Thank you for your help.
Hi you can use group by and summarise with paste,
> df %>% group_by(Matter) %>% summarise(line= paste(Text, collapse = " "))
# A tibble: 4 x 2
# Matter line
# <fct> <chr>
#1 MatterA Now is the time for all good men to come to
#2 MatterB the aid of their country since
#3 MatterC all we have
#4 MatterD to fear is fear itself in these times that try men's souls

How to concatenate rows based on group as quickly as possible

I have a dataframe as follows
ClientVisitGUID LineNum TextCol
1 1 This was a great
1 2 report I did
2 3 was performed today
2 1 Another great report
2 2 for this person
3 2 good stuff
3 1 I really write very
3 3 when I put my
3 4 mind to it
I'd like to concatenate the rows based on the ClientVisitGUID and the line number so i can get the following output
ClientVisitGUID TextCol
1 This was a great report I did
2 Another great report for this person was performed today
3 I really write very good stuff when I put my mind to it
I tried dplyr but it takes a long time and can't deal with thousands of rows which is what I have
resultset2<-resultset %>%
group_by(ClientVisitGUID) %>%
arrange(LineNum) %>%
summarize_all(paste, collapse=",")
Is there a faster way? I'm not really familiar with data.table but is this fast?
A second data.table option, also using stringi for its performance
library(data.table)
library(stringi)
setDT(df)
setkey(df, ClientVisitGUID, LineNum)
df1 <- df[, .(new = stri_c(TextCol, collapse = " ")), by = ClientVisitGUID]
Result
df1
# ClientVisitGUID new
#1: 1 This was a great report I did
#2: 2 Another great report for this person was performed today
#3: 3 I really write very good stuff when I put my mind to it
data (thanks to #ThomasIsCoding)
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))
An base R option is using aggregate
result <- aggregate(TextCol~ClientVisitGUID,
df[order(df$ClientVisitGUID,df$LineNum),],
paste0,
collapse = " ")
which gives
> result
ClientVisitGUID TextCol
1 1 This was a great report I did
2 2 Another great report for this person was performed today
3 3 I really write very good stuff when I put my mind to it
Data
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))
If you want speed, data.table is indeed a great candidate:
library(data.table)
setDT(resultset)
data.table::setkeyv(resultset, "ClientVisitGUID")
resultset <- resultset[order(ClientVisitGUID, LineNum)]
resultset[, .(lapply(.SD, paste, collapse = ",")), by = "ClientVisitGUID"]
Setting the key takes some times at first but you will end up with faster operations afterwards. Setting the keys reorder rows belonging to the same group in contiguous memory slots
Example
data = data.table("a" = c("aaa","ffff","ttt"), "b" = c(1,1,2))
data[, .(lapply(.SD, paste, collapse = ",")), by = "b"]

Conditional Seasonal Averaging Time-Series Data

Introduction
Summary:
Trying to average data by season (when necessary) when certain conditions are met.
Hello everyone.
I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. So far, I've extracted data into separate CSV files per site and per station.
For the most part, the data reported per parameter is one instance per season.
Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively.
However, there are some cases where there were multiple readings during a seasonal event. Here, the parameter values and dates have to be averaged; this is before further analysis can take place on these data sets.
To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). In these cases, values and dates are not averaged unless the recorded value is the same.
Data Example
Summary:
Code and Tables show requested changes in data-set
So, for a data-driven example...
Here's a few rows from a data set.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
This is ideally what I would like as output
Data.Example.New<-structure(list(
Station.ID.new = c(13402, 13402, 13402, 13402),
End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"),
Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value.new = c(10, 20, 22.5, 10)),
.Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")
Here, the following things are occurring:
For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. New row is 7/28/2016 and Value 10.
While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame.
Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row.
Alkalinity in October is different season, so keep row.
All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here.
If for some reason you have a GTLT and non-GTLT for same parameter:
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/13/2016 10 Alk
8/15/2016 20 Alk
Then final result would be
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/14/2016 15 Alk
Approach
Summary:
Define seasons and then aggregate using package like dplyr?
Create loop function to read row by row (after sort by Parameter.Name then Date?)
As one might expect, this is where I'm stuck.
I know seasons can be defined in R from prior Stack Q's:
New vector of seasons based on dates
And I know that average/aggregation packages such as dplyr (and possibly zoo?) can do chaining commands.
My issue is putting this thought process into code that can be repeated for each data set.
I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date.
I quickly sketched my thoughts on what some sort of loop function would have to include:
Rough idea of flow diagram
Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n].
Further, as clarification, the season would have to be within season of that annual cycle. So 7/13/2016 == 8/13/2016 for same season. 12/12/2015 == 01/01/2016 for same season. But 4/13/2016! == 4/13/2015 in regards to averaging.
Conclusion and Summary
In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets.
I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages.
Thank you for your time in advance.
Cheers,
soccernamlak
Using dplyr and lubridate I was able to come up with a solution. My output matches your example output, except I did not keep the exact dates, which I felt were misleading in the final result.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
# Create season key
seasons <- data.frame(month = 1:12, season = c(rep("DJF",2), rep("MAM", 3), rep("JJA", 3), rep("SON",3), "DJF"))
# Isolate Month and Year, create Season column
Data.Example$Month <- lubridate::month(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Year <- lubridate::year(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Season <- seasons$season[Data.Example$Month]
# Update 'year' where month = December so that it is grouped with Jan and Feb of following year
Data.Example$Year[Data.Example$Month == 12] <- Data.Example$Year[Data.Example$Month == 12]+1
# Find out which station/year/season/paramaters have at least one record with a GTLT
GTLT.Test<- Data.Example %>%
group_by(Station.ID, Year, Season, Parameter.Name) %>%
summarize(has_GTLT = max(nchar(as.character(GTLT))))
# First only calculate averages for groups without any GTLT
Data.Example.New1 <- Data.Example %>%
anti_join(GTLT.Test[GTLT_test$has_GTLT == 1,],
by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
group_by(Station.ID, Year, Season, Parameter.Name, GTLT) %>%
summarize(Value.new = mean(Value))
# Now do the same for groups with GTLT, only combining when values and GTLT symbols match.
Data.Example.New2 <- Data.Example %>%
anti_join(GTLT.Test[GTLT_test$has_GTLT == 0,],
by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
group_by(Station.ID, Year, Season, Parameter.Name, GTLT, Value) %>%
summarize(Value.new = mean(Value)) %>%
select(-Value)
# Combine both
Data.Example.New <- rbind(Data.Example.New1, Data.Example.New2)
EDIT: I just noticed you linked to another SO question for converting dates to seasons. Mine simply converts by month, not date, and does not use actual seasons. I did this because in your example, Dec. 12 matches with Jan. 1. December 12 is technically fall, so I assumed you weren't using actual seasons, but were instead using four three-month groupings.

Filling in missing (blanks) in a data table, per category - backwards and forwards

I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.
Here is some fictitious data to orient you (as a dput ASCII text representation)
structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy",
"Timothy"), class = "factor"), patient.last.name = structure(c(3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones",
"Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455,
4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775,
2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087,
39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360,
39184, 39194, 39198, 39216), procedure.code = c(44750, 38995,
40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750,
44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87,
456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87,
456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413,
413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe",
"Mark"), class = "factor"), referring.doctor.last = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead",
"Wydell"), class = "factor"), referring.docotor.zip = c(15209,
15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA,
15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L,
NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio",
"cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name",
"patient.last.name", "medical.record.nr", "date.of.service",
"procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first",
"referring.doctor.last", "referring.docotor.zip", "some.other.stuff"
), row.names = c(NA, 14L), class = "data.frame")
The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.
In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.
A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?
Since you already have NAs in the data, that you wish to fill, it's not roll in data.table really. A rolling join is more for when your data has no NA but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.
So you do need na.locf as #Joshua suggested. I'm not aware of a function that fills NA forward and then the first value backwards, though.
In data.table, to use na.locf by group it's just :
require(data.table)
require(zoo)
DT[,doctor:=na.locf(doctor),by=patient]
which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf to roll the first non NA backwards.
Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by maintains the order of rows within each group.
Hope that gives you some hints.
#MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.
In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.
setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
Then do something similar for referring.doctor.first
A few pointers:
The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.
One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.
Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.
You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]

Counting an event only every X days per subject (in an irregular time series)

I've got data where I'm counting episodes of care (like ER visits). The trick is, I can't count every single visit, because sometimes a 2nd or 3rd visit is actually a follow-up for a previous problem. So I've been given direction to count visits by using a 30 day "clean period" or "black out period", such that, I look for the first event (VISIT 1) by patient (min date), I count that event, then apply rules so as NOT to count any visits that occur in the 30 days following the first event. After that 30 day window has elapsed, I can begin looking for the 2nd visit (VISIT 2), count that one, then apply the 30 day black out again (NOT counting any visits that occur in the 30 days after visit #2)... wash, rinse, repeat...
I have rigged together a very sloppy solution that requires a lot of babysitting and manual checking of steps(see below). I have to believe that there is a better way. HELP!
data1 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("", "patient1", "patient2",
"patient3"), class = "factor"), Date = structure(c(14610, 14610,
14627, 14680, 14652, 14660, 14725, 15085, 15086, 14642, 14669,
14732, 14747, 14749), class = "Date"), test = c(1L, 1L, 1L, 2L,
1L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L)), .Names = c("ID", "Date",
"test"), class = "data.frame", row.names = c(NA, 14L))
library(doBy)
## create a table of first events
step1 <- summaryBy(Date~ID, data = data1, FUN=min)
step1$Date30 <- step1$Date.min+30
step2 <- merge(data1, step1, by.x="ID", by.y="ID")
## use an ifelse to essentially remove any events that shouldn't be counted
step2$event <- ifelse(as.numeric(step2$Date) >= step2$Date.min & as.numeric(step2$Date) <= step2$Date30, 0, 1)
## basically repeat steps above until I dont capture any more events
## there just has to be a better way
data3 <- step2[step2$event==1,]
data3<- data3[,1:3]
step3 <- summaryBy(Date~ID, data = data3, FUN=min)
step3$Date30 <- step3$Date.min+30
step4 <- merge(data3, step3, by.x="ID", by.y="ID")
step4$event <- ifelse(as.numeric(step4$Date) >= step4$Date.min & as.numeric(step4$Date) <= step4$Date30, 0, 1)
data4 <- step4[step4$event==1,]
data4<- data4[,1:3]
step5 <- summaryBy(Date~ID, data = data4, FUN=min)
step5$Date30 <- step5$Date.min+30
## then I rbind the "keepers"
## in this case steps 1 and 3 above
final <- rbind(step1,step3, step5)
## then reformat
final <- final[,1:2]
final$Date.min <- as.Date(final$Date.min,origin="1970-01-01")
## again, extremely clumsy, but it works... HELP! :)
This solution is loop-free and uses only base R. It produces a logical vector ok which selects the acceptable rows of data1.
ave runs the indicated anonymous function over each patient separately.
We define a state vector consisting of the current date and the start of the period for which no other dates are considered. Each date is represented by as.numeric(x) where x is the date. step takes the state vector and the current date and updates the state vector. Reduce runs it over the data and then we take only results for which the minimum and current date are the same and for which the current date is not a duplicate.
step <- function(init, curdate) {
c(curdate, if (curdate > init[2] + 30) curdate else init[2])
}
ok <- !!ave(as.numeric(data1$Date), paste(data1$ID), FUN = function(d) {
x <- do.call("rbind", Reduce(step, d, c(-Inf, 0), acc = TRUE))
x[-1,1] == x[-1,2] & !duplicated(x[-1,1])
})
data1[ok, ]
Since that kind of manipulation is not straightforward and error-prone,
I would write a separate function to discard events in the blackout period.
The function contains a loop,
which basically does what you were doing by hand,
until there is nothing left to do.
blackout <- function(dates, period=30) {
dates <- sort(dates)
while( TRUE ) {
spell <- as.numeric(diff(dates)) <= period
if(!any(spell)) { return(dates) }
i <- which(spell)[1] + 1
dates <- dates[-i]
}
}
# Tests
stopifnot(
length(
blackout( seq.Date(Sys.Date(), Sys.Date()+50, by=1) )
) == 2
)
stopifnot(
length(
blackout( seq.Date(Sys.Date(), by=31, length=5) )
) == 5
)
It can be used as follows.
library(plyr)
ddply(data1, "ID", summarize, Date=blackout(Date))
How about
do.call('rbind', lapply(split(data1, factor(data1$ID)), function(x) (x <- x[order(x$Date),])[c(T, diff(x$Date) > 30),]))

Resources