Calculating difference with respect to previous day for each ID - r

Stock.Open <- rep(c(102.25,102.87,102.25,100.87,103.44,103.87,103.00),times=3)
Stock.Close <- rep(c(102.12,102.62,100.12,103.00,103.87,103.12,105.12), times=3)
Stock.id<-rep(1:3,each=7)
day<-rep(c(1:7),times=3)
df<-data.frame(day,Stock.Close,Stock.Open,Stock.id)
How to I calculate the %difference for stock.open and stock.close each day w.r.t to the previous day. Ex: I want to calculate %change in Stock.open between day1,day2 then day2,day3, day3,day4, and so on..
To perform the same task for each ID.

Try this:
Stock.Open <- rep(c(102.25,102.87,102.25,100.87,103.44,103.87,103.00),times=3)
Stock.Close <- rep(c(102.12,102.62,100.12,103.00,103.87,103.12,105.12), times=3)
Stock.id<-rep(1:3,each=7)
day<-rep(c(1:7),times=3)
df<-data.frame(day,Stock.Close,Stock.Open,Stock.id)
#Subset stock data per day
temp <- subset(df, select = c(Stock.Open, Stock.Close, day, Stock.id))
#Change day and rename
temp$day <- temp$day + 1
require(plyr)
temp <- plyr::rename(temp, c("Stock.Open" = "Stock.Open.Pre", "Stock.Close" = "Stock.Close.Pre"))
#Merge back
df <- join(df, temp, by = c("day", "Stock.id"), type = "left")
#Compute difference
df$Stock.Open.Diff <- (df$Stock.Open / df$Stock.Open.Pre) - 1
df$Stock.Close.Diff <- (df$Stock.Close / df$Stock.Close.Pre) - 1

Try this:
library(dplyr)
df %>%
group_by(Stock.id) %>%
arrange(day) %>%
mutate(Change.Stock.Open = c(NA, diff(Stock.Open))/Stock.Open,
Change.Stock.Close = c(NA, diff(Stock.Close))/Stock.Close)
# A tibble: 21 x 6
# Groups: Stock.id [3]
day Stock.Close Stock.Open Stock.id Change.Stock.Open Change.Stock.Close
<int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 102.12 102.25 1 NA NA
2 1 102.12 102.25 2 NA NA
3 1 102.12 102.25 3 NA NA
4 2 102.62 102.87 1 0.006027024 0.004872345
5 2 102.62 102.87 2 0.006027024 0.004872345
6 2 102.62 102.87 3 0.006027024 0.004872345
7 3 100.12 102.25 1 -0.006063570 -0.024970036
8 3 100.12 102.25 2 -0.006063570 -0.024970036
9 3 100.12 102.25 3 -0.006063570 -0.024970036
10 4 103.00 100.87 1 -0.013680976 0.027961165
# ... with 11 more rows
(Values associated with day 1 for each stock are NA since there's no previous day for comparison)

Yet another solution, this time using base R only.
df$Stock.Close.Change <- ave(df$Stock.Close, df$Stock.id, FUN = function(x) c(NA, diff(x))/x)
df$Stock.Open.Change <- ave(df$Stock.Open, df$Stock.id, FUN = function(x) c(NA, diff(x))/x)
df <- df[order(df$day), ]
row.names(df) <- NULL
df
day Stock.Close Stock.Open Stock.id Stock.Close.Change Stock.Open.Change
1 1 102.12 102.25 1 NA NA
2 1 102.12 102.25 2 NA NA
3 1 102.12 102.25 3 NA NA
4 2 102.62 102.87 1 0.004872345 0.006027024
5 2 102.62 102.87 2 0.004872345 0.006027024
6 2 102.62 102.87 3 0.004872345 0.006027024
7 3 100.12 102.25 1 -0.024970036 -0.006063570
8 3 100.12 102.25 2 -0.024970036 -0.006063570
9 3 100.12 102.25 3 -0.024970036 -0.006063570
10 4 103.00 100.87 1 0.027961165 -0.013680976
11 4 103.00 100.87 2 0.027961165 -0.013680976
12 4 103.00 100.87 3 0.027961165 -0.013680976
13 5 103.87 103.44 1 0.008375854 0.024845321
14 5 103.87 103.44 2 0.008375854 0.024845321
15 5 103.87 103.44 3 0.008375854 0.024845321
16 6 103.12 103.87 1 -0.007273080 0.004139790
17 6 103.12 103.87 2 -0.007273080 0.004139790
18 6 103.12 103.87 3 -0.007273080 0.004139790
19 7 105.12 103.00 1 0.019025875 -0.008446602
20 7 105.12 103.00 2 0.019025875 -0.008446602
21 7 105.12 103.00 3 0.019025875 -0.008446602

Related

How to print a date when the input is number of days since 01-01-60?

I received a set of dates, but it turns out that time is reported in days since 01-01-1960 in this specific data set.
D_INDDTO
1 20758
2 20856
3 21062
4 19740
5 21222
6 21203
The specific date of interest for Patient 1 is 20758 days since 01-01-60
I want to create a new covariate u$date containing the specific date of interest i d%m%y%. I tried
library(tidyverse)
u %>% mutate(date=as.date(D_INDDTO,origin="1960-01-01")
But that did not solve it.
u <- structure(list(D_INDDTO = c(20758, 20856, 21062, 19740, 21222,
21203, 20976, 20895, 18656, 18746)), row.names = c(NA, 10L), class = "data.frame")
Try this:
#Code 1
u %>% mutate(date=as.Date("1960-01-01")+D_INDDTO)
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 2
u %>% mutate(date=as.Date(D_INDDTO,origin="1960-01-01"))
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 3
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d%m%y'))
Output:
D_INDDTO date
1 20758 311016
2 20856 060217
3 21062 310817
4 19740 170114
5 21222 070218
6 21203 190118
7 20976 060617
8 20895 170317
9 18656 290111
10 18746 290411
If more customization is required:
#Code 4
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d-%m-%Y'))
Output:
D_INDDTO date
1 20758 31-10-2016
2 20856 06-02-2017
3 21062 31-08-2017
4 19740 17-01-2014
5 21222 07-02-2018
6 21203 19-01-2018
7 20976 06-06-2017
8 20895 17-03-2017
9 18656 29-01-2011
10 18746 29-04-2011

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

How to add rows and extrapolate the data by multiple variables?

I'm trying to add missing lines for "day" and extrapolate the data for "value". In my data each subject ("id") has 2 periods (period 1 and period 2) and values for consecutive days.
An example of my data looks like this:
df <- data.frame(
id = c(1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3),
period = c(1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2),
day= c(1,2,4,5, 1,3,4,5, 2,3,4,5, 1,2,3,5, 2,3,4,5, 1,2,3,4),
value =c(10,12,15,16, 11,14,15,17, 13,14,15,16, 15,16,18,20, 16,17,19,29, 14,16,18,20))
For each id and period I am missing data for days 3,2,1,4,1,5, respectively. I want to expand the data to let's say 10 days and extrapolate the data on value column (e.g. with linear regression).
My final df should be something like that:
df2 <- data.frame(
id = c(1,1,1,1,1,1,1, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 2,2,2,2,2,2,2, 3,3,3,3,3,3,3, 3,3,3,3,3,3,3),
period = c(1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2),
day= c(1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7),
value =c(10,12,13,15,16,17,18, 11,12,14,15,17,18,19, 12,13,14,15,16,18,22, 15,16,18,19,20,22,23, 15,16,17,19,29,39,49, 14,16,18,20,22,24,26))
The most similar example I found doesn't extrapolate by two variables (ID and period in my case), it extrapolates only by year. I tried to adapt the code but no success :(
Another example extrapolates the data by multiple id but doesn't add rows for missing data.
I couldn't combine both codes with my limited experience in R. Any suggestions?
Thanks in advance...
We can use complete
library(dplyr)
library(tidyr)
library(forecast)
df %>%
group_by(id, period) %>%
complete(day =1:7)%>%
mutate(value = as.numeric(na.interp(value)))
#akrun's answer is good, as long as you don't mind using linear interpolation. However, if you do want to use a linear model, you could try this data.table approach.
library(data.table)
model <- lm(value ~ day + period + id,data=df)
dt <- as.data.table(df)[,.SD[,.(day = 1:7,value = value[match(1:7,day)])],by=.(id,period)]
dt[is.na(value), value := predict(model,.SD),]
dt
id period day value
1: 1 1 1 10.00000
2: 1 1 2 12.00000
3: 1 1 3 12.86714
4: 1 1 4 15.00000
5: 1 1 5 16.00000
6: 1 1 6 18.13725
7: 1 1 7 19.89396
8: 1 2 1 11.00000
9: 1 2 2 12.15545
10: 1 2 3 14.00000
11: 1 2 4 15.00000
12: 1 2 5 17.00000
13: 1 2 6 19.18227
14: 1 2 7 20.93898
15: 2 1 1 11.90102
16: 2 1 2 13.00000
17: 2 1 3 14.00000
18: 2 1 4 15.00000
19: 2 1 5 16.00000
20: 2 1 6 20.68455
21: 2 1 7 22.44125
22: 2 2 1 15.00000
23: 2 2 2 16.00000
24: 2 2 3 18.00000
25: 2 2 4 18.21616
26: 2 2 5 20.00000
27: 2 2 6 21.72957
28: 2 2 7 23.48627
29: 3 1 1 14.44831
30: 3 1 2 16.00000
31: 3 1 3 17.00000
32: 3 1 4 19.00000
33: 3 1 5 29.00000
34: 3 1 6 23.23184
35: 3 1 7 24.98855
36: 3 2 1 14.00000
37: 3 2 2 16.00000
38: 3 2 3 18.00000
39: 3 2 4 20.00000
40: 3 2 5 22.52016
41: 3 2 6 24.27686
42: 3 2 7 26.03357
id period day value

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources