Stocks Daily Returns with R data.frame - r

(daily return percentage) / 100 = (today's close - yesterday's close) / yesterday's close
I have a data frame like this,
date close
1 2018-09-21 3410.486
2 2018-09-22 3310.126
3 2018-09-23 3312.482
4 2018-09-24 3269.432
5 2018-09-25 3204.922
I'd like to calculate daily returns and make it like this,
date close change
1 2018-09-21 3410.486 3.03%
2 2018-09-22 3310.126 -0.07%
3 2018-09-23 3312.482 1.32%
4 2018-09-24 3269.432 2.01%
5 2018-09-25 3321.825 NA

library(tidyverse)
library(tidyquant)
df %>%
tq_mutate(select = close,
mutate_fun = periodReturn,
period = "daily",
col_rename = "return")
# A tibble: 5 x 3
date close return
<date> <dbl> <dbl>
1 2018-09-21 3410. 0
2 2018-09-22 3310. -0.0294
3 2018-09-23 3312. 0.000712
4 2018-09-24 3269. -0.0130
5 2018-09-25 3205. -0.0197

Just using dplyr.
df1 %>%
mutate(change = (close - lag(close)) / lag(close))
date close change
1 2018-09-21 3410.486 NA
2 2018-09-22 3310.126 -0.0294268911
3 2018-09-23 3312.482 0.0007117554
4 2018-09-24 3269.432 -0.0129962970
5 2018-09-25 3204.922 -0.0197312561
data:
df1 <- structure(list(date = structure(c(17795, 17796, 17797, 17798,
17799), class = "Date"), close = c(3410.486, 3310.126, 3312.482,
3269.432, 3204.922), change = structure(c(0, 0, 0, 0, 0), tsp = c(0,
4, 1))), row.names = c(NA, -5L), class = "data.frame")

Related

How to calculate duration in R sequentially

I have a dataset that looks like this where patients are intubated (based on ObservationValue of "Start") and extubation (based on ObservationValue of "Stop"). Occasionally there is unfortunately misentries where the patient has an extubation before the intubation, as seen in EncounterID #3 below, where they had an extubation, followed by an intubation. In those instances I have no interest in calculating their ventilation duration. My question therefore is, how can I calculate the duration in which I subtract the first extubation that happened after each intubation?
My apologies if this is a basic question, I am still fairly new to using R for data management.
Here is an example of my dataset:
And this is what I would like ideally:
Here is dput output of the dataset I currently have:
test<-structure(list(EncounterID=structure(c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3)),
ObservationDate=structure(c("2018-01-12 15:27:00", "2018-01-12 19:02:00", "2018-03-03 21:09:00", "2018-03-06 07:56:00",
"2019-12-03 15:54:00", "2019-12-03 20:06:00", "2019-12-04 11:40:00", "2019-12-06 08:13:00",
"2019-12-23 18:50:00", "2019-12-23 16:00:00")),
ObservationValue=structure(c("Start", "Stop", "Start", "Stop", "Start", "Stop", "Start", "Stop", "Stop", "Start"))),
row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
test$ObservationDate<-as.POSIXct(test[["ObservationDate"]],tz="", format="%Y-%m-%d %H:%M")
With a bit of data wrangling and reshaping to wide format you could do:
library(dplyr)
library(tidyr)
test |>
arrange(EncounterID, ObservationDate) |>
group_by(EncounterID) |>
mutate(id = cumsum(ObservationValue == "Start")) |>
ungroup() |>
pivot_wider(names_from = ObservationValue, values_from = ObservationDate) |>
rename(Inturbation = Start, Exturbation = Stop) |>
mutate(duration = difftime(Exturbation, Inturbation, units = "days"))
#> # A tibble: 5 × 5
#> EncounterID id Inturbation Exturbation duration
#> <dbl> <int> <dttm> <dttm> <drtn>
#> 1 1 1 2018-01-12 15:27:00 2018-01-12 19:02:00 0.1493056 days
#> 2 1 2 2018-03-03 21:09:00 2018-03-06 07:56:00 2.4493056 days
#> 3 2 1 2019-12-03 15:54:00 2019-12-03 20:06:00 0.1750000 days
#> 4 2 2 2019-12-04 11:40:00 2019-12-06 08:13:00 1.8562500 days
#> 5 3 1 2019-12-23 16:00:00 2019-12-23 18:50:00 0.1180556 days

Adding an extra column that gives a value for day of the week

I would like to add a column to my dataset that assigns to each date a number based on the week it is in
So I would have for day1 day2... etc day7 a value in the column for the days part of that week equal to 1, and for day8, day 9 etc... till day 14 a value equal to 2
what would be the best way to add that column ?
dput(head(sdata0))
structure(list(date = structure(c(18628, 18629, 18630, 18631,
18632, 18633), class = "Date"), launches = c(-0.423325435196192,
-0.95406180171082, -0.95406180171082, -0.95406180171082, 0.107410931318437,
-0.423325435196192), pledged = c(-0.242997575062835, -0.300759417946595,
-0.300759417946595, -0.300759417946595, 0.120035260531115, -0.103075942164302
), backers = c(-0.124417670254619, -0.269239525943361, -0.269239525943361,
-0.269239525943361, 0.0620404689446357, -0.0918327527246523),
total_goal = c(-0.314834573033319, -0.33600837985916, -0.33600837985916,
-0.33600837985916, -0.205436571099805, -0.283073862794557
), mean_goal = c(-0.350195946618206, -0.422316295398803,
-0.422316295398803, -0.422316295398803, -0.199945219991962,
-0.24201542344731), US = c(0.179454667531907, -0.720497098001238,
-0.720497098001238, -0.720497098001238, 0.179454667531907,
-0.720497098001238), `number of success` = c(0.23782061224498,
-0.594551530612449, -0.594551530612449, -0.594551530612449,
1.07019275510241, 0.23782061224498), duration_days = c(-0.0399540270332042,
-1.6958261375219, -1.6958261375219, -1.6958261375219, 0.0152417099830856,
-0.0399540270332042), Twitter = c(-2.35635395414648, -1.37949565613006,
-2.47410026685382, -1.21813959797556, -0.995729896195041,
-1.226861547065), replies = c(-1.11872430995012, -0.454408610464075,
-1.06845177052955, -0.874543404193084, -1.24799655417443,
-0.906861465249162), likes = c(-0.812127568832484, -0.63113030668481,
-1.40968119485432, -1.1127549475184, -1.2106558412922, -1.22498280135666
), retweets = c(-0.606241425199139, -0.766152931679175, -1.64441036779204,
-1.39868247694445, -1.31077301003134, -1.3509601949059),
group_date = c("01", "01", "01", "01", "01", "01")), row.names = c(NA,
6L), class = "data.frame")`
You can use the function week from lubridate like this:
library(dplyr)
library(lubridate)
sdata0 %>%
mutate(week_number = week(ymd(date)))
#> date launches pledged backers total_goal mean_goal US
#> 1 2021-01-01 -0.4233254 -0.2429976 -0.12441767 -0.3148346 -0.3501959 0.1794547
#> 2 2021-01-02 -0.9540618 -0.3007594 -0.26923953 -0.3360084 -0.4223163 -0.7204971
#> 3 2021-01-03 -0.9540618 -0.3007594 -0.26923953 -0.3360084 -0.4223163 -0.7204971
#> 4 2021-01-04 -0.9540618 -0.3007594 -0.26923953 -0.3360084 -0.4223163 -0.7204971
#> 5 2021-01-05 0.1074109 0.1200353 0.06204047 -0.2054366 -0.1999452 0.1794547
#> 6 2021-01-06 -0.4233254 -0.1030759 -0.09183275 -0.2830739 -0.2420154 -0.7204971
#> number of success duration_days Twitter replies likes retweets
#> 1 0.2378206 -0.03995403 -2.3563540 -1.1187243 -0.8121276 -0.6062414
#> 2 -0.5945515 -1.69582614 -1.3794957 -0.4544086 -0.6311303 -0.7661529
#> 3 -0.5945515 -1.69582614 -2.4741003 -1.0684518 -1.4096812 -1.6444104
#> 4 -0.5945515 -1.69582614 -1.2181396 -0.8745434 -1.1127549 -1.3986825
#> 5 1.0701928 0.01524171 -0.9957299 -1.2479966 -1.2106558 -1.3107730
#> 6 0.2378206 -0.03995403 -1.2268615 -0.9068615 -1.2249828 -1.3509602
#> group_date week_number
#> 1 01 1
#> 2 01 1
#> 3 01 1
#> 4 01 1
#> 5 01 1
#> 6 01 1
Created on 2022-07-30 by the reprex package (v2.0.1)
Base R approach without any dependencies:
sdata0["week_number"] <- sdata0["date"] |> format("%V")
sdata0["week_number"]
#> week_number
#> 1 53
#> 2 53
#> 3 53
#> 4 01
#> 5 01
#> 6 01
Have also a look at %U and %W in ?strptime if you need week numbers following US/UK conventions instead of ISO 8601.

Adding dates and times to event durations

As an addition to this question, is it possible to add when an event started and when it finished in another column(s)?
Here is a reproducible example pulled from the OP.
df <- structure(list(Time = structure(c(1463911500, 1463911800, 1463912100,
1463912400, 1463912700, 1463913000), class = c("POSIXct", "POSIXt"
), tzone = ""), Temp = c(20.043, 20.234, 6.329, 20.424, 20.615,
20.805)), row.names = c(NA, -6L), class = "data.frame")
> df
Time Temp
1 2016-05-22 12:05:00 20.043
2 2016-05-22 12:10:00 20.234
3 2016-05-22 12:15:00 6.329
4 2016-05-22 12:20:00 20.424
5 2016-05-22 12:25:00 20.615
6 2016-05-22 12:30:00 20.805
library(dplyr)
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)))
id event_duration
<int> <time>
1 1 5 mins
2 3 10 mins
i.e there are two more columns: "start_DateTime" and "end_DateTime"
Thanks!
Sure. Modify the final summarise() like this:
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)),
start_DateTime = min(Time),
end_DateTime = max(Time))
#> # A tibble: 2 × 4
#> id event_duration start_DateTime end_DateTime
#> <int> <drtn> <dttm> <dttm>
#> 1 1 5 mins 2016-05-22 12:05:00 2016-05-22 12:10:00
#> 2 3 10 mins 2016-05-22 12:20:00 2016-05-22 12:30:00

Creating R loop/apply function to iterate by unique participant ID and calculate midpoint between dates

My dataset looks something like this
dat <- data.frame(id=c(100,100,100,101,101,101,102,102,102,103,103,103),
visit = c(1,2,3,1,2,3,1,2,3,1,2,3),
visit.date = c(9/15/2020,11/29/2020,12/23/2020,9/7/2020,11/16/2020,12/9/2020,9/16/2020,12/6/2020,1/6/2021,10/4/2020,11/30/2020,12/23/2020),
delivery.date = c(NA,NA,NA,NA,11/2/2020,NA,NA,11/21/2020,NA,NA,11/15/2020,NA),
death = c(0,1,NA,0,0,0,0,0,1,0,0,1))
Essentially, I have three different visits for each participant with a unique ID. What I need to do is create a variable that states the date of death for each participant who reported a death (death=1). The date of death should be the midpoint between the visit date when the baby was reported dead and the last visit date that the baby was reported alive. Then, if the midpoint date is before the delivery date, I need that newly created variable column to list the delivery date as the day of death.
I've tried creating my own function and applying it using lapply as below, but I end up getting a separate report for each row that lists the participant ID, and the value is null. Here is the code I've tried. Ultimately, I will need to calculate person time at risk in days which is why I was trying to use difftime here. The dates are in POSIXct format as well to be compatible with the difftime function. Any help here would be very appreciated!
risktime <- function(id,dat) {
a<- difftime(dat$visit.date[max(dat$visit)],dat$delivery.date,units="days")[dat["id"]=="id"]
a}
risktime1 <- lapply(unique(dat$id),risktime,dat)
riktime1
I'm not sure why you use difftime, IMO you should use mean.Date instead which already gives the midpoint between two dates. Also works with "Date" class, that we create first.
dat <- transform(dat, visit.date=as.Date(visit.date, '%m/%d/%Y'),
delivery.date=as.Date(delivery.date, '%m/%d/%Y'))
Simply wrap mean.Date in a function. There is some case handling involved, whether there's 1. a death, 2.a delivery date which is earlier than reported death, 2.b if the latter is later.
f <- \(x) {
if (any(x$death == 1)) {
last_alive <- with(x, which.max(cumsum(death == 0)))
first_dead <- with(x, which.max(cumsum(death == 1)))
u <- mean(c(x$visit.date[last_alive], x$visit.date[first_dead]))
dd <- x$delivery.date[!is.na(x$delivery.date)]
if (!length(dd) == 0) {
if (u < dd) {
x$est_death <- dd
} else {
x$est_death <- u
}
} else {
x$est_death <- u
}
} else {
x$est_death <- as.Date(NA_integer_)
}
return(x)
}
Finally use function in by.
by(dat, dat$id, f) |> do.call(what=rbind)
# id visit visit.date delivery.date death est_death
# 100.1 100 1 2020-09-15 <NA> 0 2020-10-22
# 100.2 100 2 2020-11-29 <NA> 1 2020-10-22
# 100.3 100 3 2020-12-23 <NA> NA 2020-10-22
# 101.4 101 1 2020-09-07 <NA> 0 <NA>
# 101.5 101 2 2020-11-16 2020-11-02 0 <NA>
# 101.6 101 3 2020-12-09 <NA> 0 <NA>
# 102.7 102 1 2020-09-16 <NA> 0 2020-12-21
# 102.8 102 2 2020-12-06 2020-11-21 0 2020-12-21
# 102.9 102 3 2021-01-06 <NA> 1 2020-12-21
# 103.10 103 1 2020-10-04 <NA> 0 2020-12-11
# 103.11 103 2 2020-11-30 2020-11-15 0 2020-12-11
# 103.12 103 3 2020-12-23 <NA> 1 2020-12-11
Data:
dat <- structure(list(id = c(100, 100, 100, 101, 101, 101, 102, 102,
102, 103, 103, 103), visit = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
2, 3), visit.date = structure(c(18520, 18595, 18619, 18512, 18582,
18605, 18521, 18602, 18633, 18539, 18596, 18619), class = "Date"),
delivery.date = structure(c(NA, NA, NA, NA, 18568, NA, NA,
18587, NA, NA, 18581, NA), class = "Date"), death = c(0,
1, NA, 0, 0, 0, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-12L))
I personally find dplyr to be more expressive with these kinds of grouped
operations than base R. Here’s how I would write this (using #jay.sf’s data):
library(dplyr, warn.conflicts = FALSE)
dat %>%
group_by(id) %>%
# Create a cumulative indicator for survival status
mutate(
dead = cummax(replace(death, is.na(death), 0))
) %>%
# Estimate date of death
summarise(
delivery.date = first(na.omit(delivery.date)),
last.alive = last(visit.date[dead == 0]),
first.dead = first(visit.date[dead == 1]),
death.date = mean(c(last.alive, first.dead))
) %>%
# Ensure estimated death date is not before delivery
mutate(
death.date = replace(death.date, which(delivery.date > death.date), delivery.date)
)
#> # A tibble: 4 x 5
#> id delivery.date last.alive first.dead death.date
#> <dbl> <date> <date> <date> <date>
#> 1 100 NA 2020-09-15 2020-11-29 2020-10-22
#> 2 101 2020-11-02 2020-12-09 NA NA
#> 3 102 2020-11-21 2020-12-06 2021-01-06 2020-12-21
#> 4 103 2020-11-15 2020-11-30 2020-12-23 2020-12-11

Find if a date overlaps between multiple pairs of vectorised dates

I'm trying to find whether a date exists between multiple pairs of dates which are wide in my dataset - the length I've given here is just an example, the eventual number may be larger or smaller. Not sure if this is the most sensible option but working longwise didn't seem to work, this is also a very common way to work with overlapping dates and date pairs in SPSS, where you can have multiple variables numerised as the dates are here and it works through each numbered 'set' to give you a response.
Here is an example dataset:
person key_date 1_end_date 2_end_date 3_end_date 4_end_date 1_start_date 2_start_date 3_start_date 4_start_date
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 <NA> 2019-05-23 2019-09-30 2016-07-22 <NA>
2 2 2019-06-07 2019-05-16 2019-06-07 <NA> <NA> 2019-05-16 <NA> <NA> <NA>
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09
test <- structure(list(person = 1:3, key_date = structure(c(18169, 18054,18330), class = "Date"), `1_end_date` = structure(c(18039, 18032,16954), class = "Date"), `2_end_date` = structure(c(18169, 18054,18117), class = "Date"), `3_end_date` = structure(c(17004, NA,18409), class = "Date"), `4_end_date` = structure(c(NA, NA, 18304), class = "Date"), `1_start_date` = structure(c(18039, 18032,16954), class = "Date"), `2_start_date` = structure(c(18169,NA, 18117), class = "Date"), `3_start_date` = structure(c(17004,NA, 18409), class = "Date"), `4_start_date` = structure(c(NA,NA, 18330), class = "Date")), row.names = c(NA, 3L), class = "data.frame")
The expected output would be just a binary flag to indicate that the key_date exists between any pair of start_date and end_date. In the example given, that would mean person 1 and 3. Any ideas how to do this? Is this really inefficient?
tidyverse approach
library(tidyverse)
result <- test %>% mutate(across(ends_with("end_date"), ~
key_date <= . & key_date >= get(str_replace(cur_column(), "end", "start")),
.names = '{.col}_flag')) %>%
rowwise() %>%
mutate(Flag1 = sum(c_across(ends_with("flag")), na.rm = T)) %>%
ungroup() %>%
select(-ends_with("flag"))
> result$Flag1
[1] 1 0 0
Complete output will look like
> result
# A tibble: 3 x 11
person key_date `1_end_date` `2_end_date` `3_end_date` `4_end_date` `1_start_date` `2_start_date` `3_start_date` `4_start_date` Flag1
<int> <date> <date> <date> <date> <date> <date> <date> <date> <date> <dbl>
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 NA 2019-05-23 2019-09-30 2016-07-22 NA 1
2 2 2019-06-07 2019-05-16 2019-06-07 NA NA 2019-05-16 NA NA NA 0
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09 0

Resources