I'm trying to reproduce this example which adjusts stock prices for dividends using the tidyquant framework.
Here is the original example:
library(quantmod)
library(tidyquant)
library(timetk)
SPY.Close <- Cl(getSymbols("SPY", auto.assign=FALSE))
SPY.Div <- getDividends("SPY", auto.assign=FALSE)
# Within xts framework
SPY <- merge(SPY.Close, SPY.Div)
# now adjust close for dividends
ratios <- adjRatios(dividends=SPY[,"SPY.div"], close=SPY[,"SPY.Close"])
SPY$SPY.Adjusted <- (ratios$Split * ratios$Div) * SPY$SPY.Close
# only keep dates from the original object
SPY <- SPY[index(SPY.Close), ]
Here is my attempt to do this in tidyquant:
#convert xts to tibble
spy.tbl <- tk_tbl(merge(SPY.Close, SPY.Div), preserve_index = TRUE)
#add a splits placeholder because adjRatios() complains if its not there.
spy.tbl$SPY.splits <- 0
spy.adj <- spy.tbl %>%
tq_mutate(
select = c(index, SPY.Close, SPY.div, SPY.splits),
mutate_fun = adjRatios,
splits = SPY.splits,
dividends = SPY.div,
close = SPY.Close
)
but this gives the error:
Error in fun_transmute(., ...) : unused argument (.)
I've tried various combinations of arguments, but I can't seem to make it work.
Just in case anyone searches for this topic, I've solved my own problem with the following code. The benefit of this is that it is done in the tidyverse framework and is easily expanded to many tickers with group_by(ticker).
data is a data frame with Close and previous close:
Date ticker Close Cl.prev
1 2017-08-14 SPY_US 246.54 244.12
2 2017-08-15 SPY_US 246.51 246.54
3 2017-08-16 SPY_US 246.94 246.51
4 2017-08-17 SPY_US 243.09 246.94
5 2017-08-18 SPY_US 242.71 243.09
6 2017-08-21 SPY_US 242.90 242.71
div.data is a tibble with only the dividend payments, Date here is the Ex-Div date.
ticker Date div
2 SPY_US 2017-09-15 1.234574
3 SPY_US 2017-12-15 1.351333
4 SPY_US 2018-03-16 1.096775
5 SPY_US 2018-06-15 1.245568
This chain merges the price data into the div.data in order to get the prices to calculate the adjRatio
div.data <- div.data %>%
left_join(., data[, c("Date", "ticker", "Close", "Cl.prev")], by = c("ticker", "Date"))
This chain calculates the adjRatio:
div.data <- div.data %>%
mutate(ratio = 1-div / Cl.prev) %>%
mutate(adjRatio = rev(cumprod(rev(ratio)))) %>%
select(-Close, -Cl.prev, -ratio)
This chain merges the div.data back into the price series, propagates the adjRatio and calculates the Adjusted Close:
data.adj <- data %>%
left_join(., div.data, by = c("ticker", "Date") ) %>%
mutate(adjRatio = dplyr::lead(adjRatio, n=1)) %>%
mutate(adjRatio = na.locf(adjRatio, fromLast = TRUE, na.rm = FALSE)) %>%
mutate(adjRatio = na.fill(adjRatio, fill = 1.0)) %>%
mutate(Cl.adj = Close * adjRatio) %>%
select(-Cl.prev, -div, -adjRatio)
Here is the final data:
> head(data.adj)
Date ticker Close Cl.adj
1 2017-08-14 SPY_US 246.54 242.0153
2 2017-08-15 SPY_US 246.51 241.9858
3 2017-08-16 SPY_US 246.94 242.4079
4 2017-08-17 SPY_US 243.09 238.6286
5 2017-08-18 SPY_US 242.71 238.2556
6 2017-08-21 SPY_US 242.90 238.4421
At the moment, there are only two forms of tq_mutate() and tq_mutate_xy(). The adjRatios() function has 3 inputs, which would require x,y,z.
Related
I’m working with a dataset that contains GPS locations for a small group of polar bears. For every bear, there should theoretically be one location every 4 hours, but unfortunately the radio collars don’t always work perfectly and there are gaps in my data.
My goal is to produce a csv that subsets the maximum number of locations between gaps for each bear.
For example, if a bear’s data is composed of 100 locations, then has one gap, and then 50 locations, I only want to subset the first 100 locations in the final csv.
Here is a code to generate the kind of dataset I would use:
bears<-as.character(c(rep("bear1",times=5),rep("bear2",times=5)))
time<-c("2007-09-08 13:00:00","NA","2007-09-08 21:00:00","2007-09-09 1:00:00","NA","NA","2007-10-09 17:00:00","2007-10-09 1:00:00","NA","2007-10-09 9:00:00")
bear.data<-data.frame(bears,time)
Where:
bears refers to the individual bear.
time refers to the time at which a particular location is transmitted. When the collar fails to transmit a GPS location,
this column has a value of NA.
Any help would be appreciated!!
bear.data <- data.frame(bears, time) %>%
mutate(time = ymd_hms(time),
helper = floor_date(time, unit = "year"),
seq = rleid(helper)) %>%
filter(!is.na(helper)) %>%
group_by(bears, seq) %>%
add_tally() %>% ungroup() %>%
group_by(bears) %>%
slice_max(n)
The problem can be thought of as finding the maximum length of blocks of boolean values per group:
bear.data$time <- as.Date(bear.data$time)
bear.data$not_na <- !is.na(bear.data$time)
bear.data$gap <- ave(bear.data$not_na, cumsum(!bear.data$not_na), FUN = cumsum)
aggregate(gap ~ bears, FUN = max, data=bear.data)
Output
> aggregate(gap ~ bears, FUN = max, data=bear.data)
bears gap
1 bear1 2
2 bear2 3
Data
bears time
1 bear1 2007-09-08 13:00:00
2 bear1 NA
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
5 bear1 NA
6 bear2 NA
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 17:00:00
9 bear2 2007-10-09 1:00:00
10 bear2 NA
11 bear2 2007-10-09 9:00:00
You can create a function that calculates the rows of the longest non-NA sequence for a bear. This function is based on rle() and is.na() :
seq_max <- function(x) {
r <- rle(!is.na(x))
rd <- as.data.frame(unclass(r))
rd$ends <- cumsum(rd$lengths)
rd$starts <- c(1, rd$ends[-length(rd$ends)] + 1)
rd <- rd[rd$values, ]
rd <- rd[which.max(rd$lengths)[1], ]
seq(rd$starts, rd$ends)
}
Then you apply it to each bear. This is very convenient with dplyr :
library(dplyr)
bear.data %>%
group_by(bears) %>%
slice(seq_max(time))
if you were to do this in Base R,
first write a Mode function(Returns the most occurring element):
Mode <- function(x){
y <- unique(x)
y[which.max(tabulate(match(x,y)))]
}
Now write a logical function that will give the maximum ids:
max_ids <- function(x){
id <- with(rle(x),rep(seq_along(values),lengths))
id == Mode(id) # Uses the mode function above
}
Use the two functions as follows:
subset(bear.data, ave(is.na(as.Date(time)), bears, FUN = max_ids))
bears time
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 1:00:00
My two data frames are accessible here and here and I have been trying to follow this previous post.
I would like to populate wombat$rainfall_lag_2wk with the sum of rainfall records for the previous two weeks/14 days, this data is available in rain. I have tried to do this a number of ways before I found the above post. Most recently I have tried to follow the above post, but I get the below error.
Any help would be greatly appreciated. I am happy with any solution, whether it follows the same structure as the above post or not.
Thanks in advance
# Load data
wombat <- read.csv("wombat.csv", header = TRUE)
rain <- read.csv("rain.csv", header = TRUE)
# Define dates
wombat$date <- as.Date(wombat$date, "%Y-%m-%d")
rain$Date <- as.Date(rain$Date, "%Y-%m-%d")
# Calculate rainfall for previous two weeks following above link
wombat$start_date <- rep_len("01/01/1970", nrow(wombat))
wombat$start_date <- as.Date(wombat$start_date, "%m/%d/%Y")
wombat$diff_days <- as.numeric(difftime(wombat$date, wombat$start_date, units = "days"))
rain$start_date <- rep_len("01/01/1970", nrow(rain))
rain$start_date <- as.Date(rain$start_date, "%m/%d/%Y")
rain$diff_days <- as.numeric(difftime(rain$Date, rain$start_date, units = "days"))
for (i in 1:length(wombat$diffdays)) {
day = wombat$diffdays[i]
rainday = pmatch(day, rain$diffdays, dup = FALSE)
wombat$rainfall_lag_2wk[i] = sum(rain$Rainfall.amount..millimetres.[(rainday-14):(rainday-1)]) # 14 days
}
Error after running above Error in (rainday - 14):(rainday - 1) : argument of length 0
I'm not sure what your final data should look like, so I'm assuming that you want to see the cumulative rainfall for the previous 14 days in the wombat data.
Here's a solution using the tidyverse and zoo packages.
library(tidyverse)
library(zoo)
rain <- read_csv("rain.csv") %>%
select(-X1)
wombat <- read_csv("wombat.csv") %>%
select(-X1) %>%
distinct()
rain_wombat <- left_join(rain, wombat, by = c("Date" = "date"))
rain_wombat <- rain_wombat %>%
mutate(rainfall_lag_2wk = as.numeric(rainfall_lag_2wk)) %>%
rename(rainfall = `Rainfall.amount..millimetres.`) %>%
replace(is.na(.), 0) %>%
mutate(rainfall_lag_2wk = round(rollsumr(rainfall, k = 14, fill = NA),2),
rainfall_lag_2wk = lag(rainfall_lag_2wk)) %>%
filter(Date >= min(wombat$date) & Date <= max(wombat$date))
This gives you data like:
Date rainfall rainfall_lag_2wk
<date> <dbl> <dbl>
1 2008-04-25 0 2.4
2 2008-04-26 0 2.4
3 2008-04-27 4.4 0
4 2008-04-28 0.4 4.4
5 2008-04-29 0 4.8
6 2008-04-30 0 4.8
7 2008-05-01 3.4 4.8
8 2008-05-02 0 8.2
9 2008-05-03 0 8.2
10 2008-05-04 0 8.2
11 2008-05-05 0 8.2
Thanks Matt for your answer, which helped me to reach the below solution.
Below is the solution I have used, partly adapted from here.
# load libraries
library(tidyverse)
library(lubridate)
library(dplyr)
# Load data
wombat <- read.csv("wombat.csv", header = TRUE)
rain <- read.csv("rain.csv", header = TRUE)
# Define dates
wombat$date <- as.Date(wombat$date, "%Y-%m-%d")
rain$Date <- as.Date(rain$Date, "%Y-%m-%d")
# Calculate rainfall for previous two weeks
rain$rainfall_lag_2wk <- rain$Rainfall.amount..millimetres.
rain <- rain %>% mutate(rainfall_lag_2wk = map_dbl(1:n(), ~ sum(Rainfall.amount..millimetres.[(Date >= (Date[.] - days(14))) & (Date < Date[.])], na.rm = TRUE)))
wombat <- inner_join(wombat, rain, by = c("date" = "Date"))
wombat <- dplyr::select(wombat, date, rainfall_lag_2wk.y)
I'm fairly new in R and need some help.
I have two dataframes with rather similar information. The first dataframe has information about misconnections for an airline, whereas the other one is the entire timetable for the same airline. Now, what I need is to make a new column in the misconnection data.frame including flights from the timetable that can replace the delayed flights on the transit.
The flights that I want to replace need to meet a range of conditions (within a certain time-horizon, needs to be the same weekday and it needs to fly to the same destination). I addition, I want R to choose the flight that is closest (by time) to the new arrival time at a transit(from the misconnection data.frame).
The misconnection data.frame looks like the following (1620 lines in total):
miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),
arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),
next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
weekday = c("4", "7", "7", "3"))
view(miscon)
flight.date Outbound.airport arr.time next.pos.dep Weekday
1 2019-08-05 MXP 19:25:00 19:36:00 4
2 2019-10-03 KRK 20:52:00 21:17:00 7
3 2019-07-21 KLU 07:33:00 07:58:00 7
4 2019-05-29 OTP 18:49:00 19:14:00 3
And the timetable data.frame would look like this:
tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),
dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))
view(tt)
start.date end.date Weekday Outbound.airport dep.time
1 2019-03-25 2019-10-21 1234567 KLU 12:20:00
2 2019-05-02 2019-10-27 1.3..67 KLU 15:55:00
3 2019-07-30 2019-08-26 1.34567 MXP 19:55:00
4 2019-03-30 2019-06-01 ..3.5.. OTP 20:34:00
In Excel, this problem is solved using Index matching, which I've managed. However, the problem is slightly to big for excel to handle which is why I need to convert this to R. Did try with the match and mutate function in R, but seems like the values I'm matching must be equal - which I do not expect mine to be.
Also found an interesting solution to a similar problem using the DescTools package, which I tried to implemt with no success.
get_close2 <- function(xx=tt, yy=miscon) {
pos <- vector(mode = "numeric")
for(i in 1:dim(yy)[1]) {
pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
#print(pos[i])
yy$new.flight[i] <- pos[i]
}
out <- yy
return(out)
}
get_close2()
For this one, I tried with only one condition. It generated a column, but with NA's only. Obviously, I am far out right now, which is why I'm reaching out for help. Hope the problem was clear. The end result would preferrably look something like the following:
miscon
flight.date Outbound.airport arr.time next.pos.dep Weekday new.flight.time
1 2019-12-05 MXP 19:25:00 19:36:00 4 19:55:00
2 2019-10-03 KRK 20:52:00 21:17:00 7 NA
3 2019-07-21 KLU 07:33:00 07:58:00 7 12:20:00
4 2019-05-29 OTP 18:49:00 19:14:00 3 20:34:00
I think you can do it as follows. First, I would rearrange the Weekday column so that you have one row for each weekday a flight is going:
library(data.table)
library(dplyr)
library(tidyr)
tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>%
gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
filter(weekday %in% 1:7) %>%
select(-key)
Then I would do a left join of miscon and tt on the airport and weekday.
tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]
Check if flight date is on a valid date:
df = df[flight.date>=start.date & flight.date<=end.date]
Now you have a data.frame of all possible connections. The only thing left is to find the minimum time between the flights for each connection.
df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]
Now you can filter the rows by the minimum time delay (timediff):
df = df[ , .SD[which.min(timediff)], by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")
> df
weekday Outbound.airport flight.date arr.time next.pos.dep start.date end.date new.flight.time timediff
1: 7 KLU 2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2: 4 MXP 2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00 1800 secs
3: 3 OTP 2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00 6300 secs
The solution is a bit of a mix of dplyr and data.table.
Ok, it's not pretty but you have a fairly complex issue, and it's not fully clear to me if this gives you what you are looking for - you will need to check it on a larger dataset than the small example you provide to be sure first.
# setup
library(data.table)
setDT(tt)
setDT(miscon)
# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]
# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
, new.dep.time := as.numeric(dep.time - arr.time)][
, c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
, new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]
I have a data frame with dates and numbers called 'df'. I have another data frame with start and end dates called 'date_ranges'.
My goal is to filter/subset df so that it only shows for the start/end dates in each row of the date_ranges column. Here is my code so far:
df_date <- as.Date((as.Date('2010-01-01'):as.Date('2010-04-30')))
df_numbers <- c(1:120)
df <- data.frame(df_date, df_numbers)
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
date_ranges <- data.frame(start_dates, end_dates)
# Attempting to filter df by start and end dates
for (i in range(date_ranges$start_dates)){
for (j in range(date_ranges$end_dates)){
print (
df %>%
filter(between(df_date, i, j)))
}
}
The first and third result of the nested for loop is what I want, but not the second result. The first and third give me the dates and values for df between their respective rows, but the second result is the range from the earliest date to the latest date. How can I fix this loop to exclude the second result?
A tidyverse approach could be to create a sequence between start and end_dates and join with df to keep only the dates which lie in the range.
library(dplyr)
date_ranges %>%
mutate(df_date = purrr::map2(start_dates, end_dates, seq, "day")) %>%
tidyr::unnest(df_date) %>%
select(-start_dates, -end_dates) %>%
left_join(df, by = 'df_date')
# A tibble: 39 x 2
# df_date df_numbers
# <date> <int>
# 1 2010-01-06 6
# 2 2010-01-07 7
# 3 2010-01-08 8
# 4 2010-01-09 9
# 5 2010-01-10 10
# 6 2010-01-11 11
# 7 2010-01-12 12
# 8 2010-01-13 13
# 9 2010-01-14 14
#10 2010-01-15 15
# … with 29 more rows
You can try looping through index
for (i in seq_along(date_ranges$start_dates)){
print (
df %>%
filter(between(df_date, date_ranges$start_dates[i], date_ranges$end_dates[i])))
}
Base R solution:
# Your data creation can be simplified:
df <- data.frame(df_date = seq.Date(as.Date('2010-01-01', "%Y-%m-%d"), as.Date('2010-04-30', "%Y-%m-%d"),
by = 1), df_numbers = c(1:120))
# Store start and end date vectors to filter the data.frame:
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
# Subset the data to extract records with matching dates: df => stdout (Console
df[df$df_date %in% c(start_dates, end_dates),]
I need to check the data array with function "Anomalize".
First I hooked up some libraries
library(tidyverse)
library(anomalize)
library(dplyr)
library(zoo)
library(ggplot2)
library(forecast)
library(anytime)
Then I delete all column that i do not need for this task
trash1 <- ASD[, -2]
trash2 <- trash1[,-2]
trash3 <- trash2[,-2]
trash4 <- trash3[,-2]
trash5 <- trash4[,-2]
trash6 <- trash5[,-2]
trash7 <- trash6[,-4]
trash8 <- trash7[,-4]
view(trash8)
Change class from Factor to Date:
trash8$DMY <- as.Date(trash8$DMY, format="%d.%m.%y")
Than I tryed to anomalize this
trash_tbl <- as_tibble(trash8)
trash_tbl %>%
time_decompose(Qp) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3 , alpha_dots = 0.5)
As the result I have this error:
Converting from tbl_df to tbl_time.
Auto-index message: index = DMY
Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
Error: Only year, quarter, month, week, and day periods are allowed for an index of class Date
Please help me with it or say, what can I read to solve that problem??
This is my data. DMY - Date, MCC - Factor, Art - Numeric, Qp - Numeric , Ql - Factor
1 DMY MCC Art Qp Ql
1 2016-01-01 UA0000468 1801 3520 440
2 2016-01-01 UA0000468 3102 3024 604,8
3 2016-01-01 UA0000468 4419 270 521,1
4 2016-01-01 UA0000468 5537 1080 2084,4
5 2016-01-03 UA0010557 3528 180 36
6 2016-01-03 UA0010557 3529 198 39,6
...