I have the following data:
# A tibble: 7,971 x 10
symbol date open high low close volume adjusted start_date end_date
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <date>
1 AAPL 2009-01-02 12.3 13.0 12.2 13.0 186503800 11.4 2009-07-31 2010-06-30
2 AAPL 2009-01-05 13.3 13.7 13.2 13.5 295402100 11.8 2009-07-31 2010-06-30
3 AAPL 2009-01-06 13.7 13.9 13.2 13.3 322327600 11.6 2009-07-31 2010-06-30
4 AAPL 2009-01-07 13.1 13.2 12.9 13.0 188262200 11.4 2009-07-31 2010-06-30
5 AAPL 2009-01-08 12.9 13.3 12.9 13.2 168375200 11.6 2009-07-31 2010-06-30
6 AAPL 2009-01-09 13.3 13.3 12.9 12.9 136711400 11.3 2009-07-31 2010-06-30
7 AAPL 2009-01-12 12.9 13.0 12.5 12.7 154429100 11.1 2009-07-31 2010-06-30
8 AAPL 2009-01-13 12.6 12.8 12.3 12.5 199599400 11.0 2009-07-31 2010-06-30
9 AAPL 2009-01-14 12.3 12.5 12.1 12.2 255416000 10.7 2009-07-31 2010-06-30
10 AAPL 2009-01-15 11.5 12.0 11.4 11.9 457908500 10.4 2009-07-31 2010-06-30
I am trying to group by symbol, start date and end date and take the difference between the first observation on the start date and the last observation on the end date. I just can't seem to get it working.
That is take the difference of the "close" on the start date and the "close" on the end date.
Any help would be great, thanks!
syms <- c("AAPL", "MSFT", "GOOG")
library(tidyquant)
data <- tq_get(syms)
data <- data %>%
mutate( start_date = paste(year(date %m+% months(6)), "07", "31", sep = "-"), # note this is the start_date for when we calculate the returns - we will have bought this portfolio on the 1st July but we get returns on the 31st
end_date = paste(year(date %m+% months(18)), "06", "30", sep = "-"),
start_date = as.Date(start_date),
end_date = as.Date(end_date))
My attempt...
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = diff(close))
EDIT:
I am trying to group by symbol and then take start_date - end_date. So first, I should be grouping by symbol and filtering the date column down to between the start_date and end_date values. i.e. I am only interested in the "close" price on the start_date and end_date days (which is fixed). Then just take the difference between the close price on the start_date and end_date. So most of the stock price data is useless here and I am only interested in the close on the start_date and end_date then take the difference between these two values.
I think what you are looking for is to subtract first and last close value for each group
library(dplyr)
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = first(close) - last(close))
# symbol start_date end_date diff
# <chr> <date> <date> <dbl>
# 1 AAPL 2009-07-31 2010-06-30 -7.38
# 2 AAPL 2010-07-31 2011-06-30 -15.5
# 3 AAPL 2011-07-31 2012-06-30 -12.5
# 4 AAPL 2012-07-31 2013-06-30 -34.4
# 5 AAPL 2013-07-31 2014-06-30 28.0
# 6 AAPL 2014-07-31 2015-06-30 -34.5
# 7 AAPL 2015-07-31 2016-06-30 -31.9
# 8 AAPL 2016-07-31 2017-06-30 31
# 9 AAPL 2017-07-31 2018-06-30 -48.1
#10 AAPL 2018-07-31 2019-06-30 -41.6
# … with 26 more rows
Another way to write it could be
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = close[1L] - close[n()])
Or it can be also done using base R aggregate
aggregate(close~symbol +start_date + end_date,data,function(x) x[1L] - x[length(x)])
You can take this approach...
# create df of unique symbol, start, and end date combos
df1 <- df %>% distinct(symbol,start_date,end_date)
# join original data that match the desired start/end dates
df1 <- df %>% select(start_close=close,symbol,start_date=date) %>% left_join(df1,.)
df1 <- df %>% select(end_close=close,symbol,end_date=date) %>% left_join(df1,.)
# find difference in close values
df1 %>% mutate(diff=end_close - start_close)
# A tibble: 36 x 6
symbol start_date end_date start_close end_close diff
<chr> <date> <date> <dbl> <dbl> <dbl>
1 AAPL 2009-07-31 2010-06-30 23.3 35.9 12.6
2 AAPL 2010-07-31 2011-06-30 NA 48.0 NA
3 AAPL 2011-07-31 2012-06-30 NA NA NA
4 AAPL 2012-07-31 2013-06-30 87.3 NA NA
5 AAPL 2013-07-31 2014-06-30 64.6 92.9 28.3
6 AAPL 2014-07-31 2015-06-30 95.6 125. 29.8
7 AAPL 2015-07-31 2016-06-30 121. 95.6 -25.7
8 AAPL 2016-07-31 2017-06-30 NA 144. NA
9 AAPL 2017-07-31 2018-06-30 149. NA NA
10 AAPL 2018-07-31 2019-06-30 190. NA NA
# ... with 26 more rows
There are NAs since not every start/end date are in the original date column.
Related
I'm using heatwaveR package in R to make a plot (event_line()) and visualize the heatwaves over the years. The first step is to run ts2clm(), but this command turn my temp colum into NA so I can't plot anything. Does anyone see any errors?
This is my data:
>>> Data
t temp
[Date] [num]
0 2020-05-14 6.9
1 2020-05-06 6.8
2 2020-04-23 5.5
3 2020-04-16 3.6
4 2020-03-31 2.5
5 2020-02-25 2.3
6 2020-01-30 2.8
7 2019-10-02 13.4
8 2022-09-02 19
9 2022-08-15 18.7
...
687 1974-05-06 4.2
This is my code:
#Load data
Data <- read_xlsx("seili_raw_temp.xlsx")
#Set t as class Date
Data$t <- as.Date(Data$t, format = "%Y-%m-%d")
#Constructs seasonal and threshold climatologies
ts <- ts2clm(Data, climatologyPeriod = c("1974-05-06", "2020-05-14"))
#This is the point where almost all temp values turn into NA, so you can ignore below.
#Detect_even
res <- detect_event(ts)
#Draw heatwave plot
event_line(res, min_duration = "3",metric = "int_cum",
start_date = c("1974-05-06"), end_date = c("2020-05-14"))
The data you posted isn't long enough to get the function to work, so I just made some up:
library(heatwaveR)
library(lubridate)
set.seed(1234)
Data <- data.frame(
t = seq(ymd("2015-01-01"), ymd("2023-01-01"), by="7 day"))
Data$temp <- runif(nrow(Data), 0,45)
Then, when I execute the function, I get the result below. The problem is that your data (like the ones I generated) have one observation every 7 days. The ts2clm() function pads out the dataset so that every day has an entry and if a temperature was not observed on that day, it fills in with a missing value.
ts <- ts2clm(Data, climatologyPeriod = c("2015-01-01", "2022-12-29"))
ts
#> # A tibble: 2,920 × 5
#> doy t temp seas thresh
#> <int> <date> <dbl> <dbl> <dbl>
#> 1 1 2015-01-01 5.12 22.5 38.6
#> 2 2 2015-01-02 NA 22.4 38.5
#> 3 3 2015-01-03 NA 22.2 38.2
#> 4 4 2015-01-04 NA 22.1 37.9
#> 5 5 2015-01-05 NA 21.9 37.3
#> 6 6 2015-01-06 NA 21.7 36.8
#> 7 7 2015-01-07 NA 21.5 36.5
#> 8 8 2015-01-08 28.0 21.3 36.1
#> 9 9 2015-01-09 NA 21.2 36.1
#> 10 10 2015-01-10 NA 21.0 35.8
#> # … with 2,910 more rows
Created on 2023-02-10 by the reprex package (v2.0.1)
I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)
It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows
I am trying to apply some basic math to daily stock values based on a corresponding yearly value.
reprex
(daily prices)
library(tidyquant)
data(FANG)
# daily prices
FANG %>%
select(c(date, symbol, adjusted)) %>%
group_by(symbol)
# A tibble: 4,032 x 3
# Groups: symbol [4]
date symbol adjusted
<date> <chr> <dbl>
1 2013-01-02 FB 28
2 2013-01-03 FB 27.8
3 2013-01-04 FB 28.8
4 2013-01-07 FB 29.4
5 2013-01-08 FB 29.1
6 2013-01-09 FB 30.6
7 2013-01-10 FB 31.3
8 2013-01-11 FB 31.7
9 2013-01-14 FB 31.0
10 2013-01-15 FB 30.1
# ... with 4,022 more rows
(max price per year)
FANG_yearly_high <-
FANG %>%
group_by(symbol) %>%
summarise_by_time(
.date_var = date,
.by = "year",
price = AVERAGE(adjusted))
# Groups: symbol [4]
symbol date price
<chr> <date> <dbl>
1 AMZN 2013-01-01 404.
2 AMZN 2014-01-01 407.
3 AMZN 2015-01-01 694.
4 AMZN 2016-01-01 844.
5 FB 2013-01-01 58.0
6 FB 2014-01-01 81.4
7 FB 2015-01-01 109.
8 FB 2016-01-01 133.
9 GOOG 2013-01-01 560.
10 GOOG 2014-01-01 609.
11 GOOG 2015-01-01 777.
12 GOOG 2016-01-01 813.
13 NFLX 2013-01-01 54.4
14 NFLX 2014-01-01 69.2
15 NFLX 2015-01-01 131.
16 NFLX 2016-01-01 128.
I would like to divide each daily price by the corresponding max price for the year.
I tried:
FANG %>%
group_by(symbol) %>%
summarise_by_time(
.date_var = date,
.by = "year",
price = AVERAGE(adjusted) / YEAR(date(MAX(adjusted)))
)
and the get this error:
Error in as.POSIXlt.numeric(x, tz = tz(x)) : 'origin' must be supplied
Any sensible way to accomplish this?
Thank you
summarise_by_time is good if you just want to summarise. But you want to divide a daily price by a max of a period. So you need to use mutate. Below are 2 examples. The first one for daily prices, the second for weekly. You can adjust the weekly version easily to monthly.
library(tidyquant)
library(dplyr)
data(FANG)
# daily prices
FANG %>%
select(c(date, symbol, adjusted)) %>%
group_by(symbol, year = year(date)) %>%
mutate(price_pct = adjusted / max(adjusted))
# A tibble: 4,032 x 5
# Groups: symbol, year [16]
date symbol adjusted year price_pct
<date> <chr> <dbl> <dbl> <dbl>
1 2013-01-02 FB 28 2013 0.483
2 2013-01-03 FB 27.8 2013 0.479
3 2013-01-04 FB 28.8 2013 0.496
4 2013-01-07 FB 29.4 2013 0.508
5 2013-01-08 FB 29.1 2013 0.501
6 2013-01-09 FB 30.6 2013 0.528
7 2013-01-10 FB 31.3 2013 0.540
8 2013-01-11 FB 31.7 2013 0.547
9 2013-01-14 FB 31.0 2013 0.534
10 2013-01-15 FB 30.1 2013 0.519
# ... with 4,022 more rows
Weekly / monthly:
# weekly
FANG %>%
select(c(date, symbol, adjusted)) %>%
group_by(symbol) %>%
tq_transmute(mutate_fun = to.period,
period = "weeks" # change weeks to months for monthly
) %>%
group_by(symbol, year = year(date)) %>%
mutate(price_pct = adjusted / max(adjusted))
# A tibble: 836 x 5
# Groups: symbol, year [16]
symbol date adjusted year price_pct
<chr> <date> <dbl> <dbl> <dbl>
1 FB 2013-01-04 28.8 2013 0.519
2 FB 2013-01-11 31.7 2013 0.572
3 FB 2013-01-18 29.7 2013 0.535
4 FB 2013-01-25 31.5 2013 0.569
5 FB 2013-02-01 29.7 2013 0.536
6 FB 2013-02-08 28.5 2013 0.515
7 FB 2013-02-15 28.3 2013 0.511
8 FB 2013-02-22 27.1 2013 0.489
9 FB 2013-03-01 27.8 2013 0.501
10 FB 2013-03-08 28.0 2013 0.504
# ... with 826 more rows
I have a dataframe with multiple columns and they are ordered by a time column with a time stamp every second. I want to search the data frame for 1-minute periods that have limited variation in another variable.
For example, I want every minute in the data frame where the TWS(true wind speed) has a variation of no more than 5 knots. These 1 minute periods should also not overlap.
Once we have the 1-minute sections, create another data frame with each minute of data averaged into rows.
Here is the head of the data
Date Time Lat Lon AWA AWS TWA TWS
1 19/10/2018 2019-02-11 12:06:16 35.8952 14.5 -99.7 8.42 -99.7 8.42
2 19/10/2018 2019-02-11 12:06:17 35.8952 14.5 -99.1 8.24 -99.1 8.24
3 19/10/2018 2019-02-11 12:06:18 35.8952 14.5 -99.2 7.34 -99.2 7.34
4 19/10/2018 2019-02-11 12:06:19 35.8952 14.5 -99.6 6.87 -99.6 6.87
5 19/10/2018 2019-02-11 12:06:20 35.8952 14.5 -101.1 8.85 -101.1 8.85
6 19/10/2018 2019-02-11 12:06:21 35.8952 14.5 -101.6 9.39 -101.6 9.39
library(dplyr)
library(lubridate)
df %>%
mutate(Date=as.Date(Date), Time=ymd_hms(Time)) %>%
group_by(gr=minute(Time)) %>%
mutate(flag=max(TWS,na.rm=TRUE)-min(TWS,na.rm=TRUE)) %>%
filter(flag<5) %>%
mutate_all(.,mean,na.rm=TRUE) %>% distinct()
# A tibble: 1 x 10
# Groups: gr [1]
Date Time Lat Lon AWA AWS TWA TWS gr flag
<date> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 0019-10-20 2019-02-11 12:06:17 35.9 14.5 -99.3 8. -99.3 8. 6 1.08
For variation between elements in each group, we can use dplyr::lag:
... mutate(flag=TWS-lag(TWS, default = first(TWS))) %>%
filter(all(abs(flag)<5)) %>% mutate_all(.,mean,na.rm=TRUE) %>% distinct()
Data
df <- read.table(text = "
Date Time Lat Lon AWA AWS TWA TWS
1 '19/10/2018' '2019-02-11 12:06:16' 35.8952 14.5 -99.7 8.42 -99.7 8.42
2 '19/10/2018' '2019-02-11 12:06:17' 35.8952 14.5 -99.1 8.24 -99.1 8.24
3 '19/10/2018' '2019-02-11 12:06:18' 35.8952 14.5 -99.2 7.34 -99.2 7.34
4 '19/10/2018' '2019-02-11 12:07:19' 35.8952 14.5 -99.6 6.87 -99.6 6.87
5 '19/10/2018' '2019-02-11 12:07:20' 35.8952 14.5 -101.1 8.85 -101.1 8.85
6 '19/10/2018' '2019-02-11 12:07:21' 35.8952 14.5 -101.6 9.39 -101.6 16.39
", header=TRUE)
I am using the tidyquant package in R to calculate indicators for every symbol in the SP500.
As a sample of code:
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close),mutate_fun=CLV)
This works for price-based indicators, but not indicators that include volume.
I get "Evaluation error: argument "volume" is missing, with no default."
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close,volume),mutate_fun=CMF)
How can I get indicators that include volume to calculate properly?
There are a few functions from the TTR package that cannot be used with tidyquant. Reason being they need 3 inputs like adjRatios or need an HLC object and a volume column like the CMF function. Normally you would solve this by using the tq_mutate_xy function but this one cannot handle the HCL needed for the CMF function. If you would use the OBV function from TTR that needs a price and a volume column and works fine with tq_mutate_xy.
Now there are 2 options. One the CMF function needs to be adjusted to handle a (O)HLCV object. Or two, create your own function.
The last option is the fastest. Since the internals of the CMF function call on the CLV function you could use the first code block you have and extend it with a normal dplyr::mutate call to calculate the cmf.
# create function to calculate the chaikan money flow
tq_cmf <- function(clv, volume, n = 20){
runSum(clv * volume, n)/runSum(volume, n)
}
stocks_w_price_indicators <- stocks2 %>%
group_by(symbol) %>%
tq_mutate(select = close, mutate_fun = RSI) %>%
tq_mutate(select = c(high, low, close), mutate_fun = CLV) %>%
mutate(cmf = tq_cmf(clv, volume, 20))
# A tibble: 5,452 x 11
# Groups: symbol [2]
symbol date open high low close volume adjusted rsi clv cmf
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MSFT 2008-01-02 35.8 36.0 35 35.2 63004200 27.1 NA -0.542 NA
2 MSFT 2008-01-03 35.2 35.7 34.9 35.4 49599600 27.2 NA 0.291 NA
3 MSFT 2008-01-04 35.2 35.2 34.1 34.4 72090800 26.5 NA -0.477 NA
4 MSFT 2008-01-07 34.5 34.8 34.2 34.6 80164300 26.6 NA 0.309 NA
5 MSFT 2008-01-08 34.7 34.7 33.4 33.5 79148300 25.7 NA -0.924 NA
6 MSFT 2008-01-09 33.4 34.5 33.3 34.4 74305500 26.5 NA 0.832 NA
7 MSFT 2008-01-10 34.3 34.5 33.8 34.3 72446000 26.4 NA 0.528 NA
8 MSFT 2008-01-11 34.1 34.2 33.7 33.9 55187900 26.1 NA -0.269 NA
9 MSFT 2008-01-14 34.5 34.6 34.1 34.4 52792200 26.5 NA 0.265 NA
10 MSFT 2008-01-15 34.0 34.4 34 34 61606200 26.2 NA -1 NA