Firstly, apologies for what is probably a very easy question. I have been following an example to plot STL and have come up with a nice line chart. I would like to extract the data points so I can use them in Tableau in this format:
(sorry, having trouble getting tables to display)
My time series is generated from a count in the same format as the table above, so I assume it is quite simple to stitch it back together, but I am not very experienced with data manipulation in R yet. I am happy with the actual seasonal plot, it's just the matter of tying it all back up into something I can use.
I cannot provide my data, but I can provide the following from a tutorial which does the same thing:
library(xts)
## load co2 data set
load(url("https://userpage.fu-berlin.de/soga/300/30100_data_sets/KeelingCurve.Rdata"))
library(lubridate)
start <- c(year(xts::first(co2)), month(xts::first(co2)))
start
end <- c(year(xts::last(co2)), month(xts::last(co2)))
end
# creation of a ts object
co2 <- ts(data = as.vector(coredata(co2)),
start = start,
end = end, frequency = 12)
# set up stl function
fit <- stl(co2, s.window = "periodic")
I am able to extract the list of y-axis values using:
seasonal_stl <- fit$time.series[,1]
What I would like to do is reconstruct that into a table of Month, Year and the seasonal value. Can anyone suggest how to do that? Many thanks in advance.
You can use the tsibble package to convert the ts object into a data frame in the form you want.
ts(fit$time.series, start=start, frequency=12) |>
tsibble::as_tsibble() |>
tidyr::pivot_wider(names_from = "key", values_from = "value") |>
tibble::as_tibble()
But you might find it easier to use the tsibble and feasts packages from the start, like this.
library(tsibble)
library(feasts)
library(lubridate)
## load co2 data set
load(url("https://userpage.fu-berlin.de/soga/300/30100_data_sets/KeelingCurve.Rdata"))
start <- c(year(xts::first(co2)), month(xts::first(co2)))
# creation of a tsibble object
co2 <- ts(co2, start=start, frequency=12) |>
as_tsibble()
# Fit STL
fit <- co2 |>
model(stl = STL(value ~ season(window = "periodic")))
# Extract components
components(fit)
#> # A dable: 711 x 7 [1M]
#> # Key: .model [1]
#> # : value = trend + season_year + remainder
#> .model index value trend season_year remainder season_adjust
#> <chr> <mth> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 stl 1958 Mar 316. 315. 1.46 -0.551 314.
#> 2 stl 1958 Apr 317. 315. 2.59 -0.0506 315.
#> 3 stl 1958 May 318. 315. 3.00 -0.514 315.
#> 4 stl 1958 Jun 317. 315. 2.28 -0.286 315.
#> 5 stl 1958 Jul 316. 315. 0.668 -0.00184 315.
#> 6 stl 1958 Aug 315. 315. -1.48 1.13 316.
#> 7 stl 1958 Sep 313. 315. -3.16 1.01 316.
#> 8 stl 1958 Oct 313. 315. -3.25 0.468 316.
#> 9 stl 1958 Nov 313. 316. -2.05 -0.148 315.
#> 10 stl 1958 Dec 315. 316. -0.860 -0.0377 316.
#> # … with 701 more rows
Created on 2023-01-26 with reprex v2.0.2
I was trying to get BSE SENSEX data in R and came across this Import Indian stock prices into R which is useful but I am not able to get SENSEX Index data using getSymbols from quantmod.
I have tried many options but none of them worked
quantmod::getSymbols("SENSEX", src="yahoo")
quantmod::getSymbols("SENSEX.BO", src="yahoo")
quantmod::getSymbols("BSE SENSEX", src="yahoo")
quantmod::getSymbols("BSE SENSEX.BO", src="yahoo")
About Sensex:
Packages tidyquant andQuantmod uses Yahoo Finance. Therefore, check what the security is called on their website, hence: ^BSESN
library(tidyquant)
tq_get("^BSESN")
# A tibble: 2,643 × 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ^BSESN 2012-01-03 15641. 15970. 15641. 15939. 16200 15939.
2 ^BSESN 2012-01-04 15967. 16005. 15822. 15883. 17800 15883.
3 ^BSESN 2012-01-05 15893. 15980. 15809. 15857. 21200 15857.
4 ^BSESN 2012-01-06 15789. 16001. 15665. 15868. 17200 15868.
5 ^BSESN 2012-01-07 NA NA NA NA NA NA
6 ^BSESN 2012-01-09 15840. 15872. 15678. 15815. 11200 15815.
7 ^BSESN 2012-01-10 15898. 16181. 15898. 16165. 19600 16165.
8 ^BSESN 2012-01-11 16222. 16245. 16128. 16176. 18600 16176.
9 ^BSESN 2012-01-12 16117. 16179. 15963. 16038. 14400 16038.
10 ^BSESN 2012-01-13 16145. 16257. 16050. 16155. 19600 16155.
# … with 2,633 more rows
# ℹ Use `print(n = ...)` to see more rows
I'm trying to add a column to a Tidyquant tibble. Here's the code:
library(tidyquant)
symbol <- 'AAPL'
start_date <- as.Date('2022-01-01')
end_date <- as.Date('2022-03-31')
prices <- tq_get(symbol,
from = start_date,
to = end_date,
get = 'stock.prices')
head(prices)
# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2022-01-03 178. 183. 178. 182. 104487900 182.
2 AAPL 2022-01-04 183. 183. 179. 180. 99310400 179.
3 AAPL 2022-01-05 180. 180. 175. 175. 94537600 175.
4 AAPL 2022-01-06 173. 175. 172. 172 96904000 172.
5 AAPL 2022-01-07 173. 174. 171. 172. 86709100 172.
6 AAPL 2022-01-10 169. 172. 168. 172. 106765600 172.
Now, I'm attempting to add the change_on_day column (that's just the difference in the 'adjusted' prices between one day and the next) using the following:
prices$change_on_day <- diff(prices$adjusted)
The error message is:
Error: Assigned data `diff(prices$adjusted)` must be compatible with existing data.
x Existing data has 61 rows.
x Assigned data has 60 rows.
i Only vectors of size 1 are recycled.
How would I add this price difference column?
Thanks!
If you are trying to get today's value from the previous date value then you should be able to do that with the lag() function
prices %>%
mutate(change_on_day=adjusted-lag(adjusted,1))
We can use tq_transmute with quantmod::periodReturn setting the period argument to 'daily' in order to calculate daily returns.
library(tidyquant)
symbol <- "AAPL"
start_date <- as.Date("2022-01-01")
end_date <- as.Date("2022-03-31")
prices <- tq_get(symbol,
from = start_date,
to = end_date,
get = "stock.prices"
)
stock_returns_monthly <- prices %>%
tq_transmute(
select = adjusted,
mutate_fun = periodReturn,
period = "daily",
col_rename = "change_on_day"
)
stock_returns_monthly
#> # A tibble: 61 × 2
#> date change_on_day
#> <date> <dbl>
#> 1 2022-01-03 0
#> 2 2022-01-04 -0.0127
#> 3 2022-01-05 -0.0266
#> 4 2022-01-06 -0.0167
#> 5 2022-01-07 0.000988
#> 6 2022-01-10 0.000116
#> 7 2022-01-11 0.0168
#> 8 2022-01-12 0.00257
#> 9 2022-01-13 -0.0190
#> 10 2022-01-14 0.00511
#> # … with 51 more rows
Created on 2022-04-18 by the reprex package (v2.0.1)
For more information check this vignette
I would like to achieve the following:
filter dataframe catalogs based on multiple columns in dataframe orders, for each row in dataframe orders and store the result in a list column in dataframe orders. (succeeded)
calculate the difference between a date in data frame orders and another date in the new listcolumn.
Table s_orders contains order data for different people (account keys). Table s_catalogs contains all catalogs that were sent to each account key
For each order, I want to know:
if and what catalogs were sent from the previous order (or the beginning) until the day before the focal order. More specifically, consumers received a (paper) catalog at s_catalogs$CATDATE. I want to know for each order what catalogs were received between the previous order (s_orders$PREVORDER) and the latest order. Because some consumers do not have a previous order I set the previous order date startdate to date("1999-12-31") which is the beginning of my dataset.
Then I want to do some calculations on the catalog data. (in this example: calculate the difference between date of a catalog and the order date)
For this, I have written a function getCatalogs, which takes the account key and two dates as input, and outputs a dataframe with the results from the other table. Would be much appreciated if someone has a better, more efficient solution? maybe with some sort of join?
I think my main problem is how to use mutate, pmap, pipes, pluck interchangeably for building complex queries on multiple tables.
My actual problem is outlined in sections Desired result and Problem.
# packages needed
library("dplyr")
library("lubridate")
library("purrr")
#library("tidyverse")
Example data
( i sampled some users from my data. s_ stands for 'sample')
# orders
s_orders <- structure(list(ACCNTKEY = c(2806, 2806, 2806, 3729, 3729, 3729,
3729, 4607, 4607, 4607, 4607, 4742, 11040, 11040, 11040, 11040,
11040, 17384), ORDDATE = structure(c(11325, 11703, 11709, 11330,
11375, 11384, 12153, 11332, 11445, 11589, 11713, 11333, 11353,
11429, 11662, 11868, 11960, 11382), class = "Date")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -18L))
# # A tibble: 18 x 2
# ACCNTKEY ORDDATE
# <dbl> <date>
# 1 2806 2001-01-03
# 2 2806 2002-01-16
# 3 2806 2002-01-22
# 4 3729 2001-01-08
# 5 3729 2001-02-22
# 6 3729 2001-03-03
# 7 3729 2003-04-11
# 8 4607 2001-01-10
# 9 4607 2001-05-03
# 10 4607 2001-09-24
# 11 4607 2002-01-26
# 12 4742 2001-01-11
# 13 11040 2001-01-31
# 14 11040 2001-04-17
# 15 11040 2001-12-06
# 16 11040 2002-06-30
# 17 11040 2002-09-30
# 18 17384 2001-03-01
# catalogs
s_catalogs <- structure(list(ACCNTKEY = c("2806", "2806", "4607", "2806", "4607",
"4607", "4607"), CATDATE = structure(c(11480, 11494, 11522, 11858,
11886, 12264, 12250), class = "Date"), CODE = c("2806/07/2001",
"2806/21/2001", "4607/19/2001", "2806/20/2002", "4607/18/2002",
"4607/31/2003", "4607/17/2003")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
# # A tibble: 7 x 3
# ACCNTKEY CATDATE CODE
# <chr> <date> <chr>
# 1 2806 2001-06-07 2806/07/2001
# 2 2806 2001-06-21 2806/21/2001
# 3 4607 2001-07-19 4607/19/2001
# 4 2806 2002-06-20 2806/20/2002
# 5 4607 2002-07-18 4607/18/2002
# 6 4607 2003-07-31 4607/31/2003
# 7 4607 2003-07-17 4607/17/2003
calculate the lagged order date
# calculate previous order date for each order in s_orders
s_orders<-s_orders %>%
group_by(ACCNTKEY) %>%
arrange(ORDDATE) %>%
mutate(PREVORDER=as_date(lag(ORDDATE)))
So now we know the previous order (if any)
Function getCatalogs (improvement appreciated)
So the below function getCatalogs returns a dataframe with the catalogs that were received by that account key before the order (or actually in between the last orders/catalogs that were received between startdate and enddate).
# in case _startdate_ is missing then I set it to some starting value
getCatalogs<-function(key,startdate,enddate){
if(is.na(startdate)){
startdate<-as_date(date("1999-12-31"))
}
tmp <- s_catalogs[s_catalogs$ACCNTKEY==key &
s_catalogs$CATDATE<enddate &
s_catalogs$CATDATE>=startdate,]
if (NROW(tmp)>0){
return(tmp)
}else{return(NA)}
}
Use the function
let's get for each order all catalogs in a listcolumn
# For each row in s_orders search in dataframe s_catalogs all catalogs that were received for that account key before the order date but after the previous order.
s_orders <- s_orders %>% as_tibble() %>%
mutate(catalogs =
pmap(c(list(ACCNTKEY),list(PREVORDER),list(ORDDATE)),.f= function(x,y,z){getCatalogs(x,y,z)}))
This line for example gets the date of the latest catalog, which is what i need:
s_orders %>% pluck("catalogs") %>% pluck(13) %>% pluck("CATDATE") %>% max()
# [1] "2001-06-21"
Desired result:
Now I would like to retrieve the number of days between the above date and the date of the order (ORDDATE). The following code does it exactly but it is only correct in row 13.
# get amount of days since last catalog
s_orders3 <- s_orders %>%
mutate(diff = ORDDATE - s_orders %>%
pluck("catalogs") %>% pluck(13) %>% pluck("CATDATE") %>% max())
# # A tibble: 18 x 5
# ACCNTKEY ORDDATE PREVORDER catalogs diff
# <dbl> <date> <date> <list> <time>
# 1 2806 2001-01-03 NA <lgl [1]> -169 days
# 2 3729 2001-01-08 NA <lgl [1]> -164 days
# 3 4607 2001-01-10 NA <lgl [1]> -162 days
# 4 4742 2001-01-11 NA <lgl [1]> -161 days
# 5 11040 2001-01-31 NA <lgl [1]> -141 days
# 6 3729 2001-02-22 2001-01-08 <lgl [1]> -119 days
# 7 17384 2001-03-01 NA <lgl [1]> -112 days
# 8 3729 2001-03-03 2001-02-22 <lgl [1]> -110 days
# 9 11040 2001-04-17 2001-01-31 <lgl [1]> -65 days
# 10 4607 2001-05-03 2001-01-10 <lgl [1]> -49 days
# 11 4607 2001-09-24 2001-05-03 <tibble [1 × 3]> 95 days
# 12 11040 2001-12-06 2001-04-17 <lgl [1]> 168 days
# 13 2806 2002-01-16 2001-01-03 <tibble [2 × 3]> 209 days
# 14 2806 2002-01-22 2002-01-16 <lgl [1]> 215 days
# 15 4607 2002-01-26 2001-09-24 <lgl [1]> 219 days
# 16 11040 2002-06-30 2001-12-06 <lgl [1]> 374 days
# 17 11040 2002-09-30 2002-06-30 <lgl [1]> 466 days
# 18 3729 2003-04-11 2001-03-03 <lgl [1]> 659 days
Check manually:
date("2002-01-16")-date("2001-06-21")
# Time difference of 209 days
Problem
However, the code subtracts the same date from order date in every row. I want it to use the date that belongs to each particular row.
So the problem is how to replace the %>% pluck(13) %>% by some command that dows this trick to every row and put it in the diff column.
I am really searching for a solution that uses either purrr or dplyr or some other package that is just as efficient and clear.
Hoping that I have understood the question clearly, here is my attempt trying to solve the problem. I changed the getCatalogs function to return only max CATDATE in case if it is present.
library(dplyr)
library(purrr)
getCatalogs<-function(key,startdate,enddate){
if(is.na(startdate)) startdate<- as.Date("1999-12-31")
tmp <- s_catalogs$CATDATE[s_catalogs$ACCNTKEY==key &
s_catalogs$CATDATE<enddate &
s_catalogs$CATDATE>=startdate]
if (length(tmp) > 0) max(tmp) else NA
}
s1_orders<- s_orders %>%
group_by(ACCNTKEY) %>%
arrange(ORDDATE) %>%
mutate(PREVORDER=lag(ORDDATE))
and then use pmap like :
s1_orders %>%
mutate(catalogs = pmap_dbl(list(ACCNTKEY,PREVORDER,ORDDATE), getCatalogs),
catalogs = as.Date(catalogs, origin = "1970-01-01"),
diff = ORDDATE - catalogs)
# ACCNTKEY ORDDATE PREVORDER catalogs diff
# <dbl> <date> <date> <date> <drtn>
# 1 2806 2001-01-03 NA NA NA days
# 2 3729 2001-01-08 NA NA NA days
# 3 4607 2001-01-10 NA NA NA days
# 4 4742 2001-01-11 NA NA NA days
# 5 11040 2001-01-31 NA NA NA days
# 6 3729 2001-02-22 2001-01-08 NA NA days
# 7 17384 2001-03-01 NA NA NA days
# 8 3729 2001-03-03 2001-02-22 NA NA days
# 9 11040 2001-04-17 2001-01-31 NA NA days
#10 4607 2001-05-03 2001-01-10 NA NA days
#11 4607 2001-09-24 2001-05-03 2001-07-19 67 days
#12 11040 2001-12-06 2001-04-17 NA NA days
#13 2806 2002-01-16 2001-01-03 2001-06-21 209 days
#14 2806 2002-01-22 2002-01-16 NA NA days
#15 4607 2002-01-26 2001-09-24 NA NA days
#16 11040 2002-06-30 2001-12-06 NA NA days
#17 11040 2002-09-30 2002-06-30 NA NA days
#18 3729 2003-04-11 2001-03-03 NA NA days
Update
Without changing the current getCatalogs function, we can test the length of catalogs
s1_orders %>%
mutate(catalogs = pmap(list(ACCNTKEY,PREVORDER,ORDDATE), getCatalogs),
temp = map_dbl(catalogs, ~if (length(.x) > 1)
.x %>% pluck("CATDATE") %>% max else NA),
temp = as.Date(temp, origin = "1970-01-01"),
diff = ORDDATE - temp)
I am using the package quantmod to get historical share prices.
I want to create a loop to pull back the prices and as part of the loop I want to create a dataframe for each share. I have been unsuccessful so far with the below code, it gets the share prices as expected but this is returned as a xts object whereas I require the information as a dataframe - the as.data.frame part of the code doesn't do anything...
library(quantmod)
shares<-c("BARC.L", "BP.L", "DLG.L")
for(i in 1:length(shares)){
#gets share prices
getSymbols((paste(shares[i])), from = "2018-01-01")
#put the data into a dataframe (doesn't work).
shares[i]<-as.data.frame(shares[i])
}
The end result that I want is 3 dataframes - 1 for each share.
Can anyone suggest modifications to the code to achieve this please?
Personally I would do it like this:
library(quantmod)
shares<-c("BARC.L", "BP.L", "DLG.L")
my_shares <- lapply(shares, function(x) getSymbols(x, from = "2018-01-01", auto.assign = FALSE))
names(my_shares) <- shares
Or if you need the dates as a column instead of rownames:
my_shares <- lapply(shares, function(x) {
out <- getSymbols(x, from = "2018-01-01", auto.assign = FALSE)
out <- data.frame(dates = index(out), coredata(out))
return(out)
})
names(my_shares) <- shares
Or if you need everything in a tidy dataset:
library(tidyquant)
my_shares <- tq_get(shares)
my_shares
# A tibble: 7,130 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BARC.L 2008-01-02 464. 483. 460. 466. 38104837 344.
2 BARC.L 2008-01-03 466. 472. 458. 470. 33215781 347.
3 BARC.L 2008-01-04 466. 476. 447. 449. 42710244 332.
4 BARC.L 2008-01-07 447. 452. 433. 436. 58213512 322.
5 BARC.L 2008-01-08 439. 447. 421. 437. 105370539 322.
6 BARC.L 2008-01-09 432. 434. 420. 424. 71059078 313.
7 BARC.L 2008-01-10 428. 431. 413. 418. 54763347 309.
8 BARC.L 2008-01-11 416. 437. 416. 430. 72467229 317.
9 BARC.L 2008-01-14 430. 448. 427. 444. 56916500 328.
10 BARC.L 2008-01-15 445. 452. 428. 429. 77094907 317.
# ... with 7,120 more rows
Firstly, I suggest you use the help() function that comes with R packages if you're not already doing so. I noticed in help(getSymbols) that you need to set env=NULL to actually return the data. With that, I've also made a list object so you can store the data as data.frames like you requested:
library(quantmod)
shares<-c("BARC.L", "BP.L", "DLG.L")
# initialize a list to store your data frames
df_list <- as.list(rep(data.frame(), length(shares)))
for (i in 1:length(shares)) {
#gets share prices
df_list[[i]] <- as.data.frame(getSymbols(shares[i], from = "2018-01-01", env=NULL))
}
# so you can access by name, e.g. df_list$DLG.L
names(df_list) <- shares