R: Building urls based on multiple variables of different lengths - r

I've been struggling to figure this out on my own, so reaching out for some assistance. I am trying to build urls based on multiple variables (months and years) of different lengths so that I have a url for each combination of month and year from the lists I created.
I've done something similar in Python but need to translate it into R, and I'm running into issues with building the function and for loops. Here's the Python code ..
# set years and months
oasis_market_yr = ('2020','2019','2018','2017','2016','2015','2014','2013','2012','2011')
oasis_market_mn = ('01','02','03','04','05','06','07','08','09','10','11','12')
# format url string
URL_FORMAT_STRING = 'http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_{year}_M{month}_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime={year}{month}01T07:00-0000&enddatetime={year}{month}{last_day_of_month}T07:00-0000&version=1'
# create function to make urls
def make_url(year,month):
last_day_of_month = calendar.monthrange(int(year), int(month))[1]
return URL_FORMAT_STRING.format(year=year,month=month,last_day_of_month=last_day_of_month)
# build urls for download
for y in oasis_market_yr:
for m in oasis_market_mn:
url = make_url(y,m)
I've tried using sapply and mapply with str_glue and a few other methods but can't seem to replicate the outcome. I keep getting an error that reads: Error: Variables must be length 1 or 5. Or, for instance with mapply, it maps the first value in one list to the first in the other list and so on, then returns when the short list runs out of values. What I need is all the combinations from both lists.
Any assistance would be much appreciated.

Your syntax was a little too python and won't work like that in R.
In R, the same syntax would look like this:
# set years and months
oasis_market_yr = c('2020','2019','2018','2017','2016','2015','2014','2013','2012','2011')
oasis_market_mn = c('01','02','03','04','05','06','07','08','09','10','11','12')
# create function to make urls
make_url = function(year,month){
# format url string
URL_FORMAT_STRING = 'http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_{year}_M{month}_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime={year}{month}01T07:00-0000&enddatetime={year}{month}{last_day_of_month}T07:00-0000&version=1'
lastdays = c(31,28,31,30,31,30,31,31,30,31,30,31)
if(as.integer(year)%%4==0 & as.integer(year)%%100 !=0){lastdays[2]=29}
last_day_of_month = as.character(lastdays[as.integer(month)])
fs = gsub("{month}",month,URL_FORMAT_STRING, fixed=T)
fs = gsub("{year}",year,fs, fixed=T)
fs = gsub("{last_day_of_month}",last_day_of_month, fs, fixed=T)
return(fs)
}
# build urls for download
for(y in oasis_market_yr){
for(m in oasis_market_mn){
url = make_url(y,m)
print(url)
}
}
As I am not aware of a direct correspondence of the string formatting method in R, I changed it to replacements (a = gsub(pattern, replacement, a) corresponds the python command a=a.replace(pattern,replacement). It should work beautifully.
Also, you don't really need a calendar package to get the last dates. Just offer it as a list and adjust it for leap days and Bob's your uncle.
I don't know whether the URLs that are generated are really the ones you need. But you might be able to work from this translation to correct it, if something is wrong.

An option using glue and lubridate. Note I added _i to the {month} and {year} variables to avoid confusion with the month and year functions in lubridate.
library(glue)
library(lubridate)
URL_FORMAT_STRING <- 'http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_{year_i}_M{month_i}_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime={year_i}{month_i}01T07:00-0000&enddatetime={year_i}{month_i}{last_day_of_month}T07:00-0000&version=1'
make_url<- function(year_i, month_i){
last_day_of_month <- day(ceiling_date(my(paste(month_i, year_i)), 'month') - days(1))
glue(URL_FORMAT_STRING)
}
And then rather than a nested for loop you can use mapply to apply your function to all combinations of oasis_market_yr and oasis_market_mn.
df_vars <- expand.grid(year_i = oasis_market_yr, month_i = oasis_market_mn)
mapply(make_url, df_vars$year_i, df_vars$month_i)
# [1] "http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_2020_M01_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime=20200101T07:00-0000&enddatetime=20200131T07:00-0000&version=1"
# [2] "http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_2019_M01_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime=20190101T07:00-0000&enddatetime=20190131T07:00-0000&version=1"
#....

Related

Quantmod in R - How to work on multiple symbols efficiently?

I'm using quantmod to work on multiple symbols in R. My instinct is to combine the symbols into a list of xts objects, then use lapply do do what I need to do. However, some of the things that make quantmod convenient seem (to this neophyte) not to play nicely with lists. An example:
> symbols <- c("SPY","GLD")
> getSymbols(symbols)
> prices.list <- mget(symbols)
> names(prices.list) <- symbols
> returns.list <- lapply(prices.list, monthlyReturn, leading = FALSE)
This works. But it's unclear to me which column of prices it is using. If I try to specify adjusted close, it throws an error:
> returns.list <- lapply(Ad(prices.list), monthlyReturn, leading = FALSE)
Error in Ad(prices.list) :
subscript out of bounds: no column name containing "Adjusted"
The help for Ad() confirms that it works on "a suitable OHLC object," not on a list of OHLC objects. In this particular case, how can I specify that lapply should apply the monthlyReturn function to the Adjusted column?
More generally, what is the best practice for working with multiple symbols in quantmod? Is it to use lists, or is another approach better suited?
Answer monthlyReturn:
All the **Return functions are based on periodReturn. The default check of periodReturn is to make sure it is an xts objects and then takes the open price as the start value and the close price as the last value and calculates the return. If these are available at least. If these are not available it will calculate the return based on the first value of the timeseries and the last value of the timeseries, taking into account the needed time interval (month, day, year, etc).
Answer for lapply:
You want do 2 operations on a list object, so using an function inside the lapply should be used:
lapply(prices.list, function(x) monthlyReturn(Ad(x), leading = FALSE))
This will get what you want.
Answer for multiple symbols:
Do what you are doing.
run and lapply when getting the symbols:
stock_prices <- lapply(symbols, getSymbols, auto.assign = FALSE)
use packages tidyquant or BatchGetSymbols to get all the data in a big tibble.
... probably forgot a few. There are multiple SO answers about this.

Creating a simple for loop in R

I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).

R Apply a jsonlite::unbox command to a list

I want to communicate with an API that needs a certain format, see below:
library(jsonlite)
list(limits = list("Overall_Wave3/0" = unbox("14000"),
"Overall_Wave3/1" = unbox("14005")))
which gives (note the indexes of that list are [x]):
$limits
$limits$`Overall_Wave3/0`
[x] "14000"
$limits$`Overall_Wave3/1`
[x] "14005"
Now in my real life use case, I would need to create hundreds of such elements within a list, so I need to somehow automate things. My input will be a data frame (or tibble) that I need to put into that format. I get this working, however, only without successfully doing the unbox. I.e. here's how far I got:
library(tidyverse)
library(jsonlite)
dat <- data.frame(marker = c("Overall_Wave3/0", "Overall_Wave3/0"),
value = c(14000, 14005)) %>%
mutate(value = as.character(value))
args <- as.list(dat$value)
names(args) <- dat$marker
list(limits = args)
which gives (note that the indexes are now [1] instead of [x]:
$limits
$limits$`Overall_Wave3/0`
[1] "14000"
$limits$`Overall_Wave3/0`
[1] "14005"
Now converting both lists to a JSON body with toJSON(...) gives different results:
First command gives: {"limits":{"Overall_Wave3/0":"14000","Overall_Wave3/0.1":"14005"}}
Second command gives: {"limits":{"Overall_Wave3/0":["14000"],"Overall_Wave3/0.1":["14005"]}}
The second command has unnecessary squared brackets around the numbers that must not be there. I know I could probably do a hack with a string replace, but would strongly prefer a solution that works right from the start (if it can be done within the tidyverse, I wouldn't be too sad about it).
Thanks.

Can I create new xts columns from a list of names?

My objective: read data files from yahoo then perform calculations on each xts using lists to create the names of xts and the names of columns to assign results to.
Why? I want to perform the same calculations for a large number of xts datasets without having to retype separate lines to perform the same calculations on each dataset.
First, get the datasets for 2 ETFs:
library(quantmod)
# get ETF data sets for example
startDate = as.Date("2013-12-15") #Specify period of time we are interested in
endDate = as.Date("2013-12-31")
etfList <- c("IEF","SPY")
getSymbols(etfList, src = "yahoo", from = startDate, to = endDate)
To simplify coding, replace the ETF. prefix from yahoo data
colnames(IEF) <- gsub("SPY.","", colnames(SPY))
colnames(IEF) <- gsub("IEF.","", colnames(IEF))
head(IEF,2)
Open High Low Close Volume Adjusted
#2013-12-16 100.86 100.87 100.52 100.61 572400 98.36
#2013-12-17 100.60 100.93 100.60 100.93 694800 98.67
Creating new columns using the functions in quantmod is straightforward, e.g.,
SPY$logRtn <- periodReturn(Ad(SPY),period='daily',subset=NULL,type='log')
IEF$logRtn <- periodReturn(Ad(IEF),period='daily',subset=NULL,type='log')
head(IEF,2)
# Open High Low Close Volume Adjusted logRtn
#2013-12-16 100.86 100.87 100.52 100.61 572400 98.36 0.0000000
#2013-12-17 100.60 100.93 100.60 100.93 694800 98.67 0.0031467
but rather that creating a new statement to perform the calculation for each ETF, I want to use a list instead. Here's the general idea:
etfList
#[1] "IEF" "SPY"
etfColName = "logRtn"
for (etfName in etfList) {
newCol <- paste(etfName, etfColName, sep = "$"
newcol <- periodReturn(Ad(etfName),period='daily',subset=NULL,type='log')
}
Of course, using strings (obviously) doesn't work, because
typeof(newCol) # is [1] "character"
typeof(logRtn) # is [1] "double"
I've tried everything I can think of (at least twice) to coerce the character string etfName$etfColName into an object that I can assign calculations to.
I've looked at many variations that work with data.frames, e.g., mutate() from dplyr, but don't work on xts data files. I could convert datasets back/forth from xts to data.frames, but that's pretty kludgy (to say the least).
So, can anyone suggest an elegant and straightforward solution to this problem (i.e., in somewhat less than 25 lines of code)?
I shall be so grateful that, when I make enough to buy my own NFL team, you will always have a place of honor in the owner's box.
This type of task is a lot easier if you store your data in a new environment. Then you can use eapply to loop over all the objects in the environment and apply a function to them.
library(quantmod)
etfList <- c("IEF","SPY")
# new environment to store data
etfEnv <- new.env()
# use env arg to make getSymbols load the data to the new environment
getSymbols(etfList, from="2013-12-15", to="2013-12-31", env=etfEnv)
# function containing stuff you want to do to each instrument
etfTransform <- function(x, ...) {
# remove instrument name prefix from colnames
colnames(x) <- gsub(".*\\.", "", colnames(x))
# add return column
x$logRtn <- periodReturn(Ad(x), ...)
x
}
# use eapply to apply your function to each instrument
etfData <- eapply(etfEnv, etfTransform, period='daily', type='log')
(I didn't realize that you had posted a reproducible example.)
See if this is helpful:
etfColName = "logRtn"
for ( etfName in etfList ) {
newCol <- get(etfName)[ , etfColName]
assign(etfName, cbind( get(etfName),
periodReturn( Ad(get(etfName)),
period='daily',
subset=NULL,type='log')))
}
> names(SPY)
[1] "SPY.Open" "SPY.High" "SPY.Low" "SPY.Close"
[5] "SPY.Volume" "SPY.Adjusted" "logRtn" "daily.returns"
I'm not an quantmod user and it's only from the behavior I see that I believe the Ad function returns a named vector. (So I did not need to do any naming.)
R is not a macro language, which means you cannot just string together character values and expect them to get executed as though you had typed them at the command line. Theget and assign functions allow you to 'pull' and 'push' items from the data object environment on the basis of character values, but you should not use the $-function in conjunction with them.
I still do not see a connection between the creation of newCol and the actual new column that your code was attempting to create. They have different spellings so would have been different columns ... if I could have figured out what you were attempting.

Converting time interval in R

My knowledge and experience of R is limited, so please bear with me.
I have a measurements of duration in the following form:
d+h:m:s.s
e.g. 3+23:12:11.931139, where d=days, h=hours, m=minutes, and s.s=decimal seconds. I would like to create a histogram of these values.
Is there a simple way to convert such string input into a numerical form, such as seconds? All the information I have found seems to be geared towards date-time objects.
Ideally I would like to be able to pipe a list of data to R on the command line and so create the histogram on the fly.
Cheers
Loris
Another solution based on SO:
op <- options(digits.secs=10)
z <- strptime("3+23:12:11.931139", "%d+%H:%M:%OS")
vec_z <- z + rnorm(100000)
hist(vec_z, breaks=20)
Short explanation: First, I set the option in such a way that the milliseconds are shown. Now, if you type z into the console you get "2012-05-03 23:12:11.93113". Then, I parse your string into a date-object. Then I create some more dates and plot a histogramm. I think the important step for you is the parsing and strptime should help you with that
I would do it like this:
str = "3+23:12:11.931139"
result = sum(as.numeric(unlist(strsplit(str, "[:\\+]", perl = TRUE))) * c(24*60*60, 60*60, 60, 1))
> result
[1] 342731.9
Then, you can wrap it into a function and apply over the list or vector.

Resources