Data extraction in for loop using r - r

I am trying to find and compare daily gain and loss percentage in two stocks in r. This is the code
library(quantmod)
stockData <- new.env() #Make a new environment for quantmod to store data in
tickers <- c("AAPL","GOOG")
#Set start date
start_date=as.Date("2014-01-01")
getSymbols(tickers, src="yahoo", env=stockData,from=start_date)
for (tick in tickers) {
x <- get(tick, pos=stockData) # get data from stockData environment
x$gl<-((Cl(x)-Op(x))/Op(x))*100 #Daily gain loss percentage
}
I am able to calculate daily gain/loss percentage for individual stocks but I don't know how to proceed further and extract-compare gain/loss percentage of multiple stocks separately.
Example
if AAPL(gain/loss percentage) is greater than GOOG(gain/loss percentage) then 1 else -1

Your code is a good starting point. However, I would suggest that you store the data of the tickers in a list. As the code is now, only the data of the last ticker treated in the loop is stored in x.
This slightly modified version might help:
library(quantmod)
stockData <- new.env() #Make a new environment for quantmod to store data in
tickers <- c("AAPL","GOOG","YHOO","FB")
#Set start date
start_date <- as.Date("2014-01-01")
getSymbols(tickers, src="yahoo", env=stockData, from=start_date)
x <- list()
for (i in 1:length(tickers)) {
x[[i]] <- get(tickers[i], pos=stockData) # get data from stockData environment
x[[i]]$gl <-((Cl(x[[i]])-Op(x[[i]]))/Op(x[[i]]))*100 #Daily gain loss percentage
}
compare_pl <- function(x,y){ifelse(x$gl > y$gl, 1, -1)}
aapl_vs_goog <- compare_pl(x[[1]],x[[2]])
Now the variable aapl_vs_goog contains the data on the days where AAPL outperformed GOOG (+1) or vice versa (-1):
> tail(aapl_vs_goog)
# gl
#2015-08-19 -1
#2015-08-20 1
#2015-08-21 1
#2015-08-24 1
#2015-08-25 -1
#2015-08-26 -1
Needless to say that this can be performed in the same way for any other ticker.

Related

My for loop operating on each timestamp in high frequency data is ineffective

I am using R to calculate whole lake temperatures at every timestamp from the open water season.
I have loggers at various depths logging temperature every 10 minutes.
Each data frame for each lake has over 100k entries with over 10k different timestamps.
This is how I have solved this using a for loop. However, the code is extremely inefficient and it takes a couple of hours per lake depending how deep it is (deeper lakes have more loggers).
Example below resembles what my data look like. Running the script on the example goes fast, but takes hours on real data.
There should be a more effective way of doing this, with some apply-family function but idk how.
library(rLakeAnalyzer)
date <- c("2000-01-01 00:00:00","2000-01-01 00:00:00","2000-01-01 00:00:00",
"2000-01-01 00:10:00","2000-01-01 00:10:00","2000-01-01 00:10:00",
"2000-01-01 00:20:00","2000-01-01 00:20:00","2000-01-01 00:20:00")
depth <- c(1,2,3,1,2,3,1,2,3)
temp <- c(20,12,9,14,12,11,10,7,4)
dt <- as.data.frame(cbind(temp,depth,date)) #example data frame
dptd <- c(0,1,2,3) #example depth
dpta <- c(5000,2500,1250,625) #example area per depth
datelist <- levels(as.factor(dt$date)) #'for each date in the frame...'
ldf <- list() #list to store every row for the new data frame
for(i in 1:length(datelist)){
print(i) #to check how fast it operates
lek <- dt[grepl(datelist[i],dt$date),] #take every date in dt
temp <- whole.lake.temperature(wtr=lek$temp,depths=lek$depth,bthA=dpta,bthD=dptd) #function
date <- datelist[i]
ldf[[i]] <- as.data.frame(cbind(temp,date)) #make a dataframe in list with 1 row and 2 col
}
ldf <- bind_rows(ldf) #convert list of data frames to a complete data frame
ldf$temp <- as.numeric(ldf$temp)
ldf$date <- as.POSIXct(ldf$date)
plot(ldf$date,ldf$temp) #woala, I have a dataframe with the whole lake temp at every timestamp
How about using data.table, grouping by date, and then applying the whole.lake.temperature function:
library(rLakeAnalyzer)
library(data.table)
date <- c("2000-01-01 00:00:00","2000-01-01 00:00:00","2000-01-01 00:00:00",
"2000-01-01 00:10:00","2000-01-01 00:10:00","2000-01-01 00:10:00",
"2000-01-01 00:20:00","2000-01-01 00:20:00","2000-01-01 00:20:00")
depth <- c(1,2,3,1,2,3,1,2,3)
temp <- c(20,12,9,14,12,11,10,7,4)
dt <- as.data.frame(cbind(temp,depth,date)) #example data frame
dptd <- c(0,1,2,3) #example depth
dpta <- c(5000,2500,1250,625) #example area per depth
results <- setDT(dt)[,by=date,
.(temp=whole.lake.temperature(wtr=temp,
depths=depth,
bthA=dpta,
bthD=dptd))]
It's hard to tell if it speeds things up without trying it out on your whole dataset. Let me know if it helps.

Store XTS objects as data frames in a list in R

I wish to store some XTS objects as data frames within a list in R.
The XTS objects are stock price data collected using the tidyquant package, I need to convert these objects to data frames and store them in a list. I have one additional requirement, I only want to retain the index column and the closing price column for each stock.
I have tried using dplyr syntax to select the columns of interest but my code fails to select column indexes greater than 2
Error: Can't subset columns that don't exist.
x Locations 3 and 4 don't exist.
i There are only 2 columns.
This is the code I am using but I am struggling to understand how I can't select the closing price from my 'fortified' data frames
pacman::p_load(tidyquant,tidyverse,prophet)
tickers = c("AAPL","AMZN")
getSymbols(tickers,
from = '2015-01-01',
to = today(),
warnings = FALSE,
auto.assign = TRUE)
dfList <- list()
for (i in tickers) {
dfList[[i]] <- fortify.zoo(i) %>%
select(c(1,5))
}
When I convert an individual XTS object to a data frame using fortify.zoo I can select the columns of interest but not when I loop through them.
fortify.zoo(AAPL) %>% select(c(1,5)) %>% head(n = 10)
Can anyone help me understand where I am falling down in my understanding on this issue please?
getSymbols can put the stock data into an environment stocks and Cl will extract the close and the Index. Replace Cl with Ad if you want the adjusted close. Then iterate through the names in the environment. Finally leave it as an environment stocks or optionally convert it to a list L. No packages other than quantmod and the packages that it pulls in are used. Also there is the question if you even need to convert the data to data frames. You could just leave it as xts.
library(quantmod)
tickers = c("AAPL","AMZN")
stocks <- new.env()
getSymbols(tickers, env = stocks, from = '2015-01-01')
for(nm in ls(stocks)) stocks[[nm]] <- fortify.zoo(Cl(stocks[[nm]]))
L <- as.list(stocks) # optional
Another possibility if you do want a list is to replace the last two lines with an eapply:
L <- eapply(stocks, function(x) fortify.zoo(Cl(x)))
It is better to initialize a list with fixed length and name it with the tickers. In the OP's code, it is looping over the tickers directly, so each 'i' is the ticker name which is a string
dfList <- vector('list', length(tickers))
names(dfList) <- tickers
As the i here is a string name of the object "AAPL" or "AMZN", we can use get to return the value of that object from the global env
for (i in tickers) {
dfList[[i]] <- fortify.zoo(get(i)) %>%
select(c(1,5))
}
-check the dimensions
sapply(dfList, dim)
# AAPL AMZN
#[1,] 1507 1507
#[2,] 2 2
Another approach is mget to return all those objects into a list
library(purrr)
library(dplyr)
dfList2 <- mget(tickers) %>%
map(~ fortify.zoo(.x) %>%
select(1, 5))

Difftime Error using Looping Regressions in R

With the below code I am getting the error Error in Ops.difftime((f - mean(f)), 2) : '^' not defined for "difftime" objects.
This error only occurs with the inclusion of r_sq[[counter-lookback]] <- summary(temp_lm)$r.squared; which is located towards the end of the loop. I cannot find any similar error solutions online. Thank you for your help.
#Import necessary packages
require(quantmod)
require(ggplot2)
#Measure time used in processing
ptm <- proc.time()
#########
#Write in the ticker symbols of the pair
tickers <- c("GS","JPM")
########
#Pull data down for symbols
A <- getSymbols(tickers[1],auto.assign=FALSE)
B <- getSymbols(tickers[2],auto.assign=FALSE)
#Strip data such as high and low prices
A <- A[,4]
B <- B[,4]
#Create data frame of both price series
AB_DF <- data.frame(A,B)
#Create a time series of the spread & rename header
S <- A-B
colnames(S) <- "Spread.Close"
#Separate the index of times from the spread data for regression
TS <- index(S)
SP <- coredata(S)
#Perform regressions of past 'lookback' days of the spread, incrementing by 1, beginning at T = lookback+1
########
# Change below variable to alter length of data in regression
lookback <- 250
#######
#Initialize a counter, and lists to hold data from the spread regressions
counter <- lookback+1
res_store <- list()
spread_coef <- list()
r_sq <- list()
while (counter<length(SP)) {
temp_lm <- lm(TS[(counter-lookback):counter]~SP[(counter-lookback):counter]);
res_store[[counter-lookback]] <- residuals(temp_lm);
spread_coef[[counter-lookback]] <- coefficients(temp_lm)[[2]];
r_sq[[counter-lookback]] <- summary(temp_lm)$r.squared;
counter <- counter+1;
}
Ok all, I have figured it out. The issue is that R does not like to compute R^2 values for data indexed by time. By regressing the data values against time, an error in difftime() occurs. I solved this by changing the index from time values to a standard integer index, and everything ran fine.

Prepend xts rows to a subset

Supposing I need to apply an MA(5) to a batch of market data, stored in an xts object. I can easily pull the subset of data I wanted smoothed with xts subsetting:
x['2013-12-05 17:00:01/2013-12-06 17:00:00']
However, I need an additional 5 observations prior to the first one in my subset to "prime" the filter. Is there an easy way to do this?
The only thing I have been able to figure out is really ugly, with explicit row numbers (here using xts sample data):
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
x$rn <- row(x[,1])
frst <- first(x['2007-05-18'])$rn
finl <- last(x['2007-06-09'])$rn
ans <- x[(frst-5):finl,]
Can I just say bleah? Somebody help me.
UPDATE: by popular request, a short example that applies an MA(5) to the daily data in sample_matrix:
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)$Close
calc_weights <- function(x) {
##replace rnorm with sophisticated analysis
wgts <- matrix(rnorm(5,0,0.5), nrow=1)
xts(wgts, index(last(x)))
}
smooth_days <- function(x, wgts) {
w <- wgts[index(last(x))]
out <- filter(x, w, sides=1)
xts(out, index(x))
}
set.seed(1.23456789)
wgts <- apply.weekly(x, calc_weights)
lapply(split(x, f='weeks'), smooth_days, wgts)
For brevity, only the final week's output:
[[26]]
[,1]
2007-06-25 NA
2007-06-26 NA
2007-06-27 NA
2007-06-28 NA
2007-06-29 -9.581503
2007-06-30 -9.581208
The NAs here are my problem. I want to recalculate my weights for each week of data, and apply those new weights to the upcoming week. Rinse, repeat. In real life, I replace the lapply with some ugly stuff with row indexes, but I'm sure there's a better way.
In an attempt to define the problem clearly, this appears to be a conflict between the desire to run an analysis on non-overlapping time periods (weeks, in this case) but requiring overlapping time periods of data (2 weeks, in this case) to perform the calculation.
Here's one way to do this using endpoints and a for loop. You could still use the which.i=TRUE suggestion in my comment, but integer subsetting is faster.
y <- x*NA # pre-allocate result
ep <- endpoints(x,"weeks") # time points where parameters change
set.seed(1.23456789)
for(i in seq_along(ep)[-(1:2)]) {
rng1 <- ep[i-1]:ep[i] # obs to calc weights
rng2 <- ep[i-2]:ep[i] # "prime" obs
wgts <- calc_weights(x[rng1])
# calc smooth_days on rng2, but only keep rng1 results
y[rng1] <- smooth_days(x[rng2], wgts)[index(x[rng1])]
}

Optimizing search in time series data frame

I have a data frame of 50 columns by 2.5 million rows in R, representing a time series. The time column is of class POSIXct. For analysis, I repeatedly need to find the state of the system for a given class at a particular time.
My current approach is the following (simplified and reproducible):
set.seed(1)
N <- 10000
.time <- sort(sample(1:(100*N),N))
class(.time) <- c("POSIXct", "POSIXt")
df <- data.frame(
time=.time,
distance1=sort(sample(1:(100*N),N)),
distance2=sort(sample(1:(100*N),N)),
letter=sample(letters,N,replace=TRUE)
)
# state search function
time.state <- function(df,searchtime,searchclass){
# find all rows in between the searchtime and a while (here 10k seconds)
# before that
rows <- which(findInterval(df$time,c(searchtime-10000,searchtime))==1)
# find the latest state of the given class within the search interval
return(rev(rows)[match(T,rev(df[rows,"letter"]==searchclass))])
}
# evaluate the function to retrieve the latest known state of the system
# at time 500,000.
df[time.state(df,500000,"a"),]
However, the call to which is very costly. Alternatively, I could first filter by class and then find the time, but that doesn't change the evaluation time much. According to Rprof, it's which and == that cost the majority of the time.
Is there a more efficient solution? The time points are sorted weakly increasing.
Because which, == and [ are all linear with the size of the data frame, the solution is to generate subset data frames for bulk operations, as follows:
# function that applies time.state to a series of time/class cominations
time.states <- function(df,times,classes,day.length=24){
result <- vector("list",length(times))
day.end <- 0
for(i in 1:length(times)){
if(times[i] > day.end){
# create subset interval from 1h before to 24h after
day.begin <- times[i]-60*60
day.end <- times[i]+day.length*60*60
df.subset <- df[findInterval(df$time,c(day.begin,day.end))==1,]
}
# save the resulting row from data frame
result[[i]] <- df.subset[time.state(df.subset,times[i],classes[i]),]
}
return(do.call("rbind",result))
}
With dT=diff(range(df$times)) and dT/day.length large, this reduces the evaluation time with a factor of dT/(day.length+1).

Resources