previous value in function apply R Stduio - r

i have problem with substraction 2 value from data frame in apply function. I took stock timeseries and try to substract value in time i and i-1:
library(data.table)
do_apply <- rep(NA,nrow(DATA))
do_apply$close <- DATA$Zamkniecie
funkcja <- function(close){
close_lag_1 <- shift(x=close,
n = 1,
type = 'lag',
fill=0)
close <- close - close_lag_1
return(zamkniecie)
}
sapply(do_apply$close, funkcja)

Related

Confusing non-numeric argument to binary operator error

I cannot seem to even create a reproducible example on this as it works fine when I go through the code one line at a time.
The error message I get is as follows:
"Error in testData[, colCheck][length(testData[, colCheck])] - testData[, :
non-numeric argument to binary operator "
Both colCheck and testData$linearcorrd15N are numeric and like I said, the calculation works fine when I run it at that line. The error comes only when I run the function from QTest(df, colCheck).
Here is an example of what some of the code looks like. It will not produce an error, but maybe you can see something that I don't.
QTest <- function(testData, colCheck)
#%#
# testData <- This is the entire data frame for the std/ref that has too high
# of a SD, this way the data frame can be returned without the outlier
# colCheck <- The column name for values that were flagged for having too high of a SD
# This Q test info provided by: https://www.statisticshowto.com/dixons-q-test/
#%#
{
#Get the mean of the highest and lowest values
testData <- arrange(testData, desc(testData[, colCheck]))
len <- length(testData[,colCheck])-1
high <- sapply(1:len, function(i) testData[,colCheck][i])
meanhigh <- mean(high)
testData <- arrange(testData, (testData[, colCheck]))
low <- sapply(1:len, function(i) testData[,colCheck][i])
meanlow <- mean(low)
#If the mean of the lowest numbers is lower than the mean of the highest numbers, do this
if(meanlow < meanhigh){
QexpVal <- abs((testData[, colCheck][2] - testData[, colCheck][1])/
(testData[, colCheck][length(testData[, colCheck])] - testData[, colCheck][1]))
outlier <- testData[,colCheck][1]
closest <- testData[,colCheck][2]
#else if the mean of the lowest numbers is higher than the mean of the highest numbers, do this
} else {
QexpVal <- abs((testData[, colCheck][length(testData[,colCheck])-1] - (testData[, colCheck][length(testData[,colCheck])])) /
(testData[,colCheck][length(testData[,colCheck])]) - (testData[,colCheck][1]))
outlier <- testData[,colCheck][length(testData[,colCheck])]
closest <- testData[,colCheck][length(testData[,colCheck])-1]
}
return(QexpVal)
}
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(std1)))
QTestCorrVals <- QTest(df, colCheck)
It seems you realy overcomplicate this function by pushing the whole table in the function and loop over everything and read a value again from the whole table...
just the part to get meanhigh and meanlow requires this:
v <- df[, colCheck]
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
Or if you use the decreasing ordering this:
v <- df[, colCheck]
v <- v[order(v, decreasing = T)]
n <- length(v)
meanhigh <- mean(v[1:n-1])
meanlow <- mean(v[2:n])
Full function
Hereby the full code using this approach and I agree that is not the specific question you asked, but the way you coded it is highly inefficient and error prone by every time take the whole data.frame and subset it and recalculate lengths every time. Also you just have to order once, as if the lowest is on top, the highest is per definition on the bottom. Then play around with the 1 for first and 2 for second and n for last and n-1 for second last.
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(df)))
QTest <- function(v) {
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
if(meanlow < meanhigh) {
QexpVal <- abs((v[2]-v[1])/(v[n]-v[1]))
outlier <- v[1]
closest <- v[2]
} else {
QexpVal <- abs((v[n-1]-v[n])/(v[n]-v[1]))
outlier <- v[n]
closest <- v[n-1]
}
return(QexpVal)
}
QTestCorrVals <- QTest(df[, colCheck])
Side note
Using the column index number works slightly different whether your data is a data.frame or a data.table
class(df)
df[, colCheck]
dt <- data.table(df)
class(dt)
dt[, ..colCheck]
dt[, colCheck] # throws an error

How to interpolate missing values in a time series, limited by the number of sequential NAs (R)?

I have missing values in a time series of dates. For example:
set.seed(101)
df <- data.frame(DATE = as.Date(c('2012-01-01', '2012-01-02',
'2012-01-03', '2012-01-05', '2012-01-06', '2012-01-15', '2012-01-18',
'2012-01-19', '2012-01-20', '2012-01-22')),
VALUE = rnorm(10, mean = 5, sd = 2))
How can I write a function that will fill all the missing dates between the first and last date (ie 2012-01-01 and 2012-01-22'), then use interpolation (linear and smoothing spline) to fill the missing values, but not more than 3 sequential missing values (ie no interpolation between 2012-01-06 and 2012-01-15)?
The function will be applied to a very large dataframe. I have been able to write a function that uses linear interpolation to fill all missing values between two dates (see below), but I can not figure out how to stop it interpolating long stretches of missing values.
interpolate.V <- function(df){
# sort data by time
df <- df[order(df$DATE),]
# linnearly interpolate VALUE for all missing DATEs
temp <- with(df, data.frame(approx(DATE, VALUE, xout = seq(DATE[1],
DATE[nrow(df)], "day"))))
colnames(temp) <- c("DATE", "VALUE_INTERPOLATED")
temp$ST_ID <- df$ST_ID[1]
out <- merge(df, temp, all = T)
rm(temp)
return(out)
}
Any help will be greatly appreciated!
Thanks
Function that adds rows for all missing dates:
date.range <- function(sub){
sub$DATE <- as.Date(sub$DATE)
DATE <- seq.Date(min(sub$DATE), max(sub$DATE), by="day")
all.dates <- data.frame(DATE)
out <- merge(all.dates, sub, all = T)
return(out)
}
Use na.approx or na.spline from zoo package with maxgap argument:
interpolate.zoo <- function(df){
df$VALUE_INT <- na.approx(df$VALUE, maxgap = 3, na.rm = F)
return(df)
}

How to store loop output of each iteration to data frame

Ok so I have a loop that works out the annualized / cumulative return of a stock price series.
I wish to do the same thing over many files. So made a loop to do so.
First some dummy data:
# Create dummy data
# Use lubridate to change timestamp to date format
# Use dplyr to arrange by ascending order
# Use fread from data.table to read .csv to data frame
require(lubridate)
require(data.table)
require(dplyr)
MSFT <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
MSFT$timestamp <- ymd(MSFT$timestamp)
MSFT <- arrange(MSFT,timestamp)
AAPL <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=AAPL&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
AAPL$timestamp <- ymd(AAPL$timestamp)
AAPL <- arrange(AAPL,timestamp)
NFLX <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=NFLX&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
NFLX$timestamp <- ymd(NFLX$timestamp)
NFLX <- arrange(NFLX,timestamp)
TSLA <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=TSLA&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
TSLA$timestamp <- ymd(TSLA$timestamp)
TSLA <- arrange(TSLA,timestamp)
# Place data frames in a list
df.list <- list(MSFT,AAPL,NFLX,TSLA)
# Specify file names
file.names <- c("MSFT","AAPL","NFLX","TSLA")
Now that prepares the data.
Next I want to calculate the cumulative and annualized returns for each series. I place this in a function then call the function with a loop:
# Create function for performing commands.
genAnnualized = function(x){
next.file <- data.frame(df.list[[1]],stringsAsFactors=FALSE)
next.name <- paste0(file.names[i])
new.df <- data.frame(next.file)
# Calculate annualized return
# Make prices vector
prices <- new.df[, "close", drop = FALSE]
# Denote n the number of time periods:
n <- nrow(prices)
# Calculate close to close returns
# lead in with rep,NA,1 to maintain length of vector comparible to data frame
close_ret <- c(rep(NA, 1),(prices[2:n, 1] - prices[1:(n-1), 1])/prices[1:(n-1), 1])
close_ret[1] <- 0
# Compute continuously returns (log returns)
close_ccret <- log(prices[2:n, 1]) - log(prices[1:(n-1), 1])
# Compute gross returns
close_gret <- 1 + close_ret # use close to close ret
# Compute future values
close_fv <- cumprod(close_gret)
# Obtain first and last values
ret.last <- tail(close_fv, n=1)
ret.first <- head(close_fv, n=1)
cum.ret <- (ret.last-ret.first)/ret.first
# Get First And Last row to calculate time between
ret.first.row <- head(new.df$timestamp, n=1)
ret.last.row <- tail(new.df$timestamp, n=1)
# Time diff
#trading.years.between <- as.numeric(difftime(as.Date(ret.last.row), as.Date(ret.first.row), unit="weeks"))/52.25
# Find time diff
ret.time <- ret.last.row - ret.first.row
ret.trading.years.between <- ret.time/365 #252 trading days or 365
ret.trading.years.between <- as.numeric(ret.trading.years.between, units="days") # Extract numerical value from time difference 'Time difference of 2837.208 days'
# Annualized return
# (1 + % diff of final) / (last balance to the power of 1/time first and last balance) -1
ret.annual.return <- (1+cum.ret) ^ (1/ret.trading.years.between) -1
########## Store annualized and cumulative return in data frame for each iteration #########
# Store file name as a row name :: next.name variable
# Store final annualized return :: cret.annual.return
# Store final cumulative return :: cum.ret
output.df <- cbind(cum.ret,ret.annual.return)
rownames(output.df) <- next.name
##################################################################
# Sanity check, use PerformanceAnalytics for annualized return
# TTR for returns
# Calculate Close-to-Close returns
require(TTR)
require(PerformanceAnalytics)
new.df$clret <- ROC(new.df$close, type = c("discrete"))
new.df$clret[1] <- 0
# Make time series object of returns and date
require(xts)
xts1 = xts(new.df$clret, order.by=as.Date(new.df$timestamp, format="%m/%d/%Y"))
Return.annualized(xts1)
Return.cumulative(xts1, geometric=TRUE)
}
And call the function to loop through each data frame in the data frame list:
for (i in 1:length(df.list)){
tryCatch({
genAnnualized(df.list[[i]])
}, error = function(e) { print(paste("i =", i, "failed:")) })
}
This should make a re producible example.
On each iteration, I wish to store the cumulative and annualized return of each series as with the name of the data set (so its identifiable later).
I am attempting this with the below within my function:
output.df <- cbind(cum.ret,ret.annual.return)
rownames(output.df) <- next.name
I am specifying the names with:
file.names <- c("MSFT","AAPL","NFLX","TSLA")
and then in the function calling it:
next.name <- paste0(file.names[i])
I was hoping to paste the file name so I can tag my final output in the data frame.
I think might need to rep the name twice when naming each row or column name. So that it tags the cumulative return and also the annualized return.
I think have the general idea but have been wrestling with this for a few weeks so looking for some assistance.
Essentially with the output data frame I can then organise into quartiles etc etc for further analysis
My simplest solution was to rbind a dataframe for each file.names evaluation, and name this row the same name of the corresponding file.
I've deleted comments for clarity (and put some of mine).
'genAnnualized' = function(df_list) {
next.file <- data.frame(df_list, stringsAsFactors=FALSE) # Put the parameter of the function here
next.name <- paste0(file.names[i])
new.df <- data.frame(next.file)
prices <- new.df[, "close", drop = FALSE]
n <- nrow(prices)
close_ret <- c(rep(NA, 1),(prices[2:n, 1] - prices[1:(n-1), 1])/prices[1:(n-1), 1])
close_ret[1] <- 0
close_ccret <- log(prices[2:n, 1]) - log(prices[1:(n-1), 1])
close_gret <- 1 + close_ret
close_fv <- cumprod(close_gret)
ret.last <- tail(close_fv, n=1)
ret.first <- head(close_fv, n=1)
cum.ret <- (ret.last-ret.first)/ret.first
ret.first.row <- head(new.df$timestamp, n=1)
ret.last.row <- tail(new.df$timestamp, n=1)
ret.time <- ret.last.row - ret.first.row
ret.trading.years.between <- ret.time/365
ret.trading.years.between <- as.numeric(ret.trading.years.between, units="days")
ret.annual.return <- (1+cum.ret) ^ (1/ret.trading.years.between) -1
output.df <- cbind(cum.ret,ret.annual.return)
rownames(output.df) <- next.name
##################################################################
new.df$clret <- TTR::ROC(new.df$close, type = c("discrete"))
new.df$clret[1] <- 0
xts1 = xts::xts(new.df$clret, order.by=as.Date(new.df$timestamp, format="%m/%d/%Y"))
# Create the output of the function : a named data.frame
out_df <- data.frame("Annualized Return" = PerformanceAnalytics::Return.annualized(xts1),
"Cumulative Return" = PerformanceAnalytics::Return.cumulative(xts1, geometric=TRUE))
return(out_df)
}
# Initialize the output dataframe to which we will rowbind the results
cum_ret <- data.frame()
for (i in 1:length(df.list)){
temp <- genAnnualized(df.list[[i]] )
rownames(temp) <- file.names[i]
cum_ret <- rbind.data.frame(cum_ret, temp)
}
This gives a data frame with number of named rows equal to the number of
files in df.list and 2 columns for the annualized and cumulative returns.
> cum_ret
Annualized.Return Cumulative.Return
MSFT -0.02279597 -0.3361359
AAPL 0.02039616 0.4314812
NFLX 0.17454862 10.8991045
TSLA 0.44666765 13.8233571

R - How to create a vector with propagated dates

I am trying to create a vector (dateVec) which contains the dates in the column Date propagated by the number of days in column Days. I cannot understand why the code that I created is not working. Dates are in Date format.
> for ( i in mydata[,1] ) {
> dateVec = mydata [,1] + 0 : mydata [,2] }
The data has much more rows, here is a sample as an example:
Date (mydata[,1]) -- Days (mydata[,2])
10/05/2017 ---------- 3
05/05/2017 ---------- 2
The result that I would expect for dateVec would be:
(10/05/2017, 11/05/2017, 12/05/2017, 13/05/2017, 05/05/2017, 06/05/2017, 07/05/2017, ...)
There are a few issues here why your code isn't working.
For loop: Here, your i needs a series of integers to iterate through. As you
have it now, you are trying to loop from i = 1 to "10/05/2017" and "05/05/2017".
A more useful way is to use seq_along to generate a sequence of integers from
1 to the length of the object passed through to seq_along.
dateVec is not indexed, so that you are overwriting dateVec for each
iteration of your loop
Variable length of days. For the first date, you are generating a sequence 3
days long and for the second date, 2 days. You will need a data structure that can handle variable length element such as a list.
To modify your existing code:
mydata <- data.frame(Date = as.Date(c("10/05/2017", "05/05/2017"),
format = "%d/%m/%Y"), Days = c(3, 2))
dateVec <- list()
for (i in seq_along(mydata[, 1])) {
dateVec[[i]] = mydata [i, 1] + 0 : mydata [i, 2]
}
res <- do.call("c", dateVec)
A more r idiomatic approach is to pass the starting date and length of time in parallel using mapply to return a list, which is then concatenated to a vector of dates
res <- do.call("c", mapply(function(x, y) seq(from = x, length.out = y,
by = "1 day"), x = mydata[["Date"]], y = mydata[["Days"]]))
Here's a clunky solution:
library("lubridate")
mydata = data.frame(Date = dmy(c("10/05/2017", "05/05/2017")),
Days = c(3,2))
dateVec = dmy(character())
for(i in 1:length(mydata$Date)){
dateVec = c(dateVec,mydata$Date[i])
for(j in 1:mydata$Days[i]){
dateVec = c(dateVec, mydata$Date[i]+j)
}
}
Note that this uses the lubridate package and doesn't format the dates quite how you did. I also found it interesting that I had to initialize dateVec as a date object. Initially I tried dateVec = c() but R tried to coerce to numeric.

How to speed up a loop-like function in R

In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.
#Creating first dataframe
dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)
n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))
df1<-data.frame(date = dates,
hour = sample(1:24, n,replace = T),
cat = sample(c("a", "b"), n, replace = T),
lag = sample(1:24, n, replace = T))
#Creating second dataframe
df2<-data.frame(date = sort(rep(dates, 24)),
hour = rep(1:24, length(dates)),
p = runif(length(rep(dates, 24)), min = -20, max = 100))
df2<-df2[order(df2$date, df2$hour),]
df2$cat<-"a"
temp<-df2
temp$cat<-"b"
df2<-rbind(df2,temp)
#function
period_mean<-function(x){
tmp<-df2[df$cat == df1[x,]$cat,]
#This line extracts the row name index from tmp,
#in which the two dataframes match on date and hour
he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)
#My lagged period is given by the variable "lag". I want the average
#over the period hour - (hour - lag). Since df2 is sorted such hours
#are consecutive, this method requires that I subset on only the
#relevant value for cat (hence the creation of tmp in the first line
#of the function
p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)
print(x)
print(p)
return(p)
}
#Execute function
out<-sapply(1:length(row.names(df1)), period_mean)
EDIT I have subsequently learned that part of the reason my original problem was iterating so slowly is that my data classes between the two dataframes were not the same. df1$date was a date field, while df2$date was a character field. Of course, this wasn't apparent with the example I posted because the data types were the same by construction. Hope this helps.
Here's one suggestion:
getIdx <- function(i) {
date <- df1$date[i]
hour <- df1$hour[i]
cat <- df1$cat[i]
which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)
df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
flr <- ifelse(x[1]=="a", 1, b_start)
x <- as.numeric(x[2:3])
mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})
We make a function (getIdx) to retrieve the rows from df2 that match the values from each row in df1, and then Vectorize the function.
We then run the vectorized function to get a vector of rownames. We set b_start to be the row where the "b" category starts.
We then iterate through the rows of df1 with apply. In the mean(...) function, we set the "floor" to be either row 1 (if cat=="a") or b_start (if cat=="b"), which eliminates the need to subset (what you were doing with tmp).
Performance:
> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
user system elapsed
11.304 0.393 11.917
> system.time({
+ df1$index <- v_getIdx(1:nrow(df1))
+ b_start <- match("b", df2$cat)
+ out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+ flr <- ifelse(x[1]=="a", 1, b_start)
+ x <- as.numeric(x[2:3])
+ mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+ })
+ })
user system elapsed
2.839 0.405 3.274
> all.equal(out, out2)
[1] TRUE

Resources