Trouble with Loop in R - there must be a better way?

Trouble with Loop in R - there must be a better way? - r

I am new to R, and trying to get a handle on things for a school project.
The idea is to model a simple and hypothetical electricity generation/storage system which uses only solar panels and battery storage to meet energy demand. The aim is, given a predetermined storage capacity, to select the least amount of solar paneling that ensures that demand will be satisfied on every day of the year. I have a full-year of daily data - solar insolation figures that determine how productive panels will be, day-time electricity demand, and night-time electricity demand. Surplus generation during the day is stored in batteries, up to the predetermined limit, and then discharged at night to meet demand.
I have written a simple R program, f(x), where x is the amount of solar paneling that is installed. Like the battery-storage parameter, it is invariant over the entire year.
After creating new variables for the total power output per day and total excess power produced per day and adding these as columns 4 and 5 to the original data frame, the program creates two new vectors "batterystartvector" and "batterymidvector," which respectively indicate the battery level at the start of each day and at the midpoint, between day and night.
The program loops over each day (row) in the data frame, and:
(1) Credits the excess power that is produced (column 5) to the storage system up to the predetermined limit (7500 Megawatt hours in my example) - which is then stored in "batterymidvector."
(4) Subtracts the night demand (column 3) from the total just registered in "batterymidvector" to determine how much energy there will be in storage at the start of the next day, and stores this figure in "batterystartvector."
(5) Repeats for all 365 days.
Ultimately, my aim is to use an optimization package, such as DEoptimr, to determine the lowest value for x that ensures that demand is satisfied on all days - that is that no values in either "batterymidvector" or "batterystartvector" are ever negative.
Since every entry in the two battery vectors is dependent on prior entries, I cannot figure out how to write a program that does not use a 'for' loop. But surely there must be a simpler and less clunky way.
Here is the code:
library(DEoptimR)
setwd("C:/Users/User/Desktop/Thesis Stuffs/R Programs")
data <- read.csv("optdata1.csv", header=TRUE)
#x is pv installed and y is pumped-storage capacity
#default is that system starts with complete pumped reservoir
f <- function(x) {
data$output <<- (data$insolation*x)/1000
data$daybalance <<- data$output - data$day
batterystartvector <<- vector(mode="numeric",length="365")
batterystartvector[1] <<- c(7500)
batterymidvector <<- vector(mode="numeric", length="366")
for(i in 1:nrow(data)) {
#charging up
batterymidvector[i] <<- min(batterystartvector[i] + data[i,5], 7500)
#depleting
batterystartvector[i+1] <<- (batterymidvector[i] - data[i,3])
}
}

Related

How do I store the values from a loop in a dataframe in R?

I'm n making a program that compares owning vs renting apartments. As a part of the program I want to create a dataframe of an annuity loan with the interest and principal payments and the remaining balance. The calculations for the loan are easy, but I struggle storing it.
I am new to R, and I'm trying to use it for a project in school instead of Excel. I tried to just store it in a dataframe, but got traceback errors, and I am kinda slow. I looked up code online, but do not want to rip it off 100% either and it was monthly basis. Also tried a similar mortgage function in the FinancialMath package but didn't get a dataframe with desired output. What is important isn't really this mortgage part, but it is essential for my calculations in the rest of my program.
Snippet of the mortgage function from my project, uses yearly basis where P is principal amount, I is interest rate, N is number of years
mortgage <- function(P=100000, I=1, N=10, amort=TRUE) {
I <- I/100
PMT <- (I*P)/(1-(1+I)^(-N))
if(amort==TRUE) {
Pt <- P
currP <- NULL
while(Pt>=0) {
H <- Pt * I
C <- PMT - H
Q <- Pt - C
Pt <- Q
currP <- c(currP, Pt)
}
}
}
A dataframe with the variables year, remaining balance, PMT, interest payment, principal payment. Want to differentiate my code somewhat, but similar result to (minus the plot) as http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

Calculate average return of strategy

Scenario (using quantstrat, blotter and portfolioanalytics)
I have 10k initial equity
I have a strategy that i want to backtest over 3000 symbol universe (stocks)
Let say the strategy is a simple MA crossover
Every time i get a buy crossover I buy 10k worth of stock and close position
on the sell crossover
For backtest purpose the strategy can trade without any portfolio restriction,
therefore i may be holding 100+ positions at any point in time, therefore the
initial equity shouldn't be considered.
I want to know the AVERAGE return of this strategy over all trades.
In reality if i only had 10k i would only be able to be in one trade at once, but i would like know statisctally what the average return would be.
I then want to compare this with the stock index benchmark.
Do i SUM or MEAN the return stream of each symbol
Is it the return of the portfolio, does this take into account the initial
equity? - i don't want the return to be as a percentage of the initial equity
or consider how may symbols are trading.

I'll add an example strategy when i get time, but the solution to the problem is:
#get the portfolio returns
instRets <- PortfReturns(account.st)
#for each column, NA the values where there is no return, because when the values are averaged out, you don't want 0's to be included in the calculation
# if there are no signals in the strategy, you will invest money elsewhere rather than just leaving lying around. Therefore you only calculate the returns #when the strategy is ACTIVE
for (i in 1:ncol(instRets)){
instRets[,i][instRets[,i] == 0] <- NA
}
#this will give you the average return when the strategy is active, if there are 100 trades on, you want the average return during that period.
portfRets <- xts(rowMeans(instRets, na.rm = T), order.by = index(instRets))
portfRets <- portfRets[!is.na(portfRets)]
Now you can compare the strategy with a benchmark SPY for example. If the strategy has alpha you can use a balancing rule to apply funds to the strategy when signals arise or keep invested in the index when there are no signals.
As far to my knowledge the returns analysis built into blotter uses the initial equity to work out returns, therefor invest the same amount in each trade as you have for initial equity. 10k initial equity, 10k per trade.

Speeding up identification of subsequences

I am using a dataset which has hundreds of events in each sequence. I am trying to identify subsequences and sequential association rules using TraMineR. For example, here is code that I would write:
# Frequent subsequences:
fsubseq <- seqefsub(weaver, minSupport = 0.05, maxK = 4)
fsubseq <- seqentrans(fsubseq)
fsb <- fsubseq[fsubseq$data$nevent > 1]
plot(fsb[1:20], col = "cyan")
# Sequential association rules:
rules <- TraMineR:::seqerules(fsubseq)
rules[order(rules$Lift, decreasing = TRUE)[1:25], 1:4]
This is usually workable as long as I set maxK to 1-3, but as I move over that value the computations takes hours if not days. Are there any specific parameters I can adjust to speed these computations up?

Computation time is strongly linked to:
Number of events per sequence. The algorithm was designed for a small number of event per sequence (<6 typically) and many sequences. You can try removing some events that are not your main interest or analysing group of events. I guess that the relationship between number of events and computation time is at least exponential. With more than 10 events per sequences, it can be really slow.
Minimum support. With low minimum support the possible number of subsequence get really big. Try to set it to an higher value.
Hope this helps.

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.

You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex