extracting the weight that maximize my function- R code - r

I am calculating a quantile regression for a 99 sequence tau=.01,..., .99. I am doing this process for 1000 different response(y), where y_it = w^T_i * r_t. where i=1,...,1000
For instance for tau=.01 I will have 1000 quantile each quantile correspond to different y
So in total I will have a matrix of 1000x99, each column correspond to a specific tau values, my code as follows:
#this is to calculate y observations
all_yts1 <- matrix(nrow = nrow(return) ,
ncol = 1000 )
for( i in 1:1000){
all_yts1[,i]<-rowSums(all_wts1[i,] * return) }
###calculate quantile matrix
tau=seq(0.01,.99,by=.01)
quant.matrix.reg3<- matrix(nrow = np1 ,ncol = length(tau) )
for( i in 1:np1){
quant.matrix.reg3[i,]<- rq(all_yts1[,i][-1]~ x.reg[,3], tau)$rho
colnames(quant.matrix.reg3)<- tau
}
I am using rho values in the code because my next step is to allocate the row that contains the maximum quantile value from each column as follows:
opt.w3.alltaus=matrix(ncol = length(tau) )
for(j in 1: length(tau)){
opt.w3.alltaus[,j]=which( quant.matrix.reg3[,j]== max(quant.matrix.reg3[,j]),arr.ind = T)
}
So I am expecting to allocate a 99 row numbers. for example, for tau=.01 the row that contains the maximum value is row 577. ( this what the codes tells me )
As a final step, since y_it= w^T_i * r_t. I want to extract the w values that lying in the row 577 , is there anyway to do that rather than do it manually , because extracting 99 w values will be a time consuming especially if you are doing this process for several regressors.
Any advice to improve my code will be appreciated
Thanks

Related

How can I make my for loop in R run faster? Can I vectorize this?

#Start: Initialize values
#For each block lengths (BlockLengths) I will run 10 estimates (ThetaL). For each estimate, I simulate 50000 observarions (Obs). Each estimate is calculated on the basis of the blocklength.
Index=0 #Initializing Index.
ThetaL=10 #Number of estimations of Theta.
Obs=50000 #Sample size.
Grp=vector(length=7) #Initializing a vector of number of blocks. It is dependent on block lengths (see L:15)
Theta=matrix(data=0,nrow=ThetaL,ncol=7) #Initializing a matrix of the estimates of Thetas. There are 10 for each block length.
BlockLengths<-c(10,25,50,100,125,200,250) #Setting the block lengths
for (r in BlockLengths){
Index=Index+1
Grp[Index]=Obs/r
for (k in 1:ThetaL){
#Start: Constructing the sample
Y1<-matrix(data=0,nrow=Obs,ncol=2)
Y1[1,]<-runif(2,0,1)
Y1[1,1]<--log(-(Y1[1,1])^2 +1)
Y1[1,2]<--log(-(Y1[1,2])^2 +1)
for (i in 2:Obs)
{
Y1[i,1]<-Y1[i-1,2]
Y1[i,2]<-runif(1,0,1)
Y1[i,2]<--log(-(Y1[i,2])^2 +1)
}
X1 <- vector(length=Obs)
for (i in 1:Obs){
X1[i]<-max(Y1[i,])
}
#End: Constructing the sample
K=0 #K will counts number of blocks with at least one exceedance
for (t in 1:Grp[Index]){ #For loop from 1 to number of groups
a=0
for (j in (1+r*(t-1)):(t*r)){ #Loop for the sample within each group
if (X1[j]>quantile(X1,0.99)){ #If a value exceeds high threshold, we add 1 to some variable a
a=a+1
}
}
if(a>=1){ #For the group, if a is larger than 1, we have had a exceedance.
K=K+1 #Counts number of blocks with at least one exceedance.
}
}
N<-sum(X1>=quantile(X1,0.99)) #Summing number of exceedances
Theta[k,Index]<- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs))) #Estimate
#Theta[k,Index]<-K/N
}
}
I have been running the above code without errors and it took me about 20 minutes, but I want to run the code for larger sample and more repetitions, which makes the run time absurdly large. I tried to only have the necessary part inside the loops to optimize it a little. Is it possible to optimize it even further or should I use another programming language as I've read R is bad for "for loop". Will vectorization help? In case, how can I vectorize the code?
First, you can define BlockLengths before Grp and Theta as both of them depend on it's length:
Index = 0
ThetaL = 2
Obs = 10000
BlockLengths = c(10,25)
Grp = vector(length = length(BlockLengths))
Theta = matrix(data = 0, nrow = ThetaL, ncol = length(BlockLengths))
Obs: I decreased the size of the operation so that I could run it faster. With this specification, your original loop took 24.5 seconds.
Now, for the operation, there where three points where I could improve:
Creation of Y1: the second column can be generated at once, just by creating Obs random numbers with runif(). Then, the first column can be created as a lag of the second column. With only this alteration, the loop ran in 21.5 seconds (12% improvement).
Creation of X1: you can vectorise the max function with apply. This alteration saved further 1.5 seconds (6% improvement).
Calculation of K: you can, for each t, get all the values of X1[(1+r*(t-1)):(t*r)], and run the condition on all of them at once (instead of using the second loop). The any(...) does the same as your a>=1. Furthermore, you can remove the first loop using lapply vectorization function, then sum this boolean vector, yielding the same result as your combination of if(a>=1) and K=K+1. The usage of pipes (|>) is just for better visualization of the order of operations. This by far is the more important alteration, saving more 18.4 seconds (75% improvement).
for (r in BlockLengths){
Index = Index + 1
Grp[Index] = Obs/r
for (k in 1:ThetaL){
Y1 <- matrix(data = 0, nrow = Obs, ncol = 2)
Y1[,2] <- -log(-(runif(Obs))^2 + 1)
Y1[,1] <- c(-log(-(runif(1))^2 + 1), Y1[-Obs,2])
X1 <- apply(Y1, 1, max)
K <- lapply(1:Grp[Index], function(t){any(X1[(1+r*(t-1)):(t*r)] > quantile(X1,0.99))}) |> unlist() |> sum()
N <- sum(X1 >= quantile(X1, 0.99))
Theta[k,Index] <- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))
}
}
Using set.seed() I got the same results as your original loop.
A possible way to improve more is substituting the r and k loops with purrr::map function.

Matrices in R: number of items to replace is not a multiple of replacement length

I'm trying to initialize an empty matrix X with n rows and 4 columns. And then allocating a vector of random stock values to each column. Each column represents a different stock.
I know I should apply X[,i]=cumsum(X[,i]) eventually, to get actual stock values, but that can only be done after allocating the values in the first place.
#Inputs mean return, volatility, time period and time step
mu=0.25; sigma=2; T=1; n=2^(12); X0=5;
#Generating trajectories for stocks
#NOTE: Seed is fixed. Changing seed will produce
#different trajectories
dt=T/n
t=seq(0,T,by=dt)
set.seed(201)
X <- matrix(nrow = n, ncol = 4)
for(i in 1:4){
X[,i] <- c(X0,mu*dt+sigma*sqrt(dt)*rnorm(n,mean=0,sd=1))
}
After running the code, I get the error message:
Error in X[, i] <- c(X0, mu * dt + sigma * sqrt(dt) * rnorm(n, mean = 0, :
number of items to replace is not a multiple of replacement length

How dnorm() works with a vector of quantiles in an sapply loop

I am working through Statistical Rethinking by Richard McElreath and am confused by how some code he uses on p.84 works. The code uses Bayesian grid approximation to derive two model parameters, mu and sigma, to estimate the distribution of height in a sample.
Here is the code
First we make a list of candidate mu values
mu.list <- seq(from = 140, to = 160, length.out = 200)
Then a list of candidate sigma values
sigma.list <- seq(from = 4, to = 9, length.out = 200) # grid of candidate sigma values
Then we make a data frame with every possible combination of mu and sigma.
post <- expand.grid(mu = mu.list, sigma = sigma.list) # expand grid so every mu is matched with every sigma
Which is a dataset with 40000 rows.
nrow(post)
[1] 40000
Now say we have a sample of measured heights, containing 5 measurements.
heights <- c(151.76, 139.70, 136.52, 156.84, 145.41)
Now for the part I don't understand, a reasonable complex sapply loop that calculates a log-likelihood for each of the 40000 candidate combinations of mu and sigma, based on the sample of five height measurements.
postVec <- sapply(1:nrow(post), function (i) sum( dnorm(
heights, # vector of heights
mean = post$mu[i], # candidate mean height value from corresponding position in grid
sd = post$sigma[i], # candidate sigma value from corresponding position in the grid
log = TRUE) ) # make values logs
)
What we get from this is loop is a vector 40000 values long, one value for each row of the post dataframe.
length(postVec)
[1] 40000
What I don't understand is that if we take the dnorm() out of the loop and use single values for the mean and sd, but pass the same 5-value sample vector of heights in the first argument, like so
dnorm( heights, mean = 140, sd = 4, log = TRUE )
We get five values
[1] -6.627033 -2.308045 -2.683683 -11.167283 -3.219861
So my question is: why does the sapply loop passed into the postVec vector above yield 40000 values, not 5 x 40000 = 200000 values?
Why does the dnorm() function return five values outside the sapply() loop but (seemingly) only one value within it?
You are missing sum before dnorm: in each of the 40000 cases it sums those 5 values as to compute the log-likelihood of the whole heights rather than just individual observations.
For instance, without sum for just two combinations we indeed have
sapply(1:2, function (i) dnorm(
heights,
mean = post$mu[i],
sd = post$sigma[i],
log = TRUE)
)
# [,1] [,2]
# [1,] -6.627033 -6.553479
# [2,] -2.308045 -2.310245
# [3,] -2.683683 -2.705858
# [4,] -11.167283 -11.061820
# [5,] -3.219861 -3.186194
while with sum we have column sums of the above matrix:
sapply(1:2, function (i) sum(dnorm(
heights,
mean = post$mu[i],
sd = post$sigma[i],
log = TRUE)
))
# [1] -26.00591 -25.81760

Change certain values of a vector based on mean and standard deviation of its subsets

I am trying to inject anomalies into a dataset, essentially changing certain values, based on a condition. I have a dataset, there are 10 subsets. The condition is that anomalies would be 2.8-3 times the standard deviation of each segment away from the mean of that subset. For that, I am dividing the dataset into 10 equal parts, then calculating the mean and standard deviation of each subset, and changing certain values by putting them 3 standard deviations of that subset away from the mean of that subset. The code looks like the following:
set.seed(1)
x <- rnorm(sample(1:35000, 32000, replace=F),0,1) #create dataset
y <- cumsum(x) #cumulative sum of dataset
j=1
for(i in c(1:10)){
seg = y[j:j+3000] #name each subset seg
m = mean(seg) #mean of subset
print(m)
s = sd(seg) # standard deviation of subset
print(s)
o_data = sample(j:j+3000,10) #draw random numbers from j to j + 3000
print(o_data)
y[o_data] = m + runif(10, min=2.8, max=3) * s #values = mean + 2.8-3 * sd
print(y[o_data])
j = j + 3000 # increment j
print(j)
}
The error I get is that standard deviation is NA, so I am not able to set the values.
What other approach is there by which I can accomplish the task? I have the inject anomalies which are 2.8-3 standard deviations away from the rolling mean essentially.
You have a simple error in your code. when you wrote
seg = y[j:j+3000] I believe that you meant seg = y[j:(j+3000)]
Similarly o_data = sample(j:j+3000,10) should be o_data = sample(j:(j+3000),10)

Creating a matrix in R

Imagine that I have two observed variables, interest and M2, both with length 1000.
I am trying to apply a function taking each observation of interest and M2 to generate a matrix of values called PCoCo. The matrix will have a 1000x1000 length.
The purpose is to generate a 3D surface plot with x = interest, y = M2, z = PCoCo.
I have the following code:
#Creating a matrix from 2 variables; interest, M2, and a function CoCo.Price.
interest = seq(0,0.1, length = 1000)
M2 = seq(0,10, length = 1000)
#PCoCo = price of the coco, should be a matrix
PCoCo = matrix(nrow=length(interest), ncol=(M2))
f = function(interest, M2){
for(i in 1:length(interest)){
for(j in 1:length(M2)){
PCoCo[j,i] = CoCo.Price(C.p, c.r, m, N, q, interest[i], S, S.Trigger,
sigma, M2[j])
}
}
}
z = outer(interest, M2, f)
I used to get the following error before assigning PCoCo as a matrix before running f:
Error in PCoCo[j, i] <- matrix(CoCo.Price(C.p, c.r, m, N, q, interest[i], :
incorrect number of subscripts on matrix
Now, I do not, however R is taking ages to compute the matrix.
I have tried reducing the dimensions to 100x100 but it is still taking very long.
The issue might be with the CoCo.Price function as it is about a 100 line function, which also has some foor loops in it.
Any advice?
Thanks.

Resources