Generate some simple dummy data in R - r

I just want some random data to experiment with different prediction models.
My code:
x <- 0
for (i in 1:200)
{
num <- runif(1, 0, 500)
neg <- round(runif(5, -1, 0))
percent <- ((0.01 * runif(1, 1, 10)) * num)
x[i] = num + (neg * percent)
}
The idea is that this should generate 200 points.
num is a random number between 0 and 500
neg is either -1 or 1, just to add some flexibility to the random offset (negative or positive offset of a randomly generated point)
percent is just a random percentage between 1% and 10% of the originally generated random number to either be added or subtracted
Very similar code that I've made in my main language, C#, works very well and generates proper plots. I'm more-or-less trying to port that code.
Whenever I run the above, I get the following errors (a lot of them):
number of items to replace is not a multiple of replacement length
It's triggered on the last line of code in the for loop.
I'd love to be able to fix this. Any help is appreciated. Thank you!

Chrisss has already pointed out your problem in his comment. However, you're doing a lot of bad things from an R programming prospective. The following approach is better:
N <- 200
d <- data.frame(x = rep(NA, N))
num <- runif(N, 0, 500)
neg <- sample(c(1,-1), 200, replace = TRUE) #jrdnmdhl pointed this out in his post
percent <- ((0.01 * runif(N, 1, 10)) * num)
d$x <- num + (neg * percent)
Why is this better? Two reasons, we are avoiding a for loop. R is a high-level language, and therefore, loops are slow. Second, you are not preallocating your memory. Skipping this step will slow things down as well. R has to go find more memory for each iteration in your example.
A great resource is Hadley Wickham's Advanced R, to learn more about the first and second reason, read this and that

The commenter mentioned the main problem, but your code would be much faster if vectorized. Also, your description of 'neg' is not consistent with what it is doing. Your code doesn't generate either -1 or 1. Instead, it generates either -1 or 0. The code below will generate either -1 or 1 for the neg variable.
num = runif(200, 0, 500)
neg = sample(c(1,-1),200,replace=T)
percent = ((0.01 * runif(200, 1, 10)) * num)
x = num + (neg * percent)

Related

How can I make my for loop in R run faster? Can I vectorize this?

#Start: Initialize values
#For each block lengths (BlockLengths) I will run 10 estimates (ThetaL). For each estimate, I simulate 50000 observarions (Obs). Each estimate is calculated on the basis of the blocklength.
Index=0 #Initializing Index.
ThetaL=10 #Number of estimations of Theta.
Obs=50000 #Sample size.
Grp=vector(length=7) #Initializing a vector of number of blocks. It is dependent on block lengths (see L:15)
Theta=matrix(data=0,nrow=ThetaL,ncol=7) #Initializing a matrix of the estimates of Thetas. There are 10 for each block length.
BlockLengths<-c(10,25,50,100,125,200,250) #Setting the block lengths
for (r in BlockLengths){
Index=Index+1
Grp[Index]=Obs/r
for (k in 1:ThetaL){
#Start: Constructing the sample
Y1<-matrix(data=0,nrow=Obs,ncol=2)
Y1[1,]<-runif(2,0,1)
Y1[1,1]<--log(-(Y1[1,1])^2 +1)
Y1[1,2]<--log(-(Y1[1,2])^2 +1)
for (i in 2:Obs)
{
Y1[i,1]<-Y1[i-1,2]
Y1[i,2]<-runif(1,0,1)
Y1[i,2]<--log(-(Y1[i,2])^2 +1)
}
X1 <- vector(length=Obs)
for (i in 1:Obs){
X1[i]<-max(Y1[i,])
}
#End: Constructing the sample
K=0 #K will counts number of blocks with at least one exceedance
for (t in 1:Grp[Index]){ #For loop from 1 to number of groups
a=0
for (j in (1+r*(t-1)):(t*r)){ #Loop for the sample within each group
if (X1[j]>quantile(X1,0.99)){ #If a value exceeds high threshold, we add 1 to some variable a
a=a+1
}
}
if(a>=1){ #For the group, if a is larger than 1, we have had a exceedance.
K=K+1 #Counts number of blocks with at least one exceedance.
}
}
N<-sum(X1>=quantile(X1,0.99)) #Summing number of exceedances
Theta[k,Index]<- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs))) #Estimate
#Theta[k,Index]<-K/N
}
}
I have been running the above code without errors and it took me about 20 minutes, but I want to run the code for larger sample and more repetitions, which makes the run time absurdly large. I tried to only have the necessary part inside the loops to optimize it a little. Is it possible to optimize it even further or should I use another programming language as I've read R is bad for "for loop". Will vectorization help? In case, how can I vectorize the code?
First, you can define BlockLengths before Grp and Theta as both of them depend on it's length:
Index = 0
ThetaL = 2
Obs = 10000
BlockLengths = c(10,25)
Grp = vector(length = length(BlockLengths))
Theta = matrix(data = 0, nrow = ThetaL, ncol = length(BlockLengths))
Obs: I decreased the size of the operation so that I could run it faster. With this specification, your original loop took 24.5 seconds.
Now, for the operation, there where three points where I could improve:
Creation of Y1: the second column can be generated at once, just by creating Obs random numbers with runif(). Then, the first column can be created as a lag of the second column. With only this alteration, the loop ran in 21.5 seconds (12% improvement).
Creation of X1: you can vectorise the max function with apply. This alteration saved further 1.5 seconds (6% improvement).
Calculation of K: you can, for each t, get all the values of X1[(1+r*(t-1)):(t*r)], and run the condition on all of them at once (instead of using the second loop). The any(...) does the same as your a>=1. Furthermore, you can remove the first loop using lapply vectorization function, then sum this boolean vector, yielding the same result as your combination of if(a>=1) and K=K+1. The usage of pipes (|>) is just for better visualization of the order of operations. This by far is the more important alteration, saving more 18.4 seconds (75% improvement).
for (r in BlockLengths){
Index = Index + 1
Grp[Index] = Obs/r
for (k in 1:ThetaL){
Y1 <- matrix(data = 0, nrow = Obs, ncol = 2)
Y1[,2] <- -log(-(runif(Obs))^2 + 1)
Y1[,1] <- c(-log(-(runif(1))^2 + 1), Y1[-Obs,2])
X1 <- apply(Y1, 1, max)
K <- lapply(1:Grp[Index], function(t){any(X1[(1+r*(t-1)):(t*r)] > quantile(X1,0.99))}) |> unlist() |> sum()
N <- sum(X1 >= quantile(X1, 0.99))
Theta[k,Index] <- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))
}
}
Using set.seed() I got the same results as your original loop.
A possible way to improve more is substituting the r and k loops with purrr::map function.

Simulating a process n times in R

I've written an R script (sourced from here) simulating the path of a geometric Brownian motion of a stock price, and I need the simulation to run 1000 times such that I generate 1000 paths of the process Ut = Ste^-mu*t, by discretizing the law of motion derived from Ut which is the bottom line of the solution to the question posted here.
The process also has n = 252 steps and discretization step = 1/252, also risk of sigma = 0.4 and instantaneous drift mu, which I've treated as zero, although I'm not sure about this. I'm struggling to simulate 1000 paths of the process but am able to generate one single path, I'm unsure which variables I need to change or whether there's an issue in my for loop that's restricting me from generating all 1000 paths. Could it also be that the script is simulating each individual point for 252 realization instead of simulating the full process? If so, would this restrict me from generating all 1000 paths? Is it also possible that the array I'm generating defined as U hasn't being correctly generated by me? U[0] must equal 1 and so too must the first realization U(1) = 1. The code is below, I'm pretty stuck trying to figure this out so any help is appreciated.
#Simulating Geometric Brownian motion (GMB)
tau <- 1 #time to expiry
N <- 253 #number of sub intervals
dt <- tau/N #length of each time sub interval
time <- seq(from=0, to=N, by=dt) #time moments in which we simulate the process
length(time) #it should be N+1
mu <- 0 #GBM parameter 1
sigma <- 0.4 #GBM parameter 2
s0 <- 1 #GBM parameter 3
#simulate Geometric Brownian motion path
dwt <- rnorm(N, mean = 0, sd = 1) #standard normal sample of N elements
dW <- dwt*sqrt(dt) #Brownian motion increments
W <- c(0, cumsum(dW)) #Brownian motion at each time instant N+1 elements
#Define U Array and set initial values of U
U <- array(0, c(N,1)) #array of U
U[0] = 1
U[1] <- s0 #first element of U is s0. with the for loop we find the other N elements
for(i in 2:length(U)){
U[i] <- (U[1]*exp(mu - 0.5*sigma^2*i*dt + sigma*W[i-1]))*exp(-mu*i)
}
#Plot
plot(ts(U), main = expression(paste("Simulation of Ut")))
This questions is quite difficult to answer since there are a lot of unclear things, at least to me.
To begin with, length(time) is equal to 64010, not N + 1, which will be 254.
If I understand correctly, the brownian motion function returns the position in one dimension given a time. Hence, to calculate this position for each time the following can be enough:
s0*exp((mu - 0.5*sigma^2)*time + sigma*rnorm(length(time),0,time))
However, this calculates 64010 points, not 253. If you replicate it 1000 times, it gives 64010000 points, which is quite a lot.
> B <- 1000
> res <- replicate(B, {
+ s0*exp((mu - 0.5*sigma^2)*time + sigma*rnorm(length(time),0,time))
+ })
> length(res)
[1] 64010000
> dim(res)
[1] 64010 1000
I know I'm missing the second part, the one explained here, but I actually don't fully understand what you need there. If you can draw the formula maybe I can help you.
In general, avoid programming in R using for loops to iterate vectors. R is a vectorized language, and there is no need for that. If you want to run the same code B times, the replicate(B,{ your code }) function is your firend.

R - In sqrt(1 - h * h) : NaNs produced from within rcorr - full sample data available

I have some code that creates a matrix of survey question responses, Rows - answers to the questions from a unique survey instrument, columns the individual questions. A final column has been appended with the row means. This is then passed to rcorr for evaluation. I have 15 sets of data, but only within this particular set is it throwing NaNs, and I can't see what the difference is.
m.rcorr <- rcorr(matrix, type="pearson")
A CSV of the matrix being passed is linked here. There are legitimate values of NA in some columns, as not every respondent answers each question. Other responses are 0, 25,50,75, or 100.
I get two warnings of: In sqrt(1 - h * h) : NaNs produced on this data set.
I have 14 other sets that run without NaNs being produced that have varying occurrences of NA, and I even took a look at whether 0 was the problem, but other sets again handle those fine.
Next I stepped into rcorr, assigning my matrix to x:
type <- "pearson"
{
type <- match.arg(type)
if (!missing(y))
x <- cbind(x, y)
x[is.na(x)] <- 1e+50
storage.mode(x) <- "double"
p <- as.integer(ncol(x))
if (p < 1)
stop("must have >1 column")
n <- as.integer(nrow(x))
if (n < 5)
stop("must have >4 observations")
h <- .Fortran(F_rcorr, x, n, p, itype = as.integer(1 + (type == "spearman")), hmatrix = double(p * p), npair = integer(p * p), double(n), double(n), double(n), double(n), double(n), integer(n))
The assignment of h is where I get stuck
Error: object 'F_rcorr' not found
The package Hmisc is installed and loaded, as again, this code works 14 out of 15 times.
F_rcorr is an internal Hmisc function, according to the help, not to be called by the user or undocumented, so I'm not quite sure where to go next.
I'm looking to answer two questions:
Why is this particular set throwing the NaNs
What impact does that have on the final values of the list $r from the rcorr results.
Addendum: Using the Hmisc::: prefix as suggested in the comment, I was able to get further and found two pairs in my data that when the value of h was 1, instead of 1 - h * h evaluating to 0, it was evaluating to the two very small negative numbers. It was only in these two pairs, and didn't happen on the diagonal, or in other places where that pair was valued at 1, so I'm not sure why those two generated weirdness, since 1 - 1 * 1 should equal 0 all day long.
However, to get around that I copied the rcorr function into a new function, adding these two lines before the P assignment, and then took the sqrt of the new D that substituted the negative numbers with 0.
D <- 1 - h * h
D[D<0] <- 0
P <- matrix(2 * (1 - pt(q = abs(h) * sqrt(npair - 2)/sqrt(D), df = npair - 2)), ncol = p)
I still like to know what may be going on that created the result of tiny negative number instead of 0 in that calculation, but I believe I've found a non-harmful way of getting around it.
So I figured what the heck, and emailed Dr. Harrell, and he replied back that in the next publication of Hmisc he's going to replace sqtr(1 - h * h) with max(0, 1-h^2), which would resolve it (more cleanly) as I did, in substituting 0 for the tiny negative numbers.
I'll admit I fan-girled a bit with him answering my email.

function to create a vectorized piecwise function R

I'm pretty new to R so apologies in advance if this question is poorly constructed. Basically I have a piece-wise function that I need to calculate the value for a large number of rows. My current function looks something like this:
f <- function(x){
(x <= 1000) * x^2 +
(x > 1000 & x <= 2000) * x^3 +
(x > 2000 & x <= 3000) * x^4 +
(x > 4000) * x^5
}
However I need to be able to create or generalize this function for a variety of different sets of breakpoints (ie maybe 1500,2500,3500, etc) and for different numbers of breakpoints. Also given the large number of rows that will need to be calculated on, the function has to be vectorized. Any advice?
Edit:
To clarify, I made the function above from some table of breakpoints (1000,2000,3000,4000) and associated powers to raise x to (2,3,4,5). However I need to be able to take multiple of such tables, each with varying breakpoints and number of breakpoints (with potentially 100 or so breakpoints) and be able to apply the resulting piecewise function to a large number of rows.
A vectorised version of your function with additional breaks and power arguments can be written this way:
function(x, breaks, power){
x^power[as.numeric(cut(x, breaks))]
}
as.numeric(cut(...)) gets the position of all x values in the breaks, then the square bracket looks up the power in the power vector and raises the corresponding x to the correct power. Tests:
Some breaks points and powers:
> bp <- c(10,20,30,40)
> po = c(2,3,4)
Note the breakpoints are left-excluded:
> f(9,bp,po)
[1] NA
> f(10,bp,po)
[1] NA
So the first valid x has to be above 10:
> f(11,bp,po)
[1] 121
And gets us 11^2 as expected. So 20 gets squared and 21 gets cubed:
> f(20,bp,po)
[1] 400
> f(21,bp,po)
[1] 9261
Good so far. Vectorised?
> f(19:22, bp, po)
[1] 361 400 9261 10648
Yes - the change from square to cube happens between 20 and 21.
See the help for the right option for the cut function if you want the intervals to be closed on the left or right.
From what I understand from your example code, you basically want to minimize the coding, and also want the code to be dynamic, so that you can dynamically vary the breaks and power.
Below is the sample code, which tries to do the same.
f <- function(x, breakPoints, powerX) {
cutX <- cut(x, breaks=breakPoints)
cutX1 <- factor(cutX, labels=powerX)
retX <- x ^ as.numeric(as.character(cutX1))
retX
}
x1 <- sample(1:10000, 1000)
x1 <- x1[order(x1)]
breakPoints1 <- c(min(x1)-1, 1000, 2000, 3000, max(x1))
powerX1 <- c(2, 3, 4, 5)
newX1 <- f(x1, breakPoints1, powerX1)
head(newX1) # manual check whether the values make sense
head(x1)
This code will do that.
But my suggestion will be to test this code, as much as possible, so that you can use it reliably. Hope this code is useful to you.

Generate random numbers with rbinom but exclude 0s from the range

I need to generate random numbers with rbinom but I need to exclude 0 within the range.
How can I do it?
I would like something similar to:
k <- seq(1, 6, by = 1)
binom_pdf = dbinom(k, 322, 0.1, log = FALSE)
but I need to get all the relative dataset, because if I do the following:
binom_ran = rbinom(100, 322, 0.1)
I get values from 0 to 100.
Is there any way I can get around this?
Thanks
Let`s suppose that we have the fixed parameters:
n: number of generated values
s: the size of the experiment
p: the probability of a success
# Generate initial values
U<-rbinom(n,s,p)
# Number and ubication of zero values
k<-sum(U==0)
which.k<-which(U==0)
# While there is still a zero, . . . generate new numbers
while(k!=0){
U[which.k]<-rbinom(k,s,p)
k<-sum(U==0)
which.k<-which(U==0)
# Print how many zeroes are still there
print(k)
}
# Print U (without zeroes)
U
In addition to the hit and miss approach, if you want to sample from the conditional distribution of a binomial given that the number of successes is at least one, you can compute the conditional distribution then directly sample from it.
It is easy to work out that if X is binomial with parameters p and n, then
P(X = x | X > 0) = P(X = x)/(1-p)
Hence the following function will work:
rcond.binom <- function(k,n,p){
probs <- dbinom(1:n,n,p)/(1-p)
sample(1:n,k,replace = TRUE,prob = probs)
}
If you are going to call the above function numerous times with the same n and p then you can just precompute the vector probs and simply use the last line of the function whenever you need it.
I haven't benchmarked it, but I suspect that the hit-and-miss approach is preferable when k is small, p not too close to 0, and n large, but for larger k larger, p closer to 0, and n smaller then the above might be preferable.

Resources