How to Generate Normal Random Samples within Mean±3Sigma - r

I want to draw normal random numbers in an array of order ((100*8)*5000) with a specific Mean (M) and Standard Deviation (S) but I want them to be only within the range M±3S, so that I don't have any outliers in my array exceeding those limits.
Any Suggestion? I want to write a program in R based on this array for some simulation studies. I am using following R Code to generate my Data Set:
for(i in 1:5000){
for(j in 1:8){
Dat[,j,i]=rnorm(100,mean=muu[j],sd=sigma[j])
}
}
Now, We want to get rid of those values which are higher than muu±3sigma in the above data. Definitely, We have to replace discarded values with fresh values so that the dimension of the Dat array keep intact.

First Solution
Here is a start but I bet there is a more elegant solution.
First generate a sample next step is to subset it to your desired values. Of course you have to adjust values to your desire.
set.seed(123)
rs <- rnorm(10000, mean = 10, sd = 3)
rs1 <- rs[ rs >= -19 & rs <= 19 ]
Second (better) solution
I think my first solutions didn't work so well. I have just written some code that might be perfect for your purposes. Here are the steps.
create an array of NAs with the required dimensions
fill it with random numbers
create a logical vector where TRUEs are for the desired conditions
subset the data based on that vector and replace the values where TRUE is TRUE (pardon my words game) with the mean used to generate samples
data <- array(NA, dim = c(100, 8, 5000))
for(i in 1:5000){
data[ , , i] <- rnorm(800, 3, 1)
}
bound <- 3 + c(-1, 1)*3*1
pr <- data <= bound[1] | data >= bound[2]
data[pr] <- 3

Related

How can I make my for loop in R run faster? Can I vectorize this?

#Start: Initialize values
#For each block lengths (BlockLengths) I will run 10 estimates (ThetaL). For each estimate, I simulate 50000 observarions (Obs). Each estimate is calculated on the basis of the blocklength.
Index=0 #Initializing Index.
ThetaL=10 #Number of estimations of Theta.
Obs=50000 #Sample size.
Grp=vector(length=7) #Initializing a vector of number of blocks. It is dependent on block lengths (see L:15)
Theta=matrix(data=0,nrow=ThetaL,ncol=7) #Initializing a matrix of the estimates of Thetas. There are 10 for each block length.
BlockLengths<-c(10,25,50,100,125,200,250) #Setting the block lengths
for (r in BlockLengths){
Index=Index+1
Grp[Index]=Obs/r
for (k in 1:ThetaL){
#Start: Constructing the sample
Y1<-matrix(data=0,nrow=Obs,ncol=2)
Y1[1,]<-runif(2,0,1)
Y1[1,1]<--log(-(Y1[1,1])^2 +1)
Y1[1,2]<--log(-(Y1[1,2])^2 +1)
for (i in 2:Obs)
{
Y1[i,1]<-Y1[i-1,2]
Y1[i,2]<-runif(1,0,1)
Y1[i,2]<--log(-(Y1[i,2])^2 +1)
}
X1 <- vector(length=Obs)
for (i in 1:Obs){
X1[i]<-max(Y1[i,])
}
#End: Constructing the sample
K=0 #K will counts number of blocks with at least one exceedance
for (t in 1:Grp[Index]){ #For loop from 1 to number of groups
a=0
for (j in (1+r*(t-1)):(t*r)){ #Loop for the sample within each group
if (X1[j]>quantile(X1,0.99)){ #If a value exceeds high threshold, we add 1 to some variable a
a=a+1
}
}
if(a>=1){ #For the group, if a is larger than 1, we have had a exceedance.
K=K+1 #Counts number of blocks with at least one exceedance.
}
}
N<-sum(X1>=quantile(X1,0.99)) #Summing number of exceedances
Theta[k,Index]<- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs))) #Estimate
#Theta[k,Index]<-K/N
}
}
I have been running the above code without errors and it took me about 20 minutes, but I want to run the code for larger sample and more repetitions, which makes the run time absurdly large. I tried to only have the necessary part inside the loops to optimize it a little. Is it possible to optimize it even further or should I use another programming language as I've read R is bad for "for loop". Will vectorization help? In case, how can I vectorize the code?
First, you can define BlockLengths before Grp and Theta as both of them depend on it's length:
Index = 0
ThetaL = 2
Obs = 10000
BlockLengths = c(10,25)
Grp = vector(length = length(BlockLengths))
Theta = matrix(data = 0, nrow = ThetaL, ncol = length(BlockLengths))
Obs: I decreased the size of the operation so that I could run it faster. With this specification, your original loop took 24.5 seconds.
Now, for the operation, there where three points where I could improve:
Creation of Y1: the second column can be generated at once, just by creating Obs random numbers with runif(). Then, the first column can be created as a lag of the second column. With only this alteration, the loop ran in 21.5 seconds (12% improvement).
Creation of X1: you can vectorise the max function with apply. This alteration saved further 1.5 seconds (6% improvement).
Calculation of K: you can, for each t, get all the values of X1[(1+r*(t-1)):(t*r)], and run the condition on all of them at once (instead of using the second loop). The any(...) does the same as your a>=1. Furthermore, you can remove the first loop using lapply vectorization function, then sum this boolean vector, yielding the same result as your combination of if(a>=1) and K=K+1. The usage of pipes (|>) is just for better visualization of the order of operations. This by far is the more important alteration, saving more 18.4 seconds (75% improvement).
for (r in BlockLengths){
Index = Index + 1
Grp[Index] = Obs/r
for (k in 1:ThetaL){
Y1 <- matrix(data = 0, nrow = Obs, ncol = 2)
Y1[,2] <- -log(-(runif(Obs))^2 + 1)
Y1[,1] <- c(-log(-(runif(1))^2 + 1), Y1[-Obs,2])
X1 <- apply(Y1, 1, max)
K <- lapply(1:Grp[Index], function(t){any(X1[(1+r*(t-1)):(t*r)] > quantile(X1,0.99))}) |> unlist() |> sum()
N <- sum(X1 >= quantile(X1, 0.99))
Theta[k,Index] <- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))
}
}
Using set.seed() I got the same results as your original loop.
A possible way to improve more is substituting the r and k loops with purrr::map function.

How do I loop different percentages of missing values using MCAR?

Using the cleveland data from MCI data respository, I want to generate missing values on the data to apply some imputation techniques.
heart.ds <- read.csv(file.choose())
head(heart.ds)
attach(heart.ds)
sum(is.na(heart.ds))
str(heart.ds)
#Changing Appropriate Variables to Factors
heart.ds$sex<-as.factor(heart.ds$sex)
heart.ds$cp<-as.factor(heart.ds$cp)
heart.ds$fbs<-as.factor(heart.ds$fbs)
heart.ds$exang<-as.factor(heart.ds$exang)
heart.ds$restecg<-as.factor(heart.ds$restecg)
heart.ds$slope<-as.factor(heart.ds$slope)
heart.ds$thal<-as.factor(heart.ds$thal)
heart.ds$target<-as.factor(heart.ds$target)
str(heart.ds)
Now i want to generate missing values using the MCAR mechanism. Below is the loop code;
p = c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
for(i in 1:length(p)){
hd_mcar[i] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
The problem here is that, after the above code, i dont get the data been generated in it original values like in a data frame where i will have n variables and n rows.
Below is a picture of the output i had through the above code;
enter image description here
But when i use only one missingness percentage i get an accurate results; below is the coe for only one missing percentage
#Missing Completely at Random(MCAR)
hd_mcar <- delete_MCAR(heart.ds, 0.05)
sum(is.na(hd_mcar))
Below is the output of the results;
enter image description here
Please I need help to to solve the looping problem. Thank you.
Now I want to apply the MICE and other imputations methods like HMISC, Amelia, mi, and missForest inside the loop but it is giving me an error saying "Error: Data should be a matrix or data frame"
The code below is for only MICE;
#1. Method(MICE)
mice_mcar[[i]] <- mice(hd_mcar, m=ip, method = c("pmm","logreg","polyreg","pmm","pmm","logreg",
"polyreg","pmm","logreg","pmm","polyreg","pmm",
"polyreg","logreg"), maxit = 20)
#Diagnostic check
summary(heart.ds$age)
mice_mcar$imp$age
#Finding the means of the impuatations
app1 <- apply(mice_mcar$imp$age, MARGIN = 2, FUN = mean)
min1 <- abs(app1-mean(heart.ds$age))
#Selecting the minimum index
sm1 <- which(min1==min(min1))
#Selecting final imputation
final_clean_hd_mcar =mice::complete(mice_mcar,sm1)
mice.mcar = final_clean_hd_mcar
How do i go about to make it fit into the loop and works perfectly?
Your problem was this line:
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
You are creating a vector here rather than a list. You can't assign a data frame to an element of a vector without coercing it into something that is not a data frame. You want to do this:
p <- c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar <- vector(mode = "list", length = length(p))
for(i in 1:length(p)){
hd_mcar[[i]] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
Note that because it's a list now, hd_mcar[[i]] uses the [[ rather than [ subscript.

Arrange different data set using matrix code

I'm trying to use repeat loop to generate 100 data set of Poisson Distribution with sample size n=100 and I would like to arrange the result in by row and column but it is just show me repeating to show me the last set of data while not all the data set. At the same time I would also trying to figure out the way to get the mean, variance and MSE of the 100 data set.
set.seed(124)
a <- 1
repeat{
b = rpois(100, lambda = 3)
Storage100 <- matrix(data=b,nrow=100,ncol=1)
a = a+1
print(b)
if (a>100){break
}
}
Storage100
I'm expecting that my 100 data set can be show like first set of data in first column, second set of data in second column.....
Use replicate with simplify as TRUE to get matrix of dimension 100 X 100 where each column represents the distribution.
set.seed(124)
m1 <- replicate(100, matrix(data=rpois(100, lambda = 3),ncol = 1), simplify = TRUE)
To get the mean for each column we can use colMeans (thanks to #jay.sf)
colMeans(m1)

How can I repeat these two lines of code 100+ times?

I'm still new to the programming world and looking for some guidance on a model I am building for individual animal growths over time.
The goal for the code I'm working with is to
i) Generate random starting sizes of animals from a given distribution
ii) Give each of these individuals a starting growth rate from a given distribution
iii) Calculate new size of individual after 1 year
iv) Assign a new growth rate from above distribution
v) Calculate the new size of individual after another year.
So far I have the code below, and what I want to do is repeat the last two lines of code x amount of times without I having to physically run the code over and over.
# Generate starting lengths
lengths <- seq(from=4.4, to=5.4, by =0.1)
# Generate starting ks (growth rate)
ks <- seq(from=0.0358, to=0.0437, by =0.0001)
#Create individuals
create.inds <- function(id = NaN, length0=NaN, k1=NaN){
inds <- data.frame(id=id, length0 = length0, k1=k1)
inds
}
# Generate individuals
inds <- create.inds(id=1:n.initial,
length=sample(lengths,100,replace=TRUE),
k1=sample(ks, 100, replace=TRUE))
# Calculate new lengths based on last and 2nd last columns and insert into next column
inds[,ncol(inds)+1] <- 326*(1-exp(-(inds[,ncol(inds)])))+
(inds[,ncol(inds)-1]*exp(-(inds[,ncol(inds)])))
# Calculate new ks and insert into last column
inds[,ncol(inds)+1] <- sample(ks, 100, replace=TRUE)
Any and all assistance would be appreciated, also if you think there is a better way to write this please let me know.
i think what you are asking is a simple loop:
for (i in 1:100) { #replace 100 with the desired times you want this to excecute
inds[,ncol(inds)+1] <- 326*(1-exp(-(inds[,ncol(inds)])))+
(inds[,ncol(inds)-1]*exp(-(inds[,ncol(inds)])))
# Calculate new ks and insert into last column
inds[,ncol(inds)+1] <- sample(ks, 100, replace=TRUE)
}

How to work with binary contraints in linear optimization?

I have two input matrices, dt(10,3) & wt(3,3), that i need to use to find the optimal decision matrix (same dimension), Par(10,3) so as to maximize an objective function. Below R code would give some direction into the problem (used Sample inputs here) -
#Input Matrices
dt <- matrix(runif(300),100,3)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#objective function
Obj <- function(Par) {
P = matrix(Par, nrow = 10, byrow=F) # Reshape
X = t((dt%*%wt)[,1])%*%P[,1]
Y = t((dt%*%wt)[,2])%*%P[,2]
Z = t((dt%*%wt)[,3])%*%P[,3]
as.numeric(X+Y+Z) #maximize
}
Now I am struggling to apply the following constraints to the problem :
1) Matrix, Par can only have binary values (0 or 1)
2) rowSums(Par) = 1 (Basically a row can only have 1 in one of the three columns)
3) colSums(Par[,1]) <= 5, colSums(Par[,2]) <= 6, & colSums(Par[,3]) <= 4
4) X/(X+Y+Z) < 0.35, & Y/(X+Y+Z) < 0.4 (X,Y,Z are defined in the objective function)
I tried coding the constraints in constrOptim, but not sure how to input binary & integer constraints. I am reading up on lpSolve, but not able to figure out. Any help much appreciated. Thanks!
I believe this is indeed a MIP so no issues with convexity. If I am correct the model can look like:
This model can be easily transcribed into R. Note that LP/MIP solvers do not use functions for the objective and constraints (opposed to NLP solvers). In R typically one builds up matrices with the LP coefficients.
Note: I had to make the limits on the column sums much larger (I used 50,60,40).
Based on Erwin's response, I am able to formulate the model using lpSolve in R. However still struggling to add the final constraint to the model (4th constraint in my question above). Here's what I am able to code so far :
#input dimension
r <- 10
c <- 3
#input matrices
dt <- matrix(runif(r*c),r,c)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#column controller
c.limit <- c(60,50,70)
#create structure for lpSolve
ncol <- r*c
lp.create <- make.lp(ncol=ncol)
set.type(lp.create, columns=1:ncol, type = c("binary"))
#create objective values
obj.vals <- as.vector(t(dt%*%wt))
set.objfn(lp.create, obj.vals)
lp.control(lp.create,sense='max')
#Add constraints to ensure sum of parameters for every row (rowSum) <= 1
for (i in 1:r){
add.constraint(lp.create, xt=c(1,1,1),
indices=c(3*i-2,3*i-1,3*i), rhs=1, type="<=")
}
#Add constraints to ensure sum of parameters for every column (colSum) <= column limit (defined above)
for (i in 1:c){
add.constraint(lp.create, xt=rep(1,r),
indices=seq(i,ncol,by=c), rhs=c.limit[i], type="<=")
}
#Add constraints to ensure sum of column objective (t((dt%*%wt)[,i])%*%P[,i) <= limits defined in the problem)
#NOT SURE HOW TO APPLY A CONSTRAINT THAT IS DEPENDENT ON THE OBJECTIVE FUNCTION
solve(lp.create)
get.objective(lp.create) #20
final.par <- matrix(get.variables(lp.create), ncol = c, byrow=T) # Reshape
Any help that can get me to the finish line is much appreciated :)
Thanks

Resources