I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>
Related
In an earlier question (R: Logical Conditions Not Being Respected), I learned how to make the following simulation :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
Question: Now, I am trying to make a slight modification to the above code - Instead of "a" and "b" being produced separately, I want them to be produced "together" (in math terms: "a" and "b" were being produced from two independent univariate normal distributions, now I want them to come from a bivariate normal distribution).
I tried to modify this code myself:
library(MASS)
Sigma = matrix(
c(1,0.5, 0.5, 1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
e_i = data.frame(mvrnorm(n = 1, c(10,10), Sigma))
e_i$i <- 1
while(e_i$X1 < 12 | e_i$X2 < 12) {
e_i = data.frame(mvrnorm(n = 1, c(10,10), Sigma))
e_i$i <- i + 1
}
x <- c(e_i$X1, e_i$X2 ,i)
res <- rbind(res, x)
}
res = data.frame(res)
But this is producing the following error:
Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length
zero
If I understand your code correctly you are trying to see how many samples occur before both values are >=12 and doing that for 100 trials? This is the approach I would take:
library(MASS)
for(i in 1:100){
n <- 1
while(any((x <- mvrnorm(1, mu=c(10,10), Sigma=diag(0.5, nrow=2)+0.5))<12)) n <- n+1
if(i==1) res <- data.frame("a"=x[1], "b"=x[2], n)
else res <- rbind(res, data.frame("a"=x[1], "b"=x[2], n))
}
Here I am assigning the results of a mvrnorm to x within the while() call. In that same call, it evaluates whether either are less than 12 using the any() function. If that evaluates to FALSE, n (the counter) is increased and the process repeated. Once TRUE, the values are appended to your data.frame and it goes back to the start of the for-loop.
Regarding your code, the mvrnorm() function is returning a vector, not a matrix, when n=1 so both values go into a single variable in the data.frame:
data.frame(mvrnorm(n = 1, c(10,10), Sigma))
Returns:
mvrnorm.n...1..c.10..10...Sigma.
1 9.148089
2 10.605546
The matrix() function within your data.frame() calls, along with some tweaks to your use of i, will fix your code:
library(MASS)
Sigma = matrix(
c(1,0.5, 0.5, 1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:10){
e_i = data.frame(matrix(mvrnorm(n = 1, c(10,10), Sigma), ncol=2))
i <- 1
while(e_i$X1[1] < 12 | e_i$X2[1] < 12) {
e_i = data.frame(matrix(mvrnorm(n = 1, c(10,10), Sigma), ncol=2))
i <- i + 1
}
x <- c(e_i$X1, e_i$X2 ,i)
res <- rbind(res, x)
}
res = data.frame(res)
I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.
Update2
The second set.seed(i) should be replaced as set.seed(i+1), or whatever another new random series. If not, s3 <- sum(data$gene == 0 & data$cancer == 1) will always be 0, since the number smaller than 0.08 will be smaller than 0.39.
I did't correct my original question code because this matter is not related to this post's core question.
Update
set.seed(i) is added twice since there are two random number generations, i.e., random1 and random2. However, the results among operations still changeable, which is strange.
Background:
the codes below is about odds ratios. But the focus point is not statistic here. Instead, I find the results changed (!) among some operations which are actually the same (I suppose that they are in fact not, but I cannot figure it out).
Code:
gene <- vector(length = 500, mode = "integer")
cancer <- vector(length = 500, mode = "integer")
data <- data.frame(gene, cancer)
odd_withMutate <- vector(length = 20, mode = "numeric")
odd_noMutate <- vector(length = 20, mode = "numeric")
result <- data.frame(odd_withMutate, odd_noMutate)
for (i in 1:20) {
# set.seed(12)
# set.seed(16)
set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
# set.seed(16)
set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
for (j in 1:500) {
if (random1[j] < 0.39){
data[j,1] <- 1
}
if (random2[j] < 0.08){
data[j,2] <- 1
}
}
s1 <- sum(data$gene == 1 & data$cancer == 1) # has the mutated gene & has cancer
s2 <- sum(data$gene == 1 & data$cancer == 0)
s3 <- sum(data$gene == 0 & data$cancer == 1)
s4 <- sum(data$gene == 0 & data$cancer == 0)
result[i,]$odd_withMutate <- s1/s2
result[i,]$odd_noMutate <- s3/s4
}
Different operations:
Operation #1:
If I run the code above, the 12th row of odd_noMutate in result will be 0, the 16th will be NaN. Then I tried to see what happened, so I use set.seed(12) or set.seed(16) to check (Operation #2 & #3). But the 0 and NaN disappeared! I mean, in Operation #2, 0.1638418 0 is not 1.5075377 0. In Operation #3, 0.2830189 0 is not 2.4013605 NaN.
Operation #2:
the changed part of code is:
set.seed(12) #odd_noMutate = 0
# set.seed(16) #odd_noMutate = NaN
# set.seed(i)
random1 <- runif(500, min = 0, max = 1)
set.seed(12)
# set.seed(16)
# set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
Operation #3:
# set.seed(12) #odd_noMutate = 0
set.seed(16) #odd_noMutate = NaN
# set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
set.seed(16)
# set.seed(i) # add this instruction
Operation #4:
I find that even changed the i in my code will make results totally different (Shouldn't it be the subset of the original result?). It is the Operation #4. Specifically, 0.3092105 0 is not 1.5075377 0; 0.7562724 0 is not 2.4013605 NaN.
for (i in 10:20) {
# set.seed(12) #odd_noMutate = 0
# set.seed(16) #odd_noMutate = NaN
set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
# set.seed(16)
set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
The results among these operations are shown below:
The problem is that some previous values in data remain and are reused. Maybe your problem are solved remaking data every for loop (i). (putting data <- data.frame(gene, cancer) into for loop).
gene <- vector(length = 500, mode = "integer")
cancer <- vector(length = 500, mode = "integer")
# data <- data.frame(gene, cancer)
odd_withMutate <- vector(length = 20, mode = "numeric")
odd_noMutate <- vector(length = 20, mode = "numeric")
result <- data.frame(odd_withMutate, odd_noMutate)
for (i in 1:20) {
data <- data.frame(gene, cancer) # remaking data every time
# set.seed(12)
# set.seed(16)
set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
# set.seed(16)
set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
for (j in 1:500) {
if (random1[j] < 0.39){
data[j,1] <- 1
}
if (random2[j] < 0.08){
data[j,2] <- 1
}
}
s1 <- sum(data$gene == 1 & data$cancer == 1) # has the mutated gene & has cancer
s2 <- sum(data$gene == 1 & data$cancer == 0)
s3 <- sum(data$gene == 0 & data$cancer == 1)
s4 <- sum(data$gene == 0 & data$cancer == 0)
result[i,]$odd_withMutate <- s1/s2
result[i,]$odd_noMutate <- s3/s4
}
[ADDITION]
for loop doesn't have own environment unlike function.
So handling in for loop directly affects Global env objects such as your data.
You partly overwrited data of Global env by if statement and it was referred in next loop.
Here is my simple example;
data <- data.frame(gene = vector(length = 5, mode = "integer"))
keep_of_process <- list()
for(i in 1:2) {
set.seed(i)
random_val <- runif(5, 0, 1)
for(j in 1:5) {
if(random_val[j] < 0.39) {
data[j, 1] <- 1
}
keep_of_process[[i]] <- data.frame(random = random_val,
gene = data$gene)
}
}
do.call("cbind", keep_of_process) # just to merge process to show
# left is i = 1 and right is i = 2
random gene random gene
1 0.2655087 1 0.1848823 1
2 0.3721239 1 0.7023740 1
3 0.5728534 0 0.5733263 0
4 0.9082078 0 0.1680519 1
5 0.2016819 1 0.9438393 1
Please see row 2. In i = 2, random is 0.7023740 but gene is 1 (previous result retains).
So to do what you want (from my understanding), you need to remake data (my answer) or completely overwrite data by if statement, such as
if(random_val[j] < 0.39) {
data[j, 1] <- 1
} else {
data[j, 1] <- 0
}
I have a raster stack with 364 layers with a daily rate of change in NDVI values.
I want to scale these values in every cell if positive from 0 to 1 and if negative from -1 to 0. So far I have only found a solution that scale values in single layers (see here: Replace specific value in each band of raster brick in R) and not along cells of multilayer objects. Additionally I have a decent amount of cells with NA for the entire time series and I'm not quite sure how to deal with this fact either.
I took the code from the previously mentioned post and tried to get it working for my problem:
norm <- function(x){-1+(x-min)*((1-(-1))/(max-min))}
for(j in 1:ncell(tif)){
if(is.na(sum(tif[j]))){
NULL
} else {
cat(paste("Currently processing layer:", j,"/",ncell(tif), "\n"))
min <- cellStats(tif[j],'min')
max <- cellStats(tif[j],'max')
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(31)
#normalize
tif[j] <- clusterR(tif[j], calc, args=list(fun=norm), export=c('min',"max"))
#end cluster
endCluster()
}
}
I'm not quite certain if this produces the desired output. Any help is very much appreciated!
Some example data
library(raster)
r <- raster(ncol=10, nrow=10)
s <- stack(lapply(1:5, function(i) setValues(r, runif(100, -1, 1))))
# adding NAs
s[[2]][sample(100, 25, TRUE)] <- NA
For scaling (or any other operation) by cell (as requested) you can use calc together with a function that works on a vector. For example:
ff <- function(i) {
p <- which(i >= 0)
n <- which(i <= 0)
# positive values
if (length(p) > 0) {
i[p] <- i[p] - min(i[p], na.rm=TRUE)
i[p] <- i[p] / max(i[p])
}
# negative values
if (length(n) > 0) {
i[n] <- i[n] - max(i[n], na.rm=TRUE)
i[n] <- i[n] / abs(min(i[n]))
}
i
}
Test it
ff(c(-.3, -.1, .1, .4, .8))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000
ff(c(-.3, -.1, .1, .4, .8, NA))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000 NA
ff(c(-2,-1))
#[1] -1 0
ff(c(NA, NA))
#[1] NA NA
And use it
z <- calc(s, ff)
See the below to scale by layer, based on the min and max of all cell values (I first thought that this is what was asked for). Note that the functions I used below scale values from -1 to 1, but not the lowest positive value and highest negative value to zero.
minv <- abs(cellStats(s,'min'))
maxv <- cellStats(s,'max')
f1 <- function(i, mn, mx) {
j <- i < 0
j[is.na(j)] <- TRUE
i[j] <- i[j] / abs(mn)
i[!j] <- i[!j] / mx
i
}
ss <- list()
for (i in 1:nlayers(s)) {
ss[[i]] <- calc(s[[i]], fun=function(x) f1(x, minv[i], maxv[i]))
}
ss1 <- stack(ss)
Or without a loop
f2 <- function(x, mn, mx) {
x <- t(x)
i <- which(x > 0)
i[is.na(i)] <- FALSE
mxx <- x / mx
x <- x / mn
x[i] <- mxx[i]
t(x)
}
ss2 <- calc(s, fun=function(x) f2(x, minv, maxv))
For reference, to simply scale between 0 and 1
mnv <- cellStats(s,'min')
mxv <- cellStats(s,'max')
x <- (s - mnv) / (mxv - mnv)
To get values between -1 and 1 you can then do
y <- 2 * (x - 1)
But that way previously negative values can become positive and vice versa.
See ?raster::scale for other types of scaling.
I'm working on a function which will get rid of outliers in a given data set based on 3 sigma rule. My code is presented below. "data" is a data set to be processed.
rm.outlier <- function(data){
apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
})
})
as.data.frame(data)
}
In order to check if the function works I wrote a short test:
set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
b <- a
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of outliers in a
As a result, I get:
[1] 12
So the variable var1 in the data frame a has 12 outliers. Next, I try to apply my function on this object:
a2 <- rm.outlier(a)
sum(b$var1 - a2$var1)
Unfortunately, it gives 0 which clearly indicates that something does not work. I have already worked out that the implementation of sapply is correct so there must be a mistake in my apply. Any help would be appreciated.
If runtime is important for you, you might consider another approach. You could vectorize this filtering, e.g. by using pmin and pmax which is equally readable and > 15x times faster. If you like it a little bit more complex you could use findInterval and get even more speed:
rm.outlier2 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
pmin(pmax(x, s[1]), s[2])
}
rm.outlier3 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
## sorts x into intervals 0 == left of s[1], 2 == right of s[2], 1
## between both s
i <- findInterval(x, s)
## which values are left/right of the interval
j <- which(i != 1L)
## add a value between s to directly use output of findInterval for subsetting
s2 <- c(s[1], 0, s[2])
## replace all values that are left/right of the interval
x[j] <- s2[i[j] + 1L]
x
}
Benchmarking the stuff:
## slightly modified OP version
rm.outlier <- function(x) {
sigma3 <- mean(x) + c(-3,3) * sd(x)
sapply(x, function(y) {
if (y > sigma3[2]){
y <- sigma3[2]
} else if (y < sigma3[1]){
y <- sigma3[1]
} else {y <- y}
})
}
set.seed(123)
a <- rnorm(10000, 0, 1)
# check output
all.equal(rm.outlier(a), rm.outlier2(a))
all.equal(rm.outlier2(a), rm.outlier3(a))
library("rbenchmark")
benchmark(rm.outlier(a), rm.outlier2(a), rm.outlier3(a),
order = "relative",
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
#3 rm.outlier3(a) 100 0.028 1.000
#2 rm.outlier2(a) 100 0.102 3.643
#1 rm.outlier(a) 100 1.825 65.179
It seems like you just forgot to assign your results of the apply function to a new dataframe. (Compare the 3rd line with your code)
rm.outlier <- function(data){
# Assign the result to a new dataframe
data_new <- apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
})
})
# Print the new dataframe
as.data.frame(data_new)
}
set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of too big outliers
# 15
sum(a$var1 < mean(a$var1) - 3 * sd(a$var1)) # number of too small outliers
# 13
# Overall 28 outliers
# Check the function for the number of outliers
a2 <- rm.outlier(a)
sum(a2$var1 == a$var1) - length(a$var1)