I'm trying to simulate exposure data in a group of people, and then to have a boolean conditional on the data. So say this is my simulated exposure data:
x <- rlnorm(2000)
I then want to generate a 1 or 0 for each person dependent on the value of x. I can simply define the `max(x)' as P==1 and lesser values as a proportion thereof:
prob <- x / max(x)
y <- rbinom( n=length(x), 1, prob=prob)
> table(y)
y
0 1
1900 100
However this is not really what I want. I would like to also be able to set an overall population risk for the group, say 30% (so `risk = 0.3'), such that individual risk depends on x but the total group risk =0.3. At the end I want 30% of the population to have y==1, but with an individual probability dependent on the value of x. I'm at a loss as how to achieve this - any help appreciated.
Update:
Taking a hint from #B Williams answer below, I've written a short optimiser function:
df1 <- data.frame(x = rlnorm(2000))
df1$prob <- df1$x / max(df1$x)
risk = 0.3
optimize_prob <- function(prob, risk, delta = 0.01, tol = 0.02, max_iter = 400, mult=1){
prob1 <- prob
for( i in 1: max_iter){
y <- rbinom( n=length(prob1), 1, prob=prob1)
meas_risk <- sum(y==1) / length(y)
if( abs(risk - meas_risk) > tol) {
sign <- as.numeric((risk - meas_risk) >= 0)
prob1 <- prob1 + (sign * delta) + (prob1 *delta * mult)
# prob1's must lie between 0 & 1
prob1 <- ifelse(prob1 > 1, 1, prob1)
prob1 <- ifelse(prob1 < 0, 0, prob1)
} else {
break
}
}
msg <- paste0("Iterations: ", i)
print(msg)
out <- cbind(prob1, y)
return(out)
}
df1 <- data.frame(df1, optimize_prob( df1$prob, risk, mult=3))
df1$y <- as.factor(df1$y)
table(df1$y)
This more or less achieves the result I want. However, if anyone knows a neater way of doing this I'd much appreciate suggestions. Also any efficiency improvements to the above appreciated as I will be running it alot if all goes to plan.
I may not understand correctly what you are trying to do, but here is my guess.
library(dplyr)
df <- data.frame(x = rlnorm(2000))
Pull out the top 600 (30% of 2000) values and get the minimum value
df %>%
mutate(prob = x/max(x)) %>%
top_n(600) %>%
summarise(min.value = min(prob)) -> out
Set the global probability based upon the minimum value
df %>%
mutate(prob = x/max(x),
global = ifelse(prob > out$min.value, 1, 0)) %>%
summarise(one = sum(global))
Alternatively you could write a function and optimize it to get the "cutoff" value.
Related
I am trying to run a 'while' loop for a simulation which is updating a data frame that is created within the loop. I want the end output of loop to return the aforementioned data frame. But I am getting the above error.
cd <- function(workers,beta, s_1_t, s_2_t, alpha,eta,periods) {
df <- data.frame(matrix(ncol = 6, nrow = 1))
n <- 0
while (n < t) {
# coding population identities
s_t <- (beta*s_1_t) + ((1-beta)*s_2_t)
h_tilda <- alpha # prop. of high skill jobs when marginal products are equal is alpha
h_t <- min(s_t, h_tilda)
l_t <- 1 - h_t
# production fun.
y <- h_t^(alpha) * l_t^(1 - alpha)
# wages
w_h <- alpha * (y / h_t) # wage of high skilled worker
w_l <- (1 - alpha) * (y / l_t) # wage of low skilled worker
wage_diff <- w_h - w_l # wage differential
# mean level of human capital
sig_1 <- (eta*s_1_t) + ((1-eta)*s_t)
sig_2 <- (eta*s_2_t) + ((1-eta)*s_t)
# defining total effort to be put by individual for acquiring human capital
x <- data_frame(ability=rnorm(workers, mean = 3, sd = 2)) %>%
mutate(cost1=ability+sig_1) %>%
mutate(cost2=ability+sig_2)
# proportion of pop. clearing threshold ability level
s_1_t <- ncol(x %>% filter(cost1<=wage_diff))/(beta*workers)
s_2_t <- ncol(x %>% filter(cost2<=wage_diff))/((1-beta)*workers)
df <- df %>% add_row(X1=y,X2=w_h,X3=w_l,X4=wage_diff,
X5=s_1_t,X6=s_2_t)
n <- n + 1
}
return(df)
}
I have to choose 10 elements of a vector to maximizes a function. Since the vector is pretty long there are to many possibilities (~1000 choose 10) to compute them all. So I started to look into the GA package to use a genetic algorithm.
I came up with this MWE:
values <- 1:1000
# Fitness function which I want to maximise
f <- function(x){
# Choose values
y <- values[x]
# From the first 10 sum up the odd values.
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
# Maximum value of f for this example
y <- ifelse(values %% 2 != 0, values, 0)
sum(sort(y, decreasing = TRUE)[1:10])
# [1] 9900
# Genetic algorithm
GA <- ga(type = "permutation", fitness = f, lower = rep(1, 10), upper = rep(1000, 10), maxiter = 100)
summary(GA)
The results are a bit underwhelming. From summary(GA), I get the feeling that the algorithm always permutates all 1000 values (the solution goes from x1 to x1000) which leads to an inefficient optimization. How can I tell the algorithm that it should only should use 10 values (so the solution is x1 .. x10)?
You should read https://www.jstatsoft.org/article/view/v053i04. You don't have permutation problem but selection one hence you should use binary type of genetic algorithm. Because you want to select exclusively 10 (10 ones and 990 zeroes) you should probably write your own genetic operators because that is constraint that will hardly ever be satisfied by default operators (with inclusion of -Inf in fitness function if you have more than 10 zeroes). One approach:
Population (k tells how much ones you want):
myInit <- function(k){
function(GA){
m <- matrix(0, ncol = GA#nBits, nrow = GA#popSize)
for(i in seq_len(GA#popSize))
m[i, sample(GA#nBits, k)] <- 1
m
}
}
Crossover
myCrossover <- function(GA, parents){
parents <- GA#population[parents,] %>%
apply(1, function(x) which(x == 1)) %>%
t()
parents_diff <- list("vector", 2)
parents_diff[[1]] <- setdiff(parents[2,], parents[1,])
parents_diff[[2]] <- setdiff(parents[1,], parents[2,])
children_ind <- list("vector", 2)
for(i in 1:2){
k <- length(parents_diff[[i]])
change_k <- sample(k, sample(ceiling(k/2), 1))
children_ind[[i]] <- if(length(change_k) > 0){
c(parents[i, -change_k], parents_diff[[i]][change_k])
} else {
parents[i,]
}
}
children <- matrix(0, nrow = 2, ncol = GA#nBits)
for(i in 1:2)
children[i, children_ind[[i]]] <- 1
list(children = children, fitness = c(NA, NA))
}
Mutation
myMutation <- function(GA, parent){
ind <- which(GA#population[parent,] == 1)
n_change <- sample(3, 1)
ind[sample(length(ind), n_change)] <- sample(setdiff(seq_len(GA#nBits), ind), n_change)
parent <- integer(GA#nBits)
parent[ind] <- 1
parent
}
Fitness (your function adapted for binary GA):
f <- function(x, values){
ind <- which(x == 1)
y <- values[ind]
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
GA:
GA <- ga(
type = "binary",
fitness = f,
values = values,
nBits = length(values),
population = myInit(10),
crossover = myCrossover,
mutation = myMutation,
run = 300,
pmutation = 0.3,
maxiter = 10000,
popSize = 100
)
Chosen values
values[which(GA#solution[1,] == 1)]
library(tidyverse)
gbm_vec <- function(nsim = 1000, t = 5, mu = 0.1, sigma =.3, S0 = 3400, dt = 1/252) {
epsilon <- matrix(rnorm(t*nsim, sd = .3, mean = .0004), ncol = nsim, nrow = t)
gbm <- exp((mu - sigma * sigma / 2) * dt + sigma * epsilon * sqrt(dt))
gbm <- apply(rbind(rep(S0, nsim), gbm), 2, cumprod)
return(gbm)
}
nsim <- 1000
t <- 5
mu <- 0.1
sigma <- .3
S0 <- 3477.13
gbm <- gbm_vec(nsim, t, mu, sigma, S0) #function to have the table I'm talking about
View(gbm)
Run the code to see the table.
How can I know how many of the 1000 columns have at least a value bigger than, lets say, 3500?
We can use colSums :
val <- 3500
sum(colSums(gbm > val) > 0)
colSums(gbm > val) counts total number of values in each column which is greater than val. We then count how many columns have at least 1 value greater than val with sum.
Here's a tidyverse approach, illustrated using mtcars and checking for any value > 1 in all numeric columns:
mtcars %>%
summarise_if(is.numeric, ~any(. > 1)) %>%
gather() %>%
count(value)
# value n
# 1 FALSE 2
# 2 TRUE 9
The result tells you how many of the numeric columns had at least one value > 1 and how many didn't.
I think the solution by #Ronak Shah is super elegant. Here is another option using crossprod
sum(crossprod(rep(1,nrow(gbm)),gbm)>0)
We can use rowSums
val <- 3500
sum(rowSums(t(gbm > val)) > 0)
Or with apply
sum(apply(gbm, 2, function(x) sum(x > val)))
I would like to apply the Rejection sampling method to simulate a random vector Y=(Y_1, Y_2) of a uniform distribution from a unit disc D = { (X_1 , X_2) \in R^2: \sqrt{x^2_1 + x^2_2} ≤ 1} such that X = (X_1 , X_ 2) is random vector of a uniform distribution in the square S = [−1, 1]^2 and the joint density f(y_1,y_2) = \frac{1}{\pi} 1_{D(y_1,y_2)}.
In the rejection method, we accept a sample generally if f(x) \leq C * g(x). I am using the following code to :
x=runif(100,-1,1)
y=runif(100,-1,1)
d=data.frame(x=x,y=y)
disc_sample=d[(d$x^2+d$y^2)<1,]
plot(disc_sample)
I have two questions:
{Using the above code, logically, the size of d should be greater than the size of disc_sample but when I call both of them I see there are 100 elements in each one of them. How could this be possible. Why the sizes are the same.} THIS PART IS SOLVED, thanks to the comment below.
The question now
Also, how could I reformulate my code to give me the total number of samples needed to get 100 samples follow the condition. i.e to give me the number of samples rejected until I got the 100 needed sample?
Thanks to the answer of r2evans but I am looking to write something simpler, a while loop to store all possible samples inside a matrix or a data frame instead of a list then to call from that data frame just the samples follow the condition. I modified the code from the answer without the use of the lists and without sapply function but it is not giving the needed result, it yields only one row.
i=0
samps <- data.frame()
goods <- data.frame()
nr <- 0L
sampsize <- 100L
needs <- 100L
while (i < needs) {
samps <- data.frame(x = runif(1, -1, 1), y = runif(1, -1, 1))
goods <- samps[(samps$x^2+samps$y^2)<1, ]
i = i+1
}
and I also thought about this:
i=0
j=0
samps <- matrix()
goods <- matrix()
needs <- 100
while (j < needs) {
samps[i,1] <- runif(1, -1, 1)
samps[i,2] <- runif(1, -1, 1)
if (( (samps[i,1])**2+(samps[i,2])**2)<1){
goods[j,1] <- samps[i,1]
goods[j,2] <- samps[i,2]
}
else{
i = i+1
}
}
but it is not working.
I would be very grateful for any help to modify the code.
As to your second question ... you cannot reformulate your code to know precisely how many it will take to get (at least) 100 resulting combinations. You can use a while loop and concatenate results until you have at least 100 such rows, and then truncate those over 100. Because using entropy piecewise (at scale) is "expensive", you might prefer to always over-estimate the rows you need and grab all at once.
(Edited to reduce "complexity" based on homework constraints.)
set.seed(42)
samps <- vector(mode = "list")
goods <- vector(mode = "list")
nr <- 0L
iter <- 0L
sampsize <- 100L
needs <- 100L
while (nr < needs && iter < 50) {
iter <- iter + 1L
samps[[iter]] <- data.frame(x = runif(sampsize, -1, 1), y = runif(sampsize, -1, 1))
rows <- (samps[[iter]]$x^2 + samps[[iter]]$y^2) < 1
goods[[iter]] <- samps[[iter]][rows, ]
nr <- nr + sum(rows)
}
iter # number of times we looped
# [1] 2
out <- head(do.call(rbind, goods), n = 100)
NROW(out)
# [1] 100
head(out) ; tail(out)
# x y
# 1 0.8296121 0.2524907
# 3 -0.4277209 -0.5668654
# 4 0.6608953 -0.2221099
# 5 0.2834910 0.8849114
# 6 0.0381919 0.9252160
# 7 0.4731766 0.4797106
# x y
# 221 -0.65673577 -0.2124462
# 231 0.08606199 -0.7161822
# 251 -0.37263236 0.1296444
# 271 -0.38589120 -0.2831997
# 28 -0.62909284 0.6840144
# 301 -0.50865171 0.5014720
Find the MLE of the non-linear distribution (in R, using a Gauss-Newton method):
y = sin(x*theta) + epsilon
where epsilon ~ N(0 , 0.01^2)
To do this, I've been asked to generate some data that is uniformly (and randomly) distributed from 0 <= x <= 10 , with n = 200 and theta = 2 (just for generation).
For instance, values that are close to the maximum of the sin function (1, 4 etc.) will converge but others won't.
EDITED
I now understand what theta.iter means but I cannot seem to understand why it converges only sometimes and even then, which values to input to get a useful output of. Can someone explain?
theta <- 2
x <- runif(200, 0, 10)
x <- sort(x) #this is just to sort the generated data so that plotting it
#actually looks like a sine funciton
y <- sin(x*theta) + rnorm(200, mean = 0, sd = 0.1^2)
GN_sin <- function(theta.iter, x , y, epsilon){
index <- TRUE
while (index){
y.iter <- matrix(y - sin(x*theta.iter), 200, 1)
x.iter <- matrix(theta.iter*cos(x*theta.iter), 200, 1)
theta.new <- theta.iter +
solve(t(x.iter)%*%x.iter)%*%t(x.iter)%*%y.iter
if (abs(theta.new-theta.iter) < epsilon) {index <- FALSE}
theta.iter <- as.vector(theta.new)
cat(theta.iter, '\n')
}
}