Stochastic parameter estimation - r

I have made a pandemic stochastic simulator which takes probabilities of an infection, recovery or neither and uses a gillespie algorithm with vectors to determine the number of people in each category at each time. I want to carry out a simulation study and use maximum liklihood estimation to get parameter estimates for my simulations. It worked perfectly for the SI model but in this model i get the following error codes that i cannot understand. When i run just the function MLE i get scalars and I can even produce the vector J. But when i try and use optim it tells me that the function PL isnt a scalar when i know it is. Any help would be greatly appreciated thanks
#SIR 100 DAYS WITH 10 INTERVALS A DAY
T<-100 #Setting the number of intervals
dt<-0.01 #Setting the interval lengths
B<-1.5 #Setting Beta
N<-50 #Setting population size
Y<-0.5 #Setting recovery rate
r<-function(i){runif(1,0,1)} #Random number generator
S<-c(1:T)
I<-c(1:T)
R<-c(1:T)
I1<-c(1:T)
I2<-c(1:T)
I3<-c(1:T)
It<-c(1:T)
Time<-c(1:T)
I[1]<-1
S[1]<-N-I[1]
R[1]<-0
It[1]<-I[1]
P1<-function(t){(B)*(I[t])*(S[t])*(dt)*(1/N)} #Creates first event interval(Infection)
P2<-function(t){(Y)*(I[t])*(dt)+(B)*(I[t])*(S[t])*(dt)*(1/N)} #Creates 2nd event interval(Recovery)
P3<-function(t){1} #Creates 3rd event interval (No transition)
PI1<-function(t){(I1[t])/I[t]} #Creates interval for recovery from first group
PI2<-function(t){((I1[t])/I[t])+((I2[t])/I[t])} #Creates interval for recovery from third group
PI3<-function(t){1} #Creates interval for recovery from first group
for(i in 2:T){
x<-r(i)
if(x<P1(i-1)){ #If an infection occurs
S[i]<-S[i-1]-1
I[i]<-I[i-1]+1
R[i]<-R[i-1]
It[i]<-It[i-1]+1
}
else if(x<P2(i-1)){ #If a recovery occurs
S[i]<-S[i-1]
I[i]<-I[i-1]-1
R[i]<-R[i-1]+1
It[i]<-It[i-1]}
else{ #If no transition occurs
S[i]<-S[i-1]
I[i]<-I[i-1]
R[i]<-R[i-1]
It[i]<-It[i-1]}
}
n<-c(1:T)
for(i in 1:T){
n[i]<-S[i]+I[i]+R[i]}
n
S
I
R
Data<-cbind.data.frame(Time,S,I,R,n,It) #Create a dataframe for ease of manipulations
Data$EventInfection<-0
Data$EventRecovery<-0
Data$EventNotransition<-0
for(i in 2:T){if(Data$It[i]>Data$It[i-1]){Data$EventInfection[i]<-1} #Event indiciators to make Liklihood easier
else if(Data$R[i]>Data$R[i-1]){Data$EventRecovery[i]<-1}
else{Data$EventNotransition[i]<-1}}
PL<-function(i,b,y){((b*S[i]*I[i]*dt*(1/N))^Data$EventInfection[[i]])*((I[i]*(y)*dt)^Data$EventRecovery[[i]])*((1-(b*S[i]*I[i]*dt*(1/N))-((y)*I[i]*dt))^Data$EventNotransition[[i]])}
MLE<-function(b,y){
J<<-c(1:T)
for(i in 1:T){
J[i]<<-log(PL(i,b,y))}
return(sum(J))}
MLE(1,0.5)
optim(c(1,1), MLE, y=1)
Warning messages:
1: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
2: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
3: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
4: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length
5: In J[i] <- log(PL(i, b, y)) :
number of items to replace is not a multiple of replacement length

MLE() takes two variables, yet you gave the optim() function three parameters. Essentially, the optim() function expects b in your MLE function to be a vector of two spots. If you wanted to optimize b and y, for example, this will work.
MLE <- function(b){
J <<- vector(length = Ti)
for(i in 1:Ti){
J[i] <<- log(PL(i, b[1], b[2]))
}
return(sum(J))
}
MLE(c(1, 0.5))
optim(c(1, 1), MLE)
Now b is b[1] and y is b[2]. I'm not sure if that's what you wanted to optimize, though.

Related

How to fill a non-zero coefficient/gamma plot when optimizing with CVXR package?

I'm replicating an article of Kozak, Nagel and Santosh; Shrinking the Cross-section. Therefore I'm creating a model that will select a few characteristics out of a large set of characteristics, that together are a good representation of an SDF.
In this model I make use of Ridge and Lasso techniques and my supervisor advised me to use the CVXR package. I minimize my objective with two loss functions which are multiplied with two sets of gammas. The main goal of my code is to end with a plot that has one of the gammas on the x-axis (the ridge) and the number of non-zero coefficients on the y-axis (so not the lasso parameter).
However, since the number of non-zero coefficients is an outcome of the optimizer I can not state that I want an optut with n non-zero coefficients.
Is there anyone who know how to produce my desired outcome? Code that I used is stated below.
# Grid for L1 penalty
cv.gamma_1 <- seq(0.005,0.02, by = (0.0075/15) )
# Grid for L2 penalty
cv.kappa <- 10^seq(-2,0.5,(2/24))
cv.Tt <- nrow(cv.train)
cv.tau <- sum(diag(cv.Sigma.train))
cv.gamma_2 <- as.numeric(cv.tau/((cv.kappa^2)*cv.Tt))
# Create results Matrix
coef_matrix <- matrix(nrow = length(cv.gamma_2), ncol = Nn, data = 0)
for (i in 1:length(cv.gamma_1)) {
for (j in 1:length(cv.gamma_2)) {
objective <- loss_1 + cv.gamma_2[j] * loss_2 + cv.gamma_1[i] * loss_3
prob <- Problem(Minimize(objective))
result <- solve(prob)
model_betas <- result$getValue(beta)
# Compute R-squared of model with these betas
r_score <- Rsquared(Mu_OOS = cv.Mu.test, Sigma_OOS = cv.Sigma.test, betas = model_betas)
# Coef matrix
non_zeros <- sum( round(model_betas,2) != 0.00)
if (non_zeros != 0){
if (coef_matrix[j,non_zeros] < r_score){
coef_matrix[j,non_zeros] <- r_score}
}
}
For now I ran my optimizer and counted the number of non-zeros, made a matrix with non-zeros on the y-axis and gamma on x-axis. Therefore, I do not have values on all my non-zero values in the matrix.
my plot:
Desired plot:

R code Gaussian mixture -- numerical expression has 2 elements: only the first used

I'm trying to create a Gaussian Mix function according to these parameters:
For each sample, roll a die with k sides
If the j-th side appears from the roll, draw a sample from Normal(muj, sdj) where muj and sdj are the mean and standard deviation for the j-th Normal distribution respectively. This means you should have k different Normal distributions to choose from. Note that muj is the mathematical form of referring to the j-th element in a vector called mus.
The resulting sample from this Normal is then from a Gaussian Mixture.
Where:
n, an integer that represents the number of independent samples you want from this random variable
mus, a numeric vector with length k
sds, a numeric vector with length k
prob, a numeric vector with length k that indicates the probability of choosing the different Gaussians. This should have a default to NULL.
This is what I came up with so far:
n <- c(1)
mus <- c()
sds <- c()
prob <- c()
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
for(i in 1:seq_len(n)){
if(is.null(prob)){
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus)), n, replace=TRUE)
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}else{
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus), n, replace=TRUE, p=prob))
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}
}
return(avg)
}
rgaussmix(2, 1:3, 1:3)
It seems to match most of the requirements, but it keeps giving me the following error:
numerical expression has 2 elements: only the first usednumber of items to replace is not a multiple of replacement length
I've tried looking at the lengths of multiple variables, but I can't seem to figure out where the error is coming from!
Could someone please help me?
If you do seq_len(2) it gives you:
[1] 1 2
And you cannot do 1:(1:2) .. it doesn't make sense
Also you can avoid the loops in your code, by sampling the number of tries you need, for example if you do:
rnorm(3,c(0,10,20),1)
[1] -0.507961 8.568335 20.279245
It gives you 1st sample from the 1st mean, 2nd sample from 2nd mean and so on. So you can simplify your function to:
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
if(is.null(prob)){
prob = rep(1/length(mus),length(mus))
}
rolls <- sample(length(mus), n, replace=TRUE, p=prob)
avg <- rnorm(n, mean=mus[rolls], sd=sds[rolls])
avg
}
You can plot the results:
plot(density(rgaussmix(10000,c(0,5,10),c(1,1,1))),main="mixture of 0,5,10")

Matrices in R: number of items to replace is not a multiple of replacement length

I'm trying to initialize an empty matrix X with n rows and 4 columns. And then allocating a vector of random stock values to each column. Each column represents a different stock.
I know I should apply X[,i]=cumsum(X[,i]) eventually, to get actual stock values, but that can only be done after allocating the values in the first place.
#Inputs mean return, volatility, time period and time step
mu=0.25; sigma=2; T=1; n=2^(12); X0=5;
#Generating trajectories for stocks
#NOTE: Seed is fixed. Changing seed will produce
#different trajectories
dt=T/n
t=seq(0,T,by=dt)
set.seed(201)
X <- matrix(nrow = n, ncol = 4)
for(i in 1:4){
X[,i] <- c(X0,mu*dt+sigma*sqrt(dt)*rnorm(n,mean=0,sd=1))
}
After running the code, I get the error message:
Error in X[, i] <- c(X0, mu * dt + sigma * sqrt(dt) * rnorm(n, mean = 0, :
number of items to replace is not a multiple of replacement length

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

R problems looping through array

I am trying to loop through an array, by which in the end I will create a powercurve showing the power by a function of the number of animals per treatment and the mean difference between the two treatments
N=30 # number of maximum simulations per K
K=seq(10,30,1) # maximum number of animals per group
ES=seq(1,2,0.1) # mean difference compared to control
x=array(data=NA,
dim=c(N,length(K),length(ES)),
dimnames =list(paste("Sim",1:N, sep=""),
paste("Total Number of Animals=",min(K):max(K), sep=""),
ES)) # 3-dimensional matrix in which to store the values
for (q in ES){
for (j in K){
for (i in 1:N){
controle<-rnorm(j,popmeansum$V3, 1.490918)
new<-rnorm(j,popmeansum$V3-q, 1.490918)
fit<-t.test(controle, new, alternative ="greater")
x[i,j,q]<-fit$p.value
}
}
}
The error i get is :
Error in [<-(*tmp*, i, j, q, value = 0.00490665200011608) :
subscript out of bounds
My gut feeling says I am making a simple and stupid mistake. Unfortunately, those mistakes can take hours. Hope anyone sees a quick and simple fix.
There are various problems here, but hopefully this will help solve some of them...
N=30 # number of maximum simulations per K
K=seq(10,30,1) # maximum number of animals per group
ES=seq(1,2,0.1) # mean difference compared to control
x=array(data=NA,
dim=c(N,length(K),length(ES)),
dimnames =list(paste("Sim",1:N, sep=""),
paste("Total Number of Animals=",min(K):max(K), sep=""),
ES)) # 3-dimensional matrix in which to store the values
#all OK to here
#in the next loop, popmeansum$V3 is not defined. If it is a single value, then I suggest...
pms <- popmeansum$V3[1]
for (q in 1:length(ES)){ #keep q as integers so that you can use it for indexing
for (j in 1:length(K)){ #to be consistent with your dimensions of x
for (i in 1:N){
controle<-rnorm(K[j],pms,1.490918)
new<-rnorm(K[j],pms-ES[q],1.490918)
fit<-t.test(controle, new, alternative ="greater")
x[i,j,q]<-fit$p.value
}
}
}
It is not the slickest piece of code but it does at least do something!

Resources