How to perform a Multivariate Polynomial Regression when output has stochastic behavior? - r

I have a experiment being simulated. This experiment has 3 parameters a,b,c (variables?) but the result, r, cannot be "predicted" as it has a stochastic component. In order to minimize the stochastic component I've run this experiment several times(n). So in resume I have n 4-tuples a,b,c,r where a,b,c are the same but r varies. And each batch of experiments is run with different values for a, b, c (k batches) making the complete data-set having k times n sets of 4-tuples.
I would like to find out the best polynomial fit for this data and how to compare them like:
fit1: with
fit2: with
fit3: some 3rd degree polynomial function and corresponding error
fit4: another 3rd degree (simpler) polynomial function and corresponding error
and so on...
This could be done with R or Matlab®. I've searched and found many examples but none handled same input values with different outputs.
I considered doing the multivariate polynomial regression n times adding some small delta to each parameter but I'd rather take a cleaner sollution before that.
Any help would be appreciated.
Thanks in advance,
Jacques

Polynomial regression should be able to handle stochastic simulations just fine. Just simulate r, n times, and perform a multivariate polynomial regression across all points you've simulated (I recommend polyfitn()).
You'll have multiple r values for the same [a,b,c] but a well-fit curve should be able to estimate the true distribution.
In polyfitn it will look something like this
n = 1000;
a = rand(500,1);
b = rand(500,1);
c = rand(500,1);
for n = 1:1000
for i = 1:length(a)
r(n,i) = foo(a,b,c);
end
end
my_functions = {'a^2 b^2 c^2 a b c',...};
for fun_id = 1:length(my_functions)
p{f_id} = polyfitn(repmat([a,b,c],[n,1]),r(:),myfunctions{fun_id})
end
It's not hard to iteratively/recursively generate a set of polynomial equations from a basis function; but for three variables there might not be a need to. Unless you have a specific reason for fitting higher order polynomials (planetary physics, particle physics, etc. physics), you shouldn't have too many functions to fit. It is generally not good practice to use higher-order polynomials to explain data unless you have a specific reason for doing so (risk of overfitting, sparse data inter-variable noise, more accurate non-linear methods).

Related

Random deviations of pert distribution

I'm working with monte carlo using R with the following code:
A)
mc_matrix = 1
for (i in 1:1000000){
mc_sample = rpert(n=1,min=629,max=1049,mode=739)
mc_matrix = rbind(mc_matrix, mc_sample)
}
mean(mc_matrix)
B)
mean_of_matrix = rpert(1000000, min=629, max=1049, mode=739)
Should these two instances of code not be the same? How come I'm not getting the same average having so many samples from the distribution?
as first it would be good to let everybody know the packages you need. In your case it is the package "freedom".
Than, the newest version needs the input with x.min, x.max and x.mode.
In a Monte Carlo simulation you simulate random variables to calculate for example the mean as in your example. The problem is that this is just an asymptotic approximation of the distribution.
If you try this with the implemented rnorm(n) function you get different results for the mean, even if the true mean of every simulated normal distributed random variable is the same.
So if you try
mean(rnorm(10000))
mean(rnorm(10000))
the results will slightly differ.
Every programming language has an implemented pseudo random variable generator. If you need the same random variables again and again you can use the function set.seed(seed) to start the random variable generator at the same point.
Try
set.seed(100)
mean(rnorm(1000))
set.seed(100)
mean(rnorm(1000))
you will get the same results.
You can try this with your example, but the results will still differ because you do some calculations in the first example while you just calculate the mean in the last. But you are right that the results should be the same with the same random variables because it's the same calculation.
Thats a basic principle of Monte Carlo Simulations to simulate n, with n large, random variables to approximate an asymptotic distribution.

Generate multivariate nonnormal random numbers in R

Background
I want to generate multivariate distributed random numbers with a fixed variance matrix. For example, I want to generate a 2 dimensional data with covariance value = 0.5, each dimensional variance = 1. The first maginal of data is a norm distribution with mean = 0, sd = 1, and the next is a exponential distribution with rate = 2.
My attempt
My attempt is that we can generate a correlated multinormal distribution random numbers and then revised them to be any distribution by Inverse transform sampling.
In below, I give an example about transforming 2 dimensional normal distribution random numbers into a norm(0,1)+ exp(2) random number:
# generate a correlated multi-normal distribution, data[,1] and data[,2] are standard norm
data <- mvrnorm(n = 1000,mu = c(0,0), Sigma = matrix(c(1,0.5,0.5,1),2,2))
# calculate the cdf of dimension 2
exp_cdf = ecdf(data[,2])
Fn = exp_cdf(data[,2])
# inverse transform sampling to get Exponetial distribution with rate = 2
x = -log(1-Fn + 10^(-5))/2
mean(x);cor(data[,1],x)
Out:
[1] 0.5035326
[1] 0.436236
From the outputs, the new x is a set of exponential(rate = 2) random numbers. Also, x and data[,1] are correlated with 0.43. The correlated variance is 0.43, not very close to my original setting value 0.5. It maybe a issue. I think covariance of sample generated should stay more closer to initial setting value. In general, I think my method is not quite decent, maybe you guys have some amazing code snippets.
My question
As a statistics graduate, I know there exist 10+ methods to generate multivariate random numbers theoretically. In this post, I want to collect bunch of code snippets to do it automatically using packages or handy . And then, I will compare them from different aspects, like time consuming and quality of data etc. Any ideas is appreciated!
Note
Some users think I am asking for package recommendation. However, I am not looking for any recommendation. I already knew commonly used statistical theroms and R packages. I just wanna know how to generate multivariate distributed random numbers with a fixed variance matrix decently and give a code example about generate norm + exp random numbers. I think there must exist more powerful code snippets to do it in a decent way! So I ask for help right now!
Sources:
generating-correlated-random-variables, math
use copulas to generate multivariate random numbers, stackoverflow
Ross simulation, theoretical book
R CRAN distribution task View

Code syntax in calculating posterior distribution in WinBUGS

Recently I read "The BUGS Book – A Practical Introduction to Bayesian Analysis" to learn WinBUGS. The way WinBUGS describes the derivation of posterior distribution makes me feel confused.
Let's take Example 4.1.1 in this book to illustrae:
Suppose we observe the number of deaths y in a given hospital for a
high-risk operation. Let n denote the total number of such
operations performed and suppose we wish to make inferences regarding
the underlying true mortality rate, $\theta$.
The code of WinBUGS is:
y <- 10 # the number of deaths
n <- 100 # the total number of such operations
#########################
y ~ dbin(theta,n) # likelihood, also a parametric sampling distribution
logit(theta) <- logit.theta # normal prior for the logistic transform of theta
logit.theta ~ dnorm(0,0.368) # precision = 1/2.71
The author said that:
The software knows how to derive the posterior distribution and
subsequently sample from it.
My question is:
Which code reflects the logic structure to tell WinBUGS about "which parameter that I want to calculate its posterior distribution"?
This question seems silly, but if I do not read the background first, I truly cannot find directly in the code above about which parameter is focused on (e.g., theta, or y?).
Below are some of my thoughts (as a beginner of WinBUGS):
I think the following three attributions of the code style in WinBUGS makes me confused:
(1) the code does not follow "a specific sequence". For example, why is logit.theta ~ dnorm(0,0.368) not in front of logit(theta) <- logit.theta?
(2) repeated variable. Foe example, why did the last two lines not be reduced into one line: logit(theta) ~ dnorm(0,0.368)?
(3) variables are defined in more than one place. For example, y is defined two times: y <- 10 and y ~ dbin(theta, n). This one has been explained in Appendix A of the book (i.e., However, a check has been built in so that when finding a logical node that also features as a stochastic node, a stochastic node is created with the calculated values as fixed data), yet I still cannot catch its meaning.
BUGS is a declarative language. For the most part, statements aren't executed in sequence, they define different parts of the model. BUGS works on models that can be represented by directed acyclic graphs, i.e. those where you put a prior on some components, then conditional distributions on other components given the earlier ones.
It's a fairly simple language, so I think logit(theta) ~ dnorm(0, 0.368) is just too complicated for it.
The language lets you define a complicated probability model, and declare observations of certain components in it. Once you declare an observation, the model that BUGS samples from is the the original full model conditioned on that observation. y <- 10 defines observed data. y ~ dbin(theta,n) is part of the model.
The statement n <- 100 could be either: for fixed constants like n, it doesn't really matter which way you think of it. Either the model says that n is always 100, or n has an undeclared prior distribution not depending on any other parameter, and an observed value of 100. These two statements are equivalent.
Finally, your big question: Nothing in the code above says which parameter you want to look at. BUGS will compute the joint posterior distribution of every parameter. n and y will take on their fixed values, theta and logit.theta will both be simulated from the posterior. In another part of your code (or by using the WinBUGS menus) you can decide which of those to look at.

MLE regression that accounts for two constraints

So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions

Trying to do a simulation in R

I'm pretty new to R, so I hope you can help me!
I'm trying to do a simulation for my Bachelor's thesis, where I want to simulate how a stock evolves.
I've done the simulation in Excel, but the problem is that I can't make that large of a simulation, as the program crashes! Therefore I'm trying in R.
The stock evolves as follows (everything except $\epsilon$ consists of constants which are known):
$$W_{t+\Delta t} = W_t exp^{r \Delta t}(1+\pi(exp((\sigma \lambda -0.5\sigma^2) \Delta t+\sigma \epsilon_{t+\Delta t} \sqrt{\Delta t}-1))$$
The only thing here which is stochastic is $\epsilon$, which is represented by a Brownian motion with N(0,1).
What I've done in Excel:
Made 100 samples with a size of 40. All these samples are standard normal distributed: N(0,1).
Then these outcomes are used to calculate how the stock is affected from these (the normal distribution represent the shocks from the economy).
My problem in R:
I've used the sample function:
x <- sample(norm(0,1), 1000, T)
So I have 1000 samples, which are normally distributed. Now I don't know how to put these results into the formula I have for the evolution of my stock. Can anyone help?
Using R for (discrete) simulation
There are two aspects to your question: conceptual and coding.
Let's deal with the conceptual first, starting with the meaning of your equation:
1. Conceptual issues
The first thing to note is that your evolution equation is continuous in time, so running your simulation as described above means accepting a discretisation of the problem. Whether or not that is appropriate depends on your model and how you have obtained the evolution equation.
If you do run a discrete simulation, then the key decision you have to make is what stepsize $\Delta t$ you will use. You can explore different step-sizes to observe the effect of step-size, or you can proceed analytically and attempt to derive an appropriate step-size.
Once you have your step-size, your simulation consists of pulling new shocks (samples of your standard normal distribution), and evolving the equation iteratively until the desired time has elapsed. The final state $W_t$ is then available for you to analyse however you wish. (If you retain all of the $W_t$, you have a distribution of the trajectory of the system as well, which you can analyse.)
So:
your $x$ are a sampled distribution of your shocks, i.e. they are $\epsilon_t=0$.
To simulate the evolution of the $W_t$, you will need some initial condition $W_0$. What this is depends on what you're modelling. If you're modelling the likely values of a single stock starting at an initial price $W_0$, then your initial state is a 1000 element vector with constant value.
Now evaluate your equation, plugging in all your constants, $W_0$, and your initial shocks $\epsilon_0 = x$ to get the distribution of prices $W_1$.
Repeat: sample $x$ again -- this is now $\epsilon_1$. Plugging this in, gives you $W_2$ etc.
2. Coding the simulation (simple example)
One of the useful features of R is that most operators work element-wise over vectors.
So you can pretty much type in your equation more or less as it is.
I've made a few assumptions about the parameters in your equation, and I've ignored the $\pi$ function -- you can add that in later.
So you end up with code that looks something like this:
dt <- 0.5 # step-size
r <- 1 # parameters
lambda <- 1
sigma <- 1 # std deviation
w0 <- rep(1,1000) # presumed initial condition -- prices start at 1
# Show an example iteration -- incorporate into one line for production code...
x <- rnorm(1000,mean=0,sd=1) # random shock
w1 <- w0*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*x*sqrt(dt) -1)) # evolution
When you're ready to let the simulation run, then merge the last two lines, i.e. include the sampling statement in the evolution statement. You then get one line of code which you can run manually or embed into a loop, along with any other analysis you want to run.
# General simulation step
w <- w*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*rnorm(1000,mean=0,sd=1)*sqrt(dt) -1))
You can also easily visualise the changes and obtain summary statistics (5-number summary):
hist(w)
summary(w)
Of course, you'll still need to work through the details of what you actually want to model and how you want to go about analysing it --- and you've got the $\pi$ function to deal with --- but this should get you started toward using R for discrete simulation.

Resources