Probability optimization - r

I have the data set
d<-data.frame(id=1:100, pr.a=runif(100,min=0, max=0.40))
d$pr.b=d$pr.a+runif(100,min=0, max=0.1))
d$pr.c=d$pr.b+runif(100,min=0, max=0.1)
pr.a < pr.b < pr.c are the probability of a success of a binomial trial for tests A, B, C on individuals (id's )
additionally
cost.a<-80; cost.b=200; cost.c=600;
The tests A,B,C can be performed multiple times in each subject. So that for example for if IDx has a pr.a =0.2, then if I do this test 2 times i would expect a success probability of 1-pbinom(0,2,0.2)=0.36 at a cost of 2*cost.a =160
For each modality A, B, C, in all IDs I would like to find the distributions of the costs that would be required for a given target success rate (lets say target=0.9)
At first i would like to see the distributions of the costs if only one test type (only A's or only C's) is applied in each subject (albeit it can be performed multiple times on the same subject).
Additionally I would like to find if combination of types can minimize the cost for a target success rate.
This seems to me as an optimization problem. I have no experience on optimization. Any ideas please?

I guess you could characterize this as an optimization problem, but only a very simple one. You are simply trying to maximize the probability per unit cost. To calculate this, simply divide the probability by the cost:
d$pr.a / cost.a # The probability per unit cost for A
d$pr.b / cost.b
d$pr.c / cost.c
Pick the maximum for each id, and you will get the 'best' test for each id. To calculate the expected cost for a given probability just divide the target by the probability and multiply by cost.
target=0.9
(target / d$pr.ca) * cost

Related

Why do we discard the first 10000 simulation data?

The following code comes from this book, Statistics and Data Analysis For Financial Engineering, which describes how to generate simulation data of ARCH(1) model.
library(TSA)
library(tseries)
n = 10200
set.seed("7484")
e = rnorm(n)
a = e
y = e
sig2 = e^2
omega = 1
alpha = 0.55
phi = 0.8
mu = 0.1
omega/(1-alpha) ; sqrt(omega/(1-alpha))
for (t in 2:n){
a[t] = sqrt(sig2[t])*e[t]
y[t] = mu + phi*(y[t-1]-mu) + a[t]
sig2[t+1] = omega + alpha * a[t]^2
}
plot(e[10001:n],type="l",xlab="t",ylab=expression(epsilon),main="(a) white noise")
My question is that why we need to discard the first 10000 simulation?
========================================================
Bottom Line Up Front
Truncation is needed to deal with sampling bias introduced by the simulation model's initialization when the simulation output is a time series.
Details
Not all simulations require truncation of initial data. If a simulation produces independent observations, then no truncation is needed. The problem arises when the simulation output is a time series. Time series differ from independent data because their observations are serially correlated (also known as autocorrelated). For positive correlations, the result is similar to having inertia—observations which are near neighbors tend to be similar to each other. This characteristic interacts with the reality that computer simulations are programs, and all state variables need to be initialized to something. The initialization is usually to a convenient state, such as "empty and idle" for a queueing service model where nobody is in line and the server is available to immediately help the first customer. As a result, that first customer experiences zero wait time with probability 1, which is certainly not the case for the wait time of some customer k where k > 1. Here's where serial correlation kicks us in the pants. If the first customer always has a zero wait time, that affects some unknown quantity of subsequent customer's experiences. On average they tend to be below the long term average wait time, but gravitate more towards that long term average as k, the customer number, increases. How long this "initialization bias" lingers depends on both how atypical the initialization is relative to the long term behavior, and the magnitude and duration of the serial correlation structure of the time series.
The average of a set of values yields an unbiased estimate of the population mean only if they belong to the same population, i.e., if E[Xi] = μ, a constant, for all i. In the previous paragraph, we argued that this is not the case for time series with serial correlation that are generated starting from a convenient but atypical state. The solution is to remove some (unknown) quantity of observations from the beginning of the data so that the remaining data all have the same expected value. This issue was first identified by Richard Conway in a RAND Corporation memo in 1961, and published in refereed journals in 1963 - [R.W. Conway, "Some tactical problems on digital simulation", Manag. Sci. 10(1963)47–61]. How to determine an optimal truncation amount has been and remains an active area of research in the field of simulation. My personal preference is for a technique called MSER, developed by Prof. Pres White (University of Virginia). It treats the end of the data set as the most reliable in terms of unbiasedness, and works its way towards the front using a fairly simple measure to detect when adding observations closer to the front produces a significant deviation. You can find more details in this 2011 Winter Simulation Conference paper if you're interested. Note that the 10,000 you used may be overkill, or it may be insufficient, depending on the magnitude and duration of serial correlation effects for your particular model.
It turns out that serial correlation causes other problems in addition to the issue of initialization bias. It also has a significant effect on the standard error of estimates, as pointed out at the bottom of page 489 of the WSC2011 paper, so people who calculate the i.i.d. estimator s2/n can be off by orders of magnitude on the estimated width of confidence intervals for their simulation output.

The best way to calculate classification accuracy?

I know one formula to calculate classification accuracy is X = t / n * 100 (where t is the number of correct classification and n is the total number of samples. )
But, let's say we have total 100 samples, 80 in class A, 10 in class B, 10 in class C.
Scenario 1: All 100 samples were assigned to class A, by using the formula, we got accuracy equals 80%.
Scenario 2: 10 samples belong to B were correctly assigned to class B ;10 samples belong to C were correctly assigned to class C as well; 30 samples belong to A correctly assigned to class A; the rest 50 samples belong to A were incorrectly assigned to C. By using the formula, we got accuracy of 50%.
My question is:
1: Can we say scenario 1 has a higher accuracy rate then scenario 2?
2: Is there any way to calculate accuracy rate for classification problem?
Many thanks ahead!
Classification accuracy is defined as "percentage of correct predictions". That is the case regardless of the number of classes. Thus, scenario 1 has a higher classification accuracy than scenario 2.
However, it sounds like what you are really asking is for an alternative evaluation metric or process that "rewards" scenario 2 for only making certain types of mistakes. I have two suggestions:
Create a confusion matrix: It describes the performance of a classifier so that you can see what types of errors your classifier is making.
Calculate the precision, recall, and F1 score for each class. The average F1 score might be the single-number metric you are looking for.
The Classification metrics section of the scikit-learn documentation has lots of good information about classifier evaluation, even if you are not a scikit-learn user.

Is there any algorithm that can predict multi-variables(response variables) based on one independent variable

let me ask the question in detail with an example:
I have a historical data set with columns (a,b,c,d,e,f,g)
Now I have to predict (b,c,d,e,f,g) based on the value of 'a'.
Just replace a,b,c,d,e,f,g with a real world example.
Consider a data set which contains the revenue of a bike rental store on a day based on the number of rentals and cost per hour for rent.
Now my goal is to predict the number of rentals per month and cost per hour to reach my revenue goal of $50k.
Can this done? Just need some direction for how to do that
You are basically want to maximize:
P(B|A)*P(C|A,B)*P(D|A,B,C)*P(E|A,B,C,D)*P(F|A,B,C,D,E)*P(G|A,B,C,D,E,F)
If the data B,C,D,E,F,G is all i.i.d. (but does depend on A) you are basically trying to maximize:
P = P(B|A)*P(C|A)*P(D|A)*P(E|A)*P(F|A)*P(G|A)
One approach to solve it is with Supervised Learning.
Train a set of classifiers (or regressors, depending on the values of B,C,D,E,F,G): A->B, A->C ... A->G with your historical data, and when given a query of some value a, use all classifiers/regressors to predict the values of b,c,d,e,f,g.
The "trick" is to use multiple learners for multiple outputs. Note that in the case of I.I.D dependent variables, there is no loss on doing that, since maximizing every P(Z|A) seperatedly, also maximizes P.
If the data is not i.i.d, the problem to maximize P(B|A)*P(C|A,B)*P(D|A,B,C)*P(E|A,B,C,D)*P(F|A,B,C,D,E)*P(G|A,B,C,D,E,F) which is NP-Hard, and is reduceable from Hamiltonian-Path Problem (P(X_i|X_i-1,...,X_1) > 0 iff there is an edge (X_i-1,X_i), looking for a non-zero path).
This is a typical classification problem, a trivial example is "if a>0.5 then b=1 and c=0 ... while if a<=0.5, then b=0 and c=1 ..."
You may like to look at nnet or h2o.

how to compute the global variance (square standard deviation) in a parallel application?

I have a parallel application in which I am computing in each node the variance of each partition of datapoint based on the calculated mean, but how can I compute the global variance (sum of all the variances)?
I thought that it would be a simple sum of the variances and divided by the number of nodes, but it is not giving me a close result...
The global variation is a sum.
You can compute parts of the sum in parallel trivially, and then add them together.
sum(x1...x100) = sum(x1...x50) + sum(x51...x100)
The same way, you can compute the global averages - compute the global sum, compute the sum of the object counts, divide (don't divide by the number of nodes; but by the total number of objects).
mean = sum/count
Once you have the mean, you can compute the sum of squared deviations using the distributed sum formula above (applied to (xi-mean)^2), then divide by count-1 to get the variance.
Do not use E[X^2] - (E[X])^2
While this formula "mean of square minus square of mean" is highly popular, it is numerically unstable when you are using floating point math. It's known as catastrophic cancellation.
Because the two values can be very close, you lose a lot of digits in precision when computing the difference. I've seen people get a negative variance this way...
With "big data", numerical problems gets worse...
Two ways to avoid these problems:
Use two passes. Computing the mean is stable, and gets you rid of the subtraction of the squares.
Use an online algorithm such as the one by Knuth and Welford, then use weighted sums to combine the per-partition means and variances. Details on Wikipedia In my experience, this often is slower; but it may be beneficial on Hadoop due to startup and IO costs.
You need to add the sums and sums of squares of each partition to get the global sum and sum of squares and then use them to calculate the global mean and variance.
UPDATE: E[X2] - E[X]2 and cancellation...
To figure out how important cancellation error is when calculating the standard deviation with
σ = √(E[X2] - E[X]2)
let us assume that we have both E[X2] and E[X]2 accurate to 12 significant decimal figures. This implies that σ2 has an error of order 10-12 × E[X2] or, if there has been significant cancellation, equivalently 10-12 × E[X]2 when σ will have an error of approximate order 10-6 × E[X]; one millionth the mean.
For many, if not most, statistical analyses this is negligable, in the sense that it falls within other sources of error (like measurement error), and so you can in good consciense simply set negative variances to zero before you take the square root.
If you really do care about deviations of this magnitude (and can show that it's a feature of the thing you are measuring and not, for example, an artifact of the method of measurement) then you can start worrying about cancellation. That said, the most likely explanation is that you have used an inappropriate scale for your data, such as measuring daily temperatures in Kelvin rather than Celcius!

mathematical model to build a ranking/ scoring system

I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.

Resources