Probabilities of events grouped by their outcome - r

Say there are $n$ independent events. Each has a probability $p_n$ and an associated loss $l_n$. My goal is to produce a list of all possible loss amounts and their associated probabilities.
Eventually I would like to extend this to sets of 10-20 events with variable probabilities and loss amounts. This will all be done in R.
The various outcomes are given by the power set, e.g. for three events: (null), (A), (B), (C), (A and B), (A and C), (B and C), (A and B and C). The probability of each of these outcomes can be found by taking the product of the probabilities in each subset, and the total loss by taking the sum of losses in each subset.
My problem is how to aggregate by the loss amounts, i.e. to find all unique loss amounts in the power set and produce their probabilities.
I feel like I'm halfway there with the inclusion/exclusion principle, but I can't quite get my head around how to apply it to my particular problem, especially as the number of events goes above 3, or in the case of the sets of intermediate size, e.g. how to group all the 2 element sets above.

For a problem this small--there are at most 2^20 (around a million) possibilities--brute force works fine.
To illustrate, let's generate some data of moderate size:
n <- 15
set.seed(17)
p <- runif(n)
loss <- ceiling(rgamma(n, 3, 1/2))
signif(rbind(Probability=p, Loss=loss), 2)
Here are the input values for this example:
Probability 0.16 0.97 0.47 0.78 0.41 0.54 0.21 0.19 0.78 0.19 0.43 0.0023 0.83 0.83 0.96
Loss 12.00 4.00 10.00 8.00 10.00 6.00 12.00 5.00 4.00 8.00 8.00 8.0000 4.00 4.00 4.00
Generate a binary indicator of the power set with expand.grid and then use array operations for relatively fast calculation of the losses and the probabilities of all the possible outcomes:
powerset <- t(expand.grid(lapply(p, function(x) 0:1)))
probability <- apply(powerset * (2*p - 1) + (1-p), 2, prod)
losses <- colSums(powerset * loss)
(On this aging Xeon workstation, this takes up to 5 seconds when n is 20.)
Summarize by loss using tapply:
x <- tapply(probability, losses, sum)
(This takes another 1 to 2 seconds when n is 20.)
We can check for consistency by (a) verifying the probabilities sum to unity and (b) checking that the expected loss is the sum of the expected losses of the individual events:
if(sum(probability) - 1 != 0) warning("Unnormalized probability.")
if(sum(probability * losses) - sum(p*loss) != 0) warning("Inconsistent result.")
Let's plot the resulting loss distribution.
library(ggplot2)
ggplot(data.frame(Loss=as.numeric(names(x)), Probability=x),
aes(Loss, Probability)) +
geom_col(color="White")

Related

How can I make an asymmetric correlation in r?

Sometimes I used to use psych::corr.test function with two data frames like:
df1 <- tibble(a=c(1,2,4,5,67,21,21,65,1,5), b=c(21,5,2,6,8,4,2,6,2,2))
df2 <- tibble(a=c(1,2,3,4,5,6,7,8,9,8), b=c(1,6,54,8,3,8,9,5,2,1), c=c(1,4,6,8,5,3,9,7,5,4))
corr <- corr.test(df1,df2, adjust = "BH")
And I was getting p-values from corr$p.adj
But sometimes it gives me strange repetitive p.values like:
a b c
a 0.5727443 0.5964993 0.5727443
b 0.2566757 0.5727443 0.2566757
Does anyone know how adequate these p-values are? Can we do this with the corr.test? If not, how can I make an asymmetric correlation?
I'm stressed that if I try to perform symmetric correlation like
df <- bind_cols(df1,df2[-3])
corr <- corr.test(df, adjust = "BH")
it's p-values not so repetative:
Probability values (Entries above the diagonal are adjusted for multiple tests.)
a...1 b...2 a...3 b...4
a...1 0.00 0.97 0.62 0.72
b...2 0.97 0.00 0.38 0.62
a...3 0.39 0.06 0.00 0.62
b...4 0.60 0.40 0.41 0.00
UPD: Okay, I realised that it's as repetitive as the first and I'm a bit stupid.
The BH correction is based on computing the cumulative minimum of n/i * p, where the p has your n = 6 unadjusted p-values in decreasing order, and i is 6:1. (You can see the calculation in psych::p.adjust.)
Because it's a cumulative minimum (i.e. the first value, then the min of the first and second, then the min of the first to third, etc.) there are likely to be repetitions.

Simulating correlated Bernoulli data

I want to simulate 100 data with 5 columns. I want to get a correlation of 0.5 between the columns. To complete it, I have done the following action
F1 <- matrix( c(1, .5, .5, .5,.5,
.5, 1, .5, .5,.5,
.5, .5, 1, .5,.5,
.5, .5, .5, 1,.5,
.5, .5, .5, .5,1
), 5,5)
To simulate the intended data frame, I have done this, but it does not work properly.
df2 <- as.data.frame (rbinom(100, 1,.5),ncol(5), F1)
I'm surprised this isn't a duplicate (this question refers specifically to non-binary responses, i.e. binomial with N>1). The bindata package does what you want.
library(bindata)
## set up correlation matrix (compound-symmetric with rho=0.5)
m <- matrix(0.5,5,5)
diag(m) <- 1
Simulate with a mean of 0.5 (as in your example):
set.seed(101)
## this simulates 10 rather than 100 realizations
## (I didn't read your question carefully enough)
## but it's easy to change
r <- rmvbin(n=10, margprob=rep(0.5,5), bincorr=m)
round(cor(r),2)
Results
1.00 0.22 0.80 0.05 0.22
0.22 1.00 0.00 0.65 1.00
0.80 0.00 1.00 -0.09 0.00
0.05 0.65 -0.09 1.00 0.65
0.22 1.00 0.00 0.65 1.00
this looks wrong - the correlations aren't exactly 0.5 - but on average they will be (when I sampled 10,000 vectors rather than 10, the values ranged from about 0.48 to 0.51). Equivalently, if you simulated many samples of 10 and computed the correlation matrix for each, you should find that the expected (average) correlation matrix is correct.
simulating values with correlation exactly equal to the specified value is much harder (and not necessarily what you want to do anyway, depending on the application)
note that there will be limitations about what mean vectors and correlation matrices are feasible. For example, the off-diagonal elements of an n-by-n compound-symmetric (equal-correlation) matrix can't be less than -1/(n-1). Similarly, there may be limits on what correlations are possible for a given set of means (this may be discussed in the technical reference, I haven't checked).
The reference for this method is
Leisch, Friedrich and Weingessel, Andreas and Hornik, Kurt (1998) On the generation of correlated artificial binary data. Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science", 13. SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, Vienna. https://epub.wu.ac.at/286/

Mixed integer programming R: Least absolute deviation with cost associated with each regressor

I have been presented with a problem, regarding the minimization of the absolute error, the problem know as LAD(Least absolute deviation) but, being each regressor the result of expensive test with an associated cost, one should refrain from using regressors that don't explain variance to a high degree. It takes the following equations:
Where N is the total number of observations, E the deviation associated with observation i, S the number of independant variables, lambda a penalty coefficient for the cost, and C the cost associated with performing the test.
So far, I have oriented as usual. To make it lineal, I transformed the absolute value in two errors, e^+ and e^-, where e= y_i-(B_0+sum(B_j*X_ij) and the following constraints:
z_j ={0,1}, binary value about whether the regressor enters my model.
B_i<=M_zj; B_i>=-M_zj
E^+, E^- >=0
A toy subset of data I'm working has the following structure:
For y
quality
1 5
2 5
3 5
4 6
5 7
6 5
For the regressors
fixed.acidity volatile.acidity citric.acid
1 7.5 0.610 0.26
2 5.6 0.540 0.04
3 7.4 0.965 0.00
4 6.7 0.460 0.24
5 6.1 0.400 0.16
6 9.7 0.690 0.32
And for the cost
fixed.acidity volatile.acidity citric.acid
1 0.26 0.6 0.52
So far, my code looks like this:
# loading the matrixes
y <- read.csv(file="PATH\\y.csv", header = TRUE, sep = ",") #dim=100*11
regresores <- read.csv(file="PATH\\regressors.csv", header = TRUE, sep = ",")#dim=100*1
cost <- read.csv(file="PATH\\cost.csv", header = TRUE, sep = ",")#dim=1*11
for (i in seq(0, 1, by = 0.1)){#so as to have a collection of models with different penalties
obj.fun <- c(1,1,i*coste)
constr <- matrix(
c(y,regresores,-regresores),
c(-y,-regresores,regresores),
sum(regresores),ncol = ,byrow = TRUE)
constr.dir <- c("<=",">=","<=","==")
rhs<-c(regresores,-regresores,1,binary)
sol<- lp("min", obj.fun, constr, constr.tr, rhs)
sol$objval
sol$solution}
I know theres is a LAD function in R, but for consistence sake with my colleagues, as well as a pretty annoying phD tutor, I have to perform this using lpSolve in R. I have just started with R for the project and I don't know exactly why this won't run. Is there something wrong with the syntax or my formulation of the model. Right know, the main problem I have is:
"Error in matrix(c(y, regressors, -regressors), c(-y, -regressors, regressors), : non-numeric matrix extent".
Mainly, I intended for it to create said weighted LAD model and have it return the different values of lambda, from 0 to 1 in a 0.1 step.
Thanks in advance and sorry for any inconvenience, neither English nor R are my native languages.

RandomForest() to return probabilities of positive result as well as classification

I'm building a Random Forrest Classifier and I would like to return classification and associated probabilities. My result variable is either 1 or 0, 1 being the positive class that I want to track.
no_of_trees <- 50
rf.under <- randomForest(as.factor(result) ~ . ,
data=data_balanced_under,
importance=TRUE,
ntree=no_of_trees)
prediction <- predict(rf.under, df.test)
probability <- predict(rf.under, df.test, type="prob")
submit <- data.frame( predicted = prediction, actual = df.test$result)
I wanted probability to return the probability of positive results, however I get:
> probability
0 1
242339 1.00 0.00
3356431 1.00 0.00
138327 1.00 0.00
111327 1.00 0.00
3307151 1.00 0.00
222414 1.00 0.00
1817297 1.00 0.00
3860922 1.00 0.00
1710532 1.00 0.00
in my output. What are these numbers on the left? I'm not sure what they are? I thought they are row numbers, but then, why aren't they indexed from 1,2,3..?
I tied to get probability[,2] which I'm assuming gives me probability of the result, but also doesn't work.
Ideally, I would like to include the probabilities in the submit data frame, but currently unable to do so.
Also, confusion matrix gives me:
confusionMatrix(data = submit$predicted, reference = df.test$result , positive="1")
#Reference
Prediction 0 1
0 913730 160
1 50872 8219
Is it possible to switch this around? So that it shows positive class "1" first?
probability returns the probability by class (here you have two classes so two columns).
This as been built this way to alow multiclass classification.
If you want probability of result == 1 just take the second column of probability
Since you have highly unbalanced classes (0.8% of ones) your classifier tends to predict that it is always 0... So your probability of result==1 is close to 0 for most exemples. This is why your probabilities doesn't look like probabilities.
Regarding the index of probability, it is rownames(df.test) the index of df.test. I guess you randomly splitted df.test from df. So index doesn't start by 1.

Extracting/Exporting the Data of the Empirical Cumulative Distribution Function in R (ecdf)

I use R to calculate the ecdf of some data. I want to use the results in another software. I use R just to do the 'work' but not to produce the final diagram for my thesis.
Example Code
# Plotting the a built in sampla data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
Output
> # Plotting the a built in sampla data
> plot(cars$speed)
> # Assingning the data to a new variable name
> myData = cars$speed
> # Calculating the edcf
> myResult = ecdf(myData)
> myResult
Empirical CDF
Call: ecdf(myData)
x[1:19] = 4, 7, 8, ..., 24, 25
> # Plotting the ecdf
> plot(myResult)
> plot(cars$speed)
Questions
Question 1
How do I get the raw information in order to plot the ecdf diagram in another software (e. g. Excel, Matlab, LaTeX)? For the histogram function I can just write
res = hist(...)
and I find all the information like
res$breaks
res$counts
res$density
res$mids
res$xname
Question 2
How do I calculate the inverse ecdf? Say I want to know how many cars have a speed below 10 mph (the example data is car speed).
Update
Thanks to the answer of user777 I have more information now. If I use
> myResult(0:25)
[1] 0.00 0.00 0.00 0.00 0.04 0.04 0.04 0.08 0.10 0.12 0.18 0.22 0.30 0.38
[15] 0.46 0.52 0.56 0.62 0.70 0.76 0.86 0.86 0.88 0.90 0.98 1.00
I get the data for 0 to 25 mph. But I do not know where to draw a data point. I do want to reproduce the R plot exactly.
Here I have a data point every 1 mph.
Here I do not have a data pint every 1 mph. I only have a data point if there is data available.
Solution
# Plotting the a built in sample data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
# Have a look on the probability for 0 to 25 mph
myResult(0:25)
# Have a look on the probability but just where there ara data points
myResult(unique(myData))
# Saving teh stuff to a directory
write.csv(cbind(unique(myData), myResult(unique(myData))), file="D:/myResult.txt")
The file myResult.txt looks like
"","V1","V2"
"1",4,0.04
"2",7,0.08
"3",8,0.1
"4",9,0.12
"5",10,0.18
"6",11,0.22
"7",12,0.3
"8",13,0.38
"9",14,0.46
"10",15,0.52
"11",16,0.56
"12",17,0.62
"13",18,0.7
"14",19,0.76
"15",20,0.86
"16",22,0.88
"17",23,0.9
"18",24,0.98
"19",25,1
Meaning
Attention: I have a German Excel so the decimal symbol is comma instead of the dot.
The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.
If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.
The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.
Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).
As I see it, your main requirement is to reproduce the jumps at each x value. Try this:
> x <- c(cars$speed, cars$speed, 1, 28)
> y <- c((0:49)/50, (1:50)/50, 0, 1)
> ord <- order(x)
> plot(y[ord] ~ x[ord], type="l")
The first 50 (x,y) pairs are tyhe beginnings of the jumps, the next 50 are the ends, and the last two give you starting and ending values at $(x_1-3,0)$ and $(x_{50}+3,1)$. Then you need to sort the values in increasing order in $x$.

Resources