I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps.
Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes to complete process "B". Zero days = 10%, one day = 20%, etc.
Process "B" can't be started until process "A" is complete, so "B" is dependent upon "A".
a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)
How can I calculate the probability density function of the time to complete "A" and "B"?
This is what I'd expect as the output for or the following example:
totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]
print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1
I have an approach in visual basic that used three for loops (one for each of the steps, and one for the output) but I hope I don't have to loop in R.
Since this seems to be a pretty standard process flow question, part two of my question is whether any libraries exist to model operations flow so I'm not creating this from scratch.
The efficient way to do this sort of operation is to use a convolution:
convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the Fast Fourier Transform, or FFT).
You can confirm that each of these values is correct using the formulas you posted:
(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
See the package:distr. Choosing the term "multiply" is unfortunate, since the situation described is not one where the contributions to probabilities is independent (where multiplication of probabilities would be the natural term to use). It's rather some sort of sequential addition, and that is exactly what the distr package provides as its interpretation of what "+" should mean when used as a symbolic manipulation of two discrete distributions.
A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
B <- DiscreteDistribution(setNames(0:2, c( "Zero2" ,"one2", "two2",
"three2", "four2") ), b )
?'operators-methods' # where operations on 2 DiscreteDistribution are convolution
plot(A+B)
After a bit of nosing around I see that the actual numeric values can be found here:
A.then.B <- A + B
> environment(A.the.nB#d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
Seems like there should have been a method for display of the probabilities, and I'm not a regular user of this fascinating package so there well may be one. Do read the vignette and the code-demos ... which I have not yet done. Further noodling around convinces me that the right place to look is in the companion package: distrDoc where the vignette is 100+ pages long. And it shouldn't have required any effort to find it, either, since that advice is in the messages that print when the package is loaded ... except in my defense there were a couple of pages of messages, so it was more tempting to jump into coding and using the help pages.
I'm not familiar with a dedicated package that does exactly what your example describes. but let me sujust a more robust solution for this problem.
You are looking for a method to estimate the distribution of a process that might be combined by an n steps process, in your case 2 that might not be as easy to compute as your example.
The approach Iwould use is a simulation, of 10k observations drown from the underlying distributions, and then calculating the density function of the simulated results.
using your example we can do the following:
x <- runif(10000)
y <- runif(10000)
library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])
this will result in a very good proxy if the density and the aproach would also work if your second process was drown from an exponential distribution. in which case you'd use rexp function to calculate b_days directly.
Related
The createDataPartition in caret has a Data Splitting function which can sample data preserving the relative outcome of each rating. I am looking for something similar, but that can preserve groups and handle ordinal data
I am trying to specify the target distribution of my outcomes. I want to preserve groups and see which groups I should conduct a follow-up experiment with if I want to reach a target distribution (rather than simply the current distribution). I have made code that attempts this in a very blunt way:
# Load data
library(rethinking)
data(Trolley)
d <- Trolley
# Inspect current distribution of ratings
d$response <- factor(d$response)
round(summary(d$response)/dim(d)[1],2)
# Find 5 cases that roughly have my target distribution
targetdist <- c(0.3,0.1,0.1,0.1,0.1,0.1,0.1) # Arbitrary goal
# Unique cases
uniqcase <- unique(d$case)
# Poor method
runs <- 100
difmatrix <- matrix(NA,runs,2)
for(i in 1:runs){
# Take subset
difmatrix[i,1] <- i
set.seed(i)
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
# Find ratings of subset
difmatrix[i,2] <- sum(abs(round(summary(datasub$response)/dim(datasub)[1],2)-targetdist))
}
difmatrix[which.min(difmatrix[,2]),]
# Look at best distribution
set.seed(which.min(difmatrix[,2]))
casetests<- sample(uniqcase,5)
datasub <- subset(d, case %in% casetests)
round(summary(datasub$response)/dim(datasub)[1],2) # Current best distribution
In this toy example, the overall distribution in the data is:
0.13 0.09 0.11 0.23 0.15 0.15 0.15
I aim to get a distribution of
0.3,0.1,0.1,0.1,0.1,0.1,0.1 and get one of:
0.21 0.12 0.12 0.22 0.12 0.11 0.10
I cannot help but think there is a better way to do it. For my actual case, I want to select about 200 from a group of 10,000 so it seems unlikely that I can luck on a good choice.
Thanks for reading. I hope this makes sense at all. I have been working on it for a while, yet still have issues formulating it concisely.
I have a question on a specific implementation of a Nelder-Mead algorithm (1) that handles box contraints in an unusual way. I cannot find in anything about it in any paper (25 papers), textbook (searched 4 of them) or the internet.
I have a typical optimisation problem: min f(x) with a box constraint -0.25 <= x_i <= 250
The expected approach would be using a penalty function and make sure that all instances of f(x) are "unattractive" when x is out of bounds.
The algorithm works differently: the implementation in question does not touch f(x). Instead it distorts the parameter space using an inverse hyperbolic tangens atanh(f). Now the simplex algorithm can freely operate in a space without bounds and pick just any point. Before it gets f(x) in order to assess the solution at x the algorithm switches back into normal space.
At a first glance I found the idea ingenious. This way we avoid the disadvantages of penalty functions. But now I am having doubts. The distorted space affects termination behaviour. One termination criterion is the size of the simplex. By inflating the parameter space with atanh(x) we also inflate the simplex size.
Experiments with the algorithm also show that it does not work as intended. I do not yet understand how this happens, but I do get results that are out of bounds. I can say that almost half of the returned local minima are out of bounds.
As an example, take a look at nmkb() optimising the rosenbrook function when we gradually change the width of the box constraint:
rosbkext <- function(x) {
# Extended Rosenbrock function
n <- length(x)
sum (100*(x[1:(n-1)]^2 - x[2:n])^2 + (x[1:(n-1)] - 1)^2)
}
np <- 6 #12
for (box in c(2, 4, 12, 24, 32, 64, 128)) {
set.seed(123)
p0 <- rnorm(np)
p0[p0 > +2] <- +2 - 1E-8
p0[p0 < -2] <- -2 + 1E-8
ctrl <- list(maxfeval = 5E4, tol = 1E-8)
o <- nmkb(fn = rosbkext, par = p0, lower = -box, upper = +box, control = ctrl)
print(o$message)
cat("f(", format(o$par, digits = 2), ") =", format(o$value, digits=3), "\n")
}
The output shows that it claims to converge but it does not in three cases. And it does that for bounds of (-2,2) and (-12,12). I might accept that but then it also fails at (-128, 128). I also tried the same with the unconstrained dfoptim::nmk(). No trouble there. It converges perfectly.
[1] "Successful convergence"
f( -0.99 0.98 0.97 0.95 0.90 0.81 ) = 3.97
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 4.42e-09
[1] "Successful convergence"
f( -0.99 0.98 0.97 0.95 0.90 0.81 ) = 3.97
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 1.3e-08
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 4.22e-09
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 8.22e-09
[1] "Successful convergence"
f( -0.99 0.98 0.97 0.95 0.90 0.81 ) = 3.97
Why does the constrained algorithm have more trouble converging than the unconstrained one?
Footnote (1): I am referring to the Nelder-Mead implementation used in the optimx package in R. This package calls another package dfoptim with the nmkb-function.
(This question has nothing to do with optimx, which is just a wrapper for R packages providing unconstrained optimization.)
The function in question is nmkb() in the dfoptim package for gradient-free optimization routines. The approach to transform bounded regions into unbounded spaces is a common one and can be applied with many different transformation functions, sometimes depending on the kind of the boundary and/or the type of the objective function. It may also be applied, e.g., to transform unbounded integration domains into bounded ones.
The approach is problematic if the optimum lies at the boundary, because the optimal point will be sent to (nearly) infinity and cannot ultimately be reached. The routine will not converge or the solution be quite inaccurate.
If you think the algorithm is not working correctly, you should write to the authors of that package and -- that is important -- add one or two examples for what you think are bugs or incorrect solutions. Without explicit code examples no one here is able to help you.
(1) Those transformations define bijective maps between bounded and unbounded regions and the theory behind this approach is obvious. You may read about possible transformations in books on multivariate calculus.
(2) The approach with penalties outside the bounds has its own drawbacks, for instance the target function will not be smooth at the boundaries, and the BFGS method may not be appropriate anymore.
(3) You could try the Hooke-Jeeves algorithm through function hjkb() in the same dfoptim package. It will be slower, but uses a different approach for treating the boundaries, no transformations involved.
EDIT (after discussion with Erwin Kalvelagen above)
There appear to be local minima (with some coordinates negative).
If you set the lower bounds to 0, nmkb() will find the global minimum (1,1,1,1,1,1) in any case.
Watch out: starting values have to be feasible, that is all their coordinates greater 0.
I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently
of the others, by an amount normally distributed with mean 0 and standard
deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem.
Thanks in advance!
# using the most important features, we create a ML model:
m1 <- lm(PREDICTED_VALUE ~ PREDICTER_1 + PREDICTER_2 + PREDICTER_N )
#summary(m1)
#anova(m1)
# after creating the model, we perturb as follows:
#install.packages("perturb") #install the package
library(perturb)
set.seed(1234) # for same results each time you run the code
p1_new <- perturb(m1, pvars=c("PREDICTER_1","PREDICTER_N") , prange = c(1,1),niter=200) # your can change the number of iterations to any value n. Total number of iteration would come to be n+1
p1_new # check the values of p1
summary(p1_new)
Perturbing just means adding a small, noisy shift to a number. Your code might look something like this.
x = sample(10, 10)
ind = rbinom(length(x), 1, 0.5) == 1
x[ind] = x[ind] + rnorm(sum(ind), 0, 0.1)
rbinom gets the elements to be modified with probability 0.5 and rnorm adds the perturbation.
I use R to calculate the ecdf of some data. I want to use the results in another software. I use R just to do the 'work' but not to produce the final diagram for my thesis.
Example Code
# Plotting the a built in sampla data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
Output
> # Plotting the a built in sampla data
> plot(cars$speed)
> # Assingning the data to a new variable name
> myData = cars$speed
> # Calculating the edcf
> myResult = ecdf(myData)
> myResult
Empirical CDF
Call: ecdf(myData)
x[1:19] = 4, 7, 8, ..., 24, 25
> # Plotting the ecdf
> plot(myResult)
> plot(cars$speed)
Questions
Question 1
How do I get the raw information in order to plot the ecdf diagram in another software (e. g. Excel, Matlab, LaTeX)? For the histogram function I can just write
res = hist(...)
and I find all the information like
res$breaks
res$counts
res$density
res$mids
res$xname
Question 2
How do I calculate the inverse ecdf? Say I want to know how many cars have a speed below 10 mph (the example data is car speed).
Update
Thanks to the answer of user777 I have more information now. If I use
> myResult(0:25)
[1] 0.00 0.00 0.00 0.00 0.04 0.04 0.04 0.08 0.10 0.12 0.18 0.22 0.30 0.38
[15] 0.46 0.52 0.56 0.62 0.70 0.76 0.86 0.86 0.88 0.90 0.98 1.00
I get the data for 0 to 25 mph. But I do not know where to draw a data point. I do want to reproduce the R plot exactly.
Here I have a data point every 1 mph.
Here I do not have a data pint every 1 mph. I only have a data point if there is data available.
Solution
# Plotting the a built in sample data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
# Have a look on the probability for 0 to 25 mph
myResult(0:25)
# Have a look on the probability but just where there ara data points
myResult(unique(myData))
# Saving teh stuff to a directory
write.csv(cbind(unique(myData), myResult(unique(myData))), file="D:/myResult.txt")
The file myResult.txt looks like
"","V1","V2"
"1",4,0.04
"2",7,0.08
"3",8,0.1
"4",9,0.12
"5",10,0.18
"6",11,0.22
"7",12,0.3
"8",13,0.38
"9",14,0.46
"10",15,0.52
"11",16,0.56
"12",17,0.62
"13",18,0.7
"14",19,0.76
"15",20,0.86
"16",22,0.88
"17",23,0.9
"18",24,0.98
"19",25,1
Meaning
Attention: I have a German Excel so the decimal symbol is comma instead of the dot.
The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.
If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.
The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.
Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).
As I see it, your main requirement is to reproduce the jumps at each x value. Try this:
> x <- c(cars$speed, cars$speed, 1, 28)
> y <- c((0:49)/50, (1:50)/50, 0, 1)
> ord <- order(x)
> plot(y[ord] ~ x[ord], type="l")
The first 50 (x,y) pairs are tyhe beginnings of the jumps, the next 50 are the ends, and the last two give you starting and ending values at $(x_1-3,0)$ and $(x_{50}+3,1)$. Then you need to sort the values in increasing order in $x$.
First off, sorry about the long post. Figured it's better to give context to get good answers (I hope!). Some time ago I wrote an R function that will get all pairwise interactions of variables in a data frame. This worked fine at the time, but now a colleague would like me to do this with a much larger dataset. They don't know how many variables they are going to have in the end but they are guessing approximately 2,500 - 3,000. My function below is way too slow for this (4 minutes for 100 variables). At the bottom of this post I have included some timings for various numbers of variables and total numbers of interactions. I have the results of calling Rprof() on the 100 variables run of my function, so If anyone wants to take a look at it let me know. I don't want to make a super long any longer than it needs to be.
What I'd like to know is if there is anything I can do to speed this function up. I tried looking going directly to glm.fit, but as far as I understood, for that to be useful the computation of the design matrices and all of that other stuff that I frankly don't understand, needs to be the same for each model, which is not the case for my analysis, although perhaps I am wrong about this.
Any ideas on how to make this run faster would be greatly appreciated. I am planning on using parallelization to run the analysis in the end but I don't know how many CPU's I am going to have access to but I'd say it won't be more than 8.
Thanks in advance,
Cheers
Davy.
getInteractions2 = function(data, fSNPcol, ccCol)
{
#fSNPcol is the number of the column that contains the first SNP
#ccCol is the number of the column that contains the outcome variable
require(lmtest)
a = data.frame()
snps = names(data)[-1:-(fSNPcol-1)]
names(data)[ccCol] = "PHENOTYPE"
terms = as.data.frame(t(combn(snps,2)))
attach(data)
fit1 = c()
fit2 = c()
pval = c()
for(i in 1:length(terms$V1))
{
fit1 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
fit2 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
a = lrtest(fit1, fit2)
pval = c(pval, a[2,"Pr(>Chisq)"])
}
detach(data)
results = cbind(terms,pval)
return(results)
}
In the table below is the system.time results for increasing numbers of variables being passed through the function. n is the number, and Ints, is the number of pair-wise interactions given by that number of variables.
n Ints user.self sys.self elapsed
time 10 45 1.20 0.00 1.30
time 15 105 3.40 0.00 3.43
time 20 190 6.62 0.00 6.85
...
time 90 4005 178.04 0.07 195.84
time 95 4465 199.97 0.13 218.30
time 100 4950 221.15 0.08 242.18
Some code to reproduce a data frame in case you want to look at timings or the Rprof() results. Please don't run this unless your machine is super fast, or your prepared to wait for about 15-20 minutes.
df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
gtypes = matrix(nrow=2000, ncol=3000)
gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
snps = paste("rs", 1000:3999,sep="")
df = cbind(df,gtypes)
names(df) = c("sid", "status", snps)
times = c()
for(i in seq(10,100, by=5)){
if(i==100){Rprof()}
time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
print(time)
times = rbind(times, time)
if(i==100){Rprof(NULL)}
}
numI = function(n){return(((n^2)-n)/2)}
timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)
So I have sort of solved this (with help from the R mailing lists) and am posting it up in-case it's useful to anyone.
Basically, where the SNPs or variables are independent (i.e. Not in LD, not correlated) you can centre each SNP/Variable at it's mean like so:
rs1cent <- rs1-mean(rs1)
rs2cent <- rs2 -mean(rs2)
you can then test for correlation between phenotype and interaction as a screening step:
rs12interaction <- rs1cent*rs2cent
cor(PHENOTYPE, rs12interaction)
and then fully investigate using the full glm any that seem to be correlated. cut-off choice is, as ever, arbitrary.
Other suggestions were to use a RAO score test, which involves only fitting the null hypothesis model this halving the computation time for this step, but I don't really understand how this works (yet! more reading required.)
Anyway there you go. Maybe be of use to someone someday.