Best way to perform linear regression - math

I have a set of 5 values from experiment, E1, ..., E5 and results from 10000 different simulations, sim_A_B_C.out . From each simulation I get S1, ..., S5 .
I want to study the correlation between experimental and simulated values. So I would like to perform linear regression for each set in a script that loops for the whole set of 10000 result files.
What is the best way of performing linear regression in bash or python? I used to do it with sigmaplot but it is not so good for so big set of data

I expect that each of your simulations has some input values which differ, for instance, x is 1 for the first, 2 for the second, and then you have some function f(x) which runs the simulation and generates 5 points for each simulation. From your example, I expect x is actually three values, A, B, C.
In that case, what you want to discover is the value of x which generates the best simulation.
In this case, you really need to find the correlation between f(x) to the experimental result, rather than the simulated result itself.
The reason for this is that finding a good correlation between the simulations and the experimental result has too many variables (if you assume the simulations are independent of each other), and will probably find a fit just by chance.
I think you should also obtain additional experimental values, to increase your confidence.
My favourite language for such things is R, which is free and available for most platforms at a download site near you, and I recommend the book "Introduction to Statistics using R", which gives lots of potted examples for you to try, and runs you through beginning statistics to some quite advanced things.

In python, there's a function stats.linregress in the SciPy package that you can use.

I'd avoid bash for this and use Python -- actually I'd use Matlab or Mathematica but neither is on your list. So install Numpy and possibly Scipy and crack on.

Related

How to quickly solve a moderate scale QP formulation in Julia?

This is a newbie question. I am trying to minimize the following QP problem:
x'Qx + b'x + c, for A.x >= lb
where:
x is a vector of coordinates,
Q is a sparse, strongly diagonally dominant, symmetric matrix typically
of size 500,000 x 500,000 to 1M x 1M
b is a vector of constants
c is a constant
A is an identity matrix
lb is a vector containing lower bounds on vector x
Following are the packages I have tried:
Optim.jl: They have a primal interior-point algorithm for simple "box" constraints. I have tried playing around with the inner_optimizer, by setting it to GradientDescent()/ ConjugateGradient(). No matter what this seems to be very slow for my problem set.
IterativeSolver.jl: They have a conjugate gradient solver but they do not have a way to set constraints to the QP problem.
MathProgBase.jl: They have a dedicated solver for Quadratic Programming called the Ipopt(). It works wonderfully for small data sets typically around 3Kx3K matrix, but it takes too long for the kind of data sets I am looking at. I am aware that changing the linear system solver from MUMPS to HSL or WSMP may produce significant improvement but is there a way to add third party linear system solvers to the Ipopt() through Julia?
OSQP.jl: This again takes too long to converge for the data sets that I am interested in.
Also I was wondering if anybody has worked with large data sets can they suggest a way to solve a problem of this scale really fast in Julia using the existing packages?
You can try the OSQP solver with different parameters to speedup convergence for your specific problem. In particular:
If you have multiple cores, MKL Pardiso can significantly reduce the execution time. You can find details on how to install it here (It basically consists in running the default MKL installer). After that, you can use it in OSQP as follows
model = OSQP.Model()
OSQP.setup!(model; P=Q, q=b, A=A, l=lb, u=ub, linsys_solver="mkl pardiso")
results = OSQP.solve!(model)
The number of iterations depends on your stepsize rho. OSQP automatically updates it trying to find the best one. If you have a specific problem, you can disable the automatic detection and play with it yourself. Here is an example for try different rho
model = OSQP.Model()
OSQP.setup!(model; P=Q, q=b, A=A, l=lb, u=ub, linsys_solver="mkl pardiso",
adaptive_rho=false, rho=1e-3)
results = OSQP.solve!(model)
I suggest you to try different rho values maybe logspaced between 1e-06 and 1e06.
You can reduce the iterations by rescaling the problem data so that the condition number of your matrices is not too high. This can significantly reduce the number of iterations.
I pretty sure that if follow these 3 steps you can make OSQP work pretty well. I am happy to try OSQP for your problem if you are willing to share your data (I am one of the developers).
Slightly unrelated, you can call OSQP using MathProgBase.jl and JuMP.jl. It also supports the latest MathOptInterface.jl package that will replace MathProgBase.jl for the newest version of JuMP.

1 sample t-test from summarized data in R

I can perform a 1 sample t-test in R with the t.test command. This requires actual sets of data. I can't use summary statistics (sample size, sample mean, standard deviation). I can work around this utilizing the BSDA package. But are there any other ways to accomplish this 1-sample-T in R without the BSDA pacakage?
Many ways. I'll list a few:
directly calculate the p-value by computing the statistic and calling pt with that and the df as arguments, as commenters suggest above (it can be done with a single short line in R - ekstroem shows the two-tailed test case; for the one tailed case you wouldn't double it)
alternatively, if it's something you need a lot, you could convert that into a nice robust function, even adding in tests against non-zero mu and confidence intervals if you like. Presumably if you go this route you'' want to take advantage of the functionality built around the htest class
(code and even a reasonably complete function can be found in the answers to this stats.SE question.)
If samples are not huge (smaller than a few million, say), you can simulate data with the exact same mean and standard deviation and call the ordinary t.test function. If m and s and n are the mean, sd and sample size, t.test(scale(rnorm(n))*s+m) should do (it doesn't matter what distribution you use, so runif would suffice). Note the importance of calling scale there. This makes it easy to change your alternative or get a CI without writing more code, but it wouldn't be suitable if you had millions of observations and needed to do it more than a couple of times.
call a function in a different package that will calculate it -- there's at least one or two other such packages (you don't make it clear whether using BSDA was a problem or whether you wanted to avoid packages altogether)

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

R `optim` returns different results if run in parallel

Is there any possible explanation for multiple optim instances with set starting values to return different results if run in parallel or one after another on a single core?
Basically, I do a rolling forecast with refitting of the model each time, so I can easily parallelize over the rolling windows, but the results are different if I do not parallelize...
Sadly, I don't have a simple reproducible example. I know that if I link to different BLAS then the results differ, so is there anything like different numerical precision / set of libraries used, that might cause this?

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources