how to build a null distribution [closed] - r

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
1:I would like to create a synthetic dataset of 14.000 genes (rows) and 250 samples (columns of the matrix).
How this can be done?
2: after this, I would like to infer gene regulation using for ex algorithms of mutual information. I know how and in fact I have a network.
3: I would like to know if the net I had is due by chance or not. To do this, one common approach is to schuffle samples or genes, 1000 times, to create 1000 net and plot a null distribution to validate the net you previously (point 2) obtained. This is called bootstrap.
Is there another method?
Best,
E.

The sample function in R is the basic way to construct random permutations of existing data. It's not clear what you want, and an additional thought was that you might just need to be pointed to the runif function for generating random Uniform sequences. If you had 1000 objects of a particular sort in an object vector, obj:
sample( obj ) # returns a permuted sequence
# Same as ...
obj[ sample(length(obj)) ]
Whether that is a "null distribution" is up to you to decide. (And that request for "all" of the methods to do any particular task in R will be viewed as being excessively demanding. There are often a large number of methods, and even asking for the "best" will increase you chances of getting your question closed.)

Related

Split a data set into a training and test set with runif [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding a command. We used in a class runif to create a training set, that should contain 50% of the data set. (we developed a decision tree based on this training set). But I still can't understand the logic behind this command, could someone explain to me how this works?
I understand the decision trees, and the logic behind splitting up a data set, my question is just explicitly about how this command works.
inTrain <- runif(nrow(USArrests)) < 0.5
You have a dataset named USArrests with length nrow(USArrests), let's say for the sake of simplification 100. So runif(nrow(USArrests)) creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.
Next your expression runif(nrow(USArrests)) < 0.5 checks, if the number is < 0.5 or not returning TRUE or FALSE. This gives you a logical vector of length 100 (or nrow(USArrests)) that indicates, if a row belongs to the training or to the test dataset.
It's not shown but finally you select your training data by
USArrests[inTrain,]
and your test data by
USArrests[-inTrain,]

How to work with numeric probability distribution functions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to calculate the probability distribution function of a random variable that is composed of (sum, division, product, exponentiation, etc...) some other simple random variables. It is pretty complex so I am morte then happy to get a numerical solution
While thought this was a very standard thing to do , I was unable to find a framework to do that. I'd preferably use R, but any major language will do.
What I would like therefore is a library that allowed me to:
i) create numerical random variables from classic distributions
ii) compose them by simple operations (+,-,*,/, exp,min, max,...)
Of course I could work with vectors and use convolutions and the like, but I wanted something more polished.
I am also aware that is possible to use simulation to create the variables, then compose them with the operations and finally getting the PDF using a histogram, but again, I would prefer a non - simulating approach.
Try the rv package. Note that if X is an exponential random variable with mean 1, then -log(X) has a standard Gumbel distribution.

R normalization with all samples, or just the part that i need? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am using the edgeR and Limma packages to analyse a RNA-seq count data table.
I only need a subset of the data file, therefore my question is: Do I need to normalize my data within all the samples, or is it better to subset my data first and normalize the data then.
Thank you.
Regards Lisanne
I think it depends on what you want to proof/show. If you also want to take into account your "darkcounts" than you should normalize it at first such that you also take into account the percentage in which your experiment fails. Here your total number of experiments ( good and bad results) sums up to one.
If you want to find out the distribution of your "good events" than you should first produce your subset of good samples and normalize afterwards. In this case your number of good events sums up to 1
So once again, it depends on what you want to proof. As a physicist I would prefer the first method since we do not remove bad data points.
Cheers TL

No stable solution using metaMDS() in Vegan [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a species abundance dataset with quite a few zeros in it and even when I set trymax = 1000 for metaMDS() the program is unable to find a stable solution for the stress. I have already tried combining data (collapsing multiple years together to reduce the number of zeros) and I can't do any more. I was just wondering if anyone knows - is it scientifically valid to pick what R gives me at the end (the lowest of the 1000 solutions) or should I not be using NMDS because it cannot find a stable spot? There seems to be very little information about this on the internet.
One explanation for this is that you are trying to use too few dimensions for the mapping. I presume you are using the default k = 2? If so, try k = 3 and compare the stress from the best solution you got from the 1000 tries for the k = 2 solution.
I would be a little concerned to take one solution out of 1000 just because it had the best/lowest stress.
You could also try 1000 more random starts and see if it converges if you run more iterations. When you saved the output from metaMDS(), you can supply that object to another call to metaMDS() via the previous.best argument. It will then do trymax further random starts but compare any lower-stress solutions with the previous best and converge if it finds one similar to it, rather than have to find two similar low-stress solutions in the 1000 starts.

Simulate ARFIMA process with custom initial values [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
My question deals with the fracdiff.sim function in R (in the fracdiff package) for which the help document, just like for arima.sim, is not really clear concerning initial values.
It's ok that stationary processes do not depend on their initial values when time grows, but my aim is to see in my simulations the return of my long memory process (fitted with arfima) to its mean.
Therefore, I need to input at least the p final values of my in-sample process (and eventually q innovations) if it is ARFIMA(p,d,q). In other words, I would like to set the burn-in period's length to 0 and give starting values instead.
Nevertheless, I'm currently not able to do this. I know that fracdiff.sim makes it possible for the user to chose the length of a burning period (which leads to the stationnary behavior) and the mean of the simulated process (it is simulated and then translated to make the means match). There is also a condition: the length of the burn-in period must be >= p+q. What I suppose is that there is something to do with the innov argument but I'm really not sure.
This idea is inspired by the arima.sim function which has a start.innov argument. However, even if my aim was to simulate an ARMA(p,q), I'm not sure of the exact use of this argument (the help is quite poor) : must we input only q innovations ? put with them the p last values of the in-sample process ? In which order ?
To sum up, I want to simulate ARFIMA processes starting from a specific value and having a specific mean in order to see the return to the mean and not only the long term behavior. I fund beginnings of solutions for arima.sim on the internet but nobody clearly answered, and if the solution uses start.innov, how to solve the problem for ARFIMA processes (fracdiff.sim doesn't have the start.innov argument) ?
Hopping I have been clear enough,

Resources