I want to simulate the problem below in R and calculate the average probability based on 1000 simulations -
Scores on a test are normally distributed with mean 70 and std dev 10.
Estimate the probability that among 75 randomly selected students at least 22 score greater than 78
This is what I have done so far
set.seed(1)
scores = rnorm(1000,70,10)
head(scores)
hist(scores)
sm75=sample(scores,75)
length(sm75[sm75>78])/75
#[1] 0.1866667
However, this only gives me only one iteration, I want 1000 iterations and then take the average of those 1000 probabilities. I believe some kind of control structure using for loop can be implemented. Also, is there an easier way through "apply" family of functions?
At the end of the day you are testing whether at least 22 students score higher than 78, which can be compactly computed with:
sum(rnorm(75, 70, 10) > 78) >= 22
Breaking this down a bit, rnorm(75, 70, 10) returns the 75 scores, which are normally distributed with mean 70 and standard deviation 10. rnorm(75, 70, 10) > 78 is a vector of length 75 that indicates whether or not each of these scores is above 78. sum(rnorm(75, 70, 10) > 78) converts each true to a 1 and each false to a 0 and sums these values up, meaning it counts the number of the 75 scores that exceed 78. Lastly we test whether the sum is 22 or higher with the full expression above.
replicate can be used to replicate this any number of times. So to see the breakdown of 1000 simulations, you can use the following 1-liner (after setting your random seed, of course):
set.seed(144)
table(replicate(1000, sum(rnorm(75, 70, 10) > 78) >= 22))
# FALSE TRUE
# 936 64
In 64 of the replicates, at least 22 students scored above a 78, so we estimate the probability to be 6.4%.
Probability is calculated as number of favourable outcomes / the total number of outcomes. So..
> scores <- sample(rnorm(1000,70,10),75)
> probability <- length(subset(scores,scores>78))/length(scores)
> probability
[1] 0.28
However, you want to do this a 1000 times, and then take an average.
> mean(replicate(1000, {scores<-sample(rnorm(1000,70,10),75);length(subset(scores,scores>78))/length(scores)}))
[1] 0.2133333
Related
Given a list of cars and their top speeds, MPG and car cost. I want to rank them. With Speed a 'weight' of 50%, MPG 'weight' of 30% and car cost 20%.
The faster the car, the better.. The higher the MPG, the better... The lower the COST, the better...
What math formula can I use to rank the cars in order, based on this criteria?
So given this list. How can I rank them? Given then range of each are different?
CAR SPEED MPG COST
A 135 20 50,000
B 150 15 60,000
C 170 18 80,000
D 120 30 40,000
A more general term for your problem would be 'Multi-Criteria Decision Analysis' which is a well studied subject and you will be able to find different model for different use cases.
Let's take a simple model for your case, where we will create a score based on weights and calculate it for each car:
import pandas as pd
data = pd.DataFrame({
'CAR': ['A','B','C','D'],
'SPEED': [135, 150, 170, 120],
'MPG': [20, 15, 18, 30],
'COST': [50000, 60000, 80000, 4000]
})
def Score(df):
return 0.5*df['SPEED'] + 0.3*df['MPG'] + 0.2*df['COST']
data['SCORE'] = data.apply(lambda x: Score(x), axis=1)
data = data.sort_values(by=['SCORE'], ascending=False)
print(data)
This would give us:
CAR SPEED MPG COST SCORE
2 C 170 18 80000 16090.4
1 B 150 15 60000 12079.5
0 A 135 20 50000 10073.5
3 D 120 30 40000 8069.0
As you can see in the function "SCORE" we are simply multiplying the value by weight and summing them to get a new value based on which we list the items.
The important consideration here is whether you are happy with the formula we used in Score or not. You can change it however you want and for whatever purpose you are building your model.
I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.
When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;
wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value
(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).
However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code
pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")
Re-assessing which p values are less than 0.05 via the below code, I get 0!
p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)
I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!
Thank-you
(As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).
Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.
For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.
wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value
So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.
You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes
library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001 304 129 339 102 16 617
ENSDARG00000000002 605 637 406 82 230 1245
First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732
all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732
So we proceed with DESeq2
library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)
out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 69, 0.27%
LFC < 0 (down) : 47, 0.18%
outliers [1] : 1270, 4.9%
low counts [2] : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes
res = results(dds)
res[which(res$padj < 0.05),]
In terms of probability distribution they use? I know that runif gives fractional numbers and sample gives whole numbers, but what I am interested in is if sample also use the 'uniform probability distribution'?
Consider the following code and output:
> set.seed(1)
> round(runif(10,1,100))
[1] 27 38 58 91 21 90 95 66 63 7
> set.seed(1)
> sample(1:100, 10, replace=TRUE)
[1] 27 38 58 91 21 90 95 67 63 7
This strongly suggests that when asked to do the same thing, the 2 functions give pretty much the same output (though interestingly it is round that gives the same output rather than floor or ceiling). The main differences are in the defaults and if you don't change those defaults then both would give something called a uniform (though sample would be considered a discrete uniform and by default without replacement).
Edit
The more correct comparison is:
> ceiling(runif(10,0,100))
[1] 27 38 58 91 21 90 95 67 63 7
instead of using round.
We can even step that up a notch:
> set.seed(1)
> tmp1 <- sample(1:100, 1000, replace=TRUE)
> set.seed(1)
> tmp2 <- ceiling(runif(1000,0,100))
> all.equal(tmp1,tmp2)
[1] TRUE
Of course if the probs argument to sample is used (with not all values equal), then it will no longer be uniform.
sample samples from a fixed set of inputs, and if a length-1 input is passed as the first argument, returns an integer output(s).
On the other hand, runif returns a sample from a real-valued range.
> sample(c(1,2,3), 1)
[1] 2
> runif(1, 1, 3)
[1] 1.448551
sample() runs faster than ceiling(runif())
This is useful to know if doing many simulations or bootstrapping.
Crude time trial script that time tests 4 equivalent scripts:
n<- 100 # sample size
m<- 10000 # simulations
system.time(sample(n, size=n*m, replace =T)) # faster than ceiling/runif
system.time(ceiling(runif(n*m, 0, n)))
system.time(ceiling(n * runif(n*m)))
system.time(floor(runif(n*m, 1, n+1)))
The proportional time advantage increases with n and m but watch you don't fill memory!
BTW Don't use round() to convert uniformly distributed continuous to uniformly distributed integer since terminal values get selected only half the time they should.
Question:
Suppose the numbers in the following random number table correspond to people arriving for work at a large factory. Let 0,1,and 2 be smokers and 3-9 be nonsmokers. After many arrivals, calculate the total relative frequency of smokers .
here is my R code to simulate the total relative frequency of smokers.
simulation<-function(k){
x<-round(runif(k)*10)
return (length(x[x<3])/k)}
> simulation(100)
[1] 0.27
> simulation(1000)
[1] 0.244
> simulation(10000)
[1] 0.2445
> simulation(100000)
[1] 0.24923
Why i can't get the result 0.3?
If all you want to do is get a discrete uniform distribution on the numbers 0, 1, ..., 9 then just use sample
sample(0:9, k, replace = TRUE)
With the code you have right now you'll actually get a probability of .05 each of getting 0 or 10 and a probability of .10 each of getting 1-9.
I wonder if there is a way to calculate the distance between a abline in a plot and a datapoint? For example, what is the distance between concentration == 40 with signal == 643 (element 5) and the abline?
concentration <- c(1,10,20,30,40,50)
signal <- c(4, 22, 44, 244, 643, 1102)
plot(concentration, signal)
res <- lm(signal ~ concentration)
abline(res)
You are basically asking for the residuals.
R> residuals(res)
1 2 3 4 5 6
192.61 12.57 -185.48 -205.52 -26.57 212.39
As an aside, when you fit a linear regression, the sum of the residuals is 0:
R> sum(residuals(res))
[1] 8.882e-15
and if the model is correct, should follow a Normal distribution - qqnorm(res).
I find working with the standardised residuals easier.
> rstandard(res)
1 2 3 4 5 6
1.37707 0.07527 -1.02653 -1.13610 -0.15845 1.54918
These residuals have been scaled to have mean zero, variance (approximately) equal to one and have a Normal distribution. Outlying standardised residuals are those larger that +/- 2.
You can use the function below:
http://paulbourke.net/geometry/pointlineplane/pointline.r
Then just extract the slope and intercept:
> coef(res)
(Intercept) concentration
-210.61098 22.00441
So your final answer would be:
concentration <- c(1,10,20,30,40,50)
signal <- c(4, 22, 44, 244, 643, 1102)
plot(concentration, signal)
res <- lm(signal ~ concentration)
abline(res)
cfs <- coef(res)
distancePointLine(y=signal[5], x=concentration[5], slope=cfs[2], intercept=cfs[1])
If you want a more general solution to finding a particular point, concentration == 40 returns a Boolean vector of length length(concentration). You can use that vector to select points.
pt.sel <- ( concentration == 40 )
> pt.sel
[1] FALSE FALSE FALSE FALSE TRUE FALSE
> distancePointLine(y=signal[pt.sel], x=concentration[pt.sel], slope=cfs["concentration"], intercept=cfs["(Intercept)"])
1.206032
Unfortunately distancePointLine doesn't appear to be vectorized (or it does, but it returns a warning when you pass it a vector). Otherwise you could get answers for all points just by leaving the [] selector off the x and y arguments.