Null distribution related question (edited) - r

Please help with the following question.
The experiment involved mice; feeding them two diets: high-fat diet and normal diet (control group). The data below contains the weights of all female mice (population) that received the normal diet. The data can be downloaded from GitHub running the following command lines in R:
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile = filename)
x <- unlist(read.csv(filename))
Here x represents the weights of the entire population.
So, the question is:
Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 (one thousand) times. Save the averages.
What proportion of these 1,000 averages are more than 1 gram away from the average x?
Below is what I have tried using the ‘sum’ & ‘mean()’ function:
set.seed(1)
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n)
mean(sample1 > mean(x)+1)
So this step is where I need the help…because I am not sure how to deal with ‘1 gram away from average of x’ statement in the question.
Thank you in advance for your help.

Looks like homework, so I'll give some hints:
In your second code block, the last two statements seem off.
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n) #<- why dividing by n here?
mean(sample1 > mean(x)+1) #<- what are you trying to do here?
Why are you dividing the mean of the overall sample by n?
The call to mean does seem to make sense.
I don't think you need the second statement, mean(sample1 > mean(x)+1) to get your answer.
You need an inequality in the sum() statement that will be TRUE for every value that is outside the range of mean(x) - 1 to mean(x) + 1. Or, the number less than mean(x) -1 plus the number greater than mean(x) + 1.
Does that help?

On the loop part you are doing correctly, for the ##What proportion of these 1,000 averages are more than 1 gram away from the average x?##sum(abs(null)-mean(population)>1)/n

Related

How to find the probability of extinction = 1 using Galton-Watson process in R?

I am simulating a basic Galton-Watson process (GWP) using a geometric distribution. I'm using this to find the probability of extinction for each generation. My question is, how do I find the generation at which the probability of extinction is equal to 1?
For example, I can create a function for the GWP like so:
# Galton-Watson Process for geometric distribution
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- sum(rgeom(Sn[i - 1], p))
}
return(Sn)
}
where, n is the number of generations.
Then, if I set the geometric distribution parameter p = 0.25... then to calculate the probability of extinction for, say, generation 10, I just do this:
N <- 10 # Number of elements in the initial population.
GWn <- replicate(N, GWP(10, 0.25)[10])
probExtinction <- sum(GWn==0)/N
probExtinction
This will give me the probability of extinction for generation 10... to find the probability of extinction for each generation I have to change the index value (to the corresponding generation number) when creating GWn... But what I'm trying to do is find at which generation will the probability of extinction = 1.
Any suggestions as to how I might go about solving this problem?
I can tell you how you would do this problem in principle, but I'm going to suggest that you may run into some difficulties (if you already know everything I'm about to say, just take it as advice to the next reader ...)
theoretically, the Galton-Watson process extinction probability never goes exactly to 1 (unless prob==1, or in the infinite-time limit)
of course, for any given replicate and random-number seed you can compute the first time point (if any) at which all of your lineages have gone extinct. This will be highly variable across runs, depending on the random-number seed ...
the distribution of extinction times is extremely skewed; lineages that don't go extinct immediately will last a loooong time ...
I modified your GWP function in two ways to make it more efficient: (1) stop the simulation when the lineage goes extinct; (2) replace the sum of geometric deviates with a single negative binomial deviate (see here)
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- rnbinom(1, size=Sn[i - 1], prob=p)
if (Sn[i]==0) break ## extinct, bail out
}
return(Sn)
}
The basic strategy now is: (1) run the simulations for a while, keep the entire trajectory; (2) compute extinction probability in every generation; (3) find the first generation such that p==1.
set.seed(101)
N <- 10 # Number of elements in the initial population.
maxgen <- 100
GWn <- replicate(N, GWP(maxgen, 0.5), simplify="array")
probExtinction <- rowSums(GWn==0)/N
which(probExtinction==1)[1]
(Subtract 1 from the last result if you want to start indexing from generation 0.) In this case the answer is NA, because there's 1/10 lineages that manages to stay alive (and indeed gets very large, so it will probably persist almost forever)
plot(0:maxgen, probExtinction, type="s") ## plot extinction probability
matplot(1+GWn,type="l",lty=1,col=1,log="y") ## plot lineage sizes (log(1+x) scale)
## demonstration that (sum(rgeom(n,...)) is equiv to rnbinom(1,size=n,...)
nmax <- 70
plot(prop.table(table(replicate(10000, sum(rgeom(10, prob=0.3))))),
xlim=c(0,nmax))
points(0:nmax,dnbinom(0:nmax, size=10, prob=0.3), col=2,pch=16)

simulation of binomial distribution and storing value in matrix in r

set.seed(123)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.30)
result[[m]]=u
}
result
for (m in 1:40) if (any(result[[m]] == 1)) break
m
m is the exit time for company, as we change the probability it will give different result. Using this m as exit, I have to find if there was a funding round inbetween, so I created a random binomial distribution with some prob, when you will get a 1 that means there is a funding round(j). if there is a funding round i have to find the limit of round using the random uniform distribution. I am not sure if the code is right for rbinom and is running till m. And imat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
am gettin the y value for all 40 iteration I Need it when I get rbinom==1 it should go to next loop. I am trying to store the value in matrix but its not getting stored too. Please help me with that.
mat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
for(j in 1:m) {
k<- if(any(rbinom(1e3,40,0.42)==1)) #funding round
{
y<- runif(j, min = 0, max = 1) #lower and upper bound
mat1[l][0]<-j
mat1[l][1]<-y #matrix storing the value
}
}
resl
mat1
y
The answer to your first question:
result <- vector("list",40)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.05)
print(u)
result[[m]]=u
}
u
The second question is not clear. Could you rephrase it?
To generate 40 vectors of random binomial numbers you don't need a loop at all, use ?replicate.
u <- replicate(40, rbinom(1e3, 40, 0.05))
As for your second question, there are several problems with your code. I will try address them, it will be up to you to say if the proposed corrections are right.
The following does basically nothing
for(k in 1:40)
{
n<- (any(rbinom(1e3,40,0.05)==1)) # n is TRUE/FALSE
}
k # at this point, equal to 40
There are better ways of creating a T/F variable.
#matrix(0, nrow = 40,ncol = 2) # wrong, don't use list()
matrix(0, nrow = 40,ncol = 2) # or maybe NA
Then you set l=0 when indices in R start at 1. Anyway, I don't believe you'll need this variable l.
if(any(rbinom(1e3,40,0.30)==1)) # probably TRUE, left as an exercise
# in probability theory
Then, finally,
mat1[l][0]<-j # index `0` doesn't exist
Please revise your code, and tell us what you want to do, we're glad to help.

MCMC in R Modify Proposal

I've been working with MCMC for population genetics and I have some doubts.
I'm not experienced in statistics and because of that I have difficulty.
I have code to run MCMC, 1000 iterations. I start by creating a matrix with 0's (50 columns = 50 individuals and 1000 lines for 1000 iterations).
Then I create a random vector to substitute the first line of the matrix. This vector has 1's and 2's, representing population 1 or population 2.
I also have genotype frequencies and the genotypes of the 50 individuals.
What I want is to, according to the genotype frequencies and genotypes, determine to what population an individual belongs.
Then, I'll keep changing the population assigned to a random individual and checking if the new value should be accepted.
niter <- 1000
z <- matrix(0,nrow=niter,ncol=ncol(targetinds))
z[1,] <- sample(1:2, size=ncol(z), replace=T)
lhood <- numeric(niter)
lhood[1] <- compute_lhood_K2(targetinds, z[1,], freqPops)
accepted <- 0
priorz <- c(1e-6, 0.999999)
for(i in 2:niter) {
z[i,] <- z[i-1,]
# propose new vector z, by selecting a random individual, proposing a new zi value
selind <- sample(1:nind, size=1)
# proposal probability of selecting individual at random
proposal_ratio_ind <- log(1/nind)-log(1/nind)
# propose a new index for the selected individual
if(z[i,selind]==1) {
z[i,selind] <- 2
} else {
z[i,selind] <- 1
}
# proposal probability of changing the index of individual is 1/2
proposal_ratio_cluster <- log(1/2)-log(1/2)
propratio <- proposal_ratio_ind+proposal_ratio_cluster
# compute f(x_i|z_i*, p)
# the probability of the selected individual given the two clusters
probindcluster <- compute_lhood_ind_K2(targetinds[,selind],freqPops)
# likelihood ratio f(x_i|z_i*,p)/f(x_i|z_i, p)
lhoodratio <- probindcluster[z[i,selind]]-probindcluster[z[i-1,selind]]
# prior ratio pi(z_i*)/pi(z_i)
priorratio <- log(priorz[z[i,selind]])-log(priorz[z[i-1,selind]])
# accept new value according to the MH ratio
mh <- lhoodratio+propratio+priorratio
# reject if the random value is larger than the MH ratio
if(runif(1)>exp(mh)) {
z[i,] <- z[i-1,] # keep the same z
lhood[i] <- lhood[i-1] # keep the same likelihood
} else { # if accepted
lhood[i] <- lhood[i-1]+lhoodratio # update the likelihood
accepted <- accepted+1 # increase the number of accepted
}
}
It is asked that I have to change the proposal probability so that the new proposed values are proportional to the likelihood. This leads to a Gibbs sampling MCMC algorithm, supposedly.
I don't know what to change in the code to do this. I also don't understand very well the concept of proposal probability and how to chose the prior.
Grateful if someone knows how to clarify my doubts.
Your current proposal is done here:
# propose a new index for the selected individual
if(z[i,selind]==1) {
z[i,selind] <- 2
} else {
z[i,selind] <- 1
}
if the individual is assigned to cluster 1, then you propose to switch assignment deterministically by assigning them to cluster 2 (and vice versa).
You didn't show us what freqPops is, but if you want to propose according to freqPops then I believe the above code has to be replaced by
z[i,selind] <- sample(c(1,2),size=1,prob=freqPops)
(at least that is what I understand when you say you want to propose based on the likelihood - however, that statement of yours is unclear).
For this now to be a valid mcmc gibbs sampling algorithm you also need to change the next line of code:
proposal_ratio_cluster <- log(freqPops[z[i-1,selind]])-log(fregPops[z[i,selind]])

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.
This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?
Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?
I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

What is the formula to calculate the gini with sample weight

I need your helps to explain how I can obtain the same result as this function does:
gini(x, weights=rep(1,length=length(x)))
http://cran.r-project.org/web/packages/reldist/reldist.pdf --> page 2. Gini
Let's say, we need to measure the inocme of the population N. To do that, we can divide the population N into K subgroups. And in each subgroup kth, we will take nk individual and ask for their income. As the result, we will get the "individual's income" and each individual will have particular "sample weight" to represent for their contribution to the population N. Here is example that I simply get from previous link and the dataset is from NLS
rm(list=ls())
cat("\014")
library(reldist)
data(nls);data
help(nls)
# Convert the wage growth from (log. dollar) to (dollar)
y <- exp(recent$chpermwage);y
# Compute the unweighted estimate
gini_y <- gini(y)
# Compute the weighted estimate
gini_yw <- gini(y,w=recent$wgt)
> --- Here is the result----
> gini_y = 0.3418394
> gini_yw = 0.3483615
I know how to compute the Gini without WEIGHTS by my own code. Therefore, I would like to keep the command gini(y) in my code, without any doubts. The only thing I concerned is that the way gini(y,w) operate to obtain the result 0.3483615. I tried to do another calculation as follow to see whether I can come up with the same result as gini_yw. Here is another code that I based on CDF, Section 9.5, from this book: ‘‘Relative
Distribution Methods in the Social Sciences’’ by Mark S. Handcock,
#-------------------------
# test how gini computes with the sample weights
z <- exp(recent$chpermwage) * recent$wgt
gini_z <- gini(z)
# Result gini_z = 0.3924161
As you see, my calculation gini_z is different from command gini(y, weights). If someone of you know how to build correct computation to obtain exactly
gini_yw = 0.3483615, please give me your advices.
Thanks a lot friends.
function (x, weights = rep(1, length = length(x)))
{
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}
This is the source code for the function gini which can be seen by entering gini into the console. No parentheses or anything else.
EDIT:
This can be done for any function or object really.
This is bit late, but one may be interested in concentration/diversity measures contained in the [SciencesPo][1] package.

Resources