expand.grid very big vectors exceeding memory limit - r

I have a problem with R.
I have 6 vectors of data. Each vector will have weight.
I need to calculate the quantile of each possible scenarios.
For example :
v1=c(1,2)
v2=c(0,5)
weights=c(1/3,2/3)
I would normally use :
scenarios=data.matrix(expand.grid(v1,v2))
results=scenarios %*% weights
And finally to get all the quantiles from 1% to 100% :
quantiles=quantile(results,seq(0.01,1,0.01),names=FALSE)
The problem is that I have 6 vectors of : 51,236,234,71,7 and 8 obs respectively, which would give me a vector of 11 G obs...
I get the error from R that I exceed the memory limit with a vector of 47 Gb...
Do you see some alternative that I can use to bypass this big matrix? I'm thinking like a loop within each value one vector and write the result in a document.
But then I don't know how i would calculate the percentile of these separate files...

Rather than generate the whole population, how about sampling to generate your pdf?
N <- 1e6
scenarios <- unique(matrix(c(sample(1:51, N, replace=T),
sample(1:236, N, replace=T),
sample(1:234, N, replace=T),
sample(1:71, N, replace=T),
sample(1:7, N, replace=T),
sample(1:8, N, replace=T)), nrow=N))
N <- nrow(scenarios)
weights <- matrix(rep(1/6, 6))
quantiles <- quantile(scenarios %*% weights, seq(0.01,1,0.01), names=FALSE)
if OP strictly wants the whole population, I will take this post down

Alright !! Thanks for your help guys !
Looks like sampling was the way to go !
Heres the code i use at the end with chinson12's help !
I did a bootstrap to see if the sampling converges towards the right value !
N=1e6
B=2
results = c(1:100)
for ( i in 1:B){
scenarios=unique(matrix(c(sample(v1,N,replace=T),sample(v2,N,replace=T),sample(v3,N,replace=T),
sample(v4,N,replace=T),sample(v5,N,replace=T),sample(v6,N,replace=T)),nrow = N))
weightedSum = round(scenarios %*% weights,4)
results=cbind(results,quantile(weightedSum ,seq(0.01,1,0.01),names=FALSE))
}
write(t(results),"ouput.txt",ncolumns = B + 1)
The output file looks great ! To 4 digits places, all of my percentiles are the same ! So they converges to a value at least !
This being said, are those percentiles unbiased for my population percentiles ?
Thanks

Related

Null distribution related question (edited)

Please help with the following question.
The experiment involved mice; feeding them two diets: high-fat diet and normal diet (control group). The data below contains the weights of all female mice (population) that received the normal diet. The data can be downloaded from GitHub running the following command lines in R:
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile = filename)
x <- unlist(read.csv(filename))
Here x represents the weights of the entire population.
So, the question is:
Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 (one thousand) times. Save the averages.
What proportion of these 1,000 averages are more than 1 gram away from the average x?
Below is what I have tried using the ‘sum’ & ‘mean()’ function:
set.seed(1)
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n)
mean(sample1 > mean(x)+1)
So this step is where I need the help…because I am not sure how to deal with ‘1 gram away from average of x’ statement in the question.
Thank you in advance for your help.
Looks like homework, so I'll give some hints:
In your second code block, the last two statements seem off.
n <- 1000
sample1 <- vector("numeric", n)
for (i in 1: n) {
sample1[i] <- mean (sample (x, 5))
}
sum(sample1 > mean(x) / n) #<- why dividing by n here?
mean(sample1 > mean(x)+1) #<- what are you trying to do here?
Why are you dividing the mean of the overall sample by n?
The call to mean does seem to make sense.
I don't think you need the second statement, mean(sample1 > mean(x)+1) to get your answer.
You need an inequality in the sum() statement that will be TRUE for every value that is outside the range of mean(x) - 1 to mean(x) + 1. Or, the number less than mean(x) -1 plus the number greater than mean(x) + 1.
Does that help?
On the loop part you are doing correctly, for the ##What proportion of these 1,000 averages are more than 1 gram away from the average x?##sum(abs(null)-mean(population)>1)/n

How to vectorise sampling from non-identically distributed Bernoulli random variables?

Given a sequence of independent but not identically distributed Bernoulli trials with success probabilities given by a vector, e.g.:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000 # trial probabilities for trials 1 to 501
What is the most efficient way to obtain a random variate from each trial? I am assuming that vectorisation is the way to go.
I know of two functions that give Bernoulli random variates:
rbernoulli from the package purr, which does not accept a vector of success probabilities as an input. In this case it may be possible to wrap the function in an apply type operation.
rbinom with arguments size = 1 gives Bernoulli random variates. It also accepts a vector of probabilities, so that:
rbinom(n = length(prob), size = 1, prob = prob)
gives an output with the right length. However, I am not entirely sure that this is actually what I want. The bits in the helpfile ?rbinom that seem relevant are:
The length of the result is determined by n for rbinom, and is the
maximum of the lengths of the numerical arguments for the other
functions.
The numerical arguments other than n are recycled to the length of the
result. Only the first elements of the logical arguments are used.
However, n is a parameter with no default, so I am not sure what the first sentence means. I presume the second sentence means that I get what I want, since only size = 1 should be recycled. However this thread seems to suggest that this method does not work.
This blog post gives some other methods as well. One commentator mentions my suggested idea using rbinom.
Another way to test that rbinom is vectorised for prob, taking advantage of the fact that the sum of N bernoulli random variables is a binomial random variable with denominator N:
x <- seq(0, 50, 0.1)
prob <- -x*(x - 50)/1000
n <- rbinom(prob, size=1000, prob)
par(mfrow=c(1, 2))
plot(prob ~ x)
plot(n ~ x)
If you don't trust random strangers on the internet and do not understand documentation, maybe you can convince yourself by testing. Just set the random seed to get reproducible results:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000
#501 seperate draws of 1 random number
set.seed(42)
res1 <- sapply(prob, rbinom, n = 1, size = 1)
#501 "simultaneous" (vectorized) draws
set.seed(42)
res2 <- rbinom(501, 1, prob)
identical(res1, res2)
#[1] TRUE

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.
This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?
Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?
I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

In R, how do I find the optimal variable to minimise the correlation between two datasets [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I find the optimal variable to maximize or minimize correlation between several datasets
This can be done in Excel, but my dataset has gotten too large. In excel, I would use solver.
I have 5 variables and I want to recreate a weighted average of these 5 variables so that they have the lowest correlation to a 6th variable.
Column A,B,C,D,E = random numbers
Column F = random number (which I want to minimise the correlation to)
Column G = Awi1+Bwi2+C*2i3+D*wi4+wi5*E
where wi1 to wi5 are coefficients resulted from solver In a separate cell, I would have correl(F,G)
This is all achieved with the following constraints in mind:
1. A,B,C,D, E have to be between 0 and 1
2. A+B+C+D+E= 1
I'd like to print the results of this so that I can have an efficient frontier type chart.
How can I do this in R? Thanks for the help.
I looked at the other thread mentioned by Vincent and I think I have a better solution. I hope it is correct. As Vincent points out, your biggest problem is that the optimization tools for such non-linear problems do not offer a lot of flexibility for dealing with your constraints. Here, you have two types of constraints: 1) all your weights must be >= 0, and 2) they must sum to 1.
The optim function has a lower option that can take care of your first constraint. For the second constraint, you have to be a bit creative: you can force your weights to sum to one by scaling them inside the function to be minimized, i.e. rewrite your correlation function as function(w) cor(X %*% w / sum(w), Y).
# create random data
n.obs <- 100
n.var <- 6
X <- matrix(runif(n.obs * n.var), nrow = n.obs, ncol = n.var)
Y <- matrix(runif(n.obs), nrow = n.obs, ncol = 1)
# function to minimize
correl <- function(w)cor(X %*% w / sum(w), Y)
# inital guess
w0 <- rep(1 / n.var, n.var)
# optimize
opt <- optim(par = w0, fn = correl, method = "L-BFGS-B", lower = 0)
optim.w <- opt$par / sum(opt$par)

Resources