extreme value function in R - r

I'm studying the an extreme value problem on http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf
follows the An interesting extreme value distribution is given by Pevsner (2003, p.103)
I tried to generate a sample (with size 1000) from the standard normal distribution and repeated for 1000 times. Then, subtract the given function from these maxima an and divide by bn, where
fn <- exp(-n)*exp(-exp(-n))
an <- sqrt(2*log(n)) - 0.5*(log(log(n))+log(4*pi))*(2*log(n))^(-1/2)
bn <- (2*log(n))^(-1/2)
> my.stat <-NULL
> for (i in 1:1000) {
+ n <- rnorm(1000)
+ fn <- exp(-n)*exp(-exp(-n))
+ an <- sqrt(2*log(n)) - 0.5*(log(log(n))+log(4*pi))*(2*log(n))^(-1/2)
+ bn <- (2*log(n))^(-1/2)
+ my.stat <- c(my.stat, sum(sum(fn-an)/bn)))
> par(mfrow=c(2,2))
> hist(my.stat,freq=FALSE,main="histogram of 1000 M",xlab="M")
Error: object 'fn' not found???? i'm more concerned how to deal with the
+ my.stat <- c(my.stat, sum(sum(fn-an)/bn)))
part in the loop, since i stored all the sample values in a vector my.stat. And try to compute (fn-an)/bn during each loop.
I guess my question is is there a better way that i can add (fn-an)/bn to my vector each time in my for loop? Essentially, i want the my.stat to store all the (fn-an)/bn values. thanks.

Related

sampling random values each iteration

I have some simulated data, on top of the data I add some noise to see how the noise affects my data for further analyses. I created the following function
create.noise <- function(n, amount_needed, mean, sd){
set.seed(25)
values <- rnorm(n, mean, sd)
returned.values <- sample(values, size=amount_needed)
}
I call this function in the following loop:
dataframe.noises <- as.data.frame(noises) #i create here a dataframe dim 1x45 containing zeros
for(i in 1:100){
noises <- as.matrix(create.noise(100,45,0,1))
dataframe.noises[,i] <- noises
data_w_noise <- df.data_responses+noises
Estimators <- solve(transposed_schema %*% df.data_schema) %*% (transposed_schema %*% data_w_noise)
df.calculated_estimators[,i] <-Estimators
}
The code above always returns the same values, one solution I tried is sending i as parameter(which i think isn't correct) for each iteration I add i to the set.seed(25+i)
This gives me a unique value for each iteration, butas mentioned I don't think that this is the correct way to go with it.

Is there a way that I can store a polynomial or the coefficients of a polynomial in a single element in an R vector?

I want to create an R function that generates the cyclic finite group F[x].
Basically, I need to find a way to store polynomials, or at the least a polynomial's coefficients, in a single element in an R vector.
For example, if I have a set F={0,1,x,1+x}, I would like to save these four polynomials into an R vector such as
F[1] <- 0 + 0x
F[2] <- 1 + 0x
F[3] <- 0 + x
F[4] <- 1 + x
But I keep getting the error: "number of items to replace is not a multiple of replacement length"
Is there a way that I can at least do something like:
F[1] <- (0,0)
F[2] <- (1,0)
F[3] <- (0,1)
F[4] <- (1,1)
For reference in case anyone is interested in the mathematical problem I am trying to work with, my entire R function so far is
gf <- function(q,p){
### First we need to create an irreducible polynomial of degree p
poly <- polynomial(coef=c(floor(runif(p,min=0,max=q)),1)) #This generates the first polynomial of degree p with coefficients ranging between the integer values of 0,1,...,q
for(i in 1:(q^5*p)){ #we generate/check our polynomial a sufficient amount of times to ensure that we get an irreducible polynomial
poly.x <- as.function(poly) #we coerce the generated polynomial into a function
for(j in 0:q){ #we check if the generated polynomial is irreducible
if(poly.x(j) %% q == 0){ #if we find that a polynomial is reducible, then we generate a new polynomial
poly <- polynomial(coef=c(floor(runif(p,min=0,max=q)),1)) #...and go through the loop again
}
}
}
list(poly.x=poly.x,poly=poly)
### Now, we need to construct the cyclic group F[x] given the irreducible polynomial "poly"
F <- c(rep(0,q^p)) #initialize the vector F
for(j in 0:(q^p-1)){
#F[j] <- polynomial(coef = c(rep(j,p)))
F[j] <- c(rep(0,3))
}
F
}
Make sure F is a list and then use [[]] to place the values
F<-list()
F[[1]] <- c(0,0)
F[[2]] <- c(1,0)
F[[3]] <- c(0,1)
F[[4]] <- c(1,1)
Lists can hold heterogeneous data types. If everything will be a constant and a coefficient for x, then you can also use a matrix. Just set each row value with the [row, col] type subsetting. You will need to initialize the size at the time you create it. It will not grow automatically like a list.
F <- matrix(ncol=2, nrow=4)
F[1, ] <- c(0,0)
F[2, ] <- c(1,0)
F[3, ] <- c(0,1)
F[4, ] <- c(1,1)
You will have to store those as strings, since otherwise R will try to interpret the operators. You can have
F[1] <- "0 + 0x"
Or even a matrix, which is more flexible for apply and other operations you might wanna do
mat <- matrix(c(0,1,0,1,0,0,1,1), ncol=2)

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

Generating two series with a certain correlation and a specific condition in R

I want to generate two data series of size 100 in R, one of which is going to be remission time, tr, from Exp(mean=1) distribution and the other one is going to be survival time, t, from Exp(mean=2.5) distribution. I want them to be negatively correlated (say, the correlation is -0.5). But at the same time I want that R avoids the values of t[i] that are less than tr[i] for data point i, because survival times should be greater than remission times. I have been able to produce some correlation between the two variables (although the correlation is not exactly reproduced) using the following codes:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
rawvars <- mvrnorm(100, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr<-rep(0,100)
for(i in 1:100){
tr[i] <- qexp(pvars[,1][i], 1/1)
}
t<-rep(0,100)
for(i in 1:100){
repeat {
t[i] <- qexp(pvars[,2][i], 1/2)
if (t[i]>tr[i]) break
}
}
cor(tr,t)
sum(tr>t) # shows number of invalid cases
But how should I efficiently induce the condition so that R only generates values of t that are greater than corresponding tr?
Moreover, is there a better way (faster way) to do the whole thing in R?
The issue here is that qexp is the quantile function and will return the same value for the same probability pvars[,2][i]. As a result, your code can easily go into an infinite loop when any one of the pvars[i,] is such that t[i]<=tr[i]. To avoid that, you must regenerate your rawvars for each t[i], tr[i] pair that fails your condition. In addition, looping over pvars is not necessary since qexp and operator > are all vectorized. The following code does what you want:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
set.seed(1) ## so that results are repeatable
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
keep <- which(t > tr)
return(data.frame(t=t[keep],tr=tr[keep]))
}
n <- 10000 ## generating 10000 instead of 100, this can now be large
paccept <- 1
res <- data.frame()
while (n > 0) {
new.res <- compute.tr.t(n, paccept)
res <- rbind(res, new.res)
paccept <- nrow(new.res) / n
n <- n - nrow(res)
}
Notes:
The function compute.tr.t borrows a technique from rejection sampling here. Its input arguments are the requested number of samples that we want and the expected probability of acceptance. With this:
It generates n = n / paccept exponential variates for both tr and t as you do to account for the probability of acceptance
It only keeps those satisfying the condition t > tr.
What compute.tr.t returns may be less than the requested n samples. We can then use this information to compute how many more samples we need and what the updated expected probability of acceptance is.
We generate the samples satisfying our condition in a while loop. In this loop:
We call compute.tr.t with a requested number of samples to generate and the expected acceptance rate. Initially, these will be set to how many total samples we want and 1, respectively.
The result of compute.tr.t are then appended to the result data frame res.
Updating the probability of accept is simply the ratio of how many samples were returned over how many were requested.
Updating the requested number of samples is simply how many more we need from the total number we want.
We stop when the next requested number of samples is less than or equal to 0 (i.e., we have enough samples).
The resulting data frame may contain more than the total number of samples we want.
Running this code, we get:
print(cor(res$tr,res$t))
[1] -0.09128498
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 0
We note that the anti correlation is significantly weaker than expected. This is due to your condition. If we remove this condition by modifying compute.tr.t as:
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
return(data.frame(t=t,tr=tr))
}
Then we get:
print(cor(res$tr,res$t))
##[1] -0.3814602
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 3676
The correlation is now much more reasonable, but the number of invalid cases is significant.

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

Resources