Speed up glm in apply function in R - r

My question is based on the following situation:
I have a matrix with 20 rows and > 100,000 columns. I would like to apply the glm function and extract the Likelihood ratio statistic for each of the columns. So far, I have tried to implement in this manner. For example:
X <- gl(5, 4, length = 20); Y <- gl(4, 1, length = 20)
X <- factor(X); Y <- factor(Y)
matrix <- matrix(sample.int(15, size = 20*100000, replace = TRUE), nrow = 20, ncol = 100000)
apply(matrix, 2, function(x) glm(x ~ X+Y, poisson)$deviance)
Is there any way to speed up the computation time? I figured that since each vector that is used in glm is not big at all (vector of length 20), speedglm is not helpful here.
I would be glad if anyone could give me advice on this. Thank you very much in advance!

I ran a test of 1000 columns. It only took 2.4 seconds.
system.time(apply(matrix[,1:1000], 2, function(x) glm(x ~ X+Y, poisson)$deviance))
user system elapsed
2.40 0.00 2.46
I also tried 50,000 and it seemed to scale very linearly.
Therefore you only need to wait for 4 minutes to compute 100,000 cols. So I don't see the problem. However, the bottle neck is the overhead of calling the gbm() function 100,000 times. Try to avoid running a high level function that many times.
To run faster, listed ascending in terms of effort:
wrap it in parallel loop (2x-4x times speed-up)
figure out to perform the calculation as matrix multiplications in R (~50x)
implement with Rcpp (~100x)
None of the solutions will take you less than 4 minutes to achieve

Related

How to speed up row-wise computations on a data.frame using alternative functions to apply family?

I have a data frame with 10,000 rows and 40 columns. I am trying to apply a function to each of these rows. For each row, I am expecting to return a scalar which is the value of the statistic I am calculating in this function. Below is what I have done so far;
library(dfadjust)
library(MASS)
# Creating example data #
nrows=10000
ncols=40
n1=20
n2=20
df=data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# Function to evaluate on each row of df #
get_est= function(x){
mod = rlm(x~cov$group)
fit = dfadjustSE(mod)
coef = fit$coefficients[2,1]
se = fit$coefficients[2,4]
stats = coef/se
return(stats)
}
# Applying above function to full data #
t1=Sys.time()
estimates=apply(df, 1, function(x) get_est(x))
t2=Sys.time()-t1
# Time taken by apply function
Time difference of 37.10623 secs
Is there a way to significantly decrease the time taken to implement get_est() on the full data? The main reason I need to speed up the computation on a single df is because I have 1000 more data frames with the same dimension and I have to apply this function to each row to each of these data frames simultaneously. To illustrate, below is the broader situation I am dealing with;
# Creating example data
set.seed(1234)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df.list = list()
for(i in 1:1000){
df.list[[i]] = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
}
# Applying get_est() to each row and to each of data frame in df.list #
pcks = c('MASS','dfadjust')
all.est = foreach(j = 1:length(df.list), .combine = cbind, .packages = pcks) %dopar% {
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
est = apply(df.list[[j]], 1, function(x) get_est(x))
return(est)
}
Even after parallelizing it is taking hours to finish. My ultimate objective is to significantly cut down the time to obtain "all.est" which will contain 10000 rows and 1000 columns where each column has the stats estimates for the respective data set. Any help is much appreciated!! Thanks in advance!
Doing some preprocessing of data we can adjust the rlm function so there's less overhead:
x3 <- as.matrix(cbind(1L, y))
colnames(x3) <- c('(Intercept)', 'x')
w = rep(1, nrow(x3))
get_est= function(x){
mod = rlm(x3, x, weights = w, w = w)
fit = dfadjustSE(mod)
coef = fit$coefficients[2,1]
se = fit$coefficients[2,4]
stats = coef/se
return(stats)
}
I got 12 seconds instead of 18 seconds for initial approach. ~33% improvement
For larger speedups I would suggest looking into rlm and dfadjustSE functions and try to optimize those for your specific needs (removing unnecessary checks etc., as you are calling those functions millions of times). But that probably will be quite time consuming and better performance is not guaranteed. Maybe there are other packages with similar but faster functions?

How can I automate creation of a list of vectors containing simulated data from a known distribution, using a "for loop" in R?

First stack exchange post so please bear with me. I'm trying to automate the creation of a list, and the list will be made up of many empty vectors of various, known lengths. The empty vectors will then be filled with simulated data. How can I automate creation of this list using a for loop in R?
In this simplified example, fish have been caught by casting a net 4 times, and their abundance is given in the vector "abundance" (from counting the number of total fish in each net). We don't have individual fish weights, just the mean weight of all fish each net, so I need to simulate their weights from a lognormal distribution. So, I'm then looking to fill those empty vectors for each net, each with a length equal to the number of fish caught in that net, with weight data simulated from a lognormal distribution with a known mean and standard deviation.
A simplified example of my code:
abundance <- c(5, 10, 9, 20)
net1 <- rep(NA, abundance[1])
net2 <- rep(NA, abundance[2])
net3 <- rep(NA, abundance[3])
net4 <- rep(NA, abundance[4])
simulated_weights <- list(net1, net2, net3, net4)
#meanlog vector for each net
weight_per_net
#meansd vector for each net
sd_per_net
for (i in 1:4) {
simulated_weights[[i]] <- rlnorm(n = abundance[i], meanlog = weight_per_net[i], sd = sd_per_net[i])
print(simulated_weights_VM)
}
Could anyone please help me automate this so that I don't have to write out each net vector (e.g. net1) by hand, and then also write out all the net names in the list() function? There are far more nets than 4 so it would be extremely time consuming and inefficient to do it this way. I've tried several things from other posts like paste0(), other for loops, as.list(c()), all to no avail.
Thanks!
HM
Turns out you don't need the net1, net2, etc variables at all. You can just do
abundance <- c(5, 10, 9, 20)
simulated_weights <- lapply(abundance, function(x) rep(NA, x))
The lapply function will return the list you need by calling the function once for each value of abundance
We could create the 'simulated_weights' with split and rep
simulated_weights <- split(rep(rep(NA, length(abundance)), abundance),
rep(seq_along(abundance), abundance))

R: How do i aggregate losses by a poisson observation?

I'm new to R but i am trying to use it in order to aggregate losses that are observed from a severity distribution by an observation from a frequency distribution - essentially what rcompound does. However, i need a more granular approach as i need to manipulate the severity distribution before 'aggregation'.
Lets take an example. Suppose you have:
rpois(10,lambda=3)
Thereby, giving you something like:
[1] 2 2 3 5 2 5 6 4 3 1
Additionally, suppose we have severity of losses determined by:
rgamma(20,shape=1,scale=10000)
So that we also have the following output:
[1] 233.0257 849.5771 7760.4402 731.5646 8982.7640 24172.2369 30824.8424 22622.8826 27646.5168 1638.2333 6770.9010 2459.3722 782.0580 16956.1417 1145.4368 5029.0473 3485.6412 4668.1921 5637.8359 18672.0568
My question is: what is an efficient way to get R to take each Poisson observation in turn and then aggregate losses from my severity distribution? For example, the first Poisson observation is 2. Therefore, adding two observations (the first two) from my Gamma distribution gives 1082.61.
I say this needs to be 'efficient' (run time) due to the fact:
- The Poisson parameter may be come significantly large, i.e. up to 1000 or so.
- The realisations are likely to be up to 1,000,000, i.e. up to a million Poisson and Gamma observations to sort through.
Any help would be greatly appreciated.
Thanks, Dave.
It looks like you want to split the gamma vector at positions indicated by the accumulation of the poisson vector.
The following function (from here) does the splitting:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
pois <- c(2, 2, 3, 5, 2, 5, 6, 4, 3, 1)
gam <- c(233.0257, 849.5771, 7760.4402, 731.5646, 8982.7640, 24172.2369, 30824.8424, 22622.8826, 27646.5168, 1638.2333, 6770.9010, 2459.3722, 782.0580, 16956.1417, 1145.4368, 5029.0473, 3485.6412, 4668.1921, 5637.8359, 18672.0568)
posits <- cumsum(pois)
Then do the following:
sapply(splitAt(gam, posits + 1), sum)
[1] 1082.603 8492.005 63979.843 61137.906 17738.200 19966.153 18672.057
According to post I linked to above, the splitAt() function slows down for large arrays, so you could (if necessary) consider the alternatives proposed in that post. For my part, I generated 1e6 poissons and 1e6 gammas, and the above function ran in 0.78 sec on my machine.

Filling a matrix with for-loop output

I want to fill a matrix with data simulated by using a for-loop containing the rbinom-function. This loop executes the rbinom-function 100 times, thus generating a different outcome every run. However, I can't find a way to get these outcomes in a matrix for further analysis. When assigning the for loop to an object, this object appears empty in the environment and can thus not be used in the matrix. ('data' must be of a vector type, was 'NULL').
When not including the rbinom-function in a for loop, it can be assigned to an object and I'm able to use the output in the matrix. Every column, however, contains the exact same sequence of numbers. When only running the for loop containing the rbinom-function, I do get different sequences, as it runs the rbinom function 100 times instead of 1 time. I just don't know how to integrate the loop in te matrix.
The two pieces of code I have:
n = 100
size = 7
loop_vill <- for (i in 1:100) {
print(rbinom(n=n, size=size, prob=0.75)) #working for-loop
}
vill <- rbinom(n=n, size=size, prob=0.75)
sim_data_vill <- matrix(data=vill, nrow=length(loop_vill), ncol=100)
#creates a matrix in which all columns are exact copies, should be solved
when able to use outputs of loop_vill.
sim_data_vill
When calling sim_data_vill, it (logically) contains a matrix of 100 rows and 100 columns, with all columns being the same. However, I would like to see a matrix with all columns being different (thus containing the output of a new run of the rbinom-function each time).
Hello as far as i can see you are having a few problems.
You are currently not running the for loop for each column (only the 1 vector is saved in vill)
You are not looping over the rbinom
Now there's a few ways to achieve what you want. (Scroll to the last example for the efficient way)
method 1: For loop
Using your idea we can use a for loop. The best idea is to save an empty matrix first and fill it in with the for loop
nsim <- 100 #how many rbinom are w
n <- 100000
size = 7
prob = 0.75
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
Now this will work, but is a bit slow, and requires a whopping 3 lines of code for the simulation part!
method 2: apply
We can remove the for loop and pre-assigned matrix with using one of the myriad apply like functions. One such function is replicate. This reduces the massive 3 lines of code to:
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
wuh.. That was short, but can we do even better? Actually running functions such as rbinom multiple times can be rather slow and costly.
method 3: using vectorized functions (very fast)
One thing you will hear whispered (or shouted) is the word vectorized, when it comes to programming in R. Basically, calling a function will induce overhead, and if you are working with a vectorized function, calling it once, will make sure you only induce overhead once, instead of multiple times. All distribution functions in R such as rbinom are vectorized. So what if we just do all the simulation in one go?
sim_data_vill_vectorized_functions <- matrix(rbinom(nsim * n, size, prob), ncol = nsim, nrow = n, byrow = FALSE) #perform all simulations in 1 rbinom call, and fill in 1 matrix.
So lets just quickly check how much faster this is compared to using a for loop or apply. This can be done using the microbenchmark package:
library(microbenchmark)
microbenchmark(for_loop = {
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
},
apply = {
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
},
vectorized = {
sim_data_vill_vectorized <- matrix(rbinom(nsim * n, size = size, prob = prob), ncol = nsim, nrow = n, byrow = FALSE)
}
)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 751.6121 792.5585 837.5512 812.7034 848.2479 1058.4144 100
apply 752.4156 781.3419 837.5626 803.7456 901.6601 1154.0365 100
vectorized 696.9429 720.2255 757.7248 737.6323 765.3453 921.3982 100
Looking at the median time, running all the simulations at once is about 60 ms. faster than using a for loop. As such here it is not a big deal, but in other cases it might be. (reverse n and nsim, and you will start seeing the overhead becoming big part of the calculations.)
Even if it is not a big deal, using vectorized computations where they pop up, is in all cases prefered, to make code more readable, and to avoid unnecessary bottlenecks that have already been optimized in implemented code.

knapsack case r implementation for multiple persons using genetic algorithm

I am trying to implement genetic algorithm in R. I found out that r has 'GA' and 'genalg' packages for genetic algorithm implementation. I encountered the example i the link http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/. They tried solving the Knapsack problem. The problem can be briefly explained as:
"You are going to spend a month in the wilderness. You’re taking a backpack with you, however, the maximum weight it can carry is 20 kilograms. You have a number of survival items available, each with its own number of 'survival points'. You’re objective is to maximize the number of survival points"
The problem is easily solved using 'genalg' package for a single person and the output is binary string. Now i have a doubt, lets say instead of one person there are 2 or more i.e multiple persons and we need to distribute the survival points. The weight constraints apply for each person. Then how can we solve this problem? Can we use 'genalg' or 'GA' package? If so how can we apply them? Are there any examples on this that are solved in R or other software's?
Thanks
The R package adagio (https://cran.r-project.org/web/packages/adagio/index.html) comes with two functions (knapsack and mknapsack) which solves this type of problem more efficient by dynamic programming.
A simple approach could be to have one chromosome containing all individuals in the group and have the evaluation function split this chromosome in multiple parts, one for each individual and then have these parts evaluated. In the example below (based on the example in the question) I have assumed each individual has the same weight limit and multiple individuals can bring the same item.
library(genalg)
#Set up the problem parameters
#how many people in the group
individual_count <-3
#The weight limit for one individual
weightlimit <- 20
#The items with their survivalpoints
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions",
"sleeping bag", "rope", "compass"), survivalpoints = c(10, 20, 15, 2, 30,
10, 30), weight = c(1, 5, 10, 1, 7, 5, 1))
#Next, we choose the number of iterations, design and run the model.
iter <- 100
#Our chromosome has to be large enough to contain a bit for all individuals and for all items in the dataset
chromosomesize <- individual_count * nrow(dataset)
#Function definitions
#A function to split vector X in N equal parts
split_vector <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
#EValuate an individual (a part of the chromosome)
evalIndividual <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints)
}
#Evaluate a chromosome
evalFunc <- function(x) {
#First split the chromosome in a list of individuals, then we can evaluate all individuals
individuals<-split_vector(x,individual_count)
#now we need to sapply the evalIndividual function to each element of individuals
return(sum(sapply(individuals,evalIndividual)))
}
#Run the Genetic Algorithm
GAmodel <- rbga.bin(size = chromosomesize, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
#First show a summary
summary(GAmodel,echo=TRUE)
#Then extract the best solution from the GAmodel, copy/paste from the source code of the summary function
filter = GAmodel$evaluations == min(GAmodel$evaluations)
bestSolution = GAmodel$population[filter, , drop= FALSE][1,]
#Now split the solution in the individuals.
split_vector(bestSolution,individual_count)

Resources