FastAPI slow in MLFlow

FastAPI slow in MLFlow - fastapi

I have a use case where I want to deploy ML Models with low latency and high throughput.
In MLFlow, I couldn't achieve it, so I tried FastAPI and it showed quite good results.
Thus, I tried to replace the Flask engine with FastAPI in MLFlow. But I am getting very low throughput per process on using FastAPI in MLFlow as compared to standalone FastAPI serving for the same model.
Vegeta benchmarking results:
MLFlow with FastAPI Test1:
(base) [ec2-user#ip-172-31-43-232 testing]$ cat target_all.txt | ./vegeta attack -duration=1m -rate=100 -max-workers=20 | ./vegeta report
Requests [total, rate, throughput] 1174, 100.09, 100.07
Duration [total, attack, wait] 11.732s, 11.73s, 2.03ms
Latencies [min, mean, 50, 90, 95, 99, max] 1.71ms, 2.23ms, 1.87ms, 3.394ms, 3.911ms, 7.58ms, 9.813ms
Bytes In [total, mean] 111530, 95.00
Bytes Out [total, mean] 224234, 191.00
Success [ratio] 100.00%
Status Codes [code:count] 200:1174
MLFlow with FastAPI Test2:
(base) [ec2-user#ip-172-31-43-232 testing]$ cat target_all.txt | ./vegeta attack -duration=1m -rate=200 -max-workers=20 | ./vegeta report
Requests [total, rate, throughput] 2159, 158.75, 157.30
Duration [total, attack, wait] 13.725s, 13.6s, 125.605ms
Latencies [min, mean, 50, 90, 95, 99, max] 10.256ms, 124.783ms, 125.951ms, 129.294ms, 133.891ms, 148.876ms, 293.218ms
Bytes In [total, mean] 209423, 97.00
Bytes Out [total, mean] 412369, 191.00
Success [ratio] 100.00%
Status Codes [code:count] 200:2159
Standalone FastAPI Test3:
Requests [total, rate, throughput] 36000, 600, 599
Latencies [min, mean, 50, 90, 95, 99, max] 1.515ms, 2.608ms, 2.734ms, 3.386ms, 3.611ms, 4.204ms, 4.795ms
On using FastAPI, I am getting throughput to be more than 600 RPS per Gunicorn worker, while on using FastAPI with MLFlow, I am getting 100 RPS at max without affecting the response time of the model.
I think I am missing something in this PR. Can someone please help?

Related

Survival::Survfit (left, right, and interval censoring)

I'm attempting to estimate survival probabilities using the survfit function from the Survival package. My dataset consists of animals that were captured at various times over the course of ~2 years. Some animals died, some animals were censored after capture and some animals lived beyond the end of the study (I'm guessing this means I have left, right and interval censored data).
I can estimate survival probability using right censors only, but this assumes all animals were captured on the same day and does not account for adding new animals through time. What I would like to do is estimate survival as a function of calendar day and not as a function of time since capture.
Example data:
time1<- c(2, 386, 0, 1, 384, 3, 61, 33, 385, 64)
time2<- c(366, 665, 285, 665, 665, 454, 279, 254, 665, 665)
censor<- c(3,3,3,3,3,3,3,3,3,3)
region <- c(1, 6, 1, 6, 5, 1, 1, 1, 5, 6)
m1<- data.frame(time1, time2, censor, region)
code:
km.2 <- survfit(Surv(m1$time1, m1$time2, m1$censor, type = "interval") ~ m1$region)
Note the above code runs but doesn't estimate what I laid out above. I hope this is an issue of specifying certain arguments in the survfit function but this is where I am lost. Thanks for the help

Not sure if you've figured this out by now since it was nearly a year ago. I'm a bit confused by the experiment you're explaining.
However, one item that pops out immediately is the "time1". I believe you can't have any times start or end at 0. I recommend adding 0.5 or 1 to that specific time observation, and explaining why in your write up. But having a 0 value is a likely culprit for why it's not estimating properly

How to efficiently parallelize brms::brm?

Problem summary
I am fitting a brms::brm_multiple() model to a large dataset where missing data has been imputed using the mice package. The size of the dataset makes the use of parallel processing very desirable. However, it isn't clear to me how to best use the compute resources because I am unclear about how brms divides sampling on the imputed dataset among cores.
How can I choose the following to maximize efficient use of compute resources?
number of imputations (m)
number of chains (chains)
number of cores (cores)
Conceptual example
Let's say that I naively (or deliberately foolishly for sake of example) choose m = 5, chains = 10, cores = 24. There are thus 5 x 10 = 50 chains to be allocated among 24 cores reserved on the HPC. Without parallel processing, this would take ~50 time units (excluding compiling time).
I can imagine three parallelization strategies for brms_multiple(), but there may be others:
Scenario 1: Imputed datasets in parallel, associated chains in serial
Here, each of the 5 imputations is allocated to it's own processor which runs through the 10 chains in serial. The processing time is 10 units (a 5x speed improvement vs. non-parallel processing), but poor planning has wasted 19 cores x 10 time units = 190 core time units (ctu; =80% of the reserved compute resources). The efficient solution would be to set cores = m.
Scenario 2: Imputed datasets in serial, associated chains in parallel
Here, the sampling begins by taking the first imputed dataset and running one of the chains for that dataset on each of 10 different cores. This is then repeated for the remaining four imputed datasets. The processing takes 5 time units (a 10x speed improvement over serial processing & a 2x improvement over Scenario 1). However, here too compute resources are wasted: 14 cores x 5 time units = 70 ctu. The efficient solution would be to set cores = chains
Scenario 3: Free-for-all, wherein each core takes on a pending imputation/chain combination when it becomes available until all are processed.
Here, the sampling begins by allocating all 24 cores, each one to one of the 50 pending chains. After they finish their iterations, a second batch of 24 chains is processed, bringing the total chains processed to 48. But now there are only two chains pending and 22 cores sit idle for 1 time unit. The total processing time is 3 time units, and the wasted compute resource is 22 ctu. The efficient solution would be to set cores to a multiple of m x chains.
Minimal reproducible example
This code compares the compute time using an example modified from a brms vignette. Here we'll set m = 10, chains = 6, and cores = 4. This makes for a total of 60 chains to be processed. Under these conditions, I would expect speed improvement (vs. serial processing) is as follows*:
Scenario 1: 60/(6 chains x ceiling(10 m / 4 cores)) = 3.3x
Scenario 2: 60/(ceiling(6 chains / 4 cores) x 10 m) = 3.0x
Scenario 3: 60/ceiling((6 chains x 10 m) / 4 cores) = 4.0x
*(ceiling/rounding up is used because chains cannot be subdivided among cores)
library(brms)
library(mice)
library(tictoc) # convenience functions for timing
# Load data
data("nhanes", package = "mice")
# There are 10 imputations x 6 chains = 60 total chains to be processed
imp <- mice(nhanes, m = 10, print = FALSE, seed = 234023)
# Fit the model first to get compilation out of the way
fit_base <- brm_multiple(bmi ~ age*chl, data = imp, chains = 6,
iter = 10000, warmup = 2000)
# Use update() function to avoid re-compiling time
# Serial processing (127 sec on my machine)
tic() # start timing
fit_serial <- update(fit_base, .~., cores = 1L)
t_serial <- toc() # stop timing
t_serial <- diff(unlist(t_serial)[1:2]) # calculate seconds elapsed
# Parallel processing with 3 cores (82 sec)
tic()
fit_parallel <- update(fit_base, .~., cores = 4L)
t_parallel <- toc()
t_parallel <- diff(unlist(t_parallel)[1:2]) # calculate seconds elapsed
# Calculate speed up ratio
t_serial/t_parallel # 1.5x
Clearly I am missing something. I can't distinguish between the scenarios with this approach.

Confidence Interval for Non Linear Regression Model

My data consist of two columns: time and cumulative number like below:
time <- c(1:14)
cum.num <- c(20, 45, 99, 195, 301, 407, 501, 582, 679, 753, 790, 861, 1011, 1441)
My non linear function is:
c1*cos(0.6731984259*time)+c2*sin(0.6731984259*time)+c3*(time)^2+c4*time+c5
My objective is to model this function using non linear regression using nls() in R and to compute the confidence interval. I have donr the following:
m1.fit<-nls(cum.vul~c1*cos(0.6731984259*time)+c2*sin(0.6731984259*time)+c3*(time)^2+c4*time+c5,start=list(c1=-50,c2=-60,c3=5,c4=8,c5=100))
I got an error while computing confidence interval, i have tried the following:
confint(m1.fit)
Once i issued this command got the following error:
Waiting for profiling to be done...
Error in prof$getProfile() :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
Can anyone help me in this regard?

Try package nlstools:
> nlstools::confint2(m1.fit)
2.5 % 97.5 %
c1 -48.556270 54.959689
c2 -175.654079 -45.216965
c3 3.285062 9.529072
c4 -49.254627 46.007629
c5 -34.135835 272.864743`

Why does foreach %dopar% get slower with each additional node?

I wrote a simple matrix multiplication to test out multithreading/parallelization capabilities of my network and I noticed that the computation was much slower than expected.
The Test is simple : multiply 2 matrices (4096x4096) and return the computation time. Neither the matrices nor results are stored. The computation time is not trivial (50-90secs depending on your processor).
The Conditions : I repeated this computation 10 times using 1 processor, split these 10 computations to 2 processors (5 each), then 3 processors, ... up to 10 processors (1 computation to each processor). I expected the total computation time to decrease in stages, and i expected 10 processors to complete the computations 10 times as fast as it takes one processor to do the same.
The Results : Instead what i got was only a 2 fold reduction in computation time which is 5 times SLOWER than expected.
When i computed the average computation time per node, i expected each processor to compute the test in the same amount of time (on average) regardless of the number of processors assigned. I was surprised to see that merely sending the same operation to multiple processor was slowing down the average computation time of each processor.
Can anyone explain why this is happening?
Note this is question is NOT a duplicate of these questions:
foreach %dopar% slower than for loop
or
Why is the parallel package slower than just using apply?
Because the test computation is not trivial (ie 50-90secs not 1-2secs), and because there is no communication between processors that i can see (i.e. no results are returned or stored other than the computation time).
I have attached the scripts and functions bellow for replication.
library(foreach); library(doParallel);library(data.table)
# functions adapted from
# http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/BLAS_Testing.html
Matrix.Multiplier <- function(Dimensions=2^12){
# Creates a matrix of dim=Dimensions and runs multiplication
#Dimensions=2^12
m1 <- Dimensions; m2 <- Dimensions; n <- Dimensions;
z1 <- runif(m1*n); dim(z1) = c(m1,n)
z2 <- runif(m2*n); dim(z2) = c(m2,n)
a <- proc.time()[3]
z3 <- z1 %*% t(z2)
b <- proc.time()[3]
c <- b-a
names(c) <- NULL
rm(z1,z2,z3,m1,m2,n,a,b);gc()
return(c)
}
Nodes <- 10
Results <- NULL
for(i in 1:Nodes){
cl <- makeCluster(i)
registerDoParallel(cl)
ptm <- proc.time()[3]
i.Node.times <- foreach(z=1:Nodes,.combine="c",.multicombine=TRUE,
.inorder=FALSE) %dopar% {
t <- Matrix.Multiplier(Dimensions=2^12)
}
etm <- proc.time()[3]
i.TotalTime <- etm-ptm
i.Times <- cbind(Operations=Nodes,Node.No=i,Avr.Node.Time=mean(i.Node.times),
sd.Node.Time=sd(i.Node.times),
Total.Time=i.TotalTime)
Results <- rbind(Results,i.Times)
rm(ptm,etm,i.Node.times,i.TotalTime,i.Times)
stopCluster(cl)
}
library(data.table)
Results <- data.table(Results)
Results[,lower:=Avr.Node.Time-1.96*sd.Node.Time]
Results[,upper:=Avr.Node.Time+1.96*sd.Node.Time]
Exp.Total <- c(Results[Node.No==1][,Avr.Node.Time]*10,
Results[Node.No==1][,Avr.Node.Time]*5,
Results[Node.No==1][,Avr.Node.Time]*4,
Results[Node.No==1][,Avr.Node.Time]*3,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*2,
Results[Node.No==1][,Avr.Node.Time]*1)
Results[,Exp.Total.Time:=Exp.Total]
jpeg("Multithread_Test_TotalTime_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Total.Time], type="o", xlab="", ylab="",ylim=c(80,900),
col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Time to Complete 10 Multiplications", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
ylab="Total Computation Time (secs)")
axis(2, at=seq(80, 900, by=100), tick=TRUE, labels=FALSE)
axis(2, at=seq(80, 900, by=100), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
lines(x=Results[,Node.No],y=Results[,Exp.Total.Time], type="o",col="red")
legend('topright','groups',
legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
col=c("blue","red"))
dev.off()
jpeg("Multithread_Test_PerNode_Results.jpeg")
par(oma=c(0,0,0,0)) # set outer margin to zero
par(mar=c(3.5,3.5,2.5,1.5)) # number of lines per margin (bottom,left,top,right)
plot(x=Results[,Node.No],y=Results[,Avr.Node.Time], type="o", xlab="", ylab="",
ylim=c(50,500),col="blue",xaxt="n", yaxt="n", bty="l")
title(main="Per Node Multiplication Time", line=0,cex.lab=3)
title(xlab="Nodes",line=2,cex.lab=1.2,
ylab="Computation Time (secs) per Node")
axis(2, at=seq(50,500, by=50), tick=TRUE, labels=FALSE)
axis(2, at=seq(50,500, by=50), tick=FALSE, labels=TRUE, line=-0.5)
axis(1, at=Results[,Node.No], tick=TRUE, labels=FALSE)
axis(1, at=Results[,Node.No], tick=FALSE, labels=TRUE, line=-0.5)
abline(h=Results[Node.No==1][,Avr.Node.Time], col="red")
epsilon = 0.2
segments(Results[,Node.No],Results[,lower],Results[,Node.No],Results[,upper])
segments(Results[,Node.No]-epsilon,Results[,upper],
Results[,Node.No]+epsilon,Results[,upper])
segments(Results[,Node.No]-epsilon, Results[,lower],
Results[,Node.No]+epsilon,Results[,lower])
legend('topleft','groups',
legend=c("Measured", "Expected"), bty="n",lty=c(1,1),
col=c("blue","red"))
dev.off()
EDIT : Response #Hong Ooi's comment
I used lscpu in UNIX to get;
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 30
On-line CPU(s) list: 0-29
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 30
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 # 2.40GHz
Stepping: 2
CPU MHz: 2394.455
BogoMIPS: 4788.91
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-29
EDIT : Response to #Steve Weston's comment.
I am using a virtual machine network (but I'm not the admin) with access to up to 30 clusters. I ran the test you suggested. Opened up 5 R sessions and ran the matrix multiplication on 1,2...5 simultaneously (or as quickly as i could tab over and execute). Got very similar results to before (re: each additional process slows down all individual sessions). Note i checked memory usage using top and htop and the usage never exceeded 5% of the network capacity (~2.5/64Gb).
CONCLUSIONS:
The problem seems to be R specific. When i run other multi-threaded commands with other software (e.g. PLINK) i don't run into this problem and parallel process run as expected. I have also tried running the above with Rmpi and doMPI with same (slower) results. The problem appears to be related R sessions/parallelized commands on virtual machine network. What i really need help on is how to pinpoint the problem. Similar problem seems to be pointed out here

I find the per-node multiplication time very interesting because the timings don't include any of the overhead associated with the parallel loop, but only the time to perform the matrix multiplication, and they show that the time increases with the number of matrix multiplications executing in parallel on the same machine.
I can think of two reasons why that might happen:
The memory bandwidth of the machine is saturated by the matrix multiplications before you run out of cores;
The matrix multiplication is multi-threaded.
You can test for the first situation by starting multiple R sessions (I did this in multiple terminals), creating two matrices in each session:
> x <- matrix(rnorm(4096*4096), 4096)
> y <- matrix(rnorm(4096*4096), 4096)
and then executing a matrix multiplication in each of those sessions at about the same time:
> system.time(z <- x %*% t(y))
Ideally, this time will be the same regardless of the number of R sessions you use (up to the number of cores), but since matrix multiplication is a rather memory intensive operation, many machines will run out of memory bandwidth before they run out of cores, causing the times to increase.
If your R installation was built with a multi-threaded math library, such as MKL or ATLAS, then you could be using all of your cores with a single matrix multiplication, so you can't expect better performance by using multiple processes unless you use multiple computers.
You can use a tool such as "top" to see if you're using a multi-threaded math library.
Finally, the output from lscpu suggests that you're using a virtual machine. I've never done any performance testing on multi-core virtual machines, but that could also be a source of problems.
Update
I believe the reason that your parallel matrix multiplications run more slowly than a single matrix multiplication is that your CPU isn't able to read memory fast enough to feed more than about two cores at full speed, which I referred to as saturating your memory bandwidth. If your CPU had large enough caches, you might be able to avoid this problem, but it doesn't really have anything to do with the amount of memory that you have on your motherboard.
I think this is just a limitation of using a single computer for parallel computations. One of the advantages of using a cluster is that your memory bandwidth goes up as well as your total aggregate memory. So if you ran one or two matrix multiplications on each node of a multi-node parallel program, you wouldn't run into this particular problem.
Assuming you don't have access to a cluster, you could try benchmarking a multi-threaded math library such as MKL or ATLAS on your computer. It's very possible that you could get better performance running one multi-threaded matrix multiply than running them in parallel in multiple processes. But be careful when using both a multi-threaded math library and a parallel programming package.
You could also try using a GPU. They're obviously good at performing matrix multiplications.
Update 2
To see if the problem is R specific, I suggest that you benchmark the dgemm function, which is the BLAS function used by R to implement matrix multiplication.
Here's a simple Fortran program to benchmark dgemm. I suggest executing it from multiple terminals in the same way that I described for benchmarking %*% in R:
program main
implicit none
integer n, i, j
integer*8 stime, etime
parameter (n=4096)
double precision a(n,n), b(n,n), c(n,n)
do i = 1, n
do j = 1, n
a(i,j) = (i-1) * n + j
b(i,j) = -((i-1) * n + j)
c(i,j) = 0.0d0
end do
end do
stime = time8()
call dgemm('N','N',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
etime = time8()
print *, etime - stime
end
On my Linux machine, one instance runs in 82 seconds, while four instances run in 116 seconds. This is consistent with the results that I see in R and with my guess that this is a memory bandwidth problem.
You can also link this against different BLAS libraries to see which implementation works better on your machine.
You might also get some useful information about the memory bandwidth of your virtual machine network using pmbw - Parallel Memory Bandwidth Benchmark, although I've never used it.

I think the obvious answer here is the correct one. Matrix multiplication is not embarrassingly parallel. And you do not appear to have modified the serial multiplication code to parallelize it.
Instead, you are multiplying two matrices. Since the multiplication of each matrix is likely being handled by only a single core, every core in excess of two is simply idle overhead. The result is that you only see a speed improvement of 2x.
You could test this by running more than 2 matrix multiplications. But I'm not familiar with the foreach, doParallel framework (I use parallel framework) nor do I see where in your code to modify this to test it.
An alternative test is to do a parallelized version of matrix multiplication, which I borrow directly from Matloff's Parallel Computing for Data Science. Draft available here, see page 27
mmulthread <- function(u, v, w) {
require(parallel)
# determine which rows for this thread
myidxs <- splitIndices(nrow(u), myinfo$nwrkrs ) [[ myinfo$id ]]
# compute this thread's portion of the result
w[myidxs, ] <- u [myidxs, ] %*% v [ , ]
0 # dont return result -- expensive
}
# t e s t on snow c l u s t e r c l s
test <- function (cls, n = 2^5) {
# i n i t Rdsm
mgrinit(cls)
# shared variables
mgrmakevar(cls, "a", n, n)
mgrmakevar(cls, "b", n, n)
mgrmakevar(cls, "c", n, n)
# f i l l i n some t e s t data
a [ , ] <- 1:n
b [ , ] <- rep (1 ,n)
# export function
clusterExport(cls , "mmulthread" )
# run function
clusterEvalQ(cls , mmulthread (a ,b ,c ))
#print ( c[ , ] ) # not p ri n t ( c ) !
}
library(parallel)
library(Rdsm)
c1 <- makeCluster(1)
c2 <- makeCluster (2)
c4 <- makeCluster(4)
c8 <- makeCluster(8)
library(microbenchmark)
microbenchmark(node1= test(c1, n= 2^10),
node2= test(c2, n= 2^10),
node4= test(c4, n= 2^10),
node8= test(c8, n= 2^10))
Unit: milliseconds
expr min lq mean median uq max neval cld
node1 715.8722 780.9861 818.0487 817.6826 847.5353 922.9746 100 d
node2 404.9928 422.9330 450.9016 437.5942 458.9213 589.1708 100 c
node4 255.3105 285.8409 309.5924 303.6403 320.8424 481.6833 100 a
node8 304.6386 328.6318 365.5114 343.0939 373.8573 836.2771 100 b
As expected, by parallelizing the matrix multiplication, we do see the spend improvement we wanted, although parallel overhead is clearly extensive.

calculate throughput for 1G link

I want to calculate the max network throughput on 1G Ethernet link. I understand how to estimate max rate in packets/sec units for 64-bytes frame:
IFG 12 bytes
MAC Preamble 8 bytes
MAC DA 6 bytes
MAC SA 6 bytes
MAC type 2 bytes
Payload 46 bytes
FCS 4 bytes
Total Frame size -> 84 bytes
Now for 1G link we get:
1,000,000,000 bits/sec * 8 bits/byte => 1,488,096 fps
As I understand, this is a data link performance, correct?
But how to calculate throughput in megabits per second for different packets size, i.e. 64,128...1518? Also, how to calculate UDP/TCP throughput, since I have to consider headers overhead.
Thanks.

Max throughput over Ethernet = (Payload_size / (Payload_size + 38)) * Link bitrate
I.e. if you send 50 bytes of payload data, max throughput would be (50 / 88) * 1,000,000,000 for a 1G link, or about 568 Mbit/s. If you send 1000 bytes of payload, max throughput is (1000/1038) * 1,000,000,000 = 963 Mbit/s.
IP+UDP adds 28 bytes of headers, so if you're looking for data throughput over UDP, you should use this formula:
Max throughput over UDP = (Payload_size / (Payload_size + 66)) * Link bitrate
And IP+TCP adds 40 bytes of headers, so that would be:
Max throughput over TCP = (Payload_size / (Payload_size + 78)) * Link bitrate
Note that these are optimistic calculations. I.e. in reality, you might have extra options in the header data that increases the size of the headers, lowering payload throughput. You could also have packet loss that causes performance to drop.
Check out the Wikipedia article on the ethernet frame, and particularly the "Maximum throughput" section:
http://en.wikipedia.org/wiki/Ethernet_frame

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex