Spark R 2.0 dapply very slow - r

I just started testing Spark R 2.0, and find the execution of dapply very slow.
For example, the following code
set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
user system elapsed
0.005 0.000 0.006 `
is executed in 6ms
Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:
sparkR.session(master = "local[4]")
random_DF_Spark <- repartition(createDataFrame(random_DF),4)
subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
y <- x[x[1] > 1, ]
y
},
schema(random_DF_Spark))
system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user system elapsed
2.003 0.119 62.919
I.e. 1 minute, which is abnormally slow.... Am I missing something?
I get also a warning (TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low?
I am using R 3.3.0 on Mac OS 10.10.5
Any insight welcome!

Related

How to get current percent of CPU usage in R?

How can I get the current percent CPU usage in R? Ideally, it would work for both Unix and Windows platforms.
In Windows platform, I used following code:
a <- system("wmic cpu get loadpercentage", intern = TRUE)
as.numeric(gsub("\\D", "", a[2]))
Is there a better way(or a function in a package) to get the current CPU usage, such that works with both Unix and Windows platforms?
According to how to get current cpu and ram usage in python? and the "reticulate" package:
library(reticulate)
aa<-reticulate::import("psutil")
aa$cpu_percent()
The function return the current percent usage of CPU as shown in below(6% currently used)
But this way needs Python to be installed on the platform.
The question
Is there an R function to retrieve CPU and RAM information?
ask for hardware information (not current percent usage CPU)as follows(This is not even close to My question!!!):
> system("lscpu | grep 'Model name:'")
Model name: Intel(R) Core(TM) i7-8700 CPU # 3.20GHz
> system("lsmem | grep 'Total online memory'")
Total online memory: 16G
> library(benchmarkme)
> get_cpu()
$vendor_id
[1] "GenuineIntel"
$model_name
[1] "Intel(R) Core(TM) i5-7400 CPU # 3.00GHz"
$no_of_cores
[1] 4
> get_ram()
34.3 GB
So, the answer to the question, two function get_ram() ,get_cpu() , return total available RAM and CPU! not current percent usage of RAM and CPU. That is, get_ram() function return 32GB, not 6 percent that used now!
I think, accepted answer in question R: how to check how many cores/CPU usage available, does not calculated the current percent of CPU:
Windows platform(the accepted answer R: how to check how many cores/CPU usage available):
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df[grepl("Rgui|rstudio", df$process),]
# process cpu
# 105 Rgui 0
# 108 rstudio 0
And the data.frame 'df' is:
I can not find any way to calculate current percent of CPU usage based on the answer. Perhaps I misunderstood something, So based that answer, R: how to check how many cores/CPU usage available, ,give me the current percent of CPU usage on comment.
I tried to extract the current percent CPU usage base on R: how to check how many cores/CPU usage available, When I look at the result of
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
I find two rows, Idle and _Total as follows:
df1<-df %>% filter(process %in% c("Idle","_Total"))
df1
So 1-Idle/_Total should be the percent current CPU usage. I calculate this as follows:
for(i in 1:1000){
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df1 <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df1<-df1 %>% filter(process %in% c("Idle","_Total"))
df1<-df1 %>% mutate(cpu=as.numeric(df1$cpu))
Idle<-df1 %>% filter(process=="Idle")
Total<-df1 %>% filter(process=="_Total")
message(1-Idle$cpu/Total$cpu)
}
and the result is:
that make no sense!!
When I look to the Python code, and an answer should be like it, it easily calculate the current CPU usage:
First with PowerShell install the psutil module as follows:
pip install psutil
and then use it in R as follows:
> library(reticulate)
> aa<-reticulate::import("psutil")
> aa$cpu_percent()
[1] 9.2

How to build a table/tibble/df for parallel execution time for multiple sources?

Suppose I have a set of scripts with the following inside:
Each script print into console 'hello world' and script execution time.
t0<- Sys.time()
print('hello world')
t1<- Sys.time()
time.taken<- round(as.numeric(t1-t0), 2)
time.taken
cat(paste("\n[script1] -","Execution time: ", time.taken, "seconds"))
Now I am trying to get the time execution for multiple sources:
source("modules/script1.R")
source("modules/script2.R")
source("modules/script3.R")
source("modules/script4.R")
source("modules/script5.R")
source("modules/script6.R")
However this will print the execution time separately for each script and in the console not in the environment,
Would be great to get something like this inside R environment:
#script execution time
#1 2 seconds
#2 2 seconds
#3 2 seconds
#4 2 seconds
#5 2 seconds
#6 2 seconds
Is this even possible?
What resources are available for this kind of processing?
You could use parLapply on Windows or mclapply from parallel package to run the scripts in parallel.
The last value calculated in the script is passed to the source function.
If you make sure the last value of the script is its execution time, it will be returned as a list by lapply:
library(parallel)
l <- 1:6
# Create script files
createscript <- function(numscript) {
cat('
t0<- Sys.time()
print("hello world")
Sys.sleep(runif(1))
t1<- Sys.time()
time.taken<- round(as.numeric(t1-t0), 2)
time.taken
print(paste("[script1] -","Execution time: ", time.taken, "seconds"))
data.frame(time.taken)'
,file =paste0("Script",numscript,".R") , append = F)
T
}
lapply(l,createscript)
# Scripts list
scripts <- paste0("Script",l,".R")
# Run scripts in parallel
cl <- makeCluster(getOption("cl.cores", detectCores() - 1))
result <- parLapply(cl, scripts,function(script) {source(script)$value})
do.call(rbind,result)
time.taken
1 0.87
2 0.51
3 0.61
4 0.38
5 0.37
6 0.91

Copying files over the network is MUCH slower with `file.copy` than `system(mv ...)`

I have been having some issues with R becoming very sluggish when accessing files over our corporate network. So I dropped back and did some testing and I was shocked to discover that the R file.copy() command is much slower than the equivalent file copy using system(mv ...). Is this a known issue or am I doing something wrong here?
Here's my test:
I have three files:
large_random.txt - ~100 MB
medium_random.txt - ~10 MB
small_random.txt - ~1 MB
I created these on my Mac like so:
dd if=/dev/urandom of=small_random.txt bs=1048576 count=1
dd if=/dev/urandom of=medium_random.txt bs=1048576 count=10
dd if=/dev/urandom of=large_random.txt bs=1048576 count=100
But the following R tests were all done using Windows running in a virtual machine. The J: drive is local and the N: drive is 700 miles (1100 km) away.
library(tictoc)
test_copy <- function(source, des){
tic('r file.copy')
file.remove(des)
file.copy(source, des )
toc()
tic('system call')
system(paste('rm', des, sep=' '))
system(paste('cp', source, des, sep=' '))
toc()
}
source <- 'J:\\tidy_examples\\dummyfiles\\small_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\small_random.txt'
test_copy(source, des)
source <- 'J:\\tidy_examples\\dummyfiles\\medium_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\medium_random.txt'
test_copy(source, des)
source <- 'J:\\tidy_examples\\dummyfiles\\large_random.txt'
des <- 'N:\\JAL\\2018\\_temp\\large_random.txt'
test_copy(source, des)
Which results in the following:
> source <- 'J:\\tidy_examples\\dummyfiles\\small_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\small_random.txt'
> test_copy(source, des)
r file.copy: 6.49 sec elapsed
system call: 2.12 sec elapsed
>
> source <- 'J:\\tidy_examples\\dummyfiles\\medium_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\medium_random.txt'
> test_copy(source, des)
r file.copy: 56.86 sec elapsed
system call: 4.65 sec elapsed
>
> source <- 'J:\\tidy_examples\\dummyfiles\\large_random.txt'
> des <- 'N:\\JAL\\2018\\_temp\\large_random.txt'
> test_copy(source, des)
r file.copy: 562.94 sec elapsed
system call: 31.01 sec elapsed
>
So what's going on that makes the system call so much faster? At the large file size it's more than 18 times slower!
I ran into the same problem with low performance of file.copy over network share drives. My solution was to use fs::file_copy() instead which performed even slightly better than the direct system call of copy.

A faster way to generate a vector of UUIDs in R

The code below takes about 15 seconds to generate a vector of 10k UUIDs. I will need to generate 1M or more and I calculate that this will take 15 * 10 * 10 / 60 minutes, or about 25 minutes. Is there a faster way to achieve this?
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), UUIDgenerate )
end_time <- Sys.time()
end_time - start_time
# Time difference of 15.072 secs
Essentially, I'm searching for a method for R that manages to achieve the performance boost described here for Java: Performance of Random UUID generation with Java 7 or Java 6
They should be RFC 4122 compliant but the other requirements are flexible.
Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)
In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.
It is possible to get faster performance at scale, just not with uuid. See below.
uuid on Windows
Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:
library(microbenchmark)
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 8.675561 9.330877 11.73299 10.14592 11.75467 66.2435 100
# rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411 100
# sf 8.570900 9.270524 11.28199 10.22779 12.06993 24.3583 100
# st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713 100
... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)
However, there is a problem:
R.version[1:3]
# _
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20
Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)
uuid on Linux
For comparison, the same results on a non-windows platform:
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 20.852227 21.48981 24.90932 22.30334 25.11449 74.20972 100
# rt 9.782106 11.03714 14.15256 12.04848 15.41695 100.83724 100
# sf 20.250873 21.39140 24.67585 22.44717 27.51227 44.43504 100
# st 9.852275 11.15936 13.34731 12.11374 15.03694 27.79595 100
R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch x86_64
# os linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000
(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)
UUID generation with a SQL server
If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.
(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)
Some performance comparisons on windows (all times in seconds):
# n uuid postgres sqlite sqlserver
# 1 100 0 1.23 1.13 0.84
# 2 1000 0.05 1.13 1.21 1.08
# 3 10000 0.47 1.35 1.45 1.17
# 4 100000 5.39 3.10 3.50 2.68
# 5 1000000 63.48 16.61 17.47 16.31
The use of SQL has some overhead that does not take long to overcome when done at scale.
PostgreSQL needs the uuid-ossp extension, installable with
CREATE EXTENSION "uuid-ossp"
Once installed/available, you can generate n UUIDs with:
n <- 3
pgcon <- DBI::dbConnect(...)
DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
# uuid
# 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
# 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
# 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f
Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html
SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):
sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
DBI::dbGetQuery(sqlitecon, sprintf("
WITH RECURSIVE cnt(x) as (
select 1 union all select x+1 from cnt limit %d
)
select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
from cnt", n))
# uuid
# 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
# 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
# 3 EAC159D6-7986-F42C-C5F5-35764544C105
This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)
SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)
Other considerations
In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.
However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.
Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.
(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)
Providing the option use.time will significantly speed up the process. It can be set to either TRUE or FALSE, to determine if the UUIDs are time-based or not. In both cases, it will be significantly faster than not specifying this option.
For 10k UUIDs,
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01399994 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01100016 secs
Even scaling up to 100M, still gives a faster run-time than the original 15 seconds.
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 1.154 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 3.7586 secs

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources