R code slowing with increased iterations - r

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?

I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Related

Fastest way to upload data via R to PostgresSQL 12

I am using the following code to connect to a PostgreSQL 12 database:
con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)
This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:
DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)
where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.
When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?
I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.
Setup for this test:
win10 (2004)
docker
postgres:11 container, using port 35432 on localhost, simple authentication
a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 0
z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
# user system elapsed
# 1.56 1.09 30.90
system.time({
data.table::fwrite(z1000, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z1000), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 32000
# user system elapsed
# 0.05 0.00 0.19
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 64000
While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.
Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:
z100 <- rbindlist(replicate(100, mtcars, simplify=F))
system.time({
data.table::fwrite(z100, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z100), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 3200
# user system elapsed
# 0.0 0.0 0.1
system.time({
DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.17 0.04 2.95
system.time({
DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.74 0.33 23.59
(I don't want to time dbAppendTable with z1000 above ...)
(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...
I suspect that dbAppendTable results in an INSERT statement per row, which can take a long time for high numbers of rows.
However, you can generate a single INSERT statement for the entire data frame using the sqlAppendTable function and run it by using dbSendQuery explicitly:
res <- DBI::dbSendQuery(con, DBI::sqlAppendTable(con, tablename, df, row.names=FALSE))
DBI::dbClearResult(res)
For me, this was much faster: a 30 second ingest reduced to a 0.5 second ingest.

Unexpected results in benchmark of read.csv / fread [duplicate]

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?
As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

Timing R code with Sys.time()

I can run a piece of code for 5 or 10 seconds using the following code:
period <- 10 ## minimum time (in seconds) that the loop should run for
tm <- Sys.time() ## starting data & time
while(Sys.time() - tm < period) print(Sys.time())
The code runs just fine for 5 or 10 seconds. But when I replace the period value by 60 for it to run for a minute, the code never stops. What is wrong?
As soon as elapsed time exceeds 1 minute, the default unit changes from seconds to minutes. So you want to control the unit:
while (difftime(Sys.time(), tm, units = "secs")[[1]] < period)
From ?difftime
If ‘units = "auto"’, a suitable set of units is chosen, the
largest possible (excluding ‘"weeks"’) in which all the absolute
differences are greater than one.
Subtraction of date-time objects gives an object of this class, by
calling ‘difftime’ with ‘units = "auto"’.
Alternatively use proc.time, which measures various times ("user", "system", "elapsed") since you started your R session in seconds. We want "elapsed" time, i.e., the wall clock time, so we retrieve the 3rd value of proc.time().
period <- 10
tm <- proc.time()[[3]]
while (proc.time()[[3]] - tm < period) print(proc.time())
If you are confused by the use of [[1]] and [[3]], please consult:
How do I extract just the number from a named number (without the name)?
How to get a matrix element without the column name in R?
Let me add some user-friendly reproducible examples. Your original code with print inside a loop is quite annoying as it prints thousands of lines onto the screen. I would use Sys.sleep.
test.Sys.time <- function(sleep_time_in_secs) {
t1 <- Sys.time()
Sys.sleep(sleep_time_in_secs)
t2 <- Sys.time()
## units = "auto"
print(t2 - t1)
## units = "secs"
print(difftime(t2, t1, units = "secs"))
## use '[[1]]' for clean output
print(difftime(t2, t1, units = "secs")[[1]])
}
test.Sys.time(5)
#Time difference of 5.005247 secs
#Time difference of 5.005247 secs
#[1] 5.005247
test.Sys.time(65)
#Time difference of 1.084357 mins
#Time difference of 65.06141 secs
#[1] 65.06141
The "auto" units is very clever. If sleep_time_in_secs = 3605 (more than an hour), the default unit will change to "hours".
Be careful with time units when using Sys.time, or you may be fooled in a benchmarking. Here is a perfect example: Unexpected results in benchmark of read.csv / fread. I had answered it with a now removed comment:
You got a problem with time units. I see that fread is more than 20 times faster. If fread takes 4 seconds to read a file, read.csv takes 80 seconds = 1.33 minutes. Ignoring the units, read.csv is "faster".
Now let's test proc.time.
test.proc.time <- function(sleep_time_in_secs) {
t1 <- proc.time()
Sys.sleep(sleep_time_in_secs)
t2 <- proc.time()
## print user, system, elapsed time
print(t2 - t1)
## use '[[3]]' for clean output of elapsed time
print((t2 - t1)[[3]])
}
test.proc.time(5)
# user system elapsed
# 0.000 0.000 5.005
#[1] 5.005
test.proc.time(65)
# user system elapsed
# 0.000 0.000 65.057
#[1] 65.057
"user" time and "system" time are 0, because both CPU and the system kernel are idle.

Spark R 2.0 dapply very slow

I just started testing Spark R 2.0, and find the execution of dapply very slow.
For example, the following code
set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
user system elapsed
0.005 0.000 0.006 `
is executed in 6ms
Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:
sparkR.session(master = "local[4]")
random_DF_Spark <- repartition(createDataFrame(random_DF),4)
subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
y <- x[x[1] > 1, ]
y
},
schema(random_DF_Spark))
system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user system elapsed
2.003 0.119 62.919
I.e. 1 minute, which is abnormally slow.... Am I missing something?
I get also a warning (TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low?
I am using R 3.3.0 on Mac OS 10.10.5
Any insight welcome!

Optimizing file write speed in R

I am wondering about the possibility to speed up the process of writing to a file.
with my SSD and core i5 vPro I am getting following results for the file of 5234 KB:
system.time({
write(reportData, "aaa.txt")
})
user system elapsed
1.42 3.56 12.28
as well as
system.time({
fileConn<-file("aaa.txt")
writeLines(reportData, fileConn)
close(fileConn)
})
user system elapsed
1.43 3.46 13.61
and
system.time({
fileConn <- file("aaa.txt","w")
cat(reportData,file=fileConn,sep="")
close(fileConn)
})
user system elapsed
1.50 4.13 14.12
All of them seem to be implemented in the similar manner since the time execution is almost identical.
Is it possible to use Rcpp library, for c++ could definitely do it much faster?
EDIT
Without using Rcpp writeChar seems to be the fastest.
system.time({
fileConn<-file("aaa.txt")
writeChar(reportData, fileConn,nchar(reportData, type = "chars"))
close(fileConn)
})
user system elapsed
0.01 0.14 1.31

Resources