Fast test if directory is empty - r

What is the fastest way to test if a directory is empty?
Of course I can check the length of
list.files(path, all.files = TRUE, include.dirs = TRUE, no.. = TRUE)
but this requires enumerating the entire contents of the directory which I'd rather avoid.
EDIT: I'm looking for portable solutions.
EDIT^2: Some timings for a huge directory (run this in a directory that's initially empty, it will create 100000 empty files):
system.time(file.create(as.character(0:99999)))
# user system elapsed
# 0.720 12.223 14.948
system.time(length(dir()))
# user system elapsed
# 2.419 0.600 3.167
system.time(system("ls | head -n 1"))
# 0
# user system elapsed
# 0.788 0.495 1.312
system.time(system("ls -f | head -n 3"))
# .
# ..
# 99064
# user system elapsed
# 0.002 0.015 0.019
The -f switch is crucial for ls, it will avoid the sorting that will take place otherwise.

How about if(length(dir(all.files=TRUE)) ==0) ?
I'm not sure what you qualify as "fast," but if dir takes a long time, someone is abusing your filesystem :-(.

Related

Fastest way to upload data via R to PostgresSQL 12

I am using the following code to connect to a PostgreSQL 12 database:
con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)
This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:
DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)
where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.
When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?
I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.
Setup for this test:
win10 (2004)
docker
postgres:11 container, using port 35432 on localhost, simple authentication
a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 0
z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
# user system elapsed
# 1.56 1.09 30.90
system.time({
data.table::fwrite(z1000, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z1000), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 32000
# user system elapsed
# 0.05 0.00 0.19
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 64000
While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.
Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:
z100 <- rbindlist(replicate(100, mtcars, simplify=F))
system.time({
data.table::fwrite(z100, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z100), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 3200
# user system elapsed
# 0.0 0.0 0.1
system.time({
DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.17 0.04 2.95
system.time({
DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.74 0.33 23.59
(I don't want to time dbAppendTable with z1000 above ...)
(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...
I suspect that dbAppendTable results in an INSERT statement per row, which can take a long time for high numbers of rows.
However, you can generate a single INSERT statement for the entire data frame using the sqlAppendTable function and run it by using dbSendQuery explicitly:
res <- DBI::dbSendQuery(con, DBI::sqlAppendTable(con, tablename, df, row.names=FALSE))
DBI::dbClearResult(res)
For me, this was much faster: a 30 second ingest reduced to a 0.5 second ingest.

How to find number of lines of a Large CSV file without reading it - using R? [duplicate]

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?
For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename
Option 1:
Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
# comment.char = "#")
# NULL
See help(count.fields) for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv"))
# [1] 99846
system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
# user system elapsed
# 0.528 0.000 0.503
l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L)))
# user system elapsed
# 0.063 0.000 0.063
Estimate number of lines based on size of first 1000 lines
size1000 <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
sizetotal <- file.size("dgrp2.tgeno")
1000 * sizetotal / size1000
This is usually good enough for most purposes - and is a lot faster for huge files.
Here is something I used:
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines
Check out this post for more:
https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/
Implementing Tony's answer in R:
file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))
This is about 4x faster than data.table for a file with 100k lines
> microbenchmark::microbenchmark(
+ nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+ as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+ )
Unit: milliseconds
expr min lq
nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE)) 27.70863 28.42997
mean median uq max neval
150.43999 135.1366 142.99937 629.4880 100
34.83877 29.5070 33.32973 270.3104 100

Is there a way to count rows in R without loading data first? [duplicate]

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?
For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename
Option 1:
Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
# comment.char = "#")
# NULL
See help(count.fields) for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv"))
# [1] 99846
system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
# user system elapsed
# 0.528 0.000 0.503
l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L)))
# user system elapsed
# 0.063 0.000 0.063
Estimate number of lines based on size of first 1000 lines
size1000 <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
sizetotal <- file.size("dgrp2.tgeno")
1000 * sizetotal / size1000
This is usually good enough for most purposes - and is a lot faster for huge files.
Here is something I used:
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines
Check out this post for more:
https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/
Implementing Tony's answer in R:
file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))
This is about 4x faster than data.table for a file with 100k lines
> microbenchmark::microbenchmark(
+ nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+ as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+ )
Unit: milliseconds
expr min lq
nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE)) 27.70863 28.42997
mean median uq max neval
150.43999 135.1366 142.99937 629.4880 100
34.83877 29.5070 33.32973 270.3104 100

Optimizing file write speed in R

I am wondering about the possibility to speed up the process of writing to a file.
with my SSD and core i5 vPro I am getting following results for the file of 5234 KB:
system.time({
write(reportData, "aaa.txt")
})
user system elapsed
1.42 3.56 12.28
as well as
system.time({
fileConn<-file("aaa.txt")
writeLines(reportData, fileConn)
close(fileConn)
})
user system elapsed
1.43 3.46 13.61
and
system.time({
fileConn <- file("aaa.txt","w")
cat(reportData,file=fileConn,sep="")
close(fileConn)
})
user system elapsed
1.50 4.13 14.12
All of them seem to be implemented in the similar manner since the time execution is almost identical.
Is it possible to use Rcpp library, for c++ could definitely do it much faster?
EDIT
Without using Rcpp writeChar seems to be the fastest.
system.time({
fileConn<-file("aaa.txt")
writeChar(reportData, fileConn,nchar(reportData, type = "chars"))
close(fileConn)
})
user system elapsed
0.01 0.14 1.31

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources