Optimizing `data.table::fread` speed advice - r

I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.

Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.

Related

RStudio read_delim(): intermittently receive error std::bad_alloc upon opening files with unusual delimeter

I received a series of 100+ files from a client. This client received the files as part of litigation, so they didn't have to be transmitted in a convenient fashion, they just all had to be present. In a single .zip file, all the files are all tracked with names like Folder1.001, Folder1.002, Folder3.001, etc. When unpackaged these files using the 7-Zip program, they don't show up with a .txt, .csv, or any other file extension. Windows incorrectly interprets the unzipped files as a ".001 File" or ".002 File." This is not the issue, because I know that the files are delimited by a ~ and are 118 columns wide. Each file has between 2.5M and 4.9M rows, and each is about 1 GB in size when unzipped.
This is my first ever post here, so please excuse any breach of etiquette.
I am working in a .Rmd file on a virtual machine running Windows. I have R4.2.2 (64-bit), and RStudio 2022.12.0+353. All work is being done within a drive on the virtual machine that has 9+ GB free out of 300 GB total. The size of this virtual drive could be increased, if necessary.
My goal here is examine one variable in each file, to see if cases fall within a given range for that variable, and save those rows that do. I have been saving them as .rds files using write_rds().
I have been bringing in the files using a read_delim() statement specifying 'delim = "~"'. I created a vector of 120 column names which I use because the columns are not labeled. These commands on their own are not an issue. A successful import looks like the below.
work1 <- read_delim("Data\\Folder1\\File1.001"), delim = "~", col_names = vNames1)
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
It mishandles the columns named Skip1 and Skip9 as logical values, but those aren't a necessary part of my analysis.
I then filter and write the file using
work1 <- work1 %>% filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)
write_rds(work1, "Data\\Working\\Folder1_001.rds")
I have also done this with the read_delim() and filter() piped into a single command. This is not the issue. NOTE: Before I read in the next file (File1.002), I now have a work1 file that is at most, 4000 cases, down from millions when it was imported.
Since I have over 100 of these files, I have written multiple code chunks to do a few of these at a time. After one to three read_delim() statements in a row, I get the below error.
work2 <- read_delim("Data\\Folder1\\File1.002"), delim = "~", col_names = vNames1)
Error std::bad_alloc
Which I understand has to memory allocation. I can close out RStudio and restart and that will allow me to do one or two more imports, filterings, then writings. Doing that for over 100 files is far too inefficient.
I condensed my code a step further by writing the read_delim() step within the write_rds() step, which looks like the below.
write_rds((read_delim("Data\\Folder1\\File003",
delim = "~", col_names = vNames1) %>%
filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)),
"Data\\Working\\Folder1_003.rds")
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
Yet after 1 or 2 successful runs, I get the same
Error std::bad_alloc message.
Using traceback(), it seems like it is related to vroom::vroom(), but I'm not sure how to check any further.

How to read single parameter from a file?

Chapter 9, page 163, of the AMPL book gives an example of reading a single parameter from a file:
For example, if you want to read the number of weeks and the
hours available each week for our simple production model (Figure 4-4),
param T > 0;
param avail {1..T} >= 0;
from a file week_data.txt containing
4
40 40 32 40
then you can give the command
read T, avail[1], avail[2], avail[3], avail[4] <week_data.txt;
This command fails in GLPK with error colon missing where expected. The Modeling Language GNU MathProg Language reference only contains table data IN, which serves for reading tabular data. Can GLPK read a single parameter from a file?
You can use the table statement to read parameters from a CSV files or SQL tables.
You can use data files for passing parameters, e.g. see this example of writing data files in AWK and Visual Basic.
AMPL and GMPL are related functional languages. GMPL contains a subset of the AMPL syntax but differs in several areas like the table statement.
One way to read a single parameter is to write the data into a file with a certain syntax, e.g. the contents below show a single parameter and a table:
param T := 4;
param avail :=
1 0
2 1
3 1
4 0;
end;
To verify the syntax, consider this code in the file problem.mod:
param T > 0;
param avail {1..T} >= 0;
var use {1..T} >= 0;
maximize usage: sum {t in 1..T} avail[t];
subject to constraint {t in 1..T}: use[t] <= avail[t];
solve;
end;
The result shows that it worked:
> glpsol -m problem.mod -d problem.dat
GLPSOL: GLPK LP/MIP Solver, v4.65
Parameter(s) specified in the command line:
-m problem.mod -d problem.dat
Reading model section from problem.mod...
13 lines were read
Reading data section from problem.dat...
9 lines were read
Generating usage...
Generating constraint...
Model has been successfully generated
glp_mpl_build_prob: row usage; constant term 2 ignored
GLPK Simplex Optimizer, v4.65
5 rows, 4 columns, 4 non-zeros
Preprocessing...
~ 0: obj = 2.000000000e+00 infeas = 0.000e+00
OPTIMAL SOLUTION FOUND BY LP PREPROCESSOR
Time used: 0.0 secs
Memory used: 0.1 Mb (110236 bytes)
Model has been successfully processed

Available stack size is not used by R, returning "Error: node stack overflow"

I have written a recursive code in R.
Before invoking R, I set the stack size to 96 MB at the shell with:
ulimit -s 96000
I invoked R with maximum protection pointer stack size of 500000 with:
R --max-ppsize 500000
And I changed the maximum recursion depth to 500000:
options(expression = 500000)
I both used the binary R package at Arch Linux repositories (without memory profiling) and also a binary compiled by me with memory profiling option. Both are of version 3.4.2
I used two versions of the code with and without gc().
The problem is that R exits the code with "node stack overflow" error while only 16 MB of the 93 MB of total available stack is used and depth is just below one percent of the expressions option of 5e5:
size current direction eval_depth
93388800 16284704 1 4958
Error: node stack overflow
The current stack usage change between the last two iterations were around 10K. The only passed and saved object is a numeric vector of 19 items.
The recursive portion of the code is below:
network_recursive <- function(called)
{
print(Cstack_info())
callers <- list_caller[[called + 1]] # get the callers of the called
callers <- callers[!bool[callers + 1]] # subset for nofriends - new friends
new_friend_no <- length(callers) # number of new friends
print(list(called, callers) )
if (new_friend_no > 0) # if1 still new friends
{
friends <<- friends + new_friend_no # increment friend no
print(friends)
bool[callers + 1] <<- T # toggle friends
sapply(callers, network_recursive) # recurse network control
} # close if1
print("end of recursion")
}
What may be the reason for this stack overflow?
Some notes on the R source code, related to the issue.
The portion of the code that triggers the error is lines 5987-5988 from src/main/eval.c:
5975 #ifdef USE_BINDING_CACHE
5976 if (useCache) {
5977 R_len_t n = LENGTH(constants);
5978 # ifdef CACHE_MAX
5979 if (n > CACHE_MAX) {
5980 n = CACHE_MAX;
5981 smallcache = FALSE;
5982 }
5983 # endif
5984 # ifdef CACHE_ON_STACK
5985 /* initialize binding cache on the stack */
5986 vcache = R_BCNodeStackTop;
5987 if (R_BCNodeStackTop + n > R_BCNodeStackEnd)
5988 nodeStackOverflow();
5989 while (n > 0) {
5990 SETSTACK(0, R_NilValue);
5991 R_BCNodeStackTop++;
5992 n--;
5993 }
5994 # else
5995 /* allocate binding cache and protect on stack */
5996 vcache = allocVector(VECSXP, n);
5997 BCNPUSH(vcache);
5998 # endif
5999 }
6000 #endif
Off the top of my head, I see that you used options(expression = 500000), but the field in the list returned by "options()" is called 'expressions' (with an s). If you typed it in the way you described in your question, then the 'expressions' field remained at 5000, not the 500000 you intended to set it as. So this might be why you maxed out while only using what you thought was 1% of the stack depth.
The node stack has its own limit, which is fixed (defined in Defn.h, R_BCNODESTACKSIZE). If you have a real example where the limit is too small, please submit a bug report, we could increase it or also add a command line option for it. The "node stack" is used by the byte-code interpreter, which interprets byte-code produced by the byte-code compiler. Cstack_info() does not display the node stack usage. The node stack is not allocated on the C stack.
Programs based on deep recursion will be very slow in R anyway as function calls are quite expensive. For practical purposes, when a limit related to recursion depth is hit, it might be better to rewrite the program to avoid recursion rather then increasing the limits.
Just as an experiment one might disable the just-in-time compiler and by that reduce the stress on the node stack. It won't be completely eliminated, because some packages are already compiled at installation by default, including base and recommended packages, so e.g. sapply is compiled. Also, this might on the other hand increase the stress on the recursively eliminated expressions, and the program will run even slower.

Long Vector Not Supported Yet Error in R Windows 64bit version

I'm trying to test what the memory limitations in the current R version is.
runtest <- function(size) {
x <- "testme"
while(0<1) {
x <- c(x, x)
size <<- object.size(x) # size of x when fail
}
}
By running runtest(size) in the console on my laptop, I get the following error:
> runtest(size)
Error: cannot allocate vector of size 4.0 Gb
In addition: Warning messages:
1: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 7915Mb: see help(memory.size)
2: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 7915Mb: see help(memory.size)
3: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 7915Mb: see help(memory.size)
4: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 7915Mb: see help(memory.size)
> size
2147483736 bytes
>
This size looks very close to the 2^31-1 limit that people have mentioned before. So then I tried running the same code on our upgraded desktop with 128GB of RAM and set the variable in the shortcut for the 64 bit version to the max memory usage of 100GB. This is the new error I get:
Error in structure(.Call(C_objectSize, ), class = "object_size"):
long vectors not supported yet: unique.c: 1720
> size
8589934680 bytes
>
Does this 8.5GB limit have anything to do with running in Windows O/S (specifically Windows 7 Enterprise edition)? I think the R help file (http://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html) explains this, but I'm having trouble understanding what it's saying (not my area of expertise).
Looking at the source of size.c and unique.c it looks like the hashing used to improve object.size doesn't support long vectors yet:
/* Use hashing to improve object.size. Here we want equal CHARSXPs,
not equal contents. */
and
/* Currently the hash table is implemented as a (signed) integer
array. So there are two 31-bit restrictions, the length of the
array and the values. The values are initially NIL (-1). O-based
indices are inserted by isDuplicated, and invalidated by setting
to NA_INTEGER.
*/
Therefore, it is object.size that is choking. How about calling numeric(2^36) to see if you can create a such a large object, (should be 64GB).
The maximum vector length in recent versions of R is 2^53-1 because there are 53 bits in the mantissa of the double storage mode. The maximum extent of each dimension of a matrix or array is still 2^32-1 because dimension values are still based on the integer storage mode. I thought I might get more information from news() but didn't get as much as I thought I would. There was quite a bit of discussion about this on the r-devel mailing list and I would use MarkMail's archive search if I needed more.
?double
?integer
db <- news()
str(db)
db$Text[grepl("vector", db$Text) & grepl("length", db$Text)]
# only slsightly informative
db$Text[grepl("vector", db$Text) & grepl("long", db$Text)][3]
[1] "There is support for vectors longer than 2^31 - 1 elements. This\napplies to raw, logical, integer, double, complex and character\nvectors, as well as lists. (Elements of character vectors remain\nlimited to 2^31 - 1 bytes.)"

Looking for algorithm to do long pair wise nucleotide alignments

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package.
This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern,
: cannot allocate memory block of size 17179869183.7 Gb
I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm used by pairwiseAlignment is an O(n*m) which I thought would imply the computational demand to be on the order of 3.1E9 operations (56K * 56k ~= 3.1E9). It seems the Needleman-Wunsch similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting environment.
Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
By the way, here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
The error does not happen with shorter alignments (hundreds of bases).
I have not yet found the length cutoff where the error starts happening
So I use Clustal as an alignment tool. Not sure about the specific performance, but it has never given me issues when doing multiple sequence alignments of large quantity. Here is a script that runs a whole directory of .fasta files and aligns them. You can modify the flags on the system call to suit your input/output needs. Just look at the clustal documentation. This is in Perl, I don't use R too much for alignments. You need to edit the executable path in the script to match where clustal is on your computer.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}

Resources