I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files.
The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM.
When I attempt to read in the file:
g=fread('final.results',header=T,sep=' ')
'header' changed by user from 'auto' to TRUE
Error: protect(): protection stack overflow
I tried starting R with --max-ppsize 500000 , which is the max, but the same error.
I also tried setting the stack size to unlimited via
ulimit -s unlimited
Virtual memory was already set to unlimited.
Am I being unrealistic with a file of this size? Did I miss something fairly obvious?
Now fixed in v1.8.9 on R-Forge.
An unintended 50,000 column limit has been removed in fread. Thanks to mpmorley for reporting. Test added.
The reason was I got this part wrong in the fread.c source :
// *********************************************************************
// Allocate columns for known nrow
// *********************************************************************
ans=PROTECT(allocVector(VECSXP,ncol));
protecti++;
setAttrib(ans,R_NamesSymbol,names);
for (i=0; i<ncol; i++) {
thistype = TypeSxp[ type[i] ];
thiscol = PROTECT(allocVector(thistype,nrow)); // ** HERE **
protecti++;
if (type[i]==SXP_INT64)
setAttrib(thiscol, R_ClassSymbol, ScalarString(mkChar("integer64")));
SET_TRUELENGTH(thiscol, nrow);
SET_VECTOR_ELT(ans,i,thiscol);
}
According to R-exts section 5.9.1, that PROTECT inside the loop isn't needed :
In some cases it is necessary to keep better track of whether protection is really needed. Be
particularly aware of situations where a large number of objects are generated. The pointer
protection stack has a fixed size (default 10,000) and can become full. It is not a good idea
then to just PROTECT everything in sight and UNPROTECT several thousand objects at the end. It
will almost invariably be possible to either assign the objects as part of another object (which
automatically protects them) or unprotect them immediately after use.
So that PROTECT is now removed and all is well. (It seems that the pointer protection stack limit has been reduced to 50,000 since that text was written; Defn.h contains #define R_PPSSIZE 50000L.) I've checked all other PROTECTs in data.table C source for anything similar and found and fixed one in assign.c too (when adding more than 50,000 columns by reference), no others.
Thanks for reporting!
Related
When I created a partition for an existing table, I got an exception stack below:
Thread pointer: 0x1ce1ecc6ab8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
mysqld.exe!ha_partition::read_par_file()[ha_partition.cc:3021]
mysqld.exe!ha_partition::get_from_handler_file()[ha_partition.cc:3161]
mysqld.exe!ha_partition::initialize_partition()[ha_partition.cc:512]
mysqld.exe!partition_create_handler()[ha_partition.cc:185]
mysqld.exe!get_new_handler()[handler.cc:302]
mysqld.exe!TABLE_SHARE::init_from_binary_frm_image()[table.cc:2025]
mysqld.exe!open_table_def()[table.cc:698]
mysqld.exe!tdc_acquire_share()[table_cache.cc:842]
mysqld.exe!open_table()[sql_base.cc:1905]
mysqld.exe!open_and_process_table()[sql_base.cc:3802]
mysqld.exe!open_tables()[sql_base.cc:4300]
mysqld.exe!mysql_alter_table()[sql_table.cc:9353]
mysqld.exe!Sql_cmd_alter_table::execute()[sql_alter.cc:510]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6087]
mysqld.exe!Prepared_statement::execute()[sql_prepare.cc:4760]
mysqld.exe!Prepared_statement::execute_loop()[sql_prepare.cc:4246]
mysqld.exe!mysql_sql_stmt_execute()[sql_prepare.cc:3364]
mysqld.exe!mysql_execute_command()[sql_parse.cc:3901]
mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:3652]
mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:3335]
mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:3513]
mysqld.exe!sp_head::execute()[sp_head.cc:1346]
mysqld.exe!sp_head::execute_procedure()[sp_head.cc:2288]
mysqld.exe!do_execute_sp()[sql_parse.cc:3005]
mysqld.exe!Sql_cmd_call::execute()[sql_parse.cc:3247]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6087]
mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:3652]
mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:3335]
mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:3513]
mysqld.exe!sp_head::execute()[sp_head.cc:1346]
mysqld.exe!sp_head::execute_procedure()[sp_head.cc:2288]
mysqld.exe!Event_job_data::execute()[event_data_objects.cc:1459]
mysqld.exe!Event_worker_thread::run()[event_scheduler.cc:312]
mysqld.exe!event_worker_thread()[event_scheduler.cc:268]
mysqld.exe!pthread_start()[my_winthread.c:62]
ucrtbase.dll!_configthreadlocale()
KERNEL32.DLL!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x1ce21cf66b8): alter table mytable add PARTITION (PARTITION t20221103 VALUES LESS THAN (TO_DAYS("2022-11-03")+1))
Connection ID (thread ID): 1649
Status: NOT_KILLED
The running platform is windows so I cannot debug the core file.
Therefore, I read the source code but I cannot understand the reason still deeply.
The database version is 10.4.6 so here is the source code link:
https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/ha_partition.cc#L3021
chksum= 0;
for (i= 0; i < len_words; i++)
chksum ^= uint4korr((file_buffer) + PAR_WORD_SIZE * i);
if (chksum)
goto err2;
m_tot_parts= uint4korr((file_buffer) + PAR_NUM_PARTS_OFFSET);
DBUG_PRINT("info", ("No of parts: %u", m_tot_parts));
tot_partition_words= (m_tot_parts + PAR_WORD_SIZE - 1) / PAR_WORD_SIZE;
tot_name_len_offset= file_buffer + PAR_ENGINES_OFFSET +
PAR_WORD_SIZE * tot_partition_words;
tot_name_words= (uint4korr(tot_name_len_offset) + PAR_WORD_SIZE - 1) /
PAR_WORD_SIZE; // <--- crashed here
There are some macros, so I don't know whether the line number is exact.
It seems a bug for loading the par file. I found that the above code has checked the validation on the par file, but subsequently, MariaDB still raises the exception.
So, I wonder if my analysis is right and how to bypass the exception, thanks a lot.
First is to search existing bug reports for 'read_par_file', and there's no obvious match.
Second is, looking at the line it crashed, we can assume its reading from memory that isn't allocated. The file_buffer was allocated earlier in the file based of a word length at the beginning.
Third, if you look at the git blame of the function, you see nothing has changed in quite a while.
So its likely a new bug. Please report it using the show create table and par file.
To confirm, that the show create table mytable, paste that into a running 10.4 latest version (containers are good for this). Then issue your alter table statement again. Its likely to crash, but its good to confirm.
There does seem to be a lack of checking around some of these offsets with regard to the allocated space.
Given it looks like you are doing daily partitions, have you exceeded the maximum partitions of 8k per table? Either way, an error message should occur rather than a crash.
Due to I constantly reach memory size limit in my R Session (8GB Windows PC) I start to remove big objects loaded in. However once I reach this limit, removing objects seems not to work.
So, I was wondering if there's a way to get the R Session size. I know that it's possible to retrieve objects' size (saw in this thread).I want to know if there's a way to count the complete R Session size though (loaded packages, objects, etc).
Thank you!
I personally use this function to get the available memory:
getAvailMem <- function(format = TRUE) {
gc()
if (Sys.info()[["sysname"]] == "Windows") {
memfree <- 1024^2 * (utils::memory.limit() - utils::memory.size())
} else {
# http://stackoverflow.com/a/6457769/6103040
memfree <- 1024 * as.numeric(
system("awk '/MemFree/ {print $2}' /proc/meminfo", intern = TRUE))
}
`if`(format, format(structure(memfree, class = "object_size"),
units = "auto"), memfree)
}
To get the total memory used by R, you may try mem_used() from pryr package. Unlike memory.size, this one is not OS dependent, because it uses the R function gc() underneath it. Try to look in the function body and also this pryr:::node_size and pryr:::show_bytes
pryr::mem_used()
The help file ?pryr::mem_used describes
R breaks down memory usage into Vcells (memory used by vectors) and
Ncells (memory used by everything else). However, neither this
distinction nor the "gc trigger" and "max used" columns are typically
important. What we're usually most interested in is the the first
column: the total memory used. This function wraps around gc() to
return the total amount of memory (in megabytes) currently used by R.
You can also use pryr::mem_change to track the size of the memory used by the R code. Try the example in its documentation page.
The numbers such as 28L and 56L used to refer node size with pryr:::node_size comes from the help file of ?gc, which describes
gc returns a matrix with rows "Ncells" (cons cells), usually 28 bytes
each on 32-bit systems and 56 bytes on 64-bit systems, and "Vcells"
(vector cells, 8 bytes each),
After removing a large object run gc() to free memory
I am using a 32 bit perl in my openvms system.(So perl can access up till 2gb of virtual address space ).
I am hitting "out of memory!" in a large perl script. I zeroed in on the location of variable causing this . However after my tests with devel:size it turns out the array is using only 13 Mb memory and the hash is using much less than that.
My question is about memory profiling this perl script in VMS.
is there a good way of doing memory profile on VMS?
I used size to get size of array and hash.(Array is local scope and hash is global scope)
DV Z01 A4:[INTRO_DIR]$ perl scanner_SCANDIR.PL
Directory is Z3:[new_dir]
13399796 is total on array
3475702 is total on hash
Directory is Z3:[new_dir.subdir]
2506647 is total on array
4055817 is total on hash
Directory is Z3:[new_dir.subdir.OBJECT]
5704387 is total on array
6040449 is total on hash
Directory is Z3:[new_dir.subdir.XFET]
1585226 is total on array
6390119 is total on hash
Directory is Z3:[new_dir.subdir.1]
3527966 is total on array
7426150 is total on hash
Directory is Z3:[new_dir.subdir.2]
1698678 is total on array
7777489 is total on hash
(edited: Pmis-spelled GFLQUOTA )
Where is that output coming from? To OpenVMS folks it suggests files in directories, which the code might suck in? There would typically be considerable malloc/align overhead per element saved.
Anyway the available ADDRESSABLE memory when strictly using 32 pointers on OpenVMS is 1GB: 0x0 .. 0x3fffffff, not 2GB, for programs and (malloc) data for 'P0' space. There is also room in P1 (0x7fffffff .. 0x4000000) for thread-local stack storages, but perl does not use (much) of that.
From a second session you can look at that with DCL:
$ pid = "xxxxxxxx"
$ write sys$output f$getjpi(pid,"FREP0VA"), " ", f$getjpi(pid,"FREP1VA")
$ write sys$output f$getjpi(pid,"PPGCNT"), " ", f$getjpi(pid,"GPGCNT")
$ write sys$output f$getjpi(pid,"PGFLQUOTA")
However... those are just addresses ranges, NOT how much memory the process is allowed to used. That's governed by the process page-file-quota. Check with $ SHOW PROC/QUOTA before running perl. And its usage can be reported as per above from the outside adding Private pages and Groups-shared pages as per above.
An other nice way to look at memory (and other quota) is SHOW PROC/CONT ... and then hit "q"
So how many elements are stored in each large active array? How large is an average element, rounded up to 16 bytes? How many hash elements? How large are the key + value on average (round up generously)
What is the exact message?
Does the program 'blow' up right away, or after a while (so you can use SHOW PROC/CONT)
Is there a source file data set (size) that does work?
Cheers,
Hein.
I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package.
This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern,
: cannot allocate memory block of size 17179869183.7 Gb
I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm used by pairwiseAlignment is an O(n*m) which I thought would imply the computational demand to be on the order of 3.1E9 operations (56K * 56k ~= 3.1E9). It seems the Needleman-Wunsch similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting environment.
Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
By the way, here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
The error does not happen with shorter alignments (hundreds of bases).
I have not yet found the length cutoff where the error starts happening
So I use Clustal as an alignment tool. Not sure about the specific performance, but it has never given me issues when doing multiple sequence alignments of large quantity. Here is a script that runs a whole directory of .fasta files and aligns them. You can modify the flags on the system call to suit your input/output needs. Just look at the clustal documentation. This is in Perl, I don't use R too much for alignments. You need to edit the executable path in the script to match where clustal is on your computer.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}
I want to increase the stack size of an ipad application to 16MB .I have done it in xcode build setting "-WI-stack_size 1000000 to the Other Linker Flags field ".
but getting build error of
i686-apple-darwin10-gcc-4.2.1: stack_size: No such file or directory
i686-apple-darwin10-gcc-4.2.1: 1000000: No such file or directory
How can i resolve that ?
Do not know if you got it solved or not, but the correct way to indicate such option to the linker (according to this site) is to add -Wl,-stack_size,1000000 to the Other Linker Flags field of the Build Styles pane. You are missing the commas.
In case you are using clang or gcc the cli would be:
g++ -Wl,-stack_size -Wl,1000000
Hope it helps.
If you think you need to increase the stack size you're almost certainly doing something very wrong in your code, like allocating way too large objects on the stack (ie a 16 MByte C array).
Instead, allocate the memory or use the proper Objective-C data structures (ie NSMutableArray).