How to identify errors when loading data into BigQuery - bigdata

While importing a ~5 GB file with ~41 million rows into BigQuery, I received the following error message:
Errors:
File: 0 / Offset:4026531933 / Line:604836 / Field:39, Value cannot be converted to expected type.
My question: how would I use the Offset / Line information in the error message above to determine the line number of the offending record?

For large files, BigQuery splits them up into large pieces and loads them in parallel. That means BigQuery doesn't know how many lines come before a particular piece, since the file was chunked by byte ranges. The offset mentioned is the start of the chunk from the beginning of the file, in bytes. So the error should come at 604836 lines after the 4026531933th byte.
You can isolate the line with the bad value on Unix with:
tail -c +4026531933 <input file> | head -n $((604836 + 1)) | tail -1
Or with sed:
tail -c + | sed -n $(( + 1))p

Related

Exact limit to avoid "Argument list too long "

I know that there are some tricks to avoid the shell's limit, which leads to "argument list too long", but I want to understand why the limit hits in my case (even though it should not). As far as I know the limit of chars in an argument for a command should be able to be determined by the following steps:
Get the maximum argument size by getconf ARG_MAX
Subtract the content of you environment retrieved by env|wc -c
On my machine with Fedora 30 and zsh 5.7.1 this should allow me argument lists with a length of up to 2085763 chars. But I already hit the limit with only 1501000 chars. What did I miss in my calculation?
Minimal working example for reproduction:
Setting up files:
$ for i in {10000..100000}; do touch Testfile_${i}".txt"; done
$ ls Testfile*
zsh: argument list too long: ls
No I deleted stepwise (1000 files per step) files to check when the argument line was short enough to be handled again
for i in {10000..100000..1000}; do echo $(ls|wc -l); rm Testfile_{$i..$((i + 1000))}.txt; ls Testfile_*|wc -l; done
The message zsh: argument list too long: ls stops between 79000 and 78000 remaining files. Each filename has a length of 18 chars (19, including the separating whitespace), so in total at this moment the argument line should have a total length of 79000*19=1501000 respectively 78000*19=1482000 chars.
This result is the same magnitude in comparison to the expected value of 2085763 chars but still it's slighty off. What could explain the difference of 500000 chars?
ADDENDUM1:
Like suggested in the comments I ran xargs --show-limits and the output fits round about my expectation.
$ xargs --show-limits
Your environment variables take up 4783 bytes
POSIX upper limit on argument length (this system): 2090321
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2085538
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647
ADDENDUM2:
Following the comment of #Jens I now added 9 Bytes additional overhead to the words (8 Bytes for the pointer, 1 for the terminating NUL-Byte). Now I get the following results (I do not know, how the whitespace is handled, for the moment I leave it out):
79000*(18+9)= 2133000
78000*(18+9)= 2106000
Both values are much closer to the theoretical limit than before...indeed, they are even a bit above it. So together with some safety margin I'm more confident to preestimate the maximal argument length.
Further reading:
There are more posts about this topic, of which none answers the question in a satisfying way, but still they provide good material:
https://unix.stackexchange.com/a/120842/211477
https://www.in-ulm.de/~mascheck/various/argmax/
If you where looking to count files, this worked for me on Mac OSX Ventura (13.1)
find . -maxdepth 2 -name "*.zip" | wc -l
I had 1039999 zip files and the standard "ls /.zip | wc -l" just died ("zsh: argument list too long: ls")

Bash code behaving vague: huge file processing

This is bit of a hadoop & unix mix issue and I'm really not sure which one's responsible for the error.
I've a bash script that performs validation of file in terms of:
Checks whether the total row count of the file matches with the value mentioned in the footer of the file. (Footer is a metadata that contains total number of actual rows above it, and it's positioned at the end of data file)
Checks whether the name of file matches with what it should be.
The function calculating row count & extracting footer count of file is:
count() {
rowCount=$(echo $(hdfs dfs -cat ${hdfspath}/${fileName} | wc -l) - 2 | bc -l)
footer=$(hdfs dfs -tail ${hdfspath}/${fileName} | tail -1)
footerRecordCount=`echo $footer | sed "s/[^0-9]//g"`
}
The code snippet calling the above function & performing validation test is below:
count()
if [[ ${footerRecordCount} -ne ${rowCount} ]]; then
echo "error=Number of records in the file doesn't match with record count value mentioned in the footer of the file" >&2
exit 1
else
fn_logMessage "Footer Record Count $footerRecordCount matched with rows count $rowCount"
fi
if [ ${fileName} -ne ${actualFileName} ]; then
echo "error=File name mismatch"
exit 1
else
echo "File name matched"
fi
The code looks fairly straighforward & simple; it is indeed and works as well perfectly.
However, the issue comes up when I started this test on huge file size (>400 GB). I receive the error as below:
Footer Record Count 00000003370000000002000082238384885577696960005044939533796567041020102349250692990597110000000000000000002222111111111111110200000003440000100013060089448361739204836173971223 matched with rows count 929901602
error=File name mismatch
Strange!!
The footer record count should actually be the number 929901602 but the number that comes up is some random number which doesn't exist anywhere in the file at all. However, even by the looks of it, it doesn't match, the output is thrown as "matched".
While the error of the next if loop is shown.
Not sure who is the culprit here, unix or hadoop. But I performed this test 3 times in a row. Every time the "Huge number" pops up is completely different from the previous one. So, there isn't even a correlation between these large numbers.
Any idea what on earth is going wrong?
PS: Like I said, the code works perfectly for small files like 20 GB, 50 GB.
Thanks in advance.

Ring buffer log file on unix

I'm trying to come up with a unix pipeline of commands that will allow me to log only the most recent n lines of a program's output to a text file.
The text file should never be more than n lines long. (it may be less when it is first filling the file)
It will be run on a device with limited memory/resources, so keeping the filesize small is a priority.
I've tried stuff like this (n=500):
program_spitting_out_text > output.txt
cat output.txt | tail -500 > recent_output.txt
rm output.txt
or
program_spitting_out_text | tee output.txt | tail -500 > recent_output.txt
Obviously neither works for my purposes...
Anyone have a good way to do this in a one-liner? Or will I have to write a script/utility?
Note: I don't want anything to do with dmesg and must use standard BSD unix commands. The "program_spitting_out_text" prints out about 60 lines/second, every second.
Thanks in advance!
If program_spitting_out_text runs continuously and keeps it's file open, there's not a lot you can do.
Even deleting the file won't help since it will still continue to write to the now "hidden" file (data still exists but there is no directory entry for it) until it closes it, at which point it will be really removed.
If it closes and reopens the log file periodically (every line or every ten seconds or whatever), then you have a relatively easy option.
Simply monitor the file until it reaches a certain size, then roll the file over, something like:
while true; do
sleep 5
lines=$(wc -l <file.log)
if [[ $lines -ge 5000 ]]; then
rm -f file2.log
mv file.log file2.log
touch file.log
fi
done
This script will check the file every five seconds and, if it's 5000 lines or more, will move it to a backup file. The program writing to it will continue to write to that backup file (since it has the open handle to it) until it closes it, then it will re-open the new file.
This means you will always have (roughly) between five and ten thousand lines in the log file set, and you can search them with commands that combine the two:
grep ERROR file2.log file.log
Another possibility is if you can restart the program periodically without affecting its function. By way of example, a program which looks for the existence of a file once a second and reports on that, can probably be restarted without a problem. One calculating PI to a hundred billion significant digits will probably not be restartable without impact.
If it is restartable, then you can basically do the same trick as above. When the log file reaches a certain size, kill of the current program (which you will have started as a background task from your script), do whatever magic you need to in rolling over the log files, then restart the program.
For example, consider the following (restartable) program prog.sh which just continuously outputs the current date and time:
#!/usr/bin/bash
while true; do
date
done
Then, the following script will be responsible for starting and stopping the other script as needed, by checking the log file every five seconds to see if it has exceeded its limits:
#!/usr/bin/bash
exe=./prog.sh
log1=prog.log
maxsz=500
pid=-1
touch ${log1}
log2=${log1}-prev
while true; do
if [[ ${pid} -eq -1 ]]; then
lines=${maxsz}
else
lines=$(wc -l <${log1})
fi
if [[ ${lines} -ge ${maxsz} ]]; then
if [[ $pid -ge 0 ]]; then
kill $pid >/dev/null 2>&1
fi
sleep 1
rm -f ${log2}
mv ${log1} ${log2}
touch ${log1}
${exe} >> ${log1} &
pid=$!
fi
sleep 5
done
And this output (from an every-second wc -l on the two log files) shows what happens at the time of switchover, noting that it's approximate only, due to the delays involved in switching:
474 prog.log 0 prog.log-prev
496 prog.log 0 prog.log-prev
518 prog.log 0 prog.log-prev
539 prog.log 0 prog.log-prev
542 prog.log 0 prog.log-prev
21 prog.log 542 prog.log-prev
Now keep in mind that's a sample script. It's relatively intelligent but probably needs some error handling so that it doesn't leave the executable running if you shut down the monitor.
And, finally, if none of that suffices, there's nothing stopping you from writing your own filter program which takes standard input and continuously outputs that to a real ring buffer file.
Then you would simply do:
program_spitting_out_text | ringbuffer 4096 last4k.log
That program could be a true ring buffer in that it treats the 4k file as a circular character buffer but, of course, you'll need a special marker in the file to indicate the write-point, along with a program that can turn it back into a real stream.
Or, it could do much the same as the scripts above, rewriting the file so that it's always below the size desired.
Since apparently this basic feature (circular file) does not exist on GNU/Linux, and because I needed it to track logs on my Raspberry Pi with limited storage, I just wrote the code as suggest above!
Behold: circFS
Unlike other tools quoted on this post and other similar, the maximum size is arbitrary and only limited by the actual available storage.
It does not rotate with several files, all is kept in the single file, which is rewritten on "release".
You can have as many log files as needed in the virtual directory.
It is a single C file (~600 lines including comments), and it builds with a single compile line after having installed fuse development dependencies.
This first version is very basic (see the README), if you want to improve it with some of the TODOs (see the TODO) be welcome to submit pull requests.
As a joke, this is my first "write only" fuse driver! :-)

Different results from awk and nawk

I just ran these two commands on a file having around 250 million records.
awk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
and
nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
The record length is 482. The first command gave the correct number of records in file2.txt
i.e.; 60 million but the nawk command gives only 4.2 million.
I am confused and would like to know if someone has come across issue like this. How exactly this simple command being treated in a different way internally? Is there a buffer which can hold only up to certain number of bytes while using nawk?
would appreciate if someone can throw some light on this.
My OS details are
SunOS <hostname> 5.10 Generic_147148-26 i86pc i386 i86pc
The difference probably lies on the buffer limit of Nawk. One of the records (lines) found in your input file has probably exceeded it.
This crucial line can be found in awk.h:
#define RECSIZE (8 * 1024) /* sets limit on records, fields, etc., etc. */
Your command can be reduced to just this:
awk 'substr($0,472,1)==9'
On Solaris (which you are on) when you run awk by default you are running old, broken awk (/usr/bin/awk) so I suspect that nawk is the one producing the correct result.
Run /usr/xpg4/bin/awk with the same script/arguments and see which of your other results it's output agrees with.
Also, check if your input file was created on Windows by running dos2unix on it and see if it's size changes and, if so, re-run your awk commands on the modified files. If it was created on Windows then it will have some control-Ms in there that could be causing chaos.

grep -f maximum number of patterns?

I'd like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn't like this (who, knew?). After a day, it didn't produce anything. Smaller lists work almost instantaneously.
I was thinking I might split my long list up and do it a few times. Any idea what a good maximum length for the pattern list might be?
Also, I'm rather new with unix. Alternative approaches are welcome. The list of patterns, or search terms, are in a plaintext file, one per line.
Thank you everyone for your guidance.
From comments, it appears that the patterns you are matching are fixed strings. If that is the case, you should definitely use -F. That will increase the speed of the matching considerably. (Using 479,000 strings to match on an input file with 3 lines using -F takes under 1.5 seconds on a moderately powered machine. Not using -F, that same machine is not yet finished after several minutes.)
i got the same problem with approx. 4 million patterns to search for in a file with 9 million lines. Seems like it is a problem of RAM. so i got this neat little work around which might be slower than splitting and joining but it just need this one line.
while read line; do grep $line fileToSearchIn;done < patternFile
I needed to use the work around since the -F flag is no solution for that large files...
EDIT: This seems to be really slow for large files. After some more research i found 'faSomeRecords' and really other awesome tools from Kent NGS-editing-Tools
I tried it on my own by extracting 2 million fasta-rec from 5.5million records file. Took approx. 30 sec..
cheers
EDIT: direct download link
Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions.
Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity.
Run the script in a terminal window with a command of the form
./timeScript keyFile textFile 100 > outputFile
This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.
In a second terminal, run the command
tail -f outputFile
which will keep track of the output of your other process into the file outputFile
I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.
This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.
You could take the output of the script and put it through a grep:
grep entire outputFile
You will end up with just the summaries - block size, and time taken, e.g.
Time for processing entire file with blocksize 800: 4 seconds
If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.
Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...
#!/bin/bash
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
#
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2
if (($# < 2 )); then
echo Warning: need at leasttwo parameters.
echo Usage: timeScript keyFile searchFile [initial blocksize]
exit 0
fi
f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount
f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
echo
if (($# < 3 )); then
blockLength=100
else
blockLength=$3
fi
while (($blockLength < f1_linecount))
do
echo Using blocks of $blockLength
#split is a built in command that splits the file
# -l tells it to break after $blockLength lines
# and the block$blockLength parameter is a prefix for the file
split -l $blockLength $1 block$blockLength
Tstart="$(date +%s)"
Tbefore=$Tstart
for fn in block*
do
echo "grep -f $fn $2 | wc -l"
echo number of lines matched: `grep -f $fn $2 | wc -l`
Tnow="$(($(date +%s)))"
echo Time taken: $(($Tnow - $Tbefore)) s
Tbefore=$Tnow
done
echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
blockLength=$((2*$blockLength))
# remove the split files - no longer needed
rm block*
echo block length is now $blockLength and f1 linecount is $f1_linecount
done
exit 0
You could certainly give sed a try to see whether you get a better result, but it is a lot of work to do either way on a file of any size. You didn't provide any details on your problem, but if you have 10k patterns I would be trying to think about whether there is some way to generalize them into a smaller number of regular expressions.
Here is a perl script "match_many.pl" which addresses a very common subset of the "large number of keys vs. large number of records" problem. Keys are accepted one per line from stdin. The two command line parameters are the name of the file to search and the field (white space delimited) which must match a key. This subset of the original problem can be solved quickly since the location of the match (if any) in the record is known ahead of time and the key always corresponds to an entire field in the record. In one typical case it searched 9400265 records with 42899 keys, matching 42401 of the keys and emitting 1831944 records in 41s. The more general case, where the key may appear as a substring in any part of a record, is a more difficult problem that this script does not address. (If keys never include white space and always correspond to an entire word the script could be modified to handle that case by iterating over all fields per record, instead of just testing the one, at the cost of running M times slower, where M is the average field number where the matches are found.)
#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = #ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
die "syntax: match_many.pl infile field"
}
my %keys; # hash of keys
$test_field--; # external range (1,N) to internal range (0,N-1)
$kcount=0;
while(<STDIN>) {
my $line = $_;
chomp($line);
$keys {$line} = 1;
$kcount++
}
print STDERR "keys read: $kcount\n";
my $records = 0;
my $emitted = 0;
open(INFILE, $infile ) or die "Could not open $infile";
while(<INFILE>) {
if(substr($_,0,1) =~ /#/){ #skip comment lines
next;
}
my $line = $_;
chomp($line);
$line =~ s/^\s+//;
my #fields = split(/\s+/, $line);
if(exists($keys{$fields[$test_field]})){
print STDOUT "$line\n";
$emitted++;
$keys{$fields[$test_field]}++;
}
$records++;
}
$kcount=0;
while( my( $key, $value ) = each %keys ){
if($value > 1){
$kcount++;
}
}
close(INFILE);
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";
exit;

Resources