I have a huge (39 GB) text file that I eventually need to read into R as a pipe-delimited file. However, there are lots of NUL characters \0 that do not read in R. I'm trying to replace them in PowerShell beforehand.
PowerShell code:
Get-Content file.txt | foreach { $_ -replace '\\0' } | Out-File -Encoding UTF8 file_NEW.txt
I thought this worked but when I try to read the new file in R, \0 characters appear in the string and I get this error:
Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, :
embedded nul in string: '||MORALES BELINDA F TRUST||\0||\0|0||PT||||33.824049|-118.192053||3655||N|WESTON|PL|||LONG BEACH|CA|908073855|C033||3655||N|WESTON|PL|||LONG BEACH|CA|908073855|C033|20111117|988||||||||20111027|20111110|TR|2|1527575||KINECTA FCU|KINECTA FCU|||MANHATTAN BEACH|CA|90266|047978|LAWYERS TITLE|03003|232000.00|20111027||CNV|TR|Y|10|20211201|||D|BGJT|V||115|21||0|0||0|||||Y|Y||\r\n06037|5054001029|5054-001-029|1|\0||BONILLA ...
Why are there still NULs in the file? ANY help appreciated! Especially because these functions take so long to run. Please, I'm just trying to read this huge file.
Just in case there is an error in the R code, note it is taken directly from this post using the vroom and arrow packages to read then create parquet files.
The reason is that doubling the backslash, \\, means that the backspace is escaped. Instead of looking at NUL (0x00), you are looking literally \0 - two characters.
The correct syntax would be like so,
-replace '\0'
That being said, processing a large file can be done smarter a way. A fast way would be to process, say, 10 000 lines a time. See earlier an answer about how to process a file in blocks.
vonPryz' helpful answer shows the immediate problem with your -replace operation: in a single-quoted PowerShell string ('...'), \ chars. do not need escaping as \\ in order to be passed verbatim to the .NET regex engine that the -replace operator uses behind the scenes.
Thus, '\0', when passed to the .NET regex engine from PowerShell, is sufficient to represent a NUL character (a Unicode character whose code point is 0); alternatively, you could use "`0", a double-quoted, expandable PowerShell string, in which ` serves as the escape character.
r2evans' helpful answer shows an alternative solution via the Windows ports of standard Unix utilities that come with the optional Rtools download, where piping the input file to
tr -d '\0' may offer the fastest solution, if both the input and the output file use the same character encoding.
In the realm of PowerShell, using Get-Content with its default line-by-line processing with such a large input file would take too long in practice.
While direct use of .NET APIs may offer the ultimately fastest solution, using Get-Content's
-ReadCount parameter offers a simpler, more PowerShell-idiomatic solution:
Get-Content -ReadCount 1e6 file.txt | foreach { $_ -replace '\0' } |
Out-File -Encoding UTF8 file_NEW.txt
-ReadCount 1e6 reads 1 million lines (1e6 uses exponential notation, i.e. 10 to the power of 6) at once and passes them as an array to the ForEach-Object cmdlet (one of whose built-in aliases is foreach); since the -replace operator is capable of operating on an array of values as its LHS, the NUL substitution can be performed on all elements of the array at once.
Depending on how many bytes make up the average line in your input file, you can adjust this number upward, if you have more memory available, or downward, if you have less. The higher the number you can use, the faster the command will complete.
I don't know powershell enough to fix that, but you can use sed or tr to replace the nuls in the files. The tr and sed utilities are available by default on most (all?) unix-like OSes including macos. For windows, they are included in Rtools35 and Rtools40.
If you do not find it with Sys.which("tr"), then you may need to include the full path to the respective utility. Assuming Rtools is installed on the root c:/, then something like
Rtools35: c:/Rtools/bin/tr.exe
Rtools40: c:/Rtools40/usr/bin/tr.exe
They are also included in Git-for-Windows as /usr/bin/tr.exe and /usr/bin/sed.exe within git-bash. (On the file-system, they are likely under c:/Program Files/Git/usr/bin/.)
(Same locations for sed.)
I should note that I'm doing this through R's system2 as a convenience only. If you're comfortable enough on the bash command line, then this is just as easy to perform there instead.
data generation
I don't know where the nuls are in your file, so I'll assume that they are record (line) terminators. That is, in most files you'll see each line ending with \n or \r\n, but for this example I'll replace the \n with \0 (nul).
charToRaw("a|b\nhello|world")
# [1] 61 7c 62 0a 68 65 6c 6c 6f 7c 77 6f 72 6c 64
ch <- charToRaw("a|b\nhello|world")
ch[ch == charToRaw("\n")] <- as.raw(0)
ch
# [1] 61 7c 62 00 68 65 6c 6c 6f 7c 77 6f 72 6c 64
writeBin(ch, "raw.txt")
readLines("raw.txt")
# Warning in readLines("raw.txt") :
# line 1 appears to contain an embedded nul
# Warning in readLines("raw.txt") :
# incomplete final line found on 'raw.txt'
# [1] "a|b"
The nul is a problem (as intended), so we don't see anything after the embedded nul.
tr
tr doesn't like doing things in place, so this takes as input the original file and generates a new file. If file-size and disk space is a concern, then perhaps sed would be preferred.
system2("tr", c("\\0", "\\n"), stdin = "raw.txt", stdout = "raw2.txt")
readLines("raw2.txt")
# Warning in readLines("raw2.txt") :
# incomplete final line found on 'raw2.txt'
# [1] "a|b" "hello|world"
(That warning is safe to ignore here.)
sed
sed can optionally work in-place with the -i argument. (Without it, it can operate the same as tr: generate a new file based on the original.)
system2("sed", c("-i", "s/\\x0/\\n/g", "raw.txt"))
readLines("raw.txt")
# Warning in readLines("raw.txt") :
# incomplete final line found on 'raw.txt'
# [1] "a|b" "hello|world"
(That warning is safe to ignore here.)
other than record-terminator
If the nul is not the record terminator (\n-like) character, than you have some options:
Replace the \0 character with something meaning, such as Z (stupid, but you get the point). This should use the above commands as-is, replacing the \\n with your character of choice. (tr will require a single-character, sed can replace it with multiple characters if you like.)
Delete the \0 completely, in which case you can use tr -d '\0' and sed -i -e 's/\x0//g' (translated into R's system2 calls above).
Related
I'm dealing with very big files (~10Gb) containing word with ascii representation of unicode :
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
I want to tranform them into unicode before inserting them into a database, like this :
Nuray Özdemir
Erol Čolaković Šehić
I've seen how to do it with vim but it's very slow for very large file. I thought copy/paste of the regex would be OK but it's not.
I actually get things like this:
$ echo "Nuray \u00d6zdemir" | sed -E 's/\\\u(.)(.)(.)(.)/\x\1\x\2\x\3\x\4/g'
Nuray x0x0xdx6zdemir
How can I concatenate the \x and the value of \1 \2...?
I don't want to use echo or an external program due to the size of the file, I want something efficient.
Assuming the unicodes in your file are within BMP (16bit), how about:
perl -pe 'BEGIN {binmode(STDOUT, ":utf8")} s/\\u([0-9a-fA-F]{4})/chr(hex($1))/ge' input_file > output_file
Output:
Nuray Özdemir
Erol Čolaković Šehić
I have generated a 6Gb file to test the speed efficiency.
It took approx. 10 minutes to process the entire file on my 6 year old laptop.
I hope it will be acceptable to you.
I am not a mongoDB expert at all but what I can tell you is the following:
If there is a way to do it at the import directly within the DB engine, this solution should be used, now if this feature is not available.
You can use either use a naive approach to solve it:
while read -r line; do echo -e "$line"; done < input_file
INPUT:
cat input_file
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
OUTPUT:
Nuray Özdemir
Erol Čolaković Šehić
But as you have spotted yourself the call to echo -e at each line will create a resource intensive change of context (generate a sub-process for echo -> memory allocation, new entry in the processes table, priority management, switching back to the parent process) that is not efficient for 10GB files.
Or go for a smarter approach using tools that should be available in your distro example:
whatis ascii2uni
ascii2uni (1) - convert 7-bit ASCII representations to UTF-8 Unicode
Command:
ascii2uni -a U -q input_file
Nuray Özdemir
Erol Čolaković ᘎhić
You can also split (ex split command) the input file in pieces, run in parallel the conversion step on each sub file, and import each converted pieces as soon as it is available to shorten the total execution time.
Take a look at this example please:
$ cat < demo
man
car$
$
$ od -x < demo
0000000 616d 0a6e 6163 0072
0000007
$
$ wc < demo
1 2 7
As you can see, I've got there 3 characters (man: 6d 61 6e) followed by a newline (\n: 0a) and then another three (car: 63 61 75) terminated with a NUL character (00). Clearly, there are two lines in that file, but the wc command reports that the file has got only one. What gives? Or do you think that in order to qualify as a line in Unix you must be terminated with a newline character? NUL doesn't count?
Or do you think that in order to qualify as a line in Unix you must be
terminated with a newline character?
Actually, yes - even POSIX says that:
The wc utility shall read one or more input files and, by default,
write the number of newlines, words, and bytes contained in each
input file to the standard output.
better use
awk '{ print }' demo| wc -l
I have a simple and annoying problem, and I apologize for not posting an example. The files are big and I haven't been able to recreate the exact issue using smaller files:
These are tab-delimited files (some entries contain " ; or a single space character). On UNIX, when I access a unique word via: nl file | sed -n '/word/p' I see that my word is on exactly the same line in all my files.
Now I copy the files to my mac. I run the same command on the same exact files, but the line numbers are all different! The total number of lines via wc -l is still identical to the numbers I get in unix, but when I do nl file | tail -n1 I see a different number. Yet, when I enter the number returned from my unix nl, and access the same line via sed '12345p' file I get the correct entry!?
My question: I must have something in some of my lines that is interpreted as linebreaks on my mac but not in unix, and only by nl not sed. Can anyone help me figure out what it is? I already know it's not on every line. I found this issue persists when I load the data into R, and I'm stumped. Thank you!
"Phantom newlines" can be hidden in text in the form of a multi-byte UTF-8 character called an "overlong sequence".
UTF-8 normally represents ASCII characters as themselves: UTF-8 bytes in the range 0 to 127 are just those character values. However, overlong sequences can be used to (incorrectly) encode ASCII characters using multiple UTF-8 bytes (which are in the range 0x80-0xFF). A properly written UTF-8 decoder must detect overlong sequences and somehow flag them as invalid bytes. A naively written UTF-8 decoder will simply extract the implied character.
Thus, it's possible that your data is being treated as UTF-8, and contains some bytes which look like an overlong sequence for a newline, and that this is fooling some of the software you are working with. A two-byte overlong sequence for a newline would look like C0 8A, and a three-byte overlong sequence would be E0 80 8A.
It is hard to come up with an alternative hypothesis not involving character encodings.
I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);
Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.
You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.
You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.
To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.
Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)
Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.
Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.
You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust
Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.
One simple check is if it has \0 characters. Text files don't have them.
As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}