How to handle non-printable ASCII character parameters? - unix

I'm working on a project where we are dealing with importing/exporting data from database tables using ksh scripts and Perl scripts. We have an existing process to export data from a table to a file and it is then imported into another system.
Here's the catch - the export process dumps out pipe delimited files while the system that is doing the import expects files delimited by the ASCII group separator character which is decimal 29, hexidecimal 1d, or octal 35. It shows up in vi as ^] Right now, I'm converting the delimiter via a Perl script. What I'd like to do is tell our export process to just use the delimiter we are expecting. Something like:
export_table.ksh -d '\035'
The problem is I can't figure out how to pass this character to the export script.
I've tried all kinds of combinations of single quotes, double quotes, backslashes, and the octal and hex version of this character.
I'm on Solaris 10 using ksh and/or Perl.

have you tried:
$'\x29'
actually try this for ]:
echo $'\x5d'
and for ^
echo $'\x5e'
so you just need to do:
export_table.ksh -d $'\x5e\x5d'

In bash(1), one can prefix a character with ^v to enter that character verbatim. Perhaps ksh(1) does the same?

Related

R Shiny: Is there any way to read a single backslash in user input?

I'm making a Shiny app that constructs a bash script to run on a cluster (basically just a txt file). One of the user inputs is a curl command (provided by the database where the files are stored) that they can copy/paste into a textInput field in the app. When run on the cluster, it will download the file for further processing. However, the curl command they provide contains several single backslashes. Example:
curl --cookie jgi_session=/api/sessions/ec32f2d578304a9e62b4646ae2bec6d4 --output download.20210731.211924.zip -d "{\"ids\":[\"5d94dc9fc0d65a87debccfd3\"]}" -H "Content-Type: application/json" https://files.jgi.doe.gov/filedownload/
It works fine if I paste this directly into a script or if I manually add in double backslashes, but I want to keep this as user friendly as possible. Every other post I've seen about this just says to use double backslashes, but I'd rather do this automatically if at all possible. So any ideas? I'm open to alternate solutions, less work for the user the better.
Your code is picking up the curl line as escaped characters. When you write to file, those escaped characters get converted to their actual character (i.e \" gets converted to literal ".
To avoid, replace special escaped characters by the character sequence that literally created the escape sequence. So to build \" in the final written string, you have to produce \\" as escaped character sequence (which is what the output of a print commmand should show).
Once way to achieve this for this particular character sequence is
escapedString = gsub('\"', '\\"', curlString)
Note that, in terms of string interpretation, \" is a single character (converting to "), while \\" is a sequence of two characters: an escaped \ and a literal ", converting to \" when written, which is the desired output.

How can I delete the last comma in each record of a comma-delimited csv?

Example Input : A,B,"C,D",E,F,G,
Example Output : A,B,"C,D",E,F,G
The issue I face with using the 'cut' command to accomplish the same is that my data has comma as well.
I wish to do the same in an automated process. So, Linux commands would be helpful.
This should work:
sed 's/,$//g' < input_file.csv > output_file.csv
,$ is a regular expression that matches a comma at the end of each line. This gets replaced with the s command by nothing.
Proof:
$ echo 'A,B,"C,D",E,F,G,' | sed 's/,$//g'
A,B,"C,D",E,F,G
Note that some CSV dialects can also have line endings inside double quotes. If there happens to be a comma right before such a quoted line ending, that comma will also be stripped. If you want to handle this case correctly, you'll need a proper CSV parser.

How to replace string by an escape character plus string in unix

How can I convert a one line like below:
794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|...and so on
to a linefeed-delimted,
Expected Output:
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
and so on.
BTW I'n not a unix expert and just want steps or simple commands to resolve. Appreciate your help.
I assume you have your string in a file with name "x", then you can do this.
I use the character ":" to represent the carriage return that 'sed' adds to your string. Choose something else if ":" occurs in your string. Then "tr" changes ":" to carriage return. The output is as you desire except that there is an extra carriage return at the beginning.
cat x | sed 's/794170/:794170/g' | tr ':' "\n"
You can use the fold command:
$ fold -w32 file
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
I don't think you can do it with a simple command. There are several options for creating scripts that can split lines more or less arbitrarily. Any Unix will have the awk utility available. On most systems you will also find Python and Perl. My guess is that a Perl or Python script is the easiest way to split lines like the one you gave.
This would be one way to do it in Python
inline = "794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|"
splits = ['794170' + s for s in inline.split('794170')]
for s in splits[1:]:
print s
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

Unix sort text file with user-defined newline character

I have a plain text file where newline character in not "\n" but a special character.
Now I want to sort this file.
Is there a direct way to specify custom new-line character while using unix sort command?
I don't want to use a script for this as far as possible?
Please note the data in text file have \n, \r\n, and \t characters(the reason for such data is application specific so please don't comment on that).
The sample data is as below:
1111\n1111<Ctrl+A>
2222\t2222<Ctrl+A>
3333333<Ctrl+A>
Here Ctrl+A is the newline character.
Use perl -001e 'print sort <>' to do this:
prompt$ cat -tv /tmp/a
2222^I2222^A3333333^A1111
1111^A
prompt$ perl -001e 'print sort <>' /tmp/a | cat -tv
1111
1111^A2222^I2222^A3333333^Aprompt$
That works because character 001 (octal 1) is control-A ("\cA"), which is your record terminator in this dataset.
You can also use the code point in hex using -0xHHHHH. Note that it must be a single code point, not a string, using this shortcut. There are ways of doing it for strings and even regexes that involve infinitessimally more code.

Identifying and removing null characters in UNIX

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Resources