Powershell - UNIX ANSI file encoding being changed and genterating CRLF - unix

I am using Powershell in windows to replace a '£' with a '$' in a file generated in Unix. The problem is that the output file has CRLF at the end of each line rather than LF which it originally had. When I look at the original file in Notepad++ the status bar tells me it is Unix ANSI, I want to keep this format and have LF at the end of each line.
I have tried all the encoding options with no success, I have also tried Set-Content instead od Out-File. My code is:
$old = '£'
$new = '$'
$encoding = 'UTF8'
(Get-Content $fileInfo.FullName) | % {$_ -replace $old, $new} | Out-File -filepath $fileInfo.FullName -Encoding $encoding
Thanks for any help
Jamie

#Keith Hill made a cmdlet for this ConvertTo-UnixLineEnding you can find it in the Powershell Community Extension

I realise that this is a very old question now but I stumbled across it when I encountered a similar issue and thought I would share what worked for me. This may help other coders in future without the need for a third party cmdlet.
When reading in the Unix format file, that is with LF line terminators, rather than the CRLF Windows style line terminators, simply use the -Raw parameter after the filename in your Get-Content command then output with encoding type of STRING, although UTF8 encoding may have the same result STRING worked for my requirements.
My specific command that I had issue with was reading in a template file, replacing some variables then outputting to a new file. The original template is Unix style, but the output was coming out Windows style until adding the -Raw parameter as follows. Note that this is a powershell command I used called from a batch file, hence its formatting.
powershell -Command "get-content master.template -Raw | %%{$_ -replace \"#MASTERIP#\",\"%MASTERIP%\"} | %%{$_ -replace \"#SLAVEIP#\",\"%SLAVEIP%\"} | set-content %MYFILENAME%-%MASTERIP%.cfg -Encoding STRING"
I use Notepad++ to check the formatting of the resulting file and this did the trick in my case.

Related

Transform hexadecimal representation to unicode

I'm dealing with very big files (~10Gb) containing word with ascii representation of unicode :
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
I want to tranform them into unicode before inserting them into a database, like this :
Nuray Özdemir
Erol Čolaković Šehić
I've seen how to do it with vim but it's very slow for very large file. I thought copy/paste of the regex would be OK but it's not.
I actually get things like this:
$ echo "Nuray \u00d6zdemir" | sed -E 's/\\\u(.)(.)(.)(.)/\x\1\x\2\x\3\x\4/g'
Nuray x0x0xdx6zdemir
How can I concatenate the \x and the value of \1 \2...?
I don't want to use echo or an external program due to the size of the file, I want something efficient.
Assuming the unicodes in your file are within BMP (16bit), how about:
perl -pe 'BEGIN {binmode(STDOUT, ":utf8")} s/\\u([0-9a-fA-F]{4})/chr(hex($1))/ge' input_file > output_file
Output:
Nuray Özdemir
Erol Čolaković Šehić
I have generated a 6Gb file to test the speed efficiency.
It took approx. 10 minutes to process the entire file on my 6 year old laptop.
I hope it will be acceptable to you.
I am not a mongoDB expert at all but what I can tell you is the following:
If there is a way to do it at the import directly within the DB engine, this solution should be used, now if this feature is not available.
You can use either use a naive approach to solve it:
while read -r line; do echo -e "$line"; done < input_file
INPUT:
cat input_file
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
OUTPUT:
Nuray Özdemir
Erol Čolaković Šehić
But as you have spotted yourself the call to echo -e at each line will create a resource intensive change of context (generate a sub-process for echo -> memory allocation, new entry in the processes table, priority management, switching back to the parent process) that is not efficient for 10GB files.
Or go for a smarter approach using tools that should be available in your distro example:
whatis ascii2uni
ascii2uni (1) - convert 7-bit ASCII representations to UTF-8 Unicode
Command:
ascii2uni -a U -q input_file
Nuray Özdemir
Erol Čolaković ᘎhić
You can also split (ex split command) the input file in pieces, run in parallel the conversion step on each sub file, and import each converted pieces as soon as it is available to shorten the total execution time.

unix diff to file

I'm having a little trouble getting the output of diff to write to file. I have a new and old version of a .strings file and I want to be able to write the diff between these two files to a .strings.diff file.
Here's where I am right now:
diff -u -a -B $PROJECT_DIR/new/Localizable.strings $PROJECT_DIR/old/Localizable.strings >> $PROJECT_DIR/diff/Localizable.strings.diff
fgrep + $PROJECT_DIR/diff/Localizable.strings.diff > $PROJECT_DIR/diff/Localizable.txt
The result of the diff command writes to Localizable.strings.diff without any issues, but Localizable.strings.diff appears to be a binary file. Is there any way to output the diff to a UTF-8 encoded file instead?
Note that I'm trying to just get the additions using fgrep in my second command. If there's an easier way to do this, please let me know.
Thanks,
Sean
First, you probably need to identify the encoding of the Localizable.strings files. This might be done in a manner described by How to find encoding of a file in Unix via script(s), for example.
Then probably you need to convert the Localizable.strings file to UTF-8 with a tool like iconv using commands something like:
iconv -f x -t UTF-8 $PROJECT_DIR/new/Localizable.strings >Localizable.strings.new.utf8
iconv -f x -t UTF-8 $PROJECT_DIR/old/Localizable.strings >Localizable.strings.old.utf8
Where x is the actual encoding in a form recognized by iconv. You can use iconv --list to show all the encodings it knows about.
Then, you probably need to diff without having to use -a.
diff -u -B Localizable.strings.old.utf8 Localizable.strings.new.utf8 >Localizable.strings.diff.utf8

Checking for DOS files on UNIX

I want to check whether any DOS files exist in any specific directory.
Is there any way to distinguish DOS files from UNIX apart from the ^M chars ?
I tried using file, but it gives the same output for both.
$ file test_file
test_file: ascii text
And after conversion:
$ unix2dos test_file test_file
$ file test_file.txt
test_file.txt: ascii text
The CRLF (\r\n, ^M) line endings chars are the only difference between Unix and DOS/Windows ASCII files, so no, there's no other way.
What you might try if you have to fromdos command is to compare its output with the original file:
file=test_file
fromdos < $file | cmp $file -
This fails (non-zero $?) if fromdos stripped any \r away.
dos2unix might be used in a similar way, but I don't know its exact syntax.
If you actually put Windows newlines in, you'll see the following output from file:
test_file.txt: ASCII text, with CRLF line terminators

Identifying and removing null characters in UNIX

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

How do I change a shell scripts character encoding?

I am using Gina Trapiani's excellent todo.sh to organize my todo-list.
However being a dane, it would be nice if the script accepted special danish characters like ø and æ.
I am an absolute UNIX-n00b, so it would be a great help if anybody could tell me how to fix this! :)
Slowly, the Unix world is moving from ASCII and other regional encodings to UTF-8. You need to be running a UTF terminal, such as a modern xterm or putty.
In your ~/.bash_profile set you language to be one of the UTF-8 variants.
export LANG=C.UTF-8
or
export LANG=en_AU.UTF-8
etc..
You should then be able to write UTF-8 characters in the terminal, and include them in bash scripts.
#!/bin/bash
echo "UTF-8 is græat ☺"
See also: https://serverfault.com/questions/11015/utf-8-and-shell-scripts
What does this command show?
locale
It should show something like this for you:
LC_CTYPE="da_DK.UTF-8"
LC_NUMERIC="da_DK.UTF-8"
LC_TIME="da_DK.UTF-8"
LC_COLLATE="da_DK.UTF-8"
LC_MONETARY="da_DK.UTF-8"
LC_MESSAGES="da_DK.UTF-8"
LC_PAPER="da_DK.UTF-8"
LC_NAME="da_DK.UTF-8"
LC_ADDRESS="da_DK.UTF-8"
LC_TELEPHONE="da_DK.UTF-8"
LC_MEASUREMENT="da_DK.UTF-8"
LC_IDENTIFICATION="da_DK.UTF-8"
LC_ALL=
If not, you might try doing this before you run your script:
LANG=da_DK.UTF-8
You don't say what happens when you run the script and it encounters these characters. Are they in the todo file? Are they entered at a prompt? Is there an error message? Is something output in place of the expected output?
Try this and see what you get:
read -p "Enter some characters" string
echo "$string"

Resources