convert unix executable file to plain text - unix

I see many people convert text to unix file. Is there some way to do the opposite, unix executable file to .txt?
(I need to read unix executable files under mac os. If it is impossible to convert the file, is there a way I can read them? UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 468: invalid start byte
)

I have noticed that nano seems to convert to text automatically, although I want to be able to view it in a text editor, so I am still trying to find another way.

If you are trying to read the file in Python, you need to append 'b' to the mode in order to avoid newline replacement.

Related

Using com.opencsv.CSVReader on windows stops reading lines prematurely

I have two files that are identical except for the line ending codes. The one that uses the newline (linux/Unix)character works (reads all 550 rows of data) and the one that uses carriage return and line feed (Windows) stops returning lines after reading 269 lines. In both cases the data is read correctly up to the point where they stop.
If I run dos2unix on the file that fails, the resulting file works.
I would like to be able read CSV files regardless of their origin. If I could at least detect that the file is in the wrong format before reading part of the data that would be helpful
Even if I could tell at any time in the middle of reading the file that it was not going to work, I could output an error.
My current state of reading half the file and terminating with no error is dangerous.
The problem is that under the covers openCSV uses a BufferedReader which reads a line from the stream until it gets to the Systems line.seperator.
If you know beforehand what the line separator of the file is then in your application just do a System.setProperty("line.separator", newLine) where newLine is either "\n" or "\r\n" based on the file you are about to parse. Or you can pass that in as a parameter.
If you want to automatically detect the file character. Create a method that will take the file you want, create a BufferedReader and read a single line. If the last character is a '\r' then your system system uses "\n" but you want to set it to "\r\n". Else if line.contains("\n") returns true then you are on a system that uses "\r\n" and you want to set it to "\n". Otherwise the system and the file you are reading have compatible line feed characters.
Just note if you do change the system line feed character be sure to set it back after processing the file in case your program is processing multiple files.

Reading text files in Ada: Get_Line "reads" the byte-order mark as well

I'm trying to read a file line-by-line in Ada, it's a XML text file. I'm following the instructions here:
http://rosettacode.org/wiki/Read_a_file_line_by_line#Ada
However there's a problem that annoys me: the "Get_Line" function seems to be unaware of byte-order marks and reads them as part of the text itself, which means that when I raed the lines, the first one will always start with some extra bytes that should not be there.
While removing the extra bytes manually from the string is no big deal it seems strange to me that a function dedicated to text input/output is unaware of BOMs, there must be a way to read a text file in ada without having to worry about this... is there?
Ada.Text_IO is specified to handle ISO-8859-1 encoded text, so ignoring an UTF-8 feature is the proper thing to do.
If Ada.Wide_Text_IO and Ada.Wide_Wide_Text_IO also output the byte-order-mark, when asked to read UTF-8 encoded text, then you should consider reporting it as a bug to GCC - but as there is quite a lot of implementation defined details for the text I/O packages in Ada, you should be ready for a "wont fix" answer.
One possibility is using the stream attributes and making a UTF_8 file-type to handle the BOM reading-and-discarding.

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

Diff-command: doesn't print lines that are different but still says the two files are different

I'm using the diff command to compare two text files. They need to be literally matched.
So I use the diff:
diff binary.out binary.expected
(By the way, those files are NOT binary files. They are text file. I call them binary because that's the name of the project)
and got
Binary files binary.out and binary.expected differ
When I use another diff tool, the smartest of all (AKA human), and there's really nothing different between the two files.
Does anyone happen to know what's going on here?
Thanks.
diff from diffutils says the following about text/binary:
diff determines whether a file is text or binary by checking the
first few bytes in the file; the exact number of bytes is system
dependent, but it is typically several thousand. If every byte in
that part of the file is non-null, diff considers the file to be
text; otherwise it considers the file to be binary.
hence GNU diff have a quite open definition of what is text, and the use of the --text option to force it to treat the file as text should seldom be needed.
Have you checked if binary.out or binary.expected contains null characters? What version is your diff program?
Make sure to ignore white space in the diff options.
It may also see Unicode characters and interpret that as binary. See if your diff tool has an option to force text mode.

Fixing Unicode Byte Sequences

Sometimes when copying stuff into PostgreSQL I get errors that there's invalid byte sequences.
Is there an easy way using either vim or other utilities to detect byte sequences that cause errors such as: invalid invalid byte sequence for encoding "UTF8": 0xde70 and whatnot, and possibly and easy way to do a conversion?
Edit:
What my workflow is:
Dumped sqlite3 database (from trac)
Trying to replay it in postgresql
Perhaps there's an easier way?
More Edit:
Also tried these:
Running enca to detect encoding of the file
Told me it was ASCII
Tried iconv to convert from ASCII to UTF8. Got an error
What did work is deleting the couple erroneous lines that it complained about. But that didn't really solve the real problem.
Based on one short sentence, it sounds like you have text in one encoding (e.g. ANSI/ASCII) and you are telling PostgreSQL that it's actually in another encoding (Unicode UTF8). All the different tools you would be using: PostgreSQL, Bash, some programming language, another programming language, other data from somewhere else, the text editor, the IDE, etc., all have default encodings which may be different, and some step of the way, the proper conversions are not being done. I would check the flow of data where it crosses these kinds of boundaries, to ensure that either the encodings line up, or the encodings are properly detected and the text is properly converted.
If you know the encoding of the dump file, you can convert it to utf-8 by using recode. For example, if it is encoded in latin-1:
recode latin-1..utf-8 < dump_file > new_dump_file
If you are not sure about the encoding, you should see how sqlite was configured, or maybe try some trial-and-error.
I figured it out. It wasn't really an encoding issue.
SQLite's output escaped strings differently than Postgres expects. There were some cases where 'asdf\xd\foo' was outputted. I believe the '\x' was causing it to expect the following characters to be unicode encoding.
Solution to this is dumping each table individually in CSV mode in sqlite 3.
First
sqlite3 db/trac.db .schema | psql
Now, this does the trick for the most part to copy the data back in
for table in `sqlite3 db/trac.db .schema | grep TABLE | sed 's/.*TABLE \(.*\) (/\1/'`
do
echo ".mode csv\nselect * from $table;" | sqlite3 db/trac.db | psql -c "copy $table from stdin with csv"
done
Yeah, kind of a hack, but it works.

Resources