How to view all special characters - unix

I am facing hard time in removing the special characters from the csv file.
I have done a head -1 so i am trying to compare only 1 row.
wc filename shows it has 1396 byte count
If i go to the end of the file the curson ends at 1394.
In vi I do set list (to check for control characters), i see a $ (nothing after that), so i now know its the 1395 byte count.
Can someone please tell me where is the 1396th byte?
I am trying to compare 2 files using diff and its giving me a lot of trouble.
Please help.

The last 2 bytes of your line are \r\n - this is a Windows line ending. dos2unix converts this into a Unix line ending, which is \n - hence the line is shortened by 1 byte following conversion.

Related

Using com.opencsv.CSVReader on windows stops reading lines prematurely

I have two files that are identical except for the line ending codes. The one that uses the newline (linux/Unix)character works (reads all 550 rows of data) and the one that uses carriage return and line feed (Windows) stops returning lines after reading 269 lines. In both cases the data is read correctly up to the point where they stop.
If I run dos2unix on the file that fails, the resulting file works.
I would like to be able read CSV files regardless of their origin. If I could at least detect that the file is in the wrong format before reading part of the data that would be helpful
Even if I could tell at any time in the middle of reading the file that it was not going to work, I could output an error.
My current state of reading half the file and terminating with no error is dangerous.
The problem is that under the covers openCSV uses a BufferedReader which reads a line from the stream until it gets to the Systems line.seperator.
If you know beforehand what the line separator of the file is then in your application just do a System.setProperty("line.separator", newLine) where newLine is either "\n" or "\r\n" based on the file you are about to parse. Or you can pass that in as a parameter.
If you want to automatically detect the file character. Create a method that will take the file you want, create a BufferedReader and read a single line. If the last character is a '\r' then your system system uses "\n" but you want to set it to "\r\n". Else if line.contains("\n") returns true then you are on a system that uses "\r\n" and you want to set it to "\n". Otherwise the system and the file you are reading have compatible line feed characters.
Just note if you do change the system line feed character be sure to set it back after processing the file in case your program is processing multiple files.

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

Newline in output from Unix not appearing in Notepad

Ok, Im new to SAS and have been working on a fixed formatted output .txt file. Each Variable needs to be starting at a particular column and be a fixed length and format. I have been using a PUT statement to accomplish this. So far so good.
Where I run into an issue is when I open the output.txt in notepad, the first line follows the rules defined in the put statement until it is time to move down to the next line. data continues to write on the first record opposed to creating a new record at the end of the put statement.
This only occurs when I open the file in a windows environment (notepad). When I view it in UNIX editor everything is the way I need it.
data _null_;
set work.get_driver_data;
file ".......................dw2092340/driverdata.txt" LRECL = 269;
DV_Term1 = compress(put (DV_Term, mmddyyn8.),'.');
put #001 SCAC $4.
#005 DV_REF $10.
#15 DSP_OFFICE $13.
#28 UP_RMP $5.
#33 DV_HIRE mmddyyn8.
#41 DV_TERM1 $8. #49 FIRSTNAME $20.
#69 MID $10. #79 LASTNAME $20.
#99 LICENSE $20.
#119 LIC_STATE $2.
#121 LIC_CNTRY $3.
#124 LIC_EXPIRE mmddyyn8.
#132 LIC_CDL $1.
#257 BNSF_PIN $10.;
any help is appreciated
What is your goal? It sounds to me like you are creating a text file with UNIX style line endings.
If you want this to open in Notepad, you need to give it windows style line endings. Try this on the end of your file statement:
termstr=crlf
This stands for "Carriage Return Line Feed". Alternative values include:
LF (Line Feed) - use TERMSTR=LF to write files suitable for unix systems.
CR (Carriage Return) - use TERMSTR=CR for mac systems (not a common use case).
If you are only trying to look at your results, perhaps you should consider using a fully featured text editor - there are many free ones available that will open files in all of these formats. I use Notepad++ but there are many others.
Also, check out the documentation for the filename statement under unix and windows.

ed - line editor (what does $p do?)

I am using ed a unix line editor and the book i'm reading says to type 1,$p
(also works in vim)
after trial and error I figured the first value means the line number but whats the purpose to $p? from what i can tell is the 1 goes to the beginning of the line and $p goes to the EOF and displays to me everything it picked up. is this true or am i way off?
The 1,$ part is a range. The comma separates the beginning and end of the range. In this case, 1 (line 1) is the beginning, and $ (EOF) is the end. The p means print, which is the command the range is being given to, and yes.. it displays to you what is in that range.
In vim you can look at :help :range and :help :print to find out more about how this works. These types of ranges are also used by sed and other editors.
They probably used the 1,$ terminology in the tutorial to be explicit, but note that you can also use % as its equivalent. Thus, %p will also print all the lines in the file.

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

Resources