Using com.opencsv.CSVReader on windows stops reading lines prematurely - opencsv

I have two files that are identical except for the line ending codes. The one that uses the newline (linux/Unix)character works (reads all 550 rows of data) and the one that uses carriage return and line feed (Windows) stops returning lines after reading 269 lines. In both cases the data is read correctly up to the point where they stop.
If I run dos2unix on the file that fails, the resulting file works.
I would like to be able read CSV files regardless of their origin. If I could at least detect that the file is in the wrong format before reading part of the data that would be helpful
Even if I could tell at any time in the middle of reading the file that it was not going to work, I could output an error.
My current state of reading half the file and terminating with no error is dangerous.

The problem is that under the covers openCSV uses a BufferedReader which reads a line from the stream until it gets to the Systems line.seperator.
If you know beforehand what the line separator of the file is then in your application just do a System.setProperty("line.separator", newLine) where newLine is either "\n" or "\r\n" based on the file you are about to parse. Or you can pass that in as a parameter.
If you want to automatically detect the file character. Create a method that will take the file you want, create a BufferedReader and read a single line. If the last character is a '\r' then your system system uses "\n" but you want to set it to "\r\n". Else if line.contains("\n") returns true then you are on a system that uses "\r\n" and you want to set it to "\n". Otherwise the system and the file you are reading have compatible line feed characters.
Just note if you do change the system line feed character be sure to set it back after processing the file in case your program is processing multiple files.

Related

R exports a blank last line in text file by "write.table()" [duplicate]

In R: Is it possible to avoid having a blank line at the end of a text file generated by writeLines? If not, is there any other way of generating a text file from within R without having a blank line at the end?
There is no blank line.
R (correctly) ends each line with '\n' (or '\r\n' on Windows). In other words, the file consists of lines, and each line ends with a line break.
Unfortunately, there are many tools (especially on Windows) which treat such files incorrectly and display an extra line at the end. However, that’s a fault with these tools, not with R. Consequently, this shouldn’t be fixed in the R code.
As a hack to appease buggy tools, the only recourse is to set the sep argument of writeLines to the empty string, '', and insert the line breaks between lines manually (using paste).
I had exactly the same concern (different grid, though) and even your comment of the accept answer (by Konrad) did not work for me.
I found the answer here, and here is the full code:
fileConn = file("mytext.txt")
writeLines(c("line1", "line2", "line3"), sep="\n", fileConn)
#now connect to UNIX server and upload your file
library(ssh)
session=ssh_connect("user#server.com")
scp_upload(session, files="mytext.txt")
#Here is the trick, convert all the Windows extra chars to unix
ssh_exec_wait(session, command="dos2unix mytext.txt")
#Then start your Grid job
ssh_exec_wait(session, command="sbatch mytext.txt")
ssh_disconnect(session)

Remove line feed in CSV using Unix script

I have a CSV file and I want to remove the all line feeds (LF or \n) which are all coming in between the double quotes alone.
Can you please provide me an Unix script to perform the above task. I have given the input and expected output below.
Input :
No,Status,Date
1,"Success
Error",1/15/2018
2,"Success
Error
NA",2/15/2018
3,"Success
Error",3/15/2018
Expected output:
No,Status,Date
1,"Success Error",1/15/2018
2,"Success Error NA",2/15/2018
3,"Success Error",3/15/2018
I can't write everything for you, as I am not sure about your system as well as which bash version that is running on it. But here are a couple of suggestions that you might want to consider.
https://www.unix.com/shell-programming-and-scripting/31021-removing-line-breaks-shell-variable.html
https://www.unix.com/shell-programming-and-scripting/19484-remove-line-feeds.html
How to remove carriage return from a string in Bash
https://unix.stackexchange.com/questions/57124/remove-newline-from-unix-variable
Remove line breaks in Bourne Shell from variable
https://unix.stackexchange.com/questions/254644/how-do-i-remove-newline-character-at-the-end-of-file
https://serverfault.com/questions/391360/remove-line-break-using-awk

Avoid blank line at end of file when using writeLines

In R: Is it possible to avoid having a blank line at the end of a text file generated by writeLines? If not, is there any other way of generating a text file from within R without having a blank line at the end?
There is no blank line.
R (correctly) ends each line with '\n' (or '\r\n' on Windows). In other words, the file consists of lines, and each line ends with a line break.
Unfortunately, there are many tools (especially on Windows) which treat such files incorrectly and display an extra line at the end. However, that’s a fault with these tools, not with R. Consequently, this shouldn’t be fixed in the R code.
As a hack to appease buggy tools, the only recourse is to set the sep argument of writeLines to the empty string, '', and insert the line breaks between lines manually (using paste).
I had exactly the same concern (different grid, though) and even your comment of the accept answer (by Konrad) did not work for me.
I found the answer here, and here is the full code:
fileConn = file("mytext.txt")
writeLines(c("line1", "line2", "line3"), sep="\n", fileConn)
#now connect to UNIX server and upload your file
library(ssh)
session=ssh_connect("user#server.com")
scp_upload(session, files="mytext.txt")
#Here is the trick, convert all the Windows extra chars to unix
ssh_exec_wait(session, command="dos2unix mytext.txt")
#Then start your Grid job
ssh_exec_wait(session, command="sbatch mytext.txt")
ssh_disconnect(session)

File contains two EOF characters; what happens?

Will this screw up file size estimation on the file system? Will the filesystem overwrite everything past the first EOF character? How is this handled?
In Unix there is no EOF character. It's simply a concept, a value returned by getc to signal "this is the end (beautiful friend)". EOF is chosen so that getc (and friends) can't return it in any other case.
And about writing past the end of file, different filesystems do things differently.
Some will leave holes that don't actually occupy any space on the disk
Some will fill in the blanks with blanks (0)

ASP Readline non-standard Line Endings

I'm using the ASP Classic ReadLine() function of the File System Object.
All has been working great until someone made their import file on a Mac in TextEdit.
The line endings aren't the same, and ReadLine() reads in the entire file, not just 1 line at a time.
Is there a standard way of handling this? Some sort of page directive, or setting on the File System Object?
I guess that I could read in the entire file, and split on vbLF, then for each item, replace vbCR with "", then process the lines, one at a time, but that seems a bit kludgy.
I have searched all over for a solution to this issue, but the solutions are all along the lines of "don't save the file with Mac[sic] line endings."
Anyone have a better way of dealing with this problem?
There is no way to change the behaviour of ReadLine, it will only recognize CRLF as a line terminator. Hence the only simply solution is the one you have already described.
Edit
Actually there is another library that ought to be available out of the box on an ASP server that might offer some help. That is the ADODB library.
The ADODB.Stream object has a LineSeparator property that can be assigned 10 or 13 to override the default CRLF it would normally use. The documentation is patchy because it doesn't describe how this can be used with ReadText. You can get the ReadText method to return the next line from the stream by passing -2 as its parameter.
Take a look at this example:-
Dim sLine
Dim oStreamIn : Set oStreamIn = CreateObject("ADODB.Stream")
oStreamIn.Type = 2 '' # Text
oStreamIn.Open
oStreamIn.CharSet = "Windows-1252"
oStreamIn.LoadFromFile "C:\temp\test.txt"
oStreamIn.LineSeparator = 10 '' # Linefeed
Do Until oStreamIn.EOS
sLine = oStreamIn.ReadText(-2)
'' # Do stuff with sLine
Loop
oStreamIn.Close
Note that by default the CharSet is unicode so you will need to assign the correct CharSet being used by the file if its not Unicode. I use the word "Unicode" in the sense that the documentation does which actually means UTF-16. One advantage here is that ADODB Stream can handle UTF-8 unlike the Scripting library.
BTW, I thought MACs used a CR for line endings? Its Unix file format that uses LFs isn't it?

Resources