File contains two EOF characters; what happens? - unix

Will this screw up file size estimation on the file system? Will the filesystem overwrite everything past the first EOF character? How is this handled?

In Unix there is no EOF character. It's simply a concept, a value returned by getc to signal "this is the end (beautiful friend)". EOF is chosen so that getc (and friends) can't return it in any other case.
And about writing past the end of file, different filesystems do things differently.
Some will leave holes that don't actually occupy any space on the disk
Some will fill in the blanks with blanks (0)

Related

Write lines of text at a given position in a file in R

I would like to write a line in a text file at a given position (i) by avoiding the sequential reading.
There is WriteLines base function but I don't know how to insert the text at position (i) given as parameter.
Thanks
Dave
This is — unrelated to R — fundamentally impossible. Most (all common) filesystems do not support inserting or removing content in the middle of a file. The only supported operations are appending (or truncation) at the end, and R only supports appending, not truncation.
The way virtually all software solves your problem is by reading the file, modifying it, and writing it back to disk. If you want to get fancy because the file is very large (at least in the order of hundreds of MiB), you can stream edit the file: Read a part, edit that part, write it back to a new file. Rinse and repeat.
Technical aside: There is one exception to the above with low-level file operations, since files are stored as as non-contiguous “blocks”. But even if R supported this it wouldn’t help you since it doesn’t permit byte-level or line-level granularity: Blocks are typically at least 4 kiB in size.

How to replace a string pattern with different strings quickly?

For example, I have many HTML tabs to style, they use different classes, and will have different backgrounds. Background images files have names corresponding to class names.
The way I found to do it is yank:
.tab.home {
background: ...home.jpg...
}
then paste, then :s/home/about.
This is to be repeated for a few times. I found that & can be used to repeat last substitute, but only for the same target string. What is the quickest way to repeat a substitute with different target string?
Alternatively, probably there are more efficient ways to do such a thing?
I had a quick play with some vim macro magic and came up with the following idea... I apologise for the length. I thought it best to explain the steps..
First, place the text block you want to repeat into a register (I picked register z), so with the cursor at the beginning of the .tab line I pressed "z3Y (select reg z and yank 3 lines).
Then I entered the series of VIM commands I wanted into the buffer as )"zp:.,%s/home/. (Just press i and type the commands)
This translate to;
) go the end of the current '{}' block,
"zp paste a copy of the text in register z,
.,%s/home/ which has two tricks.
The .,% ensures the substitution applies to everything from the start of the .tab to the end of the closing }, and,
The command is incomplete (ie, does not have a at the end), so vim will prompt me to complete the command.
Note that while %s/// will perform a substitution across every line of the file, it is important to realise that % is an alias for range 1,$. Using 1,% as a range, causes the % to be used as the 'jump to matching parenthesis' operator, resulting in a range from the current line to the end of the % match. (which in this example, is the closing brace in the block)
Then, after placing the cursor on the ) at the beginning of the line, I typed "qy$ which means yank all characters to the end of the line into register q.
This is important, because simply yanking the line with Y will include a carriage return in the register, and will cause the macro to fail.
I then executed the content of register q with #q and I was prompted to complete the s/home/ on the command line.
After typing the replacement text and pressing enter, the pasted block (from register z) appeared in the buffer with the substitutions already applied.
At this point you can repeat the last #qby simple typing ##. You don't even need to move the cursor down to the end of the block because the ) at the start of the macro does that for you.
This effectively reduces the process of yanking the original text, inserting it, and executing two manual replace commands into a simple ##.
You can safely delete the macro string from your edit buffer when done.
This is incredibly vim-ish, and might waste a bit of time getting it right, but it could save you even more when you do.
Vim macro's might be the trick you are looking for.
From the manual, I found :s//new-replacement. Seemed to be too much typing.
Looking for a better answer.

Using com.opencsv.CSVReader on windows stops reading lines prematurely

I have two files that are identical except for the line ending codes. The one that uses the newline (linux/Unix)character works (reads all 550 rows of data) and the one that uses carriage return and line feed (Windows) stops returning lines after reading 269 lines. In both cases the data is read correctly up to the point where they stop.
If I run dos2unix on the file that fails, the resulting file works.
I would like to be able read CSV files regardless of their origin. If I could at least detect that the file is in the wrong format before reading part of the data that would be helpful
Even if I could tell at any time in the middle of reading the file that it was not going to work, I could output an error.
My current state of reading half the file and terminating with no error is dangerous.
The problem is that under the covers openCSV uses a BufferedReader which reads a line from the stream until it gets to the Systems line.seperator.
If you know beforehand what the line separator of the file is then in your application just do a System.setProperty("line.separator", newLine) where newLine is either "\n" or "\r\n" based on the file you are about to parse. Or you can pass that in as a parameter.
If you want to automatically detect the file character. Create a method that will take the file you want, create a BufferedReader and read a single line. If the last character is a '\r' then your system system uses "\n" but you want to set it to "\r\n". Else if line.contains("\n") returns true then you are on a system that uses "\r\n" and you want to set it to "\n". Otherwise the system and the file you are reading have compatible line feed characters.
Just note if you do change the system line feed character be sure to set it back after processing the file in case your program is processing multiple files.

Native method in R to test if file is ascii

Is there a native method in R to test if a file on disk is an ASCII text file, or a binary file? Similar to the file command in Linux, but a method that will work cross platform?
The file.info() function can distinguish a file from a dir, but it doesn't seem to go beyond that.
If all you care about is whether the file is ASCII or binary...
Well, first up definitions. All files are binary at some level:
is.binary <- function(file){
if(system.type() != "quantum computer"){
return(TRUE)
}else{
return(cat=alive&dead)
}
}
ASCII is just an encoding system for characters. It is therefore impossible to tell if a file is ASCII or binary, because ASCII-ness is a matter of interpretation. If I save a file and decide that binary number 01001101 is Q and 01001110 is Z then you might decode this as ASCII but you'll get the wrong message. Luckily the Americans muscled in and said "Hey, everyone use ASCII to code their text! You get 128 characters and a parity bit! Woo! Go USA!". IBM tried to tell people to use EBCDIC but nobody listened. Which was A Good Thing.
So everyone was packing ASCII-coded text into their 8-bit bytes, and using the eighth bit for parity checking. But then people stopped doing parity checking because TCP/IP handled all that, which was also A Good Thing, and the eighth bit was expected to be zero. If not, there was trouble.
Because people (read "Microsoft") started abusing the eighth bit, and making up their own encoding schemes, and so unless you knew what encoding scheme the file was using, you were stuffed. And the file very rarely told you what encoding scheme it was. And now we have Unicode and even more encoding schemes. And that is a third Good Thing. But I digress.
Nowadays when people ask if a file is binary, what they are normally asking is "Does any byte in this file have it's highest bit set?". Which you can do in R by reading a raw file connection as unsigned integers and testing the highest value. Something like:
is.binary <- function(filepath,max=1000){
f=file(filepath,"rb",raw=TRUE)
b=readBin(f,"int",max,size=1,signed=FALSE)
return(max(b)>128)
}
This will by default test only at most the first 1000 characters. I think the file command does something similar.
You may want to change the test to check for printable character codes, and whitespace, and line feed, carriage return, and other codes you might want to consider plausible in your non-binary files...
Well, how would you do that? I guess you can't without reading (parts or all of) the file, which is why files extensions are used to signal content type.
I looked into that years ago---and as I recall, the file(1) apps actually reads the first few header bytes of a file and compares that to what is stored in a lookup table. Sounds like a good candidate for an add-on package to me..
The example section of the manual for ?raw uses this:
isASCII <- function(txt) all(charToRaw(txt) <= as.raw(127))

Diff-command: doesn't print lines that are different but still says the two files are different

I'm using the diff command to compare two text files. They need to be literally matched.
So I use the diff:
diff binary.out binary.expected
(By the way, those files are NOT binary files. They are text file. I call them binary because that's the name of the project)
and got
Binary files binary.out and binary.expected differ
When I use another diff tool, the smartest of all (AKA human), and there's really nothing different between the two files.
Does anyone happen to know what's going on here?
Thanks.
diff from diffutils says the following about text/binary:
diff determines whether a file is text or binary by checking the
first few bytes in the file; the exact number of bytes is system
dependent, but it is typically several thousand. If every byte in
that part of the file is non-null, diff considers the file to be
text; otherwise it considers the file to be binary.
hence GNU diff have a quite open definition of what is text, and the use of the --text option to force it to treat the file as text should seldom be needed.
Have you checked if binary.out or binary.expected contains null characters? What version is your diff program?
Make sure to ignore white space in the diff options.
It may also see Unicode characters and interpret that as binary. See if your diff tool has an option to force text mode.

Resources