data deletion with concurrent reading/writing in file system - unix

Most file systems use locking to handle concurrent read/write. But what if after a read call, a write call is executed which deletes the data preceding the previous read call.
Is the pointer for a file open for reading updated to reflect the new start of the now smaller file?

The question isn't really valid, because you can't delete data using the write system call. You can overwrite data using the write(2) system call, but you can't delete data. Now, you can truncate the file using the truncate(2) system call. This changes the size of the file (reported via the st_size field by the stat(2) system call), and any bytes after the end of the file as reported by changed st_size will be zero. You can increase the size of the file using the truncate system call by requesting a new size which is larger than the current size. It is undefined (per the POSIX specification) whether this is allowed, or what the system will do when it recieves a truncate larger than the current size of the file. On many file systems it will simply set the size of the file to requested size.
OK, a few more concepts. Associated with each open file structure is a file offset pointer. Attempts to read or write a file using the read(2) or write(2) system call will advance the offset pointer by the number of bytes read or written. If you open a file twice using the open(2) system call, you will get two file descriptors, which each refer to a different open file structure, and in that case, a read(2) or write(2) using one file descriptor will not change the file offset for the other file descriptor. (If you clone a file descriptor using the dup(2) system call, then then you will get a second file descriptor which points to the same file structure, and then changes made to the file structure via one file descriptor, using the read(2), write(2), or lseek(2) system calls will be reflected via the cloned file descriptor. But that's a side issue, so that's all I will say on this topic for now.)
Now, if you truncate the file, this doesn't change the file offset in the file descriptor. However, any bytes after the truncated size will be zero if read. So the answer is that file offset pointer won't be updated after the truncate, but an attempt to read beyond the truncated size of the file will return all zeros.

Related

UNIX: Is the i-number same as the file descriptor?

Dennis Ritchie and Ken Thompson's paper UNIX Time-Sharing System mentions the following points
About i-number: A directory entry contains only a name for the associated file and a pointer to the file itself. This pointer is an integer called the i-number (for index number) of the file
About open and create system-calls: The returned value (of open and create) is called a file descriptor. It is a small integer used to identify the file in subsequent calls
Purpose of open/create: The purpose of an open or create system call is to turn the path name given by the user into an i-number by searching the explicitly or implicitly named directories
Does this mean that the file descriptor is just the i-number of a file? Or am I missing something?
A file descriptor in UNIX is basically just an index into the array of open files for the current process.
An inode number is an index into the inode table for the file system.
So they're basically just integers, indexes into an array, but they are indexes into completely different, unrelated arrays. So there is no connection between them.
To add to Chris Dodd's answer, not only are inode numbers and file descriptor numbers not directly related, it wouldn't be practical for them to be.
Inode numbers are unique to each file system. Imagine if you opened fileA on a file system (say, /mnt) with inode number 100, and in the same process also opened fileB on another filesystem (say, /mnt2) which also happened to have inode number 100. What should the file descriptors be in that case?

What is the "buffer" in the Atom editor?

In describing how find (& find and replace) work, the Atom Flight Manual refers to the buffer as one scope of search, with the entire project being another scope. What is the buffer? It seems like it would be the current file, but I expect it is more than that.
From the Atom Flight Manual:
A buffer is the text content of a file in Atom. It's basically the same as a file for most descriptions, but it's the version Atom has in memory. For instance, you can change the text of a buffer and it isn't written to its associated file until you save it.
Also came across this from The Craft of Text Editing by Craig Finseth, Chapter 6:
A buffer is the basic unit of text being edited. It can be any size, from zero characters to the largest item that can be manipulated on the computer system. This limit on size is usually set by such factors as address space, amount of real and/or virtual memory, and mass storage capacity. A buffer can exist by itself, or it can be associated with at most one file. When associated with a file, the buffer is a copy of the contents of the file at a specific time. A file, on the other hand, can be associated with any number of buffers, each one being a copy of that file's contents at the same or at different times.

Can ext4 detect corrupted file contents?

Can the ext4 filesystem detect data corruption of file contents? If yes, is it enabled by default and how can I check for corrupted data?
I have read that ext4 maintains checksums for file metadata and its journal, but I was unable to find any information on checksums for the actual file contents.
For clarity: I want to know if a file has changed since the last write operation.
No, ext4 doesn't and can't detect file content corruption.
Well known file systems implementing silent data corruption detection and therefore able to correct it when enough redundancy is available are ZFS and btrfs.
They do it by computing and storing a CRC for every data block written and by checking the CRC or each data block read. Should the CRC doesn't match the data, the latter is not provided to the caller and either RAID allows for an alternate block to be used instead, or an I/O error is reported.
The reading process will never receive corrupted data, either it is correct or the read fails.
"Can the ext4 filesystem detect data corruption of file contents?"
Not in the sense you are expecting. It performs journaling, creating a boolean {before vs after} copy to ensure io completion.
A CRC / checksum is a test for modification from a known state and although the CRC or checksum may not compare to the original, that does not imply that the file is then "corrupt" (aka invalid) - - it only says it has been changed. Strictly speaking, one form of "corruption" would be to alter the 'magic number' at the beginning of a file, like changing %PDF to %xYz - - that would make the content unusable to any program.
"... to know if a file has changed since the last write operation".
Systems that track mtime() will do so uniformly, so every write will modify mtime() making your request impossible.
The only way mtime() would not reflect last write io would be media degredation.

How to change the chunksize and Retrieve specific chunk in Mongodb Gridfs?

I am pretty much new in Mongodb now what I want to do is to insert a pdf file of 3MB using JAVA driver and want change the chunk size from 256 to 1mb and then want to retrieve the second chunk say 2nd page of the pdf document.
How can I do so.
Thankyou.
Generally, once a document has been written into GridFS you will need to re-write it (delete and save again) to modify the chunk size.
Since GridFS does not know anything about the format of the data in the file it can not help you get to the "2nd page". The InputStream implementation that is returned from GridFSDBFile does avoid reading blocks when you use the skip(long) method. If you know that the "2nd page" is N bytes into the file then you can skip that many bytes in the stream and start reading.
HTH, Rob
P.S. Remember that skip(long) returns the number of bytes actually skipped. You should not assume that skip(12) always skips 12 bytes.
P.P.S Starting to read from the middle of a PDF and making sense of what is there is going to be hard unless you have preserved state from the previous page(s).

Is there any size/row limit in .txt file?

The question may looks duplicate. But i am not getting the answer which i am looking.
The problem is, in unix, one of the 4GL binary is fetching data from the table using cursor and writing the data in .txt file.
The table contains around 50 Million records.
The binary took lot of time and not completing. the .txt file is also 0 byte.
I want to know the possibilities why the records are not written in the .txt file.
Note: There is enough disk space available.
Also, for 30 Million records, i can get the data in the .txt file as i expected.
The information you provide is insufficient to tell for sure why the file is not written.
In UNIX, a text file is just like any another file - a collection of bytes. No specific limit (or structure) is enforced on "row size" or "row count," although obviously, some programs might have certain limits on maximum supported line sizes and such (depending on their implementation).
When a program starts writing data to a file (i.e. once the internal buffer is flushed for the first time) the file will no longer be zero size, so clearly your binqary is doing something else all that time (unless it wipes out the file as part of the cleanup).
Try running your executable via strace to see the file I/O activity - that would give some clues as to what is going on.
Try closing the writer if you are using one to write to the file. It achieves the dual purpose of closing the resource along with flushing the remaining contents of the buffer.
CPU calculated output needs to be flushed if you are using any mechanism of buffered writer. I have encountered such situations a few times and in almost all cases, the issue was that of flushing the output.
In java specifically, usually the best practice of writing data involves buffers. So when the buffer limit is reached, it gets written to the file but doesn't get written to the file when the end of buffer has not been reached yet. This happens when program closes without flushing the buffered writer.
So, in your case, if the processing time that it takes is reasonable and still the output is not on the file, it may mean that the output has been calculated and put on the RAM but could not be written to the file (which represents the disk) due to the output not being flushed.
You can also consider the answers to this question.

Resources