Write lines of text at a given position in a file in R - r

I would like to write a line in a text file at a given position (i) by avoiding the sequential reading.
There is WriteLines base function but I don't know how to insert the text at position (i) given as parameter.
Thanks
Dave

This is — unrelated to R — fundamentally impossible. Most (all common) filesystems do not support inserting or removing content in the middle of a file. The only supported operations are appending (or truncation) at the end, and R only supports appending, not truncation.
The way virtually all software solves your problem is by reading the file, modifying it, and writing it back to disk. If you want to get fancy because the file is very large (at least in the order of hundreds of MiB), you can stream edit the file: Read a part, edit that part, write it back to a new file. Rinse and repeat.
Technical aside: There is one exception to the above with low-level file operations, since files are stored as as non-contiguous “blocks”. But even if R supported this it wouldn’t help you since it doesn’t permit byte-level or line-level granularity: Blocks are typically at least 4 kiB in size.

Related

Vroom/fread won't read LARGE .csv file - cannot memory map it

I have a .csv file that is 112GB in weight but neither vroom nor data.table::fread will open it. Even if I ask to read in 10 rows or just a couple of columns it complains with mapping error: Cannot allocate memory.
df<-data.table::fread("FINAL_data_Bus.csv", select = c(1:2),nrows=10)
System errno 22 unmapping file: Invalid argument
Error in data.table::fread("FINAL_data_Bus.csv", select = c(1:2), nrows = 10) :
Opened 112.3GB (120565605488 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.
read.csv on the other hand will read the ten rows happily.
Why won't vroom or fread read it using the usual altrep, even for 10 rows?
This matter has been discussed by the main creator of data.table package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread, the entire file needs to be memory mapped. So fread cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.
Also, I haven't been able to use vroom with big more-than-RAM csv files. Any pointers towards this end will be appreciated.
For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk from https://bioinf.shenwei.me/csvtk/
e.g., check dimensions with
csvtk dim BigFile.csv.gz
and, check out head with top 100 rows
csvtk head -n100 BigFile.csv.gz
get a better view of above with
csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN
Here I've used less command available with "Gnu On Windows" at https://github.com/bmatzelle/gow
A word of caution - many people suggest using command
wc -l BigFile.csv
to check out no. of lines from a big csv file. In most cases, it will be equal to the no. of rows. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows. In such cases the no. of lines is different from the no. of rows. So it is advisable to use csvtk dim or csvtk nrow. Other csv command line tools like xsv, miller will also show correct results.
Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv") is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt. So better use a command line tool like csvtk, miller, xsv with less -SN for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.
PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.
PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808
EDIT:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203

New line constant

Is there a new line constant that's platform independent in R? I'm used to C# and there's Environment.NewLine which will return \r\n on windows and \n otherwise. Searching turned up nothing, but I assume there has to be something somewhere so that scripts can be platform independent.
Related question: Is there a way to detect the platform a script is running on? This could be useful to know for other reasons (which I haven't thought of yet).
EDIT: Here's why I'm asking. I'm downloading files from an FTP server, but want to get a list of files and only download files that are on the server that don't exist locally. Here's how I'm getting the list of files:
filesonserver <- unlist(strsplit(getURL(basePath, ftp.use.epsv=F, dirlistonly=T), "\n"))
On windows, the files are separated by \r\n. On my mac (where I'm currently working), they're separated by \n. I was looking for a way to make this platform independent. I haven't tried just separating by \n on windows, which might work. There might also be a way to get the list of files as a vector without having to split them, which would avoid this entirely...
The package tryCatchLog has a function determine.platform.NewLine():
https://cran.r-project.org/package=tryCatchLog
https://github.com/aryoda/tryCatchLog/blob/master/R/platform_newline.R
If you consequently use this string instead of hard-coded "\n" your new lines will work platform-independently.
The answer to the initial question appears to be there isn't a new line constant like C# has. But it doesn't matter in my case, as the comments pointed out. It didn't occur to me until after I edited in the details that I probably didn't need to worry about it. Splitting by \n works fine on windows, even though the string containing the files names returned by getURL() is split by \r\n.

Parse Flat file every 410 characters

I have a flat file that I need to take and insert a carriage return every 410 characters. I know this sounds weird, but for whatever reason my work was given several huge flat files from a clearinghouse, and I need to parse it out.
There is nothing that seperates what is supposed to be each new line, but it is exactly 410 characters. So I can't even search for anything specific and then do it.
There are 21 files total, each about 12-13mb.
I have asked for a CSV file, and they are unable to provide that.
I am trying to see if Notepad++ will do a Character count and then I can just hit "enter" after every 410th.
Also I am trying to see if I can do this in Java.
Any help you all can provide would be appreciated.
In Notepad++ you can search for the regular expression (.{410}) and replace it with \1\r.
It has happened to me that Notepad++ swallowed some characters when doing regex-based search and replace operations in large files, so I would try this for one file, then remove all the carriage returns again and compare the result size to the original size, just to make sure that nothing got swallowed during the replace operation.

Reading text files in Ada: Get_Line "reads" the byte-order mark as well

I'm trying to read a file line-by-line in Ada, it's a XML text file. I'm following the instructions here:
http://rosettacode.org/wiki/Read_a_file_line_by_line#Ada
However there's a problem that annoys me: the "Get_Line" function seems to be unaware of byte-order marks and reads them as part of the text itself, which means that when I raed the lines, the first one will always start with some extra bytes that should not be there.
While removing the extra bytes manually from the string is no big deal it seems strange to me that a function dedicated to text input/output is unaware of BOMs, there must be a way to read a text file in ada without having to worry about this... is there?
Ada.Text_IO is specified to handle ISO-8859-1 encoded text, so ignoring an UTF-8 feature is the proper thing to do.
If Ada.Wide_Text_IO and Ada.Wide_Wide_Text_IO also output the byte-order-mark, when asked to read UTF-8 encoded text, then you should consider reporting it as a bug to GCC - but as there is quite a lot of implementation defined details for the text I/O packages in Ada, you should be ready for a "wont fix" answer.
One possibility is using the stream attributes and making a UTF_8 file-type to handle the BOM reading-and-discarding.

Diff-command: doesn't print lines that are different but still says the two files are different

I'm using the diff command to compare two text files. They need to be literally matched.
So I use the diff:
diff binary.out binary.expected
(By the way, those files are NOT binary files. They are text file. I call them binary because that's the name of the project)
and got
Binary files binary.out and binary.expected differ
When I use another diff tool, the smartest of all (AKA human), and there's really nothing different between the two files.
Does anyone happen to know what's going on here?
Thanks.
diff from diffutils says the following about text/binary:
diff determines whether a file is text or binary by checking the
first few bytes in the file; the exact number of bytes is system
dependent, but it is typically several thousand. If every byte in
that part of the file is non-null, diff considers the file to be
text; otherwise it considers the file to be binary.
hence GNU diff have a quite open definition of what is text, and the use of the --text option to force it to treat the file as text should seldom be needed.
Have you checked if binary.out or binary.expected contains null characters? What version is your diff program?
Make sure to ignore white space in the diff options.
It may also see Unicode characters and interpret that as binary. See if your diff tool has an option to force text mode.

Resources