Reading multiple data frames from a single file with R - r

My problem is that I'm trying to read in data which has been formatted by an archaic piece of Fortran code (and thus is character limited on each line). The data consists of a number of chunks, each with a fixed width format, and the basic structure of each chunk is:
header line (one line, 11 columns)
data (80 lines, 11 columns)
header line (identical to above)
blank (3 lines)
The first column is identical for each chunk, so once read in, I can join the dfs into a single df. However, how do I read all of the chunks of the data in? Am I limited to writing a loop with a skip value that goes up in increments of 85, or is there a neater way to do things?

Related

How to handle and normalize a dataframe with billions of rows?

I would need to analyze a dataframe on R (bash or even python if you have suggestions, but I don't know how to use python well). The dataframe has approximately 6 billion rows and 8 columns (control1, treaty1, control2, treaty2, control3, treaty3, control4, treaty4).
Since it is a file of almost 300Gb and 6 billion lines I cannot open it with R.
I would need to read the file line by line and remove the lines where there is even only a 0.
How could I do?
If I also needed to divide each value inside a column by a number, and put the result in a new dataframe equal to the starting one, how could I do?

R Code: csv file data incorrectly breaking across lines

I have some csv data that I'm trying to read in, where lines are breaking across rows weirdly.
An example of the file (the files are the same but the date varies) is here: http://nemweb.com.au/Reports/Archive/DispatchIS_Reports/PUBLIC_DISPATCHIS_20211118.zip
The csv is non-rectangular because there's 4 different types of data included, each with their own heading rows. I can't skip a certain number of lines because the length of the data varies by date.
The data that I want is the third dataset (sometimes the second), and has approximately twice the number of headers as the data above it. So I use read.csv() without a header and ideally it should pull all data and fill NAs above.
But for some reason read.csv() seems to decide that there's 28 columns of data (corresponding to the data headers on row 2) which splits the data lower down across three rows - so instead of the data headers being on one row, it splits across three; and so does all the rows of data below it.
I tried reading it in with the column names explicitly defined, it's still splitting the rows weirdly
I can't figure out what's going on - if I open the csv file in Excel it looks perfectly normal.
If I use readr::read_lines() there's no errant carriage returns or new lines as far as I can tell.
Hoping someone might have some guidance, otherwise I'll have to figure out a kind of nasty read_lines approach.

Printing out R Dataframe - Single Character Between Columns While Maintaining Alignment (Variable Spacing)

In a previous question, I received output for an R dataframe that had two aligned columns. The answer gave me the following output:
While the post answered my initial question, it seems as if the program I intend to use requires a text file in which the two columns are both aligned and separated by a single character (e.g. a tab). The previous solution instead results in a large and variable number of spaces between the first and second columns (depending on the length of the string in the first column for that particular row.) Inserting a single character, however, results in a misalignment of the columns.
Is there any way in which I can replace a large number of spaces with a single character that has variable spacing to 'reach' to the second column?
If it helps, this webpage contains a .txt file that you may download to see the intended output (although it does not suffer from the problem with the first column having variable name lengths, it has a single 'space character' that separates the first and second columns. If I 'copy and paste' this specific space character between columns 1 and 2, the program can successfully interpret the .txt file. This copy + paste results in a single character separating the columns and appropriate alignment.)
For further example, the first of the following pictures (note the highlight is a single character) properly parses while the second does not:

Formatting Header to Append to Data Frame in R

I'm attempting to create a specifically formatted header to append to a data frame I have created in R.
The essence of my problem is that it seems increasingly difficult (maybe impossible?) to create a header that breaks away from a typical one row by one column framework, without merging the underlying table, using the dataframe concept in R.
The issue stems from me not being able to figure out a way to import this particular format of a header into R through methods such as read.csv or read.xlsx which preserve the format of the header.
Reading in a header of this format into R from a .csv or .xlsx is quite ugly and doesn't preserve the original format. The format of the header I'm trying to create and append to an already existing dataframe I have of 17 nameless columns in R could be represented in such a way:
Where the number series of 1 - 17 represents the already existing data frame of 17 nameless columns of data that I have created in R in which I wish to append to this header. Could anyone point me in the right direction?
You are correct that this header will not work within R. The data frame only supports single header values and wont do something akin to a merged cell in excel.
However if you simply want to export your data to an .csv or .xlsx (use write.csv) then just copy your header in, that could work.
OR
You could add in a factor column to your data frame to capture the information contained in the top level of your header.

How to delete all rows in R until a certain value

I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").
I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...

Resources