I have some csv data that I'm trying to read in, where lines are breaking across rows weirdly.
An example of the file (the files are the same but the date varies) is here: http://nemweb.com.au/Reports/Archive/DispatchIS_Reports/PUBLIC_DISPATCHIS_20211118.zip
The csv is non-rectangular because there's 4 different types of data included, each with their own heading rows. I can't skip a certain number of lines because the length of the data varies by date.
The data that I want is the third dataset (sometimes the second), and has approximately twice the number of headers as the data above it. So I use read.csv() without a header and ideally it should pull all data and fill NAs above.
But for some reason read.csv() seems to decide that there's 28 columns of data (corresponding to the data headers on row 2) which splits the data lower down across three rows - so instead of the data headers being on one row, it splits across three; and so does all the rows of data below it.
I tried reading it in with the column names explicitly defined, it's still splitting the rows weirdly
I can't figure out what's going on - if I open the csv file in Excel it looks perfectly normal.
If I use readr::read_lines() there's no errant carriage returns or new lines as far as I can tell.
Hoping someone might have some guidance, otherwise I'll have to figure out a kind of nasty read_lines approach.
Related
I am trying to read an excel file with multiple tabs. For that, I use the code provided here.
The problem is that each tab has a different number of empty rows before the actual data begins. For example, the first tab has two empty rows, the second tab has three empty rows, and so on.
Normally, I would use the parameter skip in the read_excel function to indicate the number of empty lines to skip. But how do I do that for multiple tabs with different numbers of rows to skip?
perhaps the easiest solution would be to read it as it is then remove rows, i.e. yourdata <- yourdata[!is.na(yourdata$columname),] ; this would work if you don't expect any NA's in a particular column, like id. If you have data gaps everywhere you can test for all NAs in multiple columns - let me know if that's what you need.
I am trying to read a special type of file (the format is called AGS) which looks like in the image:
This is basically a TEXT file, which contains many tables with different dimensions inside, separated by 2 (but sometimes more) empty rows. As you might guess, the problem is related to the fact that these tables have different number of columns and obviously different column names.
The first row in each table (here tables are denoted as GROUP) shows the name of the table, e.g. LOCA, HDPH, etc. The second row shows the column names. The third row shows the units of each column. All the other rows show the observations. In each row, columns are separated by commas and values are inside double quotes.
I am struggling to read this type of file. The ideal output would be to have each of these tables into separated data frames. Any help and ideas are much appreciated.
An example file can be downloaded here: example AGS file
I'm attempting to create a specifically formatted header to append to a data frame I have created in R.
The essence of my problem is that it seems increasingly difficult (maybe impossible?) to create a header that breaks away from a typical one row by one column framework, without merging the underlying table, using the dataframe concept in R.
The issue stems from me not being able to figure out a way to import this particular format of a header into R through methods such as read.csv or read.xlsx which preserve the format of the header.
Reading in a header of this format into R from a .csv or .xlsx is quite ugly and doesn't preserve the original format. The format of the header I'm trying to create and append to an already existing dataframe I have of 17 nameless columns in R could be represented in such a way:
Where the number series of 1 - 17 represents the already existing data frame of 17 nameless columns of data that I have created in R in which I wish to append to this header. Could anyone point me in the right direction?
You are correct that this header will not work within R. The data frame only supports single header values and wont do something akin to a merged cell in excel.
However if you simply want to export your data to an .csv or .xlsx (use write.csv) then just copy your header in, that could work.
OR
You could add in a factor column to your data frame to capture the information contained in the top level of your header.
My problem is that I'm trying to read in data which has been formatted by an archaic piece of Fortran code (and thus is character limited on each line). The data consists of a number of chunks, each with a fixed width format, and the basic structure of each chunk is:
header line (one line, 11 columns)
data (80 lines, 11 columns)
header line (identical to above)
blank (3 lines)
The first column is identical for each chunk, so once read in, I can join the dfs into a single df. However, how do I read all of the chunks of the data in? Am I limited to writing a loop with a skip value that goes up in increments of 85, or is there a neater way to do things?
I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").
I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...