Using data.table::fread on column containing a single double quote

Using data.table::fread on column containing a single double quote - r

I've been googling and reading posts on problems similar but different to the one described below; apologies if this is a duplicate.
I've got a csv file with a field which can contain, among other things, a single instance of a double quote (object descriptions sometimes containing lengths specified in inches).
When I call fread as follows
data_in <- data.table::fread(file_path,stringsAsFactors = FALSE)
the resulting data frame contains two consecutive double quotes in instances where the source file only had one (e.g., the string which appears in the raw csv as
MI|WIRE 9" BGD
appears in the data frame as
MI|WIRE 9"" BGD
).
This character field can also contain commas, semicolons, single quotes in any quantity, and many other characters which I cannot identify.
This is a problem as I need the exact string to match another dataset's values with merge (in fact, the file being read in was originally written from r with fwrite).
I assume that nearly any io problem I'm wrestling with can be solved with readLines and some elbow grease, but I quite like fread. Based on what I've read online this seems similar to problems that others have faced and so I'm guessing that some tweaking of fread's parameters will solve this problem. Any ideas?

Related

R Code: csv file data incorrectly breaking across lines

I have some csv data that I'm trying to read in, where lines are breaking across rows weirdly.
An example of the file (the files are the same but the date varies) is here: http://nemweb.com.au/Reports/Archive/DispatchIS_Reports/PUBLIC_DISPATCHIS_20211118.zip
The csv is non-rectangular because there's 4 different types of data included, each with their own heading rows. I can't skip a certain number of lines because the length of the data varies by date.
The data that I want is the third dataset (sometimes the second), and has approximately twice the number of headers as the data above it. So I use read.csv() without a header and ideally it should pull all data and fill NAs above.
But for some reason read.csv() seems to decide that there's 28 columns of data (corresponding to the data headers on row 2) which splits the data lower down across three rows - so instead of the data headers being on one row, it splits across three; and so does all the rows of data below it.
I tried reading it in with the column names explicitly defined, it's still splitting the rows weirdly
I can't figure out what's going on - if I open the csv file in Excel it looks perfectly normal.
If I use readr::read_lines() there's no errant carriage returns or new lines as far as I can tell.
Hoping someone might have some guidance, otherwise I'll have to figure out a kind of nasty read_lines approach.

How does R deal with tiny decimals? Converting .csv file list to numbers

I've seen a number of threads about similar problems and have tried all the suggestions with no luck so far - I think maybe the issue is the size of the numbers in my file.
I have a .csv file with 5399 rows and one column of numbers with six decimal places and only three or four significant figures (eg 0.000615).
I can import the file without any problems and de-select the string to factor option, resulting in mode:list.
I have tried as.numeric(), as.character(as.numeric()), tried copying the data into a data frame, into a matrix, into a vector... nothing works. I can do simple maths on the list but cannot run the functions and loops I need to.
Because as.character(as.numeric()) normally works, I figure it must be an issue with the size of the numbers. Does R have a problem with tiny decimals?

`how to read a data set and remove commas from measured variables

I am having an issue creating a linear model from a data frame I have stored because the independent variable contains comma-separators (i.e 314,567.5 vs 314567.4). How could I use read.csv or readr to read a data set and return a data frame without the commas in that specific column?

The answer to the commas question is here.
However, you first need to read the file into R. Although it can be a bit of a pain, I've found that read.fwf is often the best solution in these situations, unless you have a different delimiter, such as a pipe, |, in which case read.delim would probably be best.

How to delete all rows in R until a certain value

I have a several data frames which start with a bit of text. Sometimes the information I need starts at row 11 and sometimes it starts at row 16 for instance. It changes. All the data frames have in common that the usefull information starts after a row with the title "location".
I'd like to make a loop to delete all the rows in the data frame above the useful information (including the row with "location").

I'm guessing that you want something like this:
readfun <- function(fn,n=-1,target="location",...) {
r <- readLines(fn,n=n)
locline <- grep(target,r)[1]
read.table(fn,skip=locline,...)
}
This is fairly inefficient because it reads the data file twice (once as raw character strings and once as a data frame), but it should work reasonably well if your files are not too big. (#MrFlick points out in the comments that if you have a reasonable upper bound on how far into the file your target will occur, you can set n so that you don't have to read the whole file just to search for the target.)
I don't know any other details of your files, but it might be safer to use "^location" to identify a line that begins with that string, or some other more specific target ...

R claims that data is non-numeric, but after writing to file is numeric

I have read in a table in R, and am trying to take log of the data. This gives me an error that the last column contains non-numeric values:
> log(TD_complete)
Error in Math.data.frame(list(X2011.01 = c(187072L, 140815L, 785077L, :
non-numeric variable in data frame: X2013.05
The data "looks" numeric, i.e. when I read it my brain interprets it as numbers. I can't be totally wrong since the following will work:
> write.table(TD_complete,"C:\\tmp\\rubbish.csv", sep = ",")
> newdata = read.csv("C:\\tmp\\rubbish.csv")
> log(newdata)
The last line will happily output numbers.
This doesn't make any sense to me - either the data is numeric when I read it in the first time round, or it is not. Any ideas what might be going on?
EDIT: Unfortunately I can't share the data, it's confidential.

Review the colClasses argument of read.csv(), where you can specify what type each column should be read and stored as. That might not be so helpful if you have a large number of columns, but using it makes sure R doesn't have to guess what type of data you're using.
Just because "the last line will happily output numbers" doesn't mean R is treating the values as numeric.
Also, it would help to see some of your data.

If you provide the actual data or a sample of it, help will be much easier.
In this case I assume R has the column in question saved as a string and writes it without any parantheses into the CSV file. Once there, it reads it again and does not bother to interpret a value without any characters as anything else than a number. In other words, by writing and reading a CSV file you converted a string containing only numbers into a proper integer (or float).
But without the actual data or the rest of the code this is mere conjecture.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using data.table::fread on column containing a single double quote - r

Related

R Code: csv file data incorrectly breaking across lines

How does R deal with tiny decimals? Converting .csv file list to numbers

`how to read a data set and remove commas from measured variables

How to delete all rows in R until a certain value

R claims that data is non-numeric, but after writing to file is numeric

Categories

Resources