Avoid importing empty line breaks as \n - r

Some of the fields of an csv file I'd like to import contain text followed by an empty line break or two. As a result, when using read.csv2 to import thecsv file I obtain fields containing "[text] + \n".
I tried removing '\n' using gsub("[\n]", "", x) but this takes an awful lot of time. I was wondering whether I can simply avoid importing empty line breaks - then there will be no '\n' in my data. Using strip.white=TRUE does not work.
Any idea whether I can avoid importing empty line breaks?
The data saved in csv format, when opened with notepad, looks a bit like:
1;"text - text";1;Good
1;"text - text
";1;Good
2;"text - text";1;Good
2;"text - text";2;Good
3;"text - text";1;Good
My real dataset has much more columns. In many of the columns I have the '\n' problem.
To add some more info, this is how I import my data (in the example above I have no headers, but in reality I have headers):
read.csv2("data.csv", header = TRUE, stringsAsFactors=FALSE, strip.white=TRUE,
blank.lines.skip = TRUE)
Edit: as an easy/quick R solution might not be at hand, I tackled my with problem with an Excel macro (I recorded a macro when applying the 1st procedure described in https://www.ablebits.com/office-addins-blog/2013/12/03/remove-carriage-returns-excel/).

Related

Loading CSV with fread stops because of to large string

This is the command I'm using :
dallData <- fread("data.csv", showProgress = TRUE, colClasses = c(rep("NULL", 2), "character", rep("NULL", 37)))
but I get this error when trying to load it: R character strings are limited to 2^31-1 bytes|
Anyway to skip those values ?
Here's a strategy that may work or at least narrow down the possible sources of error. It assumes you have enough working memory to hold the data and that your separators are really commas. If you actually have tabs as separators then you will need to modify accordingly. The plan is to read using readLines which will basically ignore the quotes that are probably mismatched. Then figure out which line or lines are at fault using count.fields, table, and which.
input <- readLines("data.csv") # ignores quotes
counts.def <- count.fields(textConnection(input),
sep=",") # defaults quotes are both ' and "
table(counts.def) # might show a variety of line counts.
# Second try with just double-quotes
counts.dbl <- count.fields(textConnection(input),
sep=",", quote="\"") # just dbl-quotes
table(counts.dbl) # if all the same, then all you do is change the quotes argument
Depending on the results you may need to edit cerain lines which can be identified using which(counts.def < 40) assuming most of them are 40 as your input efforts suggest is the expected number of fields per line.
(If the tag for [ram] means you are limited and getting warnings or using virtual memory which slows things down horribly, then you should restart your OS, and only load R before trying again. R needs contiguous block of memory and Windoze isn't very good at memory management.)
Here's a small test case to work with:
input <- readLines(textConnection(
"v1,v2,v3,v4,v5,v6
text, text, text, text, text, text
text, text, O'Malley, text,text,text
junk,junk, more junk, \"text\", tex\"t, nothing
3,4,5,6,7,8")

fwrite(, append = TRUE) appends wrong way

I am facing a problem with the fwrite function from the DataTable package in R. In fact it appends the wrong way and I'd end up with something like:
**user ref version status Type DataExtraction**
user1 2.02E+11 1 Pending 1 No
user2 2.02E+11 1 Saved 2 No"user3" 2.01806E+11 1 Saved NB No
I am using the function as follows :
library(data.table)
fwrite(Save, "~/Downloads/Register.csv", append = TRUE, sep = ",", quote = TRUE)
Reproducible example:
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = FALSE)
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = TRUE)
I'm not sure if it isolates the problem, but it seems that if I manually change something in the .csv file (for instance rename DataExtraction to Extraction), the problem of appending in the wrong way occurs.
Does someone know what is going wrong?
Thanks!
When I run your example code I have no problems with the behavior - the file comes out as desired. Based on your comments about manually changing what is in the file, and what the undesired output looks like, here is what I believe is happening. When fwrite() (and many other similar IO functions) write to a file, each line has at the end of it a line break (in R, this is generally represented as \n). This is desired, so that subsequent lines of data indeed appear on subsequent lines of the file. Generally this will also mean that when you open a file in a text editor, there will be a blank line at the very end, since this reflects the line break in the last line that was written. (different editors handle this differently though). So, I suspect what is happening is that when you go in and manually edit the file in your editor, you are somehow losing that last line break. What this means is that when you go to write again using append, there is no line break at the end of the file, and therefore you get the undesired behavior of two lines of data on a single line of the file.
So, the solution would be to either find how to prevent your manual editing from deleting that last line break character. Barring that, there are ways to just write a single line break character to the file using R. E.g. with the cat() function.

Can I extract portion of a binary file without file.read()-ing it?

I am extracting ascii header from binary files. It doesn't seem to be a fixed size so I cannot rely on exact byte offset to extract this header. From what I could gather, the header starts at the first occurrence of \x0a\x23 and ends at the first \x00\x00 after that. So I am simply reading each file and applying a regex to extract what I want.
Is there a faster way to do this without actually reading the file ? (right now, the only improvement I see is to limit the reading to a fixed size that would be large enough to ensure I get the header but not reading the whole file but that doesn't seem like a huge improvement)
Note: I will read tens (hundreds even maybe) of thousands of files so any micro-optimization is welcome to make the code as fast as possible.
Here's my code:
import re
# Header starts at the first \x0a\x23 occurrence and ends at the first following \x00\x00
HEADER_RE = re.compile(b'(\x0a\x23.+?)\x00\x00', re.DOTALL)
def get_header(file_path):
with open(file_path, 'rb') as file:
data = file.read() #TODO: read only the first x bytes: file.read(x)
try:
header = HEADER_RE.search(data).group(0)
except AttributeError:
header = b''
return header

CSV file is not recognized

Im exporting an excel file into a .csv file (cause I want to import it into R) but R doesn't recognize it.
I think this is because when I open it in notepad I get:
Item;Description
1;ja
2;ne
While a file which does not have any import issues is structured like this in notepad:
"Item","Description"
"1","ja"
"2","ne"
Does anybody know how I can either export it from excel in the right format or import a csv file with ";" seperator into R.
It's easy to deal with semicolon-delimited files; you can use read.csv2() instead of read.csv() (although be aware this will also use comma as the decimal separator character!), or specify sep=";".
Sorry to ask, but did you try reading ?read.csv ? The relevant information is in there, although it might admittedly be a little overwhelming/hard to sort out if you're new to R:
sep: the field separator character. Values on each line of the
file are separated by this character. If ‘sep = ""’ (the
default for ‘read.table’) the separator is ‘white space’,
that is one or more spaces, tabs, newlines or carriage
returns.

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

Resources