Find out if text in CSV is quoted - r

I have 2 large CSV files, which contains same data. However, their file sizes vary slightly. I'm guessing this is due to different quote argument used while generating those files using data.table's fwrite().
How do I determine in R if text entries in CSV files are surrounded by quotes? I cannot open them in Notepad++ due to file size.

you don't have to parse the entire file! read in the first couple of lines to learn about the structure:
fread("pathtofile.csv",
nrows= 10, ## read first 10 lines
header = TRUE, ## if the csv contains a header
sep = "," ) ## specfiy the separator; "," for comma separated

readLines('file.csv', n = 2) would read the first two lines of a file.

Related

Convert txt file to csv [only specific contents that matches a string pattern]

I have a *.DAT file which can be opened by txt editor. I want to extract some contents from this and convert it to *.csv. The converted csv file must have header (colnames), specification (lower and higher) and data portion. I need to convert 100's of these type of files to *.csv (as separate csv or all combined to one big csv file)
Sample snippet of my *.DAT file will look like below
[FILEINFO]
VERSION=V4.0
FILENAME=TEST.DAT
CREATIONTIME=2015-07-09 22:05:26
[LOTINFO]
LotNo=A6022142
DUT=BCEK450049
PRODUCTNAME=EX061
Order=
ChipCode=
SACH_NO=B39000-
MAT_NO=B39000-P810
[SPEC1]
TXT=SEN1
Unit=
LSL=-411.400000
USL=-318.700000
[SPEC2]
TXT=SEN2
Unit=
LSL=-11.000000
USL=11.000000
[SPEC3]
TXT=SEN3
Unit=
LSL=-45.000000
USL=10.000000
[DATA]
2,29,-411.232,10.193,-11.530,
3,29,-411.257,10.205,-11.328,
I can extract the contents below [DATA] and save in csv file. I am not sure how to extract the contents above to create header, etc. I used below code to extract contents below [DATA]
library(stringr)
library(readr)
myTXT <- read_file("EXAMPLE.DAT")
ExtData <- read.csv(text =
sub(".*\\[DATA\\]\\s+", "", my_txt), header = FALSE)
dat2csv <- write.csv(ExtData, dat_2_csv.csv",row.names=FALSE)
To extract the contents above [DATA] I tried below code with no success
con <- file("EXAMPLE.DAT","r")
OneLine <- c()
while(True) {
line = readLines(con,1)
if(length(line) == 0) break
elseif(line="LSL=")
RES <- str_split(line,"=",simplify=TRUE)
lines <- RES[1,2]
}
Expected output csv file as below
According to this link, .DAT files are very generic files with very specific information. Therefore, and especially after looking at your sample snippet, I doubt there is a straightforward way to do the conversion (unless there's a package designed specifically to process similar data).
I can only give you my 5 cents of my general strategy to tackle this:
For starters, instead of focusing on the .csv format, you should first focus on turning this text file into a table format.
To do so, you should save the parameters in separate vectors/columns (Every column could be TXT, Unit, LSL, etc.)
In doing so, each row (SPEC1, SPEC2, SPEC3) would be representing each datapoint with all its characteristics.
Even so, looks like it also contains metadata, and you might, therefore, save the different chunks of data into different variables (file.info = read_file(x, nrows = 4))
Hope it might help a bit.
Edit: As said by #qwe, the format resembles a .ini file. So a good way to start would be to open the file with a '=' delimiter:
data = read.table('example.dat', delim = '=')

Avoid importing empty line breaks as \n

Some of the fields of an csv file I'd like to import contain text followed by an empty line break or two. As a result, when using read.csv2 to import thecsv file I obtain fields containing "[text] + \n".
I tried removing '\n' using gsub("[\n]", "", x) but this takes an awful lot of time. I was wondering whether I can simply avoid importing empty line breaks - then there will be no '\n' in my data. Using strip.white=TRUE does not work.
Any idea whether I can avoid importing empty line breaks?
The data saved in csv format, when opened with notepad, looks a bit like:
1;"text - text";1;Good
1;"text - text
";1;Good
2;"text - text";1;Good
2;"text - text";2;Good
3;"text - text";1;Good
My real dataset has much more columns. In many of the columns I have the '\n' problem.
To add some more info, this is how I import my data (in the example above I have no headers, but in reality I have headers):
read.csv2("data.csv", header = TRUE, stringsAsFactors=FALSE, strip.white=TRUE,
blank.lines.skip = TRUE)
Edit: as an easy/quick R solution might not be at hand, I tackled my with problem with an Excel macro (I recorded a macro when applying the 1st procedure described in https://www.ablebits.com/office-addins-blog/2013/12/03/remove-carriage-returns-excel/).

Regarding reading files which contain UTF-8 character

I have a csv file including chinese character saved with UTF-8.
项目 价格
电视 5000
The first row is header, the second row is data. In other words, it is one by two vector.
I read this the file as follows:
amatrix<-read.table("test.csv",encoding="UTF-8",sep=",",header=T,row.names=NULL,stringsAsFactors=FALSE)
However, the output including the unknown marks for the header, i.e.,X.U.FEFF
That is the byte order mark sometimes found in Unicode text files. I'm guessing you're on Windows, since that's the only popular OS where files can end up with them.
What you can do is read the file using readLines and remove the first two characters of the first line.
txt <- readLines("test.csv", encoding="UTF-8")
txt[1] <- substr(txt[1], 3, nchar(txt[1]))
amatrix <- read.csv(text=txt, ...)

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

How to make R stop reading rows in a text file at a line containing a specific character?

For example, I want to read lines from the beginning of a text file up to a string with ";" symbol excluding this string.
Thanks a lot.
A very simple approach might be to read the contents of the using readLines:
content = readLines("data.txt")
And then split the character data on the ;:
split_content = strsplit(content, split = ";")
And then extract the first elememt, i.e. the text up to the semicolon:
first_element = lapply(split_content, "[[", 1]
The result is a list of all the text in the rows of the data file up to the semicolon.
Ps I'm not entirely sure about the last line...I can't check it as I've got no access to R right now.

Resources