Problems reading in Pipe delimited csv with special characters into R - r

I've been trying to read in a pipe delimited csv file containing 96 variables about some volunteer water quality data. Randomly within the file, there's single and double quotation marks as well as semi-colons, dashes, slashes, and likely other special characters
Name: Jonathan "Joe" Smith; Jerry; Emily; etc.
From the output of several variables (such as IsNewVolunteer), it seems that r is having issues reading in the data. IsNewVolunteer should always be Y or N, but numbers are appearing and when I queried those lines it appears that the data is getting shifted. Variables that are clearly not names are in the Firstname and lastname column.
The original data format makes it a little difficult to see and troubleshoot, especially due to extra variables. I would find a way to remove them, but the goal of the work with R is to provide code that will be able to run on a dataset that is frequently updated.
I've tried
read.table("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")
But that produces the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 132 did not have 94 elements
However, there's nothing out of the ordinary that I've noticed about line 132. I've had more success with
read.csv("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")
but that still produces offsets and errors as discussed above. Is there something I'm doing incorrectly? Any information would be helpful.

I think it's one of two issues:
Encoding is either UTF-8 or UTF-16:
Try this...
read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-8)
or this...
read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-16)
Too many separators:
If this doesn't work, right-click on your .csv file and open it in a text editor. You have multiple |||| separators in rows 2,3,4, and 21,22 that are visible in your screenshot. Press CTRL+H to find and replace:
Find: ||||
Replace: |
Save the new file and try to open in R again.

Related

How to handle embedded separator in one or more fields using fread in R?

I am using fread function for importing ".dat" file [file size 3.5 GB]. The issue with the file is some fields have embedded separator[I got to know as the same file is being used for loading via SSIS ETL tool] in it.
data <- fread("xyz.dat", sep = '|', encoding = "UTF-8",showProgress = T, select = ord_shp_col, fill = TRUE, sep2 = "|")
Tried sep2 argument to handle as per the R document and even tried with only limited column, so that such columns could be skipped.
However, ending with same error again n again.
Read 0.0% of 1712440 rowsError in fread("xyz.dat", sep = "|", encoding
= "UTF-8", : Expecting 118 cols, but line 2143 contains text after processing all cols. Try again with fill=TRUE. Another reason could be
that fread's logic in distinguishing one or more fields having
embedded sep='|' and/or (unescaped) '\n' characters within unbalanced
unescaped quotes has failed. If quote='' doesn't help, please file an
issue to figure out if the logic could be improved.
Any help is highly appreciated.

Special Characters in read.table() with unz()

I am working with importing a series of text files from many zipped folders.
Before, I was unzipping the files with WinRAR, then applying read.table as:
main <- read.table(file.path(dir, paste0("Current/17/Folder/Main.txt")),
sep = '|',
header = FALSE,
stringsAsFactors = FALSE,
skipNul = TRUE,
comment.char="",
quote = ""
)
Which worked, but required unzipping myself (which takes up a prohibitive amount of space). I stumbled on unz, which would allow me to avoid this step as follows:
main <- read.table(unz(file.path(dir, paste0("Current/17.zip")), "Folder/Main.txt"),
sep = '|',
header = FALSE,
stringsAsFactors = FALSE,
skipNul = TRUE,
comment.char="",
quote = ""
)
The latter works in most cases, but for some files, the read.table throws an error
number of items read is not a multiple of the number of columns
I discerned that this is from some odd characters in the file. Comparing the two approaches in one case, I find that the manual unzip and read approach converts these special characters to "??" (with a white background, suggesting a foreign character) (which is desirable, if it cannot read them), whereas the second approach simply throws an error when it hits them (the ?? are not reported in the output), and does not read the rest of the current line or file. I looked at the line in question in Bash with:
sed -n '10905951 p' Main.txt
And it also reports the same "??" that show up in the read.table in R.
I have tried using fileEncoding, with a few different file formats, to no avail. Many other approaches I have seen for dealing with odd characters like this require manually replacing the characters, but I am struggling to even find out what the characters are, and why WinRAR's unzip converts them to something readable while unz does not. I have also considered other read options, but read.table is preferable to me because it allows for the skipNul option, which is relevant for my data.
Any help or ideas would be appreciated.

Removing "NUL" characters (within R)

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__

Why countLines and/or read.delim works in some files but not in others that were generated in the same way?

I have several files that are generated by Perl and C scripts. They are basically a large tsv file with a header, with the particular characteristic of having a list with sample names at the end of the file.
If I attempt to read this file with read.delim()I get the error:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed"
This is caused by the list of samples at the end of the file. Originally I was making a copy of the file and deleting the list from shell, but then I decided to just tell R how to read the file, so I wrote a small function that uses countLines{R.utils} and the number of samples (which I know for each file) to tell read.delim() to ignore the nasty list. I looks something like this:
readmyfile<- function (tsv, nsamps) {
tsv= tsv #the path to the tsv file
nsamps= nsamps #the number of samples as listed at the end of the file
# Count lines and rest the number of samples
maxl<-countLines(tsv) - nsamps
# Read the file ignoring the lines of the list of samples
data.cov = read.delim(tsv, header = T, row.names = 1, nrows=maxl)
}
It worked very well and I used it with several files. Then, some weeks later, I generated other files and tried the function again. But this time I'm getting the error:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed"
Here you can download an example of a "good" and "bad" file. If you download them and try:
readmyfile("goodfile.COV", 92)
You will see that it is working, but that:
> readmyfile("badfile.COV", 38)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
I have no idea what is going on. As far as I can tell the files are generated in the same way, and I work in the same computer (MacBook Pro) . I can't remember if I did something in particular to the files that work properly, the only thing I would think of was to make sure the line endings were LF. But this is not solving the issue.
Any clues? Oh please.
Removing row.names=1 from your read.delim code allowed the function to work as expected:
readmyfile<- function (tsv, nsamps) {
tsv= tsv #the path to the tsv file
nsamps= nsamps #the number of samples as listed at the end of the file
# Count lines and rest the number of samples
maxl<-countLines(tsv) - nsamps
# Read the file ignoring the lines of the list of samples
data.cov = read.delim(tsv, header = T, nrows=maxl)
}

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

Resources