Special Characters in read.table() with unz() - r

I am working with importing a series of text files from many zipped folders.
Before, I was unzipping the files with WinRAR, then applying read.table as:
main <- read.table(file.path(dir, paste0("Current/17/Folder/Main.txt")),
sep = '|',
header = FALSE,
stringsAsFactors = FALSE,
skipNul = TRUE,
comment.char="",
quote = ""
)
Which worked, but required unzipping myself (which takes up a prohibitive amount of space). I stumbled on unz, which would allow me to avoid this step as follows:
main <- read.table(unz(file.path(dir, paste0("Current/17.zip")), "Folder/Main.txt"),
sep = '|',
header = FALSE,
stringsAsFactors = FALSE,
skipNul = TRUE,
comment.char="",
quote = ""
)
The latter works in most cases, but for some files, the read.table throws an error
number of items read is not a multiple of the number of columns
I discerned that this is from some odd characters in the file. Comparing the two approaches in one case, I find that the manual unzip and read approach converts these special characters to "??" (with a white background, suggesting a foreign character) (which is desirable, if it cannot read them), whereas the second approach simply throws an error when it hits them (the ?? are not reported in the output), and does not read the rest of the current line or file. I looked at the line in question in Bash with:
sed -n '10905951 p' Main.txt
And it also reports the same "??" that show up in the read.table in R.
I have tried using fileEncoding, with a few different file formats, to no avail. Many other approaches I have seen for dealing with odd characters like this require manually replacing the characters, but I am struggling to even find out what the characters are, and why WinRAR's unzip converts them to something readable while unz does not. I have also considered other read options, but read.table is preferable to me because it allows for the skipNul option, which is relevant for my data.
Any help or ideas would be appreciated.

Related

Loading CSV with fread stops because of to large string

This is the command I'm using :
dallData <- fread("data.csv", showProgress = TRUE, colClasses = c(rep("NULL", 2), "character", rep("NULL", 37)))
but I get this error when trying to load it: R character strings are limited to 2^31-1 bytes|
Anyway to skip those values ?
Here's a strategy that may work or at least narrow down the possible sources of error. It assumes you have enough working memory to hold the data and that your separators are really commas. If you actually have tabs as separators then you will need to modify accordingly. The plan is to read using readLines which will basically ignore the quotes that are probably mismatched. Then figure out which line or lines are at fault using count.fields, table, and which.
input <- readLines("data.csv") # ignores quotes
counts.def <- count.fields(textConnection(input),
sep=",") # defaults quotes are both ' and "
table(counts.def) # might show a variety of line counts.
# Second try with just double-quotes
counts.dbl <- count.fields(textConnection(input),
sep=",", quote="\"") # just dbl-quotes
table(counts.dbl) # if all the same, then all you do is change the quotes argument
Depending on the results you may need to edit cerain lines which can be identified using which(counts.def < 40) assuming most of them are 40 as your input efforts suggest is the expected number of fields per line.
(If the tag for [ram] means you are limited and getting warnings or using virtual memory which slows things down horribly, then you should restart your OS, and only load R before trying again. R needs contiguous block of memory and Windoze isn't very good at memory management.)
Here's a small test case to work with:
input <- readLines(textConnection(
"v1,v2,v3,v4,v5,v6
text, text, text, text, text, text
text, text, O'Malley, text,text,text
junk,junk, more junk, \"text\", tex\"t, nothing
3,4,5,6,7,8")

Problems reading in Pipe delimited csv with special characters into R

I've been trying to read in a pipe delimited csv file containing 96 variables about some volunteer water quality data. Randomly within the file, there's single and double quotation marks as well as semi-colons, dashes, slashes, and likely other special characters
Name: Jonathan "Joe" Smith; Jerry; Emily; etc.
From the output of several variables (such as IsNewVolunteer), it seems that r is having issues reading in the data. IsNewVolunteer should always be Y or N, but numbers are appearing and when I queried those lines it appears that the data is getting shifted. Variables that are clearly not names are in the Firstname and lastname column.
The original data format makes it a little difficult to see and troubleshoot, especially due to extra variables. I would find a way to remove them, but the goal of the work with R is to provide code that will be able to run on a dataset that is frequently updated.
I've tried
read.table("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")
But that produces the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 132 did not have 94 elements
However, there's nothing out of the ordinary that I've noticed about line 132. I've had more success with
read.csv("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")
but that still produces offsets and errors as discussed above. Is there something I'm doing incorrectly? Any information would be helpful.
I think it's one of two issues:
Encoding is either UTF-8 or UTF-16:
Try this...
read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-8)
or this...
read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-16)
Too many separators:
If this doesn't work, right-click on your .csv file and open it in a text editor. You have multiple |||| separators in rows 2,3,4, and 21,22 that are visible in your screenshot. Press CTRL+H to find and replace:
Find: ||||
Replace: |
Save the new file and try to open in R again.

Is RStudio editing scripts upon saving?

I am reading pdf files with R. Before I analyze them, I perform a cleaning process.
More specifically, I am using this custom function:
cleanFile = function(x) {
x = str_replace_all(string = x, pattern = "\\s+", replacement = " ") # not interesting for this post
x = str_replace_all(string = x, pattern = "–", replacement = "-")
}
The second line is replacing a special character which my pdf files contain and looks just like a hyphen, with the "normal" hyphen character, "-".
This is some code afterwards:
pdfClean = lapply(pdfList, cleanFile) # files cleansed
str_detect(pdfClean[[1]], "–") # checking if the special character appears - this last bit is not part of the script, I type it in the console so it's not affected by save.
When I save and run my script, the last line surprisingly returns TRUE.
I understand that the "–" character has been replaced with something different, because if I replace it back in my code and run the same script without saving, the last line returns FALSE, as expected.
Furthermore, this issue has only appeared now that I have changed my computer. In the past, I was using a different one and my code was running without any issue whatsoever.
I tried to experiment a bit with Tools --> Global Options --> Code --> Saving --> Default text encoding, but I didn't manage to fix anything.
Any help would be greatly appreciated.

How to handle embedded separator in one or more fields using fread in R?

I am using fread function for importing ".dat" file [file size 3.5 GB]. The issue with the file is some fields have embedded separator[I got to know as the same file is being used for loading via SSIS ETL tool] in it.
data <- fread("xyz.dat", sep = '|', encoding = "UTF-8",showProgress = T, select = ord_shp_col, fill = TRUE, sep2 = "|")
Tried sep2 argument to handle as per the R document and even tried with only limited column, so that such columns could be skipped.
However, ending with same error again n again.
Read 0.0% of 1712440 rowsError in fread("xyz.dat", sep = "|", encoding
= "UTF-8", : Expecting 118 cols, but line 2143 contains text after processing all cols. Try again with fill=TRUE. Another reason could be
that fread's logic in distinguishing one or more fields having
embedded sep='|' and/or (unescaped) '\n' characters within unbalanced
unescaped quotes has failed. If quote='' doesn't help, please file an
issue to figure out if the logic could be improved.
Any help is highly appreciated.

Removing "NUL" characters (within R)

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__

Resources