Getting rid of BOM between SAS and R

Getting rid of BOM between SAS and R - r

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
ï»¿ if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.

As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).

or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

Related

Persistent string present while using read.csv or fread

I've been searching for similar problems but I can't find anything helpful.
I'm trying to open a portion of a big csv file with
#choosing a certain number of variables from more than 250 available in the file
resources<-c("P13_2_1","P13_3_1","P13_2_2",...)
v <- fread("file.csv", select = resources, header = TRUE, encoding = "UTF-8"
After the file is opened, wherever there shoul be NA there's blank cells. However, when I try to see whats in any of the blank cells, i see this
v$P13_2_1[2]
[1] "\r"
Similarly, the header of every column seems fine in the viewer of R Studio but when I try to see them in the console, there's the same \r attached.
The problem is present using both, read.csv and fread and I've tried to modify the quote and na.string arguments.
I would like to get rid of the "\r" and posibly subtitute it with NA

Converting European to American Number format in R

I'm trying to work with a file that is saved as a .csv file but is actually ; deliminated. The decimal points are commas.
Example of a row:
SAA1;6,022367813;10,9403136;5,807354922;3,169925001;3,807354922;8,636624621;5,247927513;5,459431619;9,09011242;4,247927513;4,087462841;5,247927513;4,584962501;11,17492568;4,754887502;6,857980995;7,409390936;7,499845887;8,224001674;10,19967234;9,638435914;4,700439718;6,14974712;2,807354922;0;7,348728154;4,700439718;6,820178962;4,700439718;6,044394119;1,584962501;6,044394119;6,375039431;3,807354922;9,087462841;8,74819285;5,614709844;8,330916878;6,62935662;5,169925001;6,442943496;2,321928095;8,312882955;9,240791332;2,807354922;9,06608919;6,539158811;5,64385619;4,584962501;6,700439718;6,108524457;7,539158811;6,658211483;8,982993575;5,285402219;8,744833837
I need to read this data into R and then work with it as numbers where decimal points are "."
Here's what I've tried:
read.csv2("filename.csv", row.names=1, sep=";",dec=",")
This almost worked. Most of the numbers were correctly read in with periods. However all the numbers in certain columns remained separated by commas. I tried to fix this with:
temp<-sub(",", ".", data)
However, this did not quite work. It truncated several of the numbers and completely corrupted other ones. I have no idea why.
I've also tried opening the file in Sublime text. I found and replaced all commas with periods. This again worked for the majority of the data, but several numbers again became corrupted.
I've also tried reading in the file without changing the comma delimited nature, writing it period deliminated and then reading it in again.
temp<-read.csv2("filename.csv", row.names=1, sep=";")
write.csv2(temp, "filename_edited", sep = ";", dec=".", row.names = TRUE, col.names = TRUE)
temp2 <- read.csv2("filename_edited", sep=";", row.names=1)
This also didn't work. (I'm not surprised, I was getting desperate.)
What am I doing wrong? How can I fix it?

A common issue is related to trailing white space before or after the numbers (e.g. " 342,5", instead of "342,5"). Have you tried using the strip.white=TRUE parameter, like:
read.csv2("filename.csv", row.names=1, sep=";", strip.white=TRUE)
If you otherwise pre-process the data, trimws() may also be useful in this context.

R read_delim recognizing pipe delimiter inconsistently

I have run into some problems while importing a pipe delimited file. The file consistently delimits but something is getting in the way of R reading some of the delimiters while parsing. R reads in 10 columns when there should be 11, even though the appropriate number of pipes are in place.
A very small sample of the data can be found here: https://drive.google.com/file/d/1ek6-H5EWKCaPfDTfB2muqYBjJz1fM3pf/view
dat <- read_delim("~/Desktop/foo.txt", delim = "|", col.names = TRUE)
I've tried playing around with how R treats the quotes... quote = "/"" did nothing to help and ignoring the quotes with quote = "" made an even bigger mess of the import.
Any thoughts on how to fix the problem?

Feel free to use fread() in data.table package as below.
library(data.table)
FOO3<-fread("~/Downloads/foo.txt",sep = "|",fill = T)
Below is the import dataset I got.

R/RStudio changes name of column when read from csv

I am trying to read in a file in R, using the following command (in RStudio):
fileRaw <- read.csv(file = "file.csv", header = TRUE, stringsAsFactors = FALSE)
file.csv looks something like this:
However, when it's read into R, I get:
As you can see LOCATION is changed to ï..LOCATION for seemingly no reason.
I tried adding check.names = FALSE but this only made it worse, as LOCATION is now replaced with ï»¿LOCATION. What gives?
How do I fix this? Why is R/RStudio doing this?

There is a UTF-8 BOM at the beginning of the file. Try reading as UTF-8, or remove the BOM from the file.
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence
0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text
as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.
Edit: looks like using fileEncoding = "UTF-8-BOM" fixes the problem in RStudio.

Using fileEncoding = "UTF-8-BOM" fixed my problem and read the file with no issues.
Using fileEncoding = "UTF-8"/encoding = "UTF-8" did not resolve the issue.

reading big data file using cbc.read.table

I'm trying to read a giant DF cbc.read.table:
my.df <- cbc.read.table("df.csv",sep = ";", header =F)
This is what I get:
Error in cbc.read.table("2012Q2.csv", sep = "|", header = F) :
No rows to read
The wd is set correctly. Inprinciple it works using read.table, just that it doesn't read in all lines (about two million)
Has anybody an idea what I can do about this?
SOLUTION:
Hi again, the following thread helped me out:
R: Why does read.table stop reading a file?
The problem was caused by quotation marks, probably because some of them were not closing. I simply used an editor and deleted all double and single quotation marks as well as all hash marks. It's working now.
#Anthony: Thanks for your question. I noticed that the problem did not occur in the first three lines which is why I got idea that it's an issue with the file. Thanks!
Paul

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Getting rid of BOM between SAS and R - r

or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

Related

Persistent string present while using read.csv or fread

Converting European to American Number format in R

R read_delim recognizing pipe delimiter inconsistently

R/RStudio changes name of column when read from csv

reading big data file using cbc.read.table

Categories

Resources