I am trying to read in a file in R, using the following command (in RStudio):
fileRaw <- read.csv(file = "file.csv", header = TRUE, stringsAsFactors = FALSE)
file.csv looks something like this:
However, when it's read into R, I get:
As you can see LOCATION is changed to ï..LOCATION for seemingly no reason.
I tried adding check.names = FALSE but this only made it worse, as LOCATION is now replaced with LOCATION. What gives?
How do I fix this? Why is R/RStudio doing this?
There is a UTF-8 BOM at the beginning of the file. Try reading as UTF-8, or remove the BOM from the file.
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence
0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text
as ISO-8859-1 or CP1252 will display the characters  for this.
Edit: looks like using fileEncoding = "UTF-8-BOM" fixes the problem in RStudio.
Using fileEncoding = "UTF-8-BOM" fixed my problem and read the file with no issues.
Using fileEncoding = "UTF-8"/encoding = "UTF-8" did not resolve the issue.
Related
Has anyone had experience to read Korean language file using EUC-KR as text encoding?
I used fread function as it can read that file structure perfectly. Below is the sample code:
test <- fread("KoreanTest.txt", encoding = "EUC-KR")
Then I got error, "Error in fread("KoreanTest.txt", encoding = "EUC-KR") : Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'".
Initially i was using UTF-8 as text encoding but the output characters were not displayed correctly in Korean language. I was looking to another solution but nothing seems to work at this time.
Appreciate if someone could share ideas. Thanks.
It allows an explicit encoding parameter. This common usage works well:
read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "EUC-KR")
or you can try with Rstudio
File -> Import Dataset -> From text
When I use write.csv for my Japanese text, I get gibberish in Excel (which normally handles Japanese fine). I've searched this site for solutions, but am coming up empty-handed. Is there an encoding command to add to write.csv to enable Excel to import the Japanese properly from R? Any help appreciated!
Thanks!
I just ran into this exact same problem - I used what I saw online:
write.csv(merch_df, file = "merch.reduced.csv", fileEncoding = "UTF-8")
and indeed, when opening my .xls file, <U+30BB><U+30D6><U+30F3>, etc.... Odd and disappointing.
A little Google and I found this awesome blog post by Kevin Ushey which explains it all... https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
Using the function he proposes:
write_utf8 <- function(text, f = tempfile()) {
# step 1: ensure our text is utf8 encoded
utf8 <- enc2utf8(text)
# step 2: create a connection with 'native' encoding
# this signals to R that translation before writing
# to the connection should be skipped
con <- file(f, open = "w+", encoding = "native.enc")
# step 3: write to the connection with 'useBytes = TRUE',
# telling R to skip translation to the native encoding
writeLines(utf8, con = con, useBytes = TRUE)
# close our connection
close(con)
# read back from the file just to confirm
# everything looks as expected
readLines(f, encoding = "UTF-8")
}
works magic. Thank you Kevin!
As a work around -- and diagnostic -- have you tried saving as .txt and then both opening the file in Excel and also pasting the data into Excel from a text editor?
I ran into the same problem as tchevrier. Japanese text were not display correctly both in Excel and a text editor when exporting with write.csv. I found using:
readr::write_excel_csv(df, "filename.csv")
corrected the issue.
I was reading in a csv file.
Code is:
mydata = read.csv("mycsv.csv", header=True, sep=",", quote="\"")
Get the following warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Now some cells in my CSV have missing values that are represented by "".
How do I write this code so that I do not get the above warning?
Your CSV might be encoded in UTF-16. This isn't uncommon when working with some Windows-based tools.
You can try loading a UTF-16 CSV like this:
read.csv("mycsv.csv", ..., fileEncoding="UTF-16LE")
You can try using the skipNul = TRUE option.
mydata = read.csv("mycsv.csv", quote = "\"", skipNul = TRUE)
From ?read.csv
Embedded nuls in the input stream will terminate the field currently being read, with a warning once per call to scan. Setting skipNul = TRUE causes them to be ignored.
It worked for me.
This is nothing to do with the encoding. This is the problem with reading of the nulls in the file. To handle that, you need to pass the skipNul = TRUE paramater.
for example:
neg = scan('F:/Natural_Language_Processing/negative-words.txt', what = 'character', comment.char = '', encoding = "UTF-8", skipNul = TRUE)
Might be a file that do not have CRLF, might only have LF. Try to check the HEX output of the file.
If so. Try running the file through awk:
awk '{printf "%s\r\n", $0}' file > new_log_file
I had the same error message and figured out that although my files had a .csv extensions and opened up with no problems in a spreadsheet, they were actually saved as ¨All Formats¨ rather than ¨Text CSV (.csv)¨
Another quick solution:
Double check that you are, in fact, reading a .csv file!
I was accidentally reading a .rds file instead of .csv and got this "embedded null" error.
In those cases be sure the data you are importing does not have "#" characters but if that the case try using the option comment.char="". It worked for me.
I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
 if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.
As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).
or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.
I was reading in a csv file.
Code is:
mydata = read.csv("mycsv.csv", header=True, sep=",", quote="\"")
Get the following warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Now some cells in my CSV have missing values that are represented by "".
How do I write this code so that I do not get the above warning?
Your CSV might be encoded in UTF-16. This isn't uncommon when working with some Windows-based tools.
You can try loading a UTF-16 CSV like this:
read.csv("mycsv.csv", ..., fileEncoding="UTF-16LE")
You can try using the skipNul = TRUE option.
mydata = read.csv("mycsv.csv", quote = "\"", skipNul = TRUE)
From ?read.csv
Embedded nuls in the input stream will terminate the field currently being read, with a warning once per call to scan. Setting skipNul = TRUE causes them to be ignored.
It worked for me.
This is nothing to do with the encoding. This is the problem with reading of the nulls in the file. To handle that, you need to pass the skipNul = TRUE paramater.
for example:
neg = scan('F:/Natural_Language_Processing/negative-words.txt', what = 'character', comment.char = '', encoding = "UTF-8", skipNul = TRUE)
Might be a file that do not have CRLF, might only have LF. Try to check the HEX output of the file.
If so. Try running the file through awk:
awk '{printf "%s\r\n", $0}' file > new_log_file
I had the same error message and figured out that although my files had a .csv extensions and opened up with no problems in a spreadsheet, they were actually saved as ¨All Formats¨ rather than ¨Text CSV (.csv)¨
Another quick solution:
Double check that you are, in fact, reading a .csv file!
I was accidentally reading a .rds file instead of .csv and got this "embedded null" error.
In those cases be sure the data you are importing does not have "#" characters but if that the case try using the option comment.char="". It worked for me.