R: invalid multibyte string [duplicate] - r

This question already has answers here:
Invalid multibyte string in read.csv
(9 answers)
Closed 8 years ago.
I use read.delim(filename) without any parameters to read a tab delimited text file in R.
df = read.delim(file)
This worked as intended. Now I have a weird error message and I can't make any sense of it:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<fd>'
Calls: read.delim -> read.table -> type.convert
Execution halted
Can anybody explain what a multibyte string is? What does fd mean? Are there other ways to read a tab file in R? I have column headers and lines which do not have data for all columns.

I realize this is pretty late, but I had a similar problem and I figured I'd post what worked for me. I used the iconv utility (e.g., "iconv file.pcl -f UTF-8 -t ISO-8859-1 -c"). The "-c" option skips characters that can't be translated.

If you want an R solution, here's a small convenience function I sometimes use to find where the offending (multiByte) character is lurking. Note that it is the next character to what gets printed. This works because print will work fine, but substr throws an error when multibyte characters are present.
find_offending_character <- function(x, maxStringLength=256){
print(x)
for (c in 1:maxStringLength){
offendingChar <- substr(x,c,c)
#print(offendingChar) #uncomment if you want the indiv characters printed
#the next character is the offending multibyte Character
}
}
string_vector <- c("test", "Se\x96ora", "works fine")
lapply(string_vector, find_offending_character)
I fix that character and run this again. Hope that helps someone who encounters the invalid multibyte string error.

I had a similarly strange problem with a file from the program e-prime (edat -> SPSS conversion), but then I discovered that there are many additional encodings you can use. this did the trick for me:
tbl <- read.delim("dir/file.txt", fileEncoding="UCS-2LE")

This happened to me because I had the 'copyright' symbol in one of my strings! Once it was removed, problem solved.
A good rule of thumb, make sure that characters not appearing on your keyboard are removed if you are seeing this error.

I figured out Leafpad to be an adequate and simple text-editor to view and save/convert in certain character sets - at least in the linux-world.
I used this to save the Latin-15 to UTF-8 and it worked.

Related

how to resolve read.fwf run time error: invalid multibyte string in R

I'm getting the following when I try to read in a fixed width text file using read.fwf.
Here is the output:
invalid multibyte string at 'ETE<52> O 19950207 19031103 537014290 7950 WILLOWS RD
Here are the most relevant lines of code
fieldWidths <- c(10,50,30,40,6,8,8,9,35,30,9,2)
colNames <- c("certNum", "lastN", "firstN", "middleN", "suffix", "daDeath", "daBirth", "namesSSN", "namesResStr", "namesResCity", "namesResZip", "namesStCode")
dmhpNameDF <- read.fwf(fileName, widths = fieldWidths, col.names=colNames, sep="", comment.char="", quote="", fileEncoding="WINDOWS-1258", encoding="WINDOWS-1258")
I'm running R 3.1.1 on Mac OSX 10.9.4
As you can see, I've experimented with specifying alternative encodings, I've tried latin1 and UTF-8 as well as WINDOWS-1250 through 1258
When I read this file into Excel or Word, or TextEdit everything looks good in general. By using the error message text I can id the offending line (row) of text is row number 5496, and upon inspection, I can see that the offending character shows up as an italic looking letter 'f' Searching for that character reveals that there are about 4 instances of it in this file. I have many such files to process so going through one by one to delete the offending character is not a good solution.
So far, the offending character always shows up in a name field, which is good for me as I don't actually want the name data from this file it is of no interest. If it were a numeric field that was corrupted then I'd have to toss out the row.
Since Word and Excel can read the file (apparently substituting the offending character for italic 'f', surely there must be a way to read it in with R, but I've not figured out a solution. I have searched through the many examples of questions related to "invalid multibyte string", but have not found any info that resolved my problem.
My goal is to be able to read in the data either ignoring this "character error" or substituting the offending character with something else.
Unfortunately the file in question contains sensitive information so I can not post a copy of it for people to play with.
Thanks

Error while reading csv file in R

I am having some problems in reading a csv file with R.
x=read.csv("LorenzoFerrone.csv",header=T)
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<ff><fe>N'
I can read the file using libre office with no problems.
I can not upload the file because it is full of sensible information.
What can I do?
Setting encoding seem like the solution to the problem.
> x=read.csv("LorenzoFerrone.csv",fileEncoding = "UCS-2LE")
> x[2,1]
[1] Adriano Caruso
100 Levels: Ada Adriano Caruso adriano diaz Adriano Diaz alberto ferrone Alexey ... Zia Tina
This will read the column names as-is and won't return any errors:
x = read.csv(check.names = F)
To remove/replace troublesome characters in column names, use this:
iconv(names(x), to = "ASCII", sub = "")
The cause is an invalid encoding. I have solved replacing all the "è" with e
I found this problem is caused by code of file, and I solved that by opening it with Windows note, saving with UTF-8, and reopening with Excel(it became garbled at first), and resaving with UTF-8, then it worked!
You can always use the "Latin1" encoding while reading the csv:
x = read.csv("LorenzoFerrone.csv", fileEncoding = "Latin1", check.names = F)
I am adding check.names = F to avoid replacing spaces by dots within your header.
You need to specify the correct delimiter in the sep argument.
Typically an encoding issue. You can try to change encoding or else deleting the offending character (just use your favorite editor and replace all instances). In some cases R will spit the char location, for example:
invalid multibyte string 1847
Which should make your life easier.
Also note that you may be required to repeat this process several times (deleting all offending characters or trying several encodings).
Change the file format to - CSV UTF-8. It worked for me.
Not sure if this is helpful, but I had a similar problem and figured out that it was because my "csv" file had a .csv suffix, but was actually a .xls file!
Not sure if this helps, just had a similar issue which I solved by removing " from the csv I was trying to import. The first row of the database had the column names written as "colname","colname2","etc" and I removed all the " and the csv was read in R just fine then.
I solved the problem by removing any graphical signs in the writing (i.e. accent marks). My headers were written in Spanish and had some accent marks in there. I replaced with simple words (México=Mexico) and problem was solved.
I know this is an old post, but just wanted to say to non-English natives, that if you use "," as decimal seperator,

Trying to convert the character encoding in a dataset

I would like to convert the character encoding of one of my variable in a dataset but I don't understand why the iconv command is not working.
My gist file is here : https://gist.github.com/pachevalier/5850660
Two ideas come to mind:
1) you have a simple problem and are asking the wrong question
-- the final line in your Gist is
Encoding(tab$fiche_communale$Nom)
Are you actually wanting:
Encoding(tab$fiche_communale$name)
2) readHTMLTable may not be reading in the character encoding correctly, in which case, you could set it explicitly with Encoding(tab$fiche_communale$Nom) <- "latin-1"
3) try relying on iconv to detect the local encoding:
iconv(tab$fiche_communale$Nom, from="", to="UTF-8")

Error in tolower() invalid multibyte string

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))

read.fwf and the number sign

I am trying to read this file (3.8mb) using its fixed-width structure as described in the following link.
This command:
a <- read.fwf('~/ccsl.txt',c(2,30,6,2,30,8,10,11,6,8))
Produces an error:
line 37 did not have 10 elements
After replicating the issue with different values of the skip option, I figured that the lines causing the problem all contain the "#" symbol.
Is there any way to get around it?
As #jverzani already commented, this problem is probably the fact that the # sign often used as a character to signal a comment. Setting the comment.char input argument of read.fwf to something other than # could fix the problem. I'll leave my answer below as a more general case that you can use on any character that causes problems (e.g. the 's in the Dutch city name 's Gravenhage).
I've had this problem occur with other symbols. The approach I took was to simply replace the # by either nothing, or by a character which does not generate the error. In my case it was no problem to simply replace the character, but this might not be possible in your case.
So my approach would be to delete the symbol that generates the error, or replace by another character. This can be done using a text editor (find and replace), in an R script, or using some linux tools called grep and sed. If you want to do this in an R script, use scan or readLines to read the lines. Once the text is in memory, you can use sub to replace the character.
If you cannot replace the character, I would try the following approach: replace the character by a character that does not generate an error, read it into R using read.fwf, and finally replace the character by the # character.
Following up on the answer above: to get all characters to be read as literals, use both comment.char="" and quote="" (the latter takes care of #PaulHiemstra's problem with single-quotes in Dutch proper nouns) in the call to read.fwf (this is documented in ?read.table).

Resources