Error in tolower() invalid multibyte string - r

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?

Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")

I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")

library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.

I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))

# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)

My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))

Related

write_csv - Exporting trailing spaces (no elimination)

I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)

Error while reading csv file in R

I am having some problems in reading a csv file with R.
x=read.csv("LorenzoFerrone.csv",header=T)
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<ff><fe>N'
I can read the file using libre office with no problems.
I can not upload the file because it is full of sensible information.
What can I do?
Setting encoding seem like the solution to the problem.
> x=read.csv("LorenzoFerrone.csv",fileEncoding = "UCS-2LE")
> x[2,1]
[1] Adriano Caruso
100 Levels: Ada Adriano Caruso adriano diaz Adriano Diaz alberto ferrone Alexey ... Zia Tina
This will read the column names as-is and won't return any errors:
x = read.csv(check.names = F)
To remove/replace troublesome characters in column names, use this:
iconv(names(x), to = "ASCII", sub = "")
The cause is an invalid encoding. I have solved replacing all the "è" with e
I found this problem is caused by code of file, and I solved that by opening it with Windows note, saving with UTF-8, and reopening with Excel(it became garbled at first), and resaving with UTF-8, then it worked!
You can always use the "Latin1" encoding while reading the csv:
x = read.csv("LorenzoFerrone.csv", fileEncoding = "Latin1", check.names = F)
I am adding check.names = F to avoid replacing spaces by dots within your header.
You need to specify the correct delimiter in the sep argument.
Typically an encoding issue. You can try to change encoding or else deleting the offending character (just use your favorite editor and replace all instances). In some cases R will spit the char location, for example:
invalid multibyte string 1847
Which should make your life easier.
Also note that you may be required to repeat this process several times (deleting all offending characters or trying several encodings).
Change the file format to - CSV UTF-8. It worked for me.
Not sure if this is helpful, but I had a similar problem and figured out that it was because my "csv" file had a .csv suffix, but was actually a .xls file!
Not sure if this helps, just had a similar issue which I solved by removing " from the csv I was trying to import. The first row of the database had the column names written as "colname","colname2","etc" and I removed all the " and the csv was read in R just fine then.
I solved the problem by removing any graphical signs in the writing (i.e. accent marks). My headers were written in Spanish and had some accent marks in there. I replaced with simple words (México=Mexico) and problem was solved.
I know this is an old post, but just wanted to say to non-English natives, that if you use "," as decimal seperator,

Trying to convert the character encoding in a dataset

I would like to convert the character encoding of one of my variable in a dataset but I don't understand why the iconv command is not working.
My gist file is here : https://gist.github.com/pachevalier/5850660
Two ideas come to mind:
1) you have a simple problem and are asking the wrong question
-- the final line in your Gist is
Encoding(tab$fiche_communale$Nom)
Are you actually wanting:
Encoding(tab$fiche_communale$name)
2) readHTMLTable may not be reading in the character encoding correctly, in which case, you could set it explicitly with Encoding(tab$fiche_communale$Nom) <- "latin-1"
3) try relying on iconv to detect the local encoding:
iconv(tab$fiche_communale$Nom, from="", to="UTF-8")

R - read.table imports half of the dataset - no errors nor warnings

I have a csv file with ~200 columns and ~170K rows. The data has been extensively groomed and I know that it is well-formed. When read.table completes, I see that approximately half of the rows have been imported. There are no warnings nor errors. I set options( warn = 2 ). I'm using 64-bit latest version and I increased the memory limit to 10gig. Scratching my head here...no idea how to proceed debugging this.
Edit
When I said half the file, I don't mean the first half. The last observation read is towards the end of the file....so its seemingly random.
You may have a comment character (#) in the file (try setting the option comment.char = "" in read.table). Also, check that the quote option is set correctly.
I've had this problem before how I approached it was to read in a set number of lines at a time and then combine after the fact.
df1 <- read.csv(..., nrows=85000)
df2 <- read.csv(..., skip=84999, nrows=85000)
colnames(df1) <- colnames(df2)
df <- rbind(df1,df2)
rm(df1,df2)
I had a similar problem when reading in a large txt file which had a "|" separator. Scattered about the txt file were some text blocks that contained a quote (") which caused the read.xxx function to stop at the prior record without throwing an error. Note that the text blocks mentioned were not encased in double quotes; rather, they just contained one double quote character here and there (") which tripped it up.
I did a global search and replace on the txt file, replacing the double quote (") with a single quote ('), solving the problem (all rows were then read in without aborting).

R: invalid multibyte string [duplicate]

This question already has answers here:
Invalid multibyte string in read.csv
(9 answers)
Closed 8 years ago.
I use read.delim(filename) without any parameters to read a tab delimited text file in R.
df = read.delim(file)
This worked as intended. Now I have a weird error message and I can't make any sense of it:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<fd>'
Calls: read.delim -> read.table -> type.convert
Execution halted
Can anybody explain what a multibyte string is? What does fd mean? Are there other ways to read a tab file in R? I have column headers and lines which do not have data for all columns.
I realize this is pretty late, but I had a similar problem and I figured I'd post what worked for me. I used the iconv utility (e.g., "iconv file.pcl -f UTF-8 -t ISO-8859-1 -c"). The "-c" option skips characters that can't be translated.
If you want an R solution, here's a small convenience function I sometimes use to find where the offending (multiByte) character is lurking. Note that it is the next character to what gets printed. This works because print will work fine, but substr throws an error when multibyte characters are present.
find_offending_character <- function(x, maxStringLength=256){
print(x)
for (c in 1:maxStringLength){
offendingChar <- substr(x,c,c)
#print(offendingChar) #uncomment if you want the indiv characters printed
#the next character is the offending multibyte Character
}
}
string_vector <- c("test", "Se\x96ora", "works fine")
lapply(string_vector, find_offending_character)
I fix that character and run this again. Hope that helps someone who encounters the invalid multibyte string error.
I had a similarly strange problem with a file from the program e-prime (edat -> SPSS conversion), but then I discovered that there are many additional encodings you can use. this did the trick for me:
tbl <- read.delim("dir/file.txt", fileEncoding="UCS-2LE")
This happened to me because I had the 'copyright' symbol in one of my strings! Once it was removed, problem solved.
A good rule of thumb, make sure that characters not appearing on your keyboard are removed if you are seeing this error.
I figured out Leafpad to be an adequate and simple text-editor to view and save/convert in certain character sets - at least in the linux-world.
I used this to save the Latin-15 to UTF-8 and it worked.

Resources