problems replacing €-symbol in strings - r

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!

Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

Related

iconv() returns NA when given a string with a specific special character

I am trying to convert some strings of an input file from UTF8 to ASCII. For most of the strings I give it, the conversion works perfectly fine with iconv(). However on some of them, it returns NA. While manually fixing the issue in the file seems like the simplest option, it is unfortunately not an option that I have available at the moment at all.
I have made a reproducible example of my problem but we assume to assume that I have to figure a way for iconv() to somehow convert the string in s1 and not get NA.
Here is the reproducible example:
s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing
s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')
I get the following output:
> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"
After playing around with the code, I figured that something was wrong in the entry "Besançon" that is now copied exactly from the input file. When I input it manually myself, the problem is solved. Since I can't modify the input file at all, what do you think is the exact issue and would you have any idea on how to solve it?
Thanks in advance,
Edit:
After closer inspection, there is something odd in the characters of the first line. It seems to be taken away by SO's formatting.
But to reproduce it, the best I could give is these two images describing it. First image places my cursor just before the #
Second image is after pressing delete, which should delete the white space... turns out it deletes the ". So there is definitely something weird there.
It turns out that using sub='' actually solved the issue although I am quite unsure why.
iconv(s1, to='ASCII//TRANSLIT', sub='')
From the documentation sub
character string. If not NA it is used to replace any non-convertible
bytes in the input. (This would normally be a single character, but
can be more.) If "byte", the indication is "" with the hex code of
the byte. If "Unicode" and converting from UTF-8, the Unicode point in
the form "<U+xxxx>".
So I eventually figured out that there was a character I couldn't convert (nor see) in the string and using sub was a way to eliminate it. I am still not sure what this character is though. But the problem is solved.
There is probably a latin1 (or other encoding) character in your supposedly utf8 file. For example:
> latin=iconv('Besançon','utf8','latin1')
> iconv(latin,to='ascii//translit')
[1] NA
> iconv(latin,'utf8','ascii//translit')
[1] NA
> iconv(latin,'latin1','ascii//translit')
[1] "Besancon"
> iconv(l,'Windows-1250','ascii//translit')
[1] "Besancon"
You can e.g. make one new vector or data column with the result of each character set encoding in your data, and if one is NA, fall back to the next one, e.g.
utf8 = iconv(x,'utf8','ascii//translit')
latin1 = iconv(x,'latin1','ascii//translit')
win1250 = iconv(x,'Windows-1250','ascii//translit')
result = ifelse(
is.na(utf8),
ifelse(
is.na(latin1),
win1250,
latin1
),
utf8
)
If these encodings don't work, make a file with just the problem word, then use the unix/linux file command to detect the encoding, or else try some likely encodings.
I have in the past just listed all of iconv's supported encodings, tried all with lapply, and then used whichever results worked on each string, but some "from" encodings will return a non-NA but incorrect result, so it's best to try this on each unique character in your data in order to decide which subset of iconv's encodings to use and in which order.

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

how to resolve read.fwf run time error: invalid multibyte string in R

I'm getting the following when I try to read in a fixed width text file using read.fwf.
Here is the output:
invalid multibyte string at 'ETE<52> O 19950207 19031103 537014290 7950 WILLOWS RD
Here are the most relevant lines of code
fieldWidths <- c(10,50,30,40,6,8,8,9,35,30,9,2)
colNames <- c("certNum", "lastN", "firstN", "middleN", "suffix", "daDeath", "daBirth", "namesSSN", "namesResStr", "namesResCity", "namesResZip", "namesStCode")
dmhpNameDF <- read.fwf(fileName, widths = fieldWidths, col.names=colNames, sep="", comment.char="", quote="", fileEncoding="WINDOWS-1258", encoding="WINDOWS-1258")
I'm running R 3.1.1 on Mac OSX 10.9.4
As you can see, I've experimented with specifying alternative encodings, I've tried latin1 and UTF-8 as well as WINDOWS-1250 through 1258
When I read this file into Excel or Word, or TextEdit everything looks good in general. By using the error message text I can id the offending line (row) of text is row number 5496, and upon inspection, I can see that the offending character shows up as an italic looking letter 'f' Searching for that character reveals that there are about 4 instances of it in this file. I have many such files to process so going through one by one to delete the offending character is not a good solution.
So far, the offending character always shows up in a name field, which is good for me as I don't actually want the name data from this file it is of no interest. If it were a numeric field that was corrupted then I'd have to toss out the row.
Since Word and Excel can read the file (apparently substituting the offending character for italic 'f', surely there must be a way to read it in with R, but I've not figured out a solution. I have searched through the many examples of questions related to "invalid multibyte string", but have not found any info that resolved my problem.
My goal is to be able to read in the data either ignoring this "character error" or substituting the offending character with something else.
Unfortunately the file in question contains sensitive information so I can not post a copy of it for people to play with.
Thanks

Special characters with read.csv loading as full stops when header = TRUE

I have a \t delimited .csv file with names of columns in the first row and some , decimal sign numbers in others. I am trying to read it with read.csv() command like so:
x = read.csv("Export.csv", header = TRUE, sep = "\t", dec = ",")
in the input (file Export.csv) I have for example
"$\{,}_"
45,2
which gives me
<header>X....._</header>
45.2
I had expected it would interpret quoted values as strings and numbers as numbers.
It correctly interprets 45,2 as a number but messes up all special characters except underscore.
I thought it's an encoding issue so I tried few different encoding options with the same result.
Moreover if I change header parameter to TRUE I get everything displayed correctly, however all data are then interpreted as strings and (as expected) the first row is not header.
How can I load special characters to header in these circumstances?
Issue on: RStudio Version 0.98.501, R Version 3.0.2 x64, OS: Win 7 x64
All elements of one column of a data.frame must all have the same type. So, when you try to read in a column, it has to guess which one you want. In your second example, it reads in the first line as the header, and then guesses that the column is a numeric. It then mangles the name because check.names is set to TRUE, and your header name isn't a "valid" name (it might cause problems), so it tries to fix it.
In your first example, it reads in the first line, guesses that it is a character (because it isn't a number) and then the whole column becomes a character.
If you want to read in this column, with $\{,}_ as the header name, you can do:
read.table(
textConnection('\"$\\{,}_\"
45,2'),header=TRUE,check.names=FALSE,dec=',')
If you want to read this data in, and convert the elements to a numeric or a character, you'll have to read it in as a character, and then convert it yourself, placing the elements in a list.

read.fwf and the number sign

I am trying to read this file (3.8mb) using its fixed-width structure as described in the following link.
This command:
a <- read.fwf('~/ccsl.txt',c(2,30,6,2,30,8,10,11,6,8))
Produces an error:
line 37 did not have 10 elements
After replicating the issue with different values of the skip option, I figured that the lines causing the problem all contain the "#" symbol.
Is there any way to get around it?
As #jverzani already commented, this problem is probably the fact that the # sign often used as a character to signal a comment. Setting the comment.char input argument of read.fwf to something other than # could fix the problem. I'll leave my answer below as a more general case that you can use on any character that causes problems (e.g. the 's in the Dutch city name 's Gravenhage).
I've had this problem occur with other symbols. The approach I took was to simply replace the # by either nothing, or by a character which does not generate the error. In my case it was no problem to simply replace the character, but this might not be possible in your case.
So my approach would be to delete the symbol that generates the error, or replace by another character. This can be done using a text editor (find and replace), in an R script, or using some linux tools called grep and sed. If you want to do this in an R script, use scan or readLines to read the lines. Once the text is in memory, you can use sub to replace the character.
If you cannot replace the character, I would try the following approach: replace the character by a character that does not generate an error, read it into R using read.fwf, and finally replace the character by the # character.
Following up on the answer above: to get all characters to be read as literals, use both comment.char="" and quote="" (the latter takes care of #PaulHiemstra's problem with single-quotes in Dutch proper nouns) in the call to read.fwf (this is documented in ?read.table).

Resources