I have a data set in R in which some variables contains strings with "strange" characters, such as ë, é or €. Unfortunately, the source of the data set itself has some errors in displaying these characters originated from the export. For example: € is displayed in the string, which should be €
I've managed to debug these using a list provided on https://www.i18nqa.com/debug/utf8-debug.html. Despite my code being a little redundant, I've managed to solve the problem. For each strange character I used gsub to revalue for the character that it should be. I think (actually, I know) that this could be coded more efficient, but hey, it worked for me. Here's what I did:
df$var1 <- gsub("€", "€", df$var1)
However, after saving my code and reopening the code, it is not working anymore. I think I saved it in the wrong coding format, causing some information to get lost. My question is if it is possible to get the code working again. R is displaying parts of my original code in red font with a red background. It would be a pain to retype my code, so I hope there's a quick fix for this.
By the way: there's no option to rerun the export with different settings, so I'd like to fix the problem at hand with the data set at hand. I've tried all the encodings which R shows when you press "save with encoding". For example, UTF-8 correctly works with ë and ë, but not with €
Related
I have a dataframe in R which has one row of utf-8 encoded special characters and one integer row.
If I display both rows, or go into the view(), I do not see the characters displayed correctly.
However, if I only select the row with the special characters, it works. Any ideas?
This is the output (if I paste it, the encoding disappears):
This looks like a bug in R. I've worked around a number of these in the corpus package. Try the following
library(corpus)
print.corpus_frame(WW_mapping[1:3,])
Alternatively, do
library(corpus)
class(WW_mapping) <- c("corpus_frame", "data.frame")
WW_mapping[1:3,]
Adding the "corpus_frame" class to the data frame changes the print and format methods; otherwise, it does not change the behavior of the object.
If that doesn't work, please report your sessionInfo() along with dput(WW_mapping). (Actually, even if this fix does work, please report this information so that we can let the R core developers know about the problem.)
I am reading in data from a web site, with text identifying each row. I simply copied and pasted the data into Excel, and the file is then read by R. One of these rows contains the name of a German city, "Würzburg", which includes a lower case u with an umlaut. I have no problem seeing the special character on the web or on Excel. The problem is, when this word is passed to ggplot2, it is displayed in the plot as "WÃzburg", with tilde over the capital A. RStudio shows both forms depending on the area in which it is displayed. I would assume that ggplot2 uses a different language for interpreting the special characters.
Is there a way to tell ggplot how to read, interpret and display the special characters? I do not want to write specialized code just for this city, but to solve the problem in general. I am likely to encounter other characters as the data expands over time.
I encountered a similar error with ggplot2, when I used a hardcoded data.frame (i.e., I would write Großbritannien (Great Britain) and it would get encoded to some gibberish).
My solution was to include
Sys.setlocale("LC_ALL", "German")
options(encoding = "UTF-8")
in the beginning of the script.
Read the file in as follows
library('data.table')
fread('path_to_file', ..., encoding = 'UTF-8')
My solution to this problem is switching to cairo for pdf plotting. All special characters are shown properly by the ggplot2. It is enough to put this line of code among the knitr settings:
knitr::opts_chunk$set(dev='cairo_pdf')
my data frame has a column A with strings in character form
> df$A
[1] "2-60", "2-61", "2-62", "2-63" etc
I saved the table using write.csv, but when I open it with Excel column A appears formatted as date:
Feb-60
Feb-61
Feb-62
Feb-63
etc
Anyone knows what can I do to avoid this?
I tweaked the arguments of write.csv but nothing worked, and I can't seem to find an example in Stack Overflow that helps solve this problem.
As said in the comments, this is an excel behaviour, not R's. And that can't be deactivated:
Microsoft Excel is preprogrammed to make it easier to enter dates. For
example, 12/2 changes to 2-Dec. This is very frustrating when you
enter something that you don't want changed to a date. Unfortunately
there is no way to turn this off. But there are ways to get around it.
Microsoft Office Article
The first suggested way around it according to the article is not helpful, because it relies on changing the cell formatting, but that's too late when you open the .csv file in excel (it's already converted to an integer representing the date).
There is, however, a useful tip:
If you only have a few numbers to enter, you can stop Excel from
changing them into dates by entering:
An apostrophe (‘) before you enter a number, such as ’11-53 or ‘1/47. The apostrophe isn’t displayed in the cell after you press
Enter.
So you can make the data display as original by using
vec <- c("2-60", "2-61", "2-62", "2-63")
vec <- paste0("'", vec)
Just remember the values will still have the apostrophe if you read them again in R, so you might have to use
vec <- sub("'", "", vec)
This might not be ideal but at least it works.
One alternative is enclosing the text in =" ", as an excel formula, but that has the same end result and uses more characters.
Another solution - a bit tedious, Use Import Text File in Excel, click thru the dialog boxes and in Step 3 of 3 of the Text Import Wizard, you will have an option of setting the column data format, use "Text" for the column that has "2-60", "2-61", "2-62", "2-63". If you use General (the default), Excel tries to be smart and converts the answer for you.
I solved the problem by saving the file using the .xlsx format by using the function
write.xlsx()
from the package xlsx (https://www.rdocumentation.org/packages/xlsx/versions/0.6.5)
Using Rstudio [Windows8], when I use the dygraph function to plot a time series, I have a problem when trying to use UTF-8 characters in the main title.
library(dygraphs)
dygraph(AirPassengers, main = "Título")
This results in a title: "T?tulo"
I have tried to convert "Título" to the utf-8 enconding, but it doesn't work.
You can use enc2utf8.
dygraph(AirPassengers, main = enc2utf8("Título"))
You need to make sure your locale settings support the character that you want to use, and that the file is saved with the right encoding. Saving as UTF-8 worked for me.
I was able to replicate your situation in Windows 7 and tried a bunch of things. Embedded in Rmarkdown, here is a minimal working example.
```{r}
Sys.setlocale("LC_ALL","German")
#note that windows locale names are different from unix & mac, usually
#the name of nationality works here.
#also works with "Faroese", "Hungarian", and others who have this letter.
#the locale has to be set in a preceding block to take effect.
```
```{r}
Encoding("Título")
library(dygraphs)
dygraph(AirPassengers, main = "Título")
```
You can try out the encoding given to the title to have with Encoding(). Languages like Faroese, Hungarian, and German encode "Título" as latin or unknown, both of which seem to cause no problems for dygraph's javascript. UTF-8 wrote it as <U+00ED> which was a problem for the javascript, as well as some, but not all other functions. With a matching locale, converting to utf-8 as #Michele recommended has the same result.
Also, if you don't have the title in many places, it is possible to just manually find and replace the title in the html/javascript file that is made. The problem occurs on conversion, but if the file is already made, the title variable can be successfully changed. The letter still has a question mark in Rstudio "Viewer" output, but I recommend making the entire file for javascript regularly, as I've seen other functions malfunction in the viewer window.
I have a problem that it might be a bit unique, but I think that if it is answered it could answer other questions about encoding too.
In order to expand my R skills I tried to write a function that I could manage the vcf file from android phones. Everything went ok, until I tried to upload the file in the phone. An error appeared that the first line starts with something else than a normal VCF version 3 file. But when I check the file on the PC it appears to be ok without these characters that my phone said. So, I asked about it and one person here said that it is the Byte Ordering Mark and I should use a HEX editor to see it. And it was there even it couldn't be seen in the TXT editor of windows and linux.
Thus, I tried to solve the problem by using fileEncoding arguments in R. the code that I use to write the file is:
write.table(cons2,file=paste(filename,".vcf",sep=""),row.names=F,col.names=F,quote=FALSE,fileEncoding="")
I put ASCII as argument, UTF-8 etc but no luck. ASCII seems to delete some of the characters, and UTF-8 makes these characters be visible in the text file.
I would appreciate if someone could provide a solution to this.
PS: I know that if I modify the file in a HEX editor it solves the problem, but I want the solution in the R coding.