I'm using Leaflet in R to make an interactive map.
Data I'm using was created using QGIS with Windows-1250 encoding (it's a non-English language). The layers were then exported as geoJSON which only allows UTF-8 encoding and when displaying the features on the map some characters are, of course, displayed wrong.
Is there a way of displaying the encoding properly? Trying to manually change the data results in:
Error: unexpected symbol in x' (x is the wrongly encoded data point I tried to rewrite).
You're not specifying the vector format of the data before being exported - this is more important than the tool used to create that data.
Assuming that the data is in shapefiles, you can use the approach from the question at https://gis.stackexchange.com/q/44057 , and rely on ogr2ogr to both convert to GeoJSON and handle character encoding.
Related
Loading an Excel sheet into R, some strings in the cells of the dataframe appear to be bold and in a different format. For example, like so:
𝐇𝐚𝐢𝐝𝐚𝐫𝐚
And when I copy paste this string into the R console, it appears like this:
Anyone know how to fix this (revert these strings into the standard format) in R?
Want to avoid going back into Excel to fix it.
Thanks!
These are actually UTF-8 encoded letters in the Mathematical Alphanumeric Symbols block in Unicode, and they don't map nicely back on to 'standard' ASCII letters in R unless you have a pre-existing mapping function such as utf8_normalize from the utf8 package:
library(utf8)
utf8_normalize('𝐇𝐚𝐢𝐝𝐚𝐫𝐚', map_compat = TRUE)
#> [1] "Haidara"
However, I would strongly recommend that you fix your Excel file before importing to avoid having to do this; it works with the example you have given us here, but there may be unwelcome surprises in converting some of your other strings.
I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
The output should be: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.
Thanks to MrFlick (see his comment), here is a solution using jsonlite that results in a very similar data frame structure and reads the encoding correctly:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
Just some further information for those being used to the luxury of streamR handling tweets that might encounter a similar issue in the future, there are two main differences in the data frames created by parseTweets and stream_in:
parseTweets does not extract data for broken Tweets. stream_in does. Hence, the data frame has more rows when using stream_in but contains the same Tweets.
stream_in creates fewer variables since some of the columns in the data frame are themselves data frames. This might lead to issues when using the data frame without further transforming the data frame created with stream_in. parseTweets does that for you.
I am trying to use multiple sources of German/Swiss data with umlauts in it. When trying to merge, I realized that the umlauts do not display correctly in R and the same names were rendered differently in different files.
map <-readOGR("/path/to/data.gdb", layer = "layer")
map#data$name
# [1] L\303\266rrach
# [2] Karlsruhe
# [3] ...
Along with several other posts, I read Encoding of German umlauts when using readOGR because one of my data sources is a shp file I read in with readOGR.
Appending use_iconv = TRUE, encoding = "UTF-8") to the end of readOGR did not help. And the problem exists outside of the use of redOGR. I saw that using Sys.setlocale() and a locale which supports UTF-8 worked for that poster, but I don't know what that means after looking at the ?Sys.setlocale information.
How do I correctly read in German data in R on a Mac using English? Sys.getlocale reports C.
Could you somehow include an exemplary .gdb-file?
What happens, if you try encoding="latin1"?
Maybe the gdb-data was saved in a wrong encoding? Are you creating it yourself, or you downloaded it from somewhere?
You could also check the information of the gdb-file with this command:
ogrinfo -al "/path/to/data.gdb"
Is there a way to read a shape file with a specific character encoding? I'm trying to read in a Canadian shapefile that has special (French) characters in some of the names. I can convert them manually, but I'd prefer not to do this if there's a setting somewhere that I'm so far blind to.
# manual conversion works
library(maptools)
shp <- file.path("path/to/file.shp")
map <- readShapePoly(shp, proj4string = CRS("+init=epsg:25832"))
map$ERNAME <- iconv(map$ERNAME, "Windows-1252", "UTF-8")
Instead of using, maptools and the readShapePoly arguments, using the rgdal library with readOGR function allows for more options. For example, the syntax with rgdal could be:
pasl=readOGR(".","filename", use_iconv=TRUE, encoding="UTF-8")
Be aware that this is not a universal solution but depends on the encoding of the file, which in the OP case was UTF-8. Common encoding also includes latin1. In some shapefiles, the encoding is named in the .cpg file (open with a text editor software) along with the .shp file. QGIS automatically generates a .cpg file when a new shapefile is created but many other GIS software do not.
I am reading in data from a web site, with text identifying each row. I simply copied and pasted the data into Excel, and the file is then read by R. One of these rows contains the name of a German city, "Würzburg", which includes a lower case u with an umlaut. I have no problem seeing the special character on the web or on Excel. The problem is, when this word is passed to ggplot2, it is displayed in the plot as "WÃzburg", with tilde over the capital A. RStudio shows both forms depending on the area in which it is displayed. I would assume that ggplot2 uses a different language for interpreting the special characters.
Is there a way to tell ggplot how to read, interpret and display the special characters? I do not want to write specialized code just for this city, but to solve the problem in general. I am likely to encounter other characters as the data expands over time.
I encountered a similar error with ggplot2, when I used a hardcoded data.frame (i.e., I would write Großbritannien (Great Britain) and it would get encoded to some gibberish).
My solution was to include
Sys.setlocale("LC_ALL", "German")
options(encoding = "UTF-8")
in the beginning of the script.
Read the file in as follows
library('data.table')
fread('path_to_file', ..., encoding = 'UTF-8')
My solution to this problem is switching to cairo for pdf plotting. All special characters are shown properly by the ggplot2. It is enough to put this line of code among the knitr settings:
knitr::opts_chunk$set(dev='cairo_pdf')