I want to clear a table from unwanted characters. The table can be downloaded here freely: https://www.aggdata.com/free/germany-postal-codes
It contains all german postal codes, and since in Germany there are some special characters like ü,ö,ä or ß, I need to swap them to other symbols.
Now let's say I want to replace all "ß" by "ss".
I used this code, collected from different posts in Stack Overflow. My code looks like this:
postal <- read.csv("~/Downloads/de_postal_codes.csv")
postal <- as.data.frame(sapply(postal,gsub,pattern="<df>",replacement="ss"))
When I try to replace other strings for testing like pattern = "Cot" the code works, but not if it contains the <> symbols. What is the problem here?
I am using R 3.3.3 in RStudio 1.0.136 on MacOS 10.13.4.
This seems to work. If you put encoding = "UTF-8" in to the read.table command, you see that <df> comes back as \xdf. I don't know much about this area, but trying this with the original encoding seemed to work. Hope this helps
postal <- read.table("~/Downloads/de_postal_codes.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE)
postal$Place.Name[4]
postal <- as.data.frame(
sapply(postal, function(x){
gsub(pattern="\xdf", replacement="ss", x=x)
})
)
postal$Place.Name[4]
edit: Also, I don't think you're sapply was doing the trick. The x parameter in gsub is not the first variable when you do ?gsub.
edit2: I'm using windows & 3.5.0 R version
Related
Everything I've tried doesn't seem to work. Currently running version 4.1.3 of R, tried re-installing the easyPubMed package, but nothing seems to work. Here's the code I have so far:
new_query <- "(APE1[TI] OR OGG1[TI]) AND (2012[PDAT]:2016[PDAT])"
out.A <- batch_pubmed_download(pubmed_query_string = new_query,dest_file_prefix = "easyPM_example", batch_size = 150, encoding = "UTF-8")
cat(readLines(out.A[1])[1:32], sep = "\n")
For some reason, it returns with 1 row of all the collected xml-style text, followed by 31 lines of NA.
I've also looked into using fetching_pubmed_data() to serve a similar purpose, but whenever I check the class of what I have, I get character, when it should be XMLInternalDocument and XMLAbstractDocument. Here's my code:
my_query <- '"genetic therapy"[MeSH Terms]'
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_xml <- fetch_pubmed_data(my_entrez_id)
class(my_abstracts_xml)
Any help would be greatly, greatly appreciated!
The issue is with your readLines call; both batch_pubmed_download and fetch_pubmed_data work as expected.
In your batch_pubmed_download example, the downloaded files are XML files with three text lines (can confirm with readLines or in a terminal with wc -l). So readLines(out.A[1])[1:32] makes no sense, as only 3 lines exist and all the other indices lead to NAs. The XML content is in the first 3 lines.
How to handle Cyrillic strings in R?
Sys.setlocale("LC_ALL","Polish")
dataset <- data.frame( ProductName = c('ąęćśżźół','тест') )
#Encoding(dataset) <- "UTF-8" #this line does not change anything
View(dataset)
The code above results in:
But I would like to get what I typed in тест instead of sequence <U+number>. Is there any way for that?
This works for me and see the cyrillic test in my data frame.
I think you should check what your locale is (with sessionInfo) and whether it supports UTF.
Also check this link and try to maybe change the encoding of your column.
Encoding(dataset$Cyrillic) <- "UTF-8"
In my question yesterday, "Can R read html-encoded emoji characters?", user rensa noted that:
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
How would this look in R?
(Note: I'm posting this question with the intention of answering it immediately, rather than posting this in the question linked above, since it's tangential to that question, but still possibly of use to others.)
The approach below works for transforming an emoji character or unicode representation into a name.
I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
# Get (MIT-licensed) emojilib data:
emoji_json_file <- "https://raw.githubusercontent.com/muan/emojilib/master/emojis.json"
json_data <- rjson::fromJSON(paste(readLines(emoji_json_file), collapse = ""))
get_name_from_emoji <- function(emoji_unicode, emoji_data = json_data){
emoji_evaluated <- stringi::stri_unescape_unicode(emoji_unicode)
# names(json_data)
vector_of_emoji_names_and_characters <- unlist(
lapply(json_data, function(x){
x$char
})
)
name_of_emoji <- attr(
which(vector_of_emoji_names_and_characters == emoji_evaluated)[1],
"names"
)
name_of_emoji
}
get_name_from_emoji("\\U0001f917")
# [1] "hugs"
get_name_from_emoji("🤗") # An attempt actually pasting the hugs emoji in also works.
# [1] "hugs"
This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.
I am learning how to scrape the web with the XML and the RCurl packages. All goes well except for one thing. Special characters like ö or č they are read in differently into R. For instance the í is read in as ÃÂ. I assume the latter is some sort of HTML coding for the first.
I have been looking for a way to convert these characters but I have not found it. I am sure other people have stumbled upon this problem as well, and I suspect there must be some sort of function to convert these characters. Does anyone know the solution? Thanks in advance.
Here is an example of the code, sorry I did not provide it earlier.
library(XML)
url <- 'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work
Try parsing it first while specifying the encoding, then reading the table, as here: readHTMLTable and UTF-8 encoding.
An example might be:
library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]