Entering and viewing Cyrillic strings in R - r

How to handle Cyrillic strings in R?
Sys.setlocale("LC_ALL","Polish")
dataset <- data.frame( ProductName = c('ąęćśżźół','тест') )
#Encoding(dataset) <- "UTF-8" #this line does not change anything
View(dataset)
The code above results in:
But I would like to get what I typed in тест instead of sequence <U+number>. Is there any way for that?

This works for me and see the cyrillic test in my data frame.
I think you should check what your locale is (with sessionInfo) and whether it supports UTF.
Also check this link and try to maybe change the encoding of your column.
Encoding(dataset$Cyrillic) <- "UTF-8"

Related

Removing tags after reading PDF in R

I am reading PDF in Hebrew into R using textreadr::read_document, and getting tags which I can't remove, such as <U+202B>. Looking at the data in the console, the tags are absent; if I try to remove them using gsub or stringr::str_replace, nothing happens. However, they are clearly there (see image), and worse - if I export to Excel, they are exported as part of the data. What can I do?
Could you try something like this: This code I used to replace non-ASCII characters.
library(textclean)
attach(CA_videos_df) ##data frame name
Encoding(title) <- "latin1" ## title is data frame column name
name = replace_non_ascii(title,replacement = NA, remove.nonconverted = TRUE) ## replacing title with non-ascii characters with NA

Read .sql into R with Spanish characters (á, é, í, ó, ú, ñ, etc)

So, I've been struggling with this for a while now and can't seem to google my way out of it. I'm trying to read a .sql file into R, I always do that to avoid putting 100+ lines of sql in my R scripts. I usually do this:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- read_file("path/to/script.sql")
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
However, this time my sql script has some Spanish characters in it. Say something like this:
select tree_id, tree
from forest.trees
where species = 'árbol'
When I read this script into R and make the query it just doesn't return anything, but if I copy and paste the sql script into an R string it works! So it seems that the problem is in the line where I read the script into R.
I tried changing the string's encoding in a couple of ways:
# none of these work
query <- read_file("path/to/script.sql")
Encoding(query) <- "latin1"
query <- readLines("path/to/script.sql", encoding = "latin1")
query <- paste0(query, collapse = " ")
Unfortunately I don't have a public database to offer to anyone reading this. I'm connecting to a postgreSQL 11 database.
--- UPDATE ----
I'm on a windows 10 machine, with US locale.
When I use the read_file function the contents of query look ok, the Spanish characters print out like they should, but when I pass it to dbGetQuery it just doesn't fetch anything.
I tried forcing encoding "latin1" because I found online that Spanish characters tend to fix in R when doing that. When doing this, the Spanish characters print out wrong, so I didn't expected it to work, and it didn't.
The character values in my database have 'utf-8' encoding.
Just to be completely clear, all my attempts to read the .sql script haven't worked, however this does work:
library(tidyverse)
library(DBI)
con <- dbConnect(<CONNECTION ARGUMENTS>)
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
# df actually has results
df <- as_tibble(dbGetQuery(con, query))
dbDisconnect(con)
The encoding statement is telling R how to interpret the filename, not its contents. Try this instead:
filetext <- readLines(file("path/to/script.sql", encoding = "latin1"))
See this answer for more details:R: can't read unicode text files even when specifying the encoding
So after some time to think about it, I wondered why the solution proposed by MrFlick didn't work. I checked the encoding of the file created by this chunk:
query <- "select tree_id, tree from forest.trees where species = 'árbol'"
write_lines(query, "test.sql")
After checking what encoding did test.sql have, it turned out it was ANSI, but it didn't look right. So I manually changed my original script.sql encoding to ANSI. After that it worked totally fine.
This solution however, didn't work when I cloned my repo on an ubuntu environment. In ubuntu there was no problem with the original 'utf-8' encoding.
Hope this helps anyone dealing with this in windows.

Wrong encoding while loading the JSON data to R

I'm trying to build a word corpus based on my data frame, which was loaded from a JSON file. While doing it R doesn't see special signs like 'ř' (in the original json data it is visible and encoding is utf-8). I tried encoding in R with source editor and Encoding(x), but none of them works.
I would like to change the signs to latin letters. e.g. ř --> r, but r using gsub function completely destroys my data frame.
Do you have any ideas how to solve it?
#JSON file contains name with "ř", after loading data I get <f8> even though I choose encoding of source file
data5 <- fromJSON(file = "Test1801.json")
data6 <- as.data.frame(data5)
data6 <- tolower(data6) #This and gsub change whole data frame to character values "1"
data6 <- gsub("ř", "r", data6)
Welcome to SO. Please have in mind that you are expected to provide a reproducible example so we can work on your problem.
I understand you're looking after a way to change the symbols to latin letters. That can be accomplished with stringi::stri_trans_general:
require(stringi) # load library
a <- "ř" # assign your weird character to variable
newA <- stri_trans_general(a, "latin-ascii") # convert to latin
newA
> "r"
If you find this answer helpful, please consider marking it as such by ticking on the mark below the voting.

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

How to convert special symbols in web scraping with R?

I am learning how to scrape the web with the XML and the RCurl packages. All goes well except for one thing. Special characters like ö or č they are read in differently into R. For instance the í is read in as í. I assume the latter is some sort of HTML coding for the first.
I have been looking for a way to convert these characters but I have not found it. I am sure other people have stumbled upon this problem as well, and I suspect there must be some sort of function to convert these characters. Does anyone know the solution? Thanks in advance.
Here is an example of the code, sorry I did not provide it earlier.
library(XML)
url <- 'http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles'
tables <- readHTMLTable(url)
Sec <- tables[[6]]
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]
enc2utf8(pl1R1) # does not seem to work
Try parsing it first while specifying the encoding, then reading the table, as here: readHTMLTable and UTF-8 encoding.
An example might be:
library(XML)
url <- "http://en.wikipedia.org/wiki/2000_Wimbledon_Championships_%E2%80%93_Men%27s_Singles"
doc <- htmlParse(url, encoding = "UTF-8") #this will preserve characters
tables <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE))
Sec <- tables[[6]]
#not sure what you're trying to do here though
pl1R1 <- unlist(strsplit(as.character(Sec[,2]), ' '))[seq(2,32, 4)]

Resources