R programming encoding jebrish ׳‘׳“׳™׳§׳•׳× to Hebrew - r

I have some Hebrew string along my code.
For some reason all the Hebrew is shown in the following format ׳§׳‘׳•׳¦׳×.
I have this line at the top of every file
Sys.setlocale("LC_ALL", "Hebrew")
This is the output of the commend Sys.getlocale("LC_ALL")
"LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
Example:
hebrew <- c(
"סוג תנועה",
"מס עוסק נגדי",
"תאריך אסמכתא",
"קבוצת אסמכתא",
"מס אסמכתא",
"סכום מעמ",
"סימן",
"סכום לפני מעמ",
" עתידי",
"הערה",
"בדיקות",
"סכום.מעמ",
"סכום.לפני.מעמ",
"הערות"
)
print(hebrew)
Result:
׳¡׳•׳’ ׳×׳ ׳•׳¢׳”" "[1] "\u009e׳¡ ׳¢׳•׳¡׳§ ׳ ׳’׳“׳™" "׳×׳\u0090׳¨׳™׳\u009a
׳\u0090׳¡׳\u009e׳›׳×׳\u0090" "׳§׳‘׳•׳¦׳× ׳\u0090׳¡׳\u009e׳›׳×׳\u0090" "׳\u009e׳¡
׳\u0090׳¡׳\u009e׳›׳×׳\u0090" "׳¡׳›׳•׳\u009d ׳\u009e׳¢׳\u009e"
[7] "׳¡׳™׳\u009e׳\u009f" "׳¡׳›׳•׳\u009d ׳\u009c׳₪׳ ׳™ ׳\u009e׳¢׳\u009e" "
׳¢׳×׳™׳“׳™" "׳”׳¢׳¨׳”" "׳‘׳“׳™׳§׳•׳× "׳¡׳›׳•׳\u009

Move the function to the main file

Related

How to get rid of embedded NUL on a raw vector?

Im scraping a ASP.NET website.
This will return a raw element (reporte_nacido) which is a csv file (tab as delimiter):
reporte_nacido = postForm('https://xxxxx/WebSiteNDE/BirthsPages/FiltrosExcelNac.aspx',
.params = params,
curl = curl,
.opts = RCurl::curlOptions(ssl.verifypeer=FALSE, verbose=T))
If i load the file on a text viewer, it looks like this
Now im trying to load that raw element within R but i get the following error. I believe the file downloaded from the server comes corrupted somehow and R is being picky about it
rawToChar(as.vector(unlist(reporte_nacido)))
Error in rawToChar(as.vector(unlist(reporte_nacido))) :
embedded nul in string: '\xfe\xff\0N\0\xda\0M\0E\0R\0O\0 \0C\0E\0R\0T\0I\0F\0I\0C\0A\0D\0O\0\t\0D\0E\0P\0A\0R\0T\0A\0M\0E\0N\0T\0O\0\t\0M\0U\0N\0I\0C\0I\0P\0I\0O\0\t\0A\0R\0E\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0I\0N\0S\0P\0E\0C\0C\0I\0O\0N\0 \0C\0O\0R\0R\0E\0G\0I\0M\0I\0E\0N\0T\0O\0 \0O\0 \0C\0A\0S\0E\0R\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0S\0I\0T\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0C\0\xd3\0D\0I\0G\0O\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0N\0O\0M\0B\0R\0E\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0S\0E\0X\0O\0\t\0P\0E\0S\0O\0 \0(\0G\0r\0a\0m\0o\0s\0)\0\t\0T\0A\0L\0L\0A\0 \0(\0C\0e\0n\0t\0\xed\0m\0e\0t\0r\0o\0s\0)\0\t\0F\0E\0C\0H\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0H\0O\0R\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0P\0A\0R\0T\0O\0 \0A\0T\0E\0N\0D\0I\0D\0O\0 \0P\0O\0R\0\t\0T\0I\0E\0M\0P\0O\0 \0D\0E\0 \0G\0E\0S\0T\0A\0C\0I\0\xd3\0N\0\t\0N\0\xda\0M\0E\0R\0O\0 \0C\0O\0N\0S\0U\0L\0T\0A\0S\0 \0P\0R\0E\0N\0A\0T\0A\0L\0E\0S\0\t\0T\0I\0P\0O\0 \0P\0A
The raw vector you are getting is text encoded as UTF-16. You can convert it like this:
library(stringi)
raw_vec <- as.vector(unlist(reporte_nacido))
decoded <- stri_encode(raw_vec, "UTF16")
decoded
#> [1] "NÚMERO CERTIFICADO\tDEPARTAMENTO\tMUNICIPIO\tAREA NACIMIENTO\tINSPECCION CORREGIMIENTO O CASERIO NACIMIENTO\tSITIO NACIMIENTO\tCÓDIGO INSTITUCIÓN\tNOMBRE INSTITUCIÓN\tSEXO\tPESO (Gramos)\tTALLA (Centímetros)\tFECHA NACIMIENTO\tHORA NACIMIENTO\tPARTO ATENDIDO POR\tTIEMPO DE GESTACIÓN\tNÚMERO CONSULTAS PRENATALES\tTIPO PA"
It appears to be tab-separated rather than csv format, so you probably want to read it like this:
read.table(text = decoded, sep = "\t", header = TRUE)

Insert characters when a string changes its case R

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.
Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"
If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

How can I use R Regular Expressions to catch a Hebrew word?

I've been trying to catch the word
עונה
plus the subsequent number after it in a string such as
כל הילדים אוכלים, עונה 2 , פרק 8-לזניית ירקות וסלמון בדבש
Demonstrating it on Regex101.com was straightforward enough, with עונה(\s+\d+|\d+), but with R I came up empty.
str<-"כל הילדים אוכלים, עונה 2 , פרק 8-לזניית ירקות וסלמון בדבש"
exp<-"עונה(\\s+\\d+|\\d+)"
str_extract_all(str,exp)
Output:
[[1]]
character(0)
You can use this regex:
/[\u0590-\u05FF]/*

Replacing a special character does not work with gsub

I have a table with many strings that contain some weird characters that I'd like to replace with the "original" ones. Ä became ä, ö became ö, so I replace each ö with an ö in the text. It works, however, ß became à < U+009F> and I am unable to replace it...
# Works just fine:
gsub('ö', 'REPLACED', "Testing string ö")
# this does not work
gsub("Ã<U+009F>", "REPLACED", "Testing string Ã<U+009F> ")
# this does not work as well...
gsub("â<U+0080><U+0093>", "REPLACED", "Testing string â<U+0080><U+0093> ")
How do I tell R to replace These parts with some letter I want to insert?
As there are metacharacters (+ - to signify one or more), in order to evaluate it literally either escape (as #boski mentioned in the solution) or use fixed = TRUE
sub("Ã<U+009F>", "REPLACED", "Testing string Ã<U+009F> ", fixed = TRUE)
#[1] "Testing string REPLACED "
You have to escape the + symbol, as it is a regex command.
> gsub("Ã<U\\+009F>", "REPLACED", "Testing string Ã<U+009F> ")
[1] "Testing string REPLACED "
> gsub("â<U\\+0080><U\\+0093>", "REPLACED", "Testing string â<U+0080><U+0093> ")
[1] "Testing string REPLACED "

How to make gsub() work on entire column?

I am trying to make gsub replace hex characters I have into Hebrew abc,
Using the following function:
name<-gsub("\u0080","א",name)
name<-gsub("\u0081","ב",name)
name<-gsub("\u0082","ג",name)
name<-gsub("\u0083","ד",name)
name<-gsub("\u0084","ה",name)
name<-gsub("\u0085","ו",name)
name<-gsub("\u0086","ז",name)
name<-gsub("\u0087","ח",name)
name<-gsub("\u0088","ח",name)
name<-gsub("\u0089","י",name)
name<-gsub("\u008a","ך",name)
name<-gsub("\u008b","כ",name)
name<-gsub("\u008c","ל",name)
name<-gsub("\u008d","ם",name)
name<-gsub("\u008e","מ",name)
name<-gsub("\u008f","ן",name)
name<-gsub("\u0090","נ",name)
name<-gsub("\u0091","ס",name)
name<-gsub("\u0092","ע",name)
name<-gsub("\u0093","ף",name)
name<-gsub("\u0094","פ",name)
name<-gsub("\u0095","ץ",name)
name<-gsub("\u0096","צ",name)
name<-gsub("\u0097","ק",name)
name<-gsub("\u0098","ר",name)
name<-gsub("\u0099","ש",name)
name<-gsub("\u009a","ת",name)
I have a variable called 'name' which contains the hex characters (for example):
[1] "-"
[2] "\u0083 \u0087\u0082\u0080 \u008f\u008c\u0098\u0080 \u0081\u0089\u0081\u0080"
[3] "-"
[4] "\u0084 \u0087\u0082\u0080 \u008f\u008c\u0098\u0080 \u0081\u0089\u0081\u0080"
When inserting the values into vector, manually, like this:
name<- c("-" ,
"\u0083 \u0087\u0082\u0080 \u008f\u008c\u0098\u0080 \u0081\u0089\u0081\u0080",
"-" ,
"\u0084 \u0087\u0082\u0080 \u008f\u008c\u0098\u0080 \u0081\u0089\u0081\u0080")
and running my script it works, but, when I try to make it run through the whole database, by using the following script to insert the values into 'name' variable:
cond<-list_kind %in% c("02")
name<-ifelse(cond,substr(data_set$data_from_row,25,39),"-")
(Because I need only the names in list kind 2)
it just prints the name as it was, as hex.

Resources