I was wondering how to do encoding and decoding in R. In Python, we can use ord('a') and chr(97) to transform a letter to number or transform a number to a letter. Do you know any similar functions in R? Thank you!
For example, in python
>>>ord("a")
97
>>>ord("A")
65
>>>chr(97)
'a'
>>>chr(90)
'Z'
FYI:
ord(c) in Python
Given a string of length one, return an integer representing the Unicode code point of the character when the argument is a unicode object, or the value of the byte when the argument is an 8-bit string. For example, ord('a') returns the integer 97, ord(u'\u2020') returns 8224. This is the inverse of chr() for 8-bit strings and of unichr() for unicode objects. If a unicode argument is given and Python was built with UCS2 Unicode, then the character’s code point must be in the range [0..65535] inclusive; otherwise the string length is two, and a TypeError will be raised.
chr(i) in Python
Return a string of one character whose ASCII code is the integer i. For example, chr(97) returns the string 'a'. This is the inverse of ord(). The argument must be in the range [0..255], inclusive; ValueError will be raised if i is outside that range. See also unichr().
You're looking for utf8ToInt and intToUtf8
utf8ToInt("a")
[1] 97
intToUtf8(97)
[1] "a"
Related
Table of contents
The context
The problem
The question
The context
In the context of R, I'm aware that stringi::stri_unescape_unicode() could be used for converting a Unicode code to its corresponding character.
For example, the Unicode code for á (LATIN SMALL LETTER A WITH ACUTE) and 好 is U+00E1 and U+597D, respectively. This means that I can insert those character by executing the following.
library(stringi)
stringi::stri_unescape_6unicode("\\u00E1")
stringi::stri_unescape_unicode("\\u597D")
[1] "á"
[1] "好"
I'm also aware that characters in the following ranges are for private use. The following quote was retrieved fromd this glossary (archive) in https://unicode.org.
Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D49 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.
As you can read in the quote, there are three ranges. The following lists those characters that are the limits of those ranges.
First range: (U+E000)
First range: (U+F8FF)
Second range: (U+F0000)
Second range: (U+FFFFD)
Third range: (U+100000)
Third range: (U+10FFFD)
The problem
When I try to print the characters in the in the list above that belong to the first range (i.e. (U+E000) and (U+F8FF)), there's no problem.
stringi::stri_unescape_unicode("\\ue000")
stringi::stri_unescape_unicode("\\uf8ff")
[1] ""
[1] ""
However, when I try to print the characters in shown in the list above that belong to the second range (i.e. (U+F0000) and (U+FFFFD)), R doesn't return those characters.
stringi::stri_unescape_unicode("\\uf0000")
stringi::stri_unescape_unicode("\\uffffd")
[1] "0"
[1] "\uffffd"
Similarly, the following doesn't print the characters shown in the list above that belong in the third range (i.e. (U+10FFFD) and (U+100000))
stringi::stri_unescape_unicode("\\u100000")
stringi::stri_unescape_unicode("\\u10fffd")
[1] "က00"
[1] "ჿfd"
The question
Why isn't stringi::stri_unescape_unicode() able to display characters that belong to the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD?
Is there any function in R that is able to return those characters?
I have a simple character string:
y <- "Location 433900E 387200N, Lat 53.381 Lon -1.490, 131 metres amsl"
When I perform regex capture on it:
stringr::str_extract(r'Lat(.*?)\,', y)
I get this error:
>Error: malformed raw string literal at line 1
why?
With R's raw strings (introduced in version 4.0.0), you need to use either ( or [ or { with the quotes, e.g.,
r'{Lat(.*?)\,}'
This is documented at ?Quotes (and in the release notes):
Raw character constants are also available using a syntax similar to the one used in C++: r"(...)" with ... any character sequence, except that it must not contain the closing sequence )". The delimiter pairs [] and {} can also be used, and R can be used in place of r."
How to test whether a single Unicode character is a valid variable name. The manual says:
Variable names must begin with a letter (A-Z or a-z), underscore, or a subset of Unicode code points greater than 00A0; in particular, Unicode character categories Lu/Ll/Lt/Lm/Lo/Nl (letters), Sc/So (currency and other symbols), and a few other letter-like characters (e.g. a subset of the Sm math symbols) are allowed.
Is there a function which tests a character to see if it's a valid variable name? isvalid() looks like it checks to see whether a character is a valid character, which might not be the same?
You can use Base.isidentifier for that:
julia> Base.isidentifier("a")
true
julia> Base.isidentifier("a′")
true
julia> Base.isidentifier("1a′")
false
julia> Base.isidentifier("â")
true
Consider the following code in built-in-library-tests.robot:
***Test Cases***
Use "Convert To Hex"
${hex_value} = Convert To Hex 255 base=10 prefix=0x # Result is 0xFF
# Question: How does the following statement work step by step?
Should Be True ${hex_value}==${0xFF} #: is ${0xFF} considered by Robot a string or a integer value in base 16?
# To Answer My Own Question, here is an hypothesis solution:
# For python to execute the expression:
# Should Be True a_python_expression_in_a_string_without_quotes # i.e. 0xFF==255
# To reach that target, i think of a 2 step solution:
# STEP 1: When a variable is used in the expressing using the normal ${hex_value} syntax, its value is replaced before the expression is evaluated.
# This means that the value used in the expression will be the string representation of the variable value, not the variable value itself.
Should Be True 0xFF==${0xFF}
# Step 2: When the hexadecimal value 0xFF is given in ${} decoration, robot converts the value to its
# integer representation 255 and puts the string representation of 255 into the the expression
Should Be True 0xFF==255
The test above passes with all its steps. I want to check with my community, is my 2 step hypothesis solution correct or not? Does Robot exactly go through these steps, before evaluating the final expression 0xFF==255 in Python?
Robot receives the expression as the string ${hex_value}==${0xFF}. It then performs variable substitution, yielding the string 0xFF==255. This string is then passed to python's eval statement.
The reason for the right hand side being 255 is described in the user guide:
It is possible to create integers also from binary, octal, and hexadecimal values using 0b, 0o and 0x prefixes, respectively. The syntax is case insensitive.
${0xFF} gets replaced with 255, and ${hex_value} gets substituted with whatever is in that variable. In this case, that variable contains the four bytes 0xFF.
Thus, ${hex_value}==${0xFF} gets converted to 0xFF==255, and that gets passed to eval as a string.
In other words, it's exactly the same as if you had typed eval("0xFF==255") at a python interactive prompt.
In Python, the len() function does provide the exact # amount of letters that make up a word in a string.
But when i have a string with multiple words, it doesn't display the correct # amount of letters because it is counting the spaces between the words.
what would be the correct command for the len() function to calculate the number of letters correctly for a string with multiple words ?
Remove all spaces before counting length:
string = string.replace(' ', '')
You can use len([c for c in address if c.isalpha()]) Here I'm assuming that your string is named address. Here is the defininition of isalpha from the python 3.4 docs:
Return true if all characters in the string are alphabetic and there
is at least one character, false otherwise. Alphabetic characters are
those characters defined in the Unicode character database as
“Letter”, i.e., those with general category property being one of
“Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the
“Alphabetic” property defined in the Unicode Standard
We perform this test for each one-character string in the address. Since python 3 strings are in Unicode, this test would also catch letters from other alphabets like Greek, Arabic, or Hebrew. I don't know if that's what you want, but if you only have letters from the English alphabet, it will work fine.
You can use regular expression :
import re
s = "14th Street 456 */\&^%$-+##!()[]{};.,:"
# Remove anything other than letters
n = re.sub(r'[^a-zA-Z]', "", s)
print(n)
print("length :" , len(n))
output :
thStreet
length : 8