Single (accented) characters with `str.__len__(x) == 2` - python-3.6

To the best of my understanding, str.__len__(x) counts accented characters double in Python 2 because of their byte representation, but once in Python 3, although I couldn't find proper documentation on str.__len__ on python.org.
Python documentation on stdtypes
Python documentation on len
However, If I run the following on Google Colab, the str.__len__(..) is counted as 2
import sys
test = u'ö'
print(type(test), len(test), sys.version)
Where is str.__len__ documented?

There are two ways to represent the symbol "ö" in Unicode. One is as U+00F6 LATIN SMALL LETTER O WITH DIAERESIS. The other is U+006F LATIN SMALL LETTER O followed by U+0308 COMBINING DIAERESIS. If you restrict your source files to ASCII these can be represented as "\u00f6" and "o\u0308" respectively.
In the first case, I get a length of 1. In the second case, I get a length of 2. (Tested with Python 3.7.2). I suspect your code is using the second representation.
This matches the documentation for the string type which notes that "Strings are immutable sequences of Unicode code points" (emphasis mine). A representation that consists of two code points would therefore have a length of 2.
You can use the unicodedata.normalize function to convert between the two forms. Using "NFC" for the form parameter will convert to the composed representation (length 1), using "NFD" will decompose it into a letter and a combining character (length 2).

Related

Why does comparing two strings with `>` not throw an error?

Why does this work in R? I would think it would throw an error, because you cannot actually compare whether one string is greater than another.
"Test" > "Test"
[1] FALSE
You can compare strings in R. There is complete section provided in the help page (?Comparison) explaining how the comparison is performed.
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
Character strings can be compared with different marked encodings (see Encoding): they are translated to UTF-8 before comparison.

Is it possible to write an Enigma encryption algorithm that can use all alphanumeric as input but does not output ambiguous characters?

This is about Enigma encryption, I'm guessing the number of rotors doesn't matter but I'm using 3.
I am working with what's basically a coded version of the old mechanical enigma style encryption machines. The concept is rather old but before I get too far into learning it, I was wondering if it would be possible to be able to encrypt using all characters 0-9 a-z and A-Z but the encrypted text itself will only be a subset of these characters? I'm trying to replace a subset of characters (around 10 total) from the encrypted output, while still being able to get back to those characters if they were part of the input?
You can disambiguate by adding 1 to 2-character mapping for ambiguous symbols: O -> A1; 0 -> A2; other ambiguous symbols; A->AA. This is basically just like escaping in strings: we usually can’t put new line inside the string, so we represent it as \n. \ is represented as \\
If you’re working with encrypted data (so the probabilities of all characters are uniformly distributed and characters cannot be predicted) then you can’t compress the ciphertext. If you can compress it, then you’ve noticed some kind of pattern in the text and partially broken the encryption.
If you want to reduce the ciphertext’s alphabet, then you must increase the length of the ciphertext, otherwise you’ve successfully compressed it.

Convert HEX to characters using bitwise operations

Say I've got this value xxx in hex 007800780078
How can I convert back the hex value to characters using bitwise operations?
Can I?
I suppose you could do it using "bitwise" operations, but it'd probably be a horrendous mess of code as well as being totally unnecessary since ILE RPG can do it easily using appropriate built-in functions.
First is that you don't exactly have what's usually thought of as a "hex" value. That is, you're showing a hexadecimal representation of a value; but basic "hex" conversion will not give a useful result. What you're showing seems to be a UCS-2 value for "xxx".
Here's a trivial example that shows a conversion of that hexadecimal string into a standard character value:
d ds
d charField 6 inz( x'007800780078' )
d UCSField1 3c overlay( charField )
d TargetField s 6
d Length s 10i 0
/free
Length = %len( %trim( UCSField1 ));
TargetField = %trim( %char( UCSField1 ));
*inlr = *on;
return;
/end-free
The code has a DS that includes two sub-fields. The first is a simple character field that declares six bytes of memory initialized to x'007800780078'. The second sub-field is declared as data type 'C' to indicate UCS-2, and it overlays the first sub-field. Because it's UCS-2, its size is given as "3" to allow for three characters. (Each character is 16-bits wide.)
The executable statements don't do much, just enough to let you test the converted values. Using debug, you should see that Length comes out to be (3) and TargetField becomes 'xxx'.
The %CHAR() built-in function can be used to convert from UCS-2 to the character encoding used by the program. To go in the opposite direction, use the %UCS2() built-in function.

How should I encode two integers into lowercase alphanumeric characters?

I have two integers from 0 to infinity (in practice probably less than 1 million, but don't want to have any limitation). I want to encode the two integers into a lowercase alphanumeric string (which may contain a dash, but shouldn't be just numbers). Also I want the strings to be somewhat random (i.e. don't want to always prefix every int with "a" for example). The most important requirement is that I need to be able to easily decode this alphanumeric string.
I would normally just use md5 hashing but it doesn't work for this case as I can't go back from md5 to the original integers. I also considered Base64 but it doesn't work because strings may include uppercase.
Is there a known hashing algorithm that satisfies these requirements?
If you're just looking to change the integer's base
Instead of base64 you can use base 16 (aka hexadecimal):
>>> hex(1234)[2:]
'4d2'
>>> int('4d2', 16)
1234
or base32:
>>> b32_encode(1234)
b'atja===='
>>> b32_decode(b'atja====')
1234
If you're looking to obscure the integer
The simplest method is to multiple the integer by some number and then xor with some greater, randomized key:
>>> key = 0xFa907fA06 # The result of punching my keyboard.
>>> number = 15485863
>>> obscured = (1234 * number) ^ key
50902290680
>>> hex(obscured)
'bda0350f8'
>>> (50902290680 ^ key) / number
1234
Wanting more robust obfuscation than that requires a tad bit more research, in which case this similar question may be a good place to start.

What is the use of hexadecimal values in programming?

This is something I have been thinking while reading programming books and in computer science class at school where we learned how to convert decimal values into hexadecimal.
Can someone please tell me what are the advantages of using hexadecimal values and why we use them in programmnig?
Thank you.
In many cases (like e.g. bit masks) you need to use binary, but binary is hard to read because of its length. Since hexadecimal values can be much easier translated to/from binary than decimals, you could look at hex values as kind of shorthand notation for binary values.
It certainly depends on what you're doing.
It comes as an extension of base 2, which you probably are familiar with as essential to computing.
Check this out for a good discussion of
several applications...
https://softwareengineering.stackexchange.com/questions/170440/why-use-other-number-bases-when-programming/
The hexadecimal digit corresponds 1:1 to a given pattern of 4 bits. With experience, you can map them from memory. E.g. 0x8 = 1000, 0xF = 1111, correspondingly, 0x8F = 10001111.
This is a convenient shorthand where the bit patterns do matter, e.g. in bit maps or when working with i/o ports. To visualize the bit pattern for 169d is in comparison more difficult.
A byte consists of 8 binary digits and is the smallest piece of data that computers normally work with. All other variables a computer works with are constructed from bytes. For example; a single character can be stored in a single byte, and a 32bit integer consists of 4 bytes.
As bytes are so fundamental we want a way to write down their value as neatly and efficiently as possible. One option would be to use binary, but then we would need a lot of digits. This takes up a lot of space and can be confusing when many numbers are written in sequence:
200 201 202 == 11001000 11001001 11001010
Using hexadecimal notation, we can write every byte using just two digits:
200 == C8
Also, as 16 is a power of 2, it is easy to convert between hexadecimal and binary representations in your head. This is useful as sometimes we are only interested in a single bit within the byte. As a simple example, if the first digit of a hexadecimal representation is 0 we know that the first four binary digits are 0.

Resources