UTF-8 hex to unicode code point (only math)

UTF-8 hex to unicode code point (only math) - math

Let's take this table with characters and HEX encodings in Unicode and UTF-8.
Does anyone know how it is possible to convert UTF-8 hex to Unicode code point using only math operations?
E.g. let's take the first row. Given 227, 129 130 how to get 12354?
Is there any simple way to do it by using only math operations?
Unicode code point
UTF-8
Char
30 42 (12354)
e3 (227) 81 (129) 82 (130)
あ
30 44 (12356)
e3 (227) 81 (129) 84 (132)
い
30 46 (12358)
e3 (227) 81 (129) 86 (134)
う
* Source: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=12288&unicodeinhtml=hex

This video is the perfect source (watch from 6:15), but here is its summary and code sample in golang. With letters I mark bits taken from UTF-8 bytes, hopefully it makes sense. When you understand the logic it's easy to apply bitwise operators):
Bytes
Char
UTF-8 bytes
Unicode code point
Explanation
1-byte (ASCII)
E
1. 0xxx xxxx0100 0101 or 0x45
1. 0xxx xxxx0100 0101 or U+0045
no conversion needed, the same value in UTF-8 and unicode code point
2-byte
Ê
1. 110x xxxx2. 10yy yyyy1100 0011 1000 1010 or 0xC38A
0xxx xxyy yyyy0000 1100 1010 or U+00CA
1. First 5 bits of the 1st byte2. First 6 bits of the 2nd byte
3-byte
あ
1. 1110 xxxx2. 10yy yyyy3. 10zz zzzz1110 0011 1000 0001 1000 0010 or 0xE38182
xxxx yyyy yyzz zzzz0011 0000 0100 0010 or U+3042
1. First 4 bits of the 1st byte2. First 6 bits of the 2nd byte3. First 6 bits of the 3rd byte
4-byte
𐄟
1. 1111 0xxx2. 10yy yyyy3. 10zz zzzz4. 10ww wwww1111 0000 1001 0000 1000 0100 1001 1111 or 0xF090_849F
000x xxyy yyyy zzzz zzww wwww0000 0001 0000 0001 0001 1111 or U+1011F
1. First 3 bits of the 1st byte2. First 6 bits of the 2nd byte3. First 6 bits of the 3rd byte4. First 6 bits of the 4th byte
2-byte UTF-8
func get(byte1 byte, byte2 byte) {
int1 := uint16(byte1 & 0b_0001_1111) << 6
int2 := uint16(byte2 & 0b_0011_111)
return rune(int1 + int2)
}
3-byte UTF-8
func get(byte1 byte, byte2 byte, byte3 byte) {
int1 := uint16(byte1 & 0b_0000_1111) << 12
int2 := uint16(byte2 & 0b_0011_111) << 6
int3 := uint16(byte3 & 0b_0011_111)
return rune(int1 + int2 + int3)
}
4-byte UTF-8
func get(byte1 byte, byte2 byte, byte3 byt3, byte4 byte) {
int1 := uint(byte1 & 0b_0000_1111) << 18
int2 := uint(byte2 & 0b_0011_111) << 12
int3 := uint(byte3 & 0b_0011_111) << 6
int4 := uint(byte4 & 0b_0011_111)
return rune(int1 + int2 + int3 + int4)
}

Related

Decode Epson print (ESC-i) command decoding/encoding

I'm trying to understand the algorithm used for compression value = 1 with the Epson ESCP2 print command, "ESC-i". I have a hex dump of a raw print file which looks, in part, like the hexdump below (note little-endian format issues).
000006a 1b ( U 05 00 08 08 08 40 0b
units; (page1=08), (vt1=08), (hz1=08), (base2=40 0b=0xb40=2880)
...
00000c0 691b 0112 6802 0101 de00
esc i 12 01 02 68 01 01 00
print color1, compress1, bits1, bytes2, lines2, data...
color1 = 0x12 = 18 = light cyan
compress1 = 1
bits1 (bits/pixel) = 0x2 = 2
bytes2 is ??? = 0x0168 = 360
lines2 is # lines to print = 0x0001 = 1
00000c9 de 1200 9a05 6959
00000d0 5999 a565 5999 6566 5996 9695 655a fd56
00000e0 1f66 9a59 6656 6566 5996 9665 9659 6666
00000f0 6559 9999 9565 6695 9965 a665 6666 6969
0000100 5566 95fe 9919 6596 5996 5696 9666 665a
0000110 5956 6669 0456 1044 0041 4110 0040 8140
0000120 9000 0d00
1b0c 1b40 5228 0008 5200 4d45
FF esc # esc ( R 00 REMOTE1
The difficulty I'm having is how to decode the data, starting at 00000c9, given 2 bits/pixel and the count of 360. It's my understanding this is some form of tiff or rle encoding, but I can't decode it in a way that makes sense. The output was produced by gutenprint plugin for GIMP.
Any help would be much appreciated.

The byte count is not a count of the bytes in the input stream; it is a count of the bytes in the input stream as expanded to an uncompressed form. So when expanded, there should be a total of 360 bytes. The input bytes are interpreted as either a count of bytes to follow, if positive, in which case the count is the byte value +1; and if negative the count is a count of the number of times the immediately following byte should be expanded, again, +1. The 0D at the end is a terminating carriage return for the line as a whole.
The input stream is only considered as a string of whole bytes, despite the fact that the individual pixel/nozzle controls are only 2 bits each. So it is not really possible to use a repeat count for something like a 3-nozzle sequence; a repeat count must always specify a full byte 4-nozzle combination.
The above example then specifies:
0xde00 => repeat 0x00 35 times
0x12 => use the next 19 bytes as is
0xfd66 => repeat 0x66 4 times
0x1f => use the next 32 bytes as is
etc.

Incorrect wav header generated by sox

I was using sox to convert a 2 channels, 48000Hz, 24bits wav file (new.wav) to a mono wav file (post.wav).
Here are the related commands and outputs:
[Farmer#Ubuntu recording]$ soxi new.wav
Input File : 'new.wav'
Channels : 2
Sample Rate : 48000
Precision : 24-bit
Duration : 00:00:01.52 = 72901 samples ~ 113.908 CDDA sectors
File Size : 447k
Bit Rate : 2.35M
Sample Encoding: 24-bit Signed Integer PCM
[Farmer#Ubuntu recording]$ sox new.wav -c 1 post.wav
[Farmer#Ubuntu recording]$ soxi post.wav
Input File : 'post.wav'
Channels : 1
Sample Rate : 48000
Precision : 24-bit
Duration : 00:00:01.52 = 72901 samples ~ 113.908 CDDA sectors
File Size : 219k
Bit Rate : 1.15M
Sample Encoding: 24-bit Signed Integer PCM
It looks fine. But let us check the header of post.wav:
[Farmer#Ubuntu recording]$ xxd post.wav | head -10
00000000: 5249 4646 9856 0300 5741 5645 666d 7420 RIFF.V..WAVEfmt
00000010: 2800 0000 feff 0100 80bb 0000 8032 0200 (............2..
00000020: 0300 1800 1600 1800 0400 0000 0100 0000 ................
00000030: 0000 1000 8000 00aa 0038 9b71 6661 6374 .........8.qfact
00000040: 0400 0000 c51c 0100 6461 7461 4f56 0300 ........dataOV..
This is the standard wav file header structure.
The first line is no problem.
The second line "2800 0000" shows the size of sub chunk "fmt ", it should be 0x00000028 (as this is little endian) = 40 bytes. But there are 54 bytes (before sub chunk "fmt " and sub chunk "data").
The third line shows "ExtraParamSize" is 0x0018 = 22 bytes. But actually it is 36 bytes (from third line's "1600" to 5th line's "0100"). The previous 16 bytes are standard.
So what's the extra 36 bytes?

Ok,I found out the answer.
Look at the second line, we can found that audio format is "feff", actual value is 0xFFFE, so this is not a PCM standard wave format, but a extensible format.
Wav head detailed introduction can refer to this link. The article is well written and thanks to the author.
So as this is a Non-PCM format wav, "fmt " chunk space occupied 40 bytes is no problem， and followed by a "fact" chunk, and then is "data" chunk, So everything makes sense.

Plain text to Hexadecimal manually

How to manually convert a plain text to hexadecimal ?
Eg Hexadecimal form of Hello
P.S I do not need code but the manual way to convert.

--Convert the string to its ASCII form
--Convert ASCII(decimal) to Hex
Eg Hello in ASCII is
H is 72 ,e is 101, l is 108 , o is 111
And the Hex value of
72 is 48
101 is 65
108 is 6c
111 is 6f
So the Hex representation of Hello is 48656c6c6f

For example Hello present in text take that string character-wise, where H=72(int value) to HEXADECIMAL
DIVISION= 72 / 16 RESULT = 4 REMAINDER (in HEX)= 8(4.5-4=0.5,0.5*16=8)
DIVISION=4 / 16 RESULT = 0 REMAINDER (in HEX)= 4
Till Result becomes zero
ANSWER H=48(hex)
likewise for for all
finally,Hello=48656c6c6f

Questions about hexadecimal

Write a program to swap odd and even bits in an integer.
For exp, bit 0 and bit 1 are swapped, bit 2 and bit 3 are swapped.
The solution uses 0xaaaaaaaa and 0x55555555.
Can I know what does 0xaaaaaaaa and 0x55555555 means in binary number?

Each four bits constitutes a hex digit thus:
0000 0 1000 8
0001 1 1001 9
0010 2 1010 A
0011 3 1011 B
0100 4 1100 C
0101 5 1101 D
0110 6 1110 E
0111 7 1111 F
So, for example, 0x1234 would be 0001 0010 0011 01002.
For your specific examples:
0xaaaaaaaa = 1010 1010 ... 1010
0x55555555 = 0101 0101 ... 0101
The reason why a solution might use those two values is that, if you AND a value with 0xaaaaaaaa, you'll get only the odd bits (counting from the left), which you can then shift right to move them to the even bit positions.
Similarly, if you AND a value with 0x55555555, you'll get only the even bits, which you can then shift left to move them to the odd bit positions.
Then you can just OR those two values together and the bits have been swapped.
For example, let's start with the 16-bit value abcdefghijklmnop (each letter being a bit and with a zero bit being . to make it more readable):
abcdefghijklmnop abcdefghijklmnop
AND 1.1.1.1.1.1.1.1. AND .1.1.1.1.1.1.1.1
= a.c.e.g.i.k.m.o. = .b.d.f.h.j.l.n.p
>>1 = .a.c.e.g.i.k.m.o <<1 = b.d.f.h.j.l.n.p.
\___________ ___________/
\ /
.a.c.e.g.i.k.m.o
OR b.d.f.h.j.l.n.p.
= badcfehgjilknmpo
So each group of two bits has been swapped around. In C, that would be something like:
val = ((val & 0xAAAAAAAA) >> 1) | ((val & 0x55555555) << 1);
but, if this is classwork of some description, I'd suggest you work it out yourself by doing individual operations.
For an in-depth explanation of the bitwise operators that allow you to do this, see this excellent answer here.

calculating subnet mask?

how can i calculate the subnet mask having ip address 128.2.19.4 and belong to the subnet 128.2.19.0/25.please give me the detail procedure.i want to learn to calculate.

Here's the algorithm with your example:
The subnet mask is just a representation of the "/25" part of your subnet address.
In IPv4, addresses are 32 bits long, the first 25 bits of which are ones:
1111 1111 1111 1111 1111 1111 1000 0000
addresses are given in octets -- 8 bits each:
octet 1 . octet 2 . octet 3 . octet 4
0000 0000 0000 0000 0000 0000 0000 0000
1111 1111 1111 1111 1111 1111 1000 0000
So a decimal representation of each octet is:
255 . 255 . 255 . 128
That means that your subnet mask would be:
255.255.255.128

The subnet mask is a bitmask. 25 means that 25 out of 32 bits (starting from the top) is used for the network, and the rest for the hosts.
In bytes: 128.2.19.0
In binary 10000000 00000010 00010011 00000000
The bitmask: 11111111 11111111 11111111 10000000
Ergo: ------- network ------------ host
The last 7 bits are used for hosts. The bitmask as bytes is 255.255.255.128.

Here's how you can do it in C:
#include <stdio.h>
#include <arpa/inet.h>
uint32_t cidr_to_netmask(uint8_t cidr)
{
uint8_t unset_bits = 32 - cidr;
return ntohl(0xffffffff << unset_bits);
}
int main(void)
{
uint8_t cidr = 25;
uint32_t _netmask = cidr_to_netmask(cidr);
struct in_addr _netmask_addr = { _netmask };
char netmask[16];
if (inet_ntop(AF_INET, (struct in_addr *)&_netmask_addr, (char *)&netmask, sizeof(netmask)) == NULL) {
fprintf(stderr, "error.\n");
return 1;
}
printf("%d = %s\n", cidr, netmask);
return 0;
}
Output:
25 = 255.255.255.128

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

UTF-8 hex to unicode code point (only math) - math

Related

Decode Epson print (ESC-i) command decoding/encoding

Incorrect wav header generated by sox

Plain text to Hexadecimal manually

Questions about hexadecimal

calculating subnet mask?

Categories

Resources