Does `string` in OCaml support UTF-8? - functional-programming

Does the type string in OCaml support utf8?
Or what library I should use for utf8 string?

The string type of OCaml consists of a series of 8-bit bytes in essence. You can store a UTF-8 value in a string, and I have often done this. However, there's no built-in support for handling them. A good library for handling Unicode in OCaml (so I've heard) is Camomile.

There is also Uutf if you're looking for just unicode conversion.

Related

In Julia is big"123" a macro, function, or something else?

As a newcomer to Julia this month, Sept. 2018, I am just getting used to the initially unfamiliar "#" symbol for macros and "!" symbol for functions with mutable inputs. Am I right to assume that these are merely stylistic symbols for humans to read, and that they do not really provide any information to the compiler?
I bring this up in the context of the following code that does not seem to match the style of a macro, a function, or anything else in Julia I am aware of. I am specifically asking about big"1234" below:
julia> big"1234" # big seems to be neither a macro or a function.
1234
julia> typeof(big"1234")
BigInt
julia> typeof(BigInt(1234))
BigInt
My question is: What is big in big"1234"?
Edit: I think I got my answer based on a comment at https://discourse.julialang.org/t/bigfloat-promotion-rules-and-constants-in-functions/14573/4
"Note that because decimal literals are converted to floating point numbers when parsed, BigFloat(2.1) may not yield what you expect. You may instead prefer to initialize constants from strings via parse, or using the big string literal.
julia> BigFloat(2.1)
2.100000000000000088817841970012523233890533447265625
julia> big"2.1"
2.099999999999999999999999999999999999999999999999999999999999999999999999999986"
Thus, based on the above comment, big in big"1234" is a "big string literal."
Edit 2: The above is a start at the answer, but the accepted answer below is much more complete.
These are Non-Standard String Literals. They tell the compiler that xyz"somestring" should be parsed via a macro function named #xyz_str.
The difference between BigFloat(2.1) and big"2.1" is that the former does convert the standard Float64 representation of the "numeric" literal 2.1 to BigFloat but the latter parses the string "2.1" directly (without interpreting it as a numeric literal) with the macro #big_str to compute the BigFloat representation.
You can also define your Non-Standard String Literals. LaTeXStrings.jl for example uses it to make it easier to type LaTeX equations.
Please take a look at: https://docs.julialang.org/en/v1/manual/metaprogramming/#Non-Standard-String-Literals-1

CR/LF generated by PBEWithMD5AndDES encryption?

May the encryption string provided by PBEWithMD5AndDES and then Base64 encoded contain the CR and or LF characters?
Base64 is only printable characters. However when it's used as a MIME type for email it's split into lines which are separated by CR-LF.
PBEWithMD5AndDES returns binary data. PBE encryption is defined within the PKCS#5 standard, and this standard does not have a dedicated base 64 encoding scheme. So the question becomes for which system you need to Base 64 encode the binary data. Wikipedia has a nice section within the Base 64 article that explains the various forms.
You may encounter a PBE implementation that returns a Base 64, and the implementation does not mention which of the above schemes is used. In that case you need to somehow figure out which scheme is used. I would suggest searching for it, asking the community, looking at the source or if all fails, creating a set of tests on the output.
Fortunately you are pretty safe if you are decoding base 64 and you are ignoring all the white space. Note that some implementations are disregarding padding, so add it before decoding, if applicable.
If you perform the encoding base 64 yourself, I would strongly suggest to not output any whitespace, use only the default alphabet (with '+' and '/' signs) and always perform padding when required. After that you can always split the result and replace any non-standard character (especially the '+' and '/' signs of course), or remove the padding.
I was using java with Andorid SDK. I found that the command:
String s = Base64.encodeToString(enc, Base64.DEFAULT);
did line wrapping. It put LF chars into the output string.
I found that:
String s = Base64.encodeToString(enc, Base64.NO_WRAP);
did not put the LF characters into the output string.

In Qt how do QTextCodec::codecForName("UTF-16") and codecForName("UTF-32") decide the endianness to use?

In the Qt documentation it states that (among others) the following Unicode string encodings are supported:
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Due to the three different codecs listed for 2 and 4 octet encoded Unicode, I was wondering: how do the two non-endian codecs ("UTF-16" and "UTF-32") decide which endianness to use?
Based on the source code in src/corelibs/codecs/, it seems Qt uses the byte ordering of the host for UTF-16 and UTF-32.
If you use QTextCodec to read an existing Unicode string that has a BOM, and you didn't explicitly ask to ignore the header, the byte ordering detected in the string is used.
In *qutfcodec_p.h* both QUtf16Codec::e and QUtf32Codec::e are initialized with the value DetectEndianness (an enum).
In qutfcodec.cpp, near the beginning of the functions convertFromUnicode and convertToUnicode from the classes QUtf16 and QUtf32 (used by QUtf16Codec and QUtf32Codec), you can find the line:
endian = (QSysInfo::ByteOrder == QSysInfo::BigEndian)
? BigEndianness : LittleEndianness;

How I encode the ugly string?

I have a string that is:
!"#$%&'()*+,-./0123456789:;?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]\^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª« ®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅàáâäèçéêëìíîïôö÷òóõùúý
I post that to service and used Htmlencode, then I get a result:
!#$%&'()* ,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~����������� ���������•������������������������������������
it isn't result that i need,how i get original string? thanks!
Your string is not ASCII, so you are either using a string to represent binary data, or you're not maintaining awareness of multi-byte encoding. In any case, the simplest way to deal with any Internet-based technology (HTTP, SMTP, POP, IMAP) is to encode it as 7-bit clean. One common way is to base64-encode your data, send it across the wire, then base64-decode it before trying to process it.
I believe this is what you're looking for:
!"#$%&&apos;()*+,-./0123456789:;?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]\\^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅàáâäèçéêëìíîïôö÷òóõùúý
You just need to use a better html entity/encoding library or tool. The one I used to generate this is from Ruby - I used the HTML Entities library. The code I wrote to do this follows. I had to put your text in input.txt to preserve Unicode (there was an EOF character in the string), but it worked great.
require 'rubygems'
require 'htmlentities'
str = File.read('input.txt')
coder = HTMLEntities.new
puts coder.encode(str, :named)

Converting the Rijndaelmanged() byte[] to a string

I want to convert the Rijndaemanaged() encrpted value to a string.
Will ToBase64String() suffice? It says its only for 8-bit arrays, but AES is 128 bit right?
Update
For the encrption, I am using the code from http://msdn.microsoft.com/en-us/library/system.security.cryptography.rijndaelmanaged.aspx
Base64 is a generally good way to go. It's reasonably efficient, and you usually don't need to worry about encoding issues as the result will be ASCII. However, you should probably be careful if you're going to use the result in a URL - "normal" Base64 isn't url-safe. (There are alternative encodings which use different symbols though.)
Byte is byte = 8bits. ToBase64String will work. As Jon points out, it has limitations in using it in urls or filenames.
You can use this to convert it to a hex string.
We have been succesfully using Convert.ToBase64String on the encrypted bytes from managed Rijndael for number of years.

Resources