"AssertionError:You should supply an encoding or a list of encodings to this method - bert-language-model

"You should supply an encoding or a list of encodings to this method"
An encoding is the output of one of the encoding methods of the tokenizer, i.e. call/encode_plus/batch_encode_plus.

Related

What is this text: =B0=A1=C1=CB ... and how to convert it to normal text?

I have found some text in this form:
=B0=A1=C1=CB,=C4=E3=D2=B2=C3=BB=C1=AA=CF=B5=CE=D2,=D7=EE=BD=FC=CA=C7=B2=BB=CA=C7=
=BA=DC=C3=A6=B0=A1
containing mostly sequences consisting of an equal sign followed by two hexadecimal digits.
I am told it could be converted into this Chinese sentence:
啊了你也没联系我最近是不是很忙啊
What is the =B0=A1=C1 and how to decode/convert it?
The Chinese sentence has been encoded into an 8-bit Guobiao encoding (GB2312, GBK or GB18030; most likely the latter, though it apparently decodes correctly as the former too), and then further encoded into the 7-bit MIME quoted-printable encoding.
To decode it into a Unicode string, first undo the quoted-printable encoding, then decode the Guobiao encoding. Here’s an example using Python:
import quopri
print(quopri.decodestring("""\
=B0=A1=C1=CB,=C4=E3=D2=B2=C3=BB=C1=AA=CF=B5=CE=D2,=D7=EE=BD=FC=CA=C7=B2=BB=CA=C7=
=BA=DC=C3=A6=B0=A1\
""").decode('gb18030'))
This outputs 啊了,你也没联系我,最近是不是很忙啊 on my terminal.
The quoted-printable encoding is usually found in e-mail messages; whether it is actually in use should be determined from message headers. A message encoded in this manner should carry the header Content-Transfer-Encoding: quoted-printable. The text encoding (gb18030 in this case) should be specified in the charset parameter of the Content-Type header, but sometimes can be determined by other means.

May a URL contain arbitrary binary data in a GET-request?

May a URL contain raw binary data in a GET-request?
Is it possible to create a URL, www.example.com/**binary-data**, where www.example.com/ are ordinary ASCII characters, and **binary-data** are arbitrary raw byte-values, e.g., 0x10.
I don't won't to encode the binary data, but just create a string, e.g., char* in C, that contains both the ASCII characters and the binary data.
Or is POST-request the only way to send raw binary data as part of the body?
No, but could percent-escape the non-URI characters.
No. A URL transmitted in an HTTP GET request is percent-encoded, UTF-8 encoded Unicode text (resulting in an "ASCII" string).
Again, it's Unicode text.
Unicode encodings do not produce arbitrary binary data. There is no equivalent text for some arbitrary binary data.
Moving away from "raw", the server and request can, of course, agree on the use of a scheme such as Base64 to turn arbitrary binary data into Unicode text. By that point, though, you might as well use an HTTP request with a body, and although as far as HTTP is concerned the bodies are raw binary data, HTTP headers can indicate a standard format. Such requests include POST and PUT.
There are also practical limits to the length of the URL.

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

Accentuation in R

I was wonderiing if there exist a short way to get accents of a character string in R. For instance when I the "accentued" string is "Université d'Aix-Marseille" with my script I get "Universit%C3A9 d%27Aix-Marseille". Is there any function or algorithm to get the former one directly ?
I precise tht the file from where a get all my character string is encoded is UTF-8.
Sincerely yours.
You can get and set the encoding of a character vector like this:
s <- "Université d'Aix-Marseille"
Encoding(s)
# set encoding to utf-8
Encoding(s) <- "UTF-8"
s
If that fixes it, you could change your default encoding to UTF-8.

What is this "ÿþA"?

When I read in csv files to r the requesting dataframe has very different dimensions than I see when I open the file in excel or notepad and the column heading is labeled as "ÿþA". What does this mean?
thanks,
The file you are reading is using an UTF-16 or UTF-32 encoding (with a BOM), and the r read.csv function has not been informed correctly.
As Karsten suggests you should use the fileEncoding parameter to specify the correct encoding, which I suspect should be "UTF-16LE".
Here is what the R Studio documentation states about encoding:
Encoding
The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv: see that help page for how to find out what encoding names are recognized on your platform. Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is the internal encoding of the current locale and hence no translation is done.
Re-encoding only works for connections in text mode: reading from a connection with re-encoding specified in binary mode will read the stream of bytes, but mixing text and binary mode reads (e.g. mixing calls to readLines and readChar) is likely to lead to incorrect results.
The encodings "UCS-2LE" and "UTF-16LE" are treated specially, as they are appropriate values for Windows ‘Unicode’ text files. If the first two bytes are the Byte Order Mark 0xFFFE then these are removed as some implementations of iconv do not accept BOMs. Note that whereas most implementations will handle BOMs using encoding "UCS-2" and choose the appropriate byte order, some (including earlier versions of glibc) will not. There is a subtle distinction between "UTF-16" and "UCS-2" (see http://en.wikipedia.org/wiki/UTF-16/UCS-2: the use of surrogate pairs is very rare so "UCS-2LE" is an appropriate first choice.
As from R 3.0.0 the encoding "UTF-8-BOM" is accepted for reading and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications). If it is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xff)), binary_con)
Requesting a conversion that is not supported is an error, reported when the connection is opened. Exactly what happens when the requested translation cannot be done for invalid input is in general undocumented. On output the result is likely to be that up to the error, with a warning. On input, it will most likely be all or some of the input up to the error.
It may be possible to deduce the current native encoding from Sys.getlocale("LC_CTYPE"), but not all OSes record it.
And here is what Wiki states on the BOM:
Byte order mark
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.1
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Resources