NodeMCU http.get() and UTF-8 charset - http

I am using the http module of the NodeMCU dev branch to make a GET request to the Google Calendar API. However, when I retrieve the events and parse answer, I get strange chars due to a non-proper encoding.
I tried to add Accept-Charset: utf-8 in the header of the request, but the request fails (code=-1).
Is there a way to set the charset, or convert it afterwards in lua?
function bdayshttps(curr_date)
if (string.len(curr_date) == 10) then
http.get("https://www.googleapis.com/calendar/v3/calendars/"..
"<CalendarID>/events"..
"?timeMax="..curr_date.."T23%3A59%3A59-00%3A00"..
"&timeMin="..curr_date.."T00%3A00%3A00-00%3A00&fields=items%2Fsummary"..
"&key=<Google Calendar API key>", "Accept-Charset: utf-8", function(code, data)
if (code < 0) then
print("msg:birthdays error")
else
if (code == 200) then
output = ""
for line in data:gmatch"\"summary\": \"[^\n]*" do
output = output..line:sub(13, line:len()-1)..";"
end
print("bday:"..output)
end
end
end)
end
end
For obvious reasons, I erased the calendarID and API key.
EDIT:
The result of this code returns msg:birthday error, meaning the GET request returns a code=-1.
When replacing the "Accept-Charset: utf-8" by nil in the header, I get:
Loïc Simonetti instead of Loïc Simonetti.

The API docs say that you need to append \r\n to every header you set. There's an example in the docs for http.post().
Hence, instead of "Accept-Charset: utf-8" you should set "Accept-Charset: utf-8\r\n".

For the second part of your question (now that the response is working), it appears you are receiving back valid UTF8 Unicode.
Loïc corresponds to the following UTF8 code units: 4C 6F C3 AF 63
Loïc has the following byte values in a common encoding (CodePage 1252): 4C 6F C3 AF 63.
So my guess is the terminal or other device you're using to view the bytes is mangling things as opposed to your request/response being incorrect. Per #Marcel's link above too, Lua does not handle Unicode at all so this is about the best you can do (safely receive, store, and then send it back later)

Related

Wrong result of TIdURI.URLDecode when unit LazUTF8 is used

With Free Pascal 3.0.4, this test program correctly writes ÄÖÜ
program FPCTest;
uses IdURI;
begin
WriteLn(TIdURI.URLDecode('%C3%84%C3%96%C3%9C'));
ReadLn;
end.
However if the unit LazUTF8 (as described here) is used, it writes ???
program FPCTest;
uses IdURI, LazUTF8;
begin
WriteLn(TIdURI.URLDecode('%C3%84%C3%96%C3%9C'));
ReadLn;
end.
How can I fix this decoding error for programs which use LazUTF8?
When the String type is an alias for AnsiString 1, much of Indy's functionality exposes extra parameters/properties to let users control which ANSI encodings are used when AnsiString values are passed around in operations that perform AnsiString<->byte conversions.
1: Delphi pre-2009, and FreePascal/Lazarus when {$ModeSwitch UnicodeStrings} and {$Mode DelphiUnicode} are not used (FYI, Indy 11 will use them!).
In most cases, Indy's default byte encoding is ASCII (because many of the Internet protocols that Indy implements originally supported only ASCII - individual Indy components upgrade themselves to UTF as appropriate per protocol), though some things use the OS default codepage/charset instead.
Indy's default byte encoding can be changed at runtime by setting the global GIdDefaultTextEncoding variable in the IdGlobal unit, eg:
GIdDefaultTextEncoding := encUTF8;
But, in this particular situation, TIdURI.URLEncode() does not use GIdDefaultTextEncoding, but it does have an optional ADestEncoding parameter that you can use to specify a specific byte encoding for the returned AnsiString (in addition to an optional AByteEncoding parameter to specify the byte encoding of the parsed url octets - UTF-8 by default), eg:
TIdURI.URLDecode('%C3%84%C3%96%C3%9C'
{$IFNDEF FPC_UNICODESTRINGS}, IndyTextEncoding_UTF8, IndyTextEncoding_UTF8{$ENDIF}
)
The above will parse the url-encoded octets as UTF-8, and then return that data as-is in a UTF-8 encoded AnsiString.
If you do not specify an output encoding for ADestEncoding, URLDecode() defaults to the OS default. If you want it to use GIdDefaultTextEncoding instead, specify IndyTextEncoding_Default in the ADestEncoding parameter:
TIdURI.URLDecode('%C3%84%C3%96%C3%9C'
{$IFNDEF FPC_UNICODESTRINGS}, IndyTextEncoding_UTF8, IndyTextEncoding_Default{$ENDIF}
)
Another option would be to use the IndyTextEncoding(CodePage) function for ADestEncoding, passing it FreePascal's DefaultSystemCodePage variable, which the LazUtils package sets to CP_UTF8 2:
TIdURI.URLDecode('%C3%84%C3%96%C3%9C'
{$IFNDEF FPC_UNICODESTRINGS}, IndyTextEncoding_UTF8, IndyTextEncoding(DefaultSystemCodePage){$ENDIF}
)
2: I have opened a ticket in Indy's issue tracker to add support for DefaultSystemCodePage when compiling for FreePascal/Lazarus.
With this change in TIdURI.URLDecode lines 386ff LazUTF8 can be used:
{$IFDEF FPC}
Result := string(AByteEncoding.GetString(LBytes));
{$ELSE}
{$IFDEF STRING_IS_ANSI}
EnsureEncoding(ADestEncoding, encOSDefault);
CheckByteEncoding(LBytes, AByteEncoding, ADestEncoding);
SetString(Result, PAnsiChar(LBytes), Length(LBytes));
{$ELSE}
Result := AByteEncoding.GetString(LBytes);
{$ENDIF}
{$ENDIF}
Notes
This change assumes that the LazUTF8 unit is used always, and the Indy source code change needs to be applied every time when a new version is used.
Also I found no way to fix the TIdURI.URLDecode in a way which works with and without LazUTF8.

Python Requests taking a long time

Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.

ESP8266 String max size 247 Bytes?

I program an ESP8266 NodeMCU with Lua script. As I was debugging the problem that strings just were cut off at the beginning and extended further. I send from ESP8266 to an Android Phone.
I looked more in testing the esp via UART Interface, following problem:
the maximum String size when I declare a string container is 247 characters. After I exceed the 247th there is an error:
stdin:1: unexpected symbol near '='
The String is obviously too long but I need to send at least 2048 bytes per String for maximum efficiency. Is it possible to extend the input limit of a string variable?
(I build a 2048 bytes Packet and 86 bytes Overhead for the HTTP Get Response)
The TCP Tx Buffer of ESP8266 is 2920 bytes.
str_resp0 = "HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=UTF-8\r\n";
str_resp1 = "Connection: close\r\n\r\n";
send_buf = "";
uart.on("data", "$",
function(data)
t = {send_buf,data}
send_buf = table.concat(t);
if data=="quit$" then
uart.on("data") -- quit the function
end
end, 0)
srv=net.createServer(net.TCP)
srv:listen(80,function(conn)
conn:on("receive",function(conn,payload)
--print(payload)
conn:send(str_resp0)
conn:send(str_resp1)
conn:send(send_buf)
send_buf = "";
end)
conn:on("sent",function(conn) conn:close()
end)
end)
stdin:1: unexpected symbol near '='
sounds very much like a problem with your "IDE" (ESPlorer?).
Furthermore, you should send long payloads in batches. The SDK limits the packet size to some 1500 bytes which is the standard Ethernet MTU.
http://nodemcu.readthedocs.io/en/latest/en/modules/net/#netsocketsend has some explanations and a nice example.
srv = net.createServer(net.TCP)
srv:listen(80, function(conn)
conn:on("receive", function(sck, req)
local response = {}
-- if you're sending back HTML over HTTP you'll want something like this instead
-- local response = {"HTTP/1.0 200 OK\r\nServer: NodeMCU on ESP8266\r\nContent-Type: text/html\r\n\r\n"}
response[#response + 1] = "lots of data"
response[#response + 1] = "even more data"
response[#response + 1] = "e.g. content read from a file"
-- sends and removes the first element from the 'response' table
local function send(sk)
if #response > 0
then sk:send(table.remove(response, 1))
else
sk:close()
response = nil
end
end
-- triggers the send() function again once the first chunk of data was sent
sck:on("sent", send)
send(sck)
end)
end)
Update 2016-07-18
Results of some experimenting in ESPlorer. The test uses a single initialized variable:
buffer = "A very long string with more than 250 chars..."
Hitting 'Send to ESP' (sending char-by-char over UART) will fail with an error similar to the one reported here.
Saving this into a file works fine.
Hit 'Save', will save the line to a file on your file system e.g. paul.lua
Hit 'Save to ESP', will send paul.lua to the device.
If it's not done automatically as part of 'Save to ESP' you can send dofile("paul.lua") to the device. This will make the variable buffer available in the global space.
Sending print(buffer) will print the entire string to the terminal window.
The Problem is as it seems unavoidable. Referring to the espressif forums:
"the delay added by the lower level code is 20ms it is documented!"
so the event frame could'nt get handled faster than this. The only way to tweak this might be buffering the data in microcontroller and sending it at once every 20ms or installing the so called "FreeRTOS SDK for ESP8266", whose transfer speed is only limited by uart speed.

AT+CMGS returns ERROR

I am using SIM900 GSM module connect to my AVR Microcontroller.
I tested it with FT232 to see transmitting data.
First Micro sends AT it will response OK
AT OK
AT+CMGF=1 OK
AT+CMGS="+9893XXXXXX" returns ERROR and doesn't show ">"
Could anybody advise me what to do?
Command AT+CSCS? will answer You what type of sms-encoding is used. Properly answer is "GSM", and if not, You should set it by command AT+CSCS="GSM".
And remember about "Ctrl+Z" (not "Enter") as a finish of sms text, please.
You aren't passing all the parameters to the command.
The command format is:
AT+CMGS=<number><CR><message><CTRL-Z>
Where:
<CR> = ASCII character 13
<CTRL-Z> = ASCII character 26
You have passed only the number and without the <CR> you won't see the > note for the message.
Example:
AT+CMGS="+9893XXXXXX"
> This is the message.→
The response is:
+CMGS:<mr>
OK
Where <mr> is the message reference.
If AT+CSCS? command returns UCS2, then many arguments need to be encoded as hex string of UTF-16 encoding, so the phone number would become "002B0039003800390033...", and the SMS text would need to be encoded in the same way. If you don't need UCS2 encoding, then the easiest thing to do is to switch to GSM encoding (or another encoding from the available set as shown by AT+CSCS=? command)
Sometimes the issue is the text mode you are in. Enter AT+CMGF? and you should receive +CMGF: 1. If instead you receive +CMGF: 0, enter AT+CMGF=1. This changes the message format from PDU mode to Text mode. I'm not sure what either of those mean exactly, but this fixed my issue.
SIM 800 AT command manual

How to process and save HTTP body as-is in Haskell?

I have tried following code to download HTML but it actually transforms non-ASCII characters into series of decoded characters like < U+009B> and 0033200400\0031\0031.
openURL x = getResponseBody =<< simpleHTTP (getRequest x)
download url path = do src <- openURL url
writeFile path src
How to change the following code to write HTTP response exactly as received? How should one search and manipulate with strings in such content?
The string output like "\1234\5678" is actually only two characters long—the data is preserved, but you need to interpret it correctly. Probably the best way to do that is to use Text which, instead of being a list of Chars, is actually a byte array representing UTF-8 codepoints.
To do this, you need to use a slightly more general interface in HTTP mkRequest :: BufferType ty => RequestMethod -> URI -> Request ty. Text does not directly instantiate BufferType, so we'll go through ByteString, which represents binary chunks of data—it has no particular interpretation of the encoding of that data.
We can then use decodeUtf8 to convert the raw bytes to UTF-8 Text
import Data.Text
import Data.Text.Encoding
import Data.ByteString
\ uri -> do
rawData <- getResponseBody =<< simpleHTTP (mkRequest GET uri) :: IO Text
return (decodeUtf8 rawData)
Note that decodeUtf8 is partial—it may fail in a way that cannot be caught in pure code mandating a restart or handler all the way up in your IO stack. If this is undesirable, if there's a good chance that you're downloading text which isn't valid UTF-8, then you can use decodeUtf8' which returns an Either.

Resources