Read binary data response as uint16array using urllib2 in python 3.9 - http

I am trying to get url response in a uint16array buffer where I am using urllib for http communication in python3.9.
I tried np.frombuffer(url.read(),dtype=np.uint16) but its not giving correct array.
ex. array at server = {1000, 2000,3000,4000} //uint16array -> in hexadecimal {0x03E8, 0x07D0, 0x0BB8, 0x0FA0}.
send as binary octat from memory directly using array pointer. received data are correct when I am using requests library as below -
np.frombuffer(response.content,dtype=np.uint16)
Can anyone help with the syntax to be used ?
--Kuldeep

Related

Run HTTP requests with PySpark in parallel and asynchronously

I have a text file containing several million URLs and I have to run a POST request for each of those URLs.
I tried to do it on my machine but it is taking forever so I would like to use my Spark cluster instead.
I wrote this PySpark code:
from pyspark.sql.types import StringType
import requests
url = ["http://myurltoping.com"]
list_urls = url * 1000 # The final code will just import my text file
list_urls_df = spark.createDataFrame(list_urls, StringType())
print 'number of partitions: {}'.format(list_urls_df.rdd.getNumPartitions())
def execute_requests(list_of_url):
final_iterator = []
for url in list_of_url:
r = requests.post(url.value)
final_iterator.append((r.status_code, r.text))
return iter(final_iterator)
processed_urls_df = list_urls_df.rdd.mapPartitions(execute_requests)
but it is still taking a lot of time, how can I make the function execute_requests more efficient launching the requests in each partition asynchronously for example?
Thanks!
Using the python package grequests(installable with pip install grequests) might be an easy solution for your problem without using spark.
The Documentation (can be found here https://github.com/kennethreitz/grequests) gives a simple example:
import grequests
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://fakedomain/',
'http://kennethreitz.com'
]
Create a set of unsent Requests:
>>> rs = (grequests.get(u) for u in urls)
Send them all at the same time:
>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]
I found out, that using gevent wihtin a foreach on a spark Dataframe results in some weird errors and does not work. It seems as if spark also relies on gevent, which is used by grequests...

Python Requests taking a long time

Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.

python3 imaplib search function encoding

Can someone point me out how to properly search using imaplib in python. The email server is Microsoft Exchange - seems to have problems but I would want a solution from the python/imaplib side.
https://github.com/barbushin/php-imap/issues/128
I so far use:
import imaplib
M = imaplib.IMAP4_SSL(host_name, port_name)
M.login(u, p)
M.select()
s_str = 'hello'
M.search(s_str)
And I get the following error:
>>> M.search(s_str)
('NO', [b'[BADCHARSET (US-ASCII)] The specified charset is not supported.'])
search takes two or more parameters, an encoding, and the search specifications. You can pass None as the encoding, to not specify one. hello is not a valid charset.
You also need to specify what you are searching: IMAP has a complex search language detailed in RFC3501§6.4.4; and imaplib does not provide a high level interface for it.
So, with both of those in mind, you need to do something like:
search(None, 'BODY', '"HELLO"')
or
search(None, 'FROM', '"HELLO"')

How to process and save HTTP body as-is in Haskell?

I have tried following code to download HTML but it actually transforms non-ASCII characters into series of decoded characters like < U+009B> and 0033200400\0031\0031.
openURL x = getResponseBody =<< simpleHTTP (getRequest x)
download url path = do src <- openURL url
writeFile path src
How to change the following code to write HTTP response exactly as received? How should one search and manipulate with strings in such content?
The string output like "\1234\5678" is actually only two characters long—the data is preserved, but you need to interpret it correctly. Probably the best way to do that is to use Text which, instead of being a list of Chars, is actually a byte array representing UTF-8 codepoints.
To do this, you need to use a slightly more general interface in HTTP mkRequest :: BufferType ty => RequestMethod -> URI -> Request ty. Text does not directly instantiate BufferType, so we'll go through ByteString, which represents binary chunks of data—it has no particular interpretation of the encoding of that data.
We can then use decodeUtf8 to convert the raw bytes to UTF-8 Text
import Data.Text
import Data.Text.Encoding
import Data.ByteString
\ uri -> do
rawData <- getResponseBody =<< simpleHTTP (mkRequest GET uri) :: IO Text
return (decodeUtf8 rawData)
Note that decodeUtf8 is partial—it may fail in a way that cannot be caught in pure code mandating a restart or handler all the way up in your IO stack. If this is undesirable, if there's a good chance that you're downloading text which isn't valid UTF-8, then you can use decodeUtf8' which returns an Either.

Haskell Network.HTTP incorrectly downloading image

I'm trying to download images using the Network.HTTP module and having little success.
import Network.HTTP
main = do
jpg <- get "http://www.irregularwebcomic.net/comics/irreg2557.jpg"
writeFile "irreg2557.jpg" jpg where
get url = simpleHTTP (getRequest url) >>= getResponseBody
The output file appears in the current directory, but fails to display under chromium or ristretto. Ristretto reports "Error interpreting JPEG image file (Not a JPEG file: starts with 0c3 0xbf)".
writeFile :: FilePath -> String -> IO ()
String. That's your problem, right there. String is for unicode text. Attempting to store binary data in it will lead to corruption. It's not clear in this case whether the corruption is being done by simpleHTTP or by writeFile, but it's ultimately unimportant. You're using the wrong type, and something is corrupting the data when confronted with bytes that don't make up a valid unicode encoding.
As for fixing this, newer versions of HTTP are polymorphic in their return type, and can handle returning the raw bytes in a ByteString. You just need to change how you're writing the bytes to the file, so that it won't infer that you want a String.
import qualified Data.ByteString as B
import Network.HTTP
import Network.URI (parseURI)
main = do
jpg <- get "http://www.irregularwebcomic.net/comics/irreg2557.jpg"
B.writeFile "irreg2557.jpg" jpg
where
get url = let uri = case parseURI url of
Nothing -> error $ "Invalid URI: " ++ url
Just u -> u in
simpleHTTP (defaultGETRequest_ uri) >>= getResponseBody
The construction to get a polymorphic Request is a bit clumsy. If issue #1 ever gets fixed then using getRequest url will suffice.

Resources