Haskell Network.HTTP incorrectly downloading image - http

I'm trying to download images using the Network.HTTP module and having little success.
import Network.HTTP
main = do
jpg <- get "http://www.irregularwebcomic.net/comics/irreg2557.jpg"
writeFile "irreg2557.jpg" jpg where
get url = simpleHTTP (getRequest url) >>= getResponseBody
The output file appears in the current directory, but fails to display under chromium or ristretto. Ristretto reports "Error interpreting JPEG image file (Not a JPEG file: starts with 0c3 0xbf)".

writeFile :: FilePath -> String -> IO ()
String. That's your problem, right there. String is for unicode text. Attempting to store binary data in it will lead to corruption. It's not clear in this case whether the corruption is being done by simpleHTTP or by writeFile, but it's ultimately unimportant. You're using the wrong type, and something is corrupting the data when confronted with bytes that don't make up a valid unicode encoding.
As for fixing this, newer versions of HTTP are polymorphic in their return type, and can handle returning the raw bytes in a ByteString. You just need to change how you're writing the bytes to the file, so that it won't infer that you want a String.
import qualified Data.ByteString as B
import Network.HTTP
import Network.URI (parseURI)
main = do
jpg <- get "http://www.irregularwebcomic.net/comics/irreg2557.jpg"
B.writeFile "irreg2557.jpg" jpg
where
get url = let uri = case parseURI url of
Nothing -> error $ "Invalid URI: " ++ url
Just u -> u in
simpleHTTP (defaultGETRequest_ uri) >>= getResponseBody
The construction to get a polymorphic Request is a bit clumsy. If issue #1 ever gets fixed then using getRequest url will suffice.

Related

Extract Hyperlink from a spool pdf file in Python

I am getting my form data from frontend and reading it using fast api as shown below:
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
return {"filename": "Succcess"}
Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:
import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
if not(exists(pdfPath)):
raise FileNotFoundError("PDF File not Found")
pdf = pdfx.PDFx(pdfPath)
return pdf.get_references_as_dict().get('url',[])
But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.
Additionally, I also tried to study the bytes that come out of the file. When I try to do this:
data = await pdf.read()
data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.
fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
from tempfile import TemporaryDirectory
from pathlib import Path
import pdfx
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
with TemporaryDirectory() as d: #Adding the file into a temporary storage for re-reading purposes
tmpf = Path(d) / "pdf.pdf"
with tmpf.open("wb") as f:
f.write(pdf.read())
p = pdfx.PDFX(str(tmpf))
...
It may be that pdfx.PDFX will take a Path object. I'll update this answer if so. I've kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.
Note that it would be better to find a way of doing this with the SpooledTemporaryFile.
As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?

Encode image file to base64

I have a trouble to convert an image to base64 and send it through xml-rpc client, the xml-RPC server respond and gives this error
a bytes-like object is required, not '_io.BufferedReader'
import base64
with open(full_path, 'rb') as imgFile:
image = base64.b64encode(imgFile)
You have given file pointer but should give binary data.
You should write as following :
import base64
with open(full_path, 'rb') as imgFile:
image = base64.b64encode(imgFile.read())

Python Requests taking a long time

Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.

How to overcome Python 3.4 NameError: name 'basestring' is not defined

I've got a file called hello.txt in the local directory along side the test.py, which contains this Python 3.4 code:
import easywebdav
webdav = easywebdav.connect('192.168.1.6', username='myUser', password='myPasswd', protocol='http', port=80)
srcDir = "myDir"
webdav.mkdir(srcDir)
webdav.upload("hello.txt", srcDir)
When I run this I get this:
Traceback (most recent call last):
File "./test.py", line 196, in <module>
webdav.upload("hello.txt", srcDir)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/easywebdav/client.py", line 153, in upload
if isinstance(local_path_or_fileobj, basestring):
NameError: name 'basestring' is not defined
Googling this results in several hits, all of which point to the same fix which, in case the paths moved in future, is to include "right after import types":
try:
unicode = unicode
except NameError:
# 'unicode' is undefined, must be Python 3
str = str
unicode = str
bytes = bytes
basestring = (str,bytes)
else:
# 'unicode' exists, must be Python 2
str = str
unicode = unicode
bytes = str
basestring = basestring
I wasn't using import types, but to include it or not doesn't appear to make a difference in PyDev - I get an error either way. The line which causes an error is:
unicode = unicode
saying, 'undefined variable'.
OK my python knowledge falters at this point and I've looked for similar posts on this site and not found one specific enough to basestring that I understand to help. I know I need to specify basestring but I don't know how to. Would anyone be charitable enough to point me in the right direction?
You can change easywebdav's client.py file like the top two changes in this checkin: https://github.com/hhaderer/easywebdav/commit/983ced508751788434c97b43586a68101eaee67b
The changes consist in replacing basestring by str in client.py.
I came up with an elegant pattern that does not require modification of any source files. Please note it might be extended for other modules to keep all 'hacks' in one place:
# py3ports.py
import easywebdav.client
easywebdav.basestring = str
easywebdav.client.basestring = str
# mylib.py
from py3ports import easywebdav

How to process and save HTTP body as-is in Haskell?

I have tried following code to download HTML but it actually transforms non-ASCII characters into series of decoded characters like < U+009B> and 0033200400\0031\0031.
openURL x = getResponseBody =<< simpleHTTP (getRequest x)
download url path = do src <- openURL url
writeFile path src
How to change the following code to write HTTP response exactly as received? How should one search and manipulate with strings in such content?
The string output like "\1234\5678" is actually only two characters long—the data is preserved, but you need to interpret it correctly. Probably the best way to do that is to use Text which, instead of being a list of Chars, is actually a byte array representing UTF-8 codepoints.
To do this, you need to use a slightly more general interface in HTTP mkRequest :: BufferType ty => RequestMethod -> URI -> Request ty. Text does not directly instantiate BufferType, so we'll go through ByteString, which represents binary chunks of data—it has no particular interpretation of the encoding of that data.
We can then use decodeUtf8 to convert the raw bytes to UTF-8 Text
import Data.Text
import Data.Text.Encoding
import Data.ByteString
\ uri -> do
rawData <- getResponseBody =<< simpleHTTP (mkRequest GET uri) :: IO Text
return (decodeUtf8 rawData)
Note that decodeUtf8 is partial—it may fail in a way that cannot be caught in pure code mandating a restart or handler all the way up in your IO stack. If this is undesirable, if there's a good chance that you're downloading text which isn't valid UTF-8, then you can use decodeUtf8' which returns an Either.

Resources