Run HTTP requests with PySpark in parallel and asynchronously - http

I have a text file containing several million URLs and I have to run a POST request for each of those URLs.
I tried to do it on my machine but it is taking forever so I would like to use my Spark cluster instead.
I wrote this PySpark code:
from pyspark.sql.types import StringType
import requests
url = ["http://myurltoping.com"]
list_urls = url * 1000 # The final code will just import my text file
list_urls_df = spark.createDataFrame(list_urls, StringType())
print 'number of partitions: {}'.format(list_urls_df.rdd.getNumPartitions())
def execute_requests(list_of_url):
final_iterator = []
for url in list_of_url:
r = requests.post(url.value)
final_iterator.append((r.status_code, r.text))
return iter(final_iterator)
processed_urls_df = list_urls_df.rdd.mapPartitions(execute_requests)
but it is still taking a lot of time, how can I make the function execute_requests more efficient launching the requests in each partition asynchronously for example?
Thanks!

Using the python package grequests(installable with pip install grequests) might be an easy solution for your problem without using spark.
The Documentation (can be found here https://github.com/kennethreitz/grequests) gives a simple example:
import grequests
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://fakedomain/',
'http://kennethreitz.com'
]
Create a set of unsent Requests:
>>> rs = (grequests.get(u) for u in urls)
Send them all at the same time:
>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, None, <Response [200]>]
I found out, that using gevent wihtin a foreach on a spark Dataframe results in some weird errors and does not work. It seems as if spark also relies on gevent, which is used by grequests...

Related

Extract Hyperlink from a spool pdf file in Python

I am getting my form data from frontend and reading it using fast api as shown below:
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
return {"filename": "Succcess"}
Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:
import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
if not(exists(pdfPath)):
raise FileNotFoundError("PDF File not Found")
pdf = pdfx.PDFx(pdfPath)
return pdf.get_references_as_dict().get('url',[])
But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.
Additionally, I also tried to study the bytes that come out of the file. When I try to do this:
data = await pdf.read()
data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.
fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
from tempfile import TemporaryDirectory
from pathlib import Path
import pdfx
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
with TemporaryDirectory() as d: #Adding the file into a temporary storage for re-reading purposes
tmpf = Path(d) / "pdf.pdf"
with tmpf.open("wb") as f:
f.write(pdf.read())
p = pdfx.PDFX(str(tmpf))
...
It may be that pdfx.PDFX will take a Path object. I'll update this answer if so. I've kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.
Note that it would be better to find a way of doing this with the SpooledTemporaryFile.
As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?

Pytest failing on file open command string assert - what's the best way to test this?

I am constructing a command to pass to requests library to Post an attachment - as in
files= attachment = {"attachment": ("image.png", open("C:\tmp\sensor.png", "rb"), "image/png")}
The code is working but I cannot get PyTest to test it as -is because of the open command which is executed when evaluated. Here is simplified code of the problem
import pytest
def openfile():
cmd = {"cmd": open(r"C:\tmp\sensor.png")}
return cmd
def test_openfile():
cmd = openfile()
#assert str(cmd) == str({"cmd": open(r"C:\tmp\sensor.png")}) # this works
assert cmd == {"cmd": open(r"C:\tmp\sensor.png")} # this does not
PyTest complains that the two side are different but then confirms they are the same in the diff panel!
Expected :{'cmd': <_io.TextIOWrapper name='C:\tmp\sensor.png' mode='r' encoding='cp1252'>}
Actual :{'cmd': <_io.TextIOWrapper name='C:\tmp\sensor.png' mode='r' encoding='cp1252'>}
'Click to see difference' - Opening diff panel reports 'Contents are identical'!
I can just stick with comparing the generated string with expected string but am wondering if there is a better way to do this.
Ideas?
You need to test the properties of the actual file buffer that is returned by the open call, instead of the references to that buffer, for example:
def test_openfile():
cmd = openfile()
expected_filename = r"C:\tmp\sensor.png"
assert "cmd" in cmd
file_cmd = cmd["cmd"]
assert file_cmd.name == expected_filename
with open(expected_filename) as f:
contents = f.read()
assert file_cmd.read() == contents
Note that in a test you may not have the file contents, or have them in another place like a fixture, so testing the file contents may have to be adapted, or may not be needed, depending on what you want to test.
After talking this through with a friend I think my original approach is perfectly valid. For anyone that trips over this question here's why:
I am trying to pytest building of an executable parameter to pass to another library for execution. The execution of the parameter is not relevant, just that it is correctly formatted. The test is to compare what is generated with the expected parameter ( as if I typed it) .
Therefore casting to string or json and comparing is appropriate since that is what a human does to manually check the code!

reusable traversal components not always working with gremlin

I'm trying to create reusable components for my traversals using gremlin python by putting traversal components into functions and I'm running into a problem where some of the traversal components aren't working correctly.
As setup I'm running gremlin server using the docker container with the configuration file loading in the modern graph from the github repo
docker run -p 8182:8182 tinkerpop/gremlin-server:3.4.6 conf/gremlin-server-modern.yaml
My test python code looks like the following:
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.process.graph_traversal import __
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
def connect_gremlin(endpoint='ws://localhost:8182/gremlin'):
return traversal().withRemote(DriverRemoteConnection(endpoint,'g'))
def n():
return __.values('name')
def r():
return __.range(2,4)
g = connect_gremlin()
# works as expected
g.V().map(n()).toList()
# returns an empty list
g.V().map(n()).filter(r()).toList()
# but using range step directly works as expected
g.V().map(n()).range(2,4).toList()
I can successfully move the values step into a function but when I try to do the same thing with the range step it returns an empty list rather than the 2nd through 4th items. Anyone know what I'm doing wrong?
The map step is intended to map the state of each traverser to a new state. In the context of a single traverser a range starting anywhere but zero is not going to do what you expect.
Here are some examples using Python:
>>> g.V().map(__.range(0,1)).limit(5).toList()
[v[1400], v[1401], v[1402], v[1403], v[1404]]
>>> g.V().map(__.range(0,2)).limit(5).toList()
[v[1400], v[1401], v[1402], v[1403], v[1404]]
>>> g.V().map(__.range(1,2)).limit(5).toList()
[]
This is why the values step works inside a map step and range does not.
Rather than inject code using a map step why not just incrementally add to the traversal and then iterate it when complete?

Build web graph with wget

I'm using wget with -r (recursive) option, to crawl and download all the pages starting from a root.
For debugging purpose I'd like to output which page routed me to another one, for example: https://stackoverflow.com/ -> https://stackoverflow.com/questions
Is there such a way to do that?
Please note that I need explicitly use wget.
The best solution I found untill now is to use the --warc-file option, to export a warc archive of my crawl. This format also store the Referer.
Using a python library to read the output I wrote the following simple script, to export a csv with source/target columns:
import warc
f = warc.open("crawler.warc")
for record in f:
if record['WARC-Type'] != 'request':
continue
for line in record.payload:
if line.startswith("Referer:"):
print line.replace("Referer: ", "").strip('\n\r'), ",", record['WARC-Target-URI']

Python Requests taking a long time

Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.

Resources