sending multiple requests with different value in sametime - asynchronous

requests.post("abc.com", data=payload1)
requests.post("abc.com", data=payload2)
requests.post("abc.com", data=payload3)
requests.post("abc.com", data=payload4)
What is the fastest way to send these four requests at the same time?

One common way to do multiple things simultaneously is using threads they run independent of each other.
import requests
import threading
def post_req(url, payload):
print(requests.post(url, payload))
threading.Thread(target=post_req, args=("abc.com", payload)).start()
threading.Thread(target=post_req, args=("abc.com", payload)).start()
threading.Thread(target=post_req, args=("abc.com", payload)).start()

Related

Is it possible to create dynamic jobs with Dagster?

Consider this example - you need to load table1 from source database, do some generic transformations (like convert time zones for timestamped columns) and write resulting data into Snowflake. This is an easy one and can be implemented using 3 dagster ops.
Now, imagine you need to do the same thing but with 100s of tables. How would you do it with dagster? Do you literally need to create 100 jobs/graphs? Or can you create one job, that will be executed 100 times? Can you throttle how many of these jobs will run at the same time?
You have a two main options for doing this:
Use a single job with Dynamic Outputs:
With this setup, all of your ETLs would happen in a single job. You would have an initial op that would yield a DynamicOutput for each table name that you wanted to do this process for, and feed that into a set of ops (probably organized into a graph) that would be run on each individual DynamicOutput.
Depending on what executor you're using, it's possible to limit the overall step concurrency (for example, the default multiprocess_executor supports this option).
Create a configurable job (I think this is more likely what you want)
from dagster import job, op, graph
import pandas as pd
#op(config_schema={"table_name": str})
def extract_table(context) -> pd.DataFrame:
table_name = context.op_config["table_name"]
# do some load...
return pd.DataFrame()
#op
def transform_table(table: pd.DataFrame) -> pd.DataFrame:
# do some transform...
return table
#op(config_schema={"table_name": str})
def load_table(context, table: pd.DataFrame):
table_name = context.op_config["table_name"]
# load to snowflake...
#job
def configurable_etl():
load_table(transform_table(extract_table()))
# this is what the configuration would look like to extract from table
# src_foo and load into table dest_foo
configurable_etl.execute_in_process(
run_config={
"ops": {
"extract_table": {"config": {"table_name": "src_foo"}},
"load_table": {"config": {"table_name": "dest_foo"}},
}
}
)
Here, you create a job that can be pointed at a source table and a destination table by giving the relevant ops a config schema. Depending on those config options, (which are provided when you create a run through the run config), your job will operate on different source / destination tables.
The example shows explicitly running this job using python APIs, but if you're running it from Dagit, you'll also be able to input the yaml version of this config there. If you want to simplify the config schema (as it's pretty nested as shown), you can always create a Config Mapping to make the interface nicer :)
From here, you can limit run concurrency by supplying a unique tag to your job, and using a QueuedRunCoordinator to limit the maximum number of concurrent runs for that tag.

Airflow Custom Metrics and/or Result Object with custom fields

While running pySpark SQL pipelines via Airflow I am interested in getting out some business stats like:
source read count
target write count
sizes of DFs during processing
error records count
One idea is to push it directly to the metrics, so it will gets automatically consumed by monitoring tools like Prometheus. Another idea is to obtain these values via some DAG result object, but I wasn't able to find anything about it in docs.
Please post some at least pseudo code if you have solution.
I would look to reuse Airflow's statistics and monitoring support in the airflow.stats.Stats class. Maybe something like this:
import logging
from airflow.stats import Stats
PYSPARK_LOG_PREFIX = "airflow_pyspark"
def your_python_operator(**context):
[...]
try:
Stats.incr(f"{PYSPARK_LOG_PREFIX}_read_count", src_read_count)
Stats.incr(f"{PYSPARK_LOG_PREFIX}_write_count", tgt_write_count)
# So on and so forth
except:
logging.exception("Caught exception during statistics logging")
[...]

Graphite Derivative shows no data

Using graphite/Grafana to record the sizes of all collections in a mongodb instance. I wrote a simple (WIP) python script to do so:
#!/usr/bin/python
from pymongo import MongoClient
import socket
import time
statsd_ip = '127.0.0.1'
statsd_port = 8125
# create a udp socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
client = MongoClient(host='12.34.56.78', port=12345)
db = client.my_DB
# get collection list each runtime
collections = db.collection_names()
sizes = {}
# main
while (1):
# get collection size per name
for collection in collections:
sizes[collection] = db.command('collstats', collection)['size']
# write to statsd
for size in sizes:
MESSAGE = "collection_%s:%d|c" % (size, sizes[size])
sock.sendto(MESSAGE, (statsd_ip, statsd_port))
time.sleep(60)
This properly shows all of my collection sizes in grafana. However, I want to get a rate of change on these sizes, so I build the following graphite query in grafana:
derivative(statsd.myHost.collection_myCollection)
And the graph shows up totally blank. Any ideas?
FOLLOW-UP: When selecting a time range greater than 24h, all data similarly disappears from the graph. Can't for the life of me figure out that one.
Update: This was due to the fact that my collectd was configured to send samples every second. The statsd plugin for collectd, however, was receiving data every 60 seconds, so I ended up with None for most data points.
I discovered this by checking the raw data in Graphite by appending &format=raw to the end of a graphite-api query in a browser, which gives you the value of each data point as a comma-separated list.
The temporary fix for this was to surround the graphite query with keepLastValue(60). This however creates a stair-step graph, as the value for each None (60 values) becomes the last valid value within 60 steps. Graphing a derivative of this then becomes a widely spaced sawtooth graph.
In order to fix this, I will probably go on to fix the flush interval on collectd or switch to a standalone statsd instance and configure as necessary from there.

Running multiple spiders in the same process, one spider at a time

I have a situation where I have a CrawlSpider that searches for results using postal codes and categories (POST data). I need to get all the results for all the categories in all postal codes. My spider takes a postal code and a category as arguments for the POST data. I want to programmatically start a spider for each postal code/category combo via a script.
The documentation explains you can run multiple spiders per process with this code example here: http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process This is along the same thing that I want to do however I want to essentially queue up spiders to run one after the another after the preceding spider finishes.
Any ideas on how to accomplish this? There seems to be some answers that apply to older versions of scrapy (~0.13) but the architecture has changed and they no longer function with the latest stable (0.24.4)
You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case:
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor
# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']
def configure_crawler(postal_code):
spider = MySpider(postal_code)
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# detach spider
crawler._spider = None
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
# callback fired when the spider is closed
def callback(spider, reason):
try:
postal_code = postal_codes.pop()
configure_crawler(postal_code)
except IndexError:
# stop the reactor if no postal codes left
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()

How do you Identify the request that a QNetworkReply finished signal is emitted in response to when you are doing multiple requests In QtNetwork?

I have a project will load a HTTP page, parse it, and then open other pages based on the data it received from the first page.
Since Qt's QNetworkAccessManager works asyncronusly, it seems I should be able to load more than one page at a time by continuing to make HTTP requests, and then taking care of the response would happen in the order the replies come back and would be handled by the even loop.
I'm a having a few problems figuring out how to do this though:
First, I read somewhere on stackoverflow that you should use only one QNetworkAccess manager. I do not know if that is true.
The problem is that I'm connecting to the finished slot on the single QNetworkAccess manager. If I do more than one request at a time, I don't know what request the finished signal is in response to. I don't know if there is a way to inspect the QNetworkReply object that is passed from the signal to know what reply it is in response to? Or should I actually be using a different QNetworkAccessManager for each request?
Here is an example of how I'm chaining stuff together right now. But I know this won't work when I'm doing more than one request at at time:
from PyQt4 import QtCore,QtGui,QtNetwork
class Example(QtCore.QObject):
def __init__(self):
super().__init__()
self.QNetworkAccessManager_1 = QtNetwork.QNetworkAccessManager()
self.QNetworkCookieJar_1 = QtNetwork.QNetworkCookieJar()
self.QNetworkAccessManager_1.setCookieJar(self.QNetworkCookieJar_1)
self.app = QtGui.QApplication([])
def start_request(self):
QUrl_1 = QtCore.QUrl('https://erikbandersen.com/')
QNetworkRequest_1 = QtNetwork.QNetworkRequest(QUrl_1)
#
self.QNetworkAccessManager_1.finished.connect(self.someurl_finshed)
self.QNetworkAccessManager_1.get(QNetworkRequest_1)
def someurl_finshed(self, NetworkReply):
# I do this so that this function won't get called for a diffent request
# But it will only work if I'm doing one request at a time
self.QNetworkAccessManager_1.finished.disconnect(self.someurl_finshed)
page = bytes(NetworkReply.readAll())
# Do something with it
print(page)
QUrl_1 = QtCore.QUrl('https://erikbandersen.com/ipv6/')
QNetworkRequest_1 = QtNetwork.QNetworkRequest(QUrl_1)
#
self.QNetworkAccessManager_1.finished.connect(self.someurl2_finshed)
self.QNetworkAccessManager_1.get(QNetworkRequest_1)
def someurl2_finshed(self, NetworkReply):
page = bytes(NetworkReply.readAll())
# Do something with it
print(page)
kls = Example()
kls.start_request()
I am not familiar to PyQt but from general Qt programming point of view
Using only one QNetworkAccessManager is right design choice
finished signal provides QNetworkReply*, with that we can identify corresponding request using request().
I hope this will solve your problem with one manager and multiple requests.
This is a C++ example doing the same.

Resources