I'm working on using gevent and tornado inside the same application so that libraries that doesn't support tornado's ioloop can be subdued to use gevent to act asynchronously. I thought I'd need to run two real systems threads, one dedicated to Tornado's ioloop and another dedicated to gevent's loop. However, trying to call any gevent function inside a system thread returns not implemented Error, gevent cannot be used inside threads. Therefore, I tried monkey patching threading as well, as the following snippet shows
from gevent import monkey; monkey.patch_all()
from random import choice
import gevent
import requests
import tornado.ioloop
import tornado.web
import threading
import Queue
q = Queue.Queue()
i = 0
def synchronous_get_url(url, callback=None):
global i
i += 1
d = i
print('bar %d getting %s' % (d, url))
requests.get(url)
print('bar %d finished getting %s' % (d, url))
if callback:
callback()
class GreenEventLoop(threading.Thread):
daemon = True
def run(self):
while True:
url, callback = q.get()
gevent.spawn(synchronous_get_url, url, callback)
continuing...
class MainHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
print 'Received get request'
urls = [
'http://google.com',
'http://apple.com',
'http://microsoft.com',
'http://github.com',
'http://sourceforge.com',
]
q.put((choice(urls), self._on_fetch), block=False)
self.write("submitted url to queue")
def _on_fetch(self):
print 'Finishing in the handler\n'
try:
self.finish()
except:
pass
# Start GEvent Loop
green_loop = GreenEventLoop()
green_loop.start()
# Start Tornado Loop
application = tornado.web.Application([
(r"/", MainHandler),
], debug=True)
application.listen(7001)
tornado.ioloop.IOLoop.instance().start()
In a separate process, on the command line, I run the following.
from gevent import monkey; monkey.patch_all()
import gevent
import requests
count = 0
def get_stuff(i):
global count
res = requests.get('http://localhost:7000/')
count += 1
print count, res, i
lets = [gevent.spawn(get_stuff, i) for i in range(15)]
gevent.joinall(lets)
This allows retrieves 15 urls simultaneously, and return the response as they are received. What I don't quite understand is why the above code works at all. If threading is patched by gevent and turned into green threads, that means there's only ever a single thread running at a time, which means that while gevent is off fetching new responses, tornado's ioloop would block and not handle new requests until the old one has returned. Can someone explain how gevent would interact with Tornado's ioloop?
I suggest you look at motor lib it's async wrapper around pymongo driver. It's uses greenlets to adopt synchronous pymongo code for tornado callbacks style. So I think it should be good place to find some ideas.
Basic idea is to use gevent to monkey patch the system threads, then run tornado under python 'threads' which are really gevent greenlets.
How to use gevent and tornado together
Related
I am working on a web application project which will allow user to search from keywords and get a set of results based on the keywords entered from scraping. For this I use scrapy for scraping results from a web search engine. I wrote some code to pass the keywords to the scrapy file and display the scrapy results on a webpage. However, I'm having trouble passing the keywords to scrapy using FastAPI, because when I run my api code, I always get a set of errors from Scrapy. Here is the gist containing the runtime terminal output. I don't understand the problem, yet my scrapy code was working perfectly before I connected it to the api, I'm a beginner on creating APIs, so I ask for your help. Here is the code for my web scraper:
import scrapy
import datetime
from requests_html import HTMLSession
class PagesearchSpider(scrapy.Spider):
name = 'pageSearch'
def start_requests(self,query):
#queries = [ 'investissement']
#for query in queries:
url = f'https://www.ask.com/web?q={query}'
s = HTMLSession()
r = s.get
print(r.status_code)
qlist[""]
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
def parse(self, response):
print('url:', response.url)
start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
items = response.css('div.PartialSearchResults-item')
for pos, result in enumerate(items, start_pos+1):
yield {
'title': result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(),
'snippet': result.css('p.PartialSearchResults-item-abstract::text').get().strip(),
'link': result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'),
'position': pos,
'date': dt,
}
qlist.append(items)
# --- after loop ---
next_page = response.css('.PartialWebPagination-next a')
if next_page:
url = next_page.attrib.get('href')
print('next_page:', url) # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})
# --- run without project, and save in file ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'test.json': {'format': 'json'}},
#'ROBOTSTXT_OBEY': True, # this stop scraping
})
c.crawl(PagesearchSpider)
c.start()
the code allowing the functioning of my api:
from fastapi import FastAPI
from script import PagesearchSpider
app = FastAPI()
request = PagesearchSpider()
#app.get("/{cat}")
async def read_item(cat):
return request.start_requests('cat')
I changed part of my scraper code, viz:
def start_requests(self,query):
#queries = [ 'investissement']
#for query in queries:
url = f'https://www.ask.com/web?q={query}'
if __name__ == '__main__':
s = HTMLSession()
r = s.get
print(r.status_code)
qlist[""]
but I still get the same errors. Unless I got the wrong crawler code, I'm not very good at scrapy
I also rewrote the code below my function...
# --- run without project, and save in file ---
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
#'FEEDS': {'test.json': {'format': 'json'}},
#'ROBOTSTXT_OBEY': True, # this stop scraping
})
c.crawl(PagesearchSpider)
c.start()
and I executed this command python -m uvicorn main:app --reload to start server and I had the following results in the command line :
←[32mINFO←[0m: Will watch for changes in these directories: ['C:\\Users\\user\\Documents\\AAprojects\\Whelpsgroups1\\searchApi\\apiFast']
←[32mINFO←[0m: Uvicorn running on ←[1mhttp://127.0.0.1:8000←[0m (Press CTRL+C to quit)
←[32mINFO←[0m: Started reloader process [←[36m←[1m10956←[0m] using ←[36m←[1mstatreload←[0m
←[33mWARNING←[0m: The --reload flag should not be used in production on Windows.
←[32mINFO←[0m: Started server process [←[36m6724←[0m]
←[32mINFO←[0m: Waiting for application startup.
←[32mINFO←[0m: Application startup complete.
←[33mWARNING←[0m: StatReload detected file change in 'main.py'. Reloading...
←[33mWARNING←[0m: The --reload flag should not be used in production on Windows.
←[32mINFO←[0m: Started server process [←[36m2720←[0m]
←[32mINFO←[0m: Waiting for application startup.
←[32mINFO←[0m: Application startup complete.
But when I click from the command line on the link of the address my server started at, it opens my file explorer on windows 10 and when I manually write that link i.e. http://127.0.0.0:8000/ in my browser search bar it says 127.0.0.0 took too long to respond. Yet I didn't change any of my files, just the command I was using in console line so I don't know why this error.
I looked at the stack overflow questions, but they weren't directly related to the problem of difficulty sharing data between an api and a web scraper and on the internet I couldn't find any relevant answers. So I hope you could help me, I look forward to your answers, thank you!
I have a python code thats using mitm proxy to capture website traffic and generate a JSON file and I am trying to integrate that code with Robot using its process library. If I run the python file by itself and initiate Robot tests from different window then the JSON file is generated with no issues but if I run the same file as part of my test setup in Robot(using process library) then no file is generated. Wondering what am I doing wrong here?
Here is my Python code
tracker.py
from mitmproxy import http, ctx
import json
match_url = ["https://something.com/"] # Break Point URL portion to be matched
class Tracker:
def __init__(self):
self.flow = http.HTTPFlow
def requests(self, flow):
for urls in match_url:
if urls in flow.request.pretty_url:
with open('out.json', 'a+', encoding='utf-8') as out:
json.dump(flow.request.content.decode(), out)
def done(self):
print("Bye Bye")
ctx.master.shutdown()
addons = [
AGTracker()
]
keyword.robot
Start browser proxy process
${result} = start process mitmdump -s my_directory/tracker.py -p 9995 > in.txt shell=True alias=mitm
Stop browser proxy process
Terminate process mitm
We are using Apache 1.9.0. I have written a snowflake hook plugin. I have placed the hook in the $AIRFLOW_HOME/plugins directory.
$AIRFLOW_HOME
+--plugins
+--snowflake_hook2.py
snowflake_hook2.py
# This is the base class for a plugin
from airflow.plugins_manager import AirflowPlugin
# This is necessary to expose the plugin in the Web interface
from flask import Blueprint
from flask_admin import BaseView, expose
from flask_admin.base import MenuLink
# This is the base hook for connecting to a database
from airflow.hooks.dbapi_hook import DbApiHook
# This is the snowflake provided Connector
import snowflake.connector
# This is the default python logging package
import logging
class SnowflakeHook2(DbApiHook):
"""
Airflow Hook to communicate with Snowflake
This is implemented as a Plugin
"""
def __init__(self, connname_in='snowflake_default', db_in='default', wh_in='default', schema_in='default'):
logging.info('# Connecting to {0}'.format(connname_in))
self.conn_name_attr = 'snowflake_conn_id'
self.connname = connname_in
self.superconn = super().get_connection(self.connname) #gets the values from Airflow
{SNIP - Connection stuff that works}
self.cur = self.conn.cursor()
def query(self,q,params=None):
"""From jmoney's db_wrapper allows return of a full list of rows(tuples)"""
if params == None: #no Params, so no insertion
self.cur.execute(q)
else: #make the parameter substitution
self.cur.execute(q,params)
self.results = self.cur.fetchall()
self.rowcount = self.cur.rowcount
self.columnnames = [colspec[0] for colspec in self.cur.description]
return self.results
{SNIP - Other class functions}
class SnowflakePluginClass(AirflowPlugin):
name = "SnowflakePluginModule"
hooks = [SnowflakeHook2]
operators = []
So I went ahead and put some print statements in Airflows plugin_manager to try and get a better handle on what is happening. After restarting the webserver and running airflow list_dags, these lines were showing the "new module name" (and no errors
SnowflakePluginModule [<class '__home__ubuntu__airflow__plugins_snowflake_hook2.SnowflakeHook2'>]
hook_module - airflow.hooks.snowflakepluginmodule
INTEGRATING airflow.hooks.snowflakepluginmodule
snowflakepluginmodule <module 'airflow.hooks.snowflakepluginmodule'>
As this is consistent with what the documentation says, I should be fine using this in my DAG:
from airflow import DAG
from airflow.hooks.snowflakepluginmodule import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
But the web throws this error
Broken DAG: [/home/ubuntu/airflow/dags/test_sf2.py] No module named 'airflow.hooks.snowflakepluginmodule'
So the question is, What am I doing wrong? Or have I uncovered a bug?
You need to import as below:
from airflow import DAG
from airflow.hooks import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
OR
from airflow import DAG
from airflow.hooks.SnowflakePluginModule import SnowflakeHook2
from airflow.operators.python_operator import PythonOperator
I don't think that airflow automatically goes through the folders in your plugins directory and runs everything underneath it. The way that I've set it up successfully is to have an __init__.py under the plugins directory which contains each plugin class. Have a look at the Astronomer plugins in Github, it provides some really good examples for how to set up your plugins.
In particular have a look at how they've set up the mysql plugin
https://github.com/airflow-plugins/mysql_plugin
Also someone has incorporated a snowflake hook in one of the later versions of airflow too which you might want to leverage:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/snowflake_hook.py
I have an application that works in development, but when I try to run it with Gunicorn it gives an error that the "sqlalchemy extension was not registered". From what I've read it seems that I need to call app.app_context() somewhere, but I'm not sure where. How do I fix this error?
# run in development, works
python server.py
# try to run with gunicorn, fails
gunicorn --bind localhost:8000 server:app
AssertionError: The sqlalchemy extension was not registered to the current application. Please make sure to call init_app() first.
server.py:
from flask.ext.security import Security
from database import db
from application import app
from models import Studio, user_datastore
security = Security(app, user_datastore)
if __name__ == '__main__':
# with app.app_context(): ??
db.init_app(app)
app.run()
application.py:
from flask import Flask
app = Flask(__name__)
app.config.from_object('config.ProductionConfig')
database.py:
from flask.ext.sqlalchemy import SQLAlchemy
db = SQLAlchemy()
Only when you start your app with python sever.py is the if __name__ == '__main__': block hit, where you're registering your database with your app.
You'll need to move that line, db.init_app(app), outside that block.
I want to use Werkzeug as a local development server and cannot get the DebugApplication middle ware to work as documented - Werkzeug Debugging. Whats wrong here?
import webapp2
from system import config
from werkzeug.debug import DebuggedApplication
from werkzeug.serving import run_simple
application = webapp2.WSGIApplication(routes=config.routes, debug=False, config=config.options)
debugged_application = DebuggedApplication(application)
def main():
run_simple('localhost', 4000, debugged_application, use_reloader=True, use_debugger=True, threaded=True)
if __name__ == '__main__':
main()
I think that DebuggedApplication middleware tries to achieve the same as use_debugger=True, so no need to use both. The problem is that webapp2.WSGIApplication adds its own error handling before it goes through the debugger middleware, thus forbidding werkzeug debugger to see the actual exception.
My solution to this is to extend base WSGIApplication provided by webapp2 to re-raise the original exception. It works with python 2.7, and will pass the exception if and only if debug flag has been set to True in Application constructor.
class Application(webapp2.WSGIApplication):
def _internal_error(self, exception):
if self.debug:
raise
return super(Application, self)._internal_error(exception)
Not sure this is the cleanest possible way to do it, but it works for me.