Generate JSON file via robot framework process library - robotframework

I have a python code thats using mitm proxy to capture website traffic and generate a JSON file and I am trying to integrate that code with Robot using its process library. If I run the python file by itself and initiate Robot tests from different window then the JSON file is generated with no issues but if I run the same file as part of my test setup in Robot(using process library) then no file is generated. Wondering what am I doing wrong here?
Here is my Python code
tracker.py
from mitmproxy import http, ctx
import json
match_url = ["https://something.com/"] # Break Point URL portion to be matched
class Tracker:
def __init__(self):
self.flow = http.HTTPFlow
def requests(self, flow):
for urls in match_url:
if urls in flow.request.pretty_url:
with open('out.json', 'a+', encoding='utf-8') as out:
json.dump(flow.request.content.decode(), out)
def done(self):
print("Bye Bye")
ctx.master.shutdown()
addons = [
AGTracker()
]
keyword.robot
Start browser proxy process
${result} = start process mitmdump -s my_directory/tracker.py -p 9995 > in.txt shell=True alias=mitm
Stop browser proxy process
Terminate process mitm

Related

How to create an api with FastAPI and scrapy?

I am working on a web application project which will allow user to search from keywords and get a set of results based on the keywords entered from scraping. For this I use scrapy for scraping results from a web search engine. I wrote some code to pass the keywords to the scrapy file and display the scrapy results on a webpage. However, I'm having trouble passing the keywords to scrapy using FastAPI, because when I run my api code, I always get a set of errors from Scrapy. Here is the gist containing the runtime terminal output. I don't understand the problem, yet my scrapy code was working perfectly before I connected it to the api, I'm a beginner on creating APIs, so I ask for your help. Here is the code for my web scraper:
import scrapy
import datetime
from requests_html import HTMLSession
class PagesearchSpider(scrapy.Spider):
name = 'pageSearch'
def start_requests(self,query):
#queries = [ 'investissement']
#for query in queries:
url = f'https://www.ask.com/web?q={query}'
s = HTMLSession()
r = s.get
print(r.status_code)
qlist[""]
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
def parse(self, response):
print('url:', response.url)
start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
items = response.css('div.PartialSearchResults-item')
for pos, result in enumerate(items, start_pos+1):
yield {
'title': result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(),
'snippet': result.css('p.PartialSearchResults-item-abstract::text').get().strip(),
'link': result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'),
'position': pos,
'date': dt,
}
qlist.append(items)
# --- after loop ---
next_page = response.css('.PartialWebPagination-next a')
if next_page:
url = next_page.attrib.get('href')
print('next_page:', url) # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})
# --- run without project, and save in file ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'test.json': {'format': 'json'}},
#'ROBOTSTXT_OBEY': True, # this stop scraping
})
c.crawl(PagesearchSpider)
c.start()
the code allowing the functioning of my api:
from fastapi import FastAPI
from script import PagesearchSpider
app = FastAPI()
request = PagesearchSpider()
#app.get("/{cat}")
async def read_item(cat):
return request.start_requests('cat')
I changed part of my scraper code, viz:
def start_requests(self,query):
#queries = [ 'investissement']
#for query in queries:
url = f'https://www.ask.com/web?q={query}'
if __name__ == '__main__':
s = HTMLSession()
r = s.get
print(r.status_code)
qlist[""]
but I still get the same errors. Unless I got the wrong crawler code, I'm not very good at scrapy
I also rewrote the code below my function...
# --- run without project, and save in file ---
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
#'FEEDS': {'test.json': {'format': 'json'}},
#'ROBOTSTXT_OBEY': True, # this stop scraping
})
c.crawl(PagesearchSpider)
c.start()
and I executed this command python -m uvicorn main:app --reload to start server and I had the following results in the command line :
←[32mINFO←[0m: Will watch for changes in these directories: ['C:\\Users\\user\\Documents\\AAprojects\\Whelpsgroups1\\searchApi\\apiFast']
←[32mINFO←[0m: Uvicorn running on ←[1mhttp://127.0.0.1:8000←[0m (Press CTRL+C to quit)
←[32mINFO←[0m: Started reloader process [←[36m←[1m10956←[0m] using ←[36m←[1mstatreload←[0m
←[33mWARNING←[0m: The --reload flag should not be used in production on Windows.
←[32mINFO←[0m: Started server process [←[36m6724←[0m]
←[32mINFO←[0m: Waiting for application startup.
←[32mINFO←[0m: Application startup complete.
←[33mWARNING←[0m: StatReload detected file change in 'main.py'. Reloading...
←[33mWARNING←[0m: The --reload flag should not be used in production on Windows.
←[32mINFO←[0m: Started server process [←[36m2720←[0m]
←[32mINFO←[0m: Waiting for application startup.
←[32mINFO←[0m: Application startup complete.
But when I click from the command line on the link of the address my server started at, it opens my file explorer on windows 10 and when I manually write that link i.e. http://127.0.0.0:8000/ in my browser search bar it says 127.0.0.0 took too long to respond. Yet I didn't change any of my files, just the command I was using in console line so I don't know why this error.
I looked at the stack overflow questions, but they weren't directly related to the problem of difficulty sharing data between an api and a web scraper and on the internet I couldn't find any relevant answers. So I hope you could help me, I look forward to your answers, thank you!

Airflow - Custom XCom backend on Ubuntu

I'm trying to implement custom XCOM backend.
Those are the steps I did:
Created "include" directory at the main Airflow dir (AIRFLOW_HOME).
Created these "custom_xcom_backend.py" file inside:
from typing import Any
from airflow.models.xcom import BaseXCom
import pandas as pd
class CustomXComBackend(BaseXCom):
#staticmethod
def serialize_value(value: Any):
if isinstance(value, pd.DataFrame):
value = value.to_json(orient='records')
return BaseXCom.serialize_value(value)
#staticmethod
def deserialize_value(result) -> Any:
result = BaseXCom.deserialize_value(result)
result = df = pd.read_json(result)
return result
Set at config file:
xcom_backend = include.custom_xcom_backend.CustomXComBackend
When I restarted webserver I got:
airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "xcom_backend" key in "core" section. Current value: "include.cust...
My guess is that it not recognizing the "include" folder
But how can I fix it?
*Note: There is no docker. It is installed on a Ubuntu machine.
Thanks!
So I solved it:
Put custom_xcom_backend.py into the plugins directory
set at config file:
xcom_backend = custom_xcom_backend.CustomXComBackend
Restart all airflow related services
*Note: Do not store DataFrames that way (bad practice).
Sources I used:
https://www.youtube.com/watch?v=iI0ymwOij88

How can I read a config file from airflow packaged DAG?

Airflow packaged DAGs seem like a great building block for a sane production airflow deployment.
I have a DAG with dynamic subDAGs, driven by a config file, something like:
config.yaml:
imports:
- project_foo
- project_bar`
which yields subdag tasks like imports.project_{foo|bar}.step{1|2|3}.
I've normally read in the config file using python's open function, a la config = open(os.path.join(os.path.split(__file__)[0], 'config.yaml')
Unfortunately, when using packaged DAGs, this results in an error:
Broken DAG: [/home/airflow/dags/workflows.zip] [Errno 20] Not a directory: '/home/airflow/dags/workflows.zip/config.yaml'
Any thoughts / best practices to recommend here?
It's a bit of a kludge, but I eventually just fell back on reading zip file contents via ZipFile.
import yaml
from zipfile import ZipFile
import logging
import re
def get_config(yaml_filename):
"""Parses and returns the given YAML config file.
For packaged DAGs, gracefully handles unzipping.
"""
zip, post_zip = re.search(r'(.*\.zip)?(.*)', yaml_filename).groups()
if zip:
contents = ZipFile(zip).read(post_zip.lstrip('/'))
else:
contents = open(post_zip).read()
result = yaml.safe_load(contents)
logging.info('Parsed config: %s', result)
return result
which works as you'd expect from the main dag.py:
get_config(os.path.join(path.split(__file__)[0], 'config.yaml'))

Http-Conduit frequent connection failures

I am writing application which will download some files by HTTP. Up to some point I was using following code snippet to download page body:
import network.HTTP
simpleHTTP (getRequest "http://www.haskell.org/") >>= getResponseBody
It was working fine but it could not establish connection by HTTPS protocol. So to fix this I have switched to HTTP-Conduit and now I am using following code:
simpleHttp' :: Manager -> String -> IO (C.Response LBS.ByteString)
simpleHttp' manager url = do
request <- parseUrl url
runResourceT $ httpLbs request manager
It can connect to HTTPS but new frustrating problem appeared. About every fifth connection fails with exception:
getpics.hs: FailedConnectionException "i.imgur.com" 80
I am convinced that this is HTTP-Conduit problem because network.HTTP was working fine on same set of pages (excluding https pages).
Have anybody met such problem and know solution or better (and simple because this is simple task which should not take more than few lines of code) alternative to Conduit library?
One simple alternative would be to use the curl package. It supports HTTP, HTTPS and a bunch of other alternative protocols, as well as many options to customize its behavior. The price is introducing an external dependency on libcurl, required to build the package.
Example:
import Network.Curl
main :: IO ()
main = do
let addr = "https://google.com/"
-- Explicit type annotation is required for calls to curlGetresponse_.
-- Use ByteString instead of String for higher performance:
r <- curlGetResponse_ addr [] :: IO (CurlResponse_ [(String,String)] String)
print $ respHeaders r
putStr $ respBody r
Update: I tried to replicate your problem, but everything works for me. Could you post a Short, Self Contained, Compilable, Example that demonstrates the problem? My code:
import Control.Monad
import qualified Data.Conduit as C
import qualified Data.ByteString.Lazy as LBS
import Network.HTTP.Conduit
simpleHttp'' :: String -> Manager -> C.ResourceT IO (Response LBS.ByteString)
simpleHttp'' url manager = do
request <- parseUrl url
httpLbs request manager
main :: IO ()
main = do
let url = "http://i.imgur.com/"
count = 100
rs <- withManager $ \m -> replicateM count (simpleHttp'' url m)
mapM_ (print . responseStatus) $ rs

Importing ping module in RestrictedPython script in Plone

I would like to check internet connexion from my plone site. I tried a ping in a python script
## Script (Python) "pwreset_action.cpy"
##bind container=container
##bind context=context
##bind namespace=
##bind script=script
##bind subpath=traverse_subpath
##title=Reset a user's password
##parameters=randomstring, userid=None, password=None, password2=None
from Products.CMFCore.utils import getToolByName
from Products.PasswordResetTool.PasswordResetTool import InvalidRequestError, ExpiredRequestError
import ping, socket
status = "success"
pw_tool = getToolByName(context, 'portal_password_reset')
try:
pw_tool.resetPassword(userid, randomstring, password)
except ExpiredRequestError:
status = "expired"
except InvalidRequestError:
status = "invalid"
except RuntimeError:
status = "invalid"
context.plone_log("TRYING TO PING")
try :
ping.verbose_ping('www.google.com' , run=3)
context.plone_log("PING DONE")
except socket.error, e:
context.plone_log("PING FAILED")
return state.set(status=status)
I got these errors :
2012-07-20T11:37:08 INFO SignalHandler Caught signal SIGTERM
------
2012-07-20T11:37:08 INFO Z2 Shutting down fast
------
2012-07-20T11:37:08 INFO ZServer closing HTTP to new connections
------
2012-07-20T11:37:42 INFO ZServer HTTP server started at Fri Jul 20 11:37:42 2012
Hostname: 0.0.0.0
Port: 8080
------
2012-07-20T11:37:42 WARNING SecurityInfo Conflicting security declarations for "setText"
------
2012-07-20T11:37:42 WARNING SecurityInfo Class "ATTopic" had conflicting security declarations
------
2012-07-20T11:37:46 INFO plone.app.theming Patched Zope Management Interface to disable theming.
------
2012-07-20T11:37:48 INFO PloneFormGen Patching plone.app.portlets ColumnPortletManagerRenderer to not catch Retry exceptions
------
2012-07-20T11:37:48 INFO Zope Ready to handle requests
------
Python Scripts in Zope are sandboxed (via RestrictedPython, which means that any module imports have to be declared safe first. Adding modules to the declared-safe list is generally a Bad Idea unless you know what you are doing.
To declare a module as importable into Python Scripts, you'll need to create a python package, then add the following code to it so it is executed when Zope starts:
from Products.PythonScripts.Utility import allow_module
allow_module('ping')
This'll allow any import from that module (use with caution)!
It's better to allow only specific methods and classes from a module; use a ModuleSecurity declaration for that:
from AccessControl import ModuleSecurityInfo
ModuleSecurityInfo('ping').declarePublic('verbose_ping')
ModuleSecurityInfo('socket').declarePublic('error')
This is documented in the Security chapter of the Zope Developers Guide, specifically the section on module security assertions.
Note that it nearly always is a better idea to do all this work in a tightly constrained method in unrestricted code (e.g. a regular python package), then allow that method to be used from a python script instead.
It won't work.
You CANNOT import arbitrary Python modules in RestrictedPython scripts, as in the answer you were told yesterday:
https://stackoverflow.com/a/11568316/315168
If you need to use arbitraty Python modules you need to write your own Plone add-on for that and use a BrowserView for the purpose. RestrictedPython through-the-web-browser development is not enough:
http://collective-docs.readthedocs.org/en/latest/getstarted/index.html

Resources