OCR using microsoft cognitive - microsoft-cognitive

What if i only want to process image from disk for reading text from it and storing them in text file.
As it is working for both json and data. i want to do work with data only. How to do that?
from __future__ import print_function
import time
import requests
import cv2
import operator
import numpy as np
# Import library to display results
import matplotlib.pyplot as plt
%matplotlib inline
_url = 'https://api.projectoxford.ai/vision/v1/analyses'
_key = 'd784ea882edd4feaa373dc5a80fa87e8'
_maxNumRetries = 10
def processRequest( json, data, headers, params ):
"""
Helper function to process the request to Project Oxford
Parameters:
json: Used when processing images from its URL. See API Documentation
data: Used when processing image read from disk. See API Documentation
headers: Used to pass the key information and the data type request
"""
retries = 0
result = None
while True:
response = requests.request( 'post', _url, json = json, data = data, headers = headers, params = params )
if response.status_code == 429:
print( "Message: %s" % ( response.json()['error']['message'] ) )
if retries <= _maxNumRetries:
time.sleep(1)
retries += 1
continue
else:
print( 'Error: failed after retrying!' )
break
elif response.status_code == 200 or response.status_code == 201:
if 'content-length' in response.headers and int(response.headers['content-length']) == 0:
result = None
elif 'content-type' in response.headers and isinstance(response.headers['content-type'], str):
if 'application/json' in response.headers['content-type'].lower():
result = response.json() if response.content else None
elif 'image' in response.headers['content-type'].lower():
result = response.content
else:
print( "Error code: %d" % ( response.status_code ) )
print( "Message: %s" % ( response.json()['error']['message'] ) )
break
return result
def renderResultOnImage( result, img ):
"""Display the obtained results onto the input image"""
R = int(result['color']['accentColor'][:2],16)
G = int(result['color']['accentColor'][2:4],16)
B = int(result['color']['accentColor'][4:],16)
cv2.rectangle( img,(0,0), (img.shape[1], img.shape[0]), color = (R,G,B), thickness = 25 )
if 'categories' in result:
categoryName = sorted(result['categories'], key=lambda x: x['score'])[0]['name']
cv2.putText( img, categoryName, (30,70), cv2.FONT_HERSHEY_SIMPLEX, 2, (255,0,0), 3 )
pathToFileInDisk = r'test.jpg'
with open( pathToFileInDisk, 'rb' ) as f:
data = f.read()
# Computer Vision parameters
params = { 'visualFeatures' : 'Color,Categories'}
headers = dict()
headers['Ocp-Apim-Subscription-Key'] = _key
headers['Content-Type'] = 'application/octet-stream'
json = None
result = processRequest( json, data, headers, params )
if result is not None:
# Load the original image, fetched from the URL
data8uint = np.fromstring( data, np.uint8 ) # Convert string to an unsigned int array
img = cv2.cvtColor( cv2.imdecode( data8uint, cv2.IMREAD_COLOR ), cv2.COLOR_BGR2RGB )
renderResultOnImage( result, img )
ig, ax = plt.subplots(figsize=(15, 20))
ax.imshow( img )
It's showing sytax error at %matplot inline

I gather you copied your Python code from somewhere, and there are a number of issues with it:
Your syntax error is stemming from the fact that %matplotlib is valid syntax for iPython, not plain Python.
Based on your problem description, IIUC, you've no need for any plotting code, so you might as well drop matplotlib (and cv2 and numpy, for that matter).
Your API URL is wrong: you want https://api.projectoxford.ai/vision/v1.0/ocr.
The code you'll want will be basically like this:
import json
import requests
import urllib
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': 'YOUR_KEY_HERE',
}
params = urllib.urlencode({
# Request parameters
'language': 'unk',
'detectOrientation ': 'true',
})
body = {"url":"YOUR_URL_HERE"}
response = requests.post("https://api.projectoxford.ai/vision/v1.0/ocr?%s" % params, json=body, headers=headers)
result = response.json()
for region in result['regions']:
for line in region['lines']:
for word in line['words']:
print word['text']
Get more details about the response JSON on the API page, if you want, for instance, to arrange the text differently.
You forgot to redact your API key, so you'll probably want to generate a new one via the subscriptions page.

Related

Adjust save location of Custom XCom Backend on a per task basis

I have posted a discussion question about this here as well https://github.com/apache/airflow/discussions/19868
Is it possible to specify arguments to a custom xcom backend? If I could force a task to return data (pyarrow table/dataset, pandas dataframe) which would save a file in the correct container with a "predictable file location" path, then that would be amazing. A lot of my custom operator code deals with creating the blob_path, saving the blob, and pushing a list of the blob_paths to xcom.
Since I work with many clients, I would prefer to have the data for Client A inside of the client-a container which uses a different SAS
When I save a file I consider that a "stage" of the data so I would prefer to keep it, so ideally I could provide a blob_path which matches the folder structure I generally use
class WasbXComBackend(BaseXCom):
def __init__(
self,
container: str = "airflow-xcom-backend",
path: str = guid(),
partition_columns: Optional[list[str]] = None,
existing_data_behavior: Optional[str] = None,
) -> None:
super().__init__()
self.container = container
self.path = path
self.partition_columns = partition_columns
self.existing_data_behavior = existing_data_behavior
#staticmethod
def serialize_value(self, value: Any):
if isinstance(value, pd.DataFrame):
hook = AzureBlobHook(wasb_conn_id="azure_blob")
with io.StringIO() as buf:
value.to_csv(path_or_buf=buf, index=False)
hook.load_string(
container_name=self.container,
blob_name=f"{self.path}.csv",
string_data=buf.getvalue(),
)
value = f"{self.container}/{self.path}.csv"
elif isinstance(value, pa.Table):
hook = AzureBlobHook(wasb_conn_id="azure_blob")
write_options = ds.ParquetFileFormat().make_write_options(
version="2.6", use_dictionary=True, compression="snappy"
)
written_files = []
ds.write_dataset(
data=value,
schema=value.schema,
base_dir=f"{self.container}/{self.path}",
format="parquet",
partitioning=self.partition_columns,
partitioning_flavor="hive",
existing_data_behavior=self.existing_data_behavior,
basename_template=f"{self.task_id}-{self.ts_nodash}-{{i}}.parquet",
filesystem=hook.create_filesystem(),
file_options=write_options,
file_visitor=lambda x: written_files.append(x.path),
use_threads=True,
max_partitions=2_000,
)
value = written_files
return BaseXCom.serialize_value(value)
#staticmethod
def deserialize_value(self, result) -> Any:
result = BaseXCom.deserialize_value(result)
if isinstance(result, str) and result.endswith(".csv"):
hook = AzureBlobHook(wasb_conn_id="azure_blob")
with io.BytesIO() as input_io:
hook.get_stream(
container_name=self.container,
blob_name=str(self.path),
input_stream=input_io,
)
input_io.seek(0)
return pd.read_csv(input_io)
elif isinstance(result, list) and ".parquet" in result:
hook = AzureBlobHook(wasb_conn_id="azure_blob")
return ds.dataset(
source=result, partitioning="hive", filesystem=hook.create_filesystem()
)
return result
It's not clear exactly what information you want to be able to retrieve to use as part of your "predictable file location". But there is a PR to pass basic things like dag_id, task_id etc on to serialize_value so that you can use them when naming your stored objects.
Until that is merged, you'll have to override BaseXCom.set.
You need to override BaseXCom.set
a working ,in production example
class MyXComBackend(BaseXCom):
#classmethod
#provide_session
def set(cls, key, value, execution_date, task_id, dag_id, session=None):
session.expunge_all()
# logic to use this custom_xcom_backend only with the necessary dag and task
if cls.is_task_to_custom_xcom(dag_id, task_id):
value = cls.custom_backend_saving_fn(value, dag_id, execution_date, task_id)
else:
value = BaseXCom.serialize_value(value)
# remove any duplicate XComs
session.query(cls).filter(
cls.key == key, cls.execution_date == execution_date, cls.task_id == task_id, cls.dag_id == dag_id
).delete()
session.commit()
# insert new XCom
from airflow.models.xcom import XCom # noqa
session.add(XCom(key=key, value=value, execution_date=execution_date, task_id=task_id, dag_id=dag_id))
session.commit()
#staticmethod
def is_task_to_custom_xcom(dag_id: str, task_id: str) -> bool:
return True # custom your logic here if necessary

Import and parse a file to fill the form

Currently, I'm developing a custom app. So far I got the DocType ready to be filled in manually. We got files (SQLite3) that I'd like to upload, parse, extract the necessary fields of it and fill in the form. Basically like the import data tool. In my case, no bulk operation is needed and if possible do the extraction part server-side.
What I tried so far
I added a Server Action to call a whitelisted method of my app. I can get the current doc with:
#frappe.whitelist()
def upload_data_and_extract(doc: str):
"""
Uploads and processes an existing file and extracts data from it
"""
doc_dict = json.loads(doc)
custom_dt = frappe.get_doc('CustomDT', doc_dict['name'])
# parse data here
custom_dt.custom_field = "new value from parsed data"
custom_dt.save()
return doc # How do I return a JSON back to the website from the updated doc?
With this approach, I only can do the parsing when the document has been saved before. I'd rather update the fields of the form when the attach field gets modified. Thus, I tried the Server Side Script approach:
frappe.ui.form.on('CustomDT', {
original_data: function(frm, cdt, cdn) {
if(original_data) {
frappe.call({
method: "customapp.customapp.doctype.customdt.customdt.parse_file",
args: {
"doc": frm.doc
},
callback: function(r) {
// code snippet
}
});
}
}
});
Here are my questions:
What's the best approach to upload a file that needs to be parsed to fill the form?
How to access the uploaded file (attachment) the easiest way. (Is there something like frappe.get_attachment()?)
How to refresh the form fields in the callback easily?
I appreciate any help on these topics.
Simon
I have developed the same tool but that was for CSV upload. I am going to share that so it will help you to achieve your result.
JS File.
// Copyright (c) 2020, Bhavesh and contributors
// For license information, please see license.txt
frappe.ui.form.on('Car Upload Tool', {
upload: function(frm) {
frm.call({
doc: frm.doc,
method:"upload_data",
freeze:true,
freeze_message:"Data Uploading ...",
callback:function(r){
console.log(r)
}
})
}
});
Python Code
# -*- coding: utf-8 -*-
# Copyright (c) 2020, Bhavesh and contributors
# For license information, please see license.txt
from __future__ import unicode_literals
import frappe
from frappe.model.document import Document
from carrental.carrental.doctype.car_upload_tool.csvtojson import csvtojson
import csv
import json
class CarUploadTool(Document):
def upload_data(self):
_file = frappe.get_doc("File", {"file_url": self.attach_file})
filename = _file.get_full_path()
csv_json = csv_to_json(filename)
make_car(csv_json)
def csv_to_json(csvFilePath):
jsonArray = []
#read csv file
with open(csvFilePath, encoding='latin-1') as csvf:
#load csv file data using csv library's dictionary reader
csvReader = csv.DictReader(csvf,delimiter=";")
#convert each csv row into python dict
for row in csvReader:
frappe.errprint(row)
#add this python dict to json array
jsonArray.append(row)
#convert python jsonArray to JSON String and write to file
return jsonArray
def make_car(car_details):
for row in car_details:
create_brand(row.get('Marke'))
create_car_type(row.get('Fahrzeugkategorie'))
if not frappe.db.exists("Car",row.get('Fahrgestellnr.')):
car_doc = frappe.get_doc(dict(
doctype = "Car",
brand = row.get('Marke'),
model_and_description = row.get('Bezeichnung'),
type_of_fuel = row.get('Motorart'),
color = row.get('Farbe'),
transmission = row.get('Getriebeart'),
horsepower = row.get('Leistung (PS)'),
car_type = row.get('Fahrzeugkategorie'),
car_vin_id = row.get('Fahrgestellnr.'),
licence_plate = row.get('Kennzeichen'),
location_code = row.get('Standort')
))
car_doc.model = car_doc.model_and_description.split(' ')[0] or ''
car_doc.insert(ignore_permissions = True)
else:
car_doc = frappe.get_doc("Car",row.get('Fahrgestellnr.'))
car_doc.brand = row.get('Marke')
car_doc.model_and_description = row.get('Bezeichnung')
car_doc.model = car_doc.model_and_description.split(' ')[0] or ''
car_doc.type_of_fuel = row.get('Motorart')
car_doc.color = row.get('Farbe')
car_doc.transmission = row.get('Getriebeart')
car_doc.horsepower = row.get('Leistung (PS)')
car_doc.car_type = row.get('Fahrzeugkategorie')
car_doc.car_vin_id = row.get('Fahrgestellnr.')
car_doc.licence_plate = row.get('Kennzeichen')
car_doc.location_code = row.get('Standort')
car_doc.save(ignore_permissions = True)
frappe.msgprint("Car Uploaded Successfully")
def create_brand(brand):
if not frappe.db.exists("Brand",brand):
frappe.get_doc(dict(
doctype = "Brand",
brand = brand
)).insert(ignore_permissions = True)
def create_car_type(car_type):
if not frappe.db.exists("Vehicle Type",car_type):
frappe.get_doc(dict(
doctype = "Vehicle Type",
vehicle_type = car_type
)).insert(ignore_permissions = True)
So for this upload tool, I created one single doctype with the below field:
Attach File(Field Type = Attach)
Button (Field Type = Button)

How to stop the Crawler

I am trying to write a crawler that goes to a website and searches for a list of keywords, with max_Depth of 2. But the scraper is supposed to stop once any of the keyword's appears on any page, the problem i am facing right now is that the crawler does-not stop when it first see's any of the keywords.
Even after trying, early return command, break command and CloseSpider Commands and even python exit commands.
My class of the Crawler:
class WebsiteSpider(CrawlSpider):
name = "webcrawler"
allowed_domains = ["www.roomtoread.org"]
start_urls = ["https://"+"www.roomtoread.org"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
crawl_count = 0
words_found = 0
def check_buzzwords(self, response):
self.__class__.crawl_count += 1
crawl_count = self.__class__.crawl_count
wordlist = [
"sfdc",
"pardot",
"Web-to-Lead",
"salesforce"
]
url = response.url
contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
data = response.body.decode('utf-8')
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
ok = False
if not ok:
if self.__class__.words_found==0:
self.__class__.words_found += 1
print(word + "," + url + ";")
STOP!
return Item()
def _requests_to_follow(self, response):
if getattr(response, "encoding", None) != None:
return CrawlSpider._requests_to_follow(self, response)
else:
return []
I want it to stop execution when if not ok: is True.
When I want to stop a spider, I usually use the exception exception scrapy.exceptions.CloseSpider(reason='cancelled') from Scrapy-Docs.
The example there shows how you can use it:
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')
In your case something like
if not ok:
raise CloseSpider('keyword_found')
Or is that what you meant with
CloseSpider Commands
and already tried it?

Python - BaseHTTPServer , issue with POST and GET

I am making a very simple application with 2 webpages at the moment under URLs: localhost:8080/restaurants/ and localhost:8080/restaurants/new.
I have a sqlite database which i manipulate with SQLAlchemy in my python code.
On my first page localhost:8080/restaurants/, this just contains the lists of restaurants available in my database.
My second page localhost:8080/restaurants/new, is where i have a form in order to a new restaurant such that it displays on localhost:8080/restaurants.
However Whenever i enter a new restaurant name on form at localhost:8080/restaurants/new, it fails to redirect me back to localhost:8080/restaurants/ in order to show me the new restaurant, instead it just remains on the same url link localhost:8080/restaurants/new with the message "No data received" .
Below is my code:
import cgi
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
#import libraries and modules
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from database_setup import Base, Restaurant, MenuItem
#create and connect to database
engine = create_engine('sqlite:///restaurantmenu.db')
Base.metadata.bind=engine
DBSession = sessionmaker(bind=engine)
session = DBSession()
class webServerHandler(BaseHTTPRequestHandler):
""" class defined in the main method"""
def do_GET(self):
try:
#look for url then ends with '/hello'
if self.path.endswith("/restaurants"):
self.send_response(200)
#indicate reply in form of html to the client
self.send_header('Content-type', 'text/html')
#indicates end of https headers in the response
self.end_headers()
#obtain all restaurant names from databse
restaurants = session.query(Restaurant).all()
output = ""
output += "<html><body><a href='/restaurants/new'>Add A New Restaurant</a>"
output += "</br></br>"
for restaurant in restaurants:
output += restaurant.name
output += """<div>
<a href='#'>Edit</a>
<a href='#'>Delete</a>
</div>"""
output += "</br></br>"
output += "</body></html>"
self.wfile.write(output)
print output
return
if self.path.endswith("/restaurants/new"):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
output = ""
output += "<html><body>"
output += "<h1>Add New Restaurant</h1>"
output += "<form method='POST' enctype='multipart/form-data action='/restaurants/new'>"
output += "<input name='newRestaurant' type='text' placeholder='New Restaurant Name'>"
output += "<input name='Create' type='submit' label='Create'>"
output += "</form></body></html>"
self.wfile.write(output)
return
except IOError:
self.send_error(404, "File %s not found" % self.path)
def do_POST(self):
try:
if self.path.endswith("/restaurants/new"):
ctype, pdict = cgi.parse_header(self.headers.getheader('content-type'))
#check of content-type is form
if ctype == 'mulitpart/form-data':
#collect all fields from form, fields is a dictionary
fields = cgi.parse_multipart(self.rfile, pdict)
#extract the name of the restaurant from the form
messagecontent = fields.get('newRestaurant')
#create the new object
newRestaurantName = Restaurant(name = messagecontent[0])
session.add(newRestaurantName)
session.commit()
self.send_response(301)
self.send_header('Content-type', 'text/html')
self.send_header('Location','/restaurants')
self.end_headers()
except:
pass
def main():
"""An instance of HTTPServer is created in the main method
HTTPServer is built off of a TCP server indicating the
transmission protocol
"""
try:
port = 8080
#server address is tuple & contains host and port number
#host is an empty string in this case
server = HTTPServer(('', port), webServerHandler)
print "Web server running on port %s" % port
#keep server continually listening until interrupt occurs
server.serve_forever()
except KeyboardInterrupt:
print "^C entered, stopping web server...."
#shut down server
server.socket.close()
#run main method
if __name__ == '__main__':
main()
for reference here is my database_setup file where i create the database:
import sys
#importing classes from sqlalchemy module
from sqlalchemy import Column, ForeignKey, Integer, String
#delcaritive_base , used in the configuration
# and class code, used when writing mapper
from sqlalchemy.ext.declarative import declarative_base
#relationship in order to create foreign key relationship
#used when writing the mapper
from sqlalchemy.orm import relationship
#create_engine to used in the configuration code at the
#end of the file
from sqlalchemy import create_engine
#this object will help set up when writing the class code
Base = declarative_base()
class Restaurant(Base):
"""
class Restaurant corresponds to restaurant table
in the database to be created.
table representation for restaurant which
is in the database
"""
__tablename__ = 'restaurant'
#column definitions for the restaurant table
id = Column(Integer, primary_key=True)
name = Column(String(250), nullable=False)
class MenuItem(Base):
"""
class MenuItem corresponds to restaurant table
table representation for menu_item which
is in the database
"""
__tablename__ = 'menu_item'
#column definitions for the restaurant table
name = Column(String(80), nullable=False)
id = Column(Integer, primary_key=True)
course = Column(String(250))
description = Column(String(250))
price = Column(String(8))
restaurant_id = Column(Integer, ForeignKey('restaurant.id'))
restaurant = relationship(Restaurant)
#create an instance of create_engine class
#and point to the database to be used
engine = create_engine(
'sqlite:///restaurantmenu.db')
#that will soon be added into the database. makes
#the engine
Base.metadata.create_all(engine)
I can't figure out why i cannot add new restuarants
I know this was a long time ago, but I figured out your problem.
First, the enctype='multipart/form-data' in your do_GET function under the if self.path.endswith("/restaurants/new"): portion is missing a final single quote. Second, you misspelt 'multipart' in if ctype == 'multipart/form-data':. Hope that can help you or others.
As shteeven said, the problem was with the encryption type in the form.
As the quote was missed, the 'Content-type' changed to 'application/x-www-form-urlencoded' so in that case you should parse it different as it's a string.
In order to manage both enctype you can modify your do_POST as the following
def do_POST(self):
try:
if self.path.endswith("/restaurants/new"):
ctype, pdict = cgi.parse_header(self.headers.getheader('content-type'))
print ctype
#check of content-type is form
if (ctype == 'multipart/form-data') or (ctype == 'application/x-www-form-urlencoded'):
#collect all fields from form, fields is a dictionary
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
else:
content_length = self.headers.getheaders('Content-length')
length = int(content_length[0])
body = self.rfile.read(length)
fields = urlparse.parse_qs(body)
#extract the name of the restaurant from the form
messagecontent = fields.get('newRestaurant')
#create the new object
newRestaurantName = Restaurant(name = messagecontent[0])
session.add(newRestaurantName)
session.commit()
self.send_response(301)
self.send_header('Location','/restaurants')
self.end_headers()
return
Hope this extra information is useful for you!

Convert encryption/decryption function from Python to PHP

I have this Python script to encrypt/decrypt URLs:
# -*- coding: utf-8 -*-
import base64
from operator import itemgetter
class cryptUrl:
def __init__(self):
self.key = 'secret'
def encode(self, str):
enc = []
for i in range(len(str)):
key_c = self.key[i % len(self.key)]
enc_c = chr((ord(str[i]) + ord(key_c)) % 256)
enc.append(enc_c)
return base64.urlsafe_b64encode("".join(enc))
def decode(self, str):
dec = []
str = base64.urlsafe_b64decode(str)
for i in range(len(str)):
key_c = self.key[i % len(self.key)]
dec_c = chr((256 + ord(str[i]) - ord(key_c)) % 256)
dec.append(dec_c)
return "".join(dec)
url = "http://google.com";
print cryptUrl().encode(url.decode('utf-8'))
This works fine. For example the above url is converted to 29nX4p-joszS4czg2JPG4dI= and the decryption brings it back to the URL.
Now i want to convert this to a PHP function. So far encryption is working fine....but decryption is not....and i dont know why.....so far i got this:
function base64_url_encode($input) {
return strtr(base64_encode($input), '+/=', '-_');
}
function base64_url_decode($input) {
return strtr(base64_decode($input), '-_', '+/=');
}
function encode ($str)
{
$key = 'secret';
$enc = array();
for ($i;$i<strlen($str);$i++){
$key_c = $key[$i % strlen($key)];
$enc_c = chr((ord($str[$i]) + ord($key_c)) % 256);
$enc[] = $enc_c;
}
return base64_url_encode(implode($enc));
}
function decode ($str)
{
$key = 'secret';
$dec = array();
$str = base64_url_decode($str);
for ($i;$i<strlen($str);$i++){
$key_c = $key[$i % strlen($key)];
$dec_c = chr((256 + ord($str[$i]) + ord($key_c)) % 256);
$dec[] = $dec_c;
}
return implode($dec);
}
$str = '29nX4p-joszS4czg2JPG4dI=';
echo decode($str);
Now the above decoding prints out : N>:Tý\&™åª—Væ which is not http://google.com :p
Like i said encoding function works though.
Why isnt the decoding working ? What am i missing ?
Btw i cant use any other encoding/decoding function. I have a list of URLs encoded with python and i want to move the whole system to a PHP based site....so i need to decode those URLs with a PHP function instead of python.
(Use this page to execute Python: http://www.compileonline.com/execute_python_online.php)
Double check the syntax of strtr().
I'd suggest you using in in the following way:
strtr(
base64_encode($input),
array(
'+' => '-',
'/' => '_',
'=' => YOUR_REPLACE_CHARACTER
)
)
Make sure you have YOUR_REPLACE_CHARACTER!
Also, I'm sure you'll handle the reverse function, where you need to simply flip the values of the replace array.

Resources