Get Analyse Form Result API is returning error code 3003 - microsoft-cognitive

I used the form labelling tool to train my model. I have got the modelID, run the Analyse Form API successfully, but when called the get analyse form result, I've got the error code:
3003 "OCR extraction error: [Wrong response code: FailedToDownloadImage. Message: Failed to download image from input URL..]"
I haven't tested the model on any of these 5 pictures that I used for training purposes. Instead, I used 3 completely new documents.
Any idea how I could get this to work?
This is the form I analysed (pdf)

When you submit the 3 new documents to analyze, do you submit them from your Azure storage blob, or local file system, or from other places with a URL? if it's the last case (URL), the current service has a bug. You could try the first 2 options, and see if they solve your problem.
-xin (Form Recognizer Team)

Check your url for encoding standards.
This error can be throwed when you send an url without url encoding.
For example spaces need to be rapleced by %20.
Indeed this url:
"https://test.com/Attachments/Recognized 3728_001.pdf"
needs to be changed to
"https://test.com/Attachments/Recognized%203728_001.pdf"
Check this link for other cases:

Related

Postman Collection- Setting Authorization header through code

Everytime I am testing postman collection ,I need to change the authorization token under header manually followed by exporting the collection again and running through newman.
Is there any way , instead of giving here, I can give it in a CSV file which is being used as test data file. which would reduce the efforts of changing code every time.
Please suggest.
Yes you can, in your csv file, add one more column - name as "AT"
and mention that reference AT in your request as in the following picture.

Writing a function that scrapes dataset that appears only after typing in values and clicking a button

I am trying to write a function that will take a list of dates and retrieve the dataset as found on https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm
I am using PROC IML in SAS to execute R-code (since I am more familiar with R).
My problem is within R, and is due to the website.
First, I am aware that there is an API but this is an exercise I really want to learn because many sites do not have APIs.
Does anyone know how to retrieve the datasets?
Things I've heard:
Use RSelenium to program the clicking. RSelenium got taken off of the archive recently so that isn't an option (even downloading it off of a previous version is causing issues).
Look at the XML url changes as I click the "submit" button in Chrome. However, the XML in the Network tab doesn't show anything, whereas on other websites that have different methods of searching do.
I have been looking for a solution all day, but to no avail! Please help
First, you need to read the terms and conditions and make sure that you are not breaking the rules when scraping.
Next, if there is an API, you should use it so that they can better manage their data usage and operations.
In addition, you should also limit the number of requests made so as not to overload the server. If I am not wrong, this is related to DNS Denial of Service attacks.
Finally, if those above conditions are satisfied, you can use the inspector on Chrome to see what HTTP requests are being made when you browse these webpages.
In this particular case, you do not need RSelenium and a simple HTTP POST will do
library(httr)
resp <- POST("https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm",
body=list(
priceDate.month=5,
priceDate.day=15,
priceDate.year=2018,
submit="CSV+Format"
),
encode="form")
read.csv(text=rawToChar(resp$content), header=FALSE)
You can perform the same http processing in a SAS session using Proc HTTP. The CSV data does not contain a header row, so perhaps the XML Format is more appropriate. There are a couple of caveats for the treasurydirect site.
Prior to posting a data download request the connection needs some cookies that are assigned during a GET request. Proc HTTP can do this.
The XML contains an extra tag container <bpd> that the SAS XMLV2 library engine can't handle simply. This extra tag can be removed with some DATA step processing.
Sample code for XML
filename response TEMP;
filename respfilt TEMP;
* Get request sets up fresh session and cookies;
proc http
clear_cache
method = "get"
url ="https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm"
;
run;
* Post request as performed by XML format button;
* automatically utilizes cookies setup in GET request;
* in= can now directly specify the parameter data to post;
proc http
method = "post"
in = 'priceDate.year=2018&priceDate.month=5&priceDate.day=15&submit=XML+Format'
url ="https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm"
out = response
;
run;
* remove bpd tag from the response (the downloaded xml);
data _null_;
infile response;
file respfilt;
input;
if _infile_ not in: ('<bpd', '</bpd');
put _infile_;
run;
* copy data collections from xml file to tables in work library;
libname respfilt xmlv2 ;
proc copy in=respfilt out=work;
run;
Reference material
REST at Ease with SASĀ®: How to Use SAS to Get Your REST
Joseph Henry, SAS Institute Inc., Cary, NC
http://support.sas.com/resources/papers/proceedings16/SAS6363-2016.pdf

Paw app query request

Hi I am attempting to initiate a query to my backend on Kinvey which is backed by a MongoDB. They require passing URL parameters as such:
?query={"firstName":"James"}
I have tried every imaginable way of setting up these parameters in PAW but either get a success response with no filtering of the data or an error message of URL not supported when I try using a Raw Query String.
I have ran the query using their (Kinvey) backend API interface and it works fine in filtering the results so the problem definitely lies within PAW. I am currently using version 3.0.9. Any suggestions or is this just a bug that needs to be fixed?
Thanks!
I've just tried this setup in Paw and I have a few recommendations:
Paw will URL-encode the chars { and " as you can see if you open the HTTP preview in the bottom panel
Trying to send a similar query via Chrome (to test with another app to make sure Paw behaves correctly), I see that the query is URL encoded (try this query https://echo.paw.cloud/?query={"firstName":"James"} you'll see that the browser actually URL-encodes the characters { and " when sending. So the behavior is the same with Paw.
I don't think these two chars ({ and ") are valid HTTP if they are not URL-encoded, so I'm sure your server is expecting them encoded anyway
Testing this exact query in Paw, works for me, so please try these exact steps: go to URL Params, in the first column enter query and {"firstName":"James"} in the second column. Then using the HTTP preview mentioned above, make sure Paw is sending the request you're expecting.
Lastly, it's more like a tip, but as your value is JSON, I recommend that you use the JSON dynamic value to generate the JSON. It will be visually better for you, and will make sure you send valid JSON. For that, right click on the value field, and select Values > JSON. Here's some example:

JMeter "forgets" variable value defined via Regular Expressioin Extractor

I did create a simple testcase in JMeter.
Open a form and all it's content (css, images etc) :
GET /
GET /css/site.css
GET /favicon.ico
GET /fonts/specific-fonts.woff
GET /images/banner.png
Wait a little...
Post the values
POST /
Receive the "Thank You" page.
- GET /thanks
In the response on the first GET is a hidden input field which contains a token. This token needs to be included in the POST as well.
Now I use the "Regular Expression Extractor" of JMeter to get the token from the response. So far, so good.
Then, after retreiving all the other contents I create the POST message, using the variable name in the RegExp-Extractor in the value field of the token parameter.
But... when executing the testcase it fills in the default value given and not the actual value of the token.
So... first step in debugging this issue was to add a dummy-HTTP-GET request directly after I get the token. In this GET request I also add the token parameter with the token variable as value, but now I can easily check the parameter by looking at the access-log on my webserver.
In this case... the URL looks promising. It contains the actual token value in the GET, but it still uses the default value in the POST.
Second step in debugging was to use the "Debug Sampler" and the "View Results Tree".
By moving the Debug Sampler between the different steps I found out the value of the token-variable is back to the default value after I receive the CSS.
So... now the big question is...
How can I make JMeter to remember my variable value until the end of my test-script ?
JMeter doesn't "forget" variables. However variables scope is limited to the current Thread Group. You can convert JMeter variable to JMeter Property which have "global" scope by i.e. using Beanshell Post Processor with the following code:
props.put("myVar", vars.get("myVar"));
Or by using __setProperty() function. See How to Use Variables in Different Thread Groups guide for details.
As you found it your problem comes from a misunderstanding of scoping rules in jmeter.
https://jmeter.apache.org/usermanual/test_plan.html#scoping_rules
In your case, just put the post processor of the request that will give you the response containing the child node.
Also I think you don't need to share this token with other threads so don't use properties as proposed in the alternate answer.

how to submit query to .aspx page in python

I need to scrape query results from an .aspx web page.
http://legistar.council.nyc.gov/Legislation.aspx
The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.
Somebody out there must know how to do this.
As an overview, you will need to perform four main tasks:
to submit request(s) to the web site,
to retrieve the response(s) from the site
to parse these responses
to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)
The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup
The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).
This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.
import urllib
import urllib2
uri = 'http://legistar.council.nyc.gov/Legislation.aspx'
#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type.
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
'Content-Type': 'application/x-www-form-urlencoded'
}
# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples. This helps
# with clarity and also makes it easy to later encoding of them.
formFields = (
# the viewstate is actualy 800+ characters in length! I truncated it
# for this sample code. It can be lifted from the first page
# obtained from the site. It may be ok to hardcode this value, or
# it may have to be refreshed each time / each day, by essentially
# running an extra page request and parse, for this specific value.
(r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),
# following are more of these ASP form fields
(r'__VIEWSTATE', r''),
(r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
(r'ctl00_RadScriptManager1_HiddenField', ''),
(r'ctl00_tabTop_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
#but then we come to fields of interest: the search
#criteria the collections to search from etc.
# Check boxes
(r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'), # file number
(r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'), # Legislative text
(r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'), # attachement
# etc. (not all listed)
(r'ctl00$ContentPlaceHolder1$txtSearch', 'york'), # Search text
(r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'), # Years to include
(r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'), #types to include
(r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation') # Search button itself
)
# these have to be encoded
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req) #that's the actual call to the http site.
# *** here would normally be the in-memory parsing of f
# contents, but instead I store this to file
# this is useful during design, allowing to have a
# sample of what is to be parsed in a text editor, for analysis.
try:
fout = open('tmp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.
This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...
Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.
What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.
Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.
Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code.
Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.
The code in the other answers was useful; I never would have been able to write my crawler without it.
One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:
Add this import:
import cookielib
Init the cookie stuff:
COOKIEFILE = 'cookies.lwp' # the path and filename that you want to use to save your cookies in
cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar that has useful load and save methods
Install CookieJar so that it is used as the default CookieProcessor in the default opener handler:
cj.load(COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
To see what cookies the site is using:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
This saves the cookies:
cj.save(COOKIEFILE) # save the cookies
"Assume we need to select "all years" and "all types" from the respective dropdown menus."
What do these options do to the URL that is ultimately submitted.
After all, it amounts to an HTTP request sent via urllib2.
Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.
Select '"all years" and "all types" from the respective dropdown menus'
Note the URL which is actually submitted.
Use this URL in urllib2.

Resources