how to capture html tables using TCL http package with an example
I have tried with an example but it simple log out from the session.
package require http
package require base64
set auth "Basic [base64::encode XXXX:XXXXX]"
set headerl [list Authorization $auth]
set query "http://100.59.262.156/"
set tok [http::geturl $query -headers $headerl]
set res [http::data $tok]
http::status $tok
# After login into the session, i am moving one page another web page
set goquery [http::formatQuery "http://100.59.262.156/web/sysstatus.html"]
# After this i am log out from the session. I am unable to find reason.
set tok [http::geturl $query -query http://100.59.262.156/web/sysstatus.html]
set res [http::data $tok]
# After this i will get a table output i need capture this table
# how to capture tables using
http::status $tok
Thanks
Malli
There are some misconceptions I see here:
You should cleanup the tokens that http::geturl returns. You do that with http::cleanup $token. Otherwise you get memory leaks.
The basic auth headers has to be send for every request. It is usually enough to request the desired site with the right headers.
http::formatQuery is for POST or GET parameters, not for the URL (you can use it for the query URL part, but not for the entire URL). Drop that.
The http package does not parse HTML for you. You have to do that yourself. I suggest using tdom
Because I don't know what your site returns, I can't tell you how to parse it, but here a script to get started:
package require http
package require base64
package require tdom
set auth "Basic [base64::encode XXXX:XXXXX]"
set headerl [list Authorization $auth]
set url "http://100.59.262.156/web/sysstatus.html"
set tok [http::geturl $url -headers $headerl]
# TODO: Check the status here. If you get a 403, the login information was not correct.
set data [http::data $tok]
# Important: cleanup
http::cleanup $tok
# Now parse it
dom parse -html $data doc
# Search for the important stuff, walk over the dom... to the the important information
Related
I am trying to write a function that will take a list of dates and retrieve the dataset as found on https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm
I am using PROC IML in SAS to execute R-code (since I am more familiar with R).
My problem is within R, and is due to the website.
First, I am aware that there is an API but this is an exercise I really want to learn because many sites do not have APIs.
Does anyone know how to retrieve the datasets?
Things I've heard:
Use RSelenium to program the clicking. RSelenium got taken off of the archive recently so that isn't an option (even downloading it off of a previous version is causing issues).
Look at the XML url changes as I click the "submit" button in Chrome. However, the XML in the Network tab doesn't show anything, whereas on other websites that have different methods of searching do.
I have been looking for a solution all day, but to no avail! Please help
First, you need to read the terms and conditions and make sure that you are not breaking the rules when scraping.
Next, if there is an API, you should use it so that they can better manage their data usage and operations.
In addition, you should also limit the number of requests made so as not to overload the server. If I am not wrong, this is related to DNS Denial of Service attacks.
Finally, if those above conditions are satisfied, you can use the inspector on Chrome to see what HTTP requests are being made when you browse these webpages.
In this particular case, you do not need RSelenium and a simple HTTP POST will do
library(httr)
resp <- POST("https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm",
body=list(
priceDate.month=5,
priceDate.day=15,
priceDate.year=2018,
submit="CSV+Format"
),
encode="form")
read.csv(text=rawToChar(resp$content), header=FALSE)
You can perform the same http processing in a SAS session using Proc HTTP. The CSV data does not contain a header row, so perhaps the XML Format is more appropriate. There are a couple of caveats for the treasurydirect site.
Prior to posting a data download request the connection needs some cookies that are assigned during a GET request. Proc HTTP can do this.
The XML contains an extra tag container <bpd> that the SAS XMLV2 library engine can't handle simply. This extra tag can be removed with some DATA step processing.
Sample code for XML
filename response TEMP;
filename respfilt TEMP;
* Get request sets up fresh session and cookies;
proc http
clear_cache
method = "get"
url ="https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm"
;
run;
* Post request as performed by XML format button;
* automatically utilizes cookies setup in GET request;
* in= can now directly specify the parameter data to post;
proc http
method = "post"
in = 'priceDate.year=2018&priceDate.month=5&priceDate.day=15&submit=XML+Format'
url ="https://www.treasurydirect.gov/GA-FI/FedInvest/selectSecurityPriceDate.htm"
out = response
;
run;
* remove bpd tag from the response (the downloaded xml);
data _null_;
infile response;
file respfilt;
input;
if _infile_ not in: ('<bpd', '</bpd');
put _infile_;
run;
* copy data collections from xml file to tables in work library;
libname respfilt xmlv2 ;
proc copy in=respfilt out=work;
run;
Reference material
REST at Ease with SAS®: How to Use SAS to Get Your REST
Joseph Henry, SAS Institute Inc., Cary, NC
http://support.sas.com/resources/papers/proceedings16/SAS6363-2016.pdf
I need to reuse value which is generated for my previous request.
For example, at first request, I make a POST to the URL /api/products/{UUID} and get HTTP response with code 201 (Created) with an empty body.
And at second request I want to get that product by request GET /api/products/{UUID}, where UUID should be from the first request.
So, the question is how to store that UUID between requests and reuse it?
You can use the Request Sent Dynamic values https://paw.cloud/extensions?extension_type=dynamic_value&q=request+send these will get the value used last time you sent a requst for a given request.
In your case you will want to combine the URLSentValue with the RegExMatch (https://paw.cloud/extensions/RegExMatch) to first get the url as it was last sent for a request and then extract the UUID from the url.
e.g
REQUEST A)
REQUEST B)
The problem is in your first requests answer. Just dont return "[...] an empty body."
If you are talking about a REST design, you will return the UUID in the first request and the client will use it in his second call: GET /api/products/{UUID}
The basic idea behind REST is, that the server doesn't store any informations about previous requests and is "stateless".
I would also adjust your first query. In general the server should generate the UUID and return it (maybe you have reasons to break that, then please excuse me). Your server has (at least sometimes) a better random generator and you can avoid conflicts. So you would usually design it like this:
CLIENT: POST /api/products/ -> Server returns: 201 {product_id: UUID(1234...)}
Client: GET /api/products/{UUID} -> Server returns: 200 {product_detail1: ..., product_detail2: ...}
If your client "loses" the informations and you want him to be later able to get his products, you would usually implement an API endpoint like this:
Client: GET /api/products/ -> Server returns: 200 [{id:UUID(1234...), title:...}, {id:UUID(5678...),, title:...}]
Given something like this, presuming the {UUID} is your replacement "variable":
It is probably so simple it escaped you. All you need to do is create a text file, say UUID.txt:
(with sample data say "12345678U910" as text in the file)
Then all you need to do is replace the {UUID} in the URL with a dynamic token for a file. Delete the {UUID} portion, then right click in the URL line where it was and select
Add Dynamic Value -> File -> File Content :
You will get a drag-n-drop reception widget:
Either press the "Choose File..." or drop the file into the receiver widget:
Don't worry that the dynamic variable token (blue thing in URL) doesn't change yet... Then click elsewhere to let the drop receiver go away and you will have exactly what you want, a variable you can use across URLs or anywhere else for that matter (header fields, form fields, body, etc):
Paw is a great tool that goes asymptotic to awesome when you explore the dynamic value capability. The most powerful yet I have found is the regular expression parsing that can parse raw reply HTML and capture anything you want for the next request... For example, if you UUID came from some user input and was ingested into the server, then returned in a html reply, you could capture that from the reply HTML and re-inject it to the URL, or any field or even add it to the cookies using the Dynamic Value capabilities of Paw.
#chickahoona's answer touches on the more normal way of doing it, with the first request posting to an endpoint without a UUID and the server returning it. With that in place then you can use the RegExpMatch extension to extract the value from the servers's response and use it in subsequent requests.
Alternately, if you must generate the UUID on the client side, then again the RegExpMatch extension can help, simply choose the create request's url for the source and provide a regexp that will strip the UUID off the end of it, such as /([^/]+)$.
A third option I'll throw out to you, put the UUID in an environment variable and just have all of your requests reference it from there.
I'm building a custom dynamicValue extension for paw. However i'm not able to set header in the evaluate method. See the sample code below:
evaluate(context) {
const request = context.getCurrentRequest();
request.setHeader('Content-Type', this.contentType); // <-- this gives warning
return this.createSignable(request); // This returns a base64 string.
}
I get the warning saying Javascript extension error: Cannot perform modifications and the header is not set. ( when i comment out request.setHeader call, i get no warnings)
Can anyone please help me resolve this issue?
This is correct, you cannot use setters (set any value) in a dynamic value. In fact, the way Paw evaluates dynamic values is asynchronous and as multiple evaluations can take place simultaneously it would be impossible to record the modifications. For this reason, Paw is simply denying changes and no change is persisted during evaluation.
In the documentation, it's specified that these methods (like setHeaders) is only available for importer extensions. Sorry for the inconvenience!
I think to achieve what you're trying to do, you would need two dynamic values one set in the Authorization header and one set in the Content-Type header.
Alternatively, in the future we're going to add request post-processors, so you'll be able to mutate the computed request ready to be sent to the server for additional modifications.
I'm currently creating a test suite for a new API, at the moment I've sent a POST request and it's responding as expected. However, I'm now performing further validation such as checking the status code and also wish to check the Location Header. Problem being, through trial and error I've been unable to access to location header value from the response. Below is some cut down code:
${POST_REQUEST} Replace String ${CLAIM_AVAILABLE_BASE_URL} PLAN_NAME ${VALID_PLAN}
${file_data}= Get Binary File Data/Json/API/GETNaviNetClaimID/valid_aries_claim_local_only.json
${POST_RESPONSE} Post Request APIService ${POST_REQUEST} data=${file_data}
Should Be Equal As Strings ${POST_RESPONSE.status_code} ${HTTP STATUSCODE OK}
I can access the header object using:
${POST_RESPONSE.headers}
But so far I've been unable to pull out just the location header value. Can anyone offer any assistance? I'm using the Requests Library
Seems possible using the below, just replace location with the key you're looking for.
${location_header}= Get From Dictionary ${POST_RESPONSE.headers} location
I don't like this solution though so welcome to anything better!
I have created some header to login into a server,
After loggin into server, i am moving one page to another page using geturl operation using this below headers, but the problem i logged out the server i am not moving into further.
I thought it was missing cookie information.
set headers(Accept) "text/html\;q=0.9,text/plain\;q=0.8,image/png,*/*"
set headers(Accept-Language) "en-us,en\;q=0.5"
set headers(Accept-Charset) "ISO-8859-1,utf-8\;q=0.7,*\;q=0.7"
set headers(Proxy-Authorization) "[concat \"Basic\" [base64::encode $username:$password]]"
I don't how to set cookie information into headers could someone explain.
Thanks
Malli
Cookie support in Tcl is currently exceptionally primitive; I've got 95–99% of the fix in our fossil repository, but that's not much help to you. But for straight handling a session cookie for login purposes, you can “guerilla hack” it.
Sending the cookie
To send a cookie to the server, you need to send a header Cookie: thecookiestring. That's done by passing the -headers option to http::geturl which has a dictionary describing what to pass. We can get that from the array simply enough:
set headers(Cookie) $thecookiestring
set token [http::geturl $theurl -headers [array get headers]]
# ...
Receiving the cookie
That's definitely the easy bit. The rather-harder part is that you also need to check for a Set-Cookie header in the response when you do a login action. You get that with http::meta and then iterate through the list with foreach:
set thecookiestring ""
set token [http::geturl $theloginurl ...]
if {[http::ncode $token] >= 400} {error ...}
foreach {name value} [http::meta $token] {
if {$name ne "Set-Cookie"} continue
# Strip the stuff you probably don't care about
if {$thecookiestring ne ""} {append thecookiestring "; "}
append thecookiestring [regsub {;.*} $value ""]
}
Formally, there can be many cookies and they have all sorts of complicated features. Handling them is what I was working on in that fossil branch…
I'm assuming that you don't need to be able to forget cookies, manage persistent storage, or other such complexities. (After all, they're things you probably won't need for normal login sessions.)
I solved using this tool Fiddler
Thanks All