How can I port this wget stuff to scala:
wget --keep-session-cookies --save-cookies cookies.txt --post-data 'password=xxxx&username=zzzzz' http://server.com/login.jsp
wget --load-cookies cookies.txt http://server.com/download.something
I want to write a tiny, portable script, no external libraries etc.
Can that be done easily ?
Your two main requirements appear to be:
Auth with some body text
Maintain the session cookies between requests.
Since Scala itself doesn't have much support for HTTP in the core library besides scala.io.Source, you're pretty much stuck with HttpUrlConnection from the Java itself. Looks like this site already has some examples of using HttpUrlConnection in ways like this:
Reusing HttpURLConnection so as to keep session alive
Related
I am working with invokehttp module and need to make a POST request. It has three parameters: -H, -d, -F.
-H meanings are transmitted through a couple of attributes and its meaning.
-d - through flowfile content in a necessary view.
How do I transmit -F parameter? I want to use rest api rocket.chat on nifi.
It sounds like you are discussing curl flags, not HTTP-specific request values. For the record, the -F flag overrides -d in curl commands.
If you are attempting multipart/form-data uploads, you may be interested in the work done in NIFI-7394 to improve that handling in the InvokeHTTP processor.
Motivation
I'm currently an exchange student at Taiwan Tech in Taipei, but the course overview/search engine is not very comfortable to use - so I'm trying to scrape it, which unexpectedly leads to a lot of difficulties.
Problem
Opening https://qcourse.ntust.edu.tw works just fine when using Chrome/Firefox, however, I run in to trouble when trying to use command line interfaces:
# Trying to use curl:
$ curl https://qcourse.ntust.edu.tw
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to qcourse.ntust.edu.tw:443
# Trying to use wget:
$ wget https://qcourse.ntust.edu.tw
--2019-02-25 12:13:55-- https://qcourse.ntust.edu.tw/
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving qcourse.ntust.edu.tw (qcourse.ntust.edu.tw)... 140.118.242.168
Connecting to qcourse.ntust.edu.tw (qcourse.ntust.edu.tw)|140.118.242.168|:443... connected.
GnuTLS: The TLS connection was non-properly terminated.
Unable to establish SSL connection.
I also run into trouble when trying to use the browser Pale Moon
What I've considered
Maybe there is a problem with the certificate itself?
Seemingly not:
# This uses the same wildcard certificate (*.ntust.edu.tw) as qcourse.ntust.edu.tw
# (I double checked, and the SHA256 fingerprint is identical)
$ curl https://www.ntust.edu.tw
<html><head><meta http-equiv='refresh' content='0; url=bin/home.php'><title>title</title></head></html>%
Maybe I need specific headers that only Chrome/Firefox sends by default?
It seems like this doesn't solve anything either. By opening the request (Network tab) in Chrome, right clicking, and choosing "Copy" > "Copy as cURL", I get the same error message as earlier.
Additional information
The course overview site is written in ASP.NET, and seems to be running on Microsoft IIS httpd 6.0.
I find this quite mysterious and intriguing. I hope someone might be able to offer an explanation of this behaviour, and if possible: a workaround.
As you can see from the SSLLabs report this is a server with a terrible setup. It is getting a rating of F since it supports the totally broken SSLv2, mostly broken SSLv3 and many many totally broken ciphers. The only kind of secure way to access this server is using TLS 1.0 with TLS_RSA_WITH_3DES_EDE_CBC_SHA (3DES), a cipher which is not considered insecure as the others but only weak.
Only, since 3DES is considered weak (albeit not insecure) it is disabled by default in most modern TLS stacks. One need to specifically enable the support for it. For curl with OpenSSL backend this would look like this, provided that the OpenSSL library you use still supports 3DES in the first place (not the case with default build of OpenSSL 1.1.1):
$ curl -v --cipher '3DES' https://qcourse.ntust.edu.tw
I'm beginning to learn to code.
Someone said to me: "cURL is the best http client".
To help me understand this sentence, I have two questions:
what is an HTTP CLIENT; and
what is cURL?
I understand you are asking two things:
What is an HTTP CLIENT?
This is any program/application used make communications on the web using the Hypertext Transfer Protocol (HTTP). A common example is a browser.
What is cURL?
This a particular HTTP CLIENT designed to make HTTP communications on the web but built to be used via the command line of a terminal (command prompt).
If you perform a search for these topics, you will easily be able to find more in depth explanations about each.
I have a program already written in gawk that downloads a lot of small bits of info from the internet. (A media scanner and indexer)
At present it launches wget to get the information. This is fine, but I'd like to simply reuse the connection between invocations. Its possible a run of the program might make between 200-2000 calls to the same api service.
I've just discovered that gawk can do networking and found geturl
However the advice at the bottom of that page is well heeded, I can't find an easy way to read the last line and keep the connection open.
As I'm mostly reading JSON data, I can set RS="}" and exit when body length reaches the expected content-length. This might break with any trailing white space though. I'd like a more robust approach. Does anyone have a nicer way to implement sporadic http requests in awk that keep the connection open. Currently I have the following structure...
con="/inet/tcp/0/host/80";
send_http_request(con);
RS="\r\n";
read_headers();
# now read the body - but do not close the connection...
RS="}"; # for JSON
while ( con |& getline bytes ) {
body = body bytes RS;
if (length(body) >= content_length) break;
print length(body);
}
# Do not close con here - keep open
Its a shame this one little thing seems to be spoiling all the potential here. Also in case anyone asks :) ..
awk was originally chosen for historical reasons - there were not many other language options on this embedded platform at the time.
Gathering up all of the URLs in advance and passing to wget will not be easy.
re-implementing in perl/python etc is not a quick solution.
I've looked at trying to pipe urls to a named pipe and into wget -i - , that doesn't work. Data gets buffered, and unbuffer not available - also I think wget gathers up all the URLS until EOF before processing.
The data is small so lack of compression is not an issue.
The problem with the connection reuse comes from the HTTP 1.0 standard, not gawk. To reuse the connection you must either use HTTP 1.1 or try some other non-standard solutions for HTTP 1.0. Don't forget to add the Host: header in your HTTP/1.1 request, as it is mandatory.
You're right about the lack of robustness when reading the response body. For line oriented protocols this is not an issue. Moreover, even when using HTTP 1.1, if your scripts locks waiting for more data when it shouldn't, the server will, again, close the connection due to inactivity.
As a last resort, you could write your own HTTP retriever in whatever langauage you like which reuses connections (all to the same remote host I presume) and also inserts a special record separator for you. Then, you could control it from the awk script.
when I type wget http://yahoo.com:80 on unix shell. Can some one explain me what exactly happens from entering the command to reaching the yahoo server. Thank you very much in advance.
RFC provide you with all the details you need and are not tied to a tool or OS.
Wget uses in your case HTTP, which bases on TCP, which in turn uses IP, then it depends on what you use, most of the time you will encounter Ethernet frames.
In order to understand what happens, I urge you to install Wireshark and have a look at the dissected frames, you will get an overview of what data belongs to which network layer. That is the most easy way to visualize and learn what happens. Beside this if you really like (irony) funny documents (/irony) have a look at the corresponding RFCs HTTP: 2616 for example, for the others have a look at the external links at the bottom of the wikipedia articles.
The program uses DNS to resolve the host name to an IP. The classic API call is gethostbyname although newer programs should use getaddrinfo to be IPv6 compatible.
Since you specify the port, the program can skip looking up the default port for http. But if you hadn't, it would try a getservbyname to look up the default port (then again, wget may just embed port 80).
The program uses the network API to connect to the remote host. This is done with socket and connect
The program writes an http request to the connection with a call to write
The program reads the http response with one or more calls to read.