wget won't download files I can access through browser - web-scraping

I am an amateur historian trying to access newspaper archives. The server where the scans are located "works" using an outdated tif viewer that doesn't seem to actually work at all anymore. I can access the files individually in chrome without logging in, but when I try to use wget or curl, I'm told that viewing the file is unauthorized, even when I use my login info, and even when using my cookies from chrome.
Here is an example of one of the files: https://ulib.aub.edu.lb/nahar/images2/7810W2/78101001.TIF
When I put this into chrome, it automatically downloads the file even though I cannot access the directory itself, but when I use wget, I get the following response: "401 unauthorized Username/Password Authentication Failed."
This is the basic wget command I'm using (if I can get it to work at all, then I'll input a list of the other files):
wget --no-check-certificate https://ulib.aub.edu.lb/nahar/images2/7810W2/78101001.TIF
I've tried variations with and without cookies, with a blank user, with and without login credentials, As I'm sure you can tell, I'm new to this sort of thing but eager to learn.

From what I can see, authentication on your website is done with HTTP basic. This kind of authentication is not using HTTP cookies, it is using HTTP Authorization header. You can pass HTTP basic credentials to wget with the following arguments.
wget --http-user=YourUsername --http-password=YourPassword https://ulib.aub.edu.lb/nahar/images2/7810W2/78101001.TIF

Related

Making an HTTP request with a blank user agent

I'm troubleshooting an issue that I think may be related to request filtering. Specifically, it seems every connection to a site made with a blank user agent string is being shown a 403 error. I can generate other 403 errors on the server doing things like trying to browse a directory with no default document while directory browsing is turned off. I can also generate a 403 error by using a tool like Modify Headers for Google Chrome (Google Chrome extension) to set my user agent string to the Baidu spider string which I know has been blocked.
What I can't seem to do is generate a request with a BLANK user agent string to try that. The extensions I've looked at require something in that field. Is there a tool or method I can use to make a GET or POST request to a website with a blank user agent string?
I recommend trying a CLI tool like cURL or a UI tool like Postman. You can carefully craft each header, parameter and value that you place in your HTTP request and trace fully the end to end request-response result.
This example straight from the cURL docs on User Agents shows you how you can play around with setting the user agent via cli.
curl --user-agent "Mozilla/4.73 [en] (X11; U; Linux 2.2.15 i686)" [URL]
In postman its just as easy, just tinker with the headers and params as needed. You can also click the "code" link on the right hand side and view as HTTP when you want to see the resulting request.
You can also use a heap of hther HTTP tools such as Paw and Insomnia, all of which are quite well suited to your task at hand.
One last tip - in your chrome debugging tools, you can right click the specific request from the network tab and copy it as cURL. You can then paste your cURL command and modify as needed. In Postman you can import a request and past from raw text and Postman will interpret the cURL command for you which is particularly handy.

Python Requests fails to get successful response with (converted) client cert

I’m trying to access an endpoint, which requires a client cert.
I’m starting from a .p12, which I was able to quickly import to Google Chrome, and can successfully access the endpoint. So the client certificate and endpoint are compatible.
However, I’m struggling to get Python Requests module (with Python 2.7) to successfully access the same endpoint.
My steps have been:
openssl pkcs12 -in my.p12 -out certificate.pem –nodes prompts me for a password, then creates certificate.pem
print(requests.get("<https://endpoint>", cert="certificate.pem").content) returns You don't have permission to access "http" on this server. (and a HTTP response of 403)
My PEM file contains three sets of -----BEGIN CERTIFICATE-----, and then -----BEGIN PRIVATE KEY-----.
All 4 BEGINs are preceded by Bag Attributes – removing these lines doesn’t make a difference.
I'm doing the key creation with a Ubuntu VM, but running the Python from a Windows machine - not sure if this makes a difference.
I’d welcome any ideas; particularly to understand if the issue is around the conversion to PEM, or if it’s with the request call.
The error is not indicative of a problem with the client certificate.
If your client certificate were the problem the documentation suggests your error would have been prefixed with "SSLError": http://docs.python-requests.org/en/master/user/advanced/#client-side-certificates
The relevant error is likely in the part you are censoring for privacy reasons. Having achieved authentication, the web server is rejecting your request for some other reason.
Possibly you are calling requests.get('https://website.com', ...
You may need to call requests.get('https://website.com/', ...
Or directly request a file resource within the website. When testing with Chrome, a non-displayed trailing '/' may have been used when Chrome made the request to the web server. Try adding / to the end of the address.
Certainly you shouldn't be using the "<" ">" tags shown in your example.
I found https://gist.github.com/erikbern/756b1d8df2d1487497d29b90e81f8068, with the delete=False param as suggested in those comments, and pyOpenSSL, now works.

arcanist install-certificate fails

I set up my own hosted phabricator, everything is working fine (Diffusion repo etc)
I ran into problem after I installed arcanist on my dev box and run 'arc install-certificate', got exception as following:
rying to connect to server...
LOGIN TO PHABRICATOR
Open this page in your browser and login to Phabricator if necessary:
http:///conduit/login/
Then paste the API Token on that page below.
Paste API Token from that page: cli-e644viducdcccrge4i7zo5nfa66d
Usage Exception: The token "cli-e644viducdcccrge4i7zo5nfa66d" is not a valid API Token. The server returned this response when trying to use it as a token: ERR-CONDUIT-CORE: Attempting to access attached data on PhabricatorUser (via getAwayUntil()), but the data is not actually attached. Before accessing attachable data on an object, you must load and attach it.
I am wondering what's might go wrong? Thank you very much for your insights!
I've seen this problem occur many times with our users. In every case so far, the problem has been that users have set up the phabricator uri incorrectly.
Suggestion:
Check your project .arcconfig or your global .arcrc files (if you're doing this outside a project).
Verify that the URI to your Phabricator site is correct. The typical issue I've seen is accessing using http:// rather than https://

wp-piwik configuration fails for multiple reasons

I had a working wp-piwik plugin. Its configuration looked like this:
It worked fine. Then I moved the piwik installation to a different location (so I can track more sites).
Now wp-piwik only recognizes the auth key if we're not using a /home... path, but http://analytics... path. And unfortunately it doesn't work - no tracking occurs.
If I give a wrong URL, it sees it's wrong.
If I supply a wrong auth token, it sees it's wrong. So somehow the plugin communicates with piwik.
If I supply piwik location as a /home/piwik/, it doesn't work, and I'm pretty sure it used to work before I updated from 2.3.0 to 2.7.0.
My site works fine with the tracking code. But that is visible in the browser and I don't like that. I don't want the tracking to depend on the user's browser. I don't want the client to know that there's tracking.
If there's manual php code inclusion that I can use. I'd use it, but i'm not very good with php.
The option "Piwik path (PHP API, beta):" works only if I remove the auth token but also doesn't track. the piwik path can only be /home/piwik....
Here's what Support --> "Run test script" gives me:
*** Test 1/2: SitesManager.getSitesWithAtLeastViewAccess ***
Using: cURL
SSL peer verification: enabled
User Agent:
Call: /home/user/piwik_location/?module=API&method=SitesManager.getSitesWithAtLeastViewAccess&format=XML&token_auth= + TOKEN
Result:
Time: 0s
*** Test 2/2: SitesManager.getSitesIdFromSiteUrl ***
Using: cURL
SSL peer verification: enabled
User Agent:
Call: /home/user/piwik_location/?module=API&method=SitesManager.getSitesIdFromSiteUrl&url=http%3A%2F%2Fsite.com&format=XML&token_auth= + TOKEN
Result:
Time: 0s
Any suggestions?
You also have to update the trusted hostname if you move Piwik:
http://piwik.org/faq/troubleshooting/faq_171/
If this does not help, check WP-Piwik's debug script output (see support tab in settings menu).
UPDATE: As discussed here (https://wordpress.org/support/topic/wp-piwik-recognizes-the-key-but-doesnt-track-with-piwik-url-or-piwik-path?replies=7), the plugin's behavior is as expected. The test script output above shows the wrong value for Piwik URL (it contains the path instead). If entering the Piwik URL and enabling the tracking script, the tracking works.

HTTP Basic Authentication - what's the expected web browser experience?

When a server allows access via Basic HTTP Authentication, what is the experience expected to be in a web browser?
Ignoring the web browser for a moment, here's how to create a Basic Auth request with curl:
curl -u myusername:mypassword http://somesite.example
But what about in a Web Browser? What I've seen on some websites, is I visit the URL, and then the server returns response code 401. The browser then displays a username/password prompt.
However, on somesite.example, I'm not getting an authorization prompt at all, just a page that says I'm not authorized. Did somesite not implement the Basic Auth workflow correctly, or is there something else I need to do?
To help everyone avoid confusion, I will reformulate the question in two parts.
First: "how can make an authenticated HTTP request with a browser, using BASIC auth?".
In the browser you can do a HTTP basic auth first by waiting the prompt to come, or by editing the URL if you follow this format: http://myusername:mypassword#somesite.example
NB: the curl command mentionned in the question is perfectly fine, if you have a command-line and curl installed. ;)
References:
https://en.wikipedia.org/wiki/Basic_access_authentication#URL_encoding
https://en.wikipedia.org/wiki/Uniform_Resource_Locator#Syntax
https://www.rfc-editor.org/rfc/rfc3986#page-18
Also according to the CURL manual page https://curl.haxx.se/docs/manual.html
HTTP
Curl also supports user and password in HTTP URLs, thus you can pick a file
like:
curl http://name:passwd#machine.domain/full/path/to/file
or specify user and password separately like in
curl -u name:passwd http://machine.domain/full/path/to/file
HTTP offers many different methods of authentication and curl supports
several: Basic, Digest, NTLM and Negotiate (SPNEGO). Without telling which
method to use, curl defaults to Basic. You can also ask curl to pick the
most secure ones out of the ones that the server accepts for the given URL,
by using --anyauth.
NOTE! According to the URL specification, HTTP URLs can not contain a user
and password, so that style will not work when using curl via a proxy, even
though curl allows it at other times. When using a proxy, you _must_ use
the -u style for user and password.
The second and real question is "However, on somesite.example, I'm not getting an authorization prompt at all, just a page that says I'm not authorized. Did somesite not implement the Basic Auth workflow correctly, or is there something else I need to do?"
The curl documentation says the -u option supports many method of authentication, Basic being the default.
Have you tried?
curl somesite.example --user username:password
You might have old invalid username/password cached in your browser. Try clearing them and check again.
If you are using IE and somesite.example is in your Intranet security zone, IE may be sending your Windows credentials automatically.
WWW-Authenticate header
You may also get this if the server is sending a 401 response code but not setting the WWW-Authenticate header correctly - I should know, I've just fixed that in out own code because VB apps weren't popping up the authentication prompt.
If there are no credentials provided in the request headers, the following is the minimum response required for IE to prompt the user for credentials and resubmit the request.
Response.Clear();
Response.StatusCode = (Int32)HttpStatusCode.Unauthorized;
Response.AddHeader("WWW-Authenticate", "Basic");
You can use Postman a plugin for chrome.
It gives the ability to choose the authentication type you need for each of the requests.
In that menu you can configure user and password.
Postman will automatically translate the config to a authentication header that will be sent with your request.

Resources