HTTR GET works in browser, but fails in RStudio - r

Background:
I'm using the code below to get a zipped file containing some XML-documents. A note on reproducibility; the endpoint is IP-restricted as well, so you won't be able to request anything from it (it'll just time out).
GET("https://dpsv3.doffin.no/Doffin/notices/Download/2020-07-17",
authenticate(user, passwd),
accept = "application/zip")
Using this from R; I get a 403 Access denied. If I use the same in Postman, I get the file. If I go to https://dpsv3.doffin.no/Doffin/notices/Download/2020-07-17 in a browser, I get a user/pass prompt and once entered, I get the file.
Questions
Is there anything wrong with my code? From what I gather, it should work, and similar code work with other APIs. Using the same code towards the test environment works; and if I enter the wrong password for the test environment, I get a 401 error instead. From this I gather that the authenticate part of my call isn't processed at the endpoint, but I can't spot any errors here. Any input?

Soo, I think I found the problem. Changing my code to the following works:
GET("https://dpsv3.doffin.no/Doffin/notices/Download/2020-07-21",
authenticate(user, passwd),
accept = "application/zip",
user_agent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"))
I'm not sure why, but I guess there might be limitations on the endpoint? In either case; it works so I am happy.

Related

error 403 when scraping Hansard which uses Cloudflare

I am trying to extract a graph from this link. I need to write a loop to extractd the info of graphs like this for a set of specific criteria. Using Developers tools >> Network, I found the URL to the data underlying this graph. The data seems to be stored in XML format.
I have tried different approaches, but I keep getting 403 Error. It doesn't matter whether I want to extract just the plot or make a get request for the whole web page. I think the problem is that Cloudflare kicks in. Any idea how I might be able to get around this? Any help is very much appriciated.
import urllib.request
url = 'https://hansard.parliament.uk/timeline/query?searchTerm=immigration&startDate=27%2F04%2F2017&endDate=27%2F04%2F2022&house=0&contributionType=&isDebatesSearch=False&memberId='
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
webpage = urllib.request.urlopen(req).read()

How to detect source of Artifactory login attempts?

I'm using Artifactory for everything: Docker images, NuGet packages, PyPi cache, you name it. I don't have to pull things very often, but inevitably when I do, my account is locked. I can request for it to be unlocked, and it will be, but then I turn around and it's locked again in no time.
Is there a way to determine where the logon attempts are coming from that are locking my account? I've changed API Key and everything else...it just keeps happening. I've verified that my local machine has been fully updated with the new key, too.
As you mentioned you rotated the Keys, I suspect some automation job/script/cron is trying to hit the Artifactory with wrong credentials.
Nevertheless, as a first point of check, validate artifactory-request.log files for any entries that could be coming from your API key
Example: 2021-10-19T01:13:52.523Z|3b65f083f8d51f74|**127.0.0.1**|**token:XXXXXX**|GET|/api/system/configuration/platform/baseUrl|200|-1|0|16|JFrog Event/7.12.4 (revision: 5060ba45bc, build date: 2020-12-26T18:54:28Z)
If the request is coming from an user the request would look like this:
2021-10-19T01:14:31.440Z|1de7b95f92082ff|**10.10.16.322**|**sssso2**|GET|/api/auth/screen/globalState|200|2|0|345|Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
(note the IP and user above)
In addition to that you can grab the service id (1de7b95f92082ff is service id in the above example) and grep access-request.log file to get more information

What exactly doest the lighthouse.userAgent mean?

I am exploring the Google PageSpeed insights api and there in the response I see a tag called:
{
...
lighthouse.userAgent:'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.140 Safari/537.36'
...
}
Docs mention userAgent="The user agent that was used to run this LHR."
https://developers.google.com/speed/docs/insights/rest/v5/pagespeedapi/runpagespeed#LighthouseResultV5 What does that mean? How is this performance aggregated by running on all browsers?
PS: This is for Desktop version.
What does that mean?
This lets you know what browser was used to run the test.
It is useful for if you believe that there is an issue with Lighthouse (a bug in your report) so you can test it directly on the same browser that Lighthouse uses.
There is also the "environment" object which contains how Lighthouse presented itself (it sent a header saying "treat me like this browser") to the website that was being tested. (lighthouseResult.environment.networkUserAgent)
This is useful so you can check your server isn't blocking requests for that user agent etc.
It is also useful for checking your server logs to see what requests Lighthouse made etc. etc.
See the Wikipedia page for user agent for more info on user agents
How is this performance aggregated by running on all browsers?
As for your second question it doesn't quite make sense, but it has no impact on performance unless your server does something different for that user agent string if that is what you mean.

scrapy: issues fetching starting url to scraping amazon video info

I'm new to webscraping. what I'm trying to do is scraping all the amazon movies from amazon website. I went to the amazon website www.amazon.com.
I chose amazon video on the left side of search box and type in 'video' and search. I got a list of whole lots of movies. The web Url is https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo
Next, I I went to the scrapy shell and type scrapy shell 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo'
My response status is 400.
I also tried adding user agent. scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36' 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo'
I still got response status ```400``.
Why that happens?
How can I find the starting Url so that I can start scraping all the movie info?
I have no clue how to deal with it. I truly appreciate it if anyone can help.Thanks a lot in advance.
First I tried scrapy shell "https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo" and i got a 503, then I use command view(response) to see what happened on the page. The Amazon gives me verification code to verify if I'm a robot.
So I entered your second scrapy shell command with User-Agent set, and I got 200 response
Maybe you could try using view(response) and see what you got there, or you could try scrapy shell for a few more times?

JSoup error 403 when trying to read the contents of a directory on my website

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=(site)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:465)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at plan.URLReader.main(URLReader.java:21)
Hello all!
I have been looking up a way to read a directory on a website of mine for an application I'm developing.
I can read the files themselves and work with them if I hardcode it, but if I try to grab the list of files from the directory I get this error.
I've tried a few ways, but this is the code I am currently working with.
String url = ""//(removed site for privacy);
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
...
...
...
Now if I use the main site as in www.google.com/ it reads the links. The problem is I want a directory as in www.google.com/something/something/...
when I try that for my site I am getting this error.
Any idea why I can access my main site, but not directories within it?
I also notice that '/' is needed at the end.
Just curious if am I missing something, or need to do something another way?
Thank you for your time.
String mylink = "http://www.imdb.com/search/title?genres=action";
Connection connection = Jsoup.connect(mylink);
connection.userAgent("Mozilla/5.0");
Document doc = connection.get();
//Elements elements = doc.body().select("tr.even detailed");
Elements elements = doc.getElementsByClass("results");
System.out.println(elements.toString());
This is likely a problem with (or deliberate attempt to block access using) the server's configuration, not your application. From the tag wiki excerpt for the http-status-code-403 tag:
The 403 or "Forbidden" error message is a HTTP standard response code indicating that the request was legal and understood but the server refuses to respond to the request.
From the tag wiki itself:
A 403 Forbidden may be returned by a web server due to an authorization issue or other constraint related to the request. File permissions, lack of encryption, and maximum number of users reached (among others) can all be the cause of a 403 response.
If the target site is attempting to block screen-scraping, another possibility is an unrecognized user-agent string, but you're setting the user-agent string to one (I presume) you've obtained from an actual browser, so that shouldn't be the cause.
It's not clear from your question if you expect to fetch a regular (HTML) web page, or a special "directory listing" page generated by the server when an index.html is not present in a directory. If it's the latter, note that many servers have these listings disabled to avoid leaking the names of files in the directory that aren't linked to from the web site itself. Again, this is a server configuration issue, not something your application can work around.
One of the possible reason is don't have access from Java code to access external websites use proxy to connect.
System.setProperty("http.proxyHost", "<<proxy host>>");
System.setProperty("http.proxyPort", "<<proxy port>>");

Resources