error 403 when scraping Hansard which uses Cloudflare

error 403 when scraping Hansard which uses Cloudflare - web-scraping

I am trying to extract a graph from this link. I need to write a loop to extractd the info of graphs like this for a set of specific criteria. Using Developers tools >> Network, I found the URL to the data underlying this graph. The data seems to be stored in XML format.
I have tried different approaches, but I keep getting 403 Error. It doesn't matter whether I want to extract just the plot or make a get request for the whole web page. I think the problem is that Cloudflare kicks in. Any idea how I might be able to get around this? Any help is very much appriciated.
import urllib.request
url = 'https://hansard.parliament.uk/timeline/query?searchTerm=immigration&startDate=27%2F04%2F2017&endDate=27%2F04%2F2022&house=0&contributionType=&isDebatesSearch=False&memberId='
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
webpage = urllib.request.urlopen(req).read()

Related

How to detect source of Artifactory login attempts?

I'm using Artifactory for everything: Docker images, NuGet packages, PyPi cache, you name it. I don't have to pull things very often, but inevitably when I do, my account is locked. I can request for it to be unlocked, and it will be, but then I turn around and it's locked again in no time.
Is there a way to determine where the logon attempts are coming from that are locking my account? I've changed API Key and everything else...it just keeps happening. I've verified that my local machine has been fully updated with the new key, too.

As you mentioned you rotated the Keys, I suspect some automation job/script/cron is trying to hit the Artifactory with wrong credentials.
Nevertheless, as a first point of check, validate artifactory-request.log files for any entries that could be coming from your API key
Example: 2021-10-19T01:13:52.523Z|3b65f083f8d51f74|**127.0.0.1**|**token:XXXXXX**|GET|/api/system/configuration/platform/baseUrl|200|-1|0|16|JFrog Event/7.12.4 (revision: 5060ba45bc, build date: 2020-12-26T18:54:28Z)
If the request is coming from an user the request would look like this:
2021-10-19T01:14:31.440Z|1de7b95f92082ff|**10.10.16.322**|**sssso2**|GET|/api/auth/screen/globalState|200|2|0|345|Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
(note the IP and user above)
In addition to that you can grab the service id (1de7b95f92082ff is service id in the above example) and grep access-request.log file to get more information

What exactly doest the lighthouse.userAgent mean?

I am exploring the Google PageSpeed insights api and there in the response I see a tag called:
{
...
lighthouse.userAgent:'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.140 Safari/537.36'
...
}
Docs mention userAgent="The user agent that was used to run this LHR."
https://developers.google.com/speed/docs/insights/rest/v5/pagespeedapi/runpagespeed#LighthouseResultV5 What does that mean? How is this performance aggregated by running on all browsers?
PS: This is for Desktop version.

What does that mean?
This lets you know what browser was used to run the test.
It is useful for if you believe that there is an issue with Lighthouse (a bug in your report) so you can test it directly on the same browser that Lighthouse uses.
There is also the "environment" object which contains how Lighthouse presented itself (it sent a header saying "treat me like this browser") to the website that was being tested. (lighthouseResult.environment.networkUserAgent)
This is useful so you can check your server isn't blocking requests for that user agent etc.
It is also useful for checking your server logs to see what requests Lighthouse made etc. etc.
See the Wikipedia page for user agent for more info on user agents
How is this performance aggregated by running on all browsers?
As for your second question it doesn't quite make sense, but it has no impact on performance unless your server does something different for that user agent string if that is what you mean.

HTTR GET works in browser, but fails in RStudio

Background:
I'm using the code below to get a zipped file containing some XML-documents. A note on reproducibility; the endpoint is IP-restricted as well, so you won't be able to request anything from it (it'll just time out).
GET("https://dpsv3.doffin.no/Doffin/notices/Download/2020-07-17",
authenticate(user, passwd),
accept = "application/zip")
Using this from R; I get a 403 Access denied. If I use the same in Postman, I get the file. If I go to https://dpsv3.doffin.no/Doffin/notices/Download/2020-07-17 in a browser, I get a user/pass prompt and once entered, I get the file.
Questions
Is there anything wrong with my code? From what I gather, it should work, and similar code work with other APIs. Using the same code towards the test environment works; and if I enter the wrong password for the test environment, I get a 401 error instead. From this I gather that the authenticate part of my call isn't processed at the endpoint, but I can't spot any errors here. Any input?

Soo, I think I found the problem. Changing my code to the following works:
GET("https://dpsv3.doffin.no/Doffin/notices/Download/2020-07-21",
authenticate(user, passwd),
accept = "application/zip",
user_agent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"))
I'm not sure why, but I guess there might be limitations on the endpoint? In either case; it works so I am happy.

How to deal with open tracking problems when sending emails to Gmail

I've been working on open tracking by attaching tracking pixel to an email which is sent via PHP web application that I've been developing. The first problem I encountered was marking email as open just it was sent. I found help here: False open trackings using SES and gmail where I found link to the article describing how to prevent triggering opens by Google bot by checking user agent. I created the following function to check if I deal with the bot:
function isGoogleBot(string $userAgent): bool
{
$googleBotUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246 Mozilla/5.0';
return $userAgent === $googleBotUserAgent;
}
I also read some articles about caching images by Gmail but unfortunately I'm still dealing with some problems.
I did the following test in Chrome, Firefox, Edge and Safari:
I sent an email via application that I've been working on and then opened it in Gmail web client.
Then I went to the application to see open tracking statistics.
I repeated above test twice to make sure every time I get the same result.
With the last email I checked what is happening with subsequent opens of the same email.
Here are my observations:
In every browser except Firefox when I open email for the first time
an extra open is being made so I see 2 opens in the statistics. In
Firefox only 1 open is numbered.
On subsequent opens only 1 open is being added to the total opens
statistics every time I open the email. The exception is also Firefox
where total opens counter is not changing. No matter how many times I
open the email the total opens count is equal 1.
Is there any solution how to make reliable open tracking in every browser when email is opened in Gmail?

scrapy: issues fetching starting url to scraping amazon video info

I'm new to webscraping. what I'm trying to do is scraping all the amazon movies from amazon website. I went to the amazon website www.amazon.com.
I chose amazon video on the left side of search box and type in 'video' and search. I got a list of whole lots of movies. The web Url is https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo
Next, I I went to the scrapy shell and type scrapy shell 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo'
My response status is 400.
I also tried adding user agent. scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36' 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo'
I still got response status ```400``.
Why that happens?
How can I find the starting Url so that I can start scraping all the movie info?
I have no clue how to deal with it. I truly appreciate it if anyone can help.Thanks a lot in advance.

First I tried scrapy shell "https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Dinstant-video&field-keywords=video&rh=n%3A2858778011%2Ck%3Avideo" and i got a 503, then I use command view(response) to see what happened on the page. The Amazon gives me verification code to verify if I'm a robot.
So I entered your second scrapy shell command with User-Agent set, and I got 200 response
Maybe you could try using view(response) and see what you got there, or you could try scrapy shell for a few more times?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

error 403 when scraping Hansard which uses Cloudflare - web-scraping

Related

How to detect source of Artifactory login attempts?

What exactly doest the lighthouse.userAgent mean?

HTTR GET works in browser, but fails in RStudio

How to deal with open tracking problems when sending emails to Gmail

scrapy: issues fetching starting url to scraping amazon video info

Categories

Resources