Keep receiving 503 error while scraping with rvest

Keep receiving 503 error while scraping with rvest - r

I am trying to scrape fanfiction.net with rvest and keep getting the 503 server error.
The robots.txt file allows to scrape the site with a delay of 5 seconds, the "Terms of Service" only forbid it for commercial use, whereas I am intending to use it for research purposes (Digital Literary Studies here).
The following chunk of code results in an error already:
library (httr)
library (rvest)
url <- "https://www.fanfiction.net"
userAgent <- "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
read_html(url, user_agent(userAgent))
Most of the advice regarding that error recommends incorporating a delay between scraping requests, or providing a user agent.
Since I get the error from the first URL, incorporating a delay doesn't seem to solve the problem.
I provided different user agents:
"Mozilla/5.0"
agents of the "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" sort, but the server seems to react to the brackets and the long version gives me a 403 error.
Finally, I provided my institution, name and email address as a user agent, and once (!!) received a 200 status code.
What would be the next step?

Related

nginx errors with very large headers

When the user selects the ‘All’ filter on our dashboards, most queries fail and we get this error: 502 - Bad Gateway in Grafana. If it refreshes the page, the errors disappear and the dashboards work. We use an nginx as a reverse proxy and imagine that the problem is linked to URI size or headers. We made an attempt to increase the buffers: large_client_header_buffers 32 1024k. A second attempt was to change the InfluxDB method from GET to POST. Errors have diminished, but they still happen constantly. Our configuration uses nginx + Grafana + InfluxDB.
When using All nodes as filter on our dashboards ( the maximum of possible information), most of the queries return an failure (502 - Bad Gateway) on grafana. We have Keycloak for authetication and an nginx, working as an reverse proxy in front of our grafana server and somehow the problem is linked to it, when acessing the grafana server directly, trhough an ssh-tunnel for example, we do not experience the failure.
nginx log error example:
<my_ip> - - [22/Dec/2021:14:35:27 -0300] "POST /grafana/api/datasources/proxy/1/query?db=telegraf&epoch=ms HTTP/1.1" 502 3701 "https://<my_domain>/grafana/d/gQzec6oZk/compute-nodes-administrative-dashboard?orgId=1&refresh=1m" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36" "-"
below prints of the error in grafana and the configuration variables
variables we use in them as a whole
error in grafana

Instagram blocks me for the requests with 429

I have used a lot of requests to https://www.instagram.com/{username}/?__a=1 to check if a pseudo was existing and now I am getting 429.
Before, I had just to wait few minutes to make the 429 disapear. Now it is persistent ! :( I'm trying once a day, it doesnt work anymore.
Do you know anything about instagram requests limitation ?
Do you have any workaround please ? Thanks
Code ...
import requests
r = requests.get('https://www.instagram.com/test123/?__a=1')
res = str(r.status_code)

Try adding the user-agent header, otherwise, the website thinks that your a bot, and will block you.
import requests
URL = "https://www.instagram.com/bla/?__a=1"
HEADERS = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(URL, headers=HEADERS)
print(response.status_code) # <- Output: 200

httr: Error in curl::curl_fetch_memory(url, handle = handle) : Failure when receiving data from the peer - FileSize?

I am httr:POST(ing) a login form to get into the website. To receive the data I am using the httr::GET:
GET("mock-url", user_agent("Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko"))
Some URLs return
Error in curl::curl_fetch_memory(url, handle = handle) : Failure when receiving data from the peer
If I do a similar request:
GET("mock-url",user_agent("Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko"),write_disk("test.html",overwrite = T))
still some URLs return the same error as above, but the written file is complete and correct.
After some analysis I found that the cutoff file size is ~265KB.
Is there an additional header or other parameter that would solve the problem? A very similar webpage (but not as old) does not show this behavior.
Thank you

R XML Bad Request php

I would like to scrape the contents of this web page using the XML package and htmlParse: http://www.interactivebrokers.com/en/p.php?f=products. However the link I am passing to htmlParse gives me a Bad Request error. What am I missing?

require(RCurl)
require(XML)
iburl<-'http://www.interactivebrokers.com/en/p.php?f=products'
ua<-'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:16.0) Gecko/20100101 Firefox/16.0'
ibdata<-getURL(iburl,useragent=ua)
htmlParse(ibdata)
readHTMLTable(ibdata)
It looks like it is checking the user agent.

CSS3 - Multiple backgrounds sometimes causes 404 errors

I'm using the CSS3 ability to apply multiple background images to an element. Currently, I have this code in my stylesheet:
body{background:url("images/emblem.png") top center no-repeat, url("images/background.png");background-color:#EAE6D9}
The code works in all browsers that support it. And those that it doesn't defaults down to the background-color.
However, watching the access log files for the site, I'm noticing 404 errors pop up for, what looks to be, a malformed request based on this CSS initiative. The funny thing is, they are coming from someone using Firefox 5. I'm using Firefox 5 and I cannot get an error to show up in the log for my IP.
Here's the error line from the log:
10.21.7.246 - - [28/Jun/2011:12:02:01 -0500] "GET /templates/images/emblem.png%22),%20url(%22http://ulabs.illinoisstate.edu/templates/images/background.png HTTP/1.1" 404 1005 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"
I have a feeling the problem is coming from the fact that the " and the space is being URL encoded, but I'm definitely not doing that. And it doesn't happen all the time. Looking at requests from my IP address, the request is properly split up.
10.1.8.129 - - [28/Jun/2011:12:29:33 -0500] "GET /templates/images/background.png HTTP/1.1" 304 - "http://ulabs.illinoisstate.edu/templates/style.1308848695.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"
10.1.8.129 - - [28/Jun/2011:12:29:33 -0500] "GET /templates/images/emblem.png HTTP/1.1" 304 - "http://ulabs.illinoisstate.edu/templates/style.1308848695.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"
Has anyone experienced this behavior before? Or have any ideas on what I might try to resolve the issue?

We've discovered it's YSlow causing the error to be generated. When running YSlow, the error would appear in the log immediately for that IP address. Since this really isn't really a problem, luckily there's nothing we need to fix on our end.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Keep receiving 503 error while scraping with rvest - r

Related

nginx errors with very large headers

Instagram blocks me for the requests with 429

httr: Error in curl::curl_fetch_memory(url, handle = handle) : Failure when receiving data from the peer - FileSize?

R XML Bad Request php

CSS3 - Multiple backgrounds sometimes causes 404 errors

Categories

Resources