I am trying to scrape Reddit using RedditExtractoR
My query
df = get_reddit(search_terms = "blockchain",page_threshold = 2)
Shows me the error as
cannot open URL
'http://www.reddit.com/r/CryptoCurrency/comments/7vga1y/i_will_tell_you_exactly_what_is_going_on_here/.json?limit=500':
HTTP status was '403 Forbidden'cannot open URL
'http://www.reddit.com/r/IAmA/comments/blssl3/my_name_is_benjamin_zhang_and_im_a_transportation/.json?limit=500':
HTTP status was '403 Forbidden'cannot open URL
'http://www.reddit.com/r/IAmA/comments/blssl3/my_name_is_benjamin_zhang_and_im_a_transportation/.json?limit=500':
HTTP status was '403 Forbidden'3 Forbidden'
How can I resolve it?
The usual cause for 403 forbidden is either:
a) issues with the server
b) getting blocked
I created an R program that uses the get_reddit extractor library to test whether you could get blocked by using it
library(RedditExtractoR)
blockchain <- get_reddit(
search_terms = "blockchain",
page_threshold = 2,
)
And it worked like a charm no matter how I ran it.
Fortunately, the RedditExtractoR has built-in rate limits to prevent issues.
As per redditExtractorR package documentation:
Question: All functions in this library appear to be a little slow, why is that?
Answer: The Reddit API allows users to make 60 requests per minute (1 request per second), which is why URL parsers used in this library intentionally limit requests to conform to the API requirements
But a 403 error due to getting blocked is still a real possibility.
If you start getting 403 either:
wait a few days to make sure it's not the API that has issues
use a scraping service or a VPN
Related
I am trying to crawl the website https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW
but I get (410) error
INFO: Ignoring response <410 https://www.rightmove.co.uk/properties/105717104>: HTTP status code is not handled or not allowed
I am just trying to find the properties that have been sold using the notification on the page "This property has been removed by the agent."
I know the website has not blocked me because I am able to use the scrappy shell to get the data and also view(response) works fine too, I can directly go to the same URL using web browser so the 410 doesn't make sense I can also crawl pages from the same domain,
(ie) the pages without the notification "This property has been removed by the agent."
Any help would be much appreciated.
Seem's the when a listing has been marked as removed by and agent on Rightmove then the website will return status code 410 Gone (Which is quite weird). But to solve this, simply do something like this in your request:
def start_requests(self):
yield scrapy.Request(
url='https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW',
meta={
'handle_httpstatus_list': [410],
}
)
EDIT
Explanation: Basically, Scrapy will only handle the status code from the response is in the range 200-299, since 2XX means that it was a successful response. In your case, you got a 4XX status code which means that some error happened. By passing handle_httpstatus_list = [410] we tell Scrapy that we want it to also handle 410 responses and not only 200-299.
Here is the docs: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#std-reqmeta-handle_httpstatus_list
I'd like to scrape team advanced stats from stats.nba.com.
My current code to get the XHR file where the data is stored is :
library(httr)
library(jsonlite)
nba <- GET('https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=11%2F12%2F2019&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=')
I get the URL via these steps in Chrome:
Inspect -> Network -> XHR
The code throws this error:
Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
I also tried it with custom advanced filters on the website which either result in the same error or the code running forever. I'm not that great in web scraping so I would appreciate if anyone can point out what the issue is here.
I have had a good look at this. It looks like this site goes to some lengths to prevent scraping, and won't give you the json from that url unless you provide it with cookies that are generated by a back-and-forth between your browser's javascript and their own servers. They also monitor request timings with New Relic technology and are therefore likely to block your IP if you scrape multiple pages. It wouldn't be impossible, but very, very hard.
If you are desperate for the data you could look into using the NBA API which requires a sign-up but us free to use for 1000 requests per day.
The other option is to automate a browser using RSelenium to get the html of the fully rendered pages.
Of course, if you only want this one page, you can just copy the html from your Chrome's inspector, then use rvest::read_html(readClipboard())
I was wondering if anybody can solve this problem in R?
I want to read the lines (get its content) from the following web-page using R functions such as
readLines() , read_html(), getURL() , and etc.:
https://nrt3.modaps.eosdis.nasa.gov/archive/allData/6/MOD09GA/2019/107/
(You may be asked for user: U_of_C_R_MODIS & pas: Mas_4033708404 to login).
In regard to login, I am using this code in R, which works well for all other wep-pages, excluding this one:
setNASAauth(username= "U_of_C_R_MODIS",password= "Mas_4033708404", update=TRUE)
For example:
url_content <- readLines("https://nrt3.modaps.eosdis.nasa.gov/archive/allData/6/MOD09GA/2019/107")
However, all these functions give error as follow:
Unknown SSL protocol error in connection to nrt3.modaps.eosdis.nasa.gov:443
Or this one:
HTTP error 404.
Or this one:
HTTP status was '404 Not Found'
I spend time to find the solution either by myself or google it, but I failed to solve it.
Any comments and suggestion will be highly appreciated.
(This is sort of an abstract philosophical question. But I believe it has objective concrete answers.)
I'm writing an API, my API has a "status" page (like, https://status.github.com/).
If whatever logic I have in place to determine the status says everything is good my plan would be to return 200 OK, and a JSON response with more information about each service tested by my status page.
But what if my logic says the API is down? Say the database isn't responding or something.
I think I want to return 500 INTERNAL SERVER ERROR (or 503 SERVICE NOT AVAILABLE) along with a JSON response with more details.
However, is that breaking the HTTP Status Code spec? Would that confuse end users? My status page itself is working just fine in that case. So maybe it should return 200? But that would mean anyone using it would have to dig into the body looking for a specific parameter to determine the API's status vs. just checking the HTTP Status Code. (Also if my status page itself was broken, I'm fine with the end user taking that to mean the API is down since that's a pretty bad sign...)
Thoughts? Is there official protocol on how a status page should work?
https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
For me the page should return 200 unless has problems itself. Is true that is easier to check the status code of a response than parsing but using HTTP status codes to encode application informations breaks what people (and spiders) expect. If a spider passes for your page and sees a 500 or 503 will think your site has a page with problems, not that that page is ok and is signaling that the site is down.
Also, as you notice, it wont' be possible to distinguish between the service is down and the status page is down cases, with the last the only one that should send 500. Also, what if you show more than one service like the twitter status page ? Use 200.
Related: https://stackoverflow.com/a/943021/1536382 https://stackoverflow.com/a/34324179/1536382
I have set up a dependency in my Gradle build script, which is hosted on Bitbucket.
Gradle fails to download it, with error message
Could not HEAD 'https://bitbucket.org/....zip'. Received status code 403 from server: Forbidden
I looked into it, and it seems that this is because :
Bitbucket redirects to an amazon url
the Amazon url doesn't accept HEAD requests, only GET requests
I could check that by testing that URL with curl, and I also got a 403 Forbidden when sending a HEAD request with curl.
Otherwise, it could be because Amazon doesn't accept the signature in the HEAD request, which should be different from the GET one, as explained here.
Is there a way around this ? Could I tell Gradle to skip the HEAD request, and go straight to the GET request ?
I worked around the problem by using the gradle-download-task plugin, and manually writing caching as explained here