Scrapy: How to handle if a website is blocked from crawling - web-scraping

I'm using Scrapy to crawling a website but I got a 404 error. I think the website is blocked from being crawling. How I can manage to bypass that? This is the website I want to crawl https://tiki.vn/
And this is the result I got

My problem is the request send by me was reject by the website because it's a bot request. I just need to add a custom header to the request and the problem solved

Related

Wordpress - 401 Authorization Required

I recently renewed my domain name after it expired. However, when I try to view my website, a login box popups.
This is new. I tried putting in my wordpress admin username and password all to no avail. Infact I get a '401 Authorization Required' error.
I have been at this for some days now. I contacted my hosting provider. They said they could view my website and that everything is fine. They however instructed me to clear my browser cache and cookies, which I have done. Still, the problem persist.
I tried viewing the site with an IP proxy site and truly I could see my website without any errors or login pop up box.
How do I solve this problem?
A 401 request usually means that your client (e.g. your web browser) is not able to authenticate itself with the server therfore cannot view the resource.
You have cleared your cache and cookies and you're prompted by a login
box, following this a 401 error appears. The site is viewable from
proxy.
Things to check,
Flush your DNS
Are the login details correct? its possible to get this error from incorrect logins
Check the URL for errors, make sure you're using the intended url
Try deactivating your wordpress plugins if problems still persist
Any further information you can provide, including images would help a lot.

Http redirects to https but it's not support in post request method

i am newer in firebase and i done firebase hosting first time but facing one issue regarding redirection
firebase hosting redirects http:// to https:// directly but i tried it with post request and it's working fine but posted json data is not there to process over on it
This is my firebase.json redirection rule
http://prntscr.com/han083
here is the screenshot of it
This is my code
http://prntscr.com/hamzv9
Without https://
http://prntscr.com/hamybm
With https://
http://prntscr.com/hamyxh
This is because a POST body won't survive a redirect in general. This may be a small bug on Firebase Hosting's part (or perhaps on the part of whatever tool you're using to make the request) -- the HTTP response should be a 301 redirect, which must then be retried with the full POST body by the client.
In general, for any request other than GET you'll need to make sure you properly point at HTTP/HTTPS up front, since redirects don't preserve all request information in most user agents.

Blocked Access to geolocation was blocked over secure connection with mixed content

I'm using a plugin in WordPress that uses the Google Maps API but keep getting this error:
[blocked] Access to geolocation was blocked over secure connection with mixed content to...
My site is on SSL, and I've checked that the google API script is not trying to be pulled in via http (it is https as it should be).
I'm not sure what could be causing this issue. Maybe there is something I need to do in my htaccess file? Please help! Thanks!
Check below list,
Your site have http link instead of https links, so only you facing the mixed content warning( you can fine this warning in your browser console). Find those links in your website and change those as a https links.
Add google API key in configuration.
https://developers.google.com/maps/documentation/javascript/get-api-key

W3 Total Cache, HTTPS and Goog Analytics

I have a website, I've implemented the SSL functionability a few days ago.
Now the website is correctly reachable at HTTPS url.
However...
When I visit any page of my website, the URL shown in the navigator address bar is correct... example: https://domain.com/post/.
That's fine.
However, Google Analytics is registering those visits as webpages with another URL. Google analytics shows another url where the visit was done, and that url is in this example: /wp-content/cache/page_enhanced/domain.com/post/_index.html_gzip
How can I resolve that problem? Where is the issue?
Things I've done:
1) 301 redirect from all old http url to all new https url.
2) Google Analytics property was configured, the url of the website was changed to HTTPS.
Thanks all for reading and hope to find a solution for this problem.
Regards,
Pablo
This depends on how you included Analytics.
1) If the WordPress framework send the 'click' to Analytics in the back-end code, then there might be a problem with sending the wrong 'url' that is visited. (Have a look at PHP's $_SERVER variables; probably the SCRIPT_FILENAME is sent instead of REQUEST_URI.)
2) If you manually insterted the embed code from Analytics in your template, the url detection is all done based on the url in the browser, so you would never see your own filepath info there, unless it's being sent.

Do web crawlers read the HTTP headers?

I own a url shortening service and I want to detect whether the request that I received was from a web crawler or not. In response to the request, I send a HTTP header 302 that redirects the requester to the original link. I was thinking that I could provide an invisible link with the response, so that a bot would send me a request for that page too but a normal user wont. This is based on the hypothesis that even if bots read the header and redirects, they still scans the page and send requests to the links found in it. Is the hypothesis correct? If it is not, I could also redirect them via Javascript but that would not be the standard way of redirecting(I suppose).
Yes, crawlers definitely follow redirects. Their purpose is to find as many pages (or content) as possible. Following redirects is a basic requirement for that goal. However, I do not know if commercial crawlers read the body of a redirect response. I think they don't since information displayed on a redirect page is never shown to a user since they are always redirected away from that page.
There are other crawlers like Crawljax that are build for testing web applications. They will read all the data but those crawlers aren't (or shouldn't be) used to crawl the public web.

Resources