Websites that are particularly challenging to crawl and scrape? [closed]

Websites that are particularly challenging to crawl and scrape? [closed] - web-scraping

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm interested in public facing sites (nothing behind a login / authentication) that have things like:
High use of internal 301 and 302 redirects
Anti-scraping measures (but not banning crawlers via robots.txt)
Non-semantic, or invalid mark-up
Content loaded via AJAX in the form of onclicks or infinite scrolling
Lots of parameters used in urls
Canonical problems
Convoluted internal link structure
and anything else that generally makes crawling a website a headache!
I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it struggle.

Here are some:
Content loaded via AJAX in the form of onclicks or infinite scrolling
pinterest
comments in such a page
This is a Chinese commodity page and its comments is loaded by AJAX which is triggered by scrolling down the scrollbar in a browser or according to your browser's height. I must use PhantomJS and xvfb to trigger such actions.
Anti-scraping measures (but not banning crawlers via robots.txt)
amazon next page
I have crawled amazon site in China and when I want to crawl the next page in such pages, it may modify the requests resulting in that you couldn't get the real next page
stackoverflow
It has a limit of visit frequency. A few days ago, I wanted to get all of the tags in stackoverflow and set the spider's visit frequency to 10, but I was warned by stackoverflow...... Here's the screen shot. After that I have to use proxies to crawl stackoverflow.
and anything else that generally makes crawling a website a headache
yihaodian
This is a Chinese e-commerce site and when you visit it in a browser, it will show your location and will offer some commodities according to your location.
etc.
There're many sites like the above that will offer different contents according to your location. When you crawl such sites, what you get is not the same as what you see in a browser. It often needs setting cookie when emitting a request through a spider.
Last year I encountered a site which required http request headers and some cookies when emitting requests, but I don't remember that site....

Related

Should i use Cloudflare Accelerated Mobile Links option if I already have amp implemented on my Website since 2015? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I have WordPress Website with Google AMP already implemented and working.
Currently, I transfered to CloudFlare and saw the AMP feature.
What should I do now? Should I enable it or not?
Any ideas?
Thanks!

That's really up to the use cases you want to cater and user traffic sources.
Eg. is your primary traffic coming from organic search? Your site's URLs should have been cached and served from AMP Cache already.
Eg. do a lot of users open your site's URLs in a webview? maybe you can give Cloudflare a shot here?
This article might help:
https://antonyagnel.com/how-to-enable-accelerated-mobile-links-in-cloudflare/
"Cloudflare’s Accelerated Mobile Links is actually powered by the official AMP project. What Cloudflare actually does, in this case, is – it loads AMP enabled external links inside of a viewing window, within the same tab. This in comparison is different from what Google does. Google displays the AMP version of a site only when the search query comes from a mobile device through Google search."

Will web scraping only cause harm to those who have a website? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Today I scrape a website using beautifulsoup4 and try to fetch about 16.000 data on that site.
And just few minutes after that, that site got down and can't access about few hours.
So.. my question is:
Will web scraping only cause harm to those who have a website?

First of all, it is advisable to check out the robots.txt file of every site before bombarding it with automated requests like you just did. It is not good for the website owner as well as for you. In order to scrape a website, follow these steps before starting to write a web scraper:
Check if the website has an API available already to make your task easy. If not, then go to step 2.
Check out the robots.txt file which is present at www.anywebsite.com/robots.txt. If the owner has listed this page (which in most cases he will), you can see whether robots are allowed to access the website or not. If yes, then check out which pages are disallowed and also check if there are any rate-limits for it.
If the robots.txt file is not present, then make sure you are gentle enough not to shoot requests to the website at bullet-speed. It might cause harm to the owner and you might get blocked forever from accessing the site.

yahoo and bing search results caching old version of my website [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I just put up a new site for my friends company.
If you search on google the link takes you to the correct home page www.durangoshadecompany.com.
However, if you search using yahoo or bing the link that comes up goes to a cached version of the page www.durangoshadecompany.com/index.html. This worked for the old site because it was static.
The new site is dynamically built on Wordpress so the index.html file brings up an error. Can I fix this or will I just have to wait until yahoo caches the correct home page.
I've tried searching for a remedy, but can't find anything that solves this problem.

Just install the redirection plugin and create the rule for the index.html file to point to /. That should fix the issue immediately without coding knowledge.

Caching is done on the search engine's servers, so there isn't much you can do about it.
If I had to hazard a guess at the methodology, most search engines probably only re-cache a site after a certain amount of change has occurred. Therefore, your best bet might be restructuring your site's code in a way that common different utilities like diff would see as large changes.
But that's just a guess.
Best of luck.

As the other answer mentioned, it is outside your direct control since you are at the mercy of specific search engines re-indexing you.
Each search engine is going to have its own rules, but I would suggest that your switching to WordPress will help in getting it updated. If you went with the standard install of WordPress, it hits ping-o-matic when you publish posts. That is a way to notify various services (although not Bing and Yahoo directly, I believe) that you have an update.
You can submit to ping-o-matic directly as well. Just go there and fill out the form.
You can (and should) sign up for Bing and Google webmaster tools (Yahoo is part of Bing's tools). This will give you an opportunity to let them know you've updated and that you would like to be crawled. It will also give you a chance to know when they have crawled you and what errors they may have encountered (so that you can correct them).
To make yourself even more friendly for being crawled, you should have an XML sitemap. You can submit the location of your sitemap through those tool sites for indexing. If you do not already have an XML sitemap, there are plugins for WordPress that will build it for you. Then all you need to worry about is submitting it.
For your index.html issue, if the site gets reindexed, that should remedy itself. However, if you want to be sure, what you want to use is a 301 redirect message. This tells the bot that it has been moved permanently and they will note that (i.e. you want to let them know that mysite.com/index.html has been permanently moved to mysite.com or something like that).
There are different ways to do that. You could create an index.html that delivers a 301 redirect message; or you could do it with .htaccess. I would lean toward the .htaccess method.

Spam users in Drupal [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Improve this question
I am using Drupal for the content of my website. I mean, I use it as a content editor, but serve the content with my custom PHP pages. Anyways,
I'm seeing a lot of users that are registering and commenting in my Drupal site, the usernames are like
jennipoehmkocmnxqs
traciezlnvafjlasp
frederickajefferson
rowenanskmsqynljyxl
krystle03qgatdzutama
So they are obviously coming from one (if not more) bot(s).
And I don't want to implement re-captcha since I want to encourage my visitors to add content. So I need to find a way to find the source of this bot. Maybe an ip address, and block it from my domain. Do you have any idea what bot is this, or how can I track it?
I've looked into Drupal database, apparently it does not save the ip address of users..
Thanks !
p.s.
And there are also spam comments like this:
I'm really enjoying the theme/design of your website. Do you ever run into any web browser compatibility problems? A number of my blog readers have complained about my website not operating correctly in Explorer but looks great in Opera. Do you have any solutions to help fix this issue? Look at my weblog :: _sell my gold_ (link that redirects to www.swiftcashforgold.com/what-we-buy.php)

I had the same problem with fake users on an e-Commerce site that didn't even allow comments. Implementing reCAPTCHA on the registration and login screens definitely seemed to cut down on the number of fake signups we got quite a bit, but you are definitely correct about it being an annoyance to users and a bit of a barrier to their activity, and in a lot of cases it just doesn't work because humans are filling it out.
A different approach that will at least help you deal with the comments is analyzing the content of the comment and determining if it is spam that way. For this, you can use Mollom, Akismet, Defensio, or a similar solution. These services are configured by default not to display a CAPTCHA, but they instead check the patterns of the many submissions that run through their respective services, and are in many cases they are able to auto-detect spam and "quarantine" bad comments due to the wealth of information they have.
These are all subscription services with free starter plans. If you have a lot of legit comments coming into your site on a daily basis, you will have to pay a monthly fee. All of these solutions have Drupal modules which allow for their integration into Drupal forms.
I know Mollom also supports protecting the user registration form by default, but I don't know for sure if any of these modules will completely solve the problem of fake users signing up because I haven't tried it yet. It's possible that one or more of these modules will mark a user for leaving spam comments. Hopefully this will help with both problems, but it will definitely stop the comments.

You could validate that email addresses of people that register, actually exist.
This can be done using the PHP class (below) that sends SMTP commands to their email server, but not an actual email.
That way you know they are valid users (and where they came from) without actually sending them pestering emails.
http://www.webdigi.co.uk/blog/2009/how-to-check-if-an-email-address-exists-without-sending-an-email/smtpvalidateclassphp/

Is there a wordpress plugin for creating an API from your site? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'd like to use the data from my wordpress site in an API form. Maybe REST with JSON output. I'm wondering if there's any plugins that automatically make the wordpress data accessible from outside the site its running on, similar to the way most web APIs work.

WordPress is basically a REST-powered platform to begin with. You enter a URL with a particular query string (sometimes buried in a 'pretty permalink') and the system returns a semi-static resource based on the layout and structure defined in your theme.
To build it into an 'API' you'd need to first identify what information you're returning and how you want to structure it. Do you want people to access your data via a typical link (http://blog.url/?post=xxx&category=xxx&whatever=xxx)? Or do you want to keep running a typical blog but allow API access through another mechanism?
If you want the second route, you could always hook a plug-in into WordPress' built-in XMLRPC service. Then users will make a request of something like http://blog.url/xmlrpc.php?resource=xxx&variable=yyy&somethingelse=zzz and your site would return whatever information you want (though this would be SOAP, not REST ... so it's really up to you).
See my answer here for a specific example with WordPress code ...

I have used this REST/JSON API plugin with some success, so while not for creating an API, you could possibly pull it apart and change it to your needs? This plugin seems to only support output, and not input (comments etc). There also seems to be discussion regarding creating similar api plugs for both input and output, and that might be one way to go, also. Have fun!

Yes there is a way, and I just found it!
You can go here: http://developer.wordpress.com/docs/api/
all you have to do is fill in your website into a request like:
https://public-api.wordpress.com/rest/v1/sites/$yourSiteHere/posts/
and you'll get a beautiful JSON back.
You can post comments, get data, and add queries pretty easily.
If you want to do more that requires login, you can use oAuth.

API Endpoints wordpress plugin lets you construct any API out of your WordPress site.

If you are a Wordpress plugin developer and you need a RESTFul API maybe thermal-api.com can help you: Wordpress plugin to connect to a REST API?
But I think the best way is using WP-REST-API: http://v2.wp-api.org/extending/adding/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex