How to protect website data from web scraping? - web-scraping

I have a website like IMDB. All data are publicly available. I want to know how to protect the data from web scrapers ?

There is only one full-proof method against scrapers, that is captcha. But as it affects user experience most of the websites avoid it.
Another option is using AJAX for loading data. This will help avoiding the scrapers which are not built to render JavaScript, but one can make one using Selenium WebDriver. In addition AJAX is also bad for SEO, in case you are into google rankings and all.
A more effiecient and awesome way will be tracking the user behaviour and saving the information into cookies, if something seems suspicious serve a captcha to user. Just how the google captcha works on several sites.
Check this link : https://blog.hartleybrody.com/prevent-scrapers/

Related

How to decrease POST and GET queue in high load time?

I have created a wordpress website to collect student's attendance. To do so, I've installed a plugin and all data are sent to google sheet.
The problem is that when all students enter and tried to submit their attendance, around hundred users go live at one moment which leads to a very high load to my website, and mostly they get error 503 or sometimes 500.
To solve this problem, some solutions cross my mind:
Of course, I can upgrade my server and hardware resources. However, I'm using a shared hosting and it is very costly to do.
I installed another plugin, and tried to handle the situation with two different and separate plugins in one page, however, as I know, they both use one GET and POST function which is the core of wordpress and it does't matter if I use different plugin simultaneously, they need to wait. Am I right?
I created two mirror page for my attendance to direct users to each page randomly, hopefully it reduces the page load. However, for form submitting the scenario is still the same since the forms in different pages also use same POST and GET.
Please give me some advice if there is any other solution. For now, I just inserted a google form as an alternative. However, I guess there is maybe another possible solution to handle inside the site not using external form provider.
Here is the site: Attendance website
Can you use 'Disconnected' architecture?
Ideally, all the attendance should be sent to a high performance queue and then your app can read it at its own pace.

Prevent CMS identification

I have a little curiosity. I often use extensions like Wappalyzer to try to understand certain sites about which CMS are based and, in this regard, I would have a question: is there a way to prevent extensions similar to the one mentioned above from identifying the CMS used?
You can try, but I don't think that you can successfully hide CMS behind your site and that attempt is waste of time. For drupal check on this page:
https://www.drupal.org/node/766404
Most CMS can be run as headless CMS, decoupling backend and frontend. They either render a static set of html pages, or a JSON file with all the sites data, which is then rendered by some kind of Javascript app. This way no one can figure out where the data is coming from, or if there was a CMS involved at all.

Is it possible to use Google Sign-in without Javascript?

I have an application where I'd been asked to support Google sign-in. Something we've tried to do since the beginning was not require javascript for any important functions. Is it possible to perform Google Sign-in without requiring Javascript?
I've read some of the guides such as https://developers.google.com/identity/sign-in/web/sign-in and https://developers.google.com/identity/sign-in/web/backend-auth, but they all seem to have a javascript component involved.
For example, can we use only links and redirects, etc. to accomplish a Google-based authentication, along with some server-side verification?
I think what you want to do is described in the OAuth 2.0 for Server-side Web Apps documentation. Several of the sections in that doc have tabs with language specific examples and there is also an HTTP/REST tab that shows how to generically use Google's OAuth URls.
You will also want to follow Google's sign-in branding guidelines.
I've been looking for the same thing. I'm sick of popups and I don't want them in my site. It seems like there should be a way to just link to a Google page, then redirect the user back to my site. However there doesn't seem to be any documentation about how to do that.
I also agree that it shouldn't matter what programming language is being used. Google doesn't need to know that. All we need is a URI to send the user to, and some way to indicate where the user should be redirected back to.

What does it mean when I see some IPs look at hundreds of pages on my website?

What should I do when I see some IP in my logs scrolling through 100s of pages on my site? I have a wordpress blog, and it seems like this isn't a real person. This happens almost daily with different IPs.
UPDATE: Oh, i forgot to mention, I'm pretty sure it's not a search engine spider. The hostname is not a searchengine, but some random person from india (ends in '.in').
What I am concerned with, is if it is a scraper, is there anything I can do? Or could it possibly be something worse than a scraper e.g. hacker?
It's a spider/crawler. Search engines use these to compile their listings, researchers use them to figure out the structure of the internet, the Internet Archive uses them to download the contents of the Internet for future generations, spammers use them to search for e-mail addresses, and many more such situations.
Checking out the user agent string in your logs may give you more information on what they're doing. Well-behaved bots will generally indicate who/what they are - Google's search bots, for example, are called Googlebot.
If you're concerned about script kiddies, I suggest checking your error logs. The scripts often look for things you may not have; e.g. on one system I run, I don't have ASP, however, I can tell when a script kiddie has probed the site because I see lots attempts to find ASP pages in my error logs.
Probably some script kiddie looking to take advantage of an exploit in your blog (or server). That, or some web crawler.
It's probably a spider-bot indexing your site. The "User-Agent" might give it away. It is possible to have 100s of GET requests easily for a dynamically generated Wordpress site if it isn't all blog pages but includes things like css, js and images.

capture details from external web page

I'm wondering if it's possible to capture details from the web page that a user previously visited, if my page was not linked from it?
What I am trying to achieve is to allow users to my site to find a page they like while browsing the web, and then navigate to a page on my site via a bookmark, which will add the URL (and possibly some other details like the page title) to a form which they can then submit to my site to add the page to a list of favourites there.
I am not really sure where to start looking for this. I wondered if I could use http referrer, but think this may only work if there is a link to my page?
Alternatively, I am open to other suggestions as to how I could capture this data - a Firefox plugin? A page which users browse other sites in an iframe, with a skinny frame on top?
Thanks in advance for your suggestions.
Features like this are typically not allowed by browsers for security and privacy reasons. The IFrame would work, but this is a common hacking technique so it may be likely to break or be flagged in the future.
The firefox addon is the best solution, but requires users to install it manually.
Also, a bookmarklet could be used. While they are actively on the target page, the bookmarklet could send you the URL.
This example bookmarklet would create a tinyURL for the destination page. You could add it to your database or whatnot.
javascript:void(window.open('http://tinyurl.com/create.php?url='+document.location.href));
If some other site links to yours and the user clicked on that link which took them to your site you can access the "referrer" from the http headers. How you get a hold of the HTTP headers is language / framework specific. In .NET you would use the Request.UrlReferrer; other frameworks would probably handle it differently.
EDIT: After reading your question again, my guess would be what you're looking for is some sort of browser plugin. If I understand correctly, you want to give your clients the ability to bookmark a site, while they are on that site, which would somehow notify your site about the page they're viewing. The cleanest way to achieve this would be a browser plugin. You can also do FRAME tricks, like the Digg bar.

Resources