The data on our website can easily be scraped. How can we detect whether a human is viewing the site or a tool?
One way is by calculating time which a user stays on a page. I do not know how to implement that. Can anyone help to detect and prevent automated tools from scraping data from my website?
I used a security image in login section, but even then a human may log in and then use an automated tool. When the recaptcha image appears after a period of time the user may type the security image and again, use an automated tool to continue scraping data.
I developed a tool to scrape another site. So I only want to prevent this from happening to my site!
DON'T do it.
It's the web, you will not be able to stop someone from scraping data if they really want it. I've done it many, many times before and got around every restriction they put in place. In fact having a restriction in place motivates me further to try and get the data.
The more you restrict your system, the worse you'll make user experience for legitimate users. Just a bad idea.
It's the web. You need to assume that anything you put out there can be read by human or machine. Even if you can prevent it today, someone will figure out how to bypass it tomorrow. Captchas have been broken for some time now, and sooner or later, so will the alternatives.
However, here are some ideas for the time being.
And here are a few more.
and for my favorite. One clever site I've run across has a good one. It has a question like "On our "about us" page, what is the street name of our support office?" or something like that. It takes a human to find the "About Us" page (the link doesn't say "about us" it says something similar that a person would figure out, though) And then to find the support office address,(different than main corporate office and several others listed on the page) you have to look through several matches. Current computer technology wouldn't be able to figure it out any more than it can figure out true speech recognition or cognition.
a Google search for "Captcha alternatives" turns up quite a bit.
This cant be done without risking false positives (and annoying users).
How can we detect whether a human is viewing the site or a tool?
You cant. How would you handle tools parsing the page for a human, like screen readers and accessibility tools?
For example one way is by calculating the time up to which a user stays in page from which we can detect whether human intervention is involved. I do not know how to implement that but just thinking about this method. Can anyone help how to detect and prevent automated tools from scraping data from my website?
You wont detect automatic tools, only unusual behavior. And before you can define unusual behavior, you need to find what's usual. People view pages in different order, browser tabs allow them to do parallel tasks, etc.
I should make a note that if there's a will, then there is a way.
That being said, I thought about what you've asked previously and here are some simple things I came up with:
simple naive checks might be user-agent filtering and checking. You can find a list of common crawler user agents here: http://www.useragentstring.com/pages/Crawlerlist/
you can always display your data in flash, though I do not recommend it.
use a captcha
Other than that, I'm not really sure if there's anything else you can do but I would be interested in seeing the answers as well.
EDIT:
Google does something interesting where if you're looking for SSNs, after the 50th page or so, they will captcha. It begs the question to see whether or not you can intelligently time the amount a user spends on your page or if you want to introduce pagination into the equation, the time a user spends on one page.
Using the information that we previously assumed, it is possible to put a time limit before another HTTP request is sent. At that point, it might be beneficial to "randomly" generate a captcha. What I mean by this, is that maybe one HTTP request will go through fine, but the next one will require a captcha. You can switch those up as you please.
The scrappers steal the data from your website by parsing URLs and reading the source code of your page. Following steps can be taken to atleast making scraping a bit difficult if not impossible.
Ajax requests make it difficult to parse the data and require extra efforts in getting the URLs to be parsed.
Use cookie even for the normal pages which do not require any authentication, create cookies once the user visits the home page and then its required for all the inner pages.This makes scraping a bit difficult.
Display the encrypted code on the website and then decrypt it on the loadtime using javascript code. I have seen it on a couple of websites.
I guess the only good solution is to limit the rate that data can be accessed. It may not completely prevent scraping but at least you can limit the speed at which automated scraping tools will work, hopefully below a level that will discourage scraping the data.
Related
My department has analytic rule conditions in DTM that trigger events based on particular classes or custom data attributes. I'm concerned that if our dev team makes a change that would break the rule, we wouldn't find out until it's discovered that the metric is no longer tracking.
We're trying to future proof our scripts to allow for changing conditions (eg: using regex for changing class names &/or functions to traverse the DOM to find a condition without it needing to be hardcoded), but I thought someone here might have experience with this type of issue. How was it handled at your company?
**EDIT:**I'm exploring using custom Data Elements within DTM that are created with javascript that has multiple conditions for traversing the DOM in ways we've identified. So a sort of data layer that's controllable by my team.
Note: This isn't really an actual coding question; more of [analytics/marketing tag] coding principles/best practices. So I'm not entirely sure this question belongs on SO (maybe one of the other stack exchange sites, perhaps superuser.com?). But I'll answer here anyways.
TL;DR - You need to get site devs involved and have them take on some level of initial and ongoing ownership of it.
Tag managers sell themselves on being able to deploy tags without getting site devs involved, and many times this works out in the short term. But in my experience, this kind of passive deployment just doesn't work out in the long term, especially for websites that have active and regular changes over time.
In my experience, the only way to effectively help prevent site devs from inadvertently breaking the tracking from this, is to include the site devs in the deployment and make them take ownership of it on some level, so that it is something they are aware of within their own system/flow.
Sometimes it is as easy as having designated classes or attributes added to html tags on the page. For example, you can write a spec for the site devs to add a data-analytics='true' to any header, footer, CTA links on a given page, and tell the site devs this is something they need to keep as part of their workflow whenever they make changes to the site.
For more complicated things, you could spec for them to do something like broadcast a custom event for you to listen for. For example, maybe you have a purchase confirmation page and right now you have code in DTM to trigger based off the URL, or scrape the page for details about the purchase to push to tags. Instead, create a spec with instructions for the devs to put that in a data layer object and push to a custom event, and then create an event based rule for it.
The overall theme here is to create a spec document for all the things you want to be able to track on the website, that you know you can't reliably passively track without it breaking sooner or later, and hand the document to the devs and tell them they need to make it a part of their flow when making changes to the site. Bonus points if you can get them to loop you in whenever changes are going to be made and pushed to production, that you may go to your dev/qa version of your site to test to make sure your tracking still looks good.
The overall overall theme here is in order to prevent site devs from breaking your tracking, you need to be more proactive about making them and keeping them aware of your tracking, and in practice, this usually means putting some of the code work on their plate to own, so it's something in their history, in front of their face to see and know about. Because it is a lot easier for a dev to take notice of a data-analytics='true' in the header nav links they are about to restructure, than knowing hey, some piece of code in DTM relies on this current structure.. something that's not more directly in front of their face in their own code editor/environment.
Yes, actually accomplishing the above is often easier said than done. But it is the reality of the situation. Passively tracking things in a tag manager rarely works out for longer term stuff, short of "every page" tags that have little or no customization requirements at all.
I tell you from my own experience of over 10 years of working in the digital marketing and analytics industry, specifically with implementation, I have seen this time and time again. Too many times to count. Clients often want to and actually take the easier route of leaving the site devs out of the loop, all tracking requirements done solely through whatever the tag manager is capable of doing.
I've seen setups with hundreds of rules with trigger conditions based off scraping the page for some id or class, or some complex css selector dependent on 5 levels of html structure not changing. Or some random cookie you just assume means what you think it means. And you're spending more and more of your time playing whack-a-mole trying to re-adjust/fix individual rules/selectors as the next random change happens, and then one day comes a full site redesign and it's a nuclear bomb on all your tracking efforts.
And time and time again, without fail, eventually they wind up asking exactly what you've asked here, kicking themselves in the arse for the time and money they spent already on it for that "quick win", because nobody is confident in the data and they're wondering why they are allocating money/resources on tracking if it's just a bunch of trash, broken, pothole data. And the solution to it has always been "site dev awareness".
If it helps.. one card I sometimes play if I have a hard time convincing the dev team or other powers-that-be to jump on that bandwagon, is to point out that one of the biggest reasons for tracking websites is to help companies determine whether or not it's worth investing money in said website. If they can't determine that, then they may not be so inclined to do so, which means their need to even have a site dev team may also decline. To be more candid: it is something that helps justify their job.
Running an MVC2 site against IIS7 and would like to capture more detail of how users traverse the site - ideally to the point of being able to replay even the duration between mouse clicks - feedback of where people pause and/or backtrack.
I could do this with flash but that's no longer an option. Now it's just IIS7 via asp.net f4. IIS7 _should be able to provide this via 3rd party extensions - especially for this sort of niche need. I'm willing to consider client-side .net components but this sure seems to be the responsibility of the server.
[opps...does this belong on serverfault?]
thx
justSteve. Here is a solution that we have used:
http://www.seevolution.com/
I don't think that it gives time between clicks, but it does give very detailed tracking considering it's price (I don't know if that's an issue). We have really liked it. Fantastic detail.
You could also roll your own solution. Using jQuery and the $(document).click() function, you can log when they click, and the points on the screen. Then every couple of minutes, serialize it and fire it off to the server. You can get extremely fine-grained detail that way. The nice thing with seevolution is that they've done all of the work for you already, but it probably isn't as detailed as you would like.
JMax
Maybe not the "in-house" solution you're after but we are about to implement SessionCam at my company, which seems like a pretty good match for what you're looking for. Not having actually finished implementing it yet, I can't vouch for it in terms of quality at this point - but the description of the product certainly matches.
You aren't going to be able to capture the level of detail you need using a solely server-side solution. There needs to be a degree of client-side work - whether it's in flash or javascript - to capture things such as where the mouse is hovering (for heatmaps etc).
I personally haven't used this product, but a friend of mine spoke highly of it.
Clicktale
I have started developing a webpage and recently hired someone to write code to display a customized feed (powered by API) in the middle panel on http://farmball.com/. Note that this is not the RSS feed tied to the site blog. The feed ties to my account on another site. There is no RSS link for an average user to subscribe to the feed. I've taken the site out of maintenance mode to ask anyone here with scraping/hacking experience how someone would most easily go about 'taking' the feed and displaying it on their own site. More importantly, what can I do to prevent it?
^Updated for re-wording
You can't.
If you are going to expose an RSS feed which you don't want others to be able to display on their site then you are completely missing the point of RSS. The entire reason for Really Simple Syndication (RSS) is to make your content externally consumable- whether that's in an RSS Reader or through someone simply printing its content on their own website.
Why are you including an RSS feed if you do not want someone to be able to consume it?
what can I do to prevent...'taking' the feed and displaying it on their own site?
Nothing. Preventing reuse goes against the basic concept of RSS, which is to make it as easy as possible for anyone to do anything they want with it. It was designed from the ground up to be Really Simple to Syndicate, not Really Hard to Retransmit Without Permission.
You could restrict access to the feed itself to trusted users only by making them provide some credentials or pass in a key to the feed (e.g. yoursite.rss?mykey=abc123). But you cannot control use. Only access.
Be explicit about your license. It isn't a technology solution, as others have mentioned, the technology is an open technology-- this isn't DRM! But if you ask in each post that people who use this feed to not repost/fail to give credit/etc then some people will respond to the request.
Otherwise, you're better off putting your content behind a password and using a paid subscription model for distributing your content.
This is a DRM problem essentially. If you had some technique that you could put content on the web without having it redistributable, the music industry would love you.
It is possible to try to prevent redistribution. One technique you could try is embedding a signature of some sort into the feed for each user who you require to sign up. If the content is found on the web, you can identify and ban the user who redistributed your content.
This is avoidable too, by getting multiple accounts and normalizing the content to remove fingerprints. For the would-be pirate, this requires more effort than they may be willing to put in. Your signature could be a unique whitespace pattern, tiny variances in the timestamps on posts, misplaced pixels in videos, or any other thing you can vary slightly without end users noticing.
use .htpassword
better yet, don't put something private in a public place where it's likely to get picked up by software automatically. Like others have said, it's a pretty odd question, if you're trying to figure something else out, you're better off being explicit with what you want to know.
There are probably thousands of applications out there like 'Google Web Accelerator' and all kinds of popup blockers. Then theres header blocking personal firewalls, full site blockers, and paranoid cookie monsters.
Fortunately Web Accelerator is now defunct (I suggest you read the above article - its actually quite funny what issues it caused) but there are so many other plugins and third party apps out there that its impossible to test them all with your app until its out in the wild.
What I'm looking for is advice on the most important things to remember when writing a web-app (whatever technology) with respect to ensuring the user's environment isnt going to break it. Kind of like a checklist.
Whats the craziest thing you've experienced?
PS. I may have linked to net-nanny above, but I'm not trying to make a porn site
The best advice I can give is to program defensively. For example, don't assume that all of your scripts may be loaded. I've seen cases where AdBlocker Plus will block 1/10 scripts that are included in a page just because it has the word "ad" in the name or path. While you can work around this by renaming the file, it's still good to check that a particular object exists before using it.
The weirdest thing I've seen wasn't so much a browser plugin but a firewall/proxy configuration at a user's workplace. They were using a squid proxy that was trying to remove ads by replacing any image HTTP request that it thought was an ad with a single pixel GIF image. Unfortunately it did this for non-GIF images too so when our iPhone application was expecting a PNG image and got a GIF, it would crash.
Internet Explorer 6. :)
No, but seriously. Firefox plugins like noscript and greasemonkey for one, though those are likely to be a very small minority.
Sometimes the user's environment means a screen reader (or even a braille interface like this). If your layout is in any way critical to the content being delivered as intended, you've got a problem right there.
Web pages break, fact of life; the closer you have been coding and designing up against standards, the less your fault it is.
Something I have checked in the past is loading some of the more popular toolbars that people tend to install (Google, Yahoo, MSN, etc) and seeing how that affects the users experience.
To a certain extent it is difficult to preempt which of the products you mentioned will be used by your users since there are so many. I would say your best bet is to test for the most frequent products that your user base may employ and roll with the punches for the rest. If you have the time to test other possible scenarios, by all means do.
Also, making it easy for your users to report possible issues also helps lessen the time it takes to get a fix in place should it be something you can work around.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Concerning pages that build a web application:
Lately, I have found myself creating web pages that are simpler than the ones I used to. Before, I would try to jam as much functionality into a single page as I could to prevent from having lots of pages.
I am starting to realize that this was just making things way more complex, convoluted, and confusing than it had to be. Why not have more pages? I think the reason that I was doing this was because I didn't want the user to have to browse to other pages; just to have all the functionality they needed on a single page.
Well, these good intentions turned into an overly confusing interface for the user and very unmanageable source code. I am a new developer and I am trying to be very reflective of what I am doing so that I can improve. If it makes a difference, I am developing in ASP.net (though these are probably considerations for any platform).
My questions are:
Am I overthinking these things?
Has anyone else found themselves doing this?
Where is the happy medium?
There is no expert who can give you a rule that works in all places at all times. I have been known in my industry for years for "easy" interfaces and we've won significant amounts of business for it (as well as 5 "Best in Class" awards). I have also had people within my company and outside of it tell me - for years - that they like my work but wish that I would "jazz it up" with more graphics and such. What always amazes me is how little connection people see between the two.
So...a few rules of thumb:
A page should do one main thing.
A page may well have multiple links related to the main thing
Menuing and link layout should be consistent across pages
Simpler is better than more complex
Pages should be visually appealing and inviting
Rule 4 is more important than rule 5.
For example, my product provides an interface that lets people define classes and events to be displayed in a calendar. I could have one page that lets you Review, Add, Update, Delete and Edit the classes. Indeed, in some simpler areas, I've used the gridview to let people manage everything in a grid. However, classes have too much information to do this and still follow the rules above.
So,
The main idea is: "Here is a list of classes for this location"
The links are "Add New" shown above and to the right of the grid, Change and Delete are links within each row. This is consistent across the app.
Menuing for the system as a whole is always across the right/top. Nothing else appears on the class/event page except for standard elements common to all pages (a logo, a header, a footer).
The grid is nicely styled but there are no spurious graphics (4,5,6)
A few last things about UIs and graphic design.
First, develop your own vision and be consistent across pages and apps.
Second, do not be afraid of simplicity.
Next, when soliciting advice from others keep in mind that you do not want their advice - you want their impressions: you want to understand the way they perceive the interface. Advice is sometimes good but, more often than not, actually harmful. In my experience, everyone thinks that they are a UI expert.
When you do your hallway (or formal) useability testing you should discount almost all advice to the effect that "you should make that stand out more." As you'll see, it will quickly become "and that," "and that," "and the other." If you follow this advice, you'll end up with a mess due to Brittingham's first rule of design: If everything is important than nothing is. (There you go: when explaining why you can't make someone stand out more, just tell them that "it violates Brittingham's first rule of design!")
Hope this helps!
You hit the nail on the head. Use the KISS principle. (Keep It Simple Stupid)
I've done this in the past as well and not only does it make for a hideous UI, but confusing as to what operations you can do on the page due to having too much functionality. I've often found in testing that I did not have enough checks to see if the user could perform a certain operation based on the state of the data.
It's easy enough in ASP.Net to write several pages that do simple tasks and then link them together with Response.Redirect or Server.Transfer. Now all I try to achieve on any given page is what the design specs say. So if my page is just a search page, that's all I give. If the user wants to see the details of an item that was returned in the search, then I send them to an itemDetails.aspx page.
You've broken a wall that most software developers have, the one that was blocking your view on usability before. A lot of developers don't really think about it and try to make it easier for them by stuffing functionality in one window, web page or whatever.
The thing is once you start designing software from the user's point of view, i.e. making it easier, several things start to become clear. One is the issue of code maintenance, that code is easily more managable to work on if you don't stuff everything in one giant class or whatever travesty you've been doing. The other is usability itself, that you start to think how the user is actually using your application through the graphical interface. Third is avoiding requirements or scope creep where you stop developing functionality that the user doesn't need.
We as users want simplicity partly because we don't want to spend most of our time muddling through a bad UI when we can get our work done faster with a simple and slick UI. That makes it for us software developers the right thing to do, to think through your design on all levels... that and specs always lie.
Definitely agree: most attempts at writing pages/forms that do too much have resulted in
bugs and rewrites. Problems occur with keeping all parts valid/synchronized,
excess managing of users' expectations ("I've entered a bill number here and clicked "find person" there but it gives an error message. Why?") when the two are logically separate. These questions cannot arise if only the valid options are visible,
Formatting/layout issues: In ASP.NET pages, trying to layout independent User Controls turns out to be a nightmare ("But we really want all the buttons vertically aligned!" in separate user controls. Good luck with that.)
I'd consider webpages with more than one functionality only if the target audience consists of domain experts, i.e. people that need lots of functionality on one page for better productivity (think data-entry or financial software with lots of variables).
Even then, most of the time, it's possible separate pages into single units.
No
Yes - me
I found the happy medium was to use Masterpages, and using it in a way that was familiar to IFrames. That I could have a lots of functionality combined well together. There is a more interesting way of doing this with WPF/Silverlight called Prism
The amount of functionality on a page is usually not determined by you but by your customer. If the customer demands a single page to update some VeryComplexObject, you're likely to end up with an aspx page that has a significant number of lines. Main reason is that you simply have a lot of event handlers for all actions on the page.
Whether that page is complex is entirely up to you. You should always attempt to make your code-behind file as simple and clean as possible. Some suggestions in that direction:
Move all business code to another application layer.
Use ObjectDataSource for providing data to data-bound controls such as ListView, GridView, Repeater, ... Delegating loading of data to a dedicated object prevents a lot of overhead in your aspx.cs file.
Another suggestion is to use user controls to implement portions of your page. You would usually only do this when you can reuse the user control, but it can also be of great help reducing page complexity (both of your code-behind file as well as your aspx).
Sometimes I think we are all guilty of forgetting just who it is that we develop our applications for. It isn't always easy as a developer to be able to take a step back and have a look at your application as a user might do so. This is why big companies employee hundreds of people to do this for them and they don't always get it right.
Usability is a massive subject but it is defiantly something that all developers need to keep in mind. It has taken me a long time to learn this but when tackling any development task I always try to think about how my users are going to interact with what I am writing. This will make a difference to all levels of your development.
I would suggest reading Don't Make Me Think by Steve Krug. This book won't take you an age to read and it puts across some fantastic ideas that can help you to develop applications that are much easier to use and understand.
I always find that once I have thought about the user experience the decisions about what my web pages are going to do and how they are going to interact are much easier to make.
Maybe you should ask the people who are using your site. Or better yet, just watch people use your site. I think that would tell you if your site is designed well, or if you need to change it.