Scraping user behavior in a specific web page - web-scraping

Let's say I want to go to a specific web page and track their users activity (for example get their location, how many times they logged on, the links they clicked etc..) Its is easy to implement this if it was my website, however I want to do it for any website.
Is it technologically doable? Do you have any idea how i can start to implement this?

If the website publicly (with or without authentication) gives Users data, you can do web scraping.
The data that you mentioned are the website stats which can be tracked only by the website or the Web server. Unless you have access to the server logs, you can't do it.

Related

How to protect website data from web scraping?

I have a website like IMDB. All data are publicly available. I want to know how to protect the data from web scrapers ?
There is only one full-proof method against scrapers, that is captcha. But as it affects user experience most of the websites avoid it.
Another option is using AJAX for loading data. This will help avoiding the scrapers which are not built to render JavaScript, but one can make one using Selenium WebDriver. In addition AJAX is also bad for SEO, in case you are into google rankings and all.
A more effiecient and awesome way will be tracking the user behaviour and saving the information into cookies, if something seems suspicious serve a captcha to user. Just how the google captcha works on several sites.
Check this link : https://blog.hartleybrody.com/prevent-scrapers/

Google Analytic Tracking for an entire server?

We currently use google accounts for our site ibiblio.org. Setting up the tracking on our landing pages is no big deal. But, we have tons of "collections" on our server, which as a public service, we allow a ton of users to host their own website installs (usually wordpress or wikis). Here is an example of a collection
These installs exist on a server, each install is a subfolder. Is there a way we can track the activity of these subfolders (or installs) without having to inject tracking code into each of their HTML files? We have a lot of contributors, so injecting code could get messy.
Thanks so much for your time and help.
There are ways, depending on the server's programming language, to create google analytics requests every page load (i.e. whenever page requests are fired). Tracking an entire site is not possible if the HTML files are served statically, because GA wouldn't know a page load happened.

Easy way to track website user activity

We want to track all visitor activities for logged users so we can get better insight into visitor behavior and gather more data for each individual user.
What options are available for something like this that does not require us to add code to every single website page? Are there some existing libraries for this?
Here are couple different options you can try:
If your website uses one or more master pages you can try adding code only to master pages.
Create custom http module you will later integrate into IIS. Check this article for more details

How to know duration in site

I have a .net web site(eg: www.mysite,com) and I use some external links in my site that redirect to other sites (like www.google.com). But how I know a user spend time in second site(ie www.google.com)
Assuming you have no control over the external website - you cannot.
You cannot inject your own client-side scripting into external website (well, sometimes you can, but it's XSS and it's bad).
Google Analytics tracks the activity of the users on a certain website because website owner included javascript code.
It may be possible for Google to track user's activity over multiple websites but only if those website display Google Ads. It is not bullet-proof since users can block ads.

ASP.Net authentication and Googlebot

I have an ASP.Net 3.5 web site with forms authentication enabled. Is it possible to have Googlebot crawl my web site without getting prompted for a username/password?
Google claims that is not wont to index page and show them on the users as available that are not, because actually they request user name and password.
It can give the option only to crawl the protect page by the AdSense so he can know what advertize to show on them
https://www.google.com/adsense/support/bin/answer.py?answer=37081
Other solutions that check if is bot or coming from google bot computers are not safe because they can easy spoof by the users, and also may fail to show a preview or a cache of the page.
So you need to think your site structures, what is very important and what is not, to show some part of the pages, hide some other if the user is not register, and that way google have something to index even if its not loged in.
Here is an article:
http://www.guruofsearch.com/google-access-password-protected-site
It would be interesting to see if a google sitemap would result in pages showing up in google - but I doubt that would work either, as the pages would likely need to be crawled anyway.
And some other interesting comments here:
http://forums.searchenginewatch.com/showthread.php?t=8221

Resources