Difference web crawling and web scraping - web-scraping

I am unable to actually figure out the difference between web crawling and web scraping .
If I am scraping data from the FedEx website using every tracking number is it web scraping or web crawling?
Please give a good short example with the difference.
Thank you.

Short answer: Web crawling just indexes the information using bots, where as Web scraping aka web data extraction is an automated software technique of extracting information from web.
Elaborated answer:
Web crawling aka Indexing, is used to index the information on the page using bots also known as crawlers. Web Crawlers are basically used by major search engines like google, bing, yahoo, in other terms Google, Bing are one of the major web crawlers.
Here we get generic information, where as scraping we get specific information.
Web scraping aka Web Data Extraction, is an automated way of extracting the information/content using bots aka scrapers. Here the information can be used to replicate in some other website or can be used to do data analysis.
[Information in this context means all varieties of content including images, text, sensible information like contact details, price etc.]

Related

Tracking browser metrics inside desktop application

Our desktop application includes a pane which pulls content from a web page upon loading (think: links to what's new, top support topics, etc.). We have the analytics.js on that page.
We're getting demographic information back like country, which I assume is location-based. We also see language information. When using an embedded web page like this, where is Goolge Analytics getting the language information from? There isn't a way for customers to change language settings for the web part inside our application. We're trying to understand if the language information we're seeing is accurate or not.
Thanks in advance!
The language report is showing the default language set by a visitor on his/her web browser from which they are accessing your application.
Thanks.

How do i scale my solution to multiple domains on single web application

Im a little lost here. Im starting on a project for a customer who wants a SaaS solution as a small portal.
The idea is that i make a web solution e.g. an online business card, where each customer should have their own domain like this:
www.carpenter.com
www.painter.com
www.masonry.com
Etc. each of these domains should point to my web application and each have their own administration web site and the online business card. This means that if I go to: www.carpenter.com I should see the companys online business card. And at the url: www.carpenter.com/admin the carpenter company should be able to log in and edit its information.
I hope this makes sense.
What Im looking at is how this is done in practice, I would like to have a central database and a central place to update my software (maybe one per country). What do i need to do to point a www.carpenter.com domain/url to its own specific area in my web app. And how do I need to structure my web application to do this?
Im using ASP.NET MVC for this, but this should be a general question regardless of language - or?
Im considering using a cloud service such as Azure, is this possible with this setup? Or do i need a virtual hosted server i own myself?
I guess the main question is "how do I host multiple domains on the same software" - and keep the display of the "business card" and admin separated from each customer?
Not sure if this specifically answers your question, and my experience thus far has not been with ASP, but I think the general idea is that you determine the execution environment for your web app early in the bootstrap process, and then set constants and configuration options at that point. Then, you can use those values throughout your application to customise the response based on which site you're working with (i.e. carpetner, masonry, etc.). And, since the only piece of differentiating information you have during the bootstrap process is the domain name and URL of the site being requested, I think the generally accepted method is to switch on the domain name. So, you can store different configs for the different sites based on their domain names, and then load those configs during the bootstrap process. For example, if you had a different site template for your carpentry site and your masonry site, you could store the path to your templates as one of the configuration options. HTH

Is this considered RESTful?

I am writing a simple web page for our existing web site that will only be used by the web site admin to delete all images from a certain directory on the server. He would browse to this page from his web browser (not to be consumed by any external services as of right now). I was thinking of creating another ASPX page (obviously not linked from or to anywhere) that implemented this. Is this considered a RESTful API? If not, what would be, and would it be a more elegant solution than what I'm proposing?
I realize this is an extremely simplistic example, but I'm trying to understand what RESTful really means and if it would benefit our existing infrastructure in any meaningful way, so that's kind of the purpose of this question.
Our website is written entirely in ASP.NET 2.0 WebForms.
It depends on your URL structure. a classic REST API call would be to, say:
/images/delete
You would then send a Post as DELETE or just GET or POST to this to do what you need. That's more the RESTful way. REST isn't so much what you are doing with the method as the structure of that method. I hope that makes sense :).

ASP.NET : How to do IIS Indexing search feature for my website

I want to implement a search feature to my webapplication which has 50 + static files which are rich in meta tag contents. I want to add one ASP.NET page to the application which would show the searchresults when someone go for a search. Can anyone guide me how to go ahead ?
Take a look at an article - https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-5032858.html
SUMMARY: How the Indexing Service
provides a powerful search feature for
LANs and Internet users.
Indexing with Internet Information Services
SUMMARY: Indexing Service provides
search capabilities to Web sites
hosted with Internet Information
Services (IIS). With Indexing Service,
you can provide searchable access to
both intranet and Internet Web sites.
You can index remote computers and
multiple Web servers.
Check out Lucene.net. Plenty of examples are out there.

Coverage report for ASP.NET web site pages

I have taken on an ASP.NET web site where the client is using the web server as a code repository, i.e. removing a page from the site involves not linking to it any more. There are a stupendous number of unsused files, and I would like to archive these off and arrive at a lean git repository of only files used by the active site.
How can I get usage or coverage data that will tell me, over an agreed upon period, i.e. a month, which pages are being hit? I know there are many ways of doing this in ASP.NET, and even in plain IIS, but I'd like some suggestions on a convenient and simple way of doing this.
I would suggest the IIS logs, but that wouldn't report linked pages that haven't been accessed by users.
You could try running a spider on the site. Here's a free tool. http://www.trellian.com/sitespider/download.htm
You should be careful what which files you delete from the web server if there are cached links to the pages out there. A good strategy would be to use Google. Run the following search query to see what pages are returned "site:example.com" where example.com is the domain for your site.
look at the access logs for the agreed period and compare the list of pages visited against the full list of all pages. this seems like more work than necessary though.
there is a program called Xenu link checker which already contains the functionality you require. it can spider your site and if you tell it where the files are it will identify unused files for you.

Resources