There are a number of different websites that let you monitor specifi web pages for any changes, such as watchthatpage.com or page2rss.com
I'm interested in the way how those sites are working, meaning how do they determine whether some web page is updated. Do they just copy all the text from the page, store it in memory and compare it later to the content of a site's page?
Or maybe they look for some specific html elements and compare theirs values?
Please help me to find the answer.
How it works: http://www.watchthatpage.com/information.jsp
I suspect that they store the entire contents, and every time they check, they compare. If different, send alert, otherwise don't.
There's two ways this can be done just off the top of my head.
The first is to pull the HTML and do a simple string.compare.
The second way, would be to do a HEAD request See, section 9.4 here
Related
Before making a CSS change that might possibly have unintended consequences, what's a good way to find where else on the whole site (not just this page) that id or class is used? (It doesn't have to be exhaustive, and semi-manual processes are ok, too.)
For a bit of context, it's a Joomla-based site with a lot of content, and I'm not yet familiar with most of it. The id in question has a two letter name, and I have no idea where else it might be used. I don't have direct access to the server for any grep-like approaches.
The only technique I can think of is using Stylish to make an obvious change to that one selector, and browsing the site for a bit to see where it pops up.
The easiest way would be a local grep, but since you don't have access to the server, try downloading it locally using wget:
wget -r -l --domains=http://yourdomain.com http://yourdomain.com
That'll recursively retrieve pages from your domain to an infinite depth, but only following links to pages within your domain.
Once it's on disk, do a local grep and you're golden.
I use unused-css.com for this sort of thing. You simply put in your webpage, and it will look through the whole site (incl. login) and give you the CSS that you actually are using.
I've found it to be 95% correct - but it only doesn't pick up on things like some CSS browser hacks and some errors (ie. the CSS only displays after an error), so it should work fine for this.
You could also check the original template (assuming the template is a commercial one) to see where the id perhaps should be (they usually lay everything out in their demo template), but unused-css won't tell you exactly where it is used, only if it is or not. For that, I'd start with a view-source -> find on the major pages, and then try other mentioned solutions.
Get the whole site's source tree into an IDE like NetBeans or Eclipse and then do a recursive search for id="theid" on the root folder.
If this is not possible, how are you updating the CSS?
Assuming you don't want to do the grep approach:
Is the ID in question appearing in the actual content area of the page, or in the 'surrounding' areas? If it seems like it's not part of the content, but rather appears in a template, you could search the template files for it. As you're updating the CSS, I'm going to assume you can at least get a hold of the template files. Many text editors/IDE's will let you do a 'global search'. I'd load the template files in TextMate (my texteditor of choice) and do a "search in project" for the particular ID.
That will at least give you a semblance of an idea of where in the site that ID shows up. No, it won't be every 'page', but you'll know what kind of page it appears on (which, with a CMS, is really what you're after).
If the ID in question appears in the content, that is, it was hand-entered by content creators, you'll have to go another route. Do you have access to the database? If you can get a dump of the database (I think Joomla! is MySQL based), you can open the sql in something like Sequel Pro and do a search in the content records for that ID.
This is not actually as hard as it sounds. First place to look the index.php file for the template. This file should be pretty small without a ton of code unless the template is from a developer that uses a template framework. If the ID is in there, then it will show up on every page in the website since this is the foundation that every page is built on.
If you don't find it in there, then you need to determine whether it is displaying in a module position or in the component area. You should be able to tell the difference by looking at the index.php file from the template.
If it's in a module position, then the ID should only show up in instances of that particular module.
If it's in the component area, then it should only display in any pages being created by the component. That does leave the possibility of it affecting many elements you don't want changed. But there is a solution for that. you can use the page class suffix in a menu item to add a unique id/class to the page you want to change (depends on your template). With that unique suffix you can create a specific selector that will only affect the pages you want to change.
Currently reading Bloch's Effective Java (2nd Edition) and he makes a point to state, in bold, that overusing POSTs in web applications is inherently bad. Unfortunately, he doesn't specify why.
This startled me, because when I do any web development, all I ever use are POSTs! I have always steered clear of GETs for security reasons and because it felt more professional (long, unsightly URLs always bother me for some reason).
Are there performance differentials between GET and POST? Can anyone elaborate on why overusing POSTs is bad, and why? My understanding - and preliminary searches - seem to all indicate that these two are handles very similarly by the web server. Thanks in advance!
You should use HTTP as it's supposed to be used.
GET should be used for idempotent, read queries (i.e. view an item, search for a product, etc.).
POST should be used for create, delete or update requests (i.e. delete an item, update a profile, etc.)
GET allows refreshing the page, bookmark it, send the URL to someone. POST doesn't allow that. A useful pattern is post/redirect/get (AKA redirect after post).
Note that, except for long search forms, GET URLs should be short. They should usually look like http://www.foo.com/app/product/view?productId=1245, or even http://www.foo.com/app/product/view/1245
You should almost always use GET when requesting content. Only use POST when you are either:
Transmitting sensitive information which should not appear in the URL bar, or
Changing the state on the server (adding/changing/deleting stuff, altough recently some web applications use POST to change, PUT to add and DELETE to delete.)
Here's the difference: If you want to give the link to the page to a friend, or save it somewhere, or even only add it to your bookmarks, you need the full URL of the page. Just like your address bar should say http://stackoverflow.com/questions/7810876/abusing-http-post at the moment. You can Ctrl-C that. You can save that. Enter that link again, you're back at this page.
Now when you use any action other than GET, there is simply no URL to copy. It's like your browser would say you are at http://stackoverflow.com/question. You can't copy that. You can't bookmark that. Besides, if you would try to reload this page, your browser would ask you whether you want to send the data again, which is rather confusing for the non-tech-savy users of your page. And annoying for the entire rest.
However, you should use POST/PUT when transferring data. URL's can only be so long. You can't transmit an entire blog post in an URL. Also, if you reload such a page, You'll almost certainly double-post, because the above described message does not appear.
GET and POST are very different. Choose the right one for the job.
If you are using POST for security reasons, I might drop a mention of other security factors here. You need to ensure that you send the data from a form submit in encrypted form even if you are using POST.
As for the difference between GET and POST, it is as simple as GET is used to send a GET request. So, you would want to get data from a page and act upon it and that is the end of everything.
POST on the other hand, is used to POST data to the application. I am talking about transactions here (complete create, update or delete operations).
If you have a sensitive application that takes, say and ID to delete a user. You would not want to use GET for it because in that case, a witty user may raise mayhem simply changing the ID at the end of the URL and deleting all random uses.
POST allows more data and can be hacked to send streams of files as well. GET has a limited size though.
There is hardly any tradeoff in using GET or POST.
What is the most standard or best way to persist data between requests?
Should I use cookies or session variables? I'm interested in keeping data like sort order, sort column, and page number (for paginiation).
I'm coming from a webforms background so normally this type of thing was automatically handled for me in the viewstate of the controls I was using.
update
I like the querystring idea, for searching and more meaningful URLs; however, I'm working on an "index/list" view, which consists of a View with header, and "control" options, like DDLs for filtering and a partial view that renders the table of data.
The DDLs use a $.load() to call an ActionResult on the controller, which returns the partial view, passing parameters there in the querystring, but since these are ajax requests the main page url of the user's browser does not get updated.
Is there a best-practice for taking querystrings off the main-page URL and using them in ajax requests to other ActionResults?
If you want it to survive only through one request/redirect TempData is your friend.
However, for things like your pagination, URL is the best method, for the ability to share links alone.
A standard way is to pass those sort of things via URL Query Parameters. You can modify your routing to expect certain URL variables. That way the pages become more search engine friendly as well.
It depends on how permanent you want the information to be:
Things like the page number should indeed be in the URL (as others have pointed out) - this helps with bookmarking, etc, but remember that if you add more content to the list, then that bookmarked result set will not always be what the user wanted...
If you're happy for these values to be lost when a session times out (by default around 20 minutes), then put them in Session.
If you think that sessions are going to timeout before the next request, or you want to save it across visits then you should be storing them in either cookies, or a profile (potentially allowing "Anonymous" profiles, which work with the users cookies, so they would lose them across machines).
Personally, I'd think very carefully about putting sort order and columns in the URL if you do you could actually end up really confusing search engines:
Lots of pages with very similar content (page 1, sorted by date desc, page 1 sorted by date asc, etc) - search engines don't like duplicate content, and nor should you as Google (for instance) will only show two pages from your site in a default result set, you want them to be valid, not duplicates.
Search engines will spend lots more time crawling your site, and potentially give up - If on every page they find links to "Sort by this column", they will attempt to follow them, resulting in more work on the server, higher bandwidth use, etc.
These can be mitigated through the use of a Robots.txt file denying access to sorted versions of the page, but if this is generated almost dynamically that will be very complex to maintain going forward.
In response to your update, a nice way to achieve that for pages would be to have links to "Previous" and "Next" pages of results (or better yet, a list of all pages in the list), output on the page, with the page numbers, that you then hide with JavaScript.
This way users should see your nice, AJAXy behaviour, and search engines (and users without JavaScript - mobile, or those using older screen readers for instance) will still be able to get access to all your pages - this will help your pages to degrade gracefully, or use "Progressive Enhancement".
Things that were previously in viewstate should probably be put back in the clients hands via either hidden fields or cookies.
Session is "too" easy. In a dev environment it works great, pretty much no matter what you put in it. In production scalability and persistence become a problem. In-process session is likely to disappear unexpectedly if you have crashing bug in your site, and requires server affinity when load balancing. Out-of process session fixes the durability and affinity issues, but can still be a performance bottle neck if too much stuff is put in session. A VERY common problem is that each page will put 1 or 2 items into session but never take them out again when they are done. And even if a page removes it session data when it is no longer needed, the data can still get orphaned if a user starts a process and never completes it.
Cookies is a fast and simple way to persist data between requests, and you can also make them live only for a limited time depending on your needs.
Session are easiest.
In my asp.net 2005 app, I would like conceal the app structure from the user. Currently, the end user can learn intimate details of my web app as they navigate and watch the url change. I don't want the end user to know about my application structure. I would like the browser url to not change if possible. Please advise.
thanks
E.A.
URL rewriting is the only one that can provide any kind of real concealment.
Just moving the requests to AJAX or to frames, means anyone (well, more advanced users) can still see those requests being fired, just not in the address bar.
Simplest solution is to use frames - a single frame that holds your application and is 100% * 100%. The URL will not change though the underlying URL can still be seen via "View Frame info", however only advanced users will even figure that out.
In your pages, make sure that they are contained inside the holding frame.
A couple of possibilities.
1) use AJAX to power everything. This will mean that the user never leaves the home page
2) use postbacks to power everything. In this, you'd have all those pages be user controls which you progrmattically hide or show.
3) URL rewriting (especially if this is asp.net 3.0 or later)
My site uses url parameters to dynamically load ascx files into a single main aspx. So if I get 'page_id=123' on the query string, I load the corresponding ascx. The url changes, but only the query string - the domain part remains the same.
If you want the url to remain precisely the same at all times, then frames (per Oded) or ajax (per Stephen) are probably the only ways to do it.
Short answer: use URL encryption
A simple & straight article: http://devcity.net/PrintArticle.aspx?ArticleID=47
and another article: https://web.archive.org/web/20210610035204/http://aspnet.4guysfromrolla.com/articles/083105-1.aspx
HTH
I have this GUI that shows, let's say Customer Orders. When my client nailed down the requirements, he asked me to keep pagination like this,
Show Items Per Page : 10/50/150
For each customer there could be thousands of orders and each order will have atleast 50 attributes to show up in the screen. So, assume a 50 column html table with 2000 or 3000 records associated with it spanning across multiple database tables (anyway, this is a different story)
Things were breeze until yesterday, now my client has come up with new change requests, in that he specified Show Items like this,
Show Items Per Page : 10/50/150/All
Yes, he wanted to see 2000 or 3000 records by just select "All" option. Internally, this is not a big change, I would go back and remove the filters I apply on rowcount etc., but when it is loaded in GUI it really sucks ... view state was huge etc., etc.,
I know, this is a standard problem. How you guys deal with it? I cannot convince my client to remove this "All" option, he got stick to it. (the reason is simple, he got a big 42" screen where he can easily see 1000 items in one page)
I also tried to use javacript to prepare DOM in a ajax call .. but still, inserting 2000 TDs is really slow.
Any help is greatly appreciated.
Some Extra Info
This application is a intranet application or else accessed through VPN connection
This problem is about browser performance.
I suppose you can do two things.
1) you can use <div> instead of <table> (this is possible with CSS) because browser do not render table until closed tag. So it will take long to load page but it will render first results faster.
2) If you use Ajax+Json and render every <tr> piece by piece, you can render whole thing and only than put it in DOM. That will be faster because browser will not render every time you put another row
If you want you can load the data in sort of installments. Sort of like how pagination works but it is not quite pagination to be precise. You can label your installments/pages with a proper ID. Load the page one after another via ajax calls. You can even show a progress bar to show how much data is actually loaded. Append this data to the table you are displaying the data in. I would not go about using server controls for this...you have to handle this via javascript or jquery.
You might want to append table rows incrementally.
When client scrolls close to page bottom - fire an ajax call, return next page and render it.
But best solution would be to convince your client - this is not how web applications works. We had similar situation - pure nightmare.
Instead of an ASP.NET GridView, you'd be better to use a DataRepeater.
Better yet, if you are not constrained by technology, you can use Microsoft Ajax Preview 4 with WCF REST Services. You would just need to find some hacks to "stream" data from the service and display it.
Also there is JQuery Grid (if you don't want to use Microsoft Ajax Preview 4) that supports JSON serialization.