How to get a site's entire RSS history as XML? - rss

Given a website/blog's RSS feed link, is there any way to get that site's entire RSS history (all its blog posts EVER) in a single XML file?
Is this something that is only possible from the other end (ie. a site publishes it's entire blogroll history as RSS)? In which case, how is this achieved?
Thanks!
S

RSS is just another way of expressing the data. It depends entirely on the site. If the site provides a way for you to specify how many items you want (which is unlikely), then you should know that that won't work on other sites.
Technically speaking, formatting the data in RSS is no different than formatting it in HTML. For example, many sites (including this one), need to represent some sequential data (questions in SO's case) on a page in HTML. To do this, the site will iterate through some data source (like a database), and output HTML so your web browser can render it, until it hits some limit. Knowing that limit is impossible, as it depends on the site. This is exactly what RSS does: it iterates through a data source, spitting out XML as it goes along. Again, knowing the limit is not possible.
Is this something that is only possible from the other end ...? In which case, how is this achieved?
If you can change how your site generates the RSS, simply remove the limit. I know this is vague, but it really depends on the implementation. There are dozens of RSS implementations, all different, and all behaving differently.
So my point is, nothing will work universally, you have to change the site itself to modify that behavior.

You are right there. The site has to publish its entire history, otherwise you can't get it. Doing it on server side, if you have access to the database, its quite easy. Just dump all the rows as XML. It actually takes effort to filter and limit the xml. How you can do it on blogging platforms? You could use plugins that allow you to do this

Related

Is the best way to update sections of a dynamic webpage - by rewriting the file every time?

At my company (a custom retail e-commerce website) we have a homepage with various sections on the page for promotions/sales/events. These sections may appear in any order, one after the other. The order of the sections are saved in a database table called mainpage_sections with an order column (int).
The present method we use for updating the homepage when the order of sections is changed, is by running a callback method that automatically rewrites the aspx View file itself, in HTML, with the new order of the sections. It does not pull the sections from the database and dynamically render them according to their order.
This struck me as being opposite to best principles and very messy. I asked why we didn't use a database read instead, but I was told that since this website is visited thousands of times a day, and the order of the sections rarely changes, it makes more sense to update the file itself, instead of running thousands of extra database reads just for people visiting the homepage of the website.
Does this approach make sense? What is the best-principle, recommended approach here? Is something like output caching a better choice?
Overwriting a code file does seem weird. What if you stored the ordering in a separate JSON file, and only overwrote the JSON file?

What is the best way of duplicating an entire website?

I've built a complex site for a client, who wants this duplicated, and re-skinned, so it can be used for other means.
What is the best way of doing this? I'm concerned about copying every file as this means any bugs must be fixed twice, and any improvements must be implmented twice.
I'd look to refactor your code.
Move common functions into a library you can reference from both projects. As you mention that the new site is for a different purpose then you are likely to see divergence and you don't want to hamper yourself later, so extract the common parts and then modify copies (or if appropriate new files) of the remainder to complete your fork.
If you haven't applied good practice already then now is the time to do it and it'll make your work on both sites easier moving forward.
If all the functionality is the same and only the layout is different you could just create a new css file. 2 websites could have exactly the same code base but have different stylesheets and look completely different.
I think that using a version control system like subversion or preferably git, is a good way to duplicate your website. You will be able to track the changes that you make and revert to older versions if things do not work out.
You should implement some kind of instantiation, so look and feel, content and data will be shown depending of what instance of the application is accessed.
In other words, each application access to the code with a different application identifier, meaning content will be served depending on it.
Both application identifier will be pointing to different settings, so stylesheet and content will be absolutely isolated, and both domain will be living in the same IIS application.
If you want to duplicate a whole site it's probably best to copy the whole thing and amend as necessary. Obviously taking great care not to copy large portions of text or else you may be penalised by the search engines.
There are ways you could put the new site onto the same shared host (say within a subdirectory of the original site) and literally 'share' some files. If a unique change is required, you could instead reference a 'local' version of a particular file.
However that sounds like a recipe for a headache to me. I'd prefer to duplicate the whole site. It would be much easier to replace one or two functions on separate websites than it would to try and work out which website(s) are affected by a particular change to your source.

How to prevent someone from hacking API feed?

I have started developing a webpage and recently hired someone to write code to display a customized feed (powered by API) in the middle panel on http://farmball.com/. Note that this is not the RSS feed tied to the site blog. The feed ties to my account on another site. There is no RSS link for an average user to subscribe to the feed. I've taken the site out of maintenance mode to ask anyone here with scraping/hacking experience how someone would most easily go about 'taking' the feed and displaying it on their own site. More importantly, what can I do to prevent it?
^Updated for re-wording
You can't.
If you are going to expose an RSS feed which you don't want others to be able to display on their site then you are completely missing the point of RSS. The entire reason for Really Simple Syndication (RSS) is to make your content externally consumable- whether that's in an RSS Reader or through someone simply printing its content on their own website.
Why are you including an RSS feed if you do not want someone to be able to consume it?
what can I do to prevent...'taking' the feed and displaying it on their own site?
Nothing. Preventing reuse goes against the basic concept of RSS, which is to make it as easy as possible for anyone to do anything they want with it. It was designed from the ground up to be Really Simple to Syndicate, not Really Hard to Retransmit Without Permission.
You could restrict access to the feed itself to trusted users only by making them provide some credentials or pass in a key to the feed (e.g. yoursite.rss?mykey=abc123). But you cannot control use. Only access.
Be explicit about your license. It isn't a technology solution, as others have mentioned, the technology is an open technology-- this isn't DRM! But if you ask in each post that people who use this feed to not repost/fail to give credit/etc then some people will respond to the request.
Otherwise, you're better off putting your content behind a password and using a paid subscription model for distributing your content.
This is a DRM problem essentially. If you had some technique that you could put content on the web without having it redistributable, the music industry would love you.
It is possible to try to prevent redistribution. One technique you could try is embedding a signature of some sort into the feed for each user who you require to sign up. If the content is found on the web, you can identify and ban the user who redistributed your content.
This is avoidable too, by getting multiple accounts and normalizing the content to remove fingerprints. For the would-be pirate, this requires more effort than they may be willing to put in. Your signature could be a unique whitespace pattern, tiny variances in the timestamps on posts, misplaced pixels in videos, or any other thing you can vary slightly without end users noticing.
use .htpassword
better yet, don't put something private in a public place where it's likely to get picked up by software automatically. Like others have said, it's a pretty odd question, if you're trying to figure something else out, you're better off being explicit with what you want to know.

How to implement a "news" section in asp.net website?

I'm implementing "news" section in asp.net website. There is a list of short versions of articles on one page and when you click one of the links it redirects you to a page with a full article. The problem is that the article's text on the second page will come from database but the articles may vary - some may have links, some may have an image or a set of images, may be differently formatted etc. The obvious solution that my friend have come up with is to keep the article in the database as html including all links, images, formatting, etc. Then it would be simply displayed on the second page. I feel this is not a good solution as if, for example, we decide to change the css class of some div inside this html (let's say it is used in all articles), we will have to find it and change in every single record of the articles table in our database. But on the other hand we have no idea how to do it differently. My question is: how do you usually handle something like this?
I personally don't like the idea of storing full html in the database. Here's an attempt at solving the problem.
Don't go for a potentially infinite number of layouts. Yes all articles may be different but if you stick to a few good layouts then you're going to save yourself a lot of hassle. These layouts can be stored as templates e.g ArticleWithImagesAtTheBottom, ArticleWithImagesOnLeft etc
This way, your headache is less as you can easily change the templates. I guess you could also argue then that the site has some consistency in layout.
Then for storage you have at least 2 options:
Use the model-per-view approach and have eg ArticleWithImagesAtTheBottomModel which would have properties like 1stparagraph, 2ndparagraph, MainImage, ExtraImages
Parse the article according to the template you want to use. e.g look for a paragraph break if you need to.
Always keep the images separate and reference them in another column/table in the db. That gives you most freedom.
By the way, option #2 would be slower as you'd have to parse on the fly each time. I like the model-per-view approach.
Essentially I guess I'm trying to say beware of making things to complicated. An infinite number of layout means an infinite number of potential problems. You can always add more templates as you go if you really want to expand, but you're probably best off starting with say 3 or 4 layouts.
EDITED FROM THIS POINT:
Actually, thinking about it this may not be the best solution. It could work depending on your needs, but I was wondering how the big sites do it. If you really need that much flexibility, you could (as I think was sort of suggested) use a custom markup. Maybe even a simplified or full wiki markup. I'd still tend toward using templates in general, but if you need to insert at least links and images then you can parse for those.
Surely the point of storing HTML with logically placed < div >s is that you DON'T have to go through every bit of HTML you store to make changes to styles?
I presume you're not using inline styles in your stored HTML, and are referencing an external CSS file, right?
The objection you raise to your colleague's proposal does not say anything about the use of a DB. A DB as opposed to what: files? Then it's all the same. You want to screw around with the HTML, you have to do it on "every single record." Which is not any harder than "on every single file." Global changes are a bitch unless you plan for it by, say, referencing an external CSS. But if you're going to have millions of news articles, you had better plan on versioning the CSS as well.
Anyway, the CMSes do what you're thinking of doing. Using a DB is a fine way to go. How to use it would depend on knowing the problem more intimately.
Have you looked into using free content management systems? I can think of a few good ones:
Joomla
Drupal
WordPress
TONS of others... just do some googling.
Check out this Wiki article: http://en.wikipedia.org/wiki/List_of_content_management_systems

Data Access ASP.NET

I built an online news portal before which is working fine for me but some say the home page is slow a little bit. When I think of it I see a reason why that is.
The home page of the site displays
Headlines
Spot news (sub-headlines
Spots with pictures
Most read news (as titles)
Most commented news (as titles)
5 news titles from each news category (11 in total e.g. sports, economy, local, health
etc..)
now, each of these are seperate queries to the db. I have tableadapters datasets and datatables (standard data acces scenarios) so for headlines, I call the business logic in my news class that returns the datatables by the tableadapter. from there on, I either use the datatable by just binding it to the controls or (most of the time) the object converts it to a list(of news) for example and I consume it from there.
Doing this for each of the above seems to work fine though. At least it does not put a huge load. But makes me wonder if there is a better way.
For example, the project I describe above is a highly dynamic web site, news are inserted as they arrive from agencies 24 hours non-stop. so caching in this case might not sound good. but on the other hand, I know have another similar project for a local newspaper. The site will only be updated once a day. In this case:
Can I only run one query, that would return a datatable containing all the news items inserted for today, then query that datatable and place headlines, spots and other items to their respective places on the site? Or is there a better alternative around? I just wander how other people carry out similar tasks in the most efficient way.
I think you should use FireBug to find out what elements are taking time to load. Sometimes large images can ruin the show (and the size of the image on screen isn't always relative its download size).
Secondly you could download the Yahoo Firefox plugin YSlow and investigate if you have any slowing scripts.
But Firebug should give you the best review. After loading Firebug click on the 'Net' tab to view the load time of each element in the page.
If you've got poor performance, your first step isn't to start mucking around. Profile your code. Find out exactly why it is slow. Is the slowdown in transmitting the page, rendering it, or actually dynamically generating the page? Is a single query taking too long?
Find out exactly where the bottleneck is and attack the problem at its heart.
Caching is also a very good idea, even in cases where content is updated fairly quickly. As long as your caching mechanism is intelligent, you'll still save a lot of generation time. In the case of a news portal or a blog as opposed to a forum, your likely to improve performance greatly with a caching system.
If you find that your delays come from the DB, check your tables, make sure they're properly indexed, clustered, or whatever else you need depending on the amount of data in the table. Also, if you're using dynamic queries, try stored procedures instead.
If you want to get several queries done in one database request, you can. Since initially you wont be showing any data until all the queries are done anyhow, and barring any other issues, you'll at least be saving time on accessing the DB again for every single query.
DataSets hold a collection of tables, they can be generated by several queries in the same request.
ASP.NET provides you with a pretty nice mechanism already for caching (HttpContext.Cache) that you can wrap around and make it easier for you to use. Since you can set a life span on your cached objects, you don't really have to worry about articles and title not being up to date.
If you're using WebForms for this website, disable ViewState for the controls that don't really need them just to make the page that little bit faster to load. Not to mention plenty of other tweaks and changes to make a page load faster (gzipping, minimizing scripts etc.)
Still, before doing any of that, do as Anthony suggested and profile your code. Find out what the true problem is.

Resources