What's the best "file format" for saving complete web pages (images, etc.) in a single archive? [closed] - standards

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm working on a project which stores single images and text files in one place, like a time capsule. Now, most every project can be saved as one file, like DOC, PPT, and ODF. But complete web pages can't -- they're saved as a separate HTML file and data folder. I want to save a web page in a single archive, and while there are several solutions, there's no "standard". Which is the best format for HTML archives?
Microsoft has MHTML -- basically a file encoded exactly as a MIME HTML email message. It's already based on an existing standard, and MHTML as its own was proposed as rfc2557. This is a great idea and it's been around forever, except it's been a "proposed standard" since 1999. Plus, implementations other than IE's are just cumbersome. IE and Opera support it; Firefox and Safari with a cumbersome extension.
Mozilla has Mozilla Archive Format -- basically a ZIP file with the markup and images, with metadata saved as RDF. It's an awesome idea -- Winamp does this for skins, and ODF and OOXML for their embedded images. I love this, except, 1. Nobody else except Mozilla uses it, 2. The only extension supporting it wasn't updated since Firefox 1.5.
Data URIs are becoming more popular. Instead of referencing an external location a la MHTML or MAF, you encode the file straight into the HTML markup as base64. Depending on your view, it's streamlined since the files are right where the markup is. However, support is still somewhat weak. Firefox, Opera, and Safari support it without gaffes; IE, the market leader, only started supporting it at IE8, and even then with limits.
Then of course, there's "Save complete webpage" where the HTML markup is saved as "savedpage.html" and the files in a separate "savedpage_files" folder. Afaik, everyone does this. It's well supported. But having to handle two separate elements is not simple and streamlined at all. My project needs to have them in a single archive.
Keeping in mind browser support and ease of editing the page, what do you think's the best way to save web pages in a single archive? What would be best as a "standard"? Or should I just buckle down and deal with the HTML file and separate folder? For the sake of my project, I could support that, but I'd best avoid it.

My favourite is the ZIP format. Because:
It is very well sutied for the purpose
It is well documented
There a a lot of implementations available for creating or reading them
A user can easily extract single files, change them and put them back in the archive
Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in
The alternatives all have some flaw:
With MHTMl, you can not easily edit.
With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:
store whole page as it is with all referenced resources - images,
CSS and javascript?
to capture page as it was rendered at some point in time; a static
image of some rendered state of web page DOM?
Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.
Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:
one page is in fact several pages build dynamically by JS, user interaction is needed
to get it to desired state
AJAX applications can do remote communication with remote service rendering it
unusable for offline view.
Hidden links in javascript code. Such resource is then not part of stored page.
Even parsing JS code may not discover them. You need to run the code.
Even position of basic html elements may be recomputed may be computed dynamically by
JS and it is not always possible/easy to recreate it locally.
You would need some sort of JS memory dump and load this to get page to desired state
you hoped to store
And many many more issues...
Check Chrome SingleFile extension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

Use a zip file.
You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.
Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.
MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.
Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

i see no excuse to use anything other than a zipfile

Well, if browser support and ease of editing are the biggest concerns I think you are stuck with the file+directory approach unless you are willing to provide an editor for the single file format and live with not very good support in browsers.
You can create a single file by compressing the contents. You can also create a parent directory to ease handling.

The problem is that html is bottoms up not top down. Look at your file name which saved on my box as "What's the best "file format" for saving complete web pages (images, etc.) in a single archive? - Stack Overflow.html"
Just add a '|' and one has trouble doing copy and paste backups to a spare drive. In the end you end up. chopping the file name in order to save it. Dozens/ perhaps hundreds of identical index.html or index.php are cluttering my drives.
The partial solution is to write you own CMS and use scripts to map all relevant files to a flat file database - then use fileName, size, mtime and md5 to get a unique Id for each file. Create a flat file index permitting 100k or 1000k records. The goal is to write once and use many times. So you need a real CMS you need a unique id based on content (eg index8765432.html) that goes in your files_archive. Ditto for the others. Then you can non-destructively symlink from the saved original html to the files_archive and just recreate the file using a php or alternative script if need be. Don't know if it will work as I'm at the same point you're at - maybe in a week will know for sure. The more useful approach is to have a top down structure based on your business or personal wants and related tasks. So your files might be organized top down but external ones bottom up to preserve the original content. My interest is in Web 3.0 services and the closer you get to machine to machine interaction the greater the need to structure the information. Maybe time to rethink the idea of bundling everything into a single file. So you have hundreds of main.css why bundle when a top down solution might let you modify one file instead of hundreds.

Related

Why have all CSS files in one folder? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've been adhering to the practice of having all CSS files in one folder. Is it just for the sake of keeping things organized or does it have any other benefit besides that.
This questions isn't just for clarification purposes, I have a whole lot of Websites that I'm trying to decrease their load time and I was wondering if this method will help.
two reasons come in mind right now (to do with keeping both html, js and css seperate):
both css and js can be reused in multiple html files if are in external file but you need to copy same code in each page if is a single html+css+js page.
If you want you can develop a new version of the code, css or js, for a online page and when you finish the only thing you need to do is to change the filename in the link or script element in html. This means you're not being repetitive
Placing them into a single css also means that it's easier to find the styling and bulk edit a lot of the html styling in a single place.
I also found:
Easier editing: Suppose you find that distances are calculated wrong (or whatever you are currently working on), then it's definitely easier to just open the file that contains the object responsible for distance calculation than scrolling through your huge HTML file looking for the culprit.
Syntax highlighting / Code completion / other features of you IDE (like refactoring): This might work partially with code inside of HTML files, but not all that well. So, you could work faster and actually see errors before they become bugs.
Cachability: While your HTML code will be different for all the pages of your site, your CSS and JS won't, and it would be silly to reload them for every page (which happens when they are put directly into the HTML).
Page load time: With CSS and JS in the HTML file, they have to be loaded before the browser will see any actual HTML, so the page will show slower. Also, search engines that don't really care about your scripts will have to load them, and there are penalties based on page load time.
Minification: In production, you use a minified (and concatenated) version of your CSS and JS, and of course you don't want to manually create that every time you make a change, so you do it programatically. Trying to do that without separate files would become very ugly, and you wouldn't be able to cache that minified version, which would be quite the performance hit.
CSS generators: When you start to care about keeping code duplication to a minimum, you will quickly tire of writing CSS, which is full of duplications, and switch, for example, to SASS (like I did quite some time ago). You will definitely need separate files to make that work.
So, in answer to your question,
Yes, for the majority of websites, separating them (like mvc does by default) will likely quicken your load time. (very few exceptions appear in this rule, however a site with 1 or 2 style declarations may be slightly faster when placed within the html, but if you ever use MVC you won't look back at not separating them!)
Storing all the css files into one folder will not decrease load time. Best way to decrease the load time is to combine all CSS files together and to minify it. Bad thing here is when time comes to change your styles. You would have to keep original files, or to decompress/pretty print it, than change, than minify again. Some CMS have an option to minify and cache all of your styles (and javascript files), which is much better. On the other, your server side, make sure all of your styles are gzipped before they are delivered to client browser. More about how to enable compression on Apache
Why have all CSS files in one folder?
Why have many CSS files to begin with? If you interested in "decreasing the load time" then you should consider having only one CSS file.
Long story short, you can organize your CSS and JavaScript files any way you want. However, when serving the files you must combine and minify all CSS files into one (likewise for JavaScript).
I personally keep related CSS, images and JavaScript together. I find it much easier to work with files that way. An example would be:
- resources
- plugin1
- plugin1.js
- plugin1.css
- images
- plugin2
- plugin2.js
- plugin2.css
- images
I have written a script that combines CSS files from all folders into one file, minifies them and copy the result to a directory below the wwwroot.
The organisation it's very important for a projet, you don't put apple into a pear bag it's a reflex, but, in fact you can do it ...
For the code it's the same, you put CSS files into CSS folder, but you can don't do it.
Imagine than you have 30 CSS files, 40 PHP files, 50 pictures et 10 js files ... You need to organised this !

ASP.NET - how can I recompile the language resources?

I notice that if I have another language in my project and change the HTML content file it will work on my computer but if I upload to the server it keeps the old HTML.
It seems to work if I modify the .resx resource file even if I don't make any changes, just add a space for example, then save, then upload also.
Why is this? I'm guessing that the HTML is cached somewhere, more importantly how can I upload a large batch of updated translations, without having to modify each .resx file??? I have a lot of Japanese
Thanks in advance
But .resx files needs to be compiled. That is why only upload doesn't work.
I just completed a project translating an existing site into several languages.
I did this using webforms with c# codebehind. Hope my answers can assist you...
I used RESX files to store the different language translations (index.aspx.fr-CA.resx for example) and in the c# code behind you would have to use 'protect override void InitializeCulture()' method to have the RESX files come into play. Do you have this setup?
I found it easily for me to create the multiple languages as such -- I would generate the local resource (after tagging the static content in tags) and then copy that generated RESX file, paste and rename it to the other language. The RESX file is an XML and in the VS it creates an easy to edit table view to copy and paste translated languages (or alternatively you can use tools such as RESX Manager -see link below). What I did, to keep it convenient to edit the syntax of different languages and modify, was take my existing content text and run it through Google or Bing translator and paste the translated text into the RESX files...
If your looking for something that will automatically translate your site for you and then re displays in an iframe, you can look into Bing Translation API http://www.microsoft.com/web/post/using-the-free-bing-translation-apis but in my case, this was not ideal as the translations would not take into account the syntax of different languages and you can not go in and edit after someone that speaks and is familiar with sites in that language tells you that it makes no sense! =P
Also this tool is very helpful if you already have existing RESX files and you will need to modify an section on one language, you can edit in the others easily. http://resxmanager.com/ its free for non-commercial use. It's easy to setup and can definitely safe you time when you get going.

What's the purpose of an asp:hyperlink, and how many strings is too many in a resource file?

I developed a (small) company website in Visual Studio, and I'm addicted to learning more. I really just have two simple questions that I can't google.
1 - Asp:hyperlinks:
What is the purpose of an asp.hyperlink? I know I can't use these in my resource files -- I have to convert 'em all back to html links. At first, asp:hyperlinks looked sophisticated, so I made all my links asp:hyperlinks. Now I'm reverting back. What's the purpose of an asp:hyperlink, if any?
2 - Resource Files and strings:
In localizing my website, I have found that I'm putting the .master resource files in the directory's App_LocalResources folder VS created, because you can't change the top line stuff in a .master file and put a culture/uiculture in there. But all of my regular .aspx pages are going into the root App_GlobalResources folder into 1 of 4 language resource files (de, es-mx, fr, en). I'm making 2 or 3 strings per .aspx page. So when you have 47 pages in your website, that's about 100 strings on a resource page.
I just learned about all of the resources stuff from this forum and MSDN tutorials, so I have to ask, 'cause it's a lot of work. Is this okay? Is it normal? Am I going about this the wrong way?
I've never used resources, so can't comment on that.
Differences between asp:hyperlink and a tag that I know of:
asp:hyperlink is converted to an A tag by the ASP.NET engine when output to the browser.
It is possible asp:hyperlink could make browser specific adjustments, to overcome browser bugs/etc.. which is kind of the point of ASP.NET, or at least one of them. If not already in it, they could be added later, and by using those objects you'll get that when/if added.
Both can be used in code behind (you can set runat="server" for an A tag), but the asp:hyperlink has better compile-time checking in most cases -- strong type-casting for more items vs generic objects.
asp:hyperlinks are easier to get HTML bloat, but only if used with a poor design. For example, it is easy to set font styles and colors on them.. but I wouldn't, since that generates in-line styles that are usually pretty bloated compared to what you would do by hand or in a CSS file.
asp:hyperlinks support the "~/Folder/File.ext" syntax for the TargetUrl (href), which is nice in some projects if you use a lot of different URLs and sub-folders and want the server to handle mapping in a "smart" way.
The purpose of is to display a link to another webpage.
With the resource files, since you're not a programmer and just developing a small program, use something you're comfortable with. Resource files are easy to use for beginners when you want to localize your web content -- and yes, it's normal to be adding many strings if you need them.
For #1
Using a hyperlink control over just a piece of text will allow you to access the control at runtime and manipulate its contents if you want to change the link dynamically, if you have static links that will never change then its simpler to just use plain text ie. <a href=''>

How to preview an array of bytes which represents a file content in client browser?

We are in a need to interact with a Document Management Systems which saves file (mostly PDF & DOC & DOCX) as an array of bytes and saves also its file extension.
So, we need to build a file viewer which display files in client's browser.
We think of converting DOC files into PDF and preview converted file in browser, others think of converting array of bytes to HTML (this solution is a big question for us as we don't know how to do this and if this is available or not) and transfer rendered html.
But we don't think that these solutions are the best and cross browser solutions.
So, is there any way to do such functionality? which must be a cross browser solution?
The first question you have to answer is whether the users need to be able to edit the documents. If so, then your best "viewer" is going to be a the Word and Adobe client apps. Please note that in this case, you will also need to give the users the ability to upload (and possibly check-in) the edited documents.
If the users just need read access, then you can certainly just show them an image or PDF of the file in their browser. If you go the PDF route, you will save money using Adobe reader, but it will be a "clunkier" user experience.
If you want to give your users a read-only view, you will need to "render" .doc files into PDF's or TIFF's or PNG's or whatever. I don't recommend doing this in the browser unless ALL of your documents are VERY simple.
If you users require a single, web-based interface for all of their rendered .doc and .pdf files, then you may want to consider using an java or activex-based document viewing applet. Daeja is the most popular vendor for this type of viewer, and it even gives your users the ability to annotate documents.
One more note. Rendering .doc files can be a very expensive, cumbersome, and error-prone process. I've worked on numerous systems at multiple companies that have tried this, and no matter what we did or how much we spent, it never worked terribly well.
Good luck!
Tom Purl

Localize Images in ASP.NET

A couple of years ago, we had a graphic designer revamp our website. His results looked great, but he unfortunately introduced a new unsupported font by the web browser.
At first I was like, "What!?!"... since most of our content is dynamic and there was no real way to pre-make all of the images. There was also the issue of multiple languages (since we knew Spanish was on the horizon).
Anyway, I decided to create some classes to auto-generate images via GDI+ and programatically cache them as needed. This solved most of our initial problems. However, now that our load has increased dramatically, there has been a drain on our UI server.
Now to the question... I am looking to replace most of the dynamic GDI+ images with a standard web browser font. I am thinking of keeping some of the rendered GDI+ images and putting them in a resx file, but plan to replace most of them with Tahoma or Arial fonts via asp:Labels.
Which have you found to be a better localized image solution?
Embedding images into the resx
Only adding the image url into the resx
Some other solution
My main concern is to limit the processing on the UI server. If that is the case, would adding the image url to the resx be a better solution compared to actually embedding the image into the resx?
You should only need to generate each image once, and then save it on the hard disk. The load on your site shouldn't increase the amount of processing you have to do. That being said, it almost sounds like you are using images for things you shouldn't be. If there are so many different images that you can't keep up with generating them, it's time to abandon your fancy images for things that shouldn't be images, and go back to straight text. If the user doesn't have the specified font installed, it should just fall back to a similar looking font. CSS has good support for this.
see my response here
This can be done manually or using some sort of automated (CMS) system.
The basic method is to cache your images in a language specific directory structure and then write an HTTP handler that effectively removes the additional directory layer. eg:
/images/
/en/
header1.gif
/es/
header1.gif
In your markup or CSS you would just reference /images/header1.gif. The http hander then uses session (if language is user specific), or config (if site specific) to choose which directory to serve the image from.
This provides a clean line bewteen code and content, and allows for client side caching. Resx is great for small strings but I much prefer a system like this for images and larger content. especially on the web where it is typically easy to switch images around.
I had the same problem a few years back and our interface team pointed us to SIFr. http://wiki.novemberborn.net/sifr/
You embed your font into a Flash movie and then use the SIFr JavaScript to dynamically convert your text into your font. Because it's client-side, there is no server-side impact.
If the user doesn't have Flash or JavaScript installed, they get the closest web-friendly font.
As an added bonus: because your content is still Text -- Google can search and index the content -- a huge SEO optimization.
Because of caching, I'd rather add only the image url into the resx. Caching is much better for static content (i-e plain files ) than for generated content.
I'd be very cautious about putting text in images at all, CSS with appropriate font-family fallback is probably the correct response on accessibility and good MVC grounds.
Where generation really is required I think Kiblee and JayArr outline good solutions

Resources