How to get information from standards? - standards

Is there a reasonable way to search standards for programming and markup languages (specifically, C, C++, Java, JavaScript, (X)HTML)? Standard libraries tend to be well-documented and easy to access, but when looking for information on the basics of a language I always have trouble, and end up getting most of my information second-hand from tutorials. That's not all bad, since tutorials often point out gotchas (such as even though this is in the standard, it doesn't work in Internet Explorer) but tutorials are sometimes wrong and often don't cover more obscure areas.

There is really only one way to get information from a standard: read the standard.
If the standard is too hard to read (and a lot of them are), then maybe the standards folks have created (non-normative) introduction or tutorial documents. But they are not the standard. Very occasionally, someone produces an "annotated" version of a specification that offers simplified explanations. These are very useful, but once again they are not the standard.
If a standard is available in machine readable / searchable form, document search on suitably chosen keywords can often point to the relevant part(s) of the documents. But you have to read and understand the text. There are no tools around (that I'm aware of) that can accurately translate the (often abstruse) technical details of a standard into something that "normal people" can read easily.
This is why we label people who are intimately familiar with particular standards "standards lawyers" or "language lawyers". It is analogous to lawyers and judges reading/writing legal documents.

I assume you are looking for a syntax reference, as opposed to a standard. The standard is precise but probably too low level for what you really want. A syntax reference will show you the language constructs for looping, selection etc. There are some exceptions such as (X)HTML which is a markup language as opposed to a programming language. Markup language standards documents tend to be more useful from the reference perspective.
For example look at the Visual C++ Language Reference and compare it to a version of the standard.

For (X)HTML and the DOM, the standards are handled by the W3C. But as you know, browsers don't exactly follow the standards. For an exhaustive resource on browser issues there's nothing better than the quirksmode compatibility tables.

Related

NTRU Key Exchange example implementation

Are there any open-source implementations of NTRU-KE (Preferably in Java or C#) out there that I can use as a reference for implementing it in a different language?
The implementations listed on the Wikipedia page for NTRUEncrypt don't have it included, and there's a paper covering the algorithm here but the language is a bit too technical for me to be able to understand it fully.
Future readers, please prove me wrong (and post your own answer).
Given it is pretty new (November 2013) there probably aren't any implementations at all. Even the authors of the paper might not have implemented it themselves (you could ask them though). But as far as I can tell the protocol only uses operations that would have to be included in NTRUEncrypt implementations anyway. So it shouldn't be to difficult to write one yourself on top of an existing NTRU library. You can ask specific questions on the protocol here or on https://crypto.stackexchange.com. Probably you should try to understand the basics of NTRUEncrypt first, though.

Technical choices in unmarshalling hash-consed data

There seems to be quite a bit of folklore knowledge floating about in restricted circles about the pitfalls of hash-consing combined with marshalling-unmarshalling of data. I am looking for citable references to these tidbits.
For instance, someone once pointed me to library aterm and mentioned that the authors had clearly thought about this and that the representation on disk was bottom-up (children of a node come before the node itself in the data stream). This is indeed the right way to do things when you need to re-share each node (with a possible identical node already in memory). This re-sharing pass needs to be done bottom-up, so the unmarshalling itself might as well be, too, so that it's possible to do everything in a single pass.
I am in the process of describing difficulties encountered in our own context, and the solutions we found. I would appreciate any citable reference to the kind of aforementioned folklore knowledge. Some people obviously have encountered the problems before (the aterm library is only one example). But I didn't find anything in writing. Even the little piece of information I have about aterm is hear-say. I am not worried it's not reliable (you can't make this up), but "personal communication" and "look how it's done in the source code" are considered poor form in citations.
I have enough references on hash-consing alone. I am only interested in references where it interferes with other aspects of programming, such as marshalling or distribution.
OK, this is not much more use, but Andrew Kennedy wrote a functional pearl called simply Pickling Combinators, which appears in the Journal of Functional Programming, (2004), 14:6:727-739. There is extensive discussion of structure sharing and how it is handled in pickles, but no direct discussion of how this problem might relate to hash-consing in the implementation of the language. But the article does discuss structure sharing in memory as well as in a pickle, so I hope it is better than nothing.
Martin Elsman had a follow-on paper in 2005 in Trends in Functional Programming; the title is Type-specialized serialization with sharing. The article deals primarily with hash-consing by the unpickler (deserializer), not with hash-consing in the impelementation, but again it may be worth something.
The JFP paper is proprietary, but there appears to be a preprint on Andrew's web page.
Elsman's paper appears to be available through Google Scholar at http://tinyurl.com/yd5tw2b.
(In a previous life, I worked on a project to create ASCII pickles that people could read and edit. I stupidly failed to publish it, but I have retained an interest.)
I found one reference on marshalling in functional languages; not sure if it will be useful, but the authors are smart: http://tinyurl.com/yc3hob9
I believe that Matthias Blume and/or Andrew Appel did something on this, but I can't find the paper. I also believe I reviewed something once for the Journal of Functional Programming, but I can't remember if the paper was accepted or who wrote it.
I suggest you ask Matthias Blume, Andrew Appel, and Phil Wadler if they can help.
Coq V5.10 had hash-consing and marshaling/unmarshaling. I didn't find anything in published form but the unmarshaling steps would be referenced as "reinterning" in the source code. Coq unmarhsaled values and then traversed them in order to re-create sharing, the obvious and only solution when all the language provides is an unmarshal function of type int_channel -> 'a.

What is the difference between AntiXss.HtmlEncode and HttpUtility.HtmlEncode?

I just ran across a question with an answer suggesting the AntiXss library to avoid cross site scripting. Sounded interesting, reading the msdn blog, it appears to just provide an HtmlEncode() method. But I already use HttpUtility.HtmlEncode().
Why would I want to use AntiXss.HtmlEncode over HttpUtility.HtmlEncode?
Indeed, I am not the first to ask this question. And, indeed, Google turns up some answers, mainly
A white-list instead of black-list approach
A 0.1ms performance improvement
Well, that's nice, but what does it mean for me? I don't care so much about the performance of 0.1ms and I don't really feel like downloading and adding another library dependency for functionality that I already have.
Are there examples of cases where the AntiXss implementation would prevent an attack that the HttpUtility implementation would not?
If I continue to use the HttpUtility implementation, am I at risk? What about this 'bug'?
I don't have an answer specifically to your question, but I would like to point out that the white list vs black list approach not just "nice". It's important. Very important. When it comes to security, every little thing is important. Remember that with cross-site scripting and cross-site request forgery , even if your site is not showing sensitive data, a hacker could infect your site by injecting javascript and use it to get sensitive data from another site. So doing it right is critical.
OWASP guidelines specify using a white list approach. PCI Compliance guidelines also specify this in coding standards (since they refer tot he OWASP guidelines).
Also, the newer version of the AntiXss library has a nice new function: .GetSafeHtmlFragment() which is nice for those cases where you want to store HTML in the database and have it displayed to the user as HTML.
Also, as for the "bug", if you're coding properly and following all the security guidelines, you're using parameterized stored procedures, so the single quotes will be handled correctly, If you're not coding properly, no off the shelf library is going to protect you fully. The AntiXss library is meant to be a tool to be used, not a substitute for knowledge. Relying on the library to do it right for you would be expecting a really good paintbrush to turn out good paintings without a good artist.
Edit - Added
As asked in the question, an example of where the anti xss will protect you and HttpUtility will not:
HttpUtility.HtmlEncode and Server. HtmlEncode do not prevent Cross Site Scripting
That's according to the author, though. I haven't tested it personally.
It sounds like you're up on your security guidelines, so this may not be something I need to tell you, but just in case a less experienced developer is out there reading this, the reason I say that the white-list approach is critical is this.
Right now, today, HttpUtility.HtmlEncode may successfully block every attack out there, simply by removing/encoding < and > , plus a few other "known potentially unsafe" characters, but someone is always trying to think of new ways of breaking in. Allowing only known-safe (white list) content is a lot easier than trying to think of every possible unsafe bit of input an attacker could possibly throw at you (black-list approach).
In terms of why you'd use one over the other, consider that the AntiXSS library gets released more often than the ASP.NET framework - since, as David Stratton says 'someone is always trying to think of new ways of breaking in', when someone does come up with one the AntiXSS library is much more likely to get an updated release to defend against it.
The following are the differences between Microsoft.Security.Application.AntiXss.HtmlEncode and System.Web.HttpUtility.HtmlEncode methods:
Anti-XSS uses the white-listing technique, sometimes referred to as the principle of inclusions, to provide protection against Cross-Site Scripting (XSS) attacks. This approach works by first defining a valid or allowable set of characters, and encodes anything outside this set (invalid characters or potential attacks). System.Web.HttpUtility.HtmlEncode and other encoding methods in that namespace use principle of exclusions and encode only certain characters designated as potentially dangerous such as <, >, & and ' characters.
The Anti-XSS Library's list of white (or safe) characters support more than a dozen languages (Greek and Coptic, Cyrillic, Cyrillic Supplement, Armenian, Hebrew, Arabic, Syriac, Arabic Supplement, Thaana, NKo and more)
Anti-XSS library has been designed specially to mitigate XSS attacks whereas HttpUtility encoding methods are created to ensure that ASP.NET output does not break HTML.
Performance - the average delta between AntiXss.HtmlEncode() and HttpUtility.HtmlEncode() is +0.1 milliseconds per transaction.
Anti-XSS Version 3.0 provides a test harness which allows developers to run both XSS validation and performance tests.
Most XSS vulnerabilities (any type of vulnerability, actually) are based purely on the fact that existing security did not "expect" certain things to happen. Whitelist-only approaches are more apt to handle these scenarios by default.
We use the white-list approach for Microsoft's Windows Live sites. I'm sure that there are any number of security attacks that we haven't thought of yet, so I'm more comfortable with the paranoid approach. I suspect there have been cases where the black-list exposed vulnerabilities that the white-list did not, but I couldn't tell you the details.

Are computer language copyrighted? Can I make a compiler or ide or anything for any of them?

Are computer languages copyrighted or have some restrictions imposed on them how they can be used? What does that mean in practice? If so, what can be done or what cannot be done? Could I make a compiler or ide or anything for any of them?
For example for Pl/Sql?
Unfortunately, programming languages may be encumbered by patents. This appears to be the case e.g. with the Aikido language.
Just recently this seems to have become a non-issue for the C# programming language (and the .NET Common Language Infrastructure).
To answer your question regarding what can and what cannot be done: if in your implementation of the language you use an invention that somebody patented, you definitely don't want to try to make profit with your implementation in any country where the patent applies (unless you licence the tech, of course). However, if you can circumvent the patent, i.e., implement for example a compiler for the same language without using that specific trick but something else, then you should not have a problem. Patents need (well, should need) to be very specific, so this might often be possible. (IANAL, though.)
You really need to familiarize yourself with copyright. Copyright applies to works of art: writings, paintings, etc. So the programming language itself cannot be copyrighted. The text describing it usually is, but that only prevents you from copying that text - it doesn't prevent you from reading it, understanding it, and using it.
So for PL/SQL, it's probably the case that its description is copyrighted by Oracle, but that can't stop you from making compilers and IDEs. As Pukku points out: there are other kinds of intellectual property, such as patents and trade marks, which may prevent you from doing these things (or calling them PL/SQL when done), but not copyright.

What are the main issues in designing an interpreter for a functional language?

Suppose I want to implement an interpreter for a functional language. I would like to understand the issues involved in doing so and suitable literature that is available. This is a new language that is in early design stages, that is why the question is broad in scope.
For the purpose of this discussion we can assume that the purpose of the language is not important and that its functional features can be changed (even drastically) if it makes a significant difference in the ease of writing an interpreter.
The MIT website has an online copy of Structure and Interpretation of Computer Programs as well as videos of the MIT 6.001 lectures using Scheme, recorded at HP in 1986. These form a great introduction to language design.
I would highly recommend Structure and Interpretation of Computer Programs (SICP) as a starting point. This book will introduce the idea of what it means to write an interpreter (and a compiler), and is generally a must-read for anybody designing languages.
Implementing an interpreter for a functional language isn't likely to be too much different from implementing an interpreter for any other general purpose language. There's lexical analysis, parsing, AST construction, semantic analysis, plus execution (for a pure interpreter) or code generation and optimisation (for a compiler, even compiling to bytecode like Java/Perl/Python). SICP will introduce the difference between "applicative order" and "normal order" evaluation, which may be important for you in a pure functional context.
For just about any language interpreter or compiler, the main issues are the same, I think.
You need to decide certain basic characteristics of the language (semantics, not syntax), and the bulk of the design of the thing follows from that.
For example, does your language have
a type system? If so, what sorts of
types does it have? Is it going to be
statically typed, dynamically typed,
duck-typed?
What sort of expressions are you
planning to support? Do you need to
define an order of operations? Will
you even have operators?
What will you use as the run-time
representation of the program? Will
you convert the text to a byte-code
representation, or an AST, or a
tokenized form of the source text?
There are toolkits available to help take some of the tedium out of the actual parsing of text (ANTLR and Bison, to name two), but I don't know of anything that helps with the actual interpretation part of the task. I'm sure somebody will suggest something.
The main issue is having a semantics for the language you're implementing -- with that, the implementation becomes straightforward. Otherwise, this question is incredibly broad and hard to answer.
I'd recommend Essentials of Programming Languages as a good complement to SICP, particularly if you're interested in interpreters: Official EOPL site. You may want to check out the third edition-- the site hasn't been updated for it yet.
Edit: spam prevention is making me choose between links, so the official page is now unheated. It's easily Google-able, though.

Resources