Drupal with Texy and GeSHi - drupal

I have my Drupal installation set up to use both Texy for markup (hand-writing HTML soon gets tedious) and GeSHi for syntax highlighting (which is about the only syntax highlighter I found for Drupal at that time).
The problem now is that since the last update to Texy seemingly nothing really works anymore. I spend a long time trying to convince both of them working together a while ago but it was a pretty flaky setup. Depending on the order of evaluation of both filters I either get no syntax highlighting or all, escaped HTML output, line breaks disappear or, sometimes, it can indeed work.
I am now at a point where it almost works again, but with exceptions. Ideally GeSHi would take care of the code, while Texy handles the rest, but that's not the case. The nice regular expression
[1–9][0–9]*[WDwd][1–9][0–9]*(\+[1–9][0–9]*)?
gets the part between the first two asterisks italicized, since Texy runs over that part as well. Which is unfortunate, since it changes the meaning of the regex.
Anyone out here who has insights of how to peroperly set up multiple filters for input in Drupal and how to handle non-HTML markup and syntax highlighting simultaneously. As I currently have to go over every post that contains code I have written (nearly all) it wouldn probably not much less work to completely redo every page on the site in search for a better setup. As for syntax highlighting, I have much need for the usual common languages, such as C#, Java, etc. but also for more esoteric ones like Windows batch files or Powershell. Simply dumping unhighlighted code there isn't very pretty.
So, actually I have two questions here:
How can one convince multiple input filters to work without interfering with each other, specifically Texy and GeSHI?
What other options are out there that meet my requirements of easy-to-write non-HTML markup1 and syntax highlighting even for lesser-used languages2?
1 Often I just need emphasis and strong emphasis, sometimes headings, often images, sometimes also tables. Oh, and usually code :-)
2 The Stack Overflow-like guessing for syntax highlighting doesn't work very well for most code, it just works well enough to be a little pleasing.

To answer 2. I have had good results with markdown and GeSHI. I have no experience with Texy.
What you must pay very good attention to, is the combination of filter-format settings, filter-format ordering, filter-format permissions. For your problem, I would suggest the following input formats.
Basic HTML (default). Used for comments and so on.
Markdown (For editors, does what you describe)
Raw HTML (no filtering at all usefull for webmasters etc.)
Then configure them as follows, in this order:
Basic HTML:
URL filter
HTML filter. only allow inline styles em, strong, a. Maybe a very few more, but not br, p and such.
linebreak filter.
Markdown
HTML filter: strips all tags, except the "code" tags for GESHI.
Markdown filter.
Geshi filter.
This implies that markdown has no limits: people can use markdown to creat H1 tags, for example. If you want to limit the abilities in Markdown, you must place the HTML filter after the markdown filter. In that set-up, markdown will convert to full HTML, then the HTML-filters will strip the not-allowed filters.
Since GeSHI requires non-standard code tags, you will want to let them fall-trough. Since GeSHI adds a bucketload of spans, divs and color-coded style-elements, you will always need to put this filter after the HTML-filter, to avoid these spans- etc to be removed again

Related

How do I output HTML form data to PDF?

I need to collect data from a visitor in an HTML form and then have them print a document with the appropriate fields pre-populated. They'll need to have a couple of signatures on the document, so it has to be printed.
The paper form already exists, so one idea was to scan it in, with nothing filled out, as an image. I would then have the HTML form data print out using CSS for positioning and using the blank scanned form as a background image.
A better option, I would think, would be to automatically generate the PDF with this data, but I'm not sure how to accomplish either.
Suggestions and ideas would be greatly appreciated! =)
I would have to respectfully disagree with Osvaldo. Using CSS to align on a printed document would take ages to do efficiently in the aspect of cross-browser integration. Plus, if Microsoft comes out with a new browser, you're going to have to constantly update for the new use in browsers.
If you know any PHP (Which, if you know JavaScript and HTML, basic PHP is very simple), here's a good library you can use, FDPF:
Thankfully, PHP doesn't deprecate a whole lot of methods and the total code is less than 10 lines if you have to go in and change things around.
You can control printed documents acceptably well with CSS, so I would suggest you to try that option first. Because it's easier.
This is actually a great php library for converting HTML to PDF documents http://code.google.com/p/dompdf/ there are many demo's available on the site
XSL-FO is what I would recommend. XSL-FO (along with XSLT and XPath) is a sub-standard of XSL that was designed to be an abstract representation of a formatted document (that contains, text, graphic elements, fonts, styles, etc).
XSL-FO documents are valid xml documents, and there exist tools and apis that allow you to convert an XSL-FO documet to MS Word, PDF, RTF, etc. Depending on the technology you use, a quick google search will tell you what is available.
Here are a few links to help you get started with XSL-FO:
http://en.wikipedia.org/wiki/XSL_Formatting_Objects
http://www.w3schools.com/xslfo/xslfo_intro.asp
http://www.w3.org/TR/xsl11/

Styling footnotes for markdown

I've been using footnotes in markdown pages as mentioned in this post on DaringFireball, but I can't seem to figure out how to get them styled the way I want. Adding .footnotes {} to my style sheet allows me to style the footnote text, but I'm missing two things:
In Gruber's post, his footnote's backlink is given a style "a.footnoteBackLink," but my page simply produces "a href="link" rev="footnote". I don't know how to call this in CSS and I have no idea how I'd change it so that my markdown page outputs any differently. My backlink goes to a separate line, and I'd like to have it on the same line. Perhaps this is an issue with the markdown engine; I'm using Maruku (I think), and I could probably figure out how to change it if I knew which one I should use instead.
No matter what I put in the footnote brackets, the page outputs numbered footnotes. How can I tell it to use asterisks or other symbols? Most pages with footnotes will have only one or two, and symbols are generally correct when there are less than seven footnotes, so I'd like to do things proper.
I probably shouldn't even say this, but I've been teaching myself web development for the past couple of months and I absolutely could not have done it without SO. This is the first time I haven't found my answer here, so it's my first time asking. I love you don't get mad at me please.
There is a great variety of conversion tools out there. Each may have a different way of handling this. I found remarkable a good choice for your issue. It adds the class "footnote-item" to each footnote.
Check the live demo and inspect the HTML Output.

What are appropriate markup languages for users with disabilities?

Suppose you're developing a web site and blind users will be a significant chunk of your target market. If the web site includes document editing functionality, what would be appropriate WYSIWYM tools? Are languages like Markdown, Textile and Wiki Formatting really accessible or are they inconvenient to blind users?
I'm a blind programmer and while I haven't used most of the languages you mention I've found that any markdown language is fairly easy to use if you have the desire to learn it. I've had no problem using either HTML or several markup languages for wiki's. Part of it will depend on how invested the users are in your site. If it's a site that will be visited infrequently or for short periods of time, it's much less likely that a user will take the time to learn the required markup whether they are blind or not. Unfortunately, I have not found an accessible JavaScript WYSIWYG editor but I find it easier to manually enter the markup so haven't looked very hard.
the first question is: how important is semantic structure? could you get away with plain text. You could do simple parsing like treating blank lines as paragraph markers, treating a series of lines which begin with * as a bulleted list, identify URLs and make them into links, etc.
As a blind developer myself, I have no problem in understanding languages like Markdown. But if it's a syntax I'm unfamiliar with, I'll only learn it if I expect to use the site very often, or care deeply about the content.
Two final thoughts come to mind: while I certainly experience some accessibility challenges using TinyMCE, you could develop something much simpler - provide less than 10 formatting options, like inserting hyperlinks, making lists, centering text, setting the style (such as heading) etc.
And lastly, when I talk to non-technical blind people, they often just write their content in Word and paste into a wiki or blog post. This sounded strange when I first heard it, but it does make sense. So an ideal solution would accept pasted in content.
In closing - it depends how important this is, and how much effort you want to expend. Maybe a Markdown editor with a live preview (like on this site), buttons for inserting simple formatting like URLs, and the ability to paste in rich text would tick all boxes :-)
On a web page, the most accessible embedded text editor for blind users is one that uses standard HTML, such as a <textarea> element, with a corresponding <label> element:
<label for="editor">Enter your text here using wiki markup:</label>
<textarea id="editor"></textarea>
If a WYSIWYM tool is built using standard accessible HTML, then blind users can easily enter text into it, with full confidence that they're entering text in the right place. Then the question becomes: Which is the better markup language? They all require memorization, but some may be more intuitive than others. One way to find out which is best would be to do some usability testing with a wide variety of target users. Also be sure to providing easy, accessible access to syntax help.
Picture yourself working in pure text 80x4 display (just open a console and resize appropriately), then use vi/emacs/ed and you'll soon realize what markup will get in the way.
Try to do as much work as possible to understand plain text, else use light markup like POD, finally things like AsciiDoc are very powerful but needs training.
I don't know about WYSIWYG/WYSIWYM tools, but I do know that complying with W3C standards (especially their HTML5 en CSS3 drafts) while writing your own editor code will help a lot.
In CSS you can specify speed and intonation of speech. In HTML you can specify alternative text (alt attribute in many elements) that screen readers are compatible with. Be sure to know when to use the abbr and the acronym elements. Use the former when you want the screen reader to read the meaning of an abbreviation and the latter when the acronym should be read as a word (e.g. ASAP, NATO and OS).
For the editor itself, I recommend creating a WYSIWYG editor that uses divs and spans. Screen readers will understand easily the structure of a document. For the current line, use a text box; for every other line that's not being edited, convert the contents immediately to valid HTML.
If you find a good tool, be sure to post it here. I'm looking for one too. :-)

What is the best way to get clean semantic XHTML from MS word documents?

Some days ago I received a rather lengthy and somewhat elaborate MS Word document, which I was asked to convert to HTML for uploading to a 3rd party’s website. My first instinct was to save the Word document as HTML and use Dreamweaver’s "Clean Up Word HTML" Command. But not only did I have to leave it running all night for Dreamweaver to finish "cleaning", but the results were far from desirable in my opinion. There were still a lot of left over inline styles, etc. that Dreamweaver just plain missed.
I approached it differently this morning and just selected the entire document in Word, copied it, and then pasted it into Dreamweaver’s Design window. Not only was it much, much faster, but the output code was much, much cleaner! I didn’t have to run the "Clean Up Word HTML" Command afterwords either.
Now I don't ever convert a Word File straight to HTML for standards reasons. Instead I cut and paste content between Word and Dreamweaver. Happily I can do the following.
If a Word heading is in the Heading 1 Style, it will become an H1 in Dreamweaver (following the Dreamweaver stylesheet). Similarly Heading 2 becomes H2, Heading 3 becomes H3 and so forth.
If the Word author wasn't that organized, you can use a shortcut like Control+1 (or Command+1) on a Mac to convert any line to an H1. Can you guess the shortcut for H2? Yes it's Control+2 or Command+2 on a Mac.
Paragraphs now cut and paste as paragraphs (with the P tag). If you don't want an HTML paragraph right then, then use Control+0 (or Command+0 on a Mac) to remove it in Dreameaver.
A new one I discovered is that some embedded images in Word may be transferred to your Dreamweaver site as "clip" images when you copy and paste from Word. So, if you have a Word file with embedded images, you may be able to extract them fairly quickly via Dreamweaver.
I also found this free tool useful http://www.textfixer.com/html/convert-word-to-html.php it works same like design view of dreamweaver, useful for people who doesn't have Dreamweaver.
but what code we will get is depends on how much properly formatted MS word document is?
WORD 2007 has also style like html?
Headings, tables, ordered and unordered lists, bold, italic , hyperlinks etc?
How to use word 2007 semantically?
To get maximum possible semantic html
on save as html option
To get maximum possible clean code to
Copy in dreamweaver design view ?
To get maximum possible clean code to
place browser based WYSIWYG HTML
Editor which comes with every CMS
Does any knows any tips, tricks, tutorial , article or advice to format MS WORD documents semantically?
Or any other best way than mine?
HTML Tidy has options for this: word-2000, bare and clean.
FCKEditor and similar try to clean up code pasted from Word.
There's (rather old now) demoroniser.
However don't expect miracles. It's unlikely that Word document will have decent structure (it theoretically could, but no Word user bothers with this). These programs can't add semantic information if it's not there.
As for semantic editing in Word – use styles. It supports headers properly (sadly not much else). You can check that in outline view.
You don't need – and shouldn't use – spaces or line breaks for indentation or space adjustment. Word has ability to explicitly control paragraphs' padding.
I've found that the OpenOffice.org html generator (Open .doc in OO and save as HTML) works better than MS's in Office.
It's still not perfect, but gives MUCH cleaner HTML that's much more sane to look at.
There is no dependable way to clean up WORD docs and make them into nice HTML. If the document has any special characters, they are often encoded as Windows charset instead of UTF-8, so they just "break" when displayed online. The list goes on. You often end up with silliness like:
<strong>hello</strong><strong>th<strong>er</strong>e</strong><i></i>
The only depandable method is to paste it into Notepad and mark it up manually. You can write a few macros to do things like insert <p></p> at paragraph breaks, but that's about it.
If there is a huge volume of material that needs to go online from Word, you may be better off using a PDF.
have you tried this? Word Cleaner
Try our Doc To HTML Converter software. It was designed specially for the purpose of producing maximum possible clear (X)HTML code, and has many customizable options. It requires MS Word to be installed on your system. It is not free, but it has trial 30-day period.

HTMLEncode script tags only

I'm working on StackQL.net, which is just a simple web site that allows you to run ad hoc tsql queries on the StackOverflow public dataset. It's ugly (I'm not a graphic designer), but it works.
One of the choices I made is that I do not want to html encode the entire contents of post bodies. This way, you see some of the formatting from the posts in your queries. It will even load images, and I'm okay with that.
But I am concerned that this will also leave <script> tags active. Someone could plant a malicious script in a stackoverflow answer; they could even immediately delete it, so no one sees it. One of the most common queries people try when they first visit is a simple Select * from posts, so with a little bit of timing a script like this could end up running in several people's browsers. I want to make sure this isn't a concern before I update to the (hopefully soon-to-be-released) October data export.
What is the best, safest way to make sure just script tags end up encoded?
You may want to modify the HTMLSanatize script to fit your purposes. It was written by Jeff Atwood to allow certain kinds of HTML to be shown. Since it was written for Stack Overflow, it'd fit your purpose as well.
I don't know whether it's 'up to date' with what Jeff currently has deployed, but it's a good starting point.
Don't forget onclick, onmouseover, etc or javascript: psuedo-urls (<img src="javascript:evil!Evil!">) or CSS (style="property: expression(evil!Evil!);") or…
There are a host of attack vectors beyond simple script elements.
Implement a white list, not a black list.
If the messages are in XHTML format then you could do an XSL transform and encode/strip tags and properties that you don't want. It gets a little easier if you use something like TinyMCE or CKEditor to provide a wysiwyg editor that outputs XHTML.
What about simply breaking the <script> tags? Escaping only < and > for that tag, ending up with <script>, could be one simple and easy way.
Of course links are another vector. You should also disable every instance of href='javascript:', and every attribute starting with on*.
Just to be sure, nuke it from orbit.
But I am concerned that this will also leave <script tags active.
Oh, that's just the beginning of HTML ‘malicious content’ that can cause cross-site scripting. There's also event handlers; inline, embedded and linked CSS (expressions, behaviors, bindings), Flash and other embeddable plugins, iframes to exploit sites, javascript: and other dangerous schemes (there are more than you think!) in every place that can accept a URL, meta-refresh, UTF-8 overlongs, UTF-7 mis-sniffing, data binding, VML and other non-HTML stuff, broken markup parsed as scripts by permissive browsers...
In short any quick-fix attempt to sanitise HTML with a simple regex will fail badly.
Either escape everything so that any HTML is displayed as plain text, or use a full parser-and-whitelist-based sanitiser. (And keep it up-to-date, because even that's a hard job and there are often newly-discovered holes in them.)
But aren't you using the same Markdown system as SO itself to render posts? That would be the obvious thing to do. I can't guarantee there are no holes in Markdown that would allow cross-site scripting (there certainly have been in the past and there are probably some more obscure ones still in there as it's quite a complicated system). But at least you'd be no more insecure than SO is!
Use a Regex to replace the script tags with the encoded tags. This will filter the tags which has the word "script" in it and HtmlEncode it. Thus, all the script tags such as <script>, </script> and <script type="text/javascript"> etc. will get encoded and will not encode other tags in the string.
Regex.Replace(text, #"</?(\w+)[^>]*>",
tag => tag.Groups[1].Value.ToLower().Contains("script") ? HttpUtility.HtmlEncode(tag.Value) : tag.Value,
RegexOptions.Singleline);

Resources