the HtmlHighlighter of boilerpipe in .net is not returning the text always - text-extraction

am using Boilerpipe in my application, and when am trying to extract the content using ArticleExtractor am getting plane text only, all the html formating has been removed, so am trying with HtmlHighlighter. but the process method of HtmlHighlighter fails for certain urls.
is there any option to use html string to pass to this method? can anybody explain?

You can use IKVM to convert the Boilerpipe jar into a new DLL to use in your .NET aplications. I am using this approach and works fine when sending html thrown the different boilerpipe methods.
If the page content that you are trying to access is loaded by javascript, a simple http request cant handle such information.
First you need to get the result html after the javascript changes, and then give it to boilerpipe.

Related

How can we save multimedia components using external resource types if the URL doesn’t end in with a file extension?

We have a Tridion use case related to curated content where we are creating multimedia components for images associated with our content which are pointing to External resource types instead of uploaded resource types.
One of the issues we have run into with this use case is that despite explicitly setting the Multimedia Type for the resource, if the URL of the image has either a query string in it: http://cdn.hw.net/UploadService/1c8b7f28-bb12-4e02-b888-388fdff5836e.jpg?w=160&h=120&mode=crop&404=default or uses a ‘friendly url’: http://www.somewhere.com/images/myimage/ when we save the component, Tridion barfs with error messages similar to : ‘Invalid value for property 'Filename'. Unexpected file extension: jpg?w=160&h=120&mode=crop&404=default. Expecting: jpg,jpeg,jpe.’
So far, the only way we’ve been able to figure out to potentially get around this issue is to do something hacky like appending an extra query string parameter to the very end of the urls which end with the expected file extension: http://cdn.hw.net/UploadService/1c8b7f28-bb12-4e02-b888-388fdff5836e.jpg?w=160&h=120&mode=crop&404=default&ext=.jpg Obviously, this is not the best solution and in fact may not work for some images if the site they are being served from strictly validates the requested URL.
Does anyone have any ideas on how we can work around this issue?
Unfortunately I can't really think of an easy solution to this, since Tridion "detects" the Mime type by checking the file extension.
You could perhaps add it while saving and remove it when reading (via Event System)? Definitely a worthwhile enhancement request, to my knowledge this behavior has not been changed for the soon-coming Tridion 2013... See comment below, it has been changed for 2013.
+1 for Nuno's answer. Recognizing that the title of your question is specific to multimedia components, you may want to consider another approach which is to use normal Components, not Multimedia Components. You can create a normal component schema called something like "External Image" that has an External Url field to store your extentionless url.
Content authors will then include these images via regular component linking mechanisms in the Tridion GUI.
You will then need a custom link resolver TBB that will parse the Output item (via Regex) looking for any Tridion anchor tags <a tridion:href="tcm:x-y-z"> and for each one replace them with an <img src=...> tag where the src path would come from this linked component.
For an example of a similar approach, but with videos, and sample code for a custom link resolver TBB have a look at the code in the following post: http://www.tridiondeveloper.com/integration-sdl-tridion-jw-media-player.

XML and XSLT to generate CSS?

I want to provide user facility to change the CSS.
First think clicked is that storing CSS as XML will help me read CSS and understand.
Second is that using XSLT i will be able to generate the CSS (am i right ? will that be useful)
Lastly when user changed the CSS XML file can be updated and then it can be used.
Now this is at very rough level ..... i am using ASP.NET can some one please guide me if my understanding is correct or not and how should i approach for this pros/cons.
Will something like below will work ? is possible?
<link src="someserverfiletoprocessxmlusingxslt.aspx?user=id" type=text/css/>
That is possible; your ASPX page would need to return CSS with a MIME type of text/css.
However, it would be better to use an ASHX (Generic Handler) rather than an ASPX (Web Form).
Using an ASP.NET generic HTTP handlers (ashx) would be better. This is just a class that gives you access to the output stream (better for non-html output).
From there you can process the XML, transform it using XSLT and write/dump it on the output stream.
Might be a good idea to implement some kind of caching to enhance performance...
More info on generic handlers: http://www.brainbell.com/tutorials/ASP/Generic_Handlers_(ASHX_Files).html
Setting the method attribute of the xsl:output element to text will strip the resulting output of all XML tags and return it unencoded.

How to display XSL-transformed XML in ASP.NET page?

So far all the XML / XSLT I've worked with takes an XML document and transforms it to a standalone HTML webpage using an XSLT file.
In my web application, I'm using a web service to retrieve the XML document, which I need to render and make human-readable, and then insert that formatted content into a content placeholder in my master page.
The easiest way would be to append the XSLT to the retrieved XML file and link that to the content placeholder, but something tells me I can't just do that.
I took a look at these Stack Overflow pages, but they just want to render the straight XML whereas I want a transformed XML. Also, I need to be able to put it into my master page template.
This article shows how:
http://www.codeproject.com/Articles/37868/Beginners-Introduction-To-XSL-Transform-Rendering-XML-Data-using-XSL-Get-HTML-output.aspx
even if the spelling is as bad as mine...
Added
And here's another link that shows how, perhaps a bit more simply
http://www.aspfree.com/c/a/XML/Applying-XSLT-to-XML-Using-ASP.NET/2/

What is the best practice for using ASP.NET MVC to render lots of html or text files?

I have a lot of html pages, but I don't know how to display them through the asp.net mvc view.
I buid a view as my template and use asp.net mvc to insert html into the template and then render it.
But the question is that I must use FileStream to read the raw html-based files into memroy and then put it into view template, like ViewData["content"] = ???.
I just want to know if there are some other better ways to render static html files to the browser.
Did i describe the question clearly?
I guess you could do something like this:
using(var file = new StreamReader(htmlFileName))
{
return Content(file.ReadToEnd());
}
Note that the mime type automatically defaults to text/html, but you could optionally specify which mime type headers should be sent by supplying the type as an additional argument to the Content method.
I guess you also can point a iframe element from HTML to the target file url directly.
Alternatively you could write your own ActionResult that writes the contents of the file to Response.Output (could potentially avoid loading the entire file into memory at once albeit it might not be a big issue).

Export ASPX to HTML

We're building a CMS. The site will be built and managed by the users in aspx pages, but we would like to create a static site of HTML's.
The way we're doing it now is with code I found here that overloads the Render method in the Aspx Page and writes the HTML string to a file. This works fine for a single page, but the thing with our CMS is that we want to automatically create a few HTML pages for a site right from the start, even before the creator has edited anything in the system.
Does anyone know of any way to do this?
I seem to have found the solution for my problemby using the Server.Ecxcute method.
I found an article that demonstared the use of it:
TextWriter textWriter = new StringWriter();
Server.Execute("myOtherPage.aspx", textWriter);
Then I do a few maniulatons on the textWriter, and insert it into an html file. Et voila! It works!
Calling the Render method is still pretty simple. Just create an instance of your page, create a stub WebContext along with the WebRequest object, and call the Render method of the page. You are then free to do whatever you want with the results.
Alternatively, write a little curl or wget script to download and store whichever pages you want to make static.
You could use wget (a command line tool) to recursively query each page and save them to html files. It would update all necessary links in the resulting html to reference .html files instead of .aspx. This way, you can code all your site as if you were using server-generated pages (easier to test), and then convert it to static pages.
If you need static HTML for performance reasons only, my preference would be to use ASP.Net output caching.
I recommend you do this a very simple way and don't do it in code. It will allow your CMS code to do what the CMS code should do and will keep it as simple as possible.
Use a product such as HTTrack. It calls itself a "website copier". It crawls a site and creates html output. It is fast and free. You can just have it run at whatever frequency you think is best.
It decouples your HTML output needs from your CMS design and implementation. It reduces complexity and gives you some flexibility in how you output the HTML without introducing failure points in your CMS code.
#ckarras: I would rather not use an external tool, because I want the HTML pages to be created programmatically and not manually.
#jttraino: I don't have a time interval in which the site needs to be outputted- the uotput has to occur when a user creates a new site.
#Frank Krueger: I don't really understand how to create an instance of my page using WebContext and WebRequest.
I searched for "wget" in searchdotnet, and got to a post about a .net class called WebClient. It seems to do what I want if I use the DownloadString() method - gets a string from a specific url. The problem is that because our CMS needs to be logged in to, when the method tries to reach the page it's thrown to the login page, and therefore returns the login.aspx HTML...
Any thoughts as to how I can continue from here?

Resources