Custom Parser for Nutch (or open source .NET Crawler) - asp.net

I have been using Nutch/Solr/SolrNet for my search solutions, I must say, it works a treat. On a new site I'm working on, I am using Master pages, as a result, content in the header and footer is getting indexed and distorts the results. For example, I have a link to the Contact Us page in the header. Now, when I search for 'Contact' the result returns all the pages in the site.
Is there a customizable Nutch parser that i can maybe pass a div id and then it only indexes content inside the div.
Or if there are .NET based crawlers that I can customize.

See https://issues.apache.org/jira/browse/NUTCH-585
and https://issues.apache.org/jira/browse/NUTCH-961
BTW you'd get a more relevant audience by posting to the Nutch user list

You can implement a Nutch filter (I like Jericho HTML Parser) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.

Related

WordPress Elementor Page issue with binding REST API

We are trying to integrate a REST API in one of page created using Element or and have hard time to bind the value because we cannot provide Id or name to html elements like we do in a normal page when use Javascript/Jquery to DOM (document object model) manipulation. We have to use syntax like this:
document.querySelectorAll('#advisor-contact > div > div> div > div')[1].querySelector('a > span').innerText
Such code is problematic and requires a lot of maintenance again and again if the structure of page change. We read some articles on the web and it says Elementor Pro has a way to provide tag to dynamic content. I am not sure if it is talking about giving Id to HTML elements on a page created using Elementor Pro or something else. Considering our use case please suggest/guide us how can we bind response of an REST API on a page created using Elementor. Please provide some links to documentation if available.

Plone - embed the content of an internal link in a page

I have a list of pages that have to appear in different places of my Plone. If I use an internal link, I see an HTML link in the page but instead of that I would like to see the embedded content of the linked page.
I've tried to install some link plugins (Smart Link, vs.alias...) but I'm not able to find the solution.
I'm using Plone 4.3.
I don't know any Plone Plugin, which satisfy your requirement.
A long time ago i wrote this small js to show internal links in a popup using Plone's prepOverlay.
In this case you can put a popup custom CSS class on the internal link with TinyMCE.
It simply shows the content area of the given URL.
$(function(){
jq('a.popup').prepOverlay({
subtype:'ajax',
urlmatch:'$',urlreplace:' #content > *'
});
});
I guess this is a good starting point for your own implementation.
You could think of a criterion like location, contenttype, etc., to distinct, which articles should be picked (in worst case use collective.flag), then fetch them with a collection, to give you the links as a resultlist, and set its view to all_content, a nice feature, introduced in the Plone-4 series.

SDL Tridion Schema Field "List of Links" Options

I'm looking to create an SDL Tridion schema with a list of repeatable links while avoiding multiple fields per link.
Hyperlink
In a rich text field I have the following options for creating a hyperlink:*
Component
Anchor
http://
mailto:
Other
When content authors create one of these hyperlinks, they have the option to select linked (visible) text as well as title and target attributes that function like typical HTML hyperlinks.
"Richtext" means a Text field with Height of the Text Area = at least 2 rows with Allow Rich Text Formatting selected.
Single Schema Field Link
When creating a single schema field, I see these options:
External Link (author options will include http://, mailto, Other)
Multimedia Link
Component Link (which can allow Multimedia Values)
Current Ideas
The best out-of-the-box (OOTB) setups I've found for this "list of links" is either offering:
a single 2-line RTF with instructions to create a hyperlink (of any type) in that field
separate fields for each type as well as additional fields for display name, target, and title (where the fields are assembled through template code), authors fill in only one of the fields (component link or external)
Question
Is there a way in the schema form designer, by updating the schema source, or through code to offer the same (RTF) hyperlink drop-down options, but in a single field? I could be missing something, but recognize this scenario isn't supported OOTB.
One question we are missing here is to consider if those links are going to be used somewhere else individually. If that's the case, multiple components would be my first choice, so we can reuse each component several times.
If you are planning to allow the editor to create a list of links that they are only going to use in a given component (not reusable), well, you have all the options mentioned in the previous answers.
To give you an idea on what's the best approach (in my humble opinion) here are things to consider:
Individual Components per link: use this approach if links are reusable.
Using embedded schemas (with the link structure) so this approach can be used in different component types (schemas)
Custom URL / Single Line Text Field: it requires an additional development effort and it is very unlikely you will keep the hard-link-references when creating internal links. As you know SDL Tridion keeps a reference to the tcm id in order to resolve links, trigger publishing, etc..
Custom URL / 2 Lines RTF: It will do the job, but you need to make sure you disable all the other RTF options from the Ribbon Tool Bar within the Schema RTF options, so you meke sure that the editors can only create links. Also, you might need to consider to add an XSLT filter to check if the edtiors entered something more than just links. These links are not reusable.
In general if you implement something custom (GUI extension + Custom URL) keep in mind all the TRIDION CMS concepts, like blueprinting (what happens when the link is inherited down), where used, etc...
My recommendation has always been to use Separated Components, but be careful with the link propagation when publishing...
I have seen this case at customers. If they consider less development effort, the idea of having a multiple embedded field is good.
You can have it as:
[text] Link Text
[Component Link] Link to anything
You would need an extra Content schema for External Links, like:
[External Link] Url
[text] target
[any extra option you need]
This means the editor would need to create a new External Link Component every time they create an external link. It is extra work, but it can also mean easier maintenance on the use of external urls within their site.
Lastly, the editor would just add multiple Component Links, those being of schema External Link of any other. It will be the template code which checks on the schema of the linked Component and add the code accordingly.
XML Name Description Field Type
[text] Text Text
[title] Title Text
[static_url] External URL Text
[component] Internal URL Component Link
In the field description for "External URL" and "Internal URL" you could add a comment to make sure that the editor doesn't get confused, only one of these two fields should be filled in. From the component, its ID can be used to create the dynamic link in the DWT. This solution has no development effort and for the editor is pretty much as intuitive as it can get. Of course this would be a multivalue embedded schema field inside the Links schema.
This use-case might work using a Custom URL field and maybe a GUI extension. The idea is to have a Custom URL that opens a popup (which might be a GUI extension). In that popup, you would select/construct your link (maybe using the same options as a normal RTF link - Component, Anchor, mailto, etc).
The popup would return a specially crafted string. The format could be anything, even an actual anchor tag (but JSon is also fine). Example: {href:'tcm:1-2',type='component'}.
Your Templates would interpret this string in order to generate something meaningful, like a dynamic link or static HTML anchor.
Also the Custom URL popup should be smart enough to 'decode' such a link (if a value was specified in that field previously) and maybe pre-populate some attributes in the RTF link constructor form.

Read rss and show as html

I am using google reader for my RSS, i want to export all my shared or starred rss items to HTML to take this html and put on my website
Do any one have an idea about?
And one important thing as well, can i page through this html? i mean to export as pages not all in one html page to let the user on my site page through my starred feeds.
Thanks,
With XSTL you can transform XML to any format you want, including HTML. You can do the transformation on the server, or with modern browsers like IE6+ and Firefox2+ you can do the transformation on the client side. XSTL isn't very pretty as a programming language, but the concept is pretty neat.
I don't know if you can link directly to the RSS feed XML so that it's always up to date. I think Google requires that you authenticate and have permission to access the feed.
You can read from an RSS with jQuery by selecting and iterating through the tags rather easily. Additionally, you can perform conditional-checks on attributes etc as well.

Parsing PlainText Emails from HTML Content (ASP.NET)

Right, in short we basically already have a system in place where the HTML content for emails is generated. It's not perfect, but it works.
From this, we need to be able to derive a plaintext alternative for the email. I was thinking of instantly jumping on and creating a RegEx to strip the <*> tags from the message - but then I realised this would be no good because we do need some of the formatting information (paragraphs, line breaks, images etc).
NOTE: I am OK with actually sending the mail and setting up alternative views etc, this is only about getting plaintext from HTML.
So, I am pondering some ideas. Will post one as an answer to see what you guys think, but thought I would open it up to the floor. :)
If you need any more clarification then please shout.
Many thanks,
Rob
My Solution
OK, so here it is! I thought up a solution to my problem and it works like a charm!
Now, here are some of the goals I wanted to set out:
All the content for the emails should remain in the ASPX pages (as the HTML content currently does).
I didn't want the client code to do anything more other than say "SendMail("PageX.aspx")".
I didn't want to write too much code.
I wanted to keep the code as semantically correct as possible (no REALLY crazy-ass hacks!).
The Process
So, this is what I ended up doing:
Go to the master page for the email messages. Create an ASP.NET MultiView Control. This control would have two views - HTML and PlainText.
Within each view, I added content placeholders for the actual content.
I then grabbed all the existing ASPX code (such as header and footer) and stuck it in the HTML View. All of it, DocType and everything. This does cause VS to whinge a little bit. Ignore It.
I then of course added new content to the PlainText view to best replicate the HTML view in a PlainText environment.
I then added some code to the Master Page_Load, checking for the QueryString parameter "type" which could be either "html" or "text". It falls over to "text" if none present. Dependant on the value, it switches the view.
I then go to the content pages and add new placeholders for the PlainText equivalents and add text as required.
To make my life easier, I then overloaded my SendMail method to get the response for the required page, passing "type=html" and "type=text" and creating AlternateView's as appropriate.
In Summary
So, in short:
The Views seperate the actual "views" of the content (HTML and Text).
A master page auto switches the view based on a QueryString.
Content pages are responsible for how their views look.
Job done!
If any of this is unclear then please shout. I would like to create blog post on this at some point in more detail.
My Idea
Create a page based on the HTML content and traverse the control tree. You can then pick the text from the controls and handle different controls as required (e.g. use ALT text for images, "_____" for HR etc).
You could ensure the HTML mail is in XHTML format so you can parse it easily using the standard XML tools, then create your own DOM serialiser that outputs plain text. It'd still be a lot of work to cover general XHTML, but for a limited subset you plan to use in e-mail it could work.
Alternatively, if you don't mind shelling out to another program, you could just use the -dump switch to the lynx web browser.

Resources