How to index CMS content for searching

How to index CMS content for searching - asp.net

I want to provide a search mechanism on my CMS. What is the preferred approach, what would be the best indexing technology to allow a site-wide search?
The CMS is written in .Net.

I would recommend that you have a look at Lucene for .NET. Its a very nice helper when it comes to searching and its easy to use.
A very smooth feature with Lucene is that you can set annotations on your entities. This makes it very easy to customize how different variables should be indexed and searched for. (I have only used Lucene with Java, might be some differences with .NET)

You could use Google Site Search for this, the paid version is something like $100(so thats what? 20 euros?) a year. You can customise the search result as much as you want, you call GSS with there api and get the results in XML. There is also an autocomplete included. Allot of Google search features are supported.

Related

Tool to make mindmap for test strategy

I need to draw a more elaborate Mindmap to present my test strategy to my client. I have no experience of creating mind map with any tool.
Can someone suggest any good mindmap making tool?

For "pure" mind mapping I would suggest Freeplane (free and open source). I know people using Freeplane for professional test case generation. Very helpful in this respect are
extensive scripting support that can be used to support testcase entry and for customized exports
multiple fields per node that can be used for different purposes: attributes (tabular data), notes, detail
If your primary focus is the generation of presentations then you should probably use a different tool.

For more elaborate mindmap I would suggest XMind.
With XMind you can even create testcases inside your mindmap using its matrix features. There are lots more features like:
Timeline
Gantt view
Filters
Drilldown

Try https://github.com/mindolph/Mindolph , this desktop application provides features that you can create and manage mind map easily.

You may try online service MindMup or desktop ConceptDraw MINDMAP. Though the first one is not that professional and intuitive as ConceptDraw tool, it is free. The second product has a 21-day trial period, brainstorm mode, multiple hyperlinks, export to MS PowerPoint or Web pages and so on.

Search multiple plone site indexes

I need to implement a central search for multiple plone sites on different servers/machines.If there is a way to select which sites to search would be a plus but not the primary concern.Few ways I came upon to go about this:
-Export the ZCatalog indexes to an XML file and use a crawler periodically to get all the XML files so a search can be done on them,but this way does not allow for live searching.
-There is a way to use a common catalog but its not optimal and cannot be implemented on the sites i am working on because of some requirements.
-I read somewhere that they used solr but i need help on how to use it.
But I need a way to use the existing ZCatalog and index and not create another index as i think is the case with using solr due to the extra overheads and the extra index required to be maintained.But will use it if no other solution possible.I am a beginner at searching so please give details as much as possible.

You should really look into collective.solr:
https://pypi.python.org/pypi/collective.solr/4.1.0
Searching multiple sites is a complex use case and you most likely need a solution that scales. In the end it will require far less effort to go with Solr instead of coming up with your own solution. Solr is build for these kind of requirements.

As an alternative, you can also use collective.elasticindex, an extension to index Plone content into ElasticSearch, for this.
According to its documentation:
This doesn’t replace the Plone catalog with ElasticSearch, nor
interact with the Plone catalog at all, it merely index content inside
ElasticSearch when it is modified or published.
In addition to this, it provides a simple search page called
search.html that queries ElasticSearch using Javascript (so Plone is
not involved in searching) and propose the same features than the
default Plone search page. A search portlet let you redirect people to
this new search page as well.
That can be and advantage over collective.solr.

Is there a scraper application like KimonoLabs?

I have used scrapy and beautiful soup many times, however find kimonolabs solution much easier and faster. The only problem is that sometimes jobs do need a bit of tweaking, which is not possible (e.g., crawling using a unique pattern).
Is there any other solution which combines the ease with optional complexity? Mainly I want to define a page scraping template using a WYSIWYG interface, and then programatically write the crawler.

Use an Import.io extractor.
Download the Import.io browser
Create an extractor (what you call a "scraping template")
From your code use the extractor's REST API

Full disclosure: I'm one of the founders of ParseHub.
ParseHub tries to solve exactly this problem. It gives you a gui and powerful tools for defining templates visually, and falls back to a subset of javascript if you need more fine-grained control. All of the programming primitives that you're familiar with (if, for, break, recursion, etc.) are available.
You can find it at www.parsehub.com

Try Agenty
Agenty has exact same feature to scrape websites, and the Chrome extension to setup the scraping agents. You can just install the extension and create agents to scrape any site.
FYI : We also have plan to launch hosted solution and REST API by April, 2016 (Update - API is available now)
You may see more details on website (www.datascraping.co) now Agenty.com
Disclosure : I'm one of the founding member

What's the universal standard to get data from any blog?

I want to extract data from various kinds of blogs and was going through various ways to do it:
API which needs user authentication
XML RPC(Don't know which all support it)
RSS(Again, not sure which blogs support it and even if they do, how much can one get from RSS feeds.)
Atom
I know that this isn't a strictly programming related question but I went forward in asking this as there is heck lot of confusion as to what to use and which is better served?
It would be nice to not use API with Authentication as you not only will have to tackle with varied implementations of Authentication, you also have to deal with varied API limits.

RSS is the oldest that came into use. There are limitations to it. Atom was designed to be the replacement for it, overcoming the limitations of RSS. Atom is just a specialised form of XML RPC. In other words, there are other uses for XML RPC, and Atom is the variation of it you want. All of the above are a type of API. So ideally what you want to do is support RSS and Atom. Sadly Atom and RSS are not backwards compatible. To quote the Wikipedia on "Atom":
In particular, many blog and wiki sites offer their web feeds in the
Atom format.
#porneL's solution is not recommended (at the moment). However in the future, HTML markup is set to change to improve the semantic meaning given to blocks, such as the new <article> tag. This will be yet another way to parse documents. It will be the most versatile, but in my opinion it will be a very long time before it becomes reliable, since many if not most sites suffer from 'tag soup' syndrome.

The most universal "standard" is crawling and parsing HTML.
wget -m http://example.com/
How exactly you do it depends on what are you trying to accomplish and how universal you want to be.
You could use heuristics, similar to what Readability uses, to find articles on a site. You could detect and special-case popular blogging platforms.

optimizing search engine in asp.net

I have a task to optimize search engine in asp.net ecommerce store based on nopcommerce tempate.
I would like to hear on what should I pay most attention to improve the search engine and to deliver faster results, since current search engine is taking forever to display results.
Full Text Search is one of the options to be implemented too.
Thanks in advance, Laziale

Make sure that all search queries are thru database
Make sure that all the search fields have the proper indecies
Return as minimum info as needed (probably create stored procedures)
Look at your search queries, perhaps they can be rewritten an optimized
Profile your .net code and find the place where it slow and optimize it
Cache your results or even sql queries
For FULL TEXT SEARCH look at Lucene.NET

Skip the EF and have your own data layer, at least for the purpose of Search Optimization.

I think the best way will be reading this document provided by Google which tells you what are the most important tweaks you should pay attention to, i used it myself and was very rewarded indeed:
http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf