How to set the crawl rule in Azure Cognitive Search to index only the files starting with specific letters? - microsoft-cognitive

How to set the crawl rule in Azure Cognitive Search to index only the files starting with specific letters (for example files with Prefix Invoice_ etc) ? so that other files in the blob storage will not be crawled.
Swati

Regarding to this query , I got a reply for Azure search product team :The indexer only allows you to filter which files to be index based on their extension or based on the container they live in (documentation: https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage)... so if you want to filter by a prefix the only way I can think of is to actually create a custom skill that takes the metadata_storage_path field and the content, and only output the content if the path/filename meets whatever pattern you are looking for. This takes a bit of effort (creating the custom skill), but it is doable.

Related

Tagging images used for search bar

I'm working with a classmate to build some kind of politicaly-related memes database where users will have the ability to tag images with hashtags, using Meteor. The purpose of this, beyond data collection, is to provide a powerful search engine, where one can find memes with keywords (let's say, for i.e., with the keywords "ukraine" and/or "poutine", you'll find memes related to theses topics) that matches the hashtags.
We have to build everything from scratch, and I'm wondering if someone here have an idea where to start. In other words :
What is the easiest way to host images with Meteor ? Is it through MangoDB ?
Is it possible to change the metadata of the images in the client side ? Do we need to grant this ability using javascript only (or is there also json in it) ?
If we can manage the two first parts, is there a way to link the metadata (the hashtags in that case) with the search engine in order to retrieve the images ?
Thank you for inputs !
It's not easiest but I would store images in Google Cloud Storage or Amazon S3
I would store image metadata in mongodb database. You can update the database from client side by calling Meteor Methods
When users search for images by entering keywords or link with keywords, you can query the database then return the related images.

Using Azure Search index to index blobs in Azure Blob Storage (Images and Videos)

I want to index blob of type image and video.
From what I have read Azure Search cannot index image and video types.
What I have done is that I was thinking of using the blob's metadata_storage_path. However that is my key and it is encoded.
Decoding it is really a performance killer.
Is there any way I can index images and videos, using azure search index?
If not, is there any other way?
IIUC, you want to index the metadata attached to the blob but not its content, correct? If so, set dataToExtract parameter to storageMetadata as described in Controlling which parts of the blob are indexed.
The cost of base64-decoding the encoded metadata_storage_path to correlate with the rest of your system is likely to be negligible compared to other work your app is doing, such as calls to the database or Azure Search. However, you can avoid the need for decoding if you fork metadata_storage_path into a new non-key field in your index, which won't need to be encoded. You can use field mappings to fork the field.

How to update metadata using content indexs in webcenter content

I need to create a program which can search a document and fill the metadata from document( eg. resume of candidate) like user experience, user skill , location etc.
for this i like to use oracle indexing mechanism(Oracle text search) because it index all the data from document. when it index the document, i like to first update my metadata field from indexed data and then content server will update their indexes. Can anyone help me how i will get to know the working of indexer and event on which i will trap and do some modification for updating my metadata.
i need to update metadata because requirement are:
Extensive choices for Search Filter criteria (that searches within Resumes and not just form keywords) :
- Boolean search between multiple parameters
- Have search on Skills, Years of experiences, particular company, education qualification, Geo/Location and Submission date of the profile.
- Search on who referred, name, team , BU etc.
- Result window adequate size of results, filters
- Predefined resume filter criteria to assisting screening in case of candidate applying on job portal
You are looking at this problem from the wrong end. The indexer (OracleText Search) is a powerful and complex tool embedded inside the workings of the database. What you are suggesting is to interpret the results of text indexing and use this as metadata for your content - if I am not mistaken? OracleText generates huge amounts of data and literally "chops" up documents word for word. For you to make meaningful metadata from this would be a huge task.
Instead you should be looking at the capture of the metadata from as close to the source as possible. This could be done using (if you are using MS-OFFICE) Word vbScript when the user saves to the repository or filesystem. I believe you can fully manipulate the metadata in a document at savetime.
You will of course need to install the Oracle WebCenter Content Desktop Integration suite.
Look into Oracle WebCenter Capture. WebCenter Capture can scan a document and allows metadata to be automatically tagged on the document. WebCenter Capture integrates with WebCenter Content (WCC) and allows you to directly checkin scanned documents to WebCenter Content.
http://www.oracle.com/technetwork/middleware/webcenter/content/index-090596.html

How to read content in scanned content in alfresco?

I have a number of scanned content items which are being scanned by scanner & converted into pdf/image and finally got stored in alfresco repository.
I can search these scanned items using metadata properties but can anybody help me on how i can search them thru content stored into scanned documents. E.g. I have scanned a form with filled in user details & i want to search into alfresco with that particular user's name.
How is it possible? Is there any way to make it as closer as possible to scanner end?
Use EpheSoft or Kofax for the scanning software. Both products have integrations with Alfresco were they can automatic recognize fields and map those to an Alfresco model.
After this process had been done you can search on these specific fields.
I can integrate & scan the content using kofax & this integration can automatically capture all details including text content of scanned content which will be filled in custom content model automatically which has mapping to all these fields and this model is attached to scanned content. Once done, it comes under purview of alfresco indexing after which user can search for same.
Also I assume kofax provides many components such as Scan, Virtual ReScan (VRS), Recognition (OCR / OMR / ICR), Validation, Verification, Quality Control, PDF Generator, etc. which are available OOTB but we need to configure these for use in our implementation. E.g. by configuring quality module, we can see error generated while scanning the content. Further as I am looking for alfresco+Kofax integration so I assume that these features would be provided by Kofax OOTB & I need to just map the scanned content to alfresco content repository for storing content & metadata as per defined content model.
There are a number of options that you could explore but they all require that OCR is performed on the scanned content and the text that is extracted from the OCR needs to be stored in the PDF (if you're using PDFs) or it needs to be stored in Alfresco as either metadata or full text.
If you store the OCR text in the PDF, Alfresco will then be able to extract the text using its content transformers so long as the content type being used specifies that you will be indexing the full text of the content.
Now there are a number of options available to accomplish what you're after but to keep the solution close to the scanner, you will want to investigate a capture solution such as Ephesoft, which is used for intelligent document capture and processing. Other solutions are available (such as Kofax) or you can implement your own solution using Tesseract.

Searching through Sheetnode data in Drupal

I'm using Sheetnode -- http://drupal.org/project/sheetnode -- to upload a big spreadsheet full of property data for a real estate tool I'm working on.
I've successfully uploaded all my spreadsheets; what I'm stuck on is creating the view (I'm very unfamiliar with the Views module).
Is there any way I can use Views to search through a particular column of Sheetnode data, i.e., how do I query a particular column and return results meeting a particular condition?
Sheetnode integrates with core Search, so cell content is indexed as part of node indexing. If you use the "Search: Search Terms" exposed filter, you will be able to search trough spreadsheet content. But you can't specify to only search through a particular column.
Add the fields you want to query as filters, you can then expose these fields creating a search form for your view.

Resources