How to read content in scanned content in alfresco? - alfresco

I have a number of scanned content items which are being scanned by scanner & converted into pdf/image and finally got stored in alfresco repository.
I can search these scanned items using metadata properties but can anybody help me on how i can search them thru content stored into scanned documents. E.g. I have scanned a form with filled in user details & i want to search into alfresco with that particular user's name.
How is it possible? Is there any way to make it as closer as possible to scanner end?

Use EpheSoft or Kofax for the scanning software. Both products have integrations with Alfresco were they can automatic recognize fields and map those to an Alfresco model.
After this process had been done you can search on these specific fields.

I can integrate & scan the content using kofax & this integration can automatically capture all details including text content of scanned content which will be filled in custom content model automatically which has mapping to all these fields and this model is attached to scanned content. Once done, it comes under purview of alfresco indexing after which user can search for same.
Also I assume kofax provides many components such as Scan, Virtual ReScan (VRS), Recognition (OCR / OMR / ICR), Validation, Verification, Quality Control, PDF Generator, etc. which are available OOTB but we need to configure these for use in our implementation. E.g. by configuring quality module, we can see error generated while scanning the content. Further as I am looking for alfresco+Kofax integration so I assume that these features would be provided by Kofax OOTB & I need to just map the scanned content to alfresco content repository for storing content & metadata as per defined content model.

There are a number of options that you could explore but they all require that OCR is performed on the scanned content and the text that is extracted from the OCR needs to be stored in the PDF (if you're using PDFs) or it needs to be stored in Alfresco as either metadata or full text.
If you store the OCR text in the PDF, Alfresco will then be able to extract the text using its content transformers so long as the content type being used specifies that you will be indexing the full text of the content.
Now there are a number of options available to accomplish what you're after but to keep the solution close to the scanner, you will want to investigate a capture solution such as Ephesoft, which is used for intelligent document capture and processing. Other solutions are available (such as Kofax) or you can implement your own solution using Tesseract.

Related

Tagging images used for search bar

I'm working with a classmate to build some kind of politicaly-related memes database where users will have the ability to tag images with hashtags, using Meteor. The purpose of this, beyond data collection, is to provide a powerful search engine, where one can find memes with keywords (let's say, for i.e., with the keywords "ukraine" and/or "poutine", you'll find memes related to theses topics) that matches the hashtags.
We have to build everything from scratch, and I'm wondering if someone here have an idea where to start. In other words :
What is the easiest way to host images with Meteor ? Is it through MangoDB ?
Is it possible to change the metadata of the images in the client side ? Do we need to grant this ability using javascript only (or is there also json in it) ?
If we can manage the two first parts, is there a way to link the metadata (the hashtags in that case) with the search engine in order to retrieve the images ?
Thank you for inputs !
It's not easiest but I would store images in Google Cloud Storage or Amazon S3
I would store image metadata in mongodb database. You can update the database from client side by calling Meteor Methods
When users search for images by entering keywords or link with keywords, you can query the database then return the related images.

Document Association in Alfresco

If a Alfresco user selects x number of documents from the current folder and wants to have a parent document where all x documents are attached in a single document and can download it. Should I create a custom web script to perform this or how can association concept be leveraged here. Eg. Lets say a product requirement document, testing and release document needs to be attached together into a single document.
It seems to me you mismatch document (download one combined document) and collection (association) concepts.
You could create your own custom document model which supports to logically attach documents to another (master) document by adding an association. You could also define in that model that the attached documents will be stored as a child of the master which will somehow hide the attached documents in the folders. We implemented this concept for our Alfresco Email and our custom Attachment module.
If you need the possibility to download that logical document (which still may be a collection of documents) the easiest way would be to implement a custom action shown up on your master document which will zip the master and all connected documents. If you expect to download only a single document like a PDF you will have to write your custom conversion logic which will convert the single docs into pages and to compose them into a single PDF. This could be sophisticated since the documents could be of any format. Maybe you also want/need to save metadata, process information, decisions, structure also ...

How to update metadata using content indexs in webcenter content

I need to create a program which can search a document and fill the metadata from document( eg. resume of candidate) like user experience, user skill , location etc.
for this i like to use oracle indexing mechanism(Oracle text search) because it index all the data from document. when it index the document, i like to first update my metadata field from indexed data and then content server will update their indexes. Can anyone help me how i will get to know the working of indexer and event on which i will trap and do some modification for updating my metadata.
i need to update metadata because requirement are:
Extensive choices for Search Filter criteria (that searches within Resumes and not just form keywords) :
- Boolean search between multiple parameters
- Have search on Skills, Years of experiences, particular company, education qualification, Geo/Location and Submission date of the profile.
- Search on who referred, name, team , BU etc.
- Result window adequate size of results, filters
- Predefined resume filter criteria to assisting screening in case of candidate applying on job portal
You are looking at this problem from the wrong end. The indexer (OracleText Search) is a powerful and complex tool embedded inside the workings of the database. What you are suggesting is to interpret the results of text indexing and use this as metadata for your content - if I am not mistaken? OracleText generates huge amounts of data and literally "chops" up documents word for word. For you to make meaningful metadata from this would be a huge task.
Instead you should be looking at the capture of the metadata from as close to the source as possible. This could be done using (if you are using MS-OFFICE) Word vbScript when the user saves to the repository or filesystem. I believe you can fully manipulate the metadata in a document at savetime.
You will of course need to install the Oracle WebCenter Content Desktop Integration suite.
Look into Oracle WebCenter Capture. WebCenter Capture can scan a document and allows metadata to be automatically tagged on the document. WebCenter Capture integrates with WebCenter Content (WCC) and allows you to directly checkin scanned documents to WebCenter Content.
http://www.oracle.com/technetwork/middleware/webcenter/content/index-090596.html

How to save documents like PDF,Docx,xls in sql server 2008

I develop a web application that let users to upload files like images and documents. this file divided into two parts :
binary files
document files
I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?
You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.
For more information look at this:
http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx
Pros:
The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.
also you can enable versioning of files and also make it easier for load balanced web farms.
Cons:
you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.
Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.
Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx
You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.
For docx files, I would convert them to RTF and search them that way while stored in SQL.
For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.
Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.
The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.
In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.
these are two article about it:
SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search
Using Full Text Search in SQL Server 2008

how to read check box in a word document in asp.net

hi friends present i am working as developes,
i want code for the following scenario
my scenario is the word document must contain checkbox, and this word document should read to asp.net page, when user click the check box, the selected value should be stored into the database
can any one help me
From what I understand, what you're trying to do is to read a column inside a word document, and store the values into a database.
First approach - sharepoint
It seems to be a perfect fit for sharepoint. If that is an option you can do the following:
set up sharepoint
set up a document library
set up a document template
The user will have a form to fill values into, but also available in a word document format.
This technique may be overkill depending on what you ultimately want to do.
Second approach - Office SDK
Microsoft Office SDK comes with the CheckBox object. You can try open up the document programmatically and interogate the CheckBox object.
I would not advice this code to be run on the server as Microsoft Office isn't meant to be run as a server. Whereas Sharepoint is.
If you really want to do this, you may need to write a queueing mechanism so that the act of running the office sdk calls is batched and run one at a time in sequence.

Resources