I develop a web application that let users to upload files like images and documents. this file divided into two parts :
binary files
document files
I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?
You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.
For more information look at this:
http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx
Pros:
The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.
also you can enable versioning of files and also make it easier for load balanced web farms.
Cons:
you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.
Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.
Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx
You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.
For docx files, I would convert them to RTF and search them that way while stored in SQL.
For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.
Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.
The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.
In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.
these are two article about it:
SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search
Using Full Text Search in SQL Server 2008
Related
From a Linux bash script, I want to read the structured data stored by a particular Firefox add-on called FB-Purity.
I have found a folder called .mozilla/firefox/b8eab5j0.default/storage/default/moz-extension+++37a9788c-671d-4cae-ba5c-fbdb8788499a^userContextId=4294967295/ that contains a .metadata file which contains the string moz-extension://37a9788c-671d-4cae-ba5c-fbdb8788499a, an URL which when opened in Firefox shows the add-on's details, so I am pretty sure that this folder belongs to the add-on.
That folder contains an idb directory, which sounds like Indexed Database API, a W3C standard apparently used since last year by Firefox it to store add-ons data.
The idb folder only contains an empty folder and an SQLite file.
The SQLite file, unfortunately, does not contain much application structured data, but the object_data table contains a 95KB blob which probably contains the real structured data:
INSERT INTO `object_data` VALUES (1,'0pmegsjfoetupsf.742612367',NULL,NULL,
X'e08b0d0403000101c0f1ffe5a201000400ffff7b00220032003100380035003000320022003a002
2005300610074006f0072007500200055007205105861006e00690022002c00220036003100350036
[... 95KB ...]
00780022007d00000000000000');
Question: Any clue what this blob's format is? How to extract it (using command line or any library or Linux tool) to JSON or any other readable format?
Well, I had a fun day today figuring this out and ended creating a Python tool that can read the data from these indexedDB database files and print them (and maybe more at some point): moz-idb-edit
To answer the technical parts of the question first:
Both the name key (name) and data (value) use a Mozilla proprietary format whose only documentation appears to be its source code at this time.
The keys use a special just-for-this use-case encoding whose rough description is available in mozilla-central/dom/indexedDB/Key.cpp – the file also contains the only known implementation. Its unique selling point appears to be the fact that it is relatively compact while being compatible with all the possible index types websites may throw at you as well as being in the correct binary sorting order by default.
The values are stored using SpiderMonkey's internal StructuredClone representation that is also used when moving values between processes in the browser. Again there are no docs to speak of but one can read the source code which fortunately is quite easy to understand. Before being added to the database however the generated binary is compressed on-the-fly using Google's Snappy compression which “does not aim for maximum compression [but instead …] aims for very high speeds and reasonable compression” – probably not a bad idea considering that we're dealing with wasteful web content here.
To locate the correct indexedDB file for an extension's local storage data, one needs to resolve the extension's static ID to a so-call “internal UUID” whose value is different in every browser profile instance (to make tracking based on installed addons a lot harder). The mapping table for this is stored as a pref (“extensions.webextensions.uuids”) in the prefs.js. The IDB path then is ${MOZ_PROFILE}/storage/default/moz-extension+++${EXT_UUID}^userContextId=4294967295/idb/3647222921wleabcEoxlt-eengsairo.sqlite
For all practical intents and purposes you can read the value of a single storage key of any extension by downloading the project mentioned above. Basic usage is:
$ ./moz-idb-edit --extension "${EXT_ID}" --profile "${MOZ_PROFILE}" "${STORAGE_KEY}"
Where ${EXT_ID} is the extension's static ID (check its manifest.json file or look in about:support#extensions-tbody if your unsure), ${MOZ_PROFILE} is the Firefox profile directory (also in about:support) and ${STORAGE_KEY} is the name of the key you'd like to query (unfortunately querying all keys is not supported yet).
Also writing data is not currently supported either.
I'll update this answer as I implement more features (or drop me an issue on the project page!).
i have a binary file of mobile, in this binary file msgs and contacts of phone-book are stored i have extracted msgs from it but now have to extract contacts saved in phone book.in this binary file the data is stored in sqlite format as i found this string 53514C69746520666F726D617420330000 in my binary file. now how to extract list of contacts saved in phone book.
You need to first work out the format of the file from which you are extracting information, then write code to extract it. A good starting point would be The SQLite Database File Format.
The first part of that string you give (53514C69746520666F726D6174203300) is ASCII hex for SQLite format 3<nul>, which matches the header shown in that link above, so that may go some way toward helping you figure out how best to process it.
Although, given the fact it appears to be just a normal SQLite database file, you may get lucky and be able to use it as-is with a normal SQLite instance. That would be the first thing I'd try since you can then use regular SQL queries to output the data in a more usable form.
For example, if the file is called pax.db, simply run:
sqlite pax.db
to open it, then you may find you can use all the regular investigative commands like .databases, .schema, .tables and so on.
I need to create a program which can search a document and fill the metadata from document( eg. resume of candidate) like user experience, user skill , location etc.
for this i like to use oracle indexing mechanism(Oracle text search) because it index all the data from document. when it index the document, i like to first update my metadata field from indexed data and then content server will update their indexes. Can anyone help me how i will get to know the working of indexer and event on which i will trap and do some modification for updating my metadata.
i need to update metadata because requirement are:
Extensive choices for Search Filter criteria (that searches within Resumes and not just form keywords) :
- Boolean search between multiple parameters
- Have search on Skills, Years of experiences, particular company, education qualification, Geo/Location and Submission date of the profile.
- Search on who referred, name, team , BU etc.
- Result window adequate size of results, filters
- Predefined resume filter criteria to assisting screening in case of candidate applying on job portal
You are looking at this problem from the wrong end. The indexer (OracleText Search) is a powerful and complex tool embedded inside the workings of the database. What you are suggesting is to interpret the results of text indexing and use this as metadata for your content - if I am not mistaken? OracleText generates huge amounts of data and literally "chops" up documents word for word. For you to make meaningful metadata from this would be a huge task.
Instead you should be looking at the capture of the metadata from as close to the source as possible. This could be done using (if you are using MS-OFFICE) Word vbScript when the user saves to the repository or filesystem. I believe you can fully manipulate the metadata in a document at savetime.
You will of course need to install the Oracle WebCenter Content Desktop Integration suite.
Look into Oracle WebCenter Capture. WebCenter Capture can scan a document and allows metadata to be automatically tagged on the document. WebCenter Capture integrates with WebCenter Content (WCC) and allows you to directly checkin scanned documents to WebCenter Content.
http://www.oracle.com/technetwork/middleware/webcenter/content/index-090596.html
I'm migrating/consolidating multiple FMP6 databases to a single C# application backed by SQL Server 2008. the problem I have is how to export the data to a real database (SQL Server) so I can work on data quality and normalisation. Which will be significant, there are a number of repeating fields that need to be normalised into child tables.
As I see it there are a few different options, most of which involve either connecting to to FMP over ODBC and using an intermediate to copy the data across (either custom code or MS Acess linked tables), or, exporting to flat file format (CSV with no header or xml) and either use excel to generate insert statements or write some custom code to load the file.
I'm leaning towards writing some custom code to do the migration (like this article does, but in C# instead of perl) over ODBC, but I'm concerned about the overhead of writing a migrator that will only be used once (as soon as the new system is up the existing DB's will be archived)...
a few little joyful caveats: in this version of FMP there's only one table per file, and a single column may have multi-value attributes, separated by hex 1D, which is the ASCII group separator, of course!
Does anyone have experience with similar migrations?
I have done this in the past, but using MySQL as the backend. The method I use is to export as csv or merge format and them use the LOAD DATA INFILE statement.
SQL Server may have something similar, maybe this link would help bulk insert
I'm programming site for ren-a-car company, and I need to save exactly three pictures for every car. What is better to store in database ( table description) path to images and images save in some folder OR to save pictures in table (MS SQL )?
The main question is: how big are those pictures on average??
A thorough examination by Microsoft Research (To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem) has shown:
that it's better to store stuff inside your SQL tables as long as the data is typically smaller than 256 KB
that it's better to store on filesystem if the elements are typically larger than 1 MB
The in-between is a bit of a gray area....
So: are you pictures mostly more than 1 MB?? Then store on disk and keep a reference. Otherwise, I'd recommend storing them inside your database table.
Better to save the path to the picture (and probably even better just the filename usually). By storing the picture in the table, you are increasing the size of the tables and therefore increasing lookup time, insert time, etc for no apparent gain.
The other thing is that you say you need to save exactly 3 pictures for each posting, so this makes me think you're using fields in your posts table such as pic1, pic2, pic3. You might want to normalize this so that you have a post_pictures table that links each post with a picture.
For instance:
post_pictures:
post_picture_id | post_id | picture_filename. (you can even get away with having just two fields).
Depending on the horsepower of your server farm, save in SQL. I have set the up fields as varbinary(max).
What advantage do you have in storing the images in database? Can you think of any use of the image data being stored in a DB?
I suggest you store the images in file system rather than database. You can use the database to store meta data of the images (path,keywords etc) in the database.