I have close to 10K JSON files (very small). I would like to provide search functionality. Since these JSON files are fixed for specific release, I am thinking to pre-index files and load index during startup of website. I don't want to use external search engine.
I am searching for libraries to support this. lucene.Net is one popular library. I am not sure whether this library supports loading pre-index data.
Index JSON documents and store index results (probably in single file), save to file storage service like S3 - Console app.
Load index file and respond to queries. - ASP.NET core app
I am not sure this is possible or not. What are the possible options available?
Since S3 is not a .NET-specific technology and Lucene.NET is a line-by-line port of Lucene, you can expand your search to include Lucene-related questions. There is an answer here that points to an S3 implementation meant for Lucene that could be ported to .NET. But, by the author's own admission, performance of the implementation is not great.
NOTE: I don't consider this to be a duplicate question due to the fact that the answer most appropriate to you is not the accepted answer, since you explicitly stated you don't want to use an external solution.
There are a couple of implementations for Lucene.NET that use Azure instead of AWS here and here. You may be able to get some ideas that help you to create a more optimal solution for S3, but creating your own Directory implementation is a non-trivial task.
Can IndexReader read index file from in-memory string?
It is possible to use a RAMDirectory, which has a copy constructor that moves the entire index from disk into memory. The copy constructor is only useful if your files are on disk, though. You could potentially read the files from S3 and put them into RAMDirectory. This option is fast for small indexes but will not scale if your index is growing over time. It is also not optimized for high-traffic websites that have multiple concurrent threads performing searches.
From the documentation:
Warning: This class is not intended to work with huge
indexes. Everything beyond several hundred megabytes will waste
resources (GC cycles), because it uses an internal buffer size
of 1024 bytes, producing millions of byte[1024] arrays.
This class is optimized for small memory-resident indexes.
It also has bad concurrency on multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory, which is a high-performance directory
implementation working directly on the file system cache of the
operating system, so copying data to heap space is not useful.
When you call the FSDirectory.Open() method, it chooses a directory that is optimized for the current operating system. In most cases it returns MMapDirectory, which is an implementation that uses the System.IO.MemoryMappedFiles.MemoryMappedFile class under the hood with multiple views. This option will scale much better if the size of the index is large or if there are many concurrent users.
To use Lucene.NET's built-in index file optimizations, you must put the index files in a medium that can be read like a normal file system. Rather than trying to roll a Lucene.NET solution that uses S3's APIs, you might want to check into using S3 as a file system instead. Although, I am not sure how that would perform compared to a local file system.
Related
I want to apply caching techniques to improve my asp.net web application performance. I am going to use .NET default cash. I want to store the data in the XML file as well so that If the system fails to found the data from the cache, I can use the XML file as a secondary option. Is this workflow seems well or standard? Will file i/o operation degrade the performance instead of improving it or break the system integrity? The data volume will be medium and the number of files will be around 1k~2k.
Using XML files as data source seems like a rather unorthodox approach. A more common way would be using a database as data source and something like the Distributed Redis Cache for caching.
See the docs for further information.
I am developing Windows 8.1 Universal Application and I use Sqlite and Sqlite-Net to store the data. Now I need to add ability to store PDF files (1 Mb - 50 Mb) files in my application.
What is the best solution in this case? Store files in the Sqlite or store them into separate folder? Which folder will be better to use in this case?
Unless you are going to be storing relational data and that you need to run queries (and it sounds like you aren't), I would suggest using the local storage. It isn't really that difficult to use.
Now, as far as performance. Reading the disk on the app is not going to be fast. That being said, any solution you use is going to be saved to disk in the end; so I don't think you will notice much of a difference if you go with DB or local storage in terms of performance.
A good option would be to store the files on the SD Card/Removable Devices if available, this way all the valuable phone's internal memory will not be used up.
This question is about using Dropbox to sync an sqlite Core Data store between multiple iOS devices. Consider this arrangement:
An app utilizes a Core Data store, call it local.sql, saved in the app's own NSDocumentDirectory
The app uses the Dropbox Sync API to observe a certain file in the user's Dropbox, say, user/myapp/synced.sql
The app observes NSManagedObjectContextDidSaveNotification, and on every save it copies local.sql to user/myapp/synced.sql, thereby replacing the latter.
When the Dropbox API notifies us that synced.sql changed, we do the opposite of part 3 more or less: tear down the Core Data stack, replace local.sql with synced.sql, and recreate the Core Data stack. The user sees "Syncing" or "Loading" in the UI in the meantime.
Questions:
A. Is this arrangement hugely inefficient, to the extent where it should be completely avoided? What if we can guarantee the database is not large in size?
B. Is this arrangement conducive to file corruption? More than syncing via deltas/changelogs? If so, will you please explain in detail why?
A. Is this arrangement hugely inefficient, to the extent where it should be completely avoided? What if we can guarantee the database is not large in size?
Irrelevant, because:
B. Is this arrangement conducive to file corruption? More than syncing via deltas/changelogs? If so, will you please explain in detail why?
Yes, very much so. Virtually guaranteed. I suggest reviewing How to Corrupt An SQLite Database File. Offhand you're likely to commit at least two of the problems described in section 1, including copying the file while a transaction is active and deleting (or failing to copy, or making a useless copy of) the journal file(s). In any serious testing, your scheme is likely to fall apart almost immediately.
If that's not bad enough, consider the scenario where two devices save changes simultaneously. What then? If you're lucky you'll just get one of Dropbox's notorious "conflicted copy" duplicates of the file, which "only" means losing some data. If not, you're into full database corruption again.
And of course, tearing down the Core Data stack to sync is an enormous inconvenience to the user.
If you'd like to consider syncing Core Data via Dropbox I suggest one of the following:
Ensembles, which can sync over Dropbox (while avoiding the problems described above) or iCloud (while avoiding the problems of iOS's built-in Core Data/iCloud sync).
TICoreDataSync, which uses Dropbox file sync but which avoids putting a SQLite file in the file store.
ParcelKit, which uses Dropbox's new data store API. (Note that this is quite new and that the data store API itself is still beta).
We are building a jobsite application in which we will store resumes of all the candidates, which is planned to store on file system.
Now We need to search inside that file and provide the result to the user, we need to provide that what is the best solution to implement text searching.
I have just tried to identify it and got some reference like IFilter (API or interface) and Lucene.Net (open source), but not sure that is it a right solution.
In initial phase it is expected to be around 50,000 resumes and it should be scalable enough if number increases.
I just want some case study or some analysis or your suggestions that which is the best method to handle this requirement (Technology ASP .Net)
Thanks
You can use Microsoft Search Server. There is a free version, so you can try it before buy it (or never buy, if it meets your requirements).
If, later, you do want to integrate that documents into a Sharepoint portal, Enterprise Search can also integrate with it.
One possibility would be to use the FILESTREAM feature in SQL Server 2008, combined with database-level full text index / search.
That would allow you to keep the files in the filesystem, while also providing transactional integrity and search.
SQL Express supports FILESTREAM, and the 4GB size limit doesn't apply for the files (although it does apply to the size of a full text index).
This might be naive since I'm unfamiliar with off-the-shelf search products but if nothing pre-build fit the bill I would build a simple service that crawls and indexes (or several instances to crawl different directories to increase speed) and updates a database. If the files were accessed regularly you could build a layer of isolation to prevent collisions.
Rodney
I'm building an ASP .NET web solution that will include a lot of pictures and hopefully a fair amount of traffic. I do really want to achieve performance.
Should I save the pictures in the Database or on the File system? And regardless the answer I'm more interested in why choosing a specific way.
Store the pictures on the file system and picture locations in the database.
Why? Because...
You will be able to serve the pictures as static files.
No database access or application code will be required to fetch the pictures.
The images could be served from a different server to improve performance.
It will reduce database bottleneck.
The database ultimately stores its data on the file system.
Images can be easily cached when stored on the file system.
In my recently developed projects, I stored images (and all kinds of binary documents) as image columns in database tables.
The advantage of having files stored in the database is obviously that you do not end up with unreferenced files on the harddisk if a record is deleted, since synchronization between database (= meta data) and harddisk (= file storage) is not built-in and has to be programmed manually.
Using today's technology, I suggest you store images in SQL Server 2008 FILESTREAM columns (at least that's what I am going to do with my next project), since they combine the advantage of storing data in database AND having large binaries in separate files (at least according to advertising ;) )
The adage has always been "Files in the filesystem, file metadata in the database"
Better to store files as files. Different databses handle Blob data differently, so if you have to migrate your back end you might get into trouble.
When serving the impages an < img src= to a file that already exists on the server is likely to be quicker than making a temporary file from the database field and pointing the < img tag to that.
I found this answer from googling your question and reading the comments at http://databases.aspfaq.com/database/should-i-store-images-in-the-database-or-the-filesystem.html
i usually like to have binary files in the database because :
data integrity : no unreferenced file, no path in the db without any file associated
data consistency : take a database dump and that's all. no "O i forgot to targz this data directory."
Storing images in the database adds a DB overhead to serve single images and makes it hard to offload to alternate storage (S3, Akami) if you grow to that level. Storing them in the database makes it much easier to move your app to a different server since it's only the DB that needs to move now.
Storing images on the disk makes it easy to offload to alternate storage, makes images static elements so you don't have to mess about with HTTP headers in your web app to make the images cacheable. The downside is if you ever move your app to a different server you need to remember to move the images too; something that's easily forgotten.
For web based applications, you're going to get better performance out of using the file system for storing your images. Doing so will allow you to easily implement caching of the images at multiple levels within your application. There are some advantages to storing images in a database, but most of the time those advantages come with client based applications.
Just to add some more to the already good answers so far. You can still get the benefits of caching from both the web level maybe and the database level if you go the route keeping you images in the database.
I think for the database you can achieve this by how you store the images with relation to the textual data associated with them and if you can the access to the images into a particular query so that the database can cache the query (just theory though so feel free to nuke me on that part).
With the web side, I would guess since you're question is tagged up with asp.net that you would go the route of using a http handler to serve up the images. Then you have all the benefits of the framework at your disposal and you can keep you domain logic cleaner with only having to pass the key to your image to the http handler.
Here is a step-by-step example (general approach, Spring implementation, Eclipse) of storing images in file system and holding their metadata in DB --
http://www.devmanuals.com/tutorials/java/spring/spring3/mvc/Spring3MVCImageUpload.html
Here is an example too -- http://www.journaldev.com/2573/spring-mvc-file-upload-example-tutorial-single-and-multiple-files
Also you can investigate a codebase of this project -- https://github.com/jdmr/fileUpload . Pay attention to this controller.