SolrCloud - Multiple Collections or Shards - solrcloud

I currently use an older version of Solr - 4.7.2. It runs in standalone mode - only one solr node with multiple cores. Each core is protected by ldap groups.
I am looking to be able to search against a single core and also now add searching across multiple cores. Since distributed searching is considered legacy, I believe SolrCloud must be the way to go. I have installed the latest version of solr locally.
I have been reading up on this and I am still not sure how to do this.
There are roughly 100 cores right now. All have the same schema.
Do I convert each core to a collection where each collection is still protected by ldap groups? And then can you search across multiple collections?
Or do you set up one collection with multiple cores? Is each cores then a shard and I can still ldap protect each shard? Users can then search within a shard\core or across all within the collection?
Then what happens if you search across multiple collections or shards (depending on which above scenario is the way to go) and the user does not have access to a collection or shard? Do you need to know ahead a time where the user can search so there are no errors or will it bypass ones you do not have access to?
Thank you for any insight you can provide.

Well, lot of points here. Let's go how I can help you
I currently use an older version of Solr - 4.7.2. It runs in standalone mode - only one solr node with multiple cores. Each core is protected by ldap groups.
Ok
I am looking to be able to search against a single core and also now add searching across multiple cores. Since distributed searching is considered legacy, I believe SolrCloud must be the way to go. I have installed the latest version of solr locally. I have been reading up on this and I am still not sure how to do this. There are roughly 100 cores right now. All have the same schema.
Do I convert each core to a collection where each collection is still protected by ldap groups? And then can you search across multiple collections?
This is one possible scenario. I'm not sure if the LDAP auth still works as you've currently implemented, because keep in mind that the interaction with SolrCloud is different, it involves a third component (Zookeeper) which is absent in the standalone scenario or (manually) distributed search.
Starting from (maybe I'm wrong here with the version) Solr 5, the /admin endpoint offers also an authorisation / authentication API (the underlying AA mechanism, like an LDAP, can be plugged-in )
Just one doubt: 100 cores with the same schema means 100 collections with the same schema and that could mean a relevant amount of resources for managing what can you can consider 100 distributed Lucene indexes. Assuming that at the moment you are on a single server (and that means you don't have a lot of data) why don't you merge everything in a single collection (adding an additional "source" field for discriminating between documents)?
Or do you set up one collection with multiple cores?
Read above, it's basically up to you. You can do both.
Is each cores then a shard and I can still ldap protect each shard? Users can then search within a shard\core or across all within the collection?
It's not actually correct to think at core = shard but considering the step you're doing, yes, I think that this can help you to understand at the very beginning how things are working. However, I would have a look at the reference guide.
And yes, the client can search wherever you want, targeting one or more collections.
Then what happens if you search across multiple collections or shards (depending on which above scenario is the way to go) and the user does not have access to a collection or shard? Do you need to know ahead a time where the user can search so there are no errors or will it bypass ones you do not have access to?
I think that the auth protection you're actually using is completely external to Solr, so I guess your assumption is right: you should know in advance where a given user can go, otherwise some request would return a 403 error (or something like that).

Related

some generic questions about neo4j

I'm new to non-php web applications and to nosql databases. I was looking for a smart solution matching my application requirements and I was very surprised when I knew that there exist graph based db. Well I found neo4j very nice and very suitable for my application, but as I've already wrote I'm new to this and I have some limitations in understending how it works. I hope you guys could help me to learn.
If I embed neo4j in a servlet program then the database access I create is shared among the different threads of that servet right? so I need to put database creation in init() method and the shutdown in the destroy() right? And it will be thread safe.(every dot is a "right?") But what if I want to create a database shared among the whole application?
I heard that graph databases in general relies on a relational low level. Is that true for neo4j? But if it is then I see an high level interface to the real persistence layer, so what a Connection is in this case? Are there some techniques like connection pooling or these low level things are all managed by neo4j?
In my application I need to join some objects to users and many other classification stuff. any of these object has an unique id (a String). then If some one asks to view some stuff about object having id=QW then I need to load the vertex associate to object.QW. Is this an easy operation for graph datbases?
If I need to manage authentications, so as I receive the couple (usr,pwd) and I need to check whether exists this couple in my graph. Is the same problem as before or there exist some good variation for managing authentications?
thanks
If you're coming from PHP world in most cases you're better of running Neo4j in server mode and access it either via REST directly or use a client driver like https://github.com/jadell/neo4jphp. If you still want to embed Neo4j in a servlet environment, the GraphDatabaseService is a shared component, maybe stored within the ServletContext. On a per request (and therefore per-thread) basis you start and commit transactions.
Neo4j is a native graph database. The bare metal persistence layer is optimized for navigating from one node to its neighbors as fast as possible and written by the Neo4j devteam themselves. There are other graph databases out there reusing other persistence technologies for their underlying persistence.
Best thing is to run the Neo4j online course at http://www.neo4j.org/learn/online_course.
see SecurityRules
As the Neo4j is NoSql Graph Database,
Genration of the Unique ID you have to handle using the GUID(with 3.x autonincremented proery also supported for particular label),
as the Neo4j default genrated id is unique but can be realocated to the another object once the first assigned object is deleted,
I am .net developer in my project I used the Neo4j rest api it works well, i will sugesst you to go with that,as it is implemented using async-awit programing pattern, so long running operation you can pass to DB and utilize your web server resources in more prominent way.

Meteor server-side memory usage for thousands of concurrent users

Based on this answer, it looks like the meteor server keeps an in-memory copy of the cache for each connected client. My understanding is that it gets used in order to avoid sending multiple copies of data when dealing with overlapping subscriptions on a client.
The relevant part of the linked answer (emphasis is mine):
The merge box: The job of the merge box is to combine the results (added, changed and removed calls) of all of a client's active publish functions into a single data stream. There is one merge box for each connected client. It holds a complete copy of the client's minimongo cache.
Assuming that answer is still accurate in the current version of meteor, couldn't that create a huge waste of memory on the server as the number of users increases?
As an off-the-cuff calculation, if an app had about a 100kB cache per client, then 10,000 concurrent users would use up 1GB of memory on the server, and 100,000 users a whopping 10GB! This would be true even if each client was looking at almost identical data. It seems plausible for an app use much more data than that per client, which would further exacerbate the problem.
Does this problem exist in the current version of Meteor? If so, what techniques can be used to limit the amount of memory the server needs to use to manage all the client subscriptions?
Take a look at this post by Arunoda at his meteorhacks.com blog:
http://meteorhacks.com/making-meteor-500-faster-with-smart-collections.html
which talks about his Smart Collections page:
http://meteorhacks.com/introducing-smart-collections.html
He created an alternative Collection stack which has succeeded in it's goals for speed, efficiency (memory & cpu) and scalability (you can see a graphed comparison in the post). Admittedly in his tests RAM usage was negligent with both Collection types, although the way he's implemented things there should be a very obvious difference with the type of use case you mentioned.
Also, you can see in this post on meteor-core:
https://groups.google.com/d/msg/meteor-core/jG1KLObX1bM/39aP4kxqWZUJ
that the Meteor developers are aware of his work and are cooperating in implementing some of the improvements into Meteor itself (but until then his smart package works great).
Important note! Smart collections relies on access to the Mongo Oplog. This is easy if you're running on your own machine or hosted infrastructure. If you're using a cloud based database, this option might not be available, or if it is, will cost a lot more than the smaller packages.

SignalR and Memcached for persistent Group data

I am using SignalR with my ASP.NET application. What my application needs is to pressist the groups data that is updated from various servers. According to SignalR documentation it's my responsibility to do this. It means that I need to use an external server/service that will collect the data from one or more servers and I can query that data from a single place.
I first thought that MemCached is the best candidate, because it's fast and the data that I need to put there is volatile. The problem is that I need to store collections, for example: collection A with user Ids, so I can have Collection A with 2000 user ids and Collection B with 40,000 ids. The problem is that I need to update this collection and remove and insert id very quickly. I afraid that because the commands will be initiated from several servers, and the fact that I might need to read the entire collection and update it on a either web servers, the data won't be consistent. Web Server A might update the data, but Server B will read the data before Server A finished updating it. There is a concurrency conflict.
I'm searching for the best way to implement this kind of strategy in my ASP.NET 4.5 application. I think that this might be a choice to use a in-memory database or that to insure no data integrity.
I want to ask you what is the best solution for my problem.
Here's an example for my problen:
MemCached Server - stores the collections (e.g. Collection A, B, C, D), each collection stores User Id's, which can be thousands of Ids and even much more.
Web Servers - My Amazon EC2 web servers with SignalR installed. Can be behind load balancer. Those servers need to gain access to the memcached server and get a complete collection items by the Collection name (e.g. "Collection_23"). They need to be able to remove items (User Id's) and add Items. All this should be fast as possible.
I hope that I explained myself right. Thanks.
Alternatively, you can use Redis, like Memcached everything is served from in-memory. Redis has many other capabilities beyond a simple key-value datastore; for your specific case you might use Redis transactions, which ensures data consistency.
In a comment in another post it shows a link to redis provider. The link is broken, it seems that it is now integrated in the main SignalR project: https://github.com/SignalR/SignalR/tree/master/src/Microsoft.AspNet.SignalR.Redis
You have the redis nuget here:
http://www.nuget.org/packages/Microsoft.AspNet.SignalR.Redis
and documentation here:
http://www.asp.net/signalr/overview/signalr-20/performance-and-scaling/scaleout-with-redis

Multi-Database Transactional System & ASP.NET MVC

So I have a challenge to build a site that people online can use to interact with organizations.: Asp.NET MVC Customer Application
One of the requirements is financial processing and accounting.
I'm very comfortable using SQL Transactions and stored procedures to do this; i.e. CreateCustomer also creates an entity, and an account record. We have a stored procedure to do this, that does a begin transaction, creates some setup records we need, then does a commit. I'm not seeing a good way to do this with an ORM, and after reading some great blog articles I'm starting to wonder if I'm going down the wrong path.
Part of the complexity here is the data itself:
I'm querying x databases (one per existing customer) to get some of my data, though my app has its own data store as well. I need to query the x databases, run stored procedures on the x databases, and also to my own datastore.
I'm not seeing strong support for things like stored procedures and thereby transactions, though it does seem to be present.
Maybe I'm just trying to make my app a nail here, cause the MVC hammer is sooo shiny. I'm plenty comfortable with raw ADO.NET of course, but I'm in love with the expressive feel to writing Linq code in C# and I'd rather not give up on it.
Down to the question:
Is this a bad idea? Should I try to use Linq / Entity Framework, or something like nHibernate... and stick with the ORM pattern or should I trash it and use raw ADO.NET data access?
Edit: a note on scale; from a queries per second standpoint this app is not "huge". But, from a data complexity perspective, it does need to query against 50+ databases (all identical, or close to it) to read data from an external application and publish data back to that application. ORM feels right when dealing with "my" data store, but feels very wrong for accessing the data from the external application.
From a certain size (number of databases) up, you have to change the paradigm. Are you at that size?
When you deploy what ultimately is a distributed application and yet try to controll it as an ordinary local application you are going to run into a set of fundamental issues around availability, scalability and correctness. If you use concepts like 'distributed transactions', 'linked servers' and 'ORM', your are down the wrong path. True distributed applications will use terms like 'message', 'queue' and and 'service'. Terms like Linq, EF, nHibernate are all fine and good, but none will bring you anything extra from what a simple Transact-SQL SELECT statement brings. In other words, if a SELECT solves your issues, then the client side various ORM will work. If not, they won't add any miraculos value.
I recommend you go over the slides on the SQLCAT: High Performance Distributed Applications in Real World Deployments which explain how a site like MySpace manages to read and write into a store of nearly 500 servers and thousands of databases.
Ultimately what you need to internalize is this: one database can have 95% availability (uptime and acceptable service response time). A system consiting of 10 databases with 95% availability has 59% availability. And a system of 100 databases each with 99.5% availability has 60% availability. 1000 databases with 99.95% availability (5 min downtime per week) have 60% availability. And this is for an ideal situation. In reality there is always a snowball effect caused by resource consumption (eg. threads blocked on trying to access an unavailable or slow resource) that makes things far worse.
This means that one cannot write a large distributed system relying on synchronous, tightly coupled operatiosn and transactions. Is simply impossible. You always rely on asynchronous operations (usually messaging and queues), which is something completely different from your run-of-the-mill database application.
use TransactionScope object available in System.Transaction.
What I have chosen is to use Entity Framework to allow access to the application's main data store, and create a custom DAL for access to external application data and for access to stored procedures within the application.
Here's hoping Entity Framework 4.0 fixes the issue. For now, I'm using the concept listed here.
http://social.msdn.microsoft.com/forums/en-US/adodotnetentityframework/thread/44a0a7c2-7c1b-43bc-98e0-4d072b94b2ab/

Best way to keyword search Amazon SimpleDB using EC2 and Asp.Net?

I am wondering if anyone has any thoughts on the best way to perform keyword searches on Amazon SimpleDB from an EC2 Asp.Net application.
A couple options I am considering are:
1) Add keywords to a multi-value attribute and search with a query like:
select id from keywordTable where keyword ='firstword' intersection keyword='secondword' intersection keyword = 'thirdword'
Amazon Query Example
2) Create a webservice frontend to Katta:
Katta on EC2
3) A queued Lucene.Net update service that periodically pushes the Lucene index to the cloud. (to get around the 'locking' issue)
Load balance Lucene(StackOverflow post)
Lucene on S3 (blog post)
If you are looking for a strictly SimpleDB solution (as per the question as stated) Katta and Lucene won't help you. If you are looking for merely an 'Amazon infrastructure' based solution then any of the choices will work.
All three options differ in terms of how much setup and management you'll have to do and deciding which is best depends on your actual requirements.
SimpleDB with a multi-valued attribute named Keyword is your best choice if you need simplicity and minimum administration. And if you don't need to sort by relevance. There is nothing to set up or administer and you'll only be charged for your actual cpu & bandwidth.
Lucene is a great choice if you need more than keyword searching but you'll have manage updates to the index yourself. You'll also have to manage the load balancing, backups and fail over that you would have gotten with SimpleDB. If you don't care about fail over and can tolerate down time while you do a restore in the event of EC2 crash then that's one less thing to worry about and one less reason to prefer SimpleDB.
With Katta on EC2 you'd be managing everything yourself. You'd have the most flexibility and the most work to do.
Just to tidy up this question... We wound up using Lightspeed's SimpleDB provider, Solr and SolrNet by writing a custom search provider for Lightspeed.
Info on implementing ISearchEngine interface for Lightspeed:
http://www.mindscape.co.nz/blog/index.php/2009/02/25/lightspeed-writing-a-custom-search-engine/
And this is the Solr Library we are using:
http://code.google.com/p/solrnet/
Since Solr can be easily scaled using EC2 machines, this made the most sense to us.
Simple Savant is an open-source .NET persistence library for SimpleDB which includes integrated support for full-text search using Lucene.NET (I'm the Simple Savant creator).
The full-text indexing approach is described here.

Resources