It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.
I'm beginning the process of instrumenting a web application, and using StatsD to gather as many relevant metrics as possible. For instance, here are a few examples of the high-level metric names I'm currently using:
http.responseTime
http.status.4xx
http.status.5xx
view.renderTime
oauth.begin.facebook
oauth.complete.facebook
oauth.time.facebook
users.active
...and there are many, many more. What I'm grappling with right now is establishing a consistent hierarchy and set of naming conventions for the various metrics, so that the current ones make sense and that there are logical buckets within which to add future metrics.
My question is two fold:
What relevant metrics are you gathering that you have found indespensible?
What naming structure are you using to categorize metrics?
This is a question that has no definitive answer but here's how we do it at Datadog (we are a hosted monitoring service so we tend to obsess over these things).
1. Which metrics are indispensable? It depends on the beholder. But at a high-level, for each team, any metric that is as close to their goals as possible (which may not be the easiest to gather).
System metrics (e.g. system load, memory etc.) are trivial to gather but seldom actionable because they are too hard to reliably connect them to a probable cause.
On the other hand number of completed product tours matter to anyone tasked with making sure new users are happy from the first minute they use the product. StatsD makes this kind of stuff trivially easy to collect.
We have also found that the core set of key metrics for any teamchanges as the product evolves so there is a continuous editorial process.
Which in turn means that anyone in the company needs to be able to pick and choose which metrics matter to them. No permissions asked, no friction to get to the data.
2. Naming structure The highest level of hierarchy is the product line or the process. Our web frontend is internally called dogweb so all the metrics from that component are prefixed with dogweb.. The next level of hierarchy is the sub-component, e.g. dogweb.db., dogweb.http., etc.
The last level of hierarchy is the thing being measured (e.g. renderTime or responseTime).
The unresolved issue in graphite is the encoding of metric metadata in the metric name (and selection using *, e.g. dogweb.http.browser.*.renderTime) It's clever but can get in the way.
We ended up implementing explicit metadata in our data model, but this is not in statsd/graphite so I will leave the details out. If you want to know more, contact me directly.
I have WCF Data service on top of EF (old-fashioned edmx; not code-first) that has millions of rows of data. I am restricting result set to 500 rows. So far, so good!
I want to give the consumer the ability to "drill" into the data. Ideally, consumer code would run something like select distinct Province from my service, user would choose the state, and then code would run select distinct Zip where Province == p (my business domain is very different, but hopefully you get the picture!).
Obviously, I can't get everything and run distinct on the client side. How do I provide this ability in the service? I don't have EntitySet "Provinces" or "Zips" exposed in the service. Should I extend MyEntities class, and try to simulate those sets there? Or is there a simpler way to expose other collections, besides the ones that are automatically exposed from Entity Framework?
I hope the question makes sense...
UPDATE: I think I caused the confusion when I mentioned that I was using WCF service. I actually am using WCF Data Service trying to provide OData wrapper over Entity Framework 4.1. Hopefully, this clarifies the question...
Have you considered dropping your supposedly propietary interafce and going standards?
It is quit easy to expose ODATA from .NET and that allows the user to drill down. THen all you need are some transformations to expose elements like Province or Zip as in a sane data model and there you go ;)
The good thing about OData is that it allows you to easily pass LINQ to the backend and do projections (get me all proovinces from customers where customer name starts with "a") and doesn ot necessarily deal with compelte entities.
I read Ayende post about Futures in NHibernate and started using it. At first only for isolated scenario - queries that return paged results (to both query and count items in single trip to database). But than I thought why not to use futures for every single query. In most of my action methods I need to make multiple queries, on average 5 per page (to get user profile, footer, categories, group information etc).
Obvious benefit is that overall this queries summed up should take shorter to run. But are there any drawbacks for such design of web application? Areas that make me curious are:
Caching in NHibernate - will this design prevent in any way of using caching features of NHibernate?
Maybe such single pack of queries will lock database tables and make other parallel actions wait? I use Read Committed isolation level.
Any other? What are your experiences with futures? Would you recommend using it in such a way, to every possible item?
Your concerns won't apply. Caching won't be broken and its pretty rare to lock tables just batching up read statements.
Futures rock and I wish every ORM I use had them.
A web application we wrote intended for one customer is going to be product-ized and sold to dozens of companies, and we will be doing the hosting.
I could use some guidance about the pros and cons of rolling out a seperate instance for each customer versus going with a single (or very small number of) multi-tenant instances.
At first, as we ramp up, I will have to roll out a seperate instance of the application for each new customer (they will come online one at a time) because it's the only immediate option. I imagine this won't scale very well as far as maintenance goes - rolling out changes will become very tedious and possibly error-prone once there are more than 4 or 5 instances out there. Unless we automate that somehow.
Also, the single-instance philosophy seems like it might lead to a bunch of forks if people need customizations. And it would be nice to avoid that.
So what has your experience been with this?
Bonus question #1: What's the performance difference between 10 SQL Servers with 2m records each versus one huge one with 20m? Let's say they are all in one table and we're mainly doing inserts and selects on single records. Sometimes the selects are on an indexed varchar(12) or date field.
Bonus Question #2: I imagine that to avoid forking, we would have to make the customizations configurable, or build a plug-in architecture. However, that might increase the cost of doing customizations, and I don't want to be one of those shops that takes a week to resize a textbox, and I don't want to over-invest in infrastructure. Any thoughts on that?
Scale Details
Each customer will have a decent amount of data -- up to a few million records.
There will be a very small number of concurrent users, only a few per customer, plus a handful of internal reps on our end.
It's unclear whether each customer will require customizations, but I would say some of them probably will, and maybe some of those changes will be things that other customers will not want to see.
when faced with a similar challenge, here's what we did:
we have one code base with multiple sql servers. we do maintain multiple iis servers with copies of the same code base. we are free to move clients around from sql server to sql server to maximize performance.
if a customer has the $ for it, we will install them on their own server and maintain a separate iis server for them. this accommodates the largest customers for whom paying much more money every month (10 fold more money). we do not, however, give them a separate code base. if they need a mod, we make it visible on a per client basis (see #3)
custom programming usually results in a configurable option. even the people who pay us to have their own server get the same version of the code. sometimes its as simple as a clause in the code that says "if the customer = "ourbigcustomer then turn on this option". yes, that's kludgy hard-coding, but if the customer has enough money, that is fine with me.
i didn't quite get from your question whether you wanted to mix different customer's data into one big database .. our rule is we never do that (never ever). it is one of the wisest choices we ever made. it makes data manipulation much less risky and restores of data easier.
I don't see a good reason for either of your two options. I think the real answer lies somewhere in the middle: having multiple instances, each hosting multiple clients.
This adds another layer of automation processing, but it means you can keep the hosting cheap (you won't need to go out and buy a Cray any time soon) and (hopefully) this sort of mentality means you could do failover backups fairly easily.
But let's not get ahead of ourselves... We're talking about a webapp, right? Get your database(s) and aspnet on different machines. Cluster your databases and you'll have a much happier time playing around with various front-end scenarios. You'll also be able to upscale whichever area runs out of puff first.
By the sounds of it, you'll end up with one clustered database over half if not a full dozen database machines and only a couple of front-end boxes.
As for customisations, you've nailed it. You either provide a completely database-hosted set of editable templates or you have to customise who instances. I'm all for the first. It's a lot of work (without much in return) but it's well worth it as you should only need to change the core code when (you will!) you do upgrades. Hunting through a hundred customers' custom instances to make sure they upgrade safely will kill a developer! Template are the answer. At the very very least, you could allow custom CSS without much pain (but they'd need somebody who knew their stuff).
Edit: I've seen a couple of posts going for the all-in-one method. Splitting the instances over multiple machines insulates you from a couple of things:
If you introduce a bug not caught in testing, only a few clients are effected at once
Hardware fails. Having one mega-server fall over will annoy a lot of people at once. Having a failover mega-server is massively expensive. Having a spare failover box per three or four running servers is much cheaper and annoys fewer people.
Performance can be balanced between boxes on a client-by-client basis, so you can put a few light-use clients with a heavy client, or just fill a box with a few medium-use clients, etc.
On the same idea, usage spikes or other slowdowns only effect clients on the same box. Of course this doesn't mean the same for the database, but you can split that up into a cluster of clusters when you get there.
The big advantage of individual instances will be scaling out as each customer's demand increases. For example if you're running on a single server and one customer suddenly needs more preformance you're stuffed. But if they're all individual then moving that customer to a shiny new server is relatively easy.
The big disadvantage will be in managing the instances all individually. (regardless of whether they're all running on the same server or not).
Regardless you should only ever have one instance of the codebase. And customisation should all be controlled through plugins and configuration. Front end should naturally be seperate from content. Although the cost of making a change may be higher, the benefit in terms of features you can offer your other customers (which will just be customisations you've been asked to do) will pay off I'm sure. Which is to say nothing as to how much easier it'll be to manage a single codebase, as opposed to several.
I would strongly advise going with the single instance hosted by your company. This has the following advantages:
You have physical access to all code
and databases to make changes and
updates.
You control the quality of the
hardware it is running on.
When you fix a bug in common code,
you have fixed it once for all
customers.
You can refactor the application
design to better support customer
specific code and avoid forking.
As the number of customers grow, you
can scale-up and scale-out your
servers to meet
performance/responsiveness
requirements.
Your application code and databases
cannot be tampered with by
"inquistive" customers.
I would have to say it is almost more important where your application is running as opposed to how many separate instances there are of it.
Sure, maintaining multiple separate instances is not ideal due to the support/maintenance overhead, but if these apps. are all on servers you control, life is much easier then needing remote/ physical access to different customers networks and servers.
Joel Spolsky also talks about exactly this on StackOverflow podcast 67.
One thing Joel has learned from
selling Fogbugz: software designed to
be installed on a server in-house at a
customer’s site, under full control of
that customer, is almost never worth
the hassle
20 million records relatively speaking is not a huge SQL Server database. A single well provisioned SQL Server could handle this size comfortably. More important however is the number of concurrent accesses to the database. However you say that there will be only a few users per customer so is unlikely to hit you until the level of concurrency grows.
All of the above are good points but you are missing two key questions. What price point is the service offered at and how many customers (order of magnitude) will you ultimately have to support (ie market size)? In 3 years will you have a maximum of 10 customers each of which will pay you $500,000 per year or 500 customers each paying you $10,000 per year? For a small set of high paying premium customers the advantages of individual deployments is clear, whereas the lower prices and larger customer bases demand a shared solution (a la Oli's comment) is the best way to go. Or go with a cloud platform, although I've only read the hype and tinkered rather than deployed that in the field.
Bonus Question 1: table layout, indexing, number of reads / writes, efficiency and complexity of stored procedures (you are using procs or at least prepared statements, right?) all matter a heck of a lot more than the number of physical records in the database to a point. Beyond that you will likely find yourself needing to either provide individual SQL Server instances for each customer or for a pool of customers, once again depending on some of the questions I raised above.
Bonus Question 2: Putting the time into your design for templating and a plugin architecture is essential in this situation and you need to do it sooner rather than later. Once you're in the grind of customizing code for paying customers you will likely not have the time to do it right. This point cannot be stressed enough. Templates and admin tools that give you quick and deep access to data-driven changes in your product will save you a lot of time down the road. As your company / group expands you can then add less technical staff that can be "product experts" who can perform 90% of customizations and maintenance, freeing up your core to continue development or move on to other projects. Finally, don't neglect your data tier in this planning process. Having a core data tier of (almost) immutable stored procs and tables is very important, with custom tables and stored procs clearly demarcated using a good naming convention.
Good luck, feel free to provide more details if you'd like more specific suggestions.
Based on some of the advice received here, we did end up implementing a monolithic multi-tenant version of our application.
I'm glad we did. By the time it was done, we had 3 or 4 forks of the code base (mainly custom skins and things we didn't have n-level support for, but also some actual features), and it was only getting crazier.
We got the multi-tenant version up and successfully folded everything in. There ended up being a lot to think about and a lot to keep track of, but our customers never even knew they had been moved to a new system.
I will say that the actual customer migration was a bit of a bear. I thought at first that we would be able to do it by hand in the backend, but I ended up having to write some fairly involved scripts to get the job done. There were just too many identity columns, and it's not like you can just turn off constraints temporarily when you're importing into a live production system.