GRPC Services: Central Proto Repository or Distributed - grpc

We plan to keep a central proto repository to keep all proto definitions and its generated code here. We would keep messages as well as service definitions in a central Git repo. We plan to drive API design standard from this central repository.
But, any service which want to use this to expose a sever service or generate clients would have to import from this repo (.pg.go).
Do you see any issue with this approach? Or do you see keeping service proto files individually in the service repos as a better alternative.
PS: Starter in the GRPC journey of building microservices. Still learning the right way to structure and distribute code here.

This question occurs regularly and I suspect the fact that there's no published guidance is because the answer depends on your needs more than the technology's.
The specific issue of many vs one is not dissimilar to whether you prefer to use a monorepo and only you can effectively determine that. Perhaps one way to determine this is to understand now (and in the future) how many shared dependencies your services will have? Another may be to determine how many repos you'll have (how complex would it be to manage 10s or 100s of repos?).
In my experience, it's a good practice to keep the protos distinct (i.e. separate repo) from code that uses them. Not only may you want to version protos independently from implementations (across languages) but the implementations themselves are independent; in one use-case I must clone a repo containing an entire system (written mostly in one language) in order to get its protos to generate bindings in another language. In this case, it would be preferable if the repo were limited to just the protos.
You could look to examples for guidance. The gRPC repo keeps a bunch of stuff rooted on the grpc package in addition to math. Although less broad, Google bundles its well-known types under google.protobuf.

Related

Microservices orchestration choices

I am exploring the possible solutions for orchestrating my flows across multiple services via some infrastructure. Searching shows me a few options such as Conductor, Camunda, Airflow etc.
I am wondering what would fit my use case better
One of my service is in Java, the other is in Python
I need to pass info to the Java service, then take the output and pass it to the Python service
Final output is then published to another queue
It feels like Conductor is a good choice, but would love to hear your inputs!
All options can fulfill the requirement stated. Think about further / future requirements. Is it only a data pipe? Is it about orchestrating a larger end-to end business process? Do you need support for long-running processes? Is end-to-end transparency in a graphical form a benefit? Is graphical process modelling in BPMN2 standard going to be a benefit? Are there going to be audit or reporting requirements? Or is it going to be a simple, isolated, technical solution?
This article gives a great overview of tools in the market and what their primary use cases are: https://blog.bernd-ruecker.com/understanding-the-process-automation-landscape-9406fe019d93
All listed tools might technically be able to execute your workflow (I have no experience working with Conductor & Camunda). A few characteristics on which a decision is usually made are:
open vs closed source
how do you define workflows? (e.g. Python code in Airflow. Others use e.g. JSON/XML/something custom)
does it come with a UI?
can it scale out in case my workloads start growing?
is it agnostic to any technology or limited to running certain technologies? (e.g. Oozie is built for scheduling jobs on Hadoop)
other requirements could be e.g. security, logging, monitoring, etc.
There are many orchestration-tool-comparisons on the internet, e.g. 1 or 2.
Introduction to Container Orchestration
The practice of automating the administration of container-based microservice applications across different clusters is known as container orchestration. Within corporations, this notion is gaining popularity. In addition, a variety of Container Orchestration technologies have become indispensable in the deployment of microservice-based applications.
Software development in the modern era is no longer monolithic. Instead, it generates component-based apps that run across many containers. These adaptable and scalable containers work together to accomplish a specified purpose or microservice.
Depending on the complexity of the application and other requirements like load balancing, they may span many clusters.
Containers encapsulate application code as well as its dependencies. To function efficiently, they receive the resources they require from physical or virtual hosts. When complicated systems are built as containers, clustering them for deployment requires adequate management and priority.
How to Choose a Container Orchestration Tool?
We've looked at a number of Orchestration Tools that you may examine when selecting which is ideal for your business. To do so, make sure to understand your company's requirements and operations. Then you'll be able to more readily weigh the benefits and drawbacks of each option.
Kubernetes
Kubernetes has a lot of features and is ideally suited for container and cluster management at the corporate level. Kubernetes is managed by a number of platforms, including Google, AWS, Azure, Pivotal, and Docker. As the containerized workload grows, you have a lot of options.
The biggest disadvantage is that it does not work with Docker Swarm and Compose CLI manifests. It might also be difficult to understand and set up. Despite these flaws, it is one of the most used systems for cluster deployment and management.
Docker Swarm
For individuals who are already familiar with Docker Compose, Docker Swarm is a better option. It's easy to use and doesn't require any additional software. Unlike Kubernetes and Amazon ECS, however, Docker Swarm lacks sophisticated features such as built-in logging and monitoring. As a result, it is better suited to small-scale businesses that are just starting started with containers.
Amazon ECS
If you're already familiar with Amazon Web Services, Amazon ECS is a great way to install and configure clusters. It's a quick and easy method to get started, and it scales to match demand. It also connects with a number of other AWS services. It's also excellent for small teams with limited resources for container maintenance.
One of its disadvantages is that it is incompatible with nonstandard deployments. It also contains ECS-specific configuration files, which complicates debugging.

Is RocksDB and LevelDB just like Riak?

I have a question regarding some NoSQL databases. In Ehcache we have for example the JCache API, in MapDB the Map Interface and in Riak KV we have a own process with clusters. How do I exactly find out which database fits to which implementation type? For example for RocksDB (I assume that it is a process) and same for LevelDB.
For reference, RocksDB and LevelDB perform very similar functions and can be interchangeable in some situations.
Given your question of Is RocksDB and LevelDB just like Riak?, I can say that they are not the same as Riak provides a scalable distributed platform to run on that can connect to one or more backend databases simultaneoulsy (currently supported backends are Bitcask, LevelDB, Leveled and memory). RocksDB and LevelDB are essentially stand alone database platforms that can be used as such or can utilised by other software such as Riak as a backend. While you could technically implement RocksDB as a backend for Riak KV without needing a mountain of custom code, you probably wouldn't want to as RocksDB does not scale well.
How do I exactly find out which database fits to which implementation type? is rather a broad question. I think you might want to rephrase it as Which databases offer me {my list of desired implementations/functions}? to make it easier for community members to answer. Please note that some NoSQL databases have multiple uses available e.g. as you mentioned Riak KV, we have Maps, Sets, GSets, Flags, Registers, Solr Search, 2i and the standard CRDT options as well but some of those may be tied to other requirements e.g. 2i only works with a LevelDB/Leveled backend, Solr Search requires the Yokozuna package version of Riak KV 3.0.0 and above but is built in for all Riak 2.x.x versions etc.
What you may also want to try to do is download a few different options to a VM or bare metal rig, have a play and see how it works out. There are often cases where two competing products do something very similar on paper but in your specific use case, one outperforms the other significantly.
To get you started, here are links to Riak 2.9.8 (the latest release of the 2.x.x series) and to the Riak 2.2.6 docs (the 2.9.x docs should be out later this month).
I'm not sure if this has directly answered your question but, hopefully, it will give you some pointers as to where to go next.

Is there a way to protect my R code that runs on a AWS account owned by a client?

I just joined a company that needs to build an ETL pipeline inside an AWS account owned by a client.
There's one part of the ETL pipeline that runs a code written in R. The problem is, this R code is a very important part of our business, and our intelectual property. Our clients can't see this code.
Is there any way to run this in their AWS environment without them having access to our code? R is not compilable, so we can't just deploy an executable file there. And we HAVE to run this in their environment. I suggested creating an API to run this in our AWS environment, but this is not an option.
In my experience, these are the options I've realized in situations like this, in increasing order of difficulty:
Take the computation off-premises. This sounds like not an option for you.
Generate an API (e.g., shiny, opencpu, plumber) that is callable from their premises. This might require some finessing on their end, as I'm inferring (since they want it all done within their environment) that they might prefer a locked-down computation (perhaps disabling network access).
Rewrite the sensitive portions in Rcpp. While this does have the possible benefit of speed improvements, it makes it slightly harder for them to "discover" the underlying intellectual property. Realize that R and Rcpp are both GPL, which means that anything linked to by R must also be GPL, meaning source-code available. (It is feasible that since you are not making it public that you can argue your case here, but I am not a lawyer and would not want to be the first consultant found on the wrong side of GPL law here. Again, IANAL.)
Rewrite the sensitive portions in a non-R executable (note that I don't say "as a non-R library and link to it via R calls", since the linking action taints the library with R's GPL). This executable can be called by your otherwise releasable R package (via system or processx::run).
(For the record, one might infer C or C++ here, but other higher-level languages do allow compilable executables and are not GPL. Python has some such modes. Be sure to obfuscate your variables :-)
I think your "safest" options are #2 and #4.

What is the prescriptive approach to supporting multiple RDBMS's with Flyway?

I have an application that supports multiple RDBMS's. The SQL needed to build the data model is different between each of the RDBMS's that I need to support. The differences aren't small either, they stem from the fact that one of the supported systems is expected for light use (development, small installations) and heavy use. Simply standardizing on a single supported RDBMS is not an option.
As it stands I need to be able to apply migrations to my application in all of the supported RDBMS's. Where possible I'd like to be able to share migration scripts to reduce the amount of duplication involved but I imagine that isn't entirely possible.
The only approach I can come up with so far is to keep separate directories in source control for each of the supported environments. Then at runtime, pick the appropriate directory for the RDBMS that the system is connected to.
Is having one directory per supported RDBMS the prescriptive approach or is there a better way?
Right from the FAQ: What is the best strategy for handling database-specific sql?

When should one use a project reference opposed to a binary reference?

My company has a common code library which consists of many class libary projects along with supporting test projects. Each class library project outputs a single binary, e.g. Company.Common.Serialization.dll. Since we own the compiled, tested binaries as well as the source code, there's debate as to whether our consuming applications should use binary or project references.
Some arguments in favor of project references:
Project references would allow users to debug and view all solution code without the overhead of loading additional projects/solutions.
Project references would assist in keeping up with common component changes committed to the source control system as changes would be easily identifiable without the active solution.
Some arguments in favor of binary references:
Binary references would simplify solutions and make for faster solution loading times.
Binary references would allow developers to focus on new code rather than potentially being distracted by code which is already baked and proven stable.
Binary references would force us to appropriately dogfood our stuff as we would be using the common library just as those outside of our organization would be required to do.
Since a binary reference can't be debugged (stepped into), one would be forced to replicate and fix issues by extending the existing test projects rather than testing and fixing within the context of the consuming application alone.
Binary references will ensure that concurrent development on the class library project will have no impact on the consuming application as a stable version of the binary will be referenced rather than an influx version. It would be the decision of the project lead whether or not to incorporate a newer release of the component if necessary.
What is your policy/preference when it comes to using project or binary references?
It sounds to me as though you've covered all the major points. We've had a similar discussion at work recently and we're not quite decided yet.
However, one thing we've looked into is to reference the binary files, to gain all the advantages you note, but have the binaries built by a common build system where the source code is in a common location, accessible from all developer machines (at least if they're sitting on the network at work), so that any debugging can in fact dive into library code, if necessary.
However, on the same note, we've also tagged a lot of the base classes with appropriate attributes in order to make the debugger skip them completely, because any debugging you do in your own classes (at the level you're developing) would only be vastly outsized by code from the base libraries. This way when you hit the Step Into debugging shortcut key on a library class, you resurface into the next piece of code at your current level, instead of having to wade through tons of library code.
Basically, I definitely vote up (in SO terms) your comments about keeping proven library code out of sight for the normal developer.
Also, if I load the global solution file, that contains all the projects and basically, just everything, ReSharper 4 seems to have some kind of coronary problem, as Visual Studio practically comes to a stand-still.
In my opinion the greatest problem with using project references is that it does not provide consumers with a common baseline for their development. I am assuming that the libraries are changing. If that's the case, building them and ensuring that they are versioned will give you an easily reproducible environment.
Not doing this will mean that your code will mysteriously break when the referenced project changes. But only on some machines.
I tend to treat common libraries like this as 3rd-party resources. This allows the library to have it's own build processes, QA testing, etc. When QA (or whomever) "blesses" a release of the library, it's copied to a central location available to all developers. It's then up to each project to decide which version of the library to consume by copying the binaries to a project folder and using binary references in the projects.
One thing that is important is to create debug symbol (pdb) files with each build of the library and make those available as well. The other option is to actually create a local symbol store on your network and have each developer add that symbol store to their VS configuration. This would allow you to debug through the code and still have the benefits of usinng binary references.
As for the benefits you mention for project references, I don't agree with your second point. To me, it's important that the consuming projects explicitly know which version of the common library they are consuming and for them to take a deliberate step to upgrade that version. This is the best way to guarantee that you don't accidentally pick up changes to the library that haven't been completed or tested.
when you don't want it in your solution, or have potential to split your solution, send all library output to a common, bin directory and reference there.
I have done this in order to allow developers to open a tight solution that only has the Domain, tests and Web projects. Our win services, and silverlight stuff, and web control libraries are in seperate solutions that include the projects you need when looking at those, but nant can build it all.
I believe your question is actually about when projects go together in the same solution; the reason being that projects in the same solution should have project references to each other, and projects in different solutions should have binary references to each other.
I tend to think solutions should contain projects that are developed closely together. Such as your API assemblies and your implementations of those APIs.
Closeness is relative, however. A designer for an application, by definition, is closely related to the app, however you wouldn't want to have the designer and the application within the same solution (if they are at all complex, that is). You'd probably want to develop the designer against a branch of the program that is merged at intervals further spaced apart than the normal daily integration.
I think that if the project is not part of the solution, you shouldn't include it there... but that's just my opinion
I separate it by concept in short

Resources