From all the articles I've read so far about Mochiweb, I've heard this over and over again that Mochiweb provides very good scalability. My question is, how exactly does Mochiweb get its scalability property? Is it from Erlang's inherent scalability properties or does Mochiweb have any additional code that explicitly enables it to scale well? Put another way, if I were to write a simple HTTP server in Erlang myself, with a simple 'loop' (recursive function) to handle requests, would it have the same level of scalability as a simple web server built using the Mochiweb framework?
UPDATE: I'm not planning to implement a full blown web-server supporting every feature possible. My requirements are very specific - to handle POST data from a HTML form with fixed controls.
Probably. :-)
If you were to write a web server that handles each request in a separate process (light weight thread in Erlang) you could reach the same kind of "scalability" easily. Of course the feature set would be different, unless you implement everything Mochiweb has.
Erlang also has great built in support for distribution among many machines, this might be possible to use to gain even more scalability.
MochiWeb isn't scalable itself, as far as I understand it. It's a fast, tiny server library that can handle thousands of requests per second. The way in which it does that has nothing to do with "scalability" (aside from adjusting the number of mochiweb_acceptors that are listening at any given time).
What you get with MochiWeb is a solid web server library, and Erlang's scalability features. If you want to run a single MochiWeb server, when a request comes in, you can still offload the work of processing that request to any machine you want, thanks to Erlang's distributed node infrastructure and cheap message passing. If you want to run multiple MochiWeb servers, you can put them behind a load balancer and use mnesia's distributed features to sync session data between machines.
The point is, MochiWeb is small and fast (enough). Erlang is the scalability power tool.
If you roll your own server solution, you could probably meet or beat MochiWeb's efficiency and "scalability" out of the box. But then you'd have to rethink everything they've already thought of, and you'd have to battle test it yourself.
Related
We've looking at moving splitting up our architecture (and adding new components) using a Service Oriented Architecture (SOA). There will be a number of external API's that will be used by third parties, which we will make using a REST HTTP interface, however I was wondering what would be best to use internally as all components are with in our control and will be on the same network, however potentially different technologies (mainly .net and ruby on rails).
Would there be big performance/functionality gains in using a messaging system (redis, rabbitmq, EMS, other notable exceptions I've not heard of...) instead of HTTP (REST, SOAP, etc).
I've struggled to find good information on this topic and (as you can probably tell) I'm fairly new to this side area, so any advice or good resources would be appreciated!
Thnaks
Messaging tends to give you a more loosely coupled architecture. It can potentially be more robust as well, since individual components can fail without killing the entire infrastructure.
The downside is complexity, the paradigm shift to an asynchronous model, and possibly performance (especially if you're persisting messages every where).
You also need to ensure that your messaging system is particularly robust. A single aspect of your logic can go down and restart without affecting everything, but if you lose your core message base, then ALL of your logic is down waiting for the messaging to be back up.
Fortunately, the message bus can be long running without humans fiddling and touching it, the largest source of errors and instability in any system.
In addition to what #Will Hartung mentioned, I would also say that it depends on what you are going to do with your system. If you have mostly client-server type applications, where you have few servers/services and they tend to be completely independent, then it will probably be easier to implement service contracts via REST over HTTP.
If, on the other hand, your entire system is doing bi-directional communication, or if there are many inter-process calls (and particularly if every participant in the system is going to be both a client and a server at some point), then messaging is your best bet. Of the messaging options, I find that AMQP/RabbitMQ is the most feature-rich and easy to use of all of these. It offers you a true asynchronous platform to code against.
They key benefit to using messaging is that you can have queues for each type of message, so as your system expands and changes, the queues/messages can be the same, but the service that handles them can change underneath. It promotes separation of layers.
Finally, and this is a huge thing in my opinion, the proper use of messaging promotes small, independent pieces of code. These are both more testable and more maintainable, and in general it simplifies your enterprise architecture. If you attempt to handle too many services from HTTP endpoints, you will eventually (over the course of a year or two) end up with either (1) way too many endpoints to keep track of or (2) an unmaintainable mess of spaghetti code.
My company started out with using a message-based framework, and it has worked very well for us. The RabbitMQ server has easily been the most reliable component. Feel free to ask if you have any more questions about messaging or SOA.
I have been building a real-time notification system. It’s part of a web application, but events have to be seen as soon as they occur. Long polling was not an option because it would be expensive for the web server to hold on to connections when no events are available, so I had to go for short-lived polls.
Each client hits the web server every, say, 2 seconds (this is a fairly high rate). When events are available, they are sent as JSON to the JavaScript client. Now, this requires a server set-up to handle a high number of short-lived connections. I have implemented one such system using the Yaws web server. However, because Yaws starts quite a number of many other services, it feels heavy and connections begin to get either refused or aborted when they go beyond 30,000 (maybe because I am running some ETS Tables in the same Erlang VM as Yaws is running on [separating these may require rpc:call/4, which—I fear—will increase latency]). I know that there are operating-system-specific tweaks to do, and those have been done.
This would not be a problem if it was easy to cluster up several Yaws instances. In Yaws, i am using a few appmods, and I am doing things RESTfully. I was thinking that the Cowboy web server might enhance things a bit here. I have not used Cowboy before, but I have used Misultin. Looking at Cowboy, it is a full fledged OTP Application and it seems to be easy to cluster, and being lightweight, may perhaps increase on the number of clients the overall system can handle. Storage is on Mnesia, which I can distribute easily to add more nodes (maybe by replication), so that there is a Cowboy instance in front of every Mnesia instance.
My questions are:
Is my speculation correct, that if I switched from Yaws to Cowboy, I might increase the performance significantly?
Yaws has a clean API via Appmods and the #arg{} record. Does Cowboy have an equivalent of these two things (illustrate please)?
Can Cowboy handle file uploads? If so, which server (Yaws or Cowboy), in your opinion would be better to use in the case of frequent file uploads? Illustrate how file uploads are done with Cowboy.
It is possible to run several Yaws instances on the same machine. Do you think that creating many Yaws instances per server (physical box) and having the client-load distributed across these would help? What do I need to know about doing this?
When I set the yaws.conf parameter max_connections = nolimit, how would I specify the same in Cowboy?
Now, I followed the interview with Cowboy author and he discusses the reasons why Cowboy is more lightweight than Yaws. He says that
The biggest difference is the use of binaries instead of lists. The generic acceptor pool is another. I could list a lot of other small differences but I figure these aren’t the most interesting.
That because Cowboy uses the listener-pool library Ranch, it somehow ends up with a higher capability of handling more connections, plus the use of binaries and not lists.
Another quote from the same interview:
Since we use one process per connection instead of two, and we use binaries instead of lists, we end up using a lot less memory than other projects without user intervention. Cowboy is also lazy, it doesn’t do anything unless required. So we don’t have much in memory until the user starts calling functions.
I wonder how yaws handles this case. Somehow, my problem domain needs lightweight HTTP handling. It’s actually true that Yaws will lead to more memory consumption as compared to say, Mochiweb, Misultin or Cowboy. My greatest concern is that Yaws has the best/cleanest API whereby it gives us access to the #arg{} containing everything we need as an Erlang record, so that we can get them out ourselves, than the others which have numerous functions for extracting stuff outside. Even the documentation: Yaws docs are pretty good and straightforward. Perhaps I need to look at more Cowboy code for things like file uploading and simple GET and POST request handling.
Otherwise, the questions I asked earlier, remain as pressing concerns. Yaws is pretty good, but seems to be overkill for this fast light-weight short-lived high rate poll situation, what do you think?
Your 30000 refusal limit sounds an awful lot like a 32k limit somewhere. Either the default process count, which is 32k, or some system limit on file descriptors and so on. You should not rule out the possibility that the limitation is on the kernel side of things. I've seen systems come to their limits quite easily due to kernel configurations which can be really hard to handle.
I'm not very experienced in web programming,
and I haven't actually coded anything in Node.js yet, just curious about the event-driven approach. It does seems good.
The article explains some bad things that could happen when we use a thread-based approach to handle requests, and should opt for a event-driven approach instead.
In thread-based, the cashier/thread is stuck with us until our food/resource is ready. While in event-driven, the cashier send us somewhere out of the request queue so we don't block other requests while waiting for our food.
To scale the blocking thread-based, you need to increase the number of threads.
To me this seems like a bad excuse for not using threads/threadpools properly.
Couldn't that be properly handled using IHttpAsyncHandler?
ASP.Net receives a request, uses the ThreadPool and runs the handler (BeginProcessRequest), and then inside it we load the file/database with a callback. That Thread should then be free to handle other requests. Once the file-reading is done, the ThreadPool is called into action again and executes the remaining response.
Not so different for me, so why is that not as scalable?
One of the disadvantages of the thread-based that I do know is, using threads needs more memory. But only with these, you can enjoy the benefits of multiple cores. I doubt Node.js is not using any threads/cores at all.
So, based on just the event-driven vs thread-based (don't bring the "because it's Javascript and every browser..." argument), can someone point me out what is the actual benefit of using Node.js instead of the existing technology?
That was a long question. Thanks :)
First of all, Node.js is not multi-threaded. This is important. You have to be a very talented programmer to design programs that work perfectly in a threaded environment. Threads are just hard.
You have to be a god to maintain a threaded project where it wasn't designed properly. There are just so many problems that can be hard to avoid in very large projects.
Secondly, the whole platform was designed to be run asynchronously. Have you see any ASP.NET project where every single IO interaction was asynchronous? simply put, ASP.NET was not designed to be event-driven.
Then, there's the memory footprint due to the fact that we have one thread per open-connection and the whole scaling issue. Correct me if I'm wrong but I don't know how you would avoid creating a new thread for each connection in ASP.NET.
Another issue is that a Node.js request is idle when it's not being used or when it's waiting for IO. On the other hand, a C# thread sleeps. Now, there is a limit to the number of these threads that can sleep. In Node.js, you can easily handle 10k clients at the same time in parallel on one development machine. You try handling 10k threads in parallel on one development machine.
JavaScript itself as a language makes asynchronous coding easier. If you're still in C# 2.0, then the asynchronous syntax is a real pain. A lot of developers will simply get confused if you're defining Action<> and Function<> all over the place and using callbacks. An ASP.NET project written in an evented way is just not maintainable by an average ASP.NET developer.
As for threads and cores. Node.js is single-threaded and scales by creating multiple-node processes. If you have a 16 core then you run 16 instances of your node.js server and have a single Node.js load balancer in front of it. (Maybe a nginx load balancer if you want).
This was all written into the platform at a very low-level right from the beginning. This was not some functionality bolted on later down the line.
Other advantages
Node.js has a lot more to it then above. Above is only why Node.js' way of handling the event loop is better than doing it with asynchronous capabilities in ASP.NET.
Performance. It's fast. Real fast.
One big advantage of Node.js is its low-level API. You have a lot of control.
You have the entire HTTP server integrated directly into your code then outsourced to IIS.
You have the entire nginx vs Apache comparison.
The entire C10K challenge is handled well by node but not by IIS
AJAX and JSON communication feels natural and easy.
Real-time communication is one of the great things about Node.js. It was made for it.
Plays nicely with document-based nosql databases.
Can run a TCP server as well. Can do file-writing access, can run any unix console command on the server.
You query your database in javascript using, for example, CouchDB and map/reduce. You write your client in JavaScript. There are no context switches whilst developing on your web stack.
Rich set of community-driven open-source modules. Everything in node.js is open source.
Small footprint and almost no dependencies. You can build the node.js source yourself.
Disadvantages of Node.js
It's hard. It's young. As a skilled JavaScript developer, I face difficulty writing a website with Node.js just because of its low-level nature and the level of control I have. It feels just like C. A lot of flexibility and power either to be used for me or to hang me.
The API is not frozen. It's changing rapidly. I can imagine having to rewrite a large website completely in 5 years because of the amount Node.js will be changed by then. It is do-able, you just have to be aware that maintenance on node.js websites is not cheap.
further reading
http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/
http://blip.tv/file/2899135
http://nodeguide.com/
There are a lot of misconceptions regarding node.js vs. ASP.Net and asynchronous programming. You can do non blocking IO in ASP.NET. Most people don't know that the .Net framework uses Windows iocompletion ports underneath when you do web service calls or other I/O bound operations using the begin/end pattern in .Net 2.0 and above. IO completion ports is the way the Windows operating system supports non-blocking IO so that the app thread is freed why the IO operation completes. Interestingly, node.js uses a less optimal non blocking IO implementation in Windows through Cygwin. A new Windows version is on the road map, which with Microsoft's guidance will be using IO completions ports. At that point there is underneath no difference.
It is also possible to do non-blocking database calls in ADO.NET but be aware of ORM tools such as NHibernate and Entity Framework. They are still very much synchronous.
Synchronous IO (blocking) makes the control flow much clearer and it has for this reason become popular. The reason why computer environments are multithreaded has only superficially to do with this. It is more generally related to time sharing and utilization of multiple CPUs.
Having only a single thread can cause starvation during lengthy operations, which can be related to both IO and complex computations. So, even though the rule of thumb is one thread pr. core when utilizing non-blocking IO, one should still consider a sufficient thread pool size so that simple requests don't get starved by more complex operations if such exist. Multiple threads also allows complex operations to be split easily among multiple CPUs. A single threaded environment like node.js can only utilize multicore processors through more processes and message passing to coordinate action.
I have personally not yet seen any compelling argument to introduce an additional technology such a node.js. However, there may be good reasons but they have in my opinion little to do with servicing a large number of connections through non-blocking IO since this can also be done with ASP.NET.
BTW tamejs can help make your nodejs code more readable similar to the new upcoming .Net Async CTP.
It is easy to understate the cultural difference between the Node.js and ASP.NET communities. Sure, IHttpAsyncHandler exists and it's been around since .NET 1.0 so it might even be good, but all of the code and discussion around Node.js is about async I/O which is decidedly not the case when it comes to .NET. Want to use LINQ To SQL? You kind of can, kind of. Want to log stuff? Maybe "CSharp DotNet Logger" will work, maybe.
So yes, IHttpAsyncHandler is there and if you're really careful you might be able to write an event driven web-service without tripping over some blocking I/O somewhere, but I don't really get the impression a lot of people are using it (and it certainly isn't the prominent way for writing ASP.NET apps). In contrast, Node.js is all about evented I/O, all the code examples, all the libraries and it's the only way people are using it. So if you were going to bet on which one's evented I/O model actually worked all the way through, Node.js would probably be the one to pick.
As per current age technology improvements and reading below links, I can say, it is matter of expertise and choosing perfect mix as per the particular scenario that matters. NodeJS is getting mature and ASP.NET side we have ASP.NET MVC, WebAPI, and SignalR etc. to make things better.
Node.js vs .Net performance
http://www.salmanq.com/blog/net-and-node-js-performance-comparison/2013/03/
and
http://www.hanselman.com/blog/InstallingAndRunningNodejsApplicationsWithinIISOnWindowsAreYouMad.aspx
Thanks.
I am having a Web application sitting on IIS, and talking with [remote]Service-Machine.
I am not sure whether to choose TCP or Http, as the main protocol.
more details:
i will have more than one service\endpoint
some of them will be one-way
the other will be two-ways
the web pages will work infront of the services
we are talking about hi-scale web-site
I know the difference pretty well, but I am looking for a good benchmark, that shows how much faster is the TCP?
HTTP is a layer built ontop of the TCP layer to some what standardize data transmission. So naturally using TCP sockets will be less heavy than using HTTP. If performance is the only thing you care about then plain TCP is the best solution for you.
You may want to consider HTTP because of its ease of use and simplicity which ultimately reduces development time. If you are doing something that might be directly consumed by a browser (through an AJAX call) then you should use HTTP. For a non-modern browser to directly consume TCP connections without HTTP you would have to use Flash or Silverlight and this normally happens for rich content such as video and/or audio. However, many modern browsers now (as of 2013) support API's to access network, audio, and video resources directly via JavaScript. The only thing to consider is the usage rate of modern web browsers among your users; see caniuse.com for the latest info regarding browser compatibility.
As for benchmarks, this is the only thing I found. See page 5, it has the performance graph. Note that it doesn't really compare apples to apples since it compares the TCP/Binary data option with the HTTP/XML data option. Which begs the question: what kind of data are your services outputting? binary (video, audio, files) or text (JSON, XML, HTML)?
In general performance oriented system like those in the military or financial sectors will probably use plain TCP connections. Where as general web focused companies will opt to use HTTP and use IIS or Apache to host their services.
The question you really need an answer for is "will TCP or HTTP be faster for my application". The answer is that it depends on the nature of your application, and on the way that you use TCP and/or HTTP in your application. A generic HTTP vs TCP benchmark won't answer your question, because the chances are that the benchmark won't match your application behaviour.
In theory, an optimally designed / implemented solution using TCP will be faster than one that uses HTTP. But it may also be considerably more work to implement ... depending on the details of your application.
There are other issues that might affect your choice. For example, you are less likely to run into firewall issues if you use HTTP than if you use TCP on some random port. Another is that HTTP would make it easier to implement a load balancer between the IIS server and the backend systems.
Finally, at the end of the day it is probably more important that your system is secure, reliable, maintainable and (maybe) scalable than it is fast. A sensible strategy is to implement the simple version first, but have plans in your head for how to make it faster ... if the simple solution is too slow.
You could always benchmark it.
In general, if what you want to accomplish can be easily done over HTTP (i.e. the only reason you would otherwise think about using raw TCP is for a possible performance boost) you should probably just use HTTP. Sure, you can do socket programming, but why bother? Lots of people have spent a lot of time and effort building HTTP client libraries and servers, and they have spent waaaaaay more time optimizing and testing that code than you will ever be able to possibly spend on your TCP sockets. There are simply so many possible errors that you would have to handle, edge cases, and optimizations that can be done, that it is usually easier and safer to use a library function for HTTP.
Plus, the HTTP specs define all kinds of features (and clients/servers implement, which you get to use "for free", i.e. no extra implementation work) which makes any third-party interoperability that much easier. "Here is my URL, here are the rules for what you send, here are the rules for what I return..."
I have a Self Hosted Windows native C++ server application that I use the Casablanca C++ REST SDK code in. I can use any client C#, JavaScript, C++, cURL, basically anything that can send a POST, GET, PUT, DEL message can be used to send request messages to this self hosted windows app. Also I can use a plain browser address bar to do GET related requests using various parameters. Currently I only run this system on a private intranet so it is very fast - I haven't benchmark it against just doing raw TCP, but on a private intranet I doubt there would be even a few microseconds difference? For the convenience and ease of development and ability to expand to full blown internet app it's a dream come true. It is a dedicated system with a private protocol using small JSON packets so not certain if that fits your application needs or not? Another nice thing is this Windows application native C++ code could be ported fairly easily to run on Linux/MacOS as the Casablanca REST SDK is portable to those OSes.
Our app need instant notification, so obvious I should use some some WCF duplex, or socket communication. Problem is the the app is partial trust XBAP, and thus I'm not allowd to use anything but BasicHttpBinding. Therefore I need to poll for changes.
No comes the question: My PM says the update interval should be araound 2 sec, and run on a intranet with 500 users on a single web server.
Does any of you have experience how polling woould influence the web server.
The service is farly simple, it takes a guid as an arg, and returns a list of guids. All data access are cached, so I guess the load on the server is minimal for one single call, but for 500...
Except from the polling, the webserver has little work.
So, based on this little info (assume a standard server HW, whatever that is), is it possible to make a qualified guess?
Is it possible or not to implement this, will it work?
Yes, I know estimating this is difficult, but I'd be really glad if some of you could share some thoughts on this
Regards
Larsi
Don't estimate, benchmark.
Polling.. bad, but if you have no other choice, then it's ideal :)
Bear in mind the fact that you will no doubt have keep-alives on, so you will have 500 permanently-connected users. The memory usage of that will probably be more significant than the processor usage. I can't imagine network access (even in a relatively bloaty web service) would use much network capacity, but your network latency might become an issue - especially as we've all seen web applications 'pause' for a little while.
In the end though, you'll probably be ok, but you'll have to check it yourself. There are plenty of web service stress testers, you can use Microsoft's WAS tool for one, here's a few links to others.
Try using soapui, a web service testing tool, to check the performance of your web service. There is a paid version and an open source version that is free.
cheers
I don't think it will be a particular problem. I'd imagine the response time for each request would be pretty low, unless you're pulling back a hell of a lot of data, so 500 connections spread over 2 seconds shouldn't hit it too hard.
You can use a stress testing tool to verify your webserver can handle the load though, before you commit to this design.
250 qps probably is doable with quite modest hardware and network bandwidth provided you do minimize the data sent back & forth. I assume you're caching these GUID lists on the client so you can just send a small "no updates" response in the normal case.
Should be pretty easy to measure with a simple prototype though to be more confident.