Swimming in the murky, shallow waters of real-time chat and notifications
16 Nov 2016
Implementing real-time chat and notifications that would serve tens of thousand of people at the same time
Recently while working on web application for a large client of Imperia Mobile, we faced with designing and implementing real-time chat and notifications that would serve tens of thousand of people at the same time. The traditional HTTP way, would not be appropriate, since the client has to send requests one after the other, which is massive overhead for both sides.
However, how we approached the task was by implementing WebSocket - a communication protocol, providing full-duplex communication channels over a single TCP connection. We tested and compared Socket.io and Ratchet framework, two very helpful and ready solutions, which facilitate the implementation of real-time communication.
Below we go into further details to explain what working solution we found out and the steps to achieve it. Just scroll down for a dip in the murky waters of real-time communication.
Here is how we navigated the shallow end
First we saw what the big ones are using; talking about real-time chat applications and notifications, what comes to mind is how it is done in global social networks. If we take a look at Facebook we will see they use HTTP long polling. In a nutshell, HTTP long polling is technique, where the client polls the server requesting new information. The server holds the request open until new data is available. Once available, the server responds and sends the new information. When the client receives the new information, it immediately sends another request, and the operation is repeated. This effectively emulates a server push feature.
Potential issues to look out for
Using HTTP long polling, there is also request-response overhead for both client and server, but the traffic is not that intense comparing to standard HTTP requests. Another overhead is connection establishment. A common criticism of long polling is that this mechanism frequently open TCP/IP connections and then close them. However, polling mechanisms work well with persistent HTTP connections that can be reused for many poll requests.
Another tricky moment using polling mechanisms is allocated resources. The HTTP long polling mechanism requires that for each client, both a TCP/IP connection and an HTTP request are held open. Thus, it is important to consider the resources related to both of these when sizing an HTTP long polling application. Typically the resources used per TCP/IP connection are minimal and can scale reasonably. Frequently the resources allocated to HTTP requests can be significant, and scaling the total number of requests outstanding can be limited on some gateways, proxies, and servers.
WebSocket saves the day
Different way to design real-time chat is using WebSocket. WebSocket is a protocol providing full-duplex communication channels over a single TCP connection. А drawback of web sockets to HTTP long polling is that not every web browser support WebSocket. Approximately 87% of web browsers supports this feature according to Caniuse.
Our field day exercise and results
Here in Imperia Mobile we have had evaluation study over two real-time chat implementations. First one using Node.js and Socket.io, and the second one on PHP using Ratchet Framework.
For Socket.io implementation we used their demo chat and for Ratchet - their “Hello world” tutorial. We run tests on virtual machine with full virtualization, 2GB RAM. Processor type was kvm64 with 4 cores on 1 socket. We run a couple of hundred parallel processes, each one initiating 100 connections on the socket. Each connection sending 10 messages, which the socket server broadcasted to all connections.
The good, the bad and the shallow waters
What is good about Socket.io is the module Socket.io-client, which is reducing implementation time on the client side. Other ‘magic’ you can rely on is if you lost connection with the server, since Socket.io-client takes care of reestablishment of the connection.
During the tests, we noticed that sometimes Socket.io disconects users and after a while the connection was reestablished. During this downtime, the user did not receive any broadcast messages. This can be significant drawback depending on specifications of the system to be implemented.
On the other side, Ratchet is slower compared to Socket.io, but supports more simultaneous live connections and we did not experienced any disconnected users by the server when testing.
Where it gets murky
Where you can get into trouble could be if you need to use secure connections over SSL. The problem is that the socket library React, which Ratchet uses, does not support direct SSL connections. The workaround to this would be to use Stunnel or some other proxy. In Ratchet documentation there is example using HAProxy. The implementation of WebSocket represents a single process running on the server, which listens and handles events. If during event handling an exception is thrown or some fatal error, this causes the whole process to go down. In fact when this process is down, all connected users are disconnected and new users cannot establish connection. To work properly after this, the process should be restarted.
In order to take care of the above issue, you have to throw additional effort to implement or integrate monitoring system, that can take care of this kind of troubles. Also neither Ratchet nor Socket.io comes with buildin tool for monitoring and statistics what is happening with the socket server, if it is working properly or not. You can get a tool that monitors on the level of operating system if the process is running.
Another issue, related to security, could be Cross-Site WebSocket Hijacking. If there is no extra authentication layer, when establishing the connection with the server, by default authentication data could be a Cookie header. As WebSocket are not restrained by the same-origin policy, it makes it easy to initiate a WebSocket request from malicious web pages using stolen cookies. It could become a very big deal if the socket is used to transfer some very sensitive data.
A crucial issue could be the synchronous nature of provided implementations. Every event that should be handled by the server is placed in a queue. If the queue gets overflow, there could have events that would not be handled. It would be easy for anyone with ill intentions and some technical background to send a lot of dummy messages to your socket server, thus the server would be busy handling dummy messages instead of important ones.
Our recommended solutions:
Some simple solution of handling too many dummy messages would be to implement some monitoring system. If from specific IP comes unexpected number of messages, Ratchet offers blacklist handler to block specific users.
In our case, our solution is to use WebSocket only as Push server, without handling user`s messages, but this strongly depend on system requirements and specifications. If you need to implement real-time chat, client could sent messages to server via simple HTTP request and the server could notify the WebSocket to send the message to the recipient. A workaround to the security issue related to transferring sensitive data would be WebSocket to notify the receiver to refresh its messages, instead of sending the whole message. The last one would be effective if there is low messages intensity.
The problems related to the number of simultaneous connections would be solved if you run a couple of parallel WebSocket processes, listening on different ports, on the server. It would require to implement some load balancing logic to handle users requests. Drawback here is that it is suitable if you use your WebSocket as push server, rather than use it to handle two-way communication.
Tricky part of using Ratchet is if you need advanced configuration. By default the event loop that PHP uses through the builtin function stream_select is old, slow poll mechanism. We replaced it with Libevent - asynchronous event driven C library, that is meant to replace default event loop API. With Libevent PHP will use faster epoll or kqueue, which drastically improves concurrency (handling many connections quickly).
Another advanced improvement for Ratchet WebSockets is to increase open file limits. As a Unix philosophy is “everything is a file”, each user of your WebSocket is represented as a file somewhere. On many systems, there is layers of security, which prevents opening too many file descriptors, so the default limit is 1024. With such configuration, the WebSocket can handles at most 1024 connections.
Accessing and taking a decision that is right for you
Choosing the right technology to use is not a simple decision. When picking framework or technology we should be aware of our project specifications. For our project we prefer to use Ratchet framework over Socket.io, although Socket.io is not so bad solution, depending on the bussiness case specifics. If you need rapid development of simple real-time feature maybe Socket.io is better choice, because you get more already implemented features and reasonable time to broadcast a message without any complex configurations.
For our complex business case here in Imperia Mobile we created an application that supports 20 000 users with no problems and custom monitoring and managing tools. Despite the extra efforts we believe it is worth it.