This was originally posted by Bret Hubert nearly 6 months ago, but it’s saved me a LOT of time, since he’s rehashed, well, common sense:
“…
Additionally, our database interface needed to offer an extra feature: every once in a while a query comes along that we DO need to wait for, and because of coherency issues, such a query can only be executed once all queries ‘in flight’ have finished.
So we spent some time pondering this, and suddenly it dawned on me that many of the features we needed exactly match the semantics of the venerable UNIX ‘pipe’.
A pipe is normally used to communicate between two processes, as exemplified by this sample shell script command, which shows us the largest directories on a disk:
$ du | sort -n
The program ‘du’ generates a list of directories and their sizes, which is then fed to sort which outputs this in ascending order. However, nothing prohibits us from using a pipe to communicate with ourselves – and as such it might be a might fine conduit to pass database queries through to our database worker thread.
This has some very nice benefits. Pipes are incredibly efficient, since a lot of UNIX performance depends on them. Additionally, they implement sane blocking behaviour: if too much data is stuck in the pipe, because the other process does not take it out again quickly enough, the sending process automatically blocks. The operating system implements high and low water marks to make this (un)blocking happen efficiently.
Furthermore, pipes guarantee that data up to a certain size can either be written as a whole, or not written at all – making sure we don’t have to deal with partial messages.
Finally, pipes automatically detect when the process on the other end of them has gone away, or has closed its end of the pipe.
However, not all is good. In order to transmit something over a pipe, it must be serialised into bytes – we can’t transmit ready to use objects over them. Additionally, because pipes implement ‘stream’ behaviour, we need to delineate one message from the other, because the pipe itself does not say where a message begins and ends – unlike datagram sockets for example.
And this is the clever bit of our idea. As stated above, pipes are usually employed to transmit data from one process to the other. In our case, the pipe goes from one thread of execution to the other – within the same process, and thus within the same memory space. So we don’t need to send serialized objects at all, and can get away with transmitting pointers to objects. And the nice thing is, pointers all have the same (known) length – so we can do away with both delineation and serialisation.
Additionally, pointers are a lot smaller than most messages, which means we can stuff more messages in the same (fixed) size of the pipe buffer.
So, are we done now? Sadly no – we have the additional need to be able to ‘flush the pipe’ in order to perform synchronous queries that we do need to wait for.
This is where things get complicated, but for those who really want to know, I’ll explain it here. It took almost a day of hacking to get it right however, and I’m explaining it for my own benefit as much as for that of the reader, since I’m bound to forget the details otherwise.
If a synchronous query comes along, we need to flush the pipe, but UNIX offers no such ability. Once we’ve written something to a pipe, all the kernel guarantees us is that it will endeavour to deliver it, but there is no system call that allows us to wait for all data to actually be delivered.
So we need to find a way to signal a ‘write barrier’, and the obvious way to do so is to send a NULL pointer over the pipe, which tells the other end we want to perform a synchronous query. Once the worker thread has seen the NULL pointer, it unlocks the single controlling mutex (which is the return signal that says “got you -the pipe is emptyâ€), and then waits for further pointers to arrive.
Meanwhile, the sending thread tries to lock that same mutex immediately after sending the NULL pointer, which blocks since the receiving thread normally holds the lock. Once the lock succeeds, this tells us the worker thread has indeed exhausted all queries that were in flight.
The sending thread now performs its synchronous database work, knowing the database is fully coherent with all queries it sent out previously, and also knowing the worker thread is not simultaneously accessing the connection – since it is instead waiting for a new pointer to arrive.
If our program now wants to perform further asynchronous queries it can simply transmit further pointers to the worker thread – which oddly enough does not need to retake the mutex. This is what caused us many hours of delay, because intuitively it seems obvious that once the sending thread is done, it must release the mutex so the worker thread can retake it.
As it turns out, doing so opens a whole world of nasty race conditions which allow synchronous queries to ‘jump the queue’ of asynchronous queries that are in flight and have not yet arrived.
So, the sequence is that the worker thread only unlocks the mutex, while the sending thread only locks it.
And this basically is it! So how much lines of code did we save by using the magic of UNIX pipes? The pipe handling code takes all of 90 lines, whereas the Distributor code of PowerDNS takes a round 300, even though it does not offer synchronous queries, does not automatically block if too many queries are outstanding, and most certainly couldn’t implement the sensible wakeup ability that UNIX pipes do offer.
Oh, and you might be wondering by now, did it help? Indeed it did – our program is now at least 20 times faster than it used to be, and there was much rejoicing.”