The Raku Programming Language Collect, Conserve and Remaster Project
Originally published on 2017-09-23 by Jonathan Worthington.
Vienna.pm have funded me to work 50 hours on Raku. After some discussion, we decided I would first work on improving the thread pool scheduler, and then move on to continuing the work around non-blocking await, a feature of the upcoming Raku.d. In this post I’ll discuss the work on the thread pool scheduler, which was merged shortly after the latest Rakudo release to maximize testing time, and thus will appear in the next release (2017.10).
My (by now slightly hazy) memory is that I wrote the initial Raku thread pool implementation in the space of an hour or two on a train, probably heading to some Perl event or other. It was one of those “simplest thing that could possibly work” implementations that turned out to work well enough that it survived with only isolated tweaks and fixes until January this year. At that point, I added the initial bits of support for non-blocking await. Even that change was entirely additive, leaving the existing scheduling mechanism completely intact.
When I first implemented the thread pool, there were approximately no people writing concurrent Raku programs. Happily, that’s changed, but with it came a few bug reports that could be traced back to the thread pool. Along with those, there were things that, while not being outright bugs, were far from ideal.
Before I dug into a re-design, I first summarized all of the problems I was aware of with the thread pool, so as I considered new designs I could “crash test” them against the problems that afflicted the previous design. Here’s my list of problems with the previous design.
Proc::Async would cause some threads to be created. A second use after that would add some more. A third use would add more. Even if these uses were not concurrent, threads would still be added, up until the maximum pool size. It was a very conservative way to make sure a thread would be available for each piece of work, provided the pool didn’t reach exhaustion point. But really, the pool doesn’t need more than a thread or two for many programs. RT #131915 was a ticket resulting from this behavior: each thread added to the memory consumption of the program, making memory use a good bit higher than it should have been for programs that really didn’t need more than a thread or two.await, but there will still be ways to write code that really blocks a bunch of OS threads and can’t make progress without yet more threads being made available to run something. At some point, people just have to write their programs differently, but the default limit of 16 threads was not very generous.Problems 2 (active I/O swamping the queue and delaying timers) and 8 (poor thread affinity) are resolved in the new thread pool by recognizing that not all work given to the pool is equal. Timers really want to be dealt with in a timely manner, so in programs that have timers it makes sense to have a queue, and one or more worker threads, exclusively for time-based events. Work items that will need processing sequentially anyway should be assigned to a single thread, implying that each such “affinity worker” gets a queue of its own. Finally, there is a general queue for everything else, and one or more threads eat from it.
The other problems were addressed by the addition of a Sufficiently Smart Supervisor. The supervisor is a thread created by the thread pool upon its first use, and living from then until the end of the program. It spends most of its time sleeping, waking up around 100 times a second to check how things are going. It is the supervisor that makes most of the decisions about how many threads to add to the pool. It factors in:
Having the pool soft-limited to the number of threads, and reluctantly able to go beyond that, means that we can raise the maximum default pool size quite a lot; it now stands at 64 threads. In reality, many programs don’t even reach the CPU core count, as there’s never enough work to trigger the addition of more threads.
Affinity worker threads aren’t added by the supervisor. Instead, when an affinity queue is requested, any existing affinity worker threads will have their queue lengths inspected. The one with the shortest queue will be picked, provided that the queue length is below a threshold. If it’s over the threshold (which increases per affinity worker thread added), then another affinity worker thread will be spawned and the work allocated to it.
For all the cases I’ve tried so far, the new scheduler seems to do either at least as well or better than the one it replaced – and sometimes much better. The deadlock due to insufficient threads bug is gone, thanks to the supervisor. Programs that do the occasional bit of work needing pool threads end up with just a thread or two, greatly reducing their memory consumption. A Cro web application that is processing just a handful of requests a second will now spawn just a few threads (and so use much less memory and get better locality), while one faced with some hundreds of requests per second will spawn more. And a program doing a lot of CPU-bound work now spawns as many threads as there are cores, which gives a small speedup compared to oversubscribing the CPU under the previous scheduler. Finally, timer events are delivered and handled in a timely way, even when there is a lot of output from a process.
Well, most immediately I should write some extra regression tests, so I can get the RT tickets mentioned earlier in the article closed up. Feel free to beat me to doing that, though! :-)
That aside, I’d like to mention that while the new scheduler is a decided improvement, it’s also improvable. The heuristics it uses to decide when to add further threads can surely be tuned some more. The code doing that decision making is written in Raku, which I hope makes it accessible to those who would like to experiment and tweak.
Once again, thanks to Vienna.pm for making this work possible. And, finally, a mention that I’m still looking for funding to help me keep on doing Raku things for a sizable chunk of my work time; a handful of companies each sponsoring 10 hours a month would soon add up!