I received a customer request about some remote-controlled machine not responding as it should. I knew from my monitoring graphs that indeed there were some errors there, but usually this is the case of a customer actually unplugging the machine, so I use them only to see if something really catastrophic happened.
In any case, since the customer complained I had to troubleshoot the root cause of the issue. I went into the systemd logs only to realise that all logging of this Elixir umbrella application had stopped a couple of days ago.
Diving back into Wobserver, I saw that the main Logger process had a couple of million messages in its queue, and is effectively dead and not responding (or, at least, not doing any new IO). It seems that the BEAM VM will not enforce a mailbox limit, but adding messages there will be slower.
As per 1.5.3, The Elixir Logger is essentially a GenEvent-based architecture, where each log message is sent to another process for later processing. When the load is low, the Logger will use the asynchronous notify function for faster performance; however, when the message inbox starts to fill up, the Logger will switch to a synchronous mode. This way, each call to
Logger.log will block until "[..] after the event has been handled by all event handlers.". This is supposed to add backpressure to give the logger back-end time to flush before new messages are coming in.
I will need to study the Logger code in more detail, but it seems to me that this design is flawed; when using any back-end, there is an inherent limit on how fast you can write to it. For the default console back-end, the limit is related to whoever handles the
stdout of the process — in my case, it was systemd.
In this particular failure mode, the systemd could process only
X msg/sec. My code produced
Y msg/sec, where
Y > X. This resulted in
Y – X messages accumulating in the Logger queue every second, with no hope ever to be flushed out. What's worse, the Logger would switch to sync mode, slowing down the entire application for ever. This kind of backpressure might be useful in other situations, but for logging I would argue that just dropping the messages (and logging the fact!) would be much better.
In my case, I fixed this by setting the log level to
:info (it is
:debug by default, another questionable choice) and doing a complete review of all my logging calls. However, I have to check again in a couple of days and see how the system holds up.
Incidentally, I now understand why Clojure's core.async channels by design do not support this unbounded behaviour. You have to specify an explicit window after which the channel becomes blocking, or a custom behaviour that will drop messages if the channel is full.