gen_event redux.

In the previous post I discussed an interesting and underrated feature of the Erlang/OTP gen_event behavior. Because of the way handler failures are managed, the behavior enables an alternative to standard crash-and-restart supervision model. Unlike that noisy approach, supervised handlers let the event manager keep running when they fail meaning failure and restart of a handler can go unnoticed by processes that interact with the event manager (i.e. the pid of the event manager does not change).

I still think all of that is pretty cool, but much of what I talked about regarding my specific use case ended up being abandoned in favor of a simpler approach. Rather than keeping track of the actual PID of the event manager and related processes, I just ended up using gproc to register the group of processes associated with each other. I still use an event manager, and in some sense take advantage of the transparent supervision that provides, but in a different way.

Controlled restarts - another reason gen_event is cool

At the very end of the previous post I noted a small race condition that I’d need to deal with when the handler fails and is re-installed. My approach to preventing this race ended up taking advantage of the differences between gen_event handler supervision and standard OTP supervisor supervision.

In my application there are two important processes on the server side, a state machine which communicates with the client, and a handler which tells the state machine how to respond to the client. A race condition arises when the handler crashes if the state machine is allowed to continue interacting with the client (i.e. it receives an “internal error” message) before the handler is reinstalled. If this were allowed it would be possible for some messages to reach the event manager while there is no handler installed resulting in unnecessary timeouts and degraded operation and performance on the client side. The solution to this is fairly simple (if not totally straightforward to implement): don’t tell the state machine that the handler experienced an internal error until after the handler is reinstalled.

Seems simple, right? The process supervising the handler re-installs it after it fails and then sends an internal error message to the state machine. However, this approach spreads the responsibility across too many components for my liking. It creates coupling between the process supervising the handler and the state machine that doesn’t need to exist. Instead it would be better to keep the communication limited to between the handler and the state machine. To do this, the handler needs to differentiate between starting for the first time and being restarted after it failed. Under standard OTP supervision this isn’t straightforward to implement; however, because supervision of handlers is not done by an OTP supervisor we have total control over the re-start/re-install process for failed handlers.

Taking advantage of this extra control, we capture the exit reason from the handler crash and pass it to a restart_handler function along with the same arguments used to start the handler initially. Now, when the handler restarts it sees the extra error information and uses it to generate an internal error message for the state machine which can safely proceed since the handler has already been reinstalled. The handler module ends up looking like this.

-define(registry(Name), {via, gproc, ?name(Name)}).
-define(name(Name), {n, l, {?MODULE, Name}}).

%% @doc Install the handler for the first time.
-spec add_handler(StationId :: binary(),
                  CallbackModule :: module(),
                  InitArg :: any()) -> gen_event:add_handler_ret().
add_handler(StationId, CallbackModule, InitArg) ->
      ?registry(StationId), ?MODULE, {StationId, CallbackModule, InitArg}).

%% @doc Re-install the handler after a crash.
-spec add_handler(StationId :: binary(),
                  CallbackModule :: module(),
                  InitArg :: any(),
                  Reason :: ocpp_error:error()) -> gen_event:add_handler_ret().
add_handler(StationId, CallbackModule, InitArg, Reason) ->
      ?registry(StationId), ?MODULE,
      {recover, Reason, {StationId, CallbackModule, InitArg}}).

%% ... snip

init({recover, Reason, {StationId, _, _} = InitArg}) ->
    %% We're back up - notify the state machine about the error
    ocpp_station:error(StationId, Reason),
    %% Now proceed with initializing normally
    %% - this could fail, but we don't worry about that because it
    %%   will be escalated up the supervision tree.
init({StationId, CallbackModule, InitArg}) ->
    %% Normal initialization. If this fails the whole set of
    %% processes related to the station is terminated.
    %% ... snip

%% ... snip

%% Internal function that tries to handle the request
do_request(RequestFun, Message,
           #state{handler_state = HState, mod = Mod, stationid = StationId} = State) ->
    try Mod:RequestFun(Message, HState) of
        {reply, Response, NewHState} ->
            %% ... snip
        {error, Reason, NewHState} ->
            %% ... snip
    catch Exception:Reason:Trace when Exception =:= error;
                                      Exception =:= exit ->
            %% ... snip ... log the error

            %% Create an error response that will eventually
            %% be sent to the state machine process
            Error = ocpp_error:new(ocpp_message:id(Message), 'InternalError',
                                   [{details, #{<<"reason">> => Reason}}]),

            %% Finish failing, providing the response message as part of the reason
            error({ocpp_handler_error, Error})

The main downside to using this kind of supervision approach that I have identified so far is that you lose the automatic escalation functionality provided by OTP supervisors. In this case I think it is worth that cost to provide a clean solution to the internal-error/restart problem described above. (It would also be a cost you pay no matter what when you use supervised handlers with an OTP event manager. I’d like to spend some more time exploring what is the right way to manage crash escalation for gen_event handlers, maybe in a future post.

Champagne photo by Billy Huynh