In the previous post I discussed an
interesting and underrated feature of the Erlang/OTP gen_event
behavior. Because of the way handler failures are managed, the
behavior enables an alternative to standard crash-and-restart
supervision model. Unlike that noisy approach, supervised handlers let
the event manager keep running when they fail meaning failure and
restart of a handler can go unnoticed by processes that interact with
the event manager (i.e. the pid of the event manager does not
change).
I still think all of that is pretty cool, but much of what I talked
about regarding my specific use case ended up being abandoned in favor
of a simpler approach. Rather than keeping track of the actual PID of
the event manager and related processes, I just ended up using
gproc
to register the group of
processes associated with each other. I still use an event manager,
and in some sense take advantage of the transparent supervision that
provides, but in a different way.
Controlled restarts - another reason gen_event is cool
At the very end of the previous post I noted a small race condition
that I’d need to deal with when the handler fails and is
re-installed. My approach to preventing this race ended up taking
advantage of the differences between gen_event
handler supervision
and standard OTP supervisor
supervision.
In my application there are two important processes on the server side, a state machine which communicates with the client, and a handler which tells the state machine how to respond to the client. A race condition arises when the handler crashes if the state machine is allowed to continue interacting with the client (i.e. it receives an “internal error” message) before the handler is reinstalled. If this were allowed it would be possible for some messages to reach the event manager while there is no handler installed resulting in unnecessary timeouts and degraded operation and performance on the client side. The solution to this is fairly simple (if not totally straightforward to implement): don’t tell the state machine that the handler experienced an internal error until after the handler is reinstalled.
Seems simple, right? The process supervising the handler re-installs
it after it fails and then sends an internal error message to the
state machine. However, this approach spreads the responsibility
across too many components for my liking. It creates coupling between
the process supervising the handler and the state machine that doesn’t
need to exist. Instead it would be better to keep the communication
limited to between the handler and the state machine. To do this, the
handler needs to differentiate between starting for the first time and
being restarted after it failed. Under standard OTP supervision this
isn’t straightforward to implement; however, because supervision of
handlers is not done by an OTP supervisor
we have total control over
the re-start/re-install process for failed handlers.
Taking advantage of this extra control, we capture the exit reason
from the handler crash and pass it to a restart_handler
function
along with the same arguments used to start the handler
initially. Now, when the handler restarts it sees the extra error
information and uses it to generate an internal error message for the
state machine which can safely proceed since the handler has already
been reinstalled. The handler module ends up looking like this.
-define(registry(Name), {via, gproc, ?name(Name)}).
-define(name(Name), {n, l, {?MODULE, Name}}).
%% @doc Install the handler for the first time.
-spec add_handler(StationId :: binary(),
CallbackModule :: module(),
InitArg :: any()) -> gen_event:add_handler_ret().
add_handler(StationId, CallbackModule, InitArg) ->
gen_event:add_sup_handler(
?registry(StationId), ?MODULE, {StationId, CallbackModule, InitArg}).
%% @doc Re-install the handler after a crash.
-spec add_handler(StationId :: binary(),
CallbackModule :: module(),
InitArg :: any(),
Reason :: ocpp_error:error()) -> gen_event:add_handler_ret().
add_handler(StationId, CallbackModule, InitArg, Reason) ->
gen_event:add_sup_handler(
?registry(StationId), ?MODULE,
{recover, Reason, {StationId, CallbackModule, InitArg}}).
%% ... snip
init({recover, Reason, {StationId, _, _} = InitArg}) ->
%% We're back up - notify the state machine about the error
ocpp_station:error(StationId, Reason),
%% Now proceed with initializing normally
%% - this could fail, but we don't worry about that because it
%% will be escalated up the supervision tree.
init(InitArg);
init({StationId, CallbackModule, InitArg}) ->
%% Normal initialization. If this fails the whole set of
%% processes related to the station is terminated.
%% ... snip
%% ... snip
%% Internal function that tries to handle the request
do_request(RequestFun, Message,
#state{handler_state = HState, mod = Mod, stationid = StationId} = State) ->
try Mod:RequestFun(Message, HState) of
{reply, Response, NewHState} ->
%% ... snip
{error, Reason, NewHState} ->
%% ... snip
catch Exception:Reason:Trace when Exception =:= error;
Exception =:= exit ->
%% ... snip ... log the error
%% Create an error response that will eventually
%% be sent to the state machine process
Error = ocpp_error:new(ocpp_message:id(Message), 'InternalError',
[{details, #{<<"reason">> => Reason}}]),
%% Finish failing, providing the response message as part of the reason
error({ocpp_handler_error, Error})
end.
The main downside to using this kind of supervision approach that I
have identified so far is that you lose the automatic escalation
functionality provided by OTP supervisors
. In this case I think it
is worth that cost to provide a clean solution to the
internal-error/restart problem described above. (It would also be a
cost you pay no matter what when you use supervised handlers with an
OTP event manager. I’d like to spend some more time exploring what is
the right way to manage crash escalation for gen_event
handlers,
maybe in a future post.