Chapter 6 of Designing for Scalability with Erlang/OTP by Fancesco
Cesarini and Steve Vinoski ends with an exercise that requires you to
build a gen_fsm
state machine implementing a simple phone
controller. The controllers interact with each other and with a
separate process representing the phone controlled by each state
machine. Since the book was published a new finite state machine
behavior has been added to OTP: gen_statem
. Here I’ll implement the
phone controller state machine using the gen_statem
behavior.
I’ve learned a number of things about implementing state machines with
the new behavior in the process. When I used the gen_statem
behavior
in the past I have always reached for the handle_event
callback
mode. I was tempted to use that again; however, because the exercise
was designed for the original gen_fsm
behavior, and I wanted to keep
my solution close to what was intended, I chose to use the
state_functions
callback mode.
The state machine
Here’s the state machine implemented by each controller.
The events tagged with action
come from the phone. All other events
are sent from other controllers on the switch. Not shown in the
diagram is a transition from all states to disconnected
triggered by
the disconnect
event. This is implemented in the
phone_controller.erl
module1.
The API
The state machine has two APIs. One is used by the phones to interact with their controllers and the other is used for interactions between controllers.
Some types
We define some types for use in our function specs.
-type phone_number() :: string().
-type phone_action() :: accept %% accept an incoming call
| reject %% reject an incoming call
| hangup %% hangup a call (possibly before it is accepted)
| {outbound, PhoneNumber :: phone_number()}. %% initiate a call
The phone API
The API used by phone processes to interact with their controllers is the following.
-spec connect(ControllerPid :: pid()) -> ok.
connect(ControllerPid) ->
gen_statem:call(ControllerPid, {connect, self()}).
-spec disconnect(ControllerPid :: pid()) -> ok.
disconnect(ControllerPid) ->
gen_statem:cast(ControllerPid, {diconnect, self()}).
-spec action(ControllerPid :: pid(), Action :: phone_action()) -> ok.
action(ControllerPid, Action) ->
gen_statem:cast(ControllerPid, {action, Action}).
The controller API
The internal controller API consists of these five functions.
busy(ControllerPid) ->
gen_statem:cast(ControllerPid, busy).
accept(ControllerPid) ->
gen_statem:cast(ControllerPid, accepted).
reject(ControllerPid) ->
gen_statem:cast(ControllerPid, rejected).
hangup(ControllerPid) ->
gen_statem:cast(ControllerPid, hangup).
inbound(ControllerPid) ->
gen_statem:cast(ControllerPid, {inbound, self()}).
The callbacks - first cut
The state machine needs to keep track of the PID of its phone so it can send responses to it. We also need to keep track of the PID of the controller we are calling or connected with so we can send control messages to it.
-record(data, {phone_pid = undefined :: pid() | undefined,
phone_ref = undefined :: reference() | undefined,
other_phone = none :: pid() | none}).
The disconnected state
The controller starts in the disconnected state. Any events coming
from other controllers while in this state are ignored with the
exception of inbound
events. To prevent other phones from waiting
forever for a call that cannot be completed an inbound
event
received in the disconnected state triggers the controller to send a
rejected
event to the controller that attempted the call.
When a phone is connected, we also set up a monitor so we can end any
ongoing calls and return to the disconnected state if it dies
unexpectedly. Here’s the callback for disconnected
.
disconnected({call, From}, {connect, PhonePid}, Data) ->
Ref = erlang:monitor(process, PhonePid),
{next_state, idle, Data#data{phone_pid = PhonePid, phone_ref = Ref},
{reply, From, ok}};
disconnected(cast, {inbound, ControllerPid}, _) ->
%% If no phone is connected then all inbound calls are rejected.
reject(ControllerPid),
keep_state_and_data;
disconnected(cast, accepted, _) -> keep_state_and_data;
disconnected(cast, rejected, _) -> keep_state_and_data;
disconnected(cast, busy, _) -> keep_state_and_data;
disconnected(cast, hangup, _) -> keep_state_and_data;
disconnected(EventType, Event, Data) ->
%% pass any other events to a handler for all-state events
handle_event(EventType, Event, Data).
The idle state
When idle the controller responds to inbound
and outbound
events. When processing an outbound event, we need to lookup the
controller for the requested phone number, and reject the call if
there is no controller for that number. All other actions and
controller events are ignored in this state, except for a disconnect
event.
idle(cast, {inbound, ControllerPid}, Data) ->
{ok, Caller} = hlr:lookup_ms(ControllerPid),
phone:reply(Data#data.phone_pid, {inbound, Caller}),
{next_state, connecting, Data#data{other_phone = ControllerPid}};
idle(cast, {action, {outbound, PhoneNumber}}, Data) ->
case hlr:lookup_id(PhoneNumber) of
{ok, Pid} ->
inbound(Pid),
{next_state, calling, Data#data{other_phone = Pid}};
{error, invalid} ->
phone:reply(Data#data.phone_pid, invalid),
keep_state_and_data
end;
idle(cast, {action, _}, _Data) ->
%% Any other actions are ignored.
keep_state_and_data;
idle(cast, accepted, _Data) ->
keep_state_and_data;
idle(cast, rejected, _Data) ->
keep_state_and_data;
idle(cast, hangup, _Data) ->
keep_state_and_data;
idle(cast, disconnect, Data) ->
{next_state, disconnected, disconnect_phone(Data)};
idle(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
%% the phone pid has died.
{next_state, disconnected, disconnect_phone(Data)};
idle(EventType, Event, Data) ->
%% Other events pass through to an all-state event handler.
handle_event(EventType, Event, Data).
We use a helper function disconnect_phone/1
to clean up the data
related to the phone before transitioning back to the disconnected
state. This function will be used in all the other states for the same
purpose.
disconnect_phone(Data) ->
erlang:demonitor(Data#data.phone_ref),
Data#data{phone_pid = undefined, phone_ref = undefined}.
The calling state
The controller enters this state when it initiates a call and is
waiting for a response from the controller for the number it is
calling. Any inbound
events result in a busy signal (i.e. a busy
event is sent to the controller that initiated the inbound call). The
controller needs to respond to rejected
and accepted
events by
transitioning to the idle or connected state respectively. Finally a
busy event received in this state is handled by reporting the busy
signal to the phone and remaining in the calling state. People should
be allowed to listen to the busy signal, if they want to, until they
hang up the phone (there is a problem with the handling of busy
and
hangup
events as implemented here). The caller may also hang up the
phone before the call is accepted (or rejected). In this case the
controller still sends a hangup
event to the other controller before
cleaning up the call data and returning to the idle state. Note that
in this case, the idle state may receive events from the other
controller because there is a race between the hangup event and the
accept/reject actions on the other phone. This is handled above by
ignoring such events in the idle
state.
calling(cast, {inbound, ControllerPid}, _Data) ->
busy(ControllerPid),
keep_state_and_data;
calling(cast, rejected, Data) ->
%% The other phone has rejected the call.
phone:reply(Data#data.phone_pid, reject),
{next_state, idle, cleanup_call(Data)};
calling(cast, accepted, Data) ->
phone:reply(Data#data.phone_pid, accept),
{next_state, connected, Data};
calling(cast, busy, Data) ->
phone:reply(Data#data.phone_pid, busy),
keep_state_and_data;
calling(cast, {action, hangup}, Data) ->
{next_state, idle, hangup_call(Data)};
calling(cast, {action, _}, _Data) ->
%% all other phone actions are ignored.
keep_state_and_data;
calling(cast, disconnect, Data) ->
{next_state, disconnected, disconnect_phone(hangup_call(Data))};
calling(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
{next_state, disconnected, disconnect_phone(hangup_call(Data))};
calling(EventType, Event, Data) ->
handle_event(EventType, Event, Data).
Two more helper functions are introduced here for cleaning up the call
data and for hanging up a call. The cleanup_call/1
function simply
cleans up the data, whereas the hangup_call/1
function sends a
hangup
event to the other controller before cleaning up the data.
hangup_call(Data) ->
hangup(Data#data.other_phone),
cleanup_call(Data).
cleanup_call(Data) ->
Data#data{other_phone = none}.
Finally, note that if the phone disconnects from its controller while waiting for a response to an outgoing call the controller politely hangs up the outgoing call.
The connecting state
The controller enters into this state when it receives an incoming
call. Here it waits for the phone to send either and accept or a
reject action. It may also receive a hangup
event from the calling
controller, in which case the call is cleaned up and the controller
returns to idle
. If the call is accepted an accepted
event is sent
to the other controller and the state is changed to connected
. As
for the calling
state another inbound
event results in a busy
signal.
connecting(cast, {action, accept}, Data) ->
accept(Data#data.other_phone),
{next_state, connected, Data};
connecting(cast, {action, reject}, Data) ->
{next_state, idle, reject_call(Data)};
connecting(cast, hangup, Data) ->
phone:reply(Data#data.phone_pid, hangup),
{next_state, idle, cleanup_call(Data)};
connecting(cast, {inbound, ControllerPid}, _Data) ->
busy(ControllerPid),
keep_state_and_data;
connecting(cast, disconnect, Data) ->
{next_state, disconnected, disconnect_phone(reject_call(Data))};
connecting(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
{next_state, disconnected, disconnect_phone(reject_call(Data))};
connecting(EventType, Event, Data) ->
handle_event(EventType, Event, Data).
One final helper function is introduced here to send a rejected
event to the calling process and clean up the call state before
transitioning back to the idle
state.
reject_call(Data) ->
reject(Data#data.other_phone),
cleanup_call(Data).
The connected state
Once the call has been connected, the only event of interest is
hangup
which causes the call to be cleaned up and the controller
transitions back to idle
. Again, any inbound
calls receive a busy
signal and if the phone disconnects the controller politely hangs up
the call.
connected(cast, {action, hangup}, Data) ->
%% this phone has ended the call
{next_state, idle, hangup_call(Data)};
connected(cast, hangup, Data) ->
%% the other phone has ended the call
phone:reply(Data#data.phone_pid, hangup),
{next_state, idle, cleanup_call(Data)};
connected(cast, {inbound, ControllerPid}, _Data) ->
busy(ControllerPid),
keep_state_and_data;
connected(cast, disconnect, Data) ->
{next_state, disconnected, disconnect_phone(hangup_call(Data))};
connected(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
{next_state, disconnected, Data, disconnect_phone(hangup_call(Data))};
connected(EventType, Event, Data) ->
handle_event(EventType, Event, Data).
The phone
The phone module in the book’s github repository did not match the
phone API described in the book so I implemented a very simple
phone. It is a gen_server
that just prints the replies it receives
from its controller. The phone is implemented as a gen_server
and
can be run on a different node than the switch.
-module(phone).
-behaviour(gen_server).
-export([start_link/1, start_link/2, reply/2, stop/1, action/2]).
-export([init/1, handle_call/3, handle_cast/2]).
-record(state, {phone_number :: phone_number(),
controller :: pid()}).
-type phone_number() :: string().
-type reply() :: {inbound, MSIDN :: phone_number()}
| accept %% an outbound call has been accepted
| invalid %% an outbound call was attempted to an invalid number
| reject %% an outbound call has been rejected
| busy %% an outbound call was attempted to a busy phone
| hangup. %% an outbound call has hung up
-type action() :: {call, PhoneNumber :: phone_number()}
| accept
| reject
| hangup.
%% Start a phone on the same node as the switch.
-spec start_link(PhoneNumber :: phone_number()) -> {ok, Pid :: pid()}.
start_link(PhoneNumber) ->
start_link(PhoneNumber, node()).
%% Start a phone on a different node than the switch.
-spec start_link(PhoneNumber :: phone_number(),
SwitchNode :: node()) -> {ok, Pid :: pid()}.
start_link(PhoneNumber, SwitchNode) ->
gen_server:start_link(?MODULE, [PhoneNumber, SwitchNode], []).
%% Stop the phone.
-spec stop(Phone :: pid()) -> ok.
stop(Phone) ->
gen_server:call(Phone, stop).
%% Execute a phone action.
-spec action(Phone :: pid(), Action :: action()) -> ok.
action(Phone, Action) ->
gen_server:cast(Phone, {action, Action}).
%% Send a message to the phone.
-spec reply(PhonePid :: pid(), Reply :: reply()) -> ok.
reply(PhonePid, Reply) ->
gen_server:cast(PhonePid, {reply, Reply}).
init([PhoneNumber, SwitchNode]) ->
{ok, Controller} = rpc:call(SwitchNode, hlr, lookup_id, [PhoneNumber]),
phone_controller:connect(Controller),
{ok, #state{phone_number = PhoneNumber, controller = Controller}}.
handle_cast({reply, Reply}, State) ->
io:format("[~p] (~p) : ~p~n",
[State#state.phone_number, self(), reply_to_message(Reply)]),
{noreply, State};
handle_cast({action, {call, PhoneNumber}}, State) ->
phone_controller:action(State#state.controller, {outbound, PhoneNumber}),
{noreply, State};
handle_cast({action, Action}, State) ->
phone_controller:action(State#state.controller, Action),
{noreply, State}.
handle_call(stop, _From, State) ->
phone_controller:disconnect(State#state.controller),
{stop, stopped, ok, State}.
reply_to_message(accept) -> "call accepted";
reply_to_message(reject) -> "call rejected";
reply_to_message(invalid) -> "invalid number";
reply_to_message(busy) -> "busy";
reply_to_message(hangup) -> "call ended";
reply_to_message({inbound, PhoneNumber}) -> "incoming call from " ++ PhoneNumber.
Testing it out
I tested this on three nodes switch
, alice
, bob
, and joe
. Here
is the output of the initial tests.
(switch@computer)1> hlr:new().
ok
(switch@computer)2> phone_controller:start_link("111").
{ok,<0.90.0>}
(switch@computer)3> phone_controller:start_link("112").
{ok,<0.92.0>}
(switch@computer)4> phone_controller:start_link("113").
{ok,<0.94.0>}
...
Alice first tries to call bob, but he rejects the call. Bob then calls alice back and she accepts.
(alice@computer)1> {ok, Phone} = phone:start_link("111", 'switch@computer').
{ok,<0.89.0>}
(alice@computer)2> phone:action(Phone, {call, "112"}).
ok
["111"] (<0.89.0>) : "call rejected"
["111"] (<0.89.0>) : "incoming call from 112"
(alice@computer)3> phone:action(Phone, accept).
ok
(bob@computer)1> {ok, Phone} = phone:start_link("112", 'switch@computer').
{ok,<0.89.0>}
["112"] (<0.89.0>) : "incoming call from 111"
(bob@computer)2> phone:action(Phone, reject).
ok
(bob@computer)3> phone:action(Phone, {call, "111"}).
ok
["112"] (<0.89.0>) : "call accepted"
(bob@computer)4>
Now Joe tries to call Alice and receives a busy signal, as expected. When he has had enough of the soothing humm of the busy signal, Joe hangs up his phone.
(joe@computer)1> {ok, Phone} = phone:start_link("113", 'switch@computer').
{ok,<0.89.0>}
(joe@computer)2> phone:action(Phone, {call, "111"}).
ok
["113"] (<0.89.0>) : "busy"
(joe@computer)3> phone:action(Phone, hangup).
ok
Everything looks fine for Joe, but Alice and Bob’s call has been
disconnected! Alice received a hangup
event generated by
Joe. Furthermore, because Alice’s phone thinks the other side of the
call initiated the hangup she does not send an hangup event to
Bob. Now Bob’s phone is stuck in an invalid state where it thinks it
is connected to Alice, but Alice’s phone is idle.
Fixing the calling state
The problem of control events coming from the wrong controllers could be solved by tagging each control event with the PID of the controller that sent it. Then any control events from controllers other than the one that is connected or being called can be ignored. Unfortunately this would add a lot of complexity and means that state transitions depend not just on the state of the FSM and the event it receives, but also on the data held by the FSM. Effectively we would be implicitly adding another state.
Rather than adding implicit states we can handle the problem by adding
an explicit state. Looking at the problem from Joe’s perspective there
is a simple solution. If Joe gets a busy signal his subsequent hangup
action should not generate a hangup event for the phone he is trying
to call. This implies that Joe’s phone controller should transition to
a distinct state when it receives the busy event. Let’s call this
state call_failed
.
call_failed(cast, {action, hangup}, Data) ->
{next_state, idle, cleanup_call(Data)};
call_failed(cast, {inbound, ControllerPid}, _Data) ->
busy(ControllerPid),
keep_state_and_data;
call_failed(cast, disconnect, Data) ->
{next_state, disconnected, disconnect_phone(cleanup_call(Data))};
call_failed(info, {'DOWN', Ref, process, _Pid, _Info},
#data{phone_ref = Ref} = Data) ->
{next_state, disconnected, disconnect_phone(cleanup_call(Data))};
call_failed(EventType, Event, Data) ->
handle_event(EventType, Event, Data).
The calling
state callback just needs to be updated to transition
the call_failed
when it receives a busy
event.
%% ...
calling(cast, busy, Data) ->
phone:reply(Data#data.phone_pid, busy),
{next_state, call_failed, Data};
%% ...
This is what the new state machine looks like (with some transitions elided for clearity).
Test number two
Let’s recompile the controller.
(switch@computer)5> c(phone_controller).
{ok,phone_controller}
Bob can get out of his invalid state by just hanging up the phone. Since her call was disconnected, Alice calls bob again and he accepts the call. They finish their conversation and bob ends the call.
(alice@computer)4> phone:action(Phone, {call, "112"}).
ok
["111"] (<0.89.0>) : "call accepted"
["111"] (<0.89.0>) : "call ended"
(bob@computer)4> phone:action(Phone, hangup).
ok
["112"] (<0.89.0>) : "incoming call from 111"
(bob@computer)5> phone:action(Phone, accept).
ok
(bob@computer)6> phone:action(Phone, hangup).
ok
In the meantime Joe has tried to call Alice again and again receives a busy signal. This time, however, when he hangs up Alice and Bob’s call remains connected.
(joe@computer)4> phone:action(Phone, {call, "111"}).
ok
["113"] (<0.89.0>) : "busy"
(joe@computer)5> phone:action(Phone, hangup).
ok
(joe@computer)6>
There are a few more cases to test out. If a phone calls itself it should get a busy signal.
(bob@computer)8> phone:action(Phone, {call, "112"}).
ok
["112"] (<0.89.0>) : "busy"
(bob@computer)9> phone:action(Phone, hangup).
ok
If Bob hangs up before Alice answers, she should receive a hangup event.
(bob@computer)10> phone:action(Phone, {call, "111"}).
ok
(bob@computer)11> phone:action(Phone, hangup).
ok
In alice’s shell:
["111"] (<0.89.0>) : "incoming call from 112"
["111"] (<0.89.0>) : "call ended"
A much bigger test
To really push the limits we need to do more than just make phone calls in the shell. If we want to experience all the possible race conditions that can occur between phones and see if they are handled correctly we need to build a small simulation of phones making random calls to each other. This test will take a bit more time to implement, and will probably be an interesting project in its own right. To be continued… ?
Conclusion
From this exercise I learned to pay attention to the indicators that
one state should probably be multiple states—complex control flow
based on the state data, for example. Often my temptation to use the
handle_event
callback mode stems from a desire to handle one event
from multiple states with the same code. This often leads to
unnecessary complexity and control flow based on data and using
state_functions
forced me to be more thoughtful in what each state
represents and to respond to events based on state rather than data.
Because the interactions between controllers are all asynchronous we need to be careful about which events we respond to. We might get events from controllers other than the one we are connected to, for example, that we should respond to differently (or ignore entirely).