Phone FSM.

Chapter 6 of Designing for Scalability with Erlang/OTP by Fancesco Cesarini and Steve Vinoski ends with an exercise that requires you to build a gen_fsm state machine implementing a simple phone controller. The controllers interact with each other and with a separate process representing the phone controlled by each state machine. Since the book was published a new finite state machine behavior has been added to OTP: gen_statem. Here I’ll implement the phone controller state machine using the gen_statem behavior.

I’ve learned a number of things about implementing state machines with the new behavior in the process. When I used the gen_statem behavior in the past I have always reached for the handle_event callback mode. I was tempted to use that again; however, because the exercise was designed for the original gen_fsm behavior, and I wanted to keep my solution close to what was intended, I chose to use the state_functions callback mode.

The state machine

Here’s the state machine implemented by each controller.

stateDiagram direction LR disconnected idle calling connecting connected [*] --> disconnected disconnected --> idle : {connect, PhoneNumber} idle --> calling : {action, outbound} idle --> disconnected : disconnect calling --> idle : rejected calling --> idle : hangup idle --> connecting : inbound connecting --> idle : hangup connecting --> idle : {action, reject} connecting --> connected : {action, accept} calling --> connected : accepted connected --> idle : hangup

The events tagged with action come from the phone. All other events are sent from other controllers on the switch. Not shown in the diagram is a transition from all states to disconnected triggered by the disconnect event. This is implemented in the phone_controller.erl module1.

The API

The state machine has two APIs. One is used by the phones to interact with their controllers and the other is used for interactions between controllers.

Some types

We define some types for use in our function specs.

-type phone_number() :: string().

-type phone_action() :: accept %% accept an incoming call
                      | reject %% reject an incoming call
                      | hangup %% hangup a call (possibly before it is accepted)
                      | {outbound, PhoneNumber :: phone_number()}. %% initiate a call

The phone API

The API used by phone processes to interact with their controllers is the following.

-spec connect(ControllerPid :: pid()) -> ok.
connect(ControllerPid) ->
    gen_statem:call(ControllerPid, {connect, self()}).

-spec disconnect(ControllerPid :: pid()) -> ok.
disconnect(ControllerPid) ->
    gen_statem:cast(ControllerPid, {diconnect, self()}).

-spec action(ControllerPid :: pid(), Action :: phone_action()) -> ok.
action(ControllerPid, Action) ->
    gen_statem:cast(ControllerPid, {action, Action}).

The controller API

The internal controller API consists of these five functions.

busy(ControllerPid) ->
    gen_statem:cast(ControllerPid, busy).

accept(ControllerPid) ->
    gen_statem:cast(ControllerPid, accepted).

reject(ControllerPid) ->
    gen_statem:cast(ControllerPid, rejected).

hangup(ControllerPid) ->
    gen_statem:cast(ControllerPid, hangup).

inbound(ControllerPid) ->
    gen_statem:cast(ControllerPid, {inbound, self()}).

The callbacks - first cut

The state machine needs to keep track of the PID of its phone so it can send responses to it. We also need to keep track of the PID of the controller we are calling or connected with so we can send control messages to it.

-record(data, {phone_pid = undefined :: pid() | undefined,
               phone_ref = undefined :: reference() | undefined,
               other_phone = none :: pid() | none}).

The disconnected state

The controller starts in the disconnected state. Any events coming from other controllers while in this state are ignored with the exception of inbound events. To prevent other phones from waiting forever for a call that cannot be completed an inbound event received in the disconnected state triggers the controller to send a rejected event to the controller that attempted the call.

When a phone is connected, we also set up a monitor so we can end any ongoing calls and return to the disconnected state if it dies unexpectedly. Here’s the callback for disconnected.

disconnected({call, From}, {connect, PhonePid}, Data) ->
    Ref = erlang:monitor(process, PhonePid),
    {next_state, idle, Data#data{phone_pid = PhonePid, phone_ref = Ref},
     {reply, From, ok}};
disconnected(cast, {inbound, ControllerPid}, _) ->
    %% If no phone is connected then all inbound calls are rejected.
    reject(ControllerPid),
    keep_state_and_data;
disconnected(cast, accepted, _) -> keep_state_and_data;
disconnected(cast, rejected, _) -> keep_state_and_data;
disconnected(cast, busy, _) -> keep_state_and_data;
disconnected(cast, hangup, _) -> keep_state_and_data;
disconnected(EventType, Event, Data) ->
    %% pass any other events to a handler for all-state events
    handle_event(EventType, Event, Data).

The idle state

When idle the controller responds to inbound and outbound events. When processing an outbound event, we need to lookup the controller for the requested phone number, and reject the call if there is no controller for that number. All other actions and controller events are ignored in this state, except for a disconnect event.

idle(cast, {inbound, ControllerPid}, Data) ->
    {ok, Caller} = hlr:lookup_ms(ControllerPid),
    phone:reply(Data#data.phone_pid, {inbound, Caller}),
    {next_state, connecting, Data#data{other_phone = ControllerPid}};
idle(cast, {action, {outbound, PhoneNumber}}, Data) ->
    case hlr:lookup_id(PhoneNumber) of
        {ok, Pid} ->
            inbound(Pid),
            {next_state, calling, Data#data{other_phone = Pid}};
        {error, invalid} ->
            phone:reply(Data#data.phone_pid, invalid),
            keep_state_and_data
    end;
idle(cast, {action, _}, _Data) ->
    %% Any other actions are ignored.
    keep_state_and_data;
idle(cast, accepted, _Data) ->
    keep_state_and_data;
idle(cast, rejected, _Data) ->
    keep_state_and_data;
idle(cast, hangup, _Data) ->
    keep_state_and_data;
idle(cast, disconnect, Data) ->
    {next_state, disconnected, disconnect_phone(Data)};
idle(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
    %% the phone pid has died.
    {next_state, disconnected, disconnect_phone(Data)};
idle(EventType, Event, Data) ->
    %% Other events pass through to an all-state event handler.
    handle_event(EventType, Event, Data).

We use a helper function disconnect_phone/1 to clean up the data related to the phone before transitioning back to the disconnected state. This function will be used in all the other states for the same purpose.

disconnect_phone(Data) ->
    erlang:demonitor(Data#data.phone_ref),
    Data#data{phone_pid = undefined, phone_ref = undefined}.

The calling state

The controller enters this state when it initiates a call and is waiting for a response from the controller for the number it is calling. Any inbound events result in a busy signal (i.e. a busy event is sent to the controller that initiated the inbound call). The controller needs to respond to rejected and accepted events by transitioning to the idle or connected state respectively. Finally a busy event received in this state is handled by reporting the busy signal to the phone and remaining in the calling state. People should be allowed to listen to the busy signal, if they want to, until they hang up the phone (there is a problem with the handling of busy and hangup events as implemented here). The caller may also hang up the phone before the call is accepted (or rejected). In this case the controller still sends a hangup event to the other controller before cleaning up the call data and returning to the idle state. Note that in this case, the idle state may receive events from the other controller because there is a race between the hangup event and the accept/reject actions on the other phone. This is handled above by ignoring such events in the idle state.

calling(cast, {inbound, ControllerPid}, _Data) ->
    busy(ControllerPid),
    keep_state_and_data;
calling(cast, rejected, Data) ->
    %% The other phone has rejected the call.
    phone:reply(Data#data.phone_pid, reject),
    {next_state, idle, cleanup_call(Data)};
calling(cast, accepted, Data) ->
    phone:reply(Data#data.phone_pid, accept),
    {next_state, connected, Data};
calling(cast, busy, Data) ->
    phone:reply(Data#data.phone_pid, busy),
    keep_state_and_data;
calling(cast, {action, hangup}, Data) ->
    {next_state, idle, hangup_call(Data)};
calling(cast, {action, _}, _Data) ->
    %% all other phone actions are ignored.
    keep_state_and_data;
calling(cast, disconnect, Data) ->
    {next_state, disconnected, disconnect_phone(hangup_call(Data))};
calling(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
    {next_state, disconnected, disconnect_phone(hangup_call(Data))};
calling(EventType, Event, Data) ->
    handle_event(EventType, Event, Data).

Two more helper functions are introduced here for cleaning up the call data and for hanging up a call. The cleanup_call/1 function simply cleans up the data, whereas the hangup_call/1 function sends a hangup event to the other controller before cleaning up the data.

hangup_call(Data) ->
    hangup(Data#data.other_phone),
    cleanup_call(Data).

cleanup_call(Data) ->
    Data#data{other_phone = none}.

Finally, note that if the phone disconnects from its controller while waiting for a response to an outgoing call the controller politely hangs up the outgoing call.

The connecting state

The controller enters into this state when it receives an incoming call. Here it waits for the phone to send either and accept or a reject action. It may also receive a hangup event from the calling controller, in which case the call is cleaned up and the controller returns to idle. If the call is accepted an accepted event is sent to the other controller and the state is changed to connected. As for the calling state another inbound event results in a busy signal.

connecting(cast, {action, accept}, Data) ->
    accept(Data#data.other_phone),
    {next_state, connected, Data};
connecting(cast, {action, reject}, Data) ->
    {next_state, idle, reject_call(Data)};
connecting(cast, hangup, Data) ->
    phone:reply(Data#data.phone_pid, hangup),
    {next_state, idle, cleanup_call(Data)};
connecting(cast, {inbound, ControllerPid}, _Data) ->
    busy(ControllerPid),
    keep_state_and_data;
connecting(cast, disconnect, Data) ->
    {next_state, disconnected, disconnect_phone(reject_call(Data))};
connecting(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
    {next_state, disconnected, disconnect_phone(reject_call(Data))};
connecting(EventType, Event, Data) ->
    handle_event(EventType, Event, Data).

One final helper function is introduced here to send a rejected event to the calling process and clean up the call state before transitioning back to the idle state.

reject_call(Data) ->
    reject(Data#data.other_phone),
    cleanup_call(Data).

The connected state

Once the call has been connected, the only event of interest is hangup which causes the call to be cleaned up and the controller transitions back to idle. Again, any inbound calls receive a busy signal and if the phone disconnects the controller politely hangs up the call.

connected(cast, {action, hangup}, Data) ->
    %% this phone has ended the call
    {next_state, idle, hangup_call(Data)};
connected(cast, hangup, Data) ->
    %% the other phone has ended the call
    phone:reply(Data#data.phone_pid, hangup),
    {next_state, idle, cleanup_call(Data)};
connected(cast, {inbound, ControllerPid}, _Data) ->
    busy(ControllerPid),
    keep_state_and_data;
connected(cast, disconnect, Data) ->
    {next_state, disconnected, disconnect_phone(hangup_call(Data))};
connected(info, {'DOWN', Ref, process, _Pid, _Info}, #data{phone_ref = Ref} = Data) ->
    {next_state, disconnected, Data, disconnect_phone(hangup_call(Data))};
connected(EventType, Event, Data) ->
    handle_event(EventType, Event, Data).

The phone

The phone module in the book’s github repository did not match the phone API described in the book so I implemented a very simple phone. It is a gen_server that just prints the replies it receives from its controller. The phone is implemented as a gen_server and can be run on a different node than the switch.

-module(phone).

-behaviour(gen_server).

-export([start_link/1, start_link/2, reply/2, stop/1, action/2]).
-export([init/1, handle_call/3, handle_cast/2]).

-record(state, {phone_number :: phone_number(),
                controller :: pid()}).

-type phone_number() :: string().

-type reply() :: {inbound, MSIDN :: phone_number()}
               | accept %% an outbound call has been accepted
               | invalid %% an outbound call was attempted to an invalid number
               | reject %% an outbound call has been rejected
               | busy %% an outbound call was attempted to a busy phone
               | hangup. %% an outbound call has hung up

-type action() :: {call, PhoneNumber :: phone_number()}
                | accept
                | reject
                | hangup.

%% Start a phone on the same node as the switch.
-spec start_link(PhoneNumber :: phone_number()) -> {ok, Pid :: pid()}.
start_link(PhoneNumber) ->
    start_link(PhoneNumber, node()).

%% Start a phone on a different node than the switch.
-spec start_link(PhoneNumber :: phone_number(),
                 SwitchNode :: node()) -> {ok, Pid :: pid()}.
start_link(PhoneNumber, SwitchNode) ->
    gen_server:start_link(?MODULE, [PhoneNumber, SwitchNode], []).

%% Stop the phone.
-spec stop(Phone :: pid()) -> ok.
stop(Phone) ->
    gen_server:call(Phone, stop).

%% Execute a phone action.
-spec action(Phone :: pid(), Action :: action()) -> ok.
action(Phone, Action) ->
    gen_server:cast(Phone, {action, Action}).

%% Send a message to the phone.
-spec reply(PhonePid :: pid(), Reply :: reply()) -> ok.
reply(PhonePid, Reply) ->
    gen_server:cast(PhonePid, {reply, Reply}).

init([PhoneNumber, SwitchNode]) ->
    {ok, Controller} = rpc:call(SwitchNode, hlr, lookup_id, [PhoneNumber]),
    phone_controller:connect(Controller),
    {ok, #state{phone_number = PhoneNumber, controller = Controller}}.

handle_cast({reply, Reply}, State) ->
    io:format("[~p] (~p) : ~p~n",
              [State#state.phone_number, self(), reply_to_message(Reply)]),
    {noreply, State};
handle_cast({action, {call, PhoneNumber}}, State) ->
    phone_controller:action(State#state.controller, {outbound, PhoneNumber}),
    {noreply, State};
handle_cast({action, Action}, State) ->
    phone_controller:action(State#state.controller, Action),
    {noreply, State}.

handle_call(stop, _From, State) ->
    phone_controller:disconnect(State#state.controller),
    {stop, stopped, ok, State}.

reply_to_message(accept) -> "call accepted";
reply_to_message(reject) -> "call rejected";
reply_to_message(invalid) -> "invalid number";
reply_to_message(busy) -> "busy";
reply_to_message(hangup) -> "call ended";
reply_to_message({inbound, PhoneNumber}) -> "incoming call from " ++ PhoneNumber.

Testing it out

I tested this on three nodes switch, alice, bob, and joe. Here is the output of the initial tests.

(switch@computer)1> hlr:new().
ok
(switch@computer)2> phone_controller:start_link("111").
{ok,<0.90.0>}
(switch@computer)3> phone_controller:start_link("112").
{ok,<0.92.0>}
(switch@computer)4> phone_controller:start_link("113").
{ok,<0.94.0>}
...

Alice first tries to call bob, but he rejects the call. Bob then calls alice back and she accepts.

(alice@computer)1> {ok, Phone} = phone:start_link("111", 'switch@computer').
{ok,<0.89.0>}
(alice@computer)2> phone:action(Phone, {call, "112"}).
ok
["111"] (<0.89.0>) : "call rejected"
["111"] (<0.89.0>) : "incoming call from 112"
(alice@computer)3> phone:action(Phone, accept).
ok
(bob@computer)1> {ok, Phone} = phone:start_link("112", 'switch@computer').
{ok,<0.89.0>}
["112"] (<0.89.0>) : "incoming call from 111"
(bob@computer)2> phone:action(Phone, reject).
ok
(bob@computer)3> phone:action(Phone, {call, "111"}).
ok
["112"] (<0.89.0>) : "call accepted"
(bob@computer)4>

Now Joe tries to call Alice and receives a busy signal, as expected. When he has had enough of the soothing humm of the busy signal, Joe hangs up his phone.

(joe@computer)1> {ok, Phone} = phone:start_link("113", 'switch@computer').
{ok,<0.89.0>}
(joe@computer)2> phone:action(Phone, {call, "111"}).
ok
["113"] (<0.89.0>) : "busy"
(joe@computer)3> phone:action(Phone, hangup).
ok

Everything looks fine for Joe, but Alice and Bob’s call has been disconnected! Alice received a hangup event generated by Joe. Furthermore, because Alice’s phone thinks the other side of the call initiated the hangup she does not send an hangup event to Bob. Now Bob’s phone is stuck in an invalid state where it thinks it is connected to Alice, but Alice’s phone is idle.

Fixing the calling state

The problem of control events coming from the wrong controllers could be solved by tagging each control event with the PID of the controller that sent it. Then any control events from controllers other than the one that is connected or being called can be ignored. Unfortunately this would add a lot of complexity and means that state transitions depend not just on the state of the FSM and the event it receives, but also on the data held by the FSM. Effectively we would be implicitly adding another state.

Rather than adding implicit states we can handle the problem by adding an explicit state. Looking at the problem from Joe’s perspective there is a simple solution. If Joe gets a busy signal his subsequent hangup action should not generate a hangup event for the phone he is trying to call. This implies that Joe’s phone controller should transition to a distinct state when it receives the busy event. Let’s call this state call_failed.

call_failed(cast, {action, hangup}, Data) ->
    {next_state, idle, cleanup_call(Data)};
call_failed(cast, {inbound, ControllerPid}, _Data) ->
    busy(ControllerPid),
    keep_state_and_data;
call_failed(cast, disconnect, Data) ->
    {next_state, disconnected, disconnect_phone(cleanup_call(Data))};
call_failed(info, {'DOWN', Ref, process, _Pid, _Info},
            #data{phone_ref = Ref} = Data) ->
    {next_state, disconnected, disconnect_phone(cleanup_call(Data))};
call_failed(EventType, Event, Data) ->
    handle_event(EventType, Event, Data).

The calling state callback just needs to be updated to transition the call_failed when it receives a busy event.

%% ...
calling(cast, busy, Data) ->
    phone:reply(Data#data.phone_pid, busy),
    {next_state, call_failed, Data};
%% ...

This is what the new state machine looks like (with some transitions elided for clearity).

stateDiagram direction LR disconnected idle calling call_failed connecting connected [*] --> disconnected disconnected --> idle : connect idle --> calling : outbound calling --> idle : rejected calling --> call_failed : busy call_failed --> idle : hangup idle --> connecting : inbound connecting --> idle : reject connecting --> connected : accept calling --> connected : accepted

Test number two

Let’s recompile the controller.

(switch@computer)5> c(phone_controller).
{ok,phone_controller}

Bob can get out of his invalid state by just hanging up the phone. Since her call was disconnected, Alice calls bob again and he accepts the call. They finish their conversation and bob ends the call.

(alice@computer)4> phone:action(Phone, {call, "112"}).
ok
["111"] (<0.89.0>) : "call accepted"
["111"] (<0.89.0>) : "call ended"
(bob@computer)4> phone:action(Phone, hangup).
ok
["112"] (<0.89.0>) : "incoming call from 111"
(bob@computer)5> phone:action(Phone, accept).
ok
(bob@computer)6> phone:action(Phone, hangup).
ok

In the meantime Joe has tried to call Alice again and again receives a busy signal. This time, however, when he hangs up Alice and Bob’s call remains connected.

(joe@computer)4> phone:action(Phone, {call, "111"}).
ok
["113"] (<0.89.0>) : "busy"
(joe@computer)5> phone:action(Phone, hangup).
ok
(joe@computer)6>

There are a few more cases to test out. If a phone calls itself it should get a busy signal.

(bob@computer)8> phone:action(Phone, {call, "112"}).
ok
["112"] (<0.89.0>) : "busy"
(bob@computer)9> phone:action(Phone, hangup).
ok

If Bob hangs up before Alice answers, she should receive a hangup event.

(bob@computer)10> phone:action(Phone, {call, "111"}).
ok
(bob@computer)11> phone:action(Phone, hangup).
ok

In alice’s shell:

["111"] (<0.89.0>) : "incoming call from 112"
["111"] (<0.89.0>) : "call ended"

A much bigger test

To really push the limits we need to do more than just make phone calls in the shell. If we want to experience all the possible race conditions that can occur between phones and see if they are handled correctly we need to build a small simulation of phones making random calls to each other. This test will take a bit more time to implement, and will probably be an interesting project in its own right. To be continued… ?

Conclusion

From this exercise I learned to pay attention to the indicators that one state should probably be multiple states—complex control flow based on the state data, for example. Often my temptation to use the handle_event callback mode stems from a desire to handle one event from multiple states with the same code. This often leads to unnecessary complexity and control flow based on data and using state_functions forced me to be more thoughtful in what each state represents and to respond to events based on state rather than data.

Because the interactions between controllers are all asynchronous we need to be careful about which events we respond to. We might get events from controllers other than the one we are connected to, for example, that we should respond to differently (or ignore entirely).


  1. Here is phone.erl and hlr.erl ↩︎