Phone Simulator.

This builds on the previous post where I implemented a phone controller using gen_statem

Making calls manually isn’t enough. We need to simulate hundreds of thousands of phones connected to the switch and constantly calling each other. The obvious thing to do is to model each person interacting with the phones as a separate process. Like the phone controller, I will use a gen_statem for this. In fact, the state machine that represents callers is very similar to the state machine for the phone controllers. There are two “intermediate” states that represent either waiting for a call to be answered (calling) or waiting to decide whether to answer or reject an incoming call (receiving).

stateDiagram direction LR idle calling receiving in_call [*] --> idle idle --> calling : initiate_call calling --> idle : rejected calling --> idle : giveup calling --> calling : busy calling --> in_call : answered idle --> receiving : inbound receiving --> idle : reject receiving --> in_call : answer receiving --> idle : hangup in_call --> idle : hangup

Each of these state machines also need to randomly generate events for itself to initiate calls, accept or reject incoming calls, and end ongoing calls. To do this the state machines use the timer module to schedule their next action before changing states. I considered using state_enter callbacks to set up these timers, but chose to configure the timers prior to the state transition to keep things within the realm of my current gen_statem understanding. It would be an interesting exercise to re-write the simulator state machine using state_enter callbacks and see if the resulting implementation is simpler. Ultimately I’m a little frustrated with the complexity of the implementation here; I think there is room for improvement.

One source of complexity that is difficult to avoid, and wouldn’t be resolved with state_enter callbacks, is the inherent race conditions that arise from the use timers to schedule events. Because the events depend on the state of the phone (e.g. can only reject a call in the receiving state) certain events need to be canceled when the other party changes the state of the call. Timers may fire before they can be canceled, meaning we have to handle and ignore some events related to previous calls in other states. On the other side of this, some events from another phone race with our timers meaning, for example, a call one phone initiated could be accepted after the phone has given up waiting for an answer, but while the hangup message is still in flight.

Tweaking the phone module

Before writing the simulator the phone module needs to be improved. It was initially written for use by a human, not a program, so all it does is print to the terminal whenever it gets a reply from its controller. To make it useful for the simulator we will have it send messages to whatever process started it when it gets replies from its controller. A controlling_pid field is added to its state, and the pid of the process that started it is passed to the init/1 callback.

-spec start_link(PhoneNumber :: phone_number(), SwitchNode :: node()) -> {ok, Pid :: pid()}.
start_link(PhoneNumber, SwitchNode) ->
    gen_server:start_link(?MODULE, [PhoneNumber, SwitchNode, self()], []).

init([PhoneNumber, SwitchNode, ControllingPid]) ->
    {ok, Controller} = rpc:call(SwitchNode, hlr, lookup_id, [PhoneNumber]),
    phone_controller:connect(Controller),
    {ok, #state{phone_number = PhoneNumber,
                controller = Controller,
                controlling_pid = ControllingPid}}.

In handle_cast we now just send messages of the form {phone, PhonePid, Reply} to the controlling pid whenever the phone receives a message from its controller. Handling actions remains the same as before.

handle_cast({reply, Reply}, State) ->
    State#state.controlling_pid ! {phone, self(), Reply},
    {noreply, State};
%% ... handle actions

The Phone Simulator state machine

This should really be called something like “caller” or “customer” simulator, but I went with phone_simulator and I don’t really feel like changing it.

The phone_simulator state machine has some basic data, mostly consisting of the timer references for upcoming events.

-record(data, {next_call :: timer:tref(), %% triggers next outgoing call
               hangup :: timer:tref() | undefined, %% triggers hangup of ongoing call
               answer :: timer:tref() | undefined, %% triggers answering an incoming call
               giveup :: timer:tref() | undefined, %% triggers hangup while awaiting an answer
               destination = random_phone() :: string(), %% phone number of outgoing call
               phone :: pid()}).

When it is started it starts a phone process for itself, which connects to the corresponding controller on the switch. It also immediately schedules its next outgoing call and enters into the idle state.

start_link(PhoneNumber) ->
    start_link(PhoneNumber, []).

start_link(PhoneNumber, Options) ->
    gen_statem:start_link(?MODULE, [PhoneNumber|Options], []).

init([PhoneNumber|Options]) ->
    seed(Options),
    {ok, Phone} = phone:start_link(PhoneNumber, 'switch@computer'),
    {ok, NextCall} = next_call(),
    {ok, idle, #data{phone = Phone, next_call = NextCall}}.

The idle state is pretty straightforward. First we include a large number of clauses to handle races between phones and timers.

idle(info, {phone, _, _}, _Data) ->
    %% Any other messages coming from the phone are ignored: e.g. a
    %% hangup after a reject (race between this phone phone and
    %% the calling phone).
    keep_state_and_data;
%% The following three clauses are needed to handle the race between the
%% canceling the timers and external events that end the call.
idle(cast, answer, _Data) -> keep_state_and_data;
idle(cast, hangup, _Data) -> keep_state_and_data;
idle(cast, giveup, _Data) -> keep_state_and_data;
%% ...

The last two clauses are the only two that actually do something interesting. Incoming calls are handled by randomly deciding whether to answer or reject the call. The answer/reject event is scheduled and the simulator transitions to the receiving state. If the next event is an initiate_call event, then the simulator transitions to the calling state. Before transitioning, however, it schedules two future events. First, it schedules an event to give up on the call if no answer is received. Second, it schedules its next outgoing call. This may seem a little strange, but I wanted to make sure that there was always another call scheduled so that when transitioning back to idle it is possible that the simulator does not wait at all before initiating a new call. This could be achieved in different ways, but this gives me the opportunity to experiment with postponing events in a gen_statem (as we will see next).

%% ...
idle(info, {phone, _PhonePid, {inbound, _PhoneNumber}}, Data) ->
    %% decide whether or not to answer the call
    case answer() of
        true ->
            {next_state, receiving, schedule_answer(Data)};
        false ->
            {next_state, receiving, schedule_reject(Data)}
    end;
idle(cast, initiate_call, Data) ->
    PhoneNumber = random_phone(),
    phone:action(Data#data.phone, {call, PhoneNumber}),
    {next_state, calling,
     schedule_next_call(
       schedule_giveup(Data#data{ destination = PhoneNumber }))}.

The receiving state is fairly simple, but if the simulator receives an initiate_call event it immediately rejects the incoming call and returns to the idle state, allowing a new call to be initiated.

receiving(cast, answer, Data) ->
    phone:action(Data#data.phone, accept),
    {next_state, in_call, schedule_hangup(Data)};
receiving(cast, hangup, Data) ->
    phone:action(Data#data.phone, reject),
    {next_state, idle, Data};
receiving(cast, initiate_call, Data) ->
    phone:action(Data#data.phone, reject),
    timer:cancel(Data#data.hangup),
    timer:cancel(Data#data.answer),
    {next_state, idle, Data, postpone};
receiving(info, {phone, _, hangup}, Data) ->
    timer:cancel(Data#data.answer),
    timer:cancel(Data#data.hangup),  % Need to cancel a pending rejection
    {next_state, idle, Data}.

For outgoing calls, the calling state handles a larger variety of messages coming from the phone. It also handles a giveup event that means it will stop waiting for an answer (i.e. hang up) and transition back to idle. If the other phone is busy then we remain in the calling state for one second before giving up. If an initiate_call event arrives it is postponed to be handled when the current call is complete.

calling(info, {phone, _, {inbound, _}}, Data) ->
    phone:action(Data#data.phone, reject),
    %% Retry the call
    phone:action(Data#data.phone, {call, Data#data.destination}),
    keep_state_and_data;
calling(info, {phone, _, reject}, Data) ->
    timer:cancel(Data#data.giveup),
    {next_state, idle, Data};
calling(info, {phone, _, accept}, Data) ->
    timer:cancel(Data#data.giveup),
    {next_state, in_call, schedule_hangup(Data)};
calling(info, {phone, _, busy}, Data) ->
    timer:cancel(Data#data.giveup),
    %% wait one second before giving up.
    timer:apply_after(1000, gen_statem, cast, [self(), giveup]),
    keep_state_and_data;
calling(info, {phone, _, invalid}, Data) ->
    timer:cancel(Data#data.giveup),
    {next_state, idle, Data};
calling(info, {phone, _, hangup}, _Data) ->
    %% This is the result of some previous call. Ignore it.
    keep_state_and_data;
calling(cast, giveup, Data) ->
    phone:action(Data#data.phone, hangup),
    {next_state, idle, Data};
%% since we are racing lots of timers, there may be leftover hangups and answers
calling(cast, hangup, _Data) -> keep_state_and_data;
calling(cast, answer, _Data) -> keep_state_and_data;
calling(cast, initiate_call, _Data) ->
    {keep_state_and_data, postpone}.

Finally, the in_call state is pretty simple. As in calling, we postpone a new initiate_call event to be processed once the current call is finished.

in_call(info, {phone, _, hangup}, Data) ->
    timer:cancel(Data#data.hangup),
    {next_state, idle, Data#data{hangup = undefined}};
in_call(info, {phone, _, _}, _) ->
    %% Any other phone messages are delayed from previous calls -
    %% ignore them.
    keep_state_and_data;
in_call(cast, hangup, Data) ->
    phone:action(Data#data.phone, hangup),
    {next_state, idle, Data#data{hangup = undefined}};
in_call(cast, giveup, _Data) ->
    keep_state_and_data;
in_call(cast, initiate_call, _) ->
    {keep_state_and_data, postpone}.

There are a few internal functions used to schedule events and generate random phone numbers.

random_phone() ->
    integer_to_list(rand:uniform(100000) + 100000).

answer() ->
    rand:uniform() < 0.5.

schedule_answer(Data) ->
    {ok, Answer} = timer:apply_after(rand:uniform(6001) - 1,
                                     gen_statem, cast, [self(), answer]),
    Data#data{answer = Answer}.

schedule_reject(Data) ->
    {ok, Hangup} = timer:apply_after(rand:uniform(6001) - 1,
                                     gen_statem, cast, [self(), hangup]),
    Data#data{
      hangup = Hangup}.

schedule_hangup(Data) ->
    {ok, Hangup} = timer:apply_after(rand:uniform(30001) - 1,
                                     gen_statem, cast, [self(), hangup]),
    Data#data{
      hangup = Hangup}.

schedule_next_call(Data) ->
    %% wait as long as an hour before initiating a call.
    {ok, NextEvent} = next_call(),
    Data#data{next_call = NextEvent}.

schedule_giveup(Data) ->
    {ok, NextEvent} = timer:apply_after(rand:uniform(26000) + 3000,
                                        gen_statem, cast, [self(), giveup]),
    Data#data{giveup = NextEvent}.

next_call() ->
    timer:apply_after(rand:uniform(36000),
                      gen_statem, cast, [self(), initiate_call]).

Complexity

A lot of complexity arises because of the race condition inherent between different phones. Some of this is unavoidable, but I think some of it could be handled in the phone module to hide it from the caller. In particular the phone could ignore events that arrive from calls it has already ended. I’ll leave this as something to mess with in the future though, because I’m kind of bored with phones. My basic idea to handle this is to create a reference in the phone when it initiates a call and make the controllers include the call reference in all their replies to the phone. This would not only simplify some of the simulator, but it would make me feel much more confident that this is correctly implemented. (I’m only at like 80%.)

Testing

I’ll test this on 5 different Erlang nodes: one switch and four nodes for the simulators. There will be a phone created for every number in the range “100000” to “200000”.

On the switch:

(switch@computer)> hlr:new().
ok
(switch@computer)34> lists:foreach(
                       fun phone_controller:start_link/1,
                       [integer_to_list(X) || X <- lists:seq(100000, 200000)]).
ok

And on each caller node I started 25000 phone_simulator processes:

(callers2@computer)1> lists:foreach(
                        fun phone_simulator:start_link/1,
                        [integer_to_list(X) || X <- lists:seq(125001, 150000)]).
ok

And nothing happened (after I fixed the bugs). I can do some traces and see that lots of calls are going through, but basically it will run forever now.

Or will it?

=ERROR REPORT==== 30-Dec-2022::16:19:55.695034 ===
** State machine <0.14899.2142> terminating
** Last event = {info,{phone,<0.14900.2142>,reject}}
** When server state  = {in_call,
                            {data,
                                {once,#Ref<0.1812204834.3329490945.96999>},
                                {once,#Ref<0.1812204834.3329490945.97002>},
                                {once,#Ref<0.1812204834.3325034498.62005>},
                                {once,#Ref<0.1812204834.3329490945.96997>},
                                "177721",<0.14900.2142>}}
** Reason for termination = error:function_clause
** Callback modules = [phone_simulator]
** Callback mode = state_functions
** Stacktrace =
**  [{phone_simulator,in_call,
                      [info,
                       {phone,<0.14900.2142>,reject},
                       {data,{once,#Ref<0.1812204834.3329490945.96999>},
                             {once,#Ref<0.1812204834.3329490945.97002>},
                             {once,#Ref<0.1812204834.3325034498.62005>},
                             {once,#Ref<0.1812204834.3329490945.96997>},
                             "177721",<0.14900.2142>}],
                      [{file,"phone_simulator.erl"},{line,109}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1419}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]

I handled a lot of the previous errors like this by adding clauses to the state callbacks to ignore the offending event because it was the result of an expected race condition. This error is clearly the result of a race condition, but it isn’t so clear that it is expected. In fact, it looks like it is actually caused by entering into an invalid state because we handled a previous race condition incorrectly. I think this error comes from accidentally responding to a {phone, _, accept} message from a previous call that arrives after giving up on that call and initiating a new call. Then the simulator ends up in the in_call state even though its current call has not been accepted. When the current call is rejected by the receiving phone the simulator crashes because it thinks its call was already accepted. Here’s what I think is going on as a diagram.

sequenceDiagram PhoneA->>ControllerA: {call, PhoneB} ControllerA->>ControllerB: {incoming, PhoneA} ControllerB->>PhoneB: {incoming, PhoneA} PhoneB-->>ControllerB: accept activate PhoneB ControllerB-->>ControllerA: accept PhoneA->>ControllerA: hangup PhoneA->>ControllerA: {call, PhoneC} ControllerA-->>PhoneA: accept activate PhoneA Note over PhoneA: PhoneA thinks this is from PhoneC ControllerA->>ControllerB: hangup ControllerB->>PhoneB: hangup deactivate PhoneB ControllerA->>ControllerC: {incoming, PhoneA} ControllerC->>PhoneC: {incoming, PhoneA} PhoneC-->>ControllerC: reject ControllerC-->>ControllerA: reject ControllerA-->>PhoneA: reject Note over PhoneA: PhoneA crashes deactivate PhoneA

Similar behavior occurs with incoming busy events (and potentially accept events, although that seems to be less frequent).

Tagging the phone controller messages as suggested in above should fix this.

Preventing race conditions between controllers

Note: here I propose modifying the controllers to manage race conditions between the controllers. Unfortunately, I didn’t quite analyze the problem above correctly and, while this is a useful change for some reasons, it does not fix the problem. I’m leaving it here to keep a record of my thoughts while solving this problem.

The best way to implement tagged messages to prevent race conditions is in the phone controllers themselves. This way the phones will never see messages that result from race conditions like the one illustrated above. When the controller gets an outbound event in the idle state it generates a reference that is sent to the destination controller to be used in its replies. Now whenever a controller receives a message from another controller it checks the reference and if the reference is not equal to the reference for the currently active call the message is discarded. In this way no replies are sent to the controller’s phone.

In the phone controller data we add a call_ref field and assign a new reference to it whenever a call is initiated. The internal controller API functions are also updated to send a reference along with every inter-controller event.

-record(data, {phone_pid = undefined :: pid() | undefined,
               phone_ref = undefined :: reference() | undefined,
               other_phone = none :: pid() | none,
               call_ref = none :: reference() | none}).

busy(Controller, Ref) ->
    reply(Controller, Ref, busy).

accept(Controller, Ref) ->
    reply(Controller, Ref, accepted).

reject(Controller, Ref) ->
    reply(Controller, Ref, rejected).

hangup(Controller, Ref) ->
    reply(Controller, Ref, hangup).

inbound(Controller, Ref) ->
    reply(Controller, Ref, {inbound, self()}).

reply(Pid, Ref, Message) ->
    gen_server:cast(Pid, {Message, Ref}).

The idle callback contains the most consequential changes, assigning a new call_ref to outbound calls and adopting the call_ref from inbound calls.

idle(cast, {{inbound, ControllerPid}, Ref}, Data) ->
    {ok, Caller} = hlr:lookup_ms(ControllerPid),
    phone:reply(Data#data.phone_pid, {inbound, Caller}),
    {next_state, connecting, Data#data{other_phone = ControllerPid,
                                       call_ref = Ref}};
idle(cast, {action, {outbound, PhoneNumber}}, Data) ->
    case hlr:lookup_id(PhoneNumber) of
        {ok, Pid} ->
            %% A new reference is created for the call
            Ref = erlang:make_ref(),
            inbound(Pid, Ref),
            {next_state, calling, Data#data{other_phone = Pid,
                                            call_ref = Ref}};
        {error, invalid} ->
            phone:reply(Data#data.phone_pid, invalid),
            keep_state_and_data
    end;
%% ... the remainder is unchanged.

The remainder of the state callbacks are modified slightly to add a clause that drops events when the ref associated with the event is not equal to the call_ref.

So far so good with the testing… I’ll let the 200,000 simulated callers run for a while longer just to see if there are any unexpected errors that show up over night.

Not the right fix

Adding call references in the phone controllers alleviates some of the issues with race conditions, but it still misses an important race of the same type as shown above. Specifically races between controllers are fixed, but there is still a race between a phone/phone simulator and its controller. In fact, the scenario shown in the sequence diagram above has not actually been resolved at all. Below is the sequence diagram from above with annotations added to show the states of PhoneA and ControllerA along with the new call references.

sequenceDiagram autonumber PhoneA->>ControllerA: {call, PhoneB} Note left of PhoneA: in 'calling' state ControllerA->>ControllerB: {incoming, PhoneA, Ref1} activate ControllerA Note over ControllerA: Ref1 is active Note right of ControllerA: in 'calling' state ControllerB->>PhoneB: {incoming, PhoneA} activate ControllerB Note over ControllerB: Ref1 is active PhoneB-->>ControllerB: accept ControllerB-->>ControllerA: {accept, Ref1} Note right of ControllerA: in 'connected' state PhoneA->>ControllerA: hangup Note left of PhoneA: in 'idle' state PhoneA->>ControllerA: {call, PhoneC} Note left of PhoneA: in 'calling' state ControllerA-->>PhoneA: accept Note left of PhoneA: in 'in_call' state ControllerA->>ControllerB: {hangup, Ref1} deactivate ControllerA Note right of ControllerA: in 'idle' state ControllerB->>PhoneB: hangup deactivate ControllerB ControllerA->>ControllerC: {incoming, PhoneA, Ref2} activate ControllerA Note over ControllerA: Ref2 is active Note right of ControllerA: in 'calling' state ControllerC->>PhoneC: {incoming, PhoneA} activate ControllerC note over ControllerC: Ref2 is active PhoneC-->>ControllerC: reject ControllerC-->>ControllerA: {reject, Ref2} deactivate ControllerC ControllerA-->>PhoneA: reject deactivate ControllerA

It is pretty clear that the solution above doesn’t work if we think about the actual state of ControllerA and PhoneA. At 5 the controller is in the calling state waiting for a reply from ControllerB. ControllerB sends an accept event, which is queued in ControllerA’s inbox. Meanwhile, PhoneA has sent a hangup event and a call event to its controller. At 8 ControllerA begins processing its pending events. First, it handles the accept message which includes Ref1, the reference generated by ControllerA at 2. The reference matches the active call reference, so the controller enters the connected state and replies to PhoneA that the call has been accepted. Because the hangup message has not been processed yet the call reference is still valid; filtering messages between controllers cannot prevent the accept event from being sent to PhoneA. PhoneA, however, gets no feedback from its controller to indicate whether it’s hangup request has been handled, so at 6 it assumes its previous call has ended and transitions to the idle state. In idle it immediately initiates a new call (7) and transitions back to the calling state.

We could adapt the call reference trick to generate references on the phone when it initiates the call. There is something I don’t like about this approach though, I’m not sure what. Another option would be to add a confirmation reply to the hangup action. Along side this the phone simulator would transition to a call_ending state until the confirmation is received. This may even be possible without adding any new messages to the phone API or states to the simulator. Rather than transitioning out of the calling state when a givup event is received the simulator can simply generate a hangup action and remain in the same state until it receives a hangup message from its phone. Similarly the phone_controller state machine will keep its current state after sending a hangup event to its other phone. The other phone then generates another hangup event to send back to the original controller. The new controller state machine looks like this:

stateDiagram direction LR disconnected idle calling call_failed connecting connected [*] --> disconnected disconnected --> idle : connect idle --> calling : outbound calling --> idle : rejected calling --> call_failed : busy calling --> idle : hangup calling --> calling : {action, hangup} call_failed --> idle : {action, hangup} idle --> connecting : inbound connecting --> idle : reject connected --> idle : hangup connecting --> connected : accept calling --> connected : accepted connected --> connected : {action, hangup}

Note that in the call_failed state we don’t wait for confirmation from the other phone since there is no other phone. In the state machine we will immediatly generate a hangup response to the phone when a hangup action event occurs. The simulator state machine is also slightly modified to require a hangup message from the phone before returning to the idle state.

stateDiagram direction LR idle calling receiving in_call [*] --> idle idle --> calling : initiate_call calling --> idle : {phone, rejected}, {phone, hangup} calling --> calling : {phone, busy}, giveup calling --> in_call : {phone, answered} idle --> receiving : {phone, inbound} receiving --> idle : reject receiving --> in_call : answer receiving --> idle : {phone, hangup} in_call --> in_call : hangup in_call --> idle : {phone, hangup}

Now the sequence diagram for the situation above should looks like this:

sequenceDiagram PhoneA->>ControllerA: {call, PhoneB} Note left of PhoneA: in 'calling' state ControllerA->>ControllerB: {incoming, PhoneA, Ref1} activate ControllerA Note over ControllerA: Ref1 is active Note right of ControllerA: in 'calling' state ControllerB->>PhoneB: {incoming, PhoneA} activate ControllerB Note over ControllerB: Ref1 is active PhoneB-->>ControllerB: accept ControllerB-->>ControllerA: {accept, Ref1} Note right of ControllerA: in 'connected' state PhoneA->>ControllerA: hangup Note left of PhoneA: in 'calling' state ControllerA-->>PhoneA: accept Note left of PhoneA: in 'in_call' state ControllerA->>ControllerB: {hangup, Ref1} ControllerB->>PhoneB: hangup ControllerB-->>ControllerA: {hangup, Ref1} deactivate ControllerB ControllerA-->>PhoneA: hangup deactivate ControllerA Note right of ControllerA: in 'idle' state Note left of PhoneA: in 'idle' state PhoneA->>ControllerA: {call, PhoneC} Note left of PhoneA: in 'calling' state ControllerA->>ControllerC: {incoming, PhoneA, Ref2} activate ControllerA Note over ControllerA: Ref2 is active Note right of ControllerA: in 'calling' state ControllerC->>PhoneC: {incoming, PhoneA} activate ControllerC note over ControllerC: Ref2 is active PhoneC-->>ControllerC: reject ControllerC-->>ControllerA: {reject, Ref2} deactivate ControllerC ControllerA-->>PhoneA: reject deactivate ControllerA

Implementation

In the phone_controller we need to update the calling and connected states (the two states that respond to {action, hangup} events). I also modify the hangup_call/1 function so it does not clean up the call data, it just sends a hangup event to the other controller. Call cleanup is now done independently when the hangup event is received.

connected(cast, {action, hangup}, Data) ->
    %% Send a hangup event to the other phone, but stay in this state
    hangup_call(Data),
    keep_state_and_data;
connected(cast, {hangup, Ref}, #data{call_ref = CallRef} = Data)
  when Ref =:= CallRef ->
    phone:reply(Data#data.phone_pid, hangup),
    %% When a hangup event is received send a hangup event back
    %% to the other phone.
    hangup_call(Data),
    {next_state, idle, cleanup_call(Data)};

This isn’t the cleanest implementation—a duplicate hangup message ends up being sent from the initiating controller after it receives its “acknowledgment” hangup event. A better solution would probably be to be a bit more verbose by introducing an new call_ended message to use for acknowledgments; however, because of the call references implemented previously the duplicate message is just dropped.

In the calling state only one clause changes. Rather than ending the call, we now just generate a hangup event and wait for the other phone to acknowledge with its own hangup event.

calling(cast, {action, hangup}, Data) ->
    hangup_call(Data),
    keep_state_and_data;

Trace showing the “two-step” hangup

First an incoming call is accepted.

*DBG* <0.16740.0> receive cast {{inbound,<0.10932.2>},#Ref<0.2423193073.3712745473.161621>} in state idle
*DBG* <0.16740.0> consume cast {{inbound,<0.10932.2>},#Ref<0.2423193073.3712745473.161621>} in state idle => connecting
*DBG* <0.16740.0> receive cast {action,accept} in state connecting
*DBG* <0.16740.0> consume cast {action,accept} in state connecting => connected

Then the phone requests to hang up. The controller remains in the connected state following the action request. Only when a hangup event is received from the other phone does the controller return to the idle state.

*DBG* <0.16740.0> receive cast {action,hangup} in state connected
*DBG* <0.16740.0> consume cast {action,hangup} in state connected
*DBG* <0.16740.0> receive cast {hangup,#Ref<0.2423193073.3712745473.161621>} in state connected
*DBG* <0.16740.0> consume cast {hangup,#Ref<0.2423193073.3712745473.161621>} in state connected => idle

Conclusion

I could have saved myself a lot of hassle if I had taken a bit more time to analyze the race conditions when initially designing the phone_controller state machine and considered the interaction between the phones and the phone_controllers. The race condition that caused all the trouble is kind of subtle, and was not obvious until I took the time to really understand the interaction between the callers and the controllers (not just the phones and the controllers). To understand it I had to remember how Erlang processes messages—in the order they are received—and walk through the state transition induced by each message individually. In my initial analysis I got caught in a trap of thinking about the interactions in a more atomic sense than they actually happen.

So, what are the lessons learned here? First, make an effort to understand the interactions of different components at a higher level than just the API. For example, considering the caller rather than just the phone module could have helped me understand the need for acknowledging hangups when I first implemented the phone_controller. Second, remember to walk through what is actually happening in the code at the level of individual operations rather than falling into the trap of mentally combining a sequence of operations into one. Finally, for analyzing interacting state machines track the individual state changes of both machines as each event is processed. Remember the order in which Erlang/OTP processes messages and keep track of what state each machine is in when it processes each message. Taking the time to do this detailed analysis could have allowed me to fix the bug above in one attempt, rather than two.