This builds on the previous post where I implemented a phone controller using gen_statem
Making calls manually isn’t enough. We need to simulate hundreds of
thousands of phones connected to the switch and constantly calling
each other. The obvious thing to do is to model each person interacting
with the phones as a separate process. Like the phone controller, I
will use a gen_statem
for this. In fact, the state machine that
represents callers is very similar to the state machine for the phone
controllers. There are two “intermediate” states that represent either
waiting for a call to be answered (calling
) or waiting to decide
whether to answer or reject an incoming call (receiving
).
Each of these state machines also need to randomly generate events for
itself to initiate calls, accept or reject incoming calls, and end
ongoing calls. To do this the state machines use the timer
module to
schedule their next action before changing states. I considered using
state_enter
callbacks to set up these timers, but chose to configure
the timers prior to the state transition to keep things within the
realm of my current gen_statem
understanding. It would be an
interesting exercise to re-write the simulator state machine using
state_enter
callbacks and see if the resulting implementation is
simpler. Ultimately I’m a little frustrated with the complexity of the
implementation here; I think there is room for improvement.
One source of complexity that is difficult to avoid, and wouldn’t be
resolved with state_enter
callbacks, is the inherent race conditions
that arise from the use timers to schedule events. Because the events
depend on the state of the phone (e.g. can only reject a call in the
receiving
state) certain events need to be canceled when the other
party changes the state of the call. Timers may fire before they can
be canceled, meaning we have to handle and ignore some events related
to previous calls in other states. On the other side of this, some
events from another phone race with our timers meaning, for example, a
call one phone initiated could be accepted after the phone has given
up waiting for an answer, but while the hangup
message is still in
flight.
Tweaking the phone
module
Before writing the simulator the phone
module needs to be
improved. It was initially written for use by a human, not a program,
so all it does is print to the terminal whenever it gets a reply from
its controller. To make it useful for the simulator we will have it
send messages to whatever process started it when it gets replies from
its controller. A controlling_pid
field is added to its state, and
the pid of the process that started it is passed to the init/1
callback.
-spec start_link(PhoneNumber :: phone_number(), SwitchNode :: node()) -> {ok, Pid :: pid()}.
start_link(PhoneNumber, SwitchNode) ->
gen_server:start_link(?MODULE, [PhoneNumber, SwitchNode, self()], []).
init([PhoneNumber, SwitchNode, ControllingPid]) ->
{ok, Controller} = rpc:call(SwitchNode, hlr, lookup_id, [PhoneNumber]),
phone_controller:connect(Controller),
{ok, #state{phone_number = PhoneNumber,
controller = Controller,
controlling_pid = ControllingPid}}.
In handle_cast
we now just send messages of the form {phone, PhonePid, Reply}
to the controlling pid whenever the phone receives a
message from its controller. Handling actions remains the same as
before.
handle_cast({reply, Reply}, State) ->
State#state.controlling_pid ! {phone, self(), Reply},
{noreply, State};
%% ... handle actions
The Phone Simulator state machine
This should really be called something like “caller” or “customer”
simulator, but I went with phone_simulator
and I don’t really feel
like changing it.
The phone_simulator
state machine has some basic data, mostly
consisting of the timer references for upcoming events.
-record(data, {next_call :: timer:tref(), %% triggers next outgoing call
hangup :: timer:tref() | undefined, %% triggers hangup of ongoing call
answer :: timer:tref() | undefined, %% triggers answering an incoming call
giveup :: timer:tref() | undefined, %% triggers hangup while awaiting an answer
destination = random_phone() :: string(), %% phone number of outgoing call
phone :: pid()}).
When it is started it starts a phone
process for itself, which
connects to the corresponding controller on the switch. It also
immediately schedules its next outgoing call and enters into the idle
state.
start_link(PhoneNumber) ->
start_link(PhoneNumber, []).
start_link(PhoneNumber, Options) ->
gen_statem:start_link(?MODULE, [PhoneNumber|Options], []).
init([PhoneNumber|Options]) ->
seed(Options),
{ok, Phone} = phone:start_link(PhoneNumber, 'switch@computer'),
{ok, NextCall} = next_call(),
{ok, idle, #data{phone = Phone, next_call = NextCall}}.
The idle state is pretty straightforward. First we include a large number of clauses to handle races between phones and timers.
idle(info, {phone, _, _}, _Data) ->
%% Any other messages coming from the phone are ignored: e.g. a
%% hangup after a reject (race between this phone phone and
%% the calling phone).
keep_state_and_data;
%% The following three clauses are needed to handle the race between the
%% canceling the timers and external events that end the call.
idle(cast, answer, _Data) -> keep_state_and_data;
idle(cast, hangup, _Data) -> keep_state_and_data;
idle(cast, giveup, _Data) -> keep_state_and_data;
%% ...
The last two clauses are the only two that actually do something
interesting. Incoming calls are handled by randomly deciding whether
to answer or reject the call. The answer/reject event is scheduled and
the simulator transitions to the receiving
state. If the next event
is an initiate_call
event, then the simulator transitions to the
calling
state. Before transitioning, however, it schedules two
future events. First, it schedules an event to give up on the call if
no answer is received. Second, it schedules its next outgoing
call. This may seem a little strange, but I wanted to make sure that
there was always another call scheduled so that when transitioning
back to idle
it is possible that the simulator does not wait at all
before initiating a new call. This could be achieved in different
ways, but this gives me the opportunity to experiment with postponing
events in a gen_statem
(as we will see next).
%% ...
idle(info, {phone, _PhonePid, {inbound, _PhoneNumber}}, Data) ->
%% decide whether or not to answer the call
case answer() of
true ->
{next_state, receiving, schedule_answer(Data)};
false ->
{next_state, receiving, schedule_reject(Data)}
end;
idle(cast, initiate_call, Data) ->
PhoneNumber = random_phone(),
phone:action(Data#data.phone, {call, PhoneNumber}),
{next_state, calling,
schedule_next_call(
schedule_giveup(Data#data{ destination = PhoneNumber }))}.
The receiving state is fairly simple, but if the simulator receives an
initiate_call
event it immediately rejects the incoming call and
returns to the idle state, allowing a new call to be initiated.
receiving(cast, answer, Data) ->
phone:action(Data#data.phone, accept),
{next_state, in_call, schedule_hangup(Data)};
receiving(cast, hangup, Data) ->
phone:action(Data#data.phone, reject),
{next_state, idle, Data};
receiving(cast, initiate_call, Data) ->
phone:action(Data#data.phone, reject),
timer:cancel(Data#data.hangup),
timer:cancel(Data#data.answer),
{next_state, idle, Data, postpone};
receiving(info, {phone, _, hangup}, Data) ->
timer:cancel(Data#data.answer),
timer:cancel(Data#data.hangup), % Need to cancel a pending rejection
{next_state, idle, Data}.
For outgoing calls, the calling
state handles a larger variety of
messages coming from the phone. It also handles a giveup
event that
means it will stop waiting for an answer (i.e. hang up) and transition
back to idle. If the other phone is busy then we remain in the calling
state for one second before giving up. If an initiate_call
event
arrives it is postponed to be handled when the current call is
complete.
calling(info, {phone, _, {inbound, _}}, Data) ->
phone:action(Data#data.phone, reject),
%% Retry the call
phone:action(Data#data.phone, {call, Data#data.destination}),
keep_state_and_data;
calling(info, {phone, _, reject}, Data) ->
timer:cancel(Data#data.giveup),
{next_state, idle, Data};
calling(info, {phone, _, accept}, Data) ->
timer:cancel(Data#data.giveup),
{next_state, in_call, schedule_hangup(Data)};
calling(info, {phone, _, busy}, Data) ->
timer:cancel(Data#data.giveup),
%% wait one second before giving up.
timer:apply_after(1000, gen_statem, cast, [self(), giveup]),
keep_state_and_data;
calling(info, {phone, _, invalid}, Data) ->
timer:cancel(Data#data.giveup),
{next_state, idle, Data};
calling(info, {phone, _, hangup}, _Data) ->
%% This is the result of some previous call. Ignore it.
keep_state_and_data;
calling(cast, giveup, Data) ->
phone:action(Data#data.phone, hangup),
{next_state, idle, Data};
%% since we are racing lots of timers, there may be leftover hangups and answers
calling(cast, hangup, _Data) -> keep_state_and_data;
calling(cast, answer, _Data) -> keep_state_and_data;
calling(cast, initiate_call, _Data) ->
{keep_state_and_data, postpone}.
Finally, the in_call
state is pretty simple. As in calling
, we
postpone a new initiate_call
event to be processed once the current
call is finished.
in_call(info, {phone, _, hangup}, Data) ->
timer:cancel(Data#data.hangup),
{next_state, idle, Data#data{hangup = undefined}};
in_call(info, {phone, _, _}, _) ->
%% Any other phone messages are delayed from previous calls -
%% ignore them.
keep_state_and_data;
in_call(cast, hangup, Data) ->
phone:action(Data#data.phone, hangup),
{next_state, idle, Data#data{hangup = undefined}};
in_call(cast, giveup, _Data) ->
keep_state_and_data;
in_call(cast, initiate_call, _) ->
{keep_state_and_data, postpone}.
There are a few internal functions used to schedule events and generate random phone numbers.
random_phone() ->
integer_to_list(rand:uniform(100000) + 100000).
answer() ->
rand:uniform() < 0.5.
schedule_answer(Data) ->
{ok, Answer} = timer:apply_after(rand:uniform(6001) - 1,
gen_statem, cast, [self(), answer]),
Data#data{answer = Answer}.
schedule_reject(Data) ->
{ok, Hangup} = timer:apply_after(rand:uniform(6001) - 1,
gen_statem, cast, [self(), hangup]),
Data#data{
hangup = Hangup}.
schedule_hangup(Data) ->
{ok, Hangup} = timer:apply_after(rand:uniform(30001) - 1,
gen_statem, cast, [self(), hangup]),
Data#data{
hangup = Hangup}.
schedule_next_call(Data) ->
%% wait as long as an hour before initiating a call.
{ok, NextEvent} = next_call(),
Data#data{next_call = NextEvent}.
schedule_giveup(Data) ->
{ok, NextEvent} = timer:apply_after(rand:uniform(26000) + 3000,
gen_statem, cast, [self(), giveup]),
Data#data{giveup = NextEvent}.
next_call() ->
timer:apply_after(rand:uniform(36000),
gen_statem, cast, [self(), initiate_call]).
Complexity
A lot of complexity arises because of the race condition inherent
between different phones. Some of this is unavoidable, but I think
some of it could be handled in the phone
module to hide it from the
caller. In particular the phone could ignore events that arrive from
calls it has already ended. I’ll leave this as something to mess with
in the future though, because I’m kind of bored with phones. My basic
idea to handle this is to create a reference in the phone when it
initiates a call and make the controllers include the call reference
in all their replies to the phone. This would not only simplify some
of the simulator, but it would make me feel much more confident that
this is correctly implemented. (I’m only at like 80%.)
Testing
I’ll test this on 5 different Erlang nodes: one switch
and four
nodes for the simulators. There will be a phone created for every
number in the range “100000” to “200000”.
On the switch:
(switch@computer)> hlr:new().
ok
(switch@computer)34> lists:foreach(
fun phone_controller:start_link/1,
[integer_to_list(X) || X <- lists:seq(100000, 200000)]).
ok
And on each caller node I started 25000 phone_simulator
processes:
(callers2@computer)1> lists:foreach(
fun phone_simulator:start_link/1,
[integer_to_list(X) || X <- lists:seq(125001, 150000)]).
ok
And nothing happened (after I fixed the bugs). I can do some traces and see that lots of calls are going through, but basically it will run forever now.
Or will it?
=ERROR REPORT==== 30-Dec-2022::16:19:55.695034 ===
** State machine <0.14899.2142> terminating
** Last event = {info,{phone,<0.14900.2142>,reject}}
** When server state = {in_call,
{data,
{once,#Ref<0.1812204834.3329490945.96999>},
{once,#Ref<0.1812204834.3329490945.97002>},
{once,#Ref<0.1812204834.3325034498.62005>},
{once,#Ref<0.1812204834.3329490945.96997>},
"177721",<0.14900.2142>}}
** Reason for termination = error:function_clause
** Callback modules = [phone_simulator]
** Callback mode = state_functions
** Stacktrace =
** [{phone_simulator,in_call,
[info,
{phone,<0.14900.2142>,reject},
{data,{once,#Ref<0.1812204834.3329490945.96999>},
{once,#Ref<0.1812204834.3329490945.97002>},
{once,#Ref<0.1812204834.3325034498.62005>},
{once,#Ref<0.1812204834.3329490945.96997>},
"177721",<0.14900.2142>}],
[{file,"phone_simulator.erl"},{line,109}]},
{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1419}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
I handled a lot of the previous errors like this by adding clauses to
the state callbacks to ignore the offending event because it was the
result of an expected race condition. This error is clearly the result
of a race condition, but it isn’t so clear that it is expected. In
fact, it looks like it is actually caused by entering into an invalid
state because we handled a previous race condition incorrectly. I
think this error comes from accidentally responding to a {phone, _, accept}
message from a previous call that arrives after giving up on
that call and initiating a new call. Then the simulator ends up in the
in_call
state even though its current call has not been
accepted. When the current call is rejected by the receiving phone the
simulator crashes because it thinks its call was already
accepted. Here’s what I think is going on as a diagram.
Similar behavior occurs with incoming busy
events (and potentially
accept
events, although that seems to be less frequent).
Tagging the phone controller messages as suggested in above should fix this.
Preventing race conditions between controllers
Note: here I propose modifying the controllers to manage race conditions between the controllers. Unfortunately, I didn’t quite analyze the problem above correctly and, while this is a useful change for some reasons, it does not fix the problem. I’m leaving it here to keep a record of my thoughts while solving this problem.
The best way to implement tagged messages to prevent race conditions
is in the phone controllers themselves. This way the phones will never
see messages that result from race conditions like the one illustrated
above. When the controller gets an outbound
event in the idle
state it generates a reference that is sent to the destination
controller to be used in its replies. Now whenever a controller
receives a message from another controller it checks the reference and
if the reference is not equal to the reference for the currently
active call the message is discarded. In this way no replies are sent
to the controller’s phone.
In the phone controller data we add a call_ref
field and assign a
new reference to it whenever a call is initiated. The internal
controller API functions are also updated to send a reference along
with every inter-controller event.
-record(data, {phone_pid = undefined :: pid() | undefined,
phone_ref = undefined :: reference() | undefined,
other_phone = none :: pid() | none,
call_ref = none :: reference() | none}).
busy(Controller, Ref) ->
reply(Controller, Ref, busy).
accept(Controller, Ref) ->
reply(Controller, Ref, accepted).
reject(Controller, Ref) ->
reply(Controller, Ref, rejected).
hangup(Controller, Ref) ->
reply(Controller, Ref, hangup).
inbound(Controller, Ref) ->
reply(Controller, Ref, {inbound, self()}).
reply(Pid, Ref, Message) ->
gen_server:cast(Pid, {Message, Ref}).
The idle
callback contains the most consequential changes, assigning
a new call_ref
to outbound calls and adopting the call_ref
from
inbound calls.
idle(cast, {{inbound, ControllerPid}, Ref}, Data) ->
{ok, Caller} = hlr:lookup_ms(ControllerPid),
phone:reply(Data#data.phone_pid, {inbound, Caller}),
{next_state, connecting, Data#data{other_phone = ControllerPid,
call_ref = Ref}};
idle(cast, {action, {outbound, PhoneNumber}}, Data) ->
case hlr:lookup_id(PhoneNumber) of
{ok, Pid} ->
%% A new reference is created for the call
Ref = erlang:make_ref(),
inbound(Pid, Ref),
{next_state, calling, Data#data{other_phone = Pid,
call_ref = Ref}};
{error, invalid} ->
phone:reply(Data#data.phone_pid, invalid),
keep_state_and_data
end;
%% ... the remainder is unchanged.
The remainder of the state callbacks are modified slightly to add a
clause that drops events when the ref associated with the event is not
equal to the call_ref
.
So far so good with the testing… I’ll let the 200,000 simulated callers run for a while longer just to see if there are any unexpected errors that show up over night.
Not the right fix
Adding call references in the phone controllers alleviates some of the issues with race conditions, but it still misses an important race of the same type as shown above. Specifically races between controllers are fixed, but there is still a race between a phone/phone simulator and its controller. In fact, the scenario shown in the sequence diagram above has not actually been resolved at all. Below is the sequence diagram from above with annotations added to show the states of PhoneA and ControllerA along with the new call references.
It is pretty clear that the solution above doesn’t work if we think
about the actual state of ControllerA and PhoneA. At 5 the
controller is in the calling
state waiting for a reply from
ControllerB. ControllerB sends an accept event, which is queued in
ControllerA’s inbox. Meanwhile, PhoneA has sent a hangup
event
and a call
event to its controller. At 8 ControllerA begins
processing its pending events. First, it handles the accept
message
which includes Ref1
, the reference generated by ControllerA at
2. The reference matches the active call reference, so the
controller enters the connected
state and replies to PhoneA that
the call has been accepted. Because the hangup
message has not been
processed yet the call reference is still valid; filtering messages
between controllers cannot prevent the accept event from being sent to
PhoneA. PhoneA, however, gets no feedback from its controller to
indicate whether it’s hangup request has been handled, so at 6 it
assumes its previous call has ended and transitions to the idle
state. In idle
it immediately initiates a new call (7) and
transitions back to the calling
state.
We could adapt the call reference trick to generate references on the
phone when it initiates the call. There is something I don’t like
about this approach though, I’m not sure what. Another option would be
to add a confirmation reply to the hangup
action. Along side this
the phone simulator would transition to a call_ending
state until
the confirmation is received. This may even be possible without adding
any new messages to the phone
API or states to the simulator. Rather
than transitioning out of the calling
state when a givup
event is
received the simulator can simply generate a hangup
action and
remain in the same state until it receives a hangup
message from its
phone. Similarly the phone_controller
state machine will keep its
current state after sending a hangup
event to its other phone. The
other phone then generates another hangup event to send back to the
original controller. The new controller state machine looks like this:
Note that in the call_failed
state we don’t wait for confirmation
from the other phone since there is no other phone. In the state
machine we will immediatly generate a hangup
response to the phone
when a hangup action event occurs. The simulator state machine is also
slightly modified to require a hangup message from the phone before
returning to the idle
state.
Now the sequence diagram for the situation above should looks like this:
Implementation
In the phone_controller
we need to update the calling
and
connected
states (the two states that respond to {action, hangup}
events). I also modify the hangup_call/1
function so it does not
clean up the call data, it just sends a hangup
event to the other
controller. Call cleanup is now done independently when the hangup
event is received.
connected(cast, {action, hangup}, Data) ->
%% Send a hangup event to the other phone, but stay in this state
hangup_call(Data),
keep_state_and_data;
connected(cast, {hangup, Ref}, #data{call_ref = CallRef} = Data)
when Ref =:= CallRef ->
phone:reply(Data#data.phone_pid, hangup),
%% When a hangup event is received send a hangup event back
%% to the other phone.
hangup_call(Data),
{next_state, idle, cleanup_call(Data)};
This isn’t the cleanest implementation—a duplicate hangup message ends
up being sent from the initiating controller after it receives its
“acknowledgment” hangup
event. A better solution would probably be
to be a bit more verbose by introducing an new call_ended
message to
use for acknowledgments; however, because of the call references
implemented previously the duplicate message is just dropped.
In the calling
state only one clause changes. Rather than ending the
call, we now just generate a hangup
event and wait for the other
phone to acknowledge with its own hangup
event.
calling(cast, {action, hangup}, Data) ->
hangup_call(Data),
keep_state_and_data;
Trace showing the “two-step” hangup
First an incoming call is accepted.
*DBG* <0.16740.0> receive cast {{inbound,<0.10932.2>},#Ref<0.2423193073.3712745473.161621>} in state idle
*DBG* <0.16740.0> consume cast {{inbound,<0.10932.2>},#Ref<0.2423193073.3712745473.161621>} in state idle => connecting
*DBG* <0.16740.0> receive cast {action,accept} in state connecting
*DBG* <0.16740.0> consume cast {action,accept} in state connecting => connected
Then the phone requests to hang up. The controller remains in the
connected
state following the action request. Only when a hangup
event is received from the other phone does the controller return to
the idle
state.
*DBG* <0.16740.0> receive cast {action,hangup} in state connected
*DBG* <0.16740.0> consume cast {action,hangup} in state connected
*DBG* <0.16740.0> receive cast {hangup,#Ref<0.2423193073.3712745473.161621>} in state connected
*DBG* <0.16740.0> consume cast {hangup,#Ref<0.2423193073.3712745473.161621>} in state connected => idle
Conclusion
I could have saved myself a lot of hassle if I had taken a bit more
time to analyze the race conditions when initially designing the
phone_controller
state machine and considered the interaction
between the phones
and the phone_controllers
. The race condition
that caused all the trouble is kind of subtle, and was not obvious
until I took the time to really understand the interaction between the
callers and the controllers (not just the phones and the
controllers). To understand it I had to remember how Erlang processes
messages—in the order they are received—and walk through the state
transition induced by each message individually. In my initial
analysis I got caught in a trap of thinking about the interactions in
a more atomic sense than they actually happen.
So, what are the lessons learned here? First, make an effort to
understand the interactions of different components at a higher level
than just the API. For example, considering the caller rather than
just the phone
module could have helped me understand the need for
acknowledging hangups when I first implemented the
phone_controller
. Second, remember to walk through what is actually
happening in the code at the level of individual operations rather
than falling into the trap of mentally combining a sequence of
operations into one. Finally, for analyzing interacting state machines
track the individual state changes of both machines as each event is
processed. Remember the order in which Erlang/OTP processes messages
and keep track of what state each machine is in when it processes each
message. Taking the time to do this detailed analysis could have
allowed me to fix the bug above in one attempt, rather than two.