{{Quickfixn}} Socket Deadlock Issue

Thu Jun 7 15:30:03 PDT 2012

All,
I’m new to the mailing list / public side of QuickFIX/n … so I wanted to
send this around to the mailing list before I go ahead and create a formal
GIT issue for it. I believe that I have identified a bug which really stems
from a design problem in how the socket communication is managed.
Certainly, if the behavior I’m about to describe can be avoided or worked
around somehow, please just let me know. But even if there are ways around
this problem, I believe that a fairly significant (though easy to
implement) change is needed in the socket management.

First let me describe the context in which the problem behavior has been
seen. We are developing a new application within an existing system. This
application will essentially function as a trade client, generating FIX
messages to be sent through a FIX network to be executed. We are in the
early stages of development and are simply running the application against
the Executor sample application included with QuickFIX/n. Before getting
into too much of the actual application development, we made sure that the
application and executor were configured properly to connect and
communicate. We were also able to show healthy back and forths of messages
as they would normally be sent to and expected back from the Executor.

Then, while building out the rest of the application and debugging the
code, we had numerous occasions where FIX messages were not completely
processed. Typically this was just because we stepped into the code and
then simply closed the application to make code changes once something was
identified. In any case, both the client and the executor built up stores
of messages which had never been received by the other. In this case, the
sequence numbers are never configured to reset. This means that when they
connect and the session is logged on, they both have fairly high sequence
numbers ... and they both have a large number of messages that weren't
received ... meaning that the sequence numbers that they each expect of the
other is much lower than the actual sequence numbers in use by each side.

This is all perfectly fine, since this is expected and designed behavior of
the FIX protocol: "The resend request is sent by the receiving application
to initiate the retransmission of messages. This function is utilized if a
sequence number gap is detected, if the receiving application lost a
message, or as a function of the initialization process." (
http://fixwiki.fixprotocol.org/fixwiki/ResendRequest)

So, as soon as the two sides exchange their logon request and response,
they should (and do) issue the resend requests for the missing messages.
This is where we get into trouble. The way the QuickFIX.NET library is
implemented, for both the initiator and acceptor modes, the message is
received on a thread which then processes the stream ... which looks for
one or more fix messages on the stream ... which then examines the message
... which determines that it is a ResendRequest ... which then loads and
re-sends all requested messages. You can see this flow in the stack
pictured here (from the initiator side although it's mirrored on the
acceptor side as well):

[image: Inline image 1]

This too is not necessarily a problem. It could, and would, spin through a
few messages ... even up to about 100 messages without a problem. Lets say
that both sides had 100 messages missed from the other side. They would
both receive the resend requests and re-send all 100 messages. They would
then go about receiving and processing the 100 messages that they received
from the other side. No harm, no foul.

However, there is a subtle deadlock vulnerability here. The socket (single
socket) for the connection is being used, as you can infer from the above
stack trace, within a loop inside NextResendRequest to iterate over all
requested messages and actually send them over the wire. While it's doing
this, this thread is never checking for and receiving anything coming the
other way off the socket / connection. And it doesn't have to, right?
Because it's just sending a lot of messages. Aha, but what if the other
side is ALSO not listening on it's socket? Because it's busy ALSO sending a
lot of messages on the socket. Since nobody (neither side) is doing a
receive on the socket ... the data buffers up on the socket until the
buffer is full. At the point, without somebody calling a receive to clear
out the buffer, any subsequent send requests will BLOCK waiting for the
buffer to get emptied.

In other words, this activity is all occurring on the thread explicitly
created to listen to the socket and initiate all subsequent actions. So if
it's not actively / regularly calling receive on the socket ... then the
other side could wind up stuck with too much data to send and then get
blocked within the socket send call. And this happens on both sides here
since they're both using the QuickFIX/n library.

It turns out that I'm seeing about 24K worth of messages (about 110 in my
case) buffer up (send) on both sides before they both then get stuck in the
above pictured state ... blocking internally on the socket send.

So, even though it may be exceptionally rare / improbable to have a
situation in production where both sides of the FIX connection have
"missed" more than 100 messages ... and they both have a similar
implementation / rely on the QuickFIX/n library ... such that they would
wind up in a socket deadlock ... It has happened to me. And rather than
just wipe out my message stores and continue on my merry way, I wanted to
make sure this wasn't a more serious problem (like one in our own code
causing a deadlock at some higher level).

It appears to me that the resend request is probably the ONLY possible
incoming message that could trigger an internal FIX engine response that
could wind up sending large amounts of data. All other internal / admin
messages are quite brief and singular. So rather than go the whole 9 yards
to creating separate threads for receiving off the socket and processing
the received messages ... perhaps a good solution would simply be to spin
up a new worker thread from the thread pool just to handle any incoming
resend requests? That would leave the vastly more common case of NOT having
potentially large batches of messages to send running fast on a single
thread and not have to worry about thread safety with the message parser /
processor.

However, in the long run, we may need / want to have distinct receive and
process threads. As it is, the same thread will block while calling into
the actual FIX application for message processing. Should that processing
take a long time or involve, for whatever reason, sending a large number of
messages ... then we could wind up in a similar problem. Say the FIX
application took 2 minutes to process some particular received message (in
a blocking synchronous way) ... and during that 2 minutes, the other side
of the FIX connection had sent enough messages to fill up it's socket
buffer and block. By ensuring that message processing is in a separate
thread from the socket reading then we will guarantee (in a far better way)
that our socket should never wind up inadvertently blocking the other side.

Thoughts?

                       - Christian Jungers

*
*
*
*
*
*
*Christian.Jungers at CM3.com - Chief Technology Officer** - Tel
877.263.1669 x705
- Fax 877.263.1669*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quickfixn.com/pipermail/quickfixn-quickfixn.com/attachments/20120607/3a49bcf1/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 34302 bytes
Desc: not available
URL: <http://lists.quickfixn.com/pipermail/quickfixn-quickfixn.com/attachments/20120607/3a49bcf1/attachment-0001.jpeg>