{{Quickfixn}} Socket Deadlock Issue

Fri Jun 8 07:31:18 PDT 2012

Christian,

Hi, I'm another user and recent contributor of the quickfix/n library. I
have encountered this in a related fashion (but triggered differently) and
agree with your final paragraph:

"However, in the long run, we may need / want to have distinct receive and
process threads. As it is, the same thread will block while calling into
the actual FIX application for message processing. Should that processing
take a long time or involve, for whatever reason, sending a large number of
messages ... then we could wind up in a similar problem. Say the FIX
application took 2 minutes to process some particular received message (in
a blocking synchronous way) ... and during that 2 minutes, the other side
of the FIX connection had sent enough messages to fill up it's socket
buffer and block. By ensuring that message processing is in a separate
thread from the socket reading then we will guarantee (in a far better way)
that our socket should never wind up inadvertently blocking the other side.
 "

Some of my processing (mainly database updates) of messages causes a pile
up when receiving an exceptionally large number of messages in a short time
period. Specifically, I get this at the end of the day (around 5pm) when
the system we're connected to propagates a series of IOI messages to mark
market offerings as being unavailable (due to traders being forcibly logged
out of the system for the day). The pile up manifests itself as "TCP
ZeroWindow" messages to be sent from our side telling the foreign system to
delay sending further messages. Eventually the delayed TCP messages are
sent and received as our system catches up, but are rejected by our fix
engine due to the fact that the messages exceed the configured latency (2
minutes!). Of course there is much I can do to alleviate this particular
problem, such as creating some sort of queue and threading at a higher
place in the app. However, I do feel that a distinct processing thread
makes sense and would benefit the overall robustness of the application.

-Matt Wood

On Thu, Jun 7, 2012 at 6:30 PM, Christian Jungers <christian.jungers at cm3.com
> wrote:

> All,
> I’m new to the mailing list / public side of QuickFIX/n … so I wanted to
> send this around to the mailing list before I go ahead and create a formal
> GIT issue for it. I believe that I have identified a bug which really stems
> from a design problem in how the socket communication is managed.
> Certainly, if the behavior I’m about to describe can be avoided or worked
> around somehow, please just let me know. But even if there are ways around
> this problem, I believe that a fairly significant (though easy to
> implement) change is needed in the socket management.
>
> First let me describe the context in which the problem behavior has been
> seen. We are developing a new application within an existing system. This
> application will essentially function as a trade client, generating FIX
> messages to be sent through a FIX network to be executed. We are in the
> early stages of development and are simply running the application against
> the Executor sample application included with QuickFIX/n. Before getting
> into too much of the actual application development, we made sure that the
> application and executor were configured properly to connect and
> communicate. We were also able to show healthy back and forths of messages
> as they would normally be sent to and expected back from the Executor.
>
> Then, while building out the rest of the application and debugging the
> code, we had numerous occasions where FIX messages were not completely
> processed. Typically this was just because we stepped into the code and
> then simply closed the application to make code changes once something was
> identified. In any case, both the client and the executor built up stores
> of messages which had never been received by the other. In this case, the
> sequence numbers are never configured to reset. This means that when they
> connect and the session is logged on, they both have fairly high sequence
> numbers ... and they both have a large number of messages that weren't
> received ... meaning that the sequence numbers that they each expect of the
> other is much lower than the actual sequence numbers in use by each side.
>
> This is all perfectly fine, since this is expected and designed behavior
> of the FIX protocol: "The resend request is sent by the receiving
> application to initiate the retransmission of messages. This function is
> utilized if a sequence number gap is detected, if the receiving application
> lost a message, or as a function of the initialization process." (
> http://fixwiki.fixprotocol.org/fixwiki/ResendRequest)
>
> So, as soon as the two sides exchange their logon request and response,
> they should (and do) issue the resend requests for the missing messages.
> This is where we get into trouble. The way the QuickFIX.NET library is
> implemented, for both the initiator and acceptor modes, the message is
> received on a thread which then processes the stream ... which looks for
> one or more fix messages on the stream ... which then examines the message
> ... which determines that it is a ResendRequest ... which then loads and
> re-sends all requested messages. You can see this flow in the stack
> pictured here (from the initiator side although it's mirrored on the
> acceptor side as well):
>
> [image: Inline image 1]
>
>
> This too is not necessarily a problem. It could, and would, spin through a
> few messages ... even up to about 100 messages without a problem. Lets say
> that both sides had 100 messages missed from the other side. They would
> both receive the resend requests and re-send all 100 messages. They would
> then go about receiving and processing the 100 messages that they received
> from the other side. No harm, no foul.
>
> However, there is a subtle deadlock vulnerability here. The socket (single
> socket) for the connection is being used, as you can infer from the above
> stack trace, within a loop inside NextResendRequest to iterate over all
> requested messages and actually send them over the wire. While it's doing
> this, this thread is never checking for and receiving anything coming the
> other way off the socket / connection. And it doesn't have to, right?
> Because it's just sending a lot of messages. Aha, but what if the other
> side is ALSO not listening on it's socket? Because it's busy ALSO sending a
> lot of messages on the socket. Since nobody (neither side) is doing a
> receive on the socket ... the data buffers up on the socket until the
> buffer is full. At the point, without somebody calling a receive to clear
> out the buffer, any subsequent send requests will BLOCK waiting for the
> buffer to get emptied.
>
> In other words, this activity is all occurring on the thread explicitly
> created to listen to the socket and initiate all subsequent actions. So if
> it's not actively / regularly calling receive on the socket ... then the
> other side could wind up stuck with too much data to send and then get
> blocked within the socket send call. And this happens on both sides here
> since they're both using the QuickFIX/n library.
>
> It turns out that I'm seeing about 24K worth of messages (about 110 in my
> case) buffer up (send) on both sides before they both then get stuck in the
> above pictured state ... blocking internally on the socket send.
>
> So, even though it may be exceptionally rare / improbable to have a
> situation in production where both sides of the FIX connection have
> "missed" more than 100 messages ... and they both have a similar
> implementation / rely on the QuickFIX/n library ... such that they would
> wind up in a socket deadlock ... It has happened to me. And rather than
> just wipe out my message stores and continue on my merry way, I wanted to
> make sure this wasn't a more serious problem (like one in our own code
> causing a deadlock at some higher level).
>
> It appears to me that the resend request is probably the ONLY possible
> incoming message that could trigger an internal FIX engine response that
> could wind up sending large amounts of data. All other internal / admin
> messages are quite brief and singular. So rather than go the whole 9 yards
> to creating separate threads for receiving off the socket and processing
> the received messages ... perhaps a good solution would simply be to spin
> up a new worker thread from the thread pool just to handle any incoming
> resend requests? That would leave the vastly more common case of NOT having
> potentially large batches of messages to send running fast on a single
> thread and not have to worry about thread safety with the message parser /
> processor.
>
> However, in the long run, we may need / want to have distinct receive and
> process threads. As it is, the same thread will block while calling into
> the actual FIX application for message processing. Should that processing
> take a long time or involve, for whatever reason, sending a large number of
> messages ... then we could wind up in a similar problem. Say the FIX
> application took 2 minutes to process some particular received message (in
> a blocking synchronous way) ... and during that 2 minutes, the other side
> of the FIX connection had sent enough messages to fill up it's socket
> buffer and block. By ensuring that message processing is in a separate
> thread from the socket reading then we will guarantee (in a far better way)
> that our socket should never wind up inadvertently blocking the other side.
>
> Thoughts?
>
>                        - Christian Jungers
>
> *
> *
> *
> *
> *
> *
> *Christian.Jungers at CM3.com - Chief Technology Officer** - Tel
> 877.263.1669 x705 - Fax 877.263.1669*
>
> _______________________________________________
> Quickfixn mailing list
> Quickfixn at lists.quickfixn.com
> http://lists.quickfixn.com/listinfo.cgi/quickfixn-quickfixn.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quickfixn.com/pipermail/quickfixn-quickfixn.com/attachments/20120608/98c7e7d3/attachment-0002.htm>