{{Quickfixn}} Socket Deadlock Issue
Staffan Ulfberg
staffan at ulfberg.se
Fri Oct 31 12:10:12 PDT 2014
Hi,
This is an attempt at replying to an older thread (from a couple of months
ago), but I just subscribed tot he list and cannot see the MessageID
headers in the archive, so threading will only work if it is done by
Subject...
Sorry for the long post, but I found this thread when investigating the
same problem.
Our application hands over processing to a different thread quite quickly
(actually we do the crack in the FIX thread but nothing more). Despite
this, the receive buffer is sometimes filled and this leads to the
described problem. For reference, our busiest fix connection receives about
4-5 million messages per hour during peak times. (We currently use
quickfixj but are evaluating switching to quickfix/n.)
There are some comments in this thread that quickfixj has the same
problems, but I would say it does not. Yes, if you write an application
that runs a long time in the fix receive thread, then that leads to
problems since the receive buffer fills up, but it does not lead to a
distributed deadlock, since the code that calls send on the socer does not
have lock.
I think the main problem is that Session holds a lock of the sync_ monitor
while performing a potentially blocking operation -- sending data on a
socket.
I've seen the suggestion to just remove the lock when reading responder_ in
HasResponder, but I would argue that is not correct: to remove the lock
while accessing the responder_ member is that it is a mutable property in
Session, and as such needs some kind of synchronization for access. Not
using synchronization to access mutable members is sometimes possible, but
to understand when is non-trivial and needs more justification than "it
seems to work". Also, I think it doesn't solve the real problem (holding a
lock while performing a time consuming operation).
I believe that the changes below are correct and removed that problem, but
I haven't spend a long time thinking about it.
I can see that this particular monitor is not used in a lot of places, so
it could be argued that holding it is not a problem (if removing the lock
in HasResponder, which I think is incorrect... :)). But there are other
problems too: Disconnect() would also block during a blocking call to
Send(), and surely, if Disconnect is called during a long blocking send, it
is better to close the socket so that Send() throws an exception. Well, at
least at the moment I would think it is.
Staffan
diff --git a/QuickFIXn/Session.cs b/QuickFIXn/Session.cs
index ec50b93..470be3a 100755
--- a/QuickFIXn/Session.cs
+++ b/QuickFIXn/Session.cs
@@ -336,13 +336,15 @@ namespace QuickFix
/// <returns></returns>
public bool Send(string message)
{
+ IResponder responder;
lock (sync_)
{
if (null == responder_)
return false;
- this.Log.OnOutgoing(message);
- return responder_.Send(message);
+ responder = responder_;
}
+ this.Log.OnOutgoing(message);
+ return responder.Send(message);
}
// TODO for v2 - rename, make internal
@@ -1575,6 +1577,8 @@ namespace QuickFix
protected bool SendRaw(Message message, int seqNum)
{
+ string messageString;
+
lock (sync_)
{
string msgType =
message.Header.GetField(Fields.Tags.MsgType);
@@ -1610,11 +1614,11 @@ namespace QuickFix
}
}
- string messageString = message.ToString();
+ messageString = message.ToString();
if (0 == seqNum)
Persist(message, messageString);
- return Send(messageString);
}
+ return Send(messageString);
}
public void Dispose()
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quickfixn.com/pipermail/quickfixn-quickfixn.com/attachments/20141031/b094d519/attachment.htm>
More information about the Quickfixn
mailing list