Scenario: We upgraded one of the XML servers we run to .NET 2.0 recently and started noticing a problem with its serialization of response messages. We have boiled this problem down to a problem the XML Serializer has with a schema <choice> group containing an element with no content model. In our case, it looks like this.

<element name="epp" type="epp:eppType" />
<complexType name="eppType">
<element name="hello" />
<element name="greeting" type="epp:greetingType" />

The .NET 1.1 framework serializes a greeting element correctly

<?xml version="1.0" encoding="utf-8"?>
<epp xmlns="urn:ietf:params:xml:ns:epp-1.0">

but although it seemed to be fine initially in .NET 2.0, we started getting this instead.

<?xml version="1.0" encoding="utf-8"?>
<epp xmlns="urn:ietf:params:xml:ns:epp-1.0">
<hello d2p1:type="greetingType" 

Diagnosis: It turns out that the new XmlSerializer code in .NET 2.0 has a bug in it when it deals with empty elements in a <choice> group. In .NET 2.0 if struct/class member can have multiple types (multiple XmlElementAttribute in CLR, choice complexType in XSD) .NET 2.0 does not serialize it according to derivation hierarchy which causes the wrong xml output above. When the code is generated for the temporary dll which performs the actual serialization, the order in types are checked to choose how to serialize the member is arbitrary so the error may or may not be reproduced.

This is now logged with Microsoft in their bug database here and is still awaiting resolution.

For us at least, the hard part was replicating the bug with, as it turned out, ‘arbitrary behaviour’. Indeed, on a clean machine in .NET 2.0 the bug seemed not to occur unless you kickstarted it. A bit of clarification on this. We created a simple command line app that demonstrated the bug. It’s linked to in the MS Bug Report if you're interested. The problem was that

  • If we created a Command Line App project in VS2005 (Proj1) and copied in the code, the bug didn't appear when we built and ran it.
  • If we created a Command Line App project in VS2003 (Proj2), copied in the code, the bug didn't appear when we built and ran it until we opened Proj2 in VS2005 and migrated it to .NET2.0. Then we built it and ran it again and hey presto - the bug appeared.
  • If we created a Command Line App project in VS2005 again (Proj3) after migrating Proj2, and copied in the code, the bug did appear when we built and ran it.

But hang on, we now have three apps with identical code exhibiting different behaviours, the only difference being that one was built and run before the bug was kickstarted. Even if we ran Proj1 again after Proj3 there were still no signs of the bug. Now Microsoft note that the behaviour of the bug itself is arbitrary, but there seems to be a pretty definite on switch. Where's the off switch I wonder?

While I’m waiting for MS to get back to me with a fix, I’ve been looking at workarounds. Two spring to mind:

Both make sense except that in this case, the former means changing a schema which is laid out in an RFC \ de-facto standard which I can’'t do, and the latter (as far as I am aware) means altering code which was automatically generated by xsd.exe so should there be a need to regenerate this code again (an extension to the schema perhaps) there also need to be several warnings and explanations on how to re-edit the new code so that it serializes correctly again. Neither are great. Ah well

Comments: It’s ironic that the reason we moved the code affected by this problem to .NET 2.0 was a different bug in .NET 1.1 SP1 involving generated classes from schemas spread across different XSD documents. We avoided that by not installing SP1 on our Win2k boxes. As we upgraded to Wn2K3, it became apparent that the version of .NET1.1 installed by default with the OS included the bug we had previously avoided. Now we’re hit by another one in .NET 2.0. It can be worked around, but you can appreciate the irony.

This is the first time I've ever used one of the Microsoft Support Calls that come with my MSDN subscription and aside from an email going astray initially, you've got to hand it to the MS support staff. They've been pretty responsive thus far with diagnosis. Of course, I've got to wait now for one of the actual XML team to create a quick fix that I can test, but the whole process was explained nicely. For reference, it works like this.

  • A support engineer is assigned to your support call. He verifies the problem and pass it to the MS dev team.
  • The dev team may require further investigation or suggest a workaround. They may need a business case to evaluate the urgency of the case.
  • If they confirm it is a bug and can fix it, you receive a private fix for testing.
  • Once you confirm that the private fix solves the problem, MS build the official fix and release it in the KB.
  • Depending on the severity / complexity of the issue the whole process may take several weeks but although the problem is fixed for you as soon as you have the private fix.

Before we pushed it out to Microsoft, we struggled for a few days to isolate this bug as it was partially hidden inside another migration issue (see Migration Woes part 3 for more on that) but once we realized there were two separate issues, it was interesting to learn that the temporary dll that serializes classes to XML now identifies System.Object differently between .NET 1.1 and 2.0. In .NET 1.1, the hello class is described as hello.System.Object... In .NET 2.0, it’s helloZSystem.Object, mscorlib, Version=, Culture=neutral, PublicKeyToken=b77a5c561934e089..

There’s not much documentation on how the XML Serializer works (or doesn’t) that I could find, but Kirk Allen Evans, Christoph Schittko, and Scott Hanselman all had useful posts on ways to approach the problem before we concluded that it was an actual .NET bug. Worth reading for future reference if you’re interested.