How will you make it if you never even try?

June 10, 2008

Speed up your XML code with OuterXmlCachingXmlDocument

Filed under: Performance — Tags: , , — charlieflowers @ 7:09 pm

Transitioning from a string to XmlDocument and vice versa is very slow

Recently I worked on a project which involved some performance profiling. We used tools like Red Gate’s profiler, the free .NET CLR profiler from Microsoft, and the AutomatedQa profiler. These profilers made one thing very clear — transitioning from an XmlDocument representation of XML to a text representation, and vice versa, is very expensive and slow.

However, in this particular project, we had no choice. Our code was building credit reports, which means our original input was XML and our final output was XML (both the MISMO XML format). At various points scattered through the processing of a request, we had to call legacy components for specific tasks. Most of those legacy components wanted XML text (not a DOM) as input, and they also returned their output as XML text. But the rest of our code wanted to work with the XML as a DOM (ie, an XmlDocument), so that we could navigate, set properties, use XPATH, etc.

So, we had no choice but to transition from a large text string to an XmlDocument and then back again, over and over. I said before this is slow, but I want to make sure you understand I mean very slow! You’d be surprised. It is slow because a) it is a big job for the computer to do, and b) because it generates tons of little objects and therefore causes garbage collection overhead.

Caching the OuterXml representation

After thinking about this for a while, I realized that it would help tremendously if we could just make XmlDocument “cache” its textual representation. In other words, when you load an XmlDocument from a string, I wanted XmlDocument to “remember” that string. And as long as no one had made any changes to the XmlDocument in any way, the XmlDocument would merely return that string every time you call OuterXml. But the minute someone makes a change to the XmlDocument, the XmlDocument now knows it no longer has a valid string representation. The next time you call OuterXml, the XmlDocument would go through the big, expensive process of creating the textual representation … but then it would “remember” it again, until the next time that some change invalidates it.

And it turns out, this was fairly simple to build.

Presenting the “OuterXmlCachingXmlDocument”

I called it the “OuterXmlCachingXmlDocument” because it is an XmlDocument that caches the “OuterXml”.

It inherits from XmlDocument and does the following:

  • Overrides Load() and LoadXml() — these methods let you load XML into an XmlDocument. They both find a string of XML text from somewhere (a file or stream, a variable, etc.). They have been overridden to store that string in an instance variable before performing the load operation. They also then register event listeners for XmlDocument’s three “Changed” events — NodeChanged, NodeInserted, and NodeRemoved. Those change events will tell us when the string representation that we have cached becomes invalid.
  • Overrides OuterXml — this is a property that returns the string representation. In XmlDocument, its implementation performs the expensive process of walking the linked list of objects in the DOM and creating a string representation. We have overridden it to first see if we have a valid cached version of the xml string. If so, we just return it! If not, then we have to let the base class do the expensive conversion … BUT! The good news is, once that expensive process has been done, we now have a valid string representation again! So we cache it in the same instance variable again.
  • Handles event notifications for NodeChanged, NodeInserted, and NodeRemoved — if any of these events is fired, we need to dump our cached string representation. We don’t recalculate a new string representation at this time, because avoiding that is why we’re here in the first place! We simply “make a note” that we no longer have a cached string representation. Also, and this is very important, when any of these events fire, we de-register our listener from the NodeChanged, NodeInserted, and NodeRemoved events! Otherwise, all DOM operations that change the XmlDocument will incur the overhead of calling us for no reason.

That’s really all there is to it. It is simple, and it simply “silently” replaces your XmlDocument usages. You can feel free to use it everywhere instead of XmlDocument — it is completely compatible. It made a very noticeable improvement to our performance, and if you’re transitioning a lot between DOM and text, it will likely help you quite a bit as well.

July 5, 2006

Generics and XPATH — a beautiful match

Filed under: C# — Tags: , , , — charlieflowers @ 8:19 pm

Generics are sweet. Here’s a simple little example that let me cut down on repetitious code when working with XPath.
If you know a little XPath, then you know that it lets you specify a string that contains an “XPath query”, and that query will return you one or more Xpath nodes that match your query. The nodes might be Xml attributes or an Xml elements (or, of course, any of the other types of Xml nodes), depending on your query.
I found myself needing to write code that obtains a single “required” node in an Xml document. By “required”, I mean that I wanted to do an XPath query for the node, and if no match was found, I wanted to throw an exception saying “The xpath query ‘/whatever’ has no match, but exactly one match was expected.”
And of course, the kinds of Xpath queries I commonly needed were those to get either a required Element or a required Attribute. Without generics, I would have had to do something like this.

public static XmlAttribute GetRequiredXmlAttribute(XmlDocument doc, string xpath)
{
	XmlNode node = doc.SelectSingleNode(xpath);

	if (node == null)
	{
		throw new Exception(“The xpath ‘” + xpath + “’ has no matches, but exactly one match is required.”);
	}

	XmlAttribute attribute = node as XmlAttribute;

	if (attribute == null)
	{
		// There was a match, but it is not an XmlAttribute.
		throw new Exception(“The xpath ‘” + xpath + “’ matches a node of type ‘” + node.GetType().FullName + “’, which is not an XmlAttribute.”);
	}

	return attribute;
}

This code is an absolute POSTER BOY for generics. More than half the battle in learning a new technology is in understanding the motivation for it. If you understand that certain XPath queries will always match an XmlAttribute and other XPath queries will always match an XmlElement, and you know that you normally have to do a lot of type-checking and casting to figure out which kind of Xml node you’ve got, then you are looking at one of the key motivations behind generics.
Here’s the generic version of the code – very nice!

public static T GetRequiredNodeFromSourceNode<T>(XmlNode sourceNode, string requiredXpath) where T : XmlNode
{
	XmlNode node = sourceNode.SelectSingleNode(requiredXpath);

	if (node == null)
	{
		throw new ArgumentException("Tried to extract the path '" + requiredXpath + "', but nothing was found for that xpath.");
	}

	T result = node	as T;

	if (result == null)
	{
		throw new ArgumentException("The xpath you provided points to a node of type " + node.GetType().FullName +
		", which cannot be cast to type " + typeof(T).FullName + ".");
	}

	return result;
}

See, generics lets you express something that you always knew about, but were previously unable to express. You knew that some XPaths returned elements while others returned attributes – but .Net 1.x did not give you a way to express that in your code. Now, generics does.
Here’s some code that uses the above generic method:

XmlDocument doc = new XmlDocument();
doc.LoadXml(@"C:\someFile.xml");

XmlAttribute attribute = GetRequiredNodeFromSourceNode<XmlAttribute>(doc, "/root/@someAttribute");
XmlElement element = GetRequiredNodeFromSourceNode<XmlElement>(doc, "/root/someElement");

// This will give one of our exceptions, because this xpath syntax always returns an element.
XmlAttribute attributeFail = GetRequiredNodeFromSourceNode<XmlAttribute>(doc, "/root");

// This will give one of our exceptions, because this xpath syntax always returns an attribute.
XmlElement elementFail = GetRequiredNodeFromSourceNode<XmlElement>(doc, "/root/@hello");

What it boils down to:
Some XPath queries are “typed” by nature — certain queries always return an XmlAttribute while others always return an XmlElement. However, before generics C# gave you no way to express that fact without resorting to the common base class, XmlNode. Generics addresses this exact problem. So you can write less code and have it cover more ground (for example, this code works for XmlComments, processing instructions, and whatever other kinds of Xml nodes you might need to deal with in the future).
What’s also interesting about this example is that it lets you be strongly typed even when you don’t know what your return type will be. If you’ve programmed in C# for a while, this is probably something you “felt the need for” at one time or another, but it couldn’t be acheived before generics.

Create a free website or blog at WordPress.com.