.NET: Validating XML documents against a schema
[Edited 2004-09-08 20:20]
It's that time of year again, when we have some XML documents that require schema validation. Either that, or it's September. Hmmmm.
Anyway, the MSDN documentation on the framework doesn't exactly spell out how to load and validate XML documents that contain schema references. Under the old MSXML model it wasn't particularly straightforward — today it's a bit easier but not altogether clear (well, at least to me). I'll try to put together a sample of the minimum number of steps required to load a schema and then validate an XML document instance against it using a few classes in the System.Xml namespace.
Now, I'm not going to go into the fascinating details about why you need schemas or when to use them (and when not to), how to create good schemas and so on. You'll find veritable tomes and treatises on this topic out there. I'm just going to tell you how to solve the validation problem with a few lines of C# code, and I'll leave the rest to you. Ready?
Let's suppose you have a W3C XSD-compliant schema that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema id="myschema-schema"
targetNamespace="http://schemas.vbbox.com/customers/"
xmlns:myschema="http://schemas.vbbox.com/customers/"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
attributeFormDefault="unqualified"
elementFormDefault="qualified">
<xs:simpleType name="firstNameAttribute">
<xs:restriction base="xs:string" />
</xs:simpleType>
<xs:simpleType name="lastNameAttribute">
<xs:restriction base="xs:string" />
</xs:simpleType>
<xs:complexType name="customer">
<xs:attribute name="lastName" type="myschema:lastNameAttribute" />
<xs:attribute name="firstName" type="myschema:firstNameAttribute" />
</xs:complexType>
<xs:complexType name="customers">
<xs:sequence>
<xs:element name="customers" maxOccurs="unbounded" minOccurs="1" type="myschema:customer" />
</xs:sequence>
</xs:complexType>
</xs:schema>
This defines one sequence element (customers) which can contain one or more complex types (customer), which in turn is made up of two attributes (lastName and firstName). Simple stuff. Now let's see how an XML document that conforms to this schema would look like:
<?xml version="1.0" encoding="utf-8" ?>
<customers xmlns="http://schemas.vbbox.com/customers/" >
<customer firstName="Klaus" lastName="Probst" />
<customer firstName="Johnny" lastName="BGood" />
</customers>
As you can see, we have one root element (customers), which contains two child elements (customer). The root element also declares the target schema and namespace (in this case, the xmlns default so we don't have to qualify each element with a namespace) and tells the parser basically when you load this document, use the schema identified by http://schemas.vbbox.com/customers/ to validate it. Of course, that's not a URL, it's a URI. URIs are just intended to be unique. You could use a GUID here for all practical purposes. This is another schema refinement thing I'm not getting into here — for purposes of this little article we'll just say we need to load the document and validate it without having the parser croak when it tries to figure out just what that URI is supposed to represent.
So, we have one file-based schema in a location we can reach, one XML document in another file with a reference to that schema through a URI and no time to waste. How do we validate the document?
Enter the System.Xml.XmlValidatingReader class. This one is much like the standard System.Xml.XmlTextReader you've probably used, with the exception that it requires a schema to work correctly. It's nice because it validates as it reads through the document's parse tree and provides a delegate callback that gives your code information about validation problems that occur during reading. The problem here of course is the "requires a schema to work" part. But let's look at some code first:
private static XmlDocument LoadAndValidate(string documentPath, string schemaPath, string namespace) {
XmlDocument doc = null;
try {
XmlTextReader reader = new XmlTextReader(documentPath);
XmlValidatingReader vreader = new XmlValidatingReader(reader);
vreader.ValidationType = ValidationType.Schema;
XmlTextReader schemaReader = new XmlTextReader(schemaPath);
vreader.Schemas.Add(namespace, schemaReader);
doc = new XmlDocument();
doc.Load(vreader);
vreader.Close();
}
catch (Exception e) {
Console.WriteLine("This document is not valid");
}
return doc;
}
This method, part of a theoretical console application, accepts a path to an XML document, a path to the schema we want to load, and a namespace URI to use with the validating reader. An innovation to loading the schema from disk is to load it from the assembly where it is embedded as a resource; I've talked about how to do that before. Now here comes the interesting part: whenever the parser encounters the namespace URI we used in the schema (http://schemas.vbbox.com/customers/), it will match schema instance we loaded to validate it. Notice the use of the Schemas property of the reader — this is an XmlSchemaCollection object that will cache loaded schemas with their corresponding namespace URIs, so you can of course use more than one schema in the validation process (and often documents do reference more than one). So we'd call the method like this:
class ConsoleApplication {
public static void Main(string[] args) {
XmlDocument doc = LoadAndValidate(@"C:\MyXmlDoc.xml", @"C:\MySchema.xsd", "http://schemas.vbbox.com/customers/");
if (doc != null)
Console.WriteLine("The document is valid!");
}
}
Now, this is nice, but because of how the LoadAndValidate method is written, we don't know why the document is invalid — we just know it is (of course it could be that the exception was caused by the loading of the documents, but let's not split hairs here). The XmlValidatingReader provides a nifty way to inform us of just what the problems are with the validation: a delegate callback you can hook into to receive messages. So we'd add one line to the method to attach to the delegate like this:
vreader.ValidationEventHandler += new ValidationEventHandler(ValidateCallback);
Note that you insert this line before you start reading the XML document. Then we write the handler:
private static void ValidateCallback(object sender, ValidationEventArgs e) {
if (e.Severity == XmlSeverityType.Error) {
// deal with errors
}
else {
// deal with warnings
}
Console.WriteLine("{0:F}: {1}", e.Severity, e.Message);
}
Here you could maybe increment an internal error and warning count that you'd use to generate a report later; you don't necessarily need to output each error to the console unless you want to, of course. You don't need to use the validation callback — it's provided only as a convenient way to get detailed information on validation problems as the document is being read. Also, keep in mind that if the document (or the schema itself) is not well-formed — which is quite different from not being valid as per the schema — it goes without saying that the load operation will simply fail and you don't have to worry about whether or not it actually validates.
So there you have it: validating XML docs against a schema in 20 lines of code or so. And you thought this stuff was hard!