XmlDOMUtil

The first thing your program will need to do is call XmlDOMUtil::Initialize(). That'll handle all the initialization (including initialization of the XercesC platform library underneath everything). You need to do this once before using any other XmlDOMUtil routines, and in a multithreaded program I strongly recommend you call Initialize() before you start spawning threads. It's not critical because the routine uses a flag within the library that essentially makes the call a no-op if it's already been called, but note that it doesn't lock that flag in any way so you can get a race condition if you call it from multiple threads simultaneously. In my opinion you probably don't want to go there anyway, and if you are it's a sign you've done something very wrong somewhere.

When you're done using XmlDOMUtil, you need to call XmlDOMUtil::Terminate(). That shuts down everything and cleans up and frees memory controlled by the library and XercesC. Don't try to use any XmlDOMUtil routines after you've called Terminate(), and don't try to use any pointers to memory you got from the library. It's got an internal flag to protect against multiple calls or being called without Initialize() having been called, but the same caveats against multithreaded programs apply. After you've called Terminate(), it's safe to call Initialize() again to restart the library.

As with XmlUtil::Exception, you probably won't use XmlDOMUtil::XmlDOMException directly much. You'll mostly catch exceptions thrown by the library, and if you do need to create your own you'll likely only need to use the single-string-argument constructors.

If you wanted to parse a document and use information from it, the code would look something like this:

XmlDOMUtil::Initialize();
xercesc::XercesDOMParser *parser = XmlDOMUtil::getParser();
xercesc::DOMDocument *doc = XmlDOMUtil::parseDocumentFile( parser, "MyXmlFile.xml" );
std::list<xercesc::DOMElement *> elements = XmlDOMUtil::findElement( doc, "TARGET_ELEMENT" );
std::list<xercesc::DOMElement *>::iterator it = elements.begin();
while ( it != elements.end() )
{
std::string attr_value = XmlDOMUtil::getAttributeValue( *it, "my_attribute" );
std::string elem_value = XmlDOMUtil::getValue( *it );
// Do some processing
it++;
}
XmlDOMUtil::destroyParser( parser );
XmlDOMUtil::Terminate();

You might vary it, for instance by calling adoptDocument() on the document you got from parseDocument()/parseDocumentFile(), and then doing doc->release() to free it's memory when you were done with it. Note: if you leave the document under the control of the parser you should never use it's release() method, always use releaseDocuments() or let it be automatically released when you call destroyParser().

If you're creating a document, the code (modulo the initialization stuff) would look something like this:

xercesc::DOMDocument *doc = XmlDOMUtil::createDocument( "", "ROOT" );
xercesc::DOMElement *root = XmlDOMUtil::getRoot( doc );
char strbuffer[32];
int i = 0;
std::list<std::string>::iterator it = some_list.begin();
while ( it != some_list.end() )
{
i++;
sprintf( strbuffer, "%d", i );
xercesc::DOMElement *elem = XmlDOMUtil::createElement( doc, "STRING", *it );
XmlDOMUtil::setAttributeValue( elem, "index", strbuffer );
XmlDOMUtil::addChild( root, elem );
it++;
}

That code creates a document, get it's root element, then takes a list of strings and adds each one as a STRING element under the root with an "index" attribute whose value is the position 1-n of the string in the original list. Note that when creating an element you have to specify the document it's in. That's so XercesC can link it to the document and to document-level metadata like namespace declarations and encoding.

XmlDOMException

Derived from the XmlUtil::Exception class, it holds exception information specific to DOM parsing. The only additions it makes to the class are the additional constructors for creating XmlDOMException objects from the various types of exceptions thrown by the XercesC DOM routines.

Constructors

Prototypes:

  • XmlDOMException( const xercesc::DOMException& )
  • XmlDOMException( const xercesc::DOMRangeException& )
  • XmlDOMException( const xercesc::SAXException& )
  • XmlDOMException( const xercesc::SAXParseException& )

Normally you won't have to worry about these, they'll be used within the library to create the exception objects. You can of course use them if you're doing things directly with the XercesC routines and want to take exceptions they throw and convert them into a common type of exception to simplify things, but that's about the only time you'll need to construct exception objects.

Additional type strings supported by this class are "DOM", "DOM range", "SAX", "SAX parse", coming from the 4 exception types supported by the constructors. For "DOM" and "DOM range" the code field comes from the code field of the underlying exception object. "SAX" and "SAX parse" don't have an associated code, so the code will always be zero. The message for "SAX parse" exceptions includes the line and column numbers where the exception occurred followed by the actual message text.

Library startup and termination routines

Prototypes:

  • void Initialize()
  • void Terminate()

These routines are what you'd expect. They call the underlying XmlUtil namespace routines to handle it's startup and termination, and take care of DOM-specific initialization and cleanup. They include flags to prevent problems if the routines are called more than once. The library needs initialized before any functions can be used, and should be terminated before your program exits so cleanup happens. Don't try to use library functions after calling Terminate(), and remember that the memory holding DOM nodes has been freed so pointers to nodes are no longer valid after Terminate()'s been called.

Parsing documents

Parser creation and destruction

Prototypes:

  • xercesc::XercesDOMParser *getParser()
  • void destroyParser( xercesc::XercesDOMParser * )
  • void releaseDocuments( xercesc::XercesDOMParser * )

getparser() returns a new parser object for later use. destroyParser() takes a parser object and frees it and all it's associated memory (including that of any documents you parsed using it and didn't adopt). Note that you can use a single parser to parse multiple documents. releaseDocuments() provides a way for you to release the memory used to hold parsed documents without destroying the parser. If you needed to process several large documents one at a time, you'd simply call releaseDocuments() after each one to free up the memory.

To make life a bit easier when dealing with exceptions, there's an XmlDOMParser class defined. This class doesn't do anything itself. Objects of this class can't be copied or assigned. The default constructor gets a parser using getParser(), the destructor calls destroyParser() on the saved parser (if any), and the no-arguments function-call operator returns a pointer to the parser just like you'd get from getParser(). It's only purpose is to make it easier to keep parsers cleaned up by giving you a way to allocate them in a normal stack-based variable that'll be automatically cleaned up upon exit from the block it's defined in, even if the exit is because of a thrown exception or some other abnormal exit path.

Document parsing and copying

Prototypes:

  • xercesc::DOMDocument *parseDocument( xercesc::XercesDOMParser *, const std::string& doc, const char *enc_string = NULL )
  • xercesc::DOMDocument *parseDocumentFile( xercesc::XercesDOMParser *, const std::string filename, const char *enc_string = NULL )
  • xercesc::DOMElement *getRoot( xercesc::DOMDocument * )
  • xercesc::DOMDocument *adoptDocument( xercesc::XercesDOMParser * )
  • xercesc::DOMDocument *cloneDocument( xercesc::DOMDocument * )

parseDocument() and parseDocumentFile() are the basic parsing functions. One parses the document from a standard C++ string, the other parses from a named file on disk. The enc_string argument to each is used if the character encoding isn't specified in the XML itself and you need to specify it. This string matches the encoding attribute on the ?xml declaration, and would be something like "UTF-8" or "ISO-8859-15". Often you won't need to specify it since either the XML document will have the encoding attribute present or it'll be in the XML default UTF-8 encoding already.

getRoot() returns the root element node of the document (the immediate child of the document node itself). There's only one root element allowed.

adoptDocument() adopts the document from the parser. Normally the DOM objects that make up the document tree are owned by the parser after a document is parsed, and they're freed when the parser is destroyed or releaseDocuments() is called on the parser. By calling adoptDocument() you're taking ownership of the DOM tree's memory yourself, and the parser will release control over the memory to you. Note that the memory's still controlled by the XercesC memory manager, so while you can destroy the parser without releasing an adopted document it's memory will still be freed when you call Terminate() and shut down the XercesC library.

cloneDocument() is similar to adoptDocument() in that it returns a document that's controlled by you. But instead of returning the original document it returns a deep copy of the document you passed in, a copy you can alter without changing the original. Just remember that, as with adoptDocument(), you're responsible for releasing the document once you're done with it by calling the release() method on it.

Namespace URI and prefix translation

Prototypes:

  • std::string namespacePrefixToURI( xercesc::DOMNode *node, std::string prefix )
  • std::string namespaceURIToPrefix( xercesc::DOMNode *node, std::string uri )

These translate between namespace URIs and prefixes. You need to supply a node because namespace prefixes are specific to a particular document. The same namespace URI can map to prefix "abc" in one document and "xyz" in another, so without a node (and it's containing document) to refer to it makes no sense to talk about prefixes.

Note that the library doesn't have routines yet to truly handle namespace prefixes and URIs in functions like searching for elements or creating elements and attributes. I plan to add them, it's just that my current needs don't involve disambiguating names based on namespaces.

Finding elements in a document

Prototypes:

  • std::list<xercesc::DOMElement *> findElement( xercesc::DOMElement *root, const std::string element_name, bool recurse = true )
  • std::list<xercesc::DOMElement *> findElement( xercesc::DOMDocument *doc, const std::string element_name, bool recurse = true )
  • std::list<xercesc::DOMElement *> findElements( xercesc::DOMElement *root, const std::list<std::string>& element_names, bool recurse = true )
  • std::list<xercesc::DOMElement *> findElements( xercesc::DOMDocument *doc, const std::list<std::string>& element_names, bool recurse = true )

This group of functions help in locating elements within a document tree. findElement() takes a single element name and returns a list of pointers to all occurrences of that element name, while findElements() accepts a list of element names and locates multiple element names in one search. They come in two variants. The first takes a pointer to an element node as the root of the search, and locates elements beneath that root. The second takes a pointer to a document node, searches starting at the root element of the document and includes the root element in the search. The recurse argument controls whether the search is recursive down the entire depth of the tree or limited only to immediate children of the root. You can use non-recursive searches to search based on the relative positions of elements in the tree (eg. search for all CUSTOMER elements, and then search for ADDRESS elements beneath them).

Note that in all cases the search doesn't continue in the sub-tree below an element that matches the search. I found I rarely wanted to locate such children, I more commonly wanted to exclude them. I do plan on adding functionality to continue the search below matched elements at some point.

I also plan on adding functions to search based on sequences of names, so you could directly search for ADDRESS elements underneath CUSTOMER elements. The code's more complex, though, and I didn't want to hack together something that wouldn't work well or had an inconvenient way of specifying sequences in combination with the list-of-names form. I'd rather take time to think about it first.

Getting information about elements/nodes

Prototypes:

  • std::string getName( xercesc::DOMNode *node )
  • std::string getNamespacePrefix( xercesc::DOMNode *node )
  • std::string getNamespaceURI( xercesc::DOMNode *node )
  • std::string getValue( xercesc::DOMNode *node )
  • std::list<xercesc::DOMElement *> getChildren( xercesc::DOMElement *node )
  • std::list<xercesc::DOMAttr *> getAttributes( xercesc::DOMElement *node )
  • bool hasAttribute( xercesc::DOMElement *node, const std::string& attr_name )
  • xercesc::DOMAttr *getAttribute( xercesc::DOMElement *node, const std::string& attr_name )
  • std::string getAttributeValue( xercesc::DOMElement *node, const std::string& attr_name )

getName(), getNamespacePrefix() and getNamespaceURI() return the obvious information about a node's name. Commonly the node will be an element or attribute. Other types of nodes, eg. text content, don't have names and these functions will return the empty string if called on them.

getValue() returns the node's contents, which varies depending on the type of node. For elements it's the contained text content. For attributes it's the attribute value. For text and CDATA nodes it's the text data of the node. Note that for elements it does not include child nodes or their text content, only text content that's directly within the element itself.

getChildren() returns a list of the child elements of an element. It only returns elements, it skips non-element children (eg. text content).

getAttributes() returns the list of attribute nodes for an element. hasAttribute() returns a flag indicating whether the attribute is present. getAttribute() returns a single attribute, or the NULL pointer if the attribute isn't present. getAttributeValue() returns the value of an attribute, or the empty string if the attribute wasn't present. Note that if you use getAttributeValue() and need to distinguish between "attribute not present" and "attribute present but with no value" you'll need to also use hasAttribute(). Most of the time attributes are either present/not-present (no value that you care about) or has-value/no-value (with not-present being functionally equivalent to no-value), so you shouldn't need the extra check often.

Serializing documents

There's really only one function for serializing XML document trees back into strings of XML text:

  • std::string serializeNodes( xercesc::DOMNode *root, bool pretty_print = false )

It takes the document tree rooted at the given node and outputs it into a string. The pretty_print argument controls whether the XML is nicely indented for human readability or not. It doesn't make a difference to most parsers whether or not the XML's pretty-printed or not, and the indentation does add to the size. About the only place it makes a difference is in whitespace content when you're not using a DTD during parsing. In that case the whitespace added for indentation may appear as element content rather than being ignored. If you're using a DTD, the parser knows whether elements can contain text content or not and can ignore whitespace where it doesn't affect content.

Creating documents

Creating documents and DTDs

Prototypes:

  • xercesc::DOMDocumentType *createDTD( std::string name, std::string public_id, std::string system_id )
  • xercesc::DOMDocument *createDocument( std::string ns_uri, std::string root_name, xercesc::DOMDocumentType *dtd = NULL )

The basic functions to create an empty document and, if needed, a DTD. Note that you don't actually populate a DTD directly, it's contents are loaded as you create it based on the public and system ID strings. Usually that means the DTD's loaded from a URL or copy of the DTD alread in the system DTD library. Read the XercesC library documentation about DTDs if you need to work with them. A lot of the time all you'll need is the empty document and won't need to associate a DTD with it.

Creating text and CDATA content

Prototypes:

  • xercesc::DOMText *createText( xercesc::DOMDocument *doc, const std::string& text )
  • xercesc::DOMCDATASection *createCDATA( xercesc::DOMDocument *doc, const std::string& text, bool base64_encode = false, bool base64_split = true )
  • xercesc::DOMEntityReference *createEntity( xercesc::DOMDocument *doc, std::string name )

Create either a standard text-content node (parsed character data) or a CDATA node (unparsed data). createText() is likely to be the most common one, suitable for ordinary plain-text content. createCDATA() would be used when you need to include content that can't safely be parsed as XML. One common use is for holding encoded binary data, eg. base64-encoded data, hence the flag to let you automatically base64-encode the text in the process of creating the node. Note that the base64_split flag, which nominally controls whether base64-encoded content is split into lines or left as a single very long string, is ignored at the moment (the XercesC base64-encoding functions don't allow unsplit output and I haven't gotten around to writing code to undo the line-splitting). createEntity() creates an XML entity reference from the entity name, for when you need to create text content that includes embedded entities.

Creating elements

Prototypes:

  • xercesc::DOMElement *createElement( xercesc::DOMDocument *doc, std::string name )
  • void setContent( xercesc::DOMElement *node, const std::string& value )
  • void setCDATAContent( xercesc::DOMElement *node, const std::string& cdata, bool base64_encode = false, bool base64_split = true )
  • xercesc::DOMElement *createElement( xercesc::DOMDocument *doc, std::string name, const std::string& content, bool use_cdata = false )
  • void setAttributeValue( xercesc::DOMElement *node, std::string name, const std::string& value )
  • void setAttributeValue( xercesc::DOMElement *node, std::string name )

These are the functions you'll more commonly use to create elements and set attributes and content on them. The first form of createElement() creates an empty element with no content, which you'd use setContent() or setCDATAContent() on to attach text or CDATA content you'd created with createText() or createCDATA(). The second form of createElement() lets you pass in a string to set as the content of the element. Normally it'll be ordinary text, but you can flag it as CDATA content. I need to add flags for base64-encoding of CDATA content.

setAttributeValue() lets you set an attribute on an element. There first form sets a value for the attribute, the second allows you to set an attribute without a value (the attribute's just a flag).

Manipulating child nodes

Prototypes:

  • void addChild( xercesc::DOMNode *parent, xercesc::DOMNode *child )
  • void addChildren( xercesc::DOMNode *parent, const std::list<xercesc::DOMNode *>& children )
  • void addSibling( xercesc::DOMNode *location, xercesc::DOMNode *node, bool before = false )
  • void addSiblings( xercesc::DOMNode *location, const std::list<xercesc::DOMNode *>& nodes, bool before = false )
  • xercesc::DOMNode *removeChild( xercesc::DOMNode *parent, xercesc::DOMNode *child )
  • std::list<xercesc::DOMNode *> removeChildren( xercesc::DOMNode *parent, const std::list<xercesc::DOMNode *>& children )

addChild() and addChildren() add child nodes directly to a parent. They append the new children to the parent's list of children, so the new children will appear after any existing children. If you need to insert into the middle of the list of children you should use getChildren(), locate the position where you want to modify the list of children and use addSibling() or addSiblings() to add siblings of the relevant child nodes.

addSibling() and addSiblings() add sibling nodes to a given node, either before or after the location node in question.

removeChild() and removeChildren() remove child elements from a node. The removed elements are returned, and it's the caller's responsibility to deallocate them when necessary. Note that removeChildren() has an issue with exceptions: if some children are successfully removed and then an exception occurs, the list of successfully-removed children can't be returned but we can't safely undo the removals either. The children that were removed have to be released, and even though removeChildren() didn't complete successfully those child nodes will have been removed from the parent. This isn't ideal, but it's the only way that's safe from memory leaks. If you need absolute control, use removeChild() to remove one child node at a time and handle errors as they occur.