XmlSAXUtil

As with XmlDOMUtil, the first thing you'll need to do is call XmlSAXUtil::Initialize() to initialize the library. And when you're done, you'll call XmlSAXUtil::Terminate() to clean things up and shut down. The same caveats apply as for XmlDOMUtil.

The exception class here is XmlSAXException, and as with XmlDOMException you probably won't need to worry about it directly. You'll be catching it, but you won't usually be creating new exceptions and if you do you'll normally just be using the single-string-argument constructor.

The basic usage is that you define one or more element handler classes derived from XmlSAXElementHandler. Their function-call operator implements the handler code for XML elements. In your program you'll create an XmlSAXHandler object, create element handler objects based on the element handler classes you defined, and use the XmlSAXHandler::mapElementHandler() method to assign element handler objects to element names. getParser() returns a parser object, and then you can use parseDocument() or parseDocumentFile() to parse the document. As elements are encountered, the function-call operator of the element handler object associated with that element name will be called. Technically the call happens as part of the XercesC endElement() call, occurring when the closing tag is encountered.

A very simple program which just echos out element names and their attributes would look like this:

#include <string.h>
#include <iostream>
#include "XmlSAXUtil.h"
#include <xercesc/sax2/SAX2XMLReader.hpp>
class MyElementHandler : public XmlSAXUtil::XmlSAXElementHandler
{
public:
    virtual void operator()( XmlSAXUtil::XmlSAXUserData *user_data, const std::string& uri,
                             const std::string& localname, const std::string& qname,
                             const std::list<XmlSAXUtil::XmlSAXAttribute>& attrs, const std::string& contents );
};
void MyElementHandler::operator()( XmlSAXUtil::XmlSAXUserData *user_data, const std::string& uri,
                                   const std::string& localname, const std::string& qname,
                                   const std::list<XmlSAXUtil::XmlSAXAttribute>& attrs, const std::string& contents )
{
    std::cout << "Element: " << uri << ":" << localname << " (" << qname << ")" << std::endl;
    std::list<XmlSAXUtil::XmlSAXAttribute>::const_iterator it = attrs.begin();
    while ( it != attrs.end() )
    {
        std::cout << "Attribute: " << it->uri << ":" << it->localname << "=" << it->value << std::endl;
        it++;
    }
    if ( !contents.empty() )
    {
        std::cout << "    Contents:" << std::endl;
        std::cout << contents << std::endl;
    }
    std::cout << std::endl;
    return;
}
int main( int argc, const char *argv[] )
{
    XmlSAXUtil::Initialize();
    XmlSAXUtil::XmlSAXHandler *handler = new XmlSAXUtil::XmlSAXHandler();
    MyElementHandler *eh = new MyElementHandler;;
    handler->mapElementHandler( "Root", eh );
    handler->mapElementHandler( "Element1", eh );
    handler->mapElementHandler( "Element2", eh );
    handler->mapElementHandler( "Child1", eh );
    handler->mapElementHandler( "Child2", eh );
    xercesc::SAX2XMLReader *parser = XmlSAXUtil::getParser( handler );
    if ( ( argc > 1 ) && argv[1] && ( strlen( argv[1] ) > 0 ) )
    {
        std::cout << "Parsing test started, document " << argv[1] << std::endl;
        XmlSAXUtil::parseDocumentFile( parser, argv[1] );
        std::cout << "Parsing test done" << std::endl;

}

    delete eh;
    delete parser;
    delete handler;
    XmlSAXUtil::Terminate();
    return 0;
}

In general, SAX parsing is harder to get your head around than DOM parsing. With DOM parsing you have a static document tree and all you're doing is basic tree and list navigation, something that you probably got down pat during high-school programming classes. SAX parsing by comparison happens on-the-fly and you have to think about it bottom-up. As you can see from looking at the output of the above program, the element handlers for an element's children are called before the element handler for the element itself, so instead of starting at the top and navigating down as you do with DOM you need to save the results of the lowest-level elements somewhere and retrieve it when the higher-level elements are processed. That's what the XmlSAXUserData class is for. If you look at it's definition, you'll find that the class itself has no data members and it's only methods, Reset() and Done(), do nothing but return. It's a placeholder. You'd derive a new class from it, adding data members to hold the information you need to store during parsing, and override the Reset() and Done() methods with your own operations. Reset() simply clears the data and makes the object ready for a new parse, and is called during startDocument() processing. Done() does whatever's appropriate when a document is done being parsed, it's called during endDocument() processing. You can associate a user data object with an XMLSaxHandler during construction or via the setUserData() method before a document is parsed, and retrieve a pointer to the user data object with getUserData() if you need it. For simple cases the user data Done() method would take the data members populated during parsing and fill in program variables with the relevant information. For more complex cases you'd leave Done() empty and explicitly retrieve the user data object after parsing and use it's contents to do whatever needed done. If you've done recursive tree processing, or if you're familiar with lex/yacc programming, you've already encountered a lot of the idioms you're going to need for SAX XML parsing.

Note that the user-data approach here isn't the only possible one. You can accomplish something similar by adding data members to classes derived from XmlSAXElementHandler and XmlSAXHandler and providing appropriate references so objects can find each other. It complicates the code, though, and IMO yields clunky hard-to-follow code that's best avoided. By separating the state data out into it's own class you get a better view of all the state, and parent elements can simply refer to user-data fields populated by their children without having to hang onto references to other handler objects. Since I find the user-data approach to be cleaner with easier-to-follow code, I use it in this library. It's not mandatory, though, nothing in the library stops you from deriving classes from the handler classes and decorating them with additional data and functionality.

XmlSAXUserData

A placeholder class for holding state data during parsing. You won't use this class directly, you derive from it and add your own data members and methods. When you create an XmlSAXHandler object you can associate a user data object (one of a class derived from XmlSAXUserData) with it. A pointer to that user data object's then passed to the XmlSAXElementHandler objects when they're called during parsing. They can store state information in the user data object, and retrieve stored state so that eg. parent element handlers know about the results of parsing their child elements.

Constructors

Protypes:

  • XmlSAXUserData()

Methods

Prototypes:

  • Reset()
  • Done()

Reset is called from the startDocument() handler, when parsing of a document starts. That allows user data objects to reset their state to whatever's appropriate to begin parsing, without the user having to explicitly code for it. Done() is called from the endDocument() handler, when parsing is complete, to allow the object to finalize things. The default implementations do nothing. I expect most commonly you'll only need to implement Reset(). Finalization would happen in the element handler for the root element, and once the parseDocument() call returns the calling code would access the user data object to get the stored information. It's possible to do more complex handling though, eg. having Done() marshal the information and trigger a new thread to finish processing it leaving the main thread to merely loop through parsing documents with no explicit handling of the results. Or you could have something intermediate, say having Done() walk collections of result objects creating linkages between them (which would be hard to do during parsing when the collections are incomplete) before handing control back to the main code.

XmlSAXException

Derived from the XmlUtil::Exception class, it holds exception information specific to SAX parsing. The only additions it makes to the class are the additional constructors for creating XmlSAXException objects from the various types of exceptions thrown by the XercesC SAX routines.

Constructors

Prototypes:

  • XmlSAXException( const xercesc::SAXException& )
  • XmlSAXException( const xercesc::SAXParseException& )

Normally you won't have to worry about these, they'll be used within the library to create the exception objects. You can of course use them if you're doing things directly with the XercesC routines and want to take exceptions they throw and convert them into a common type of exception to simplify things, but that's about the only time you'll need to construct exception objects.

Additional type strings supported by this class are "SAX", "SAX parse", coming from the 4 exception types supported by the constructors. "SAX" and "SAX parse" don't have an associated code, so the code will always be zero. The message for "SAX parse" exceptions includes the line and column numbers where the exception occurred followed by the actual message text.

XmlSAXAttribute

This is a struct, not a class, it just has data members you access directly. This structure exists to help hold the list of attributes of an element during the XmlSAXElementHandler call to process an element during parsing. You probably won't need to create XmlSAXAttribute objects yourself, just access their contents when processing an element. To allow for operations on the attribute list like searches, the full complement of comparison operators are implemented.

Data members

  • std::string uri
  • std::string localname
  • std::string value

They hold, respectively, the namespace URI, the attribute's unqualified name and the attribute's value.

Constructors

  • XmlSAXAttribute()
  • XmlSAXAttribute( const XmlUtil::XmlStr& uri, const XmlUtil::XmlStr& localname, const XmlUtil::XmlStr& value )
  • XmlSAXAttribute( const std::string& uri, const std::string& localname, const std::string& value )
  • XmlSAXAttribute( const XMLCh * const uri, const XMLCh * const localname, const XMLCh * const value )

The default constructor populates the data fields with empty strings. The other 3 constructors populate them with the values passed in.

XmlSAXElementHandler

The basic class for handling the occurrence of an element during parsing. You create a class derived from XmlSAXElementHandler to handle occurrences of elements. During parsing the endElement() handler in XmlSAXHandler looks up the name of the current element to find the element handler associated with it and calls the function-call operator on the handler object. Your derived class implements that method, process the contents of the element and it's attributes and update the user data object with the information collected.

Constructor

Prototype:

  • XmlSAXElementHandler()

The base class doesn't need anything in it's constructor. Your derived classes normally won't contain any additional data members, so a default constructor suffices. If you add data members to derived classes, you'll need to implement appropriate constructors for them.

Function call method

Prototype:

  • void operator()( XmlSAXUserData *user_data, const std::string& uri, const std::string& localname, const std::string& qname, const std::list<XmlSAXAttribute>& attrs, const std::string& contents )

The uri, localname and qname arguments contain the namespace URI and unqualified name of the element and the qualified (namespace prefix, colon, local name) form of the element name. If namespaces aren't in use or the element isn't in a namespace, the namespace URI will be empty and the qualified name will be identical to the unqualified name. The attrs argument is a list of attribute objects associated with the element, and the contents element is the non-element contents. Child elements aren't included in the contents, they're handled by their own XmlSAXElementHandler objects and if their contents are needed by their parent they're expected to fill in the user data object with the relevant information.

One thing I need to do is add startElement() handling to this class. Most parsing only needs to handle endElement() where the attributes and contents are available, but having the ability to trigger resetting of the necessary user data fields during startElement() would simplify user code. I'll probably do that by a similar function-call operator with a different method signature (startElement() doesn't have the contents yet, leaving that off would be the obvious way to distinguish the two methods).

XmlSAXHandler

Constructors

Prototypes:

  • XmlSAXHandler()
  • XmlSAXHandler( XmlSAXUserData *ud )

The only difference between these two is that the first doesn't attach a user-data object to the handler and the second does. There's also the standard copy constructor and assignment operator defined.

User data

Prototypes:

  • void setUserData( XmlSAXUserData * )
  • XmlSAXUserData *getUserData()

Attach a new user-data object to the handler and get the currently-attached user-data object.

Element handling

Prototypes:

  • void mapElementHandler( const std::string& localname, XmlSAXElementHandler *fn )
  • void mapElementHandler( const std::string& uri, const std::string& localname, XmlSAXElementHandler *fn )

These map an element handler object to an element name. When the named element is encountered, the function-call operator of the element handler object will be called with information about the element. The first form maps a handler to an unqualified name, the second to a namespace-URI-qualified name.

SAX event handlers

Prototypes:

  • virtual void startDocument()
  • virtual void endDocument()
  • virtual void characters( const XMLCh * const chars, const XMLSize_t length )
  • virtual void ignoreableWhitespace( const XMLCh * const chars, const XMLSize_t length )
  • virtual void startCDATA()
  • virtual void endCDATA()
  • virtual void startElement( const XMLCh * const uri, const XMLCh * const localname, const XMLCh * const qname, const xercesc::Attributes& attrs )
  • virtual void endElement( const XMLCh * const uri, const XMLCh * const localname, const XMLCh * const qname )

These are the low-level callbacks associated with raw XercesC SAX parsing events. The ignoreableWhitespace(), startCDATA() and endCDATA() callbacks don't do anything. The others have code to perform necessary functions and invoke the element handler objects at appropriate times. startDocument() and endDocument() initialize and reset internal state, and call the user-data Reset() and Done() methods. characters() appends the character data to the current element-contents string. startElement() pushes the current element contents onto a stack, and builds a list of attributes and pushes that onto a stack for use by endElement(). endElement() searches for an element handler for the current element in the map, and if it finds it invokes it with the current element contents and the attribute list from the top of the stack. When done it pops the attributes off of the stack, and pops the previous element contents off the stack replacing the current contents. When that's done the stacks and contents are in the correct state for the parent element's endElement() call.

If you override these methods to add your own code, you'll need to make sure you call the XmlSAXHandler methods so their work gets done. You'll need to understand the implementation code to get this right. The general idea though is that you won't normally need to override these methods, if you're getting into things complex enough to need that you probably want to work directly with XercesC instead of this library.

Library startup and termination routines

Prototypes:

  • void Initialize()
  • void Terminate()

These routines are what you'd expect. They call the underlying XmlUtil namespace routines to handle it's startup and termination, and take care of SAX-specific initialization and cleanup. They include flags to prevent problems if the routines are called more than once. The library needs initialized before any functions can be used, and should be terminated before your program exits so cleanup happens. Don't try to use library functions after calling Terminate(), memory associated with parsers and such will have been freed and won't be accessible.

Parsing documents

Parser creation and destruction

Prototypes:

  • xercesc::SAX2XMLReader *getParser( XmlSAXHandler *handler = NULL )
  • void setHandler( xercesc::SAX2XMLReader *parser, XmlSAXHandler *handler )

getParser() returns a XercesC SAX parser for later use. If a pointer to a handler is passed in, the parser is initialized to use that handler. If you don't initialize the parser with a handler, you'll need to set one later using the setHandler() function. Because of the way SAX works there's no need for a special function to destroy a parser, you just delete the object returned by getParser().

As with DOM, there's an XmlSAXParser class to help make exception-safe handling of parsers easier. This class doesn't do anything itself. Objects of this class can't be copied or assigned. The default constructor gets a parser using getParser(), and the destructor deletes the saved parser (if any). A second constructor takes two arguments: a pointer to a handler object and a boolean flag. This constructor initialized the parser with the supplied handler, and the flag controls whether the handler object will be deleted when the XmlSAXParser object is destroyed (set to true) or whether that job is left up to the application code (set to false, the default value). You'd set the flag true if the handler is dynamically allocated and needs cleaned up, and set it to false or omit the flag entirely if the handler's a local variable or other object that doesn't need explicitly deleted to clean up. The no-arguments function-call operator returns a pointer to the parser just like you'd get from getParser(). It's only purpose is to make it easier to keep parsers cleaned up by giving you a way to allocate them in a normal stack-based variable that'll be automatically cleaned up upon exit from the block it's defined in, even if the exit is because of a thrown exception or some other abnormal exit path. The class also provides a setHandler() method to change the handler on the parser while taking care of all the internal bookkeeping. It takes a pointer to the new handler and the same boolean delete-handler flag (applicable to the new handler) as the two-argument constructor, and will delete the old handler object if it was set up for deletion before updating things.

Document parsing and copying

Prototypes:

  • void parseDocument( xercesc::SAX2XMLReader *, const std::string& doc, const char *enc_string = NULL )
  • void parseDocumentFile( xercesc::SAX2XMLReader *, const std::string filename, const char *enc_string = NULL )

parseDocument() and parseDocumentFile() are the basic parsing functions. One parses the document from a standard C++ string, the other parses from a named file on disk. The enc_string argument to each is used if the character encoding isn't specified in the XML itself and you need to specify it. This string matches the encoding attribute on the ?xml declaration, and would be something like "UTF-8" or "ISO-8859-15". Often you won't need to specify it since either the XML document will have the encoding attribute present or it'll be in the XML default UTF-8 encoding already.