Relying on URLs to directly identify resources to be retrieved often causes problems for end users:
If they're absolute URLs, they only work when you can reach them[1]. Relying on remote resources makes XML processing susceptible to both planned and unplanned network downtime.
The URL “http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd” isn't very useful if I'm on an airplane at 35,000 feet.
If they're relative URLs, they're only useful in the context where the were initially created.
The URL “../../xml/dtd/docbookx.xml” isn't useful anywhere on my system. Neither, for that matter, is “/export/home/fred/docbook412/docbookx.xml”.
One way to avoid these problems is to use an entity resolver (a standard part of SAX) or a URI Resolver (a standard part of JAXP). A resolver can examine the URIs of the resources being requested and determine how best to satisfy those requests.
The best way to make this function in an interoperable way is to define a standard format for mapping system identifiers and URIs. The OASIS Entity Resolution Technical Committee is defining an XML representation for just such a mapping. These “catalog files” can be used to map public and system identifiers and other URIs to local files (or just other URIs).
The Resolver classes that are described in this article greatly simplify the task of using Catalog files to perform entity resolution. Many users will want to simply use these classes directly “out of the box” with their applications (such as Xalan and Saxon), but developers may also be interested in the JavaDoc API Documentation. The full documentation, current source code, and discussion mailing list are available from the Apache XML Commons project.
The most important change in this release is the availability of both source and binary forms under a generous license agreement.
Other than that, there have been a number of minor bug fixes and the introduction of system properties in addition to the CatalogManager.properties file to control the resolver.
The problems associated with system identifiers (and URIs in general) arise in several ways:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "file:///n:/share/doctypes/docbook/xml/docbookx.dtd">
Or I remember to change the URI before I publish the document:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<xsl:import href="/path/to/real/stylesheet.xsl"/>
There are currently two ways that I might reasonably assign an address-independent name to an object: public identifiers or Uniform Resource Names (URNs)[2].
Public identifiers are part of XML 1.0. They can occur in any form of external entity declaration. They allow you to give a globally unique name to any entity. For example, the XML version of DocBook V4.1.2 is identified with the following public identifier:
-//OASIS//DTD DocBook XML V4.1.2//EN
You'll see this identifier in the two doctype declarations I used earlier. This identifier gives no indication of where the resource (the DTD) may be found, but it does uniquely name the resource. That public identifier, now and forever refers to the XML version of DocBook V4.1.2.
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
Public identifiers don't fit very well into the web architecture (they are not, for example, always valid URIs). This problem can be addressed by the publicid URN namespace defined by RFC 3151.
This namespace allows public identifiers to be easily represented as URNs. The OASIS XML Catalog specification accords special status to URNs of this form so that catalog resolution occurs in the expected way.
One important feature of this mechanism is that it can allow resources to be distributed, so you don't have to go to http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd to get the XML version of DocBook V4.1.2, if you have a local copy.
There are a few possible resolution mechanisms:
The application just “knows”. Sure, it sounds a little silly, but this is currently the mechanism being used for namespaces. Applications know what the semantics of namespaced elements are because they recognize the namespace URI.
OASIS Catalog files provide a mechanism for mapping public and system identifiers, allowing resolution to both local and distributed resources. This is the resolution scheme we're going to consider for the balance of this column.
Many other mechanisms are possible. There are already a few for URNs, including at least one built on top of DNS, but they aren't widely deployed.
Example 1. An Example Catalog File
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//OASIS//DTD XML DocBook V4.1.2//EN" uri="docbook/xml/docbookx.dtd"/> <system systemId="urn:x-oasis:docbook-xml-v4.1.2" uri="docbook/xml/docbookx.dtd"/> <delegatePublic publicIdStartString="-//Example//" catalog="http://www.example.com/catalog"/> </catalog>
The OASIS Entity Resolution Technical Committee is actively defining the next generation XML-based catalog file format. When this work is finished, it is expected to become the official XML Catalog format. In the meantime, the existing OASIS Technical Resolution TR9401 format is the standard.
OASIS XML Catalogs are being defined by the Entity Resolution Technical Committee. This article describes the 01 Aug 2001 draft. Note that this draft is labelled to reflect that it is “not an official committee work product and may not reflect the consensus opinion of the committee.”
The document element for OASIS XML Catalogs is catalog. The official namespace name for OASIS XML Catalogs is “urn:oasis:names:tc:entity:xmlns:xml:catalog”.
There are eight elements that can occur in an XML Catalog: group, public, system, uri, delegatePublic, delegateSystem, delegateURI, and nextCatalog:
The catalog element is the root of an XML Catalog.
The xml:base URI is used to resolve relative URIs in the catalog as described in the XML Base specification.
Maps the public identifier pubid to the system identifier systemuri .
Maps the system identifier sysid to the alternate system identifier systemuri .
These catalogs are officially defined by OASIS Technical Resolution TR9401.
A Catalog is a text file that contains a sequence of entries. Of the 13 types of entries that are possible, only six are commonly applicable in XML systems: BASE, CATALOG, OVERRIDE, DELEGATE, PUBLIC, and SYSTEM:
Catalog entries can contain relative URIs. The BASE entry changes the base URI for subsequent relative URIs. The initial base URI is the URI of the catalog file.
In XML Catalogs, this functionality is provided by the closest applicable xml:base attribute, usually on the surrounding catalog or group element.
This entry serves the same purpose as the nextCatalog entry in XML Catalogs.
This entry enables or disables overriding of system identifiers for subsequent entries in the catalog file.
In XML Catalogs, this functionality is provided by the closest applicable prefer attribute on the surrounding catalog or group element.
An override value of “yes” is equivalent to “prefer="public"”.
This entry serves the same purpose as the delegate entry in XML Catalogs.
This entry serves the same purpose as the public entry in XML Catalogs.
This entry serves the same purpose as the system entry in XML Catalogs.
Resolution is performed in roughly the following way:
For a more detailed description of resolution semantics, including the treatment of multiple catalog files and the complete rules for delegation, consult the XML Catalog standard.
My CatalogManager.properties file looks like this:
Example 2. Example CatalogManager.properties File
#CatalogManager.properties verbosity=1 relative-catalogs=yes # Always use semicolons in this list catalogs=./xcatalog;/share/doctypes/catalog;/share/doctypes/xcatalog prefer=public static-catalog=yes allow-oasis-xml-catalog-pi=yes catalog-class-name=org.apache.xml.resolver.Resolver
A number of popular applications provide easy access to catalog resolution:
-URIRESOLVER org.apache.xml.resolver.tools.CatalogResolver -ENTITYRESOLVER org.apache.xml.resolver.tools.CatalogResolver
Similarly, Saxon supports command-line access to the resolvers:
-x org.apache.xml.resolver.tools.ResolvingXMLReader -y org.apache.xml.resolver.tools.ResolvingXMLReader -r org.apache.xml.resolver.tools.CatalogResolver
The -x class is used to read source documents, the -y class is used to read stylesheets.
Similarly, for XT, use the org.apache.xml.xt.xsl.sax.Driver class.
All you have to do is setup a org.apache.xml.resolver.tools.CatalogResolver on your parser's entityResolver hook. The code listing in Example 3 demonstrates how straightforward this is:
Example 3. Adding a CatalogResolver to Your Parser
import org.apache.xml.resolver.tools.CatalogResolver; ... CatalogResolver cr = new CatalogResolver(); ... yourParser.setEntityResolver(cr)
The system catalogs are loaded from the CatalogManager.properties file on your CLASSPATH. (For all the gory details about these classes, consult the API documentation.) You can explicitly parse your own catalogs (perhaps taken from command line arguments or a Preferences dialog) instead of or in addition to the system catalogs.
Example 4. An Example XML Catalog File
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//Example//DTD Example V1.0//EN" uri="example.dtd"/> </catalog>
A demonstration of public identifier resolution can be achieved like this:
Example 5. Resolving Identifiers
$ java org.apache.xml.resolver.apps.resolver -d 2 -c example/catalog.xml \ -p "-//Example//DTD Example V1.0//EN" public Loading catalog: ./catalog Loading catalog: /share/doctypes/catalog Resolve PUBLIC (publicid, systemid): public id: -//Example//DTD Example V1.0//EN Loading catalog: file:/share/doctypes/entities.cat Loading catalog: /share/doctypes/xcatalog Loading catalog: example/catalog.xml Result: file:/share/documents/articles/sun/2001/01-resolver/example/example.dtd
In order to use the program, you must have the resolver.jar file on your CLASSPATH and you must be using JAXP. In the examples that follow, I've already got these files on my CLASSPATH.
The file we'll be parsing is shown in Example 6.
Example 6. An xparse Example File
<!DOCTYPE example PUBLIC "-//Example//DTD Example V1.0//EN" "file:///dev/this/does/not/exist/example.dtd"> <example> <p>This is just a trivial example.</p> </example>
Example 7. Parsing Without a Catalog
$ java org.apache.xml.resolver.apps.xparse -d 2 example.xml Attempting validating, namespace-aware parse Fatal error:example.xml:2:External entity not found: "file:///dev/this/does/not/exist/example.dtd". Parse failed with 1 error and no warnings.
Using a command-line option to specify the catalog, I can now successfully parse the document:
Example 8. Parsing With a Catalog
$ java org.apache.xml.resolver.apps.xparse -d 2 -c catalog.xml example.xml Loading catalog: catalog.xml Attempting validating, namespace-aware parse Resolved public: -//Example//DTD Example V1.0//EN file:/share/documents/articles/sun/2001/01-resolver/example/example.dtd Parse succeeded (0.32) with no errors and no warnings.
[1] It is technically possible to use a proxy to transparently cache remote resources, thus making the cached resources available even when the real hosts are unreachable. In practice, this requires more technical skill (and system administration access) than many users have available. And I don't know of any such proxies that can be configured to provide preferential caching to the specific resources that are needed. Without such preferential treatment, its difficult to be sure that the resources you need are actually in the cache.
[2] URIs that rely on the domain name system to identify objects (in other words, all URLs) are addresses, not names, even though the domain name provides a level of indirection and the illusion of a stable name.