gnu.xml.pipeline
Class ValidationConsumer
java.lang.Object
|
+--gnu.xml.pipeline.EventFilter
|
+--gnu.xml.pipeline.ValidationConsumer
public final class
ValidationConsumerextends
EventFilter This class checks SAX2 events to report validity errors; it works as
both a filter and a terminus on an event pipeline. It relies on the
producer of SAX events to:
- Conform to the specification of a non-validating XML parser that
reads all external entities, reported using SAX2 events.
- Report ignorable whitespace as such (through the ContentHandler
interface). This is, strictly speaking, optional for nonvalidating
XML processors.
- Make SAX2 DeclHandler callbacks, with default
attribute values already normalized (and without "<").
- Make SAX2 LexicalHandler startDTD() and endDTD ()
callbacks.
- Act as if the (URI)/namespace-prefixes property were
set to true, by providing XML 1.0 names and all
xmlns*
attributes (rather than omitting either or both).
At this writing, the major SAX2 parsers (such as Ælfred2,
Crimson, and Xerces) meet these requirements, and this validation
module is used by the optional Ælfred2 validation support.
Note that because this is a layered validator, it has to duplicate some
work that the parser is doing; there are also other cost to layering.
However,
because of layering it doesn't need a parser in order
to work! You can use it with anything that generates SAX events, such
as an application component that wants to detect invalid content in
a changed area without validating an entire document, or which wants to
ensure that it doesn't write invalid data to a communications partner.
Also, note that because this is a layered validator, the line numbers
reported for some errors may seem strange. For example, if an element does
not permit character content, the validator
will use the locator provided to it.
That might reflect the last character of a
characters event
callback, rather than the first non-whitespace character.
<!--
Of interest is the fact that unlike most currently known XML validators,
this one can report some cases of non-determinism in element content models.
It is a compile-time option, enabled by default. This will only report
such XML errors if they relate to content actually appearing in a document;
content models aren't aggressively scanned for non-deterministic structure.
Documents which trigger such non-deterministic transitions may be handled
differently by different validating parsers, without losing conformance
to the XML specification.
-->
Current limitations of the validation performed are in roughly three
categories.
The first category represents constraints which demand violations
of software layering: exposing lexical details, one of the first things
that
application programming interfaces (APIs) hide. These
invariably relate to XML entity handling, and to historical oddities
of the XML validation semantics. Curiously,
recent (Autumn 1999) conformance testing showed that these constraints are
among those handled worst by existing XML validating parsers. Arguments
have been made that each of these VCs should be turned into WFCs (most
of them) or discarded (popular for the standalone declaration); in short,
that these are bugs in the XML specification (not all via SGML):
- The Proper Declaration/PE Nesting and
Proper Group/PE Nesting VCs can't be tested because they
require access to particularly low level lexical level information.
In essence, the reason XML isn't a simple thing to parse is that
it's not a context free grammar, and these constraints elevate that
SGML-derived context sensitivity to the level of a semantic rule.
- The Standalone Document Declaration VC can't be
tested. This is for two reasons. First, this flag isn't made
available through SAX2. Second, it also requires breaking that
lexical layering boundary. (If you ever wondered why classes
in compiler construction or language design barely mention the
existence of context-sensitive grammars, it's because of messy
issues like these.)
- The Entity Declared VC can't be tested, because it
also requires breaking that lexical layering boundary! There's also
another issue: the VC wording (and seemingly intent) is ambiguous.
(This is still true in the "Second edition" XML spec.)
Since there is a WFC of the same name, everyone's life would be
easier if references to undeclared parsed entities were always well
formedness errors, regardless of whether they're parameter entities
or not. (Note that nonvalidating parsers are not required
to report all such well formedness errors if they don't read external
parameter entities, although currently most XML parsers read them
in an attempt to avoid problems from inconsistent parser behavior.)
The second category of limitations on this validation represent
constraints associated with information that is not guaranteed to be
available (or in one case,
is guaranteed not to be available,
through the SAX2 API:
- The Unique Element Type Declaration VC may not be
reportable, if the underlying parser happens not to expose
multiple declarations. (Ælfred2 reports these validity
errors directly.)
- Similarly, the Unique Notation Name VC, added in the
14-January-2000 XML spec errata to restrict typing models used by
elements, may not be reportable. (Ælfred reports these
validity errors directly.)
A third category relates to ease of implementation. (Think of this
as "bugs".) The most notable issue here is character handling. Rather
than attempting to implement the voluminous character tables in the XML
specification (Appendix B), Unicode rules are used directly from
the java.lang.Character class. Recent JVMs have begun to diverge from
the original specification for that class (Unicode 2.0), meaning that
different JVMs may handle that aspect of conformance differently.
Note that for some of the validity errors that SAX2 does not
expose, a nonvalidating parser is permitted (by the XML specification)
to report validity errors. When used with a parser that does so for
the validity constraints mentioned above (or any other SAX2 event
stream producer that does the same thing), overall conformance is
substantially improved.
- David Brownell
gnu.xml.aelfred2.SAXDriver
gnu.xml.aelfred2.XmlReader
void | attributeDecl(java.lang.String eName, java.lang.String aName, java.lang.String type, java.lang.String mode, java.lang.String value) |
void | characters(char ch[] , int start, int length) |
void | elementDecl(java.lang.String name, java.lang.String model) |
void | endDocument() |
void | endDTD() |
void | endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) |
void | externalEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId) |
void | internalEntityDecl(java.lang.String name, java.lang.String value) |
void | notationDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId) |
void | skippedEntity(java.lang.String name) |
void | startDocument() |
void | startDTD(java.lang.String name, java.lang.String publicId, java.lang.String systemId) |
void | startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes atts) |
void | unparsedEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId, java.lang.String notationName) |
ValidationConsumer
public ValidationConsumer()
Creates a pipeline terminus which consumes all events passed to
it; this will report validity errors as if they were fatal errors,
unless an error handler is assigned.
setErrorHandler
ValidationConsumer
public ValidationConsumer(EventConsumer next)
Creates a pipeline filter which reports validity errors and then
passes events on to the next consumer if they were not fatal.
- next
setErrorHandler
ValidationConsumer
public ValidationConsumer(java.lang.String rootName, java.lang.String publicId, java.lang.String systemId, java.lang.String internalSubset, EntityResolver resolver, java.lang.String minimalDocument)
Creates a validation consumer which is preloaded with the DTD provided.
It does this by constructing a document with that DTD, then parsing
that document and recording its DTD declarations. Then it arranges
not to modify that information.
The resulting validation consumer will only validate against
the specified DTD, regardless of whether some other DTD is found
in a document being parsed.
- rootName - The name of the required root element; if this is
null, any root element name will be accepted.
- publicId - If non-null and there is a non-null systemId, this
identifier provides an alternate access identifier for the DTD's
external subset.
- systemId - If non-null, this is a URI (normally URL) that
may be used to access the DTD's external subset.
- internalSubset - If non-null, holds literal markup declarations
comprising the DTD's internal subset.
- resolver - If non-null, this will be provided to the parser for
use when resolving parameter entities (including any external subset).
- minimalDocument - If non-null, this will be provided to the parser for
use when resolving parameter entities (including any external subset).
SAXNotSupportedException
- If the default SAX parser does
not support the standard lexical or declaration handlers.SAXParseException
- If the specified DTD has either
well-formedness or validity errorsjava.io.IOException
- If the specified DTD can't be read for
some reason
attributeDecl
public void attributeDecl(java.lang.String eName, java.lang.String aName, java.lang.String type, java.lang.String mode, java.lang.String value)
DecllHandler Records attribute declaration for later use
in validating document content, and checks validity constraints
that are applicable to attribute declarations.
Passed to the next consumer, unless this one was
preloaded with a particular DTD.
- eName
- aName
- type
- mode
- value
characters
public void characters(char ch[] , int start, int length)
ContentHandler Reports a validity error if the element's content
model does not permit character data.
Passed to the next consumer.
- start
- length
elementDecl
public void elementDecl(java.lang.String name, java.lang.String model)
DecllHandler Records the element declaration for later use
when checking document content, and checks validity constraints that
apply to element declarations. Passed to the next consumer, unless
this one was preloaded with a particular DTD.
- name
- model
endDocument
public void endDocument()
ContentHandler Checks whether all ID values that were
referenced have been declared, and releases all resources.
Passed to the next consumer.
setDocumentLocator
endDTD
public void endDTD()
LexicalHandler Verifies that all referenced notations
and unparsed entities have been declared.
Passed to the next consumer, unless this one was
preloaded with a particular DTD.
endElement
public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName)
ContentHandler Reports a validity error if the element's content
model does not permit end-of-element yet, or a well formedness error
if there was no matching startElement call.
Passed to the next consumer.
- uri
- localName
- qName
externalEntityDecl
public void externalEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId)
DecllHandler passed to the next consumer, unless this
one was preloaded with a particular DTD
- name
- publicId
- systemId
internalEntityDecl
public void internalEntityDecl(java.lang.String name, java.lang.String value)
DecllHandler passed to the next consumer, unless this
one was preloaded with a particular DTD
- name
- value
notationDecl
public void notationDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId)
DTDHandler Records the notation name, for checking
NOTATIONS attribute values and declararations of unparsed
entities. Passed to the next consumer, unless this one was
preloaded with a particular DTD.
- name
- publicId
- systemId
skippedEntity
public void skippedEntity(java.lang.String name)
ContentHandler Reports a fatal exception. Validating
XML processors may not skip any entities.
- name
startDocument
public void startDocument()
ContentHandler Ensures that state from any previous parse
has been deleted.
Passed to the next consumer.
startDTD
public void startDTD(java.lang.String name, java.lang.String publicId, java.lang.String systemId)
LexicalHandler Records the declaration of the root
element, so it can be verified later.
Passed to the next consumer, unless this one was
preloaded with a particular DTD.
- name
- publicId
- systemId
startElement
public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes atts)
ContentHandler Performs validity checks against element
(and document) content models, and attribute values.
Passed to the next consumer.
- uri
- localName
- qName
- atts
unparsedEntityDecl
public void unparsedEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId, java.lang.String notationName)
DTDHandler Records the entity name, for checking
ENTITY and ENTITIES attribute values; records the notation
name if it hasn't yet been declared. Passed to the next consumer,
unless this one was preloaded with a particular DTD.
- name
- publicId
- systemId
- notationName
- Conform to the specification of a non-validating XML parser that
reads all external entities, reported using SAX2 events.
- Report ignorable whitespace as such (through the ContentHandler
interface). This is, strictly speaking, optional for nonvalidating
XML processors.
- Make SAX2 DeclHandler callbacks, with default
attribute values already normalized (and without "<").
- Make SAX2 LexicalHandler startDTD() and endDTD ()
callbacks.
- Act as if the (URI)/namespace-prefixes property were
set to true, by providing XML 1.0 names and all
At this writing, the major SAX2 parsers (such as Ælfred2, Crimson, and Xerces) meet these requirements, and this validation module is used by the optional Ælfred2 validation support. Note that because this is a layered validator, it has to duplicate some work that the parser is doing; there are also other cost to layering. However, because of layering it doesn't need a parser in order to work! You can use it with anything that generates SAX events, such as an application component that wants to detect invalid content in a changed area without validating an entire document, or which wants to ensure that it doesn't write invalid data to a communications partner. Also, note that because this is a layered validator, the line numbers reported for some errors may seem strange. For example, if an element does not permit character content, the validator will use the locator provided to it. That might reflect the last character of a characters event callback, rather than the first non-whitespace character.xmlns*
attributes (rather than omitting either or both).<!-- Of interest is the fact that unlike most currently known XML validators, this one can report some cases of non-determinism in element content models. It is a compile-time option, enabled by default. This will only report such XML errors if they relate to content actually appearing in a document; content models aren't aggressively scanned for non-deterministic structure. Documents which trigger such non-deterministic transitions may be handled differently by different validating parsers, without losing conformance to the XML specification. --> Current limitations of the validation performed are in roughly three categories. The first category represents constraints which demand violations of software layering: exposing lexical details, one of the first things that application programming interfaces (APIs) hide. These invariably relate to XML entity handling, and to historical oddities of the XML validation semantics. Curiously, recent (Autumn 1999) conformance testing showed that these constraints are among those handled worst by existing XML validating parsers. Arguments have been made that each of these VCs should be turned into WFCs (most of them) or discarded (popular for the standalone declaration); in short, that these are bugs in the XML specification (not all via SGML):
- The Proper Declaration/PE Nesting and
Proper Group/PE Nesting VCs can't be tested because they
require access to particularly low level lexical level information.
In essence, the reason XML isn't a simple thing to parse is that
it's not a context free grammar, and these constraints elevate that
SGML-derived context sensitivity to the level of a semantic rule.
- The Standalone Document Declaration VC can't be
tested. This is for two reasons. First, this flag isn't made
available through SAX2. Second, it also requires breaking that
lexical layering boundary. (If you ever wondered why classes
in compiler construction or language design barely mention the
existence of context-sensitive grammars, it's because of messy
issues like these.)
- The Entity Declared VC can't be tested, because it
also requires breaking that lexical layering boundary! There's also
another issue: the VC wording (and seemingly intent) is ambiguous.
(This is still true in the "Second edition" XML spec.)
Since there is a WFC of the same name, everyone's life would be
easier if references to undeclared parsed entities were always well
formedness errors, regardless of whether they're parameter entities
or not. (Note that nonvalidating parsers are not required
to report all such well formedness errors if they don't read external
parameter entities, although currently most XML parsers read them
in an attempt to avoid problems from inconsistent parser behavior.)
The second category of limitations on this validation represent constraints associated with information that is not guaranteed to be available (or in one case, is guaranteed not to be available, through the SAX2 API:- The Unique Element Type Declaration VC may not be
reportable, if the underlying parser happens not to expose
multiple declarations. (Ælfred2 reports these validity
errors directly.)
- Similarly, the Unique Notation Name VC, added in the
14-January-2000 XML spec errata to restrict typing models used by
elements, may not be reportable. (Ælfred reports these
validity errors directly.)
A third category relates to ease of implementation. (Think of this as "bugs".) The most notable issue here is character handling. Rather than attempting to implement the voluminous character tables in the XML specification (Appendix B), Unicode rules are used directly from the java.lang.Character class. Recent JVMs have begun to diverge from the original specification for that class (Unicode 2.0), meaning that different JVMs may handle that aspect of conformance differently. Note that for some of the validity errors that SAX2 does not expose, a nonvalidating parser is permitted (by the XML specification) to report validity errors. When used with a parser that does so for the validity constraints mentioned above (or any other SAX2 event stream producer that does the same thing), overall conformance is substantially improved.