Document Prolog - Building Web Services with Java

XML documents contain an optional prolog

g

followed by a root element

g

that holds the contents of the document.Typically the prolog serves up to three roles:

n Identifies the document as an XML document

n Includes any comments about the document

n Includes any meta-information about the content of the document

A document is identified as an XML document through the use of a processing instruction

g

. Processing instructions (PIs) are special directives to the application that will process the XML document.They have the following syntax:

<?PITarget ...?>

PIs are enclosed in <? ... ?>.The PI target is a keyword meaningful to the processing application. Everything between the PI target and the ^?>marker is considered the con-tents of the PI.

In general, data-oriented XML applications don’t use application-specific processing instructions. Instead, they tend to put all information in elements and attributes.

However, you should use one standard processing instruction—the XML declaration

g

—in the XML document prolog to determine two important pieces of information:

the version of XML in the document and the character encoding. Here’s an example:

<?xml version=”1.0” encoding=”UTF-8”?>

The^versionparameter of the ^xmlPI tells the processing application the version of the XML specification to which the document conforms. (W3C released an updated XML specification, XML 1.1, in early 2004; but all examples in this book use the 1.0 version of XML, which came in 1998.) The ^encodingparameter is optional. It identifies the character set of the document.The default value is “UTF-8”.

Note

UTF-8 (RFC 2279) stands for Unicode Transformation Format-8. It’s an octet (8-bit) lossless encoding of characters from the Universal Character Set (UCS), aka Unicode (ISO 10646). UTF-8 is an efficient represen-tation of English because it preserves the full US-ASCII character range. One ASCII character is encoded in 8 bits, whereas some Unicode characters can take up to 48 bits. UTF-8 encoding makes it easy to move XML on the Internet using standard communication protocols such as HTTP, SMTP, and FTP. XML is international-ized by design and can support other character encodings such as Unicode and ISO/IEC 10646. However, for simplicity and readability purposes, this book will use UTF-8 encoding for all samples.

If you omit the XML declaration, the XML version is assumed to be 1.0, and the pro-cessing application will try to guess the encoding of the document based on clues such as the raw byte order of the data stream.This approach has problems, and whenever interoperability is of high importance—such as for Web services—applications should provide an explicit XML declaration and use UTF-8 encoding.

XML document prologs can also include comments that pertain to the whole docu-ment. Comments use the following syntax:

Comments can span multiple lines but can’t be nested (comments can’t enclose other comments).The processing application will ignore everything inside the comment mark-ers. Some of the XML samples in this book use comments to provide you with useful context about the examples in question.

With what we’ve discussed so far, we can extend the PO example from Listing 2.1 to include an XML declaration and a comment about the document (see Listing 2.2).

Listing 2.2 XML Declaration and Comment for the Purchase Order

<?xml version=”1.0” encoding=”UTF-8”?>

...

</po>

In this case,^pois the root element of the XML document.

Elements

The term element

g

is a technical name for the pairing of a start tag and an end tag in an XML document. In the previous example, the ^poelement has the start tag ^<po>and the end tag </po>. Every start tag must have a matching end tag and vice versa.

Everything between these two tags is the content

g

of the element.This includes any nested elements, text, comments, and so on.

Element names can include all standard programming language identifier characters ([0-9A-Za-z]) as well as the underscore (^_), hyphen (^-), and colon (^:), but they must

37 XML Instances

begin with a letter.customer-nameis a valid XML element name. However, because XML is case-sensitive,customer-nameisn’t the same element as Customer-Name.

According to the XML Specification, elements can have three different content types

g

: element-only content

g

, mixed content

g

, or empty content

g

. Element-only content consists entirely of nested elements. Any whitespace separating elements isn’t considered significant in this case. Mixed content refers to any combination of nested elements and text. All elements in the purchase order example, with the exception of

description, have element content. Most elements in the skateboard user guide exam-ple earlier in the chapter had mixed content.

Note that the XML Specification doesn’t define a text-only content model. Outside the letter of the specification, an element that contains only text is often referred to as having data content; but, technically speaking, it has mixed content.This awkwardness comes as a result of XML’s roots in SGML and document-oriented applications.

However, in most data-oriented applications, you’ll never see elements whose contents are both nested elements and text.The content will typically be one or the other, because limiting it to be either elements or text makes processing XML much easier.

The syntax for elements with empty content is a start tag immediately followed by an end tag, as in <emptyElement></emptyElement>.This is too much text, so the XML Specification also allows the shorthand form <emptyElement/>. For example, because the last item in our PO doesn’t have a nested descriptionelement, it has empty con-tent.Therefore, we could have written it as follows:

XML elements must be strictly nested.They can’t overlap, as shown here:

Bold, italicized text in a paragraph

Bold, italicized text in a paragraph

Bold, italicized text in a paragraph

The notion of an XML document root implies that there is only one element at the very top level of a document. For example, the following wouldn’t be a valid XML doc-ument:

<first>I am the first element</first>

<second>I am the second element</second>

It’s easy to think of nested XML elements as a hierarchy. For example, Figure 2.1 shows a hierarchical tree representation of the XML elements in the purchase order example together with the data (text) associated with them.

Figure 2.1 Tree representation of XML elements in a purchase order.

Unfortunately, it’s often difficult to identify XML elements precisely in the hierarchy.To aid this task, the XML community has taken to using genealogy terms such as parent, child, sibling, ancestor, and descendant. Figure 2.2 illustrates the terminology as it applies to the^orderelement of the PO:

n Its parent (the element immediately above it in the hierarchy) is ^po.

n Its ancestor is ^po. Ancestors are all the elements above a given element in the hier-archy.

n Its siblings (elements on the same level of the hierarchy and that have the same parent) are ^billToand^shipTo.

n Its children (elements that have this element as a parent) are three ^itemelements.

n Its descendants (elements that have this element as an ancestor) are three ^item ele-ments and two descriptionelements.

Attributes

The start tags for XML elements can have zero or more attributes

g

. An attribute is a name-value pair.The syntax for an attribute is a name (which uses the same character set as an XML element name) followed by an equal sign (=), followed by a quoted value.

The XML Specification requires the quoting of values; you can use both single and

dou-39 XML Instances

ble quotes, provided they’re correctly matched. For example, the ^poelement of our PO has three attributes,^id,^submitted, and^customerId:

po order

billTo shipTo

item

item description description

parent/

ancestor

children/

descendants

descendants siblings

Figure 2.2 Common terminology for XML element relationships

A family of attributes whose names begin with ^xml:is reserved for use by the XML Specification. Probably the best example is ^xml:lang, which identifies the language of the text used in the content of the element that has the ^xml:langattribute. For exam-ple, we could have written the descriptionelements in our purchase order example to identify the description text as English:

<description xml:lang=”en”>Skateboard backpack; five pockets</description>

Note that applications processing XML aren’t required to recognize, process, and act based on the values of these attributes.The key reason the XML Specification identified these attributes is that they address common use-cases; standardizing them aided interop-erability between applications.

Without any meta-information about an XML document, attribute values are consid-ered to be pieces of text. In the previous example, the ID might look like a number and the submission date might look like a date, but to an XML processor, they will both be strings.This behavior causes headaches when processing data-oriented XML, and it’s one of the primary reasons most data-oriented XML documents have associated meta-information described in XML schemas (introduced later in this chapter).

XML applications are free to attach any semantics they choose to XML markup. A common use-case leverages attributes to create a basic linking mechanism within an XML document.The typical scenario involves a document that has duplicate

information in multiple locations.The goal is to eliminate information duplication.The process has three steps:

1. Put the information in the document only once.

2. Mark the information with a unique identifier.

3. Refer to this identifier every time you need to refer to the information.

The purchase order example offers the opportunity to try this technique (see Listing 2.3). As shown in the example, in most cases, the bill-to and ship-to addresses will be the same.

Listing 2.3 Duplicate Address Information in a Purchase Order

<company>The Skateboard Warehouse</company>

<street>One Warehouse Park</street>

<street>Building 17</street>

<city>Boston</city>

</billTo>

<company>The Skateboard Warehouse</company>

<street>One Warehouse Park</street>

<street>Building 17</street>

<city>Boston</city>

</shipTo>

...

</po>

There is no reason to duplicate this information. Instead, we can use the markup shown in Listing 2.4.

Listing 2.4 Using ID/IDREF Attributes to Eliminate Redundancy

<company>The Skateboard Warehouse</company>

<street>One Warehouse Park</street>

<street>Building 17</street>

<city>Boston</city>

</billTo>

41 XML Instances

...

</po>

We followed the three steps described previously:

1. We put the address information in the document only once, under the ^billTo element.

2. We uniquely identified the address as “addr-1”and stored that information in the

idattribute of the ^billToelement.We only need to worry about the uniqueness of the identifier within the XML document.

3. To refer to the address from the ^shipToelement, we use another attribute,^href, whose value is the unique address identifier “addr-1”.

The attribute names ^idand^hrefaren’t required but nevertheless are commonly used by convention.

You might have noticed that now both the ^poand^billToelements have an attribute called^id.This is fine, because the attribute names are unique within the context of the two elements.

Elements Versus Attributes

Given that information can be stored in both element content and attribute values, sooner or later the question of whether to use an element or an attribute arises. This debate has erupted a few times in the XML community and has claimed many casualties.

One common rule is to represent structured information using markup. For example, you should use an addresselement with nested company, street, city, state, postalCode, and country ele-ments instead of including a whole address as a chunk of text.

Even this simple rule is subject to interpretation and the choice of application domain. For example, the choice between

and

<work>

<ext/>

</work>

depends on whether your application needs to have phone number information in granular form (for exam-ple, to perform searches based on the area code only).

In other cases, only personal preference and stylistic choice apply. We might ask if SkatesTown should have used

<po>

Listing 2.4 Continued

...

</po>

instead of

...

</po>

There’s no good way to answer this question without adding stretchy assumptions about extensibility needs and so on.

In general, whenever humans design XML documents, you’ll see more frequent use of attributes. This is true even in data-oriented applications. On the other hand, when XML documents are automatically “designed”

and generated by applications, you may see more prevalent use of elements. The reasons are somewhat complex; Chapter 3 will address some of them.

在文檔中 Building Web Services with Java (頁 60-67)