In recent years, the extensible markup language (XML) has been adopted by more and more businesses as an industry standard for data exchange and data sharing. XML provides a system-independent standard format for specifying the information exchanged over networks and between applications.
The concept of XML is fairly simple, but the effectiveness it brings to the distributed computing world is tremendous. It revolutionizes the ways in which companies conduct business online, from Internet content delivery (wired or wireless) to electronic commerce to enterprise computing.
From a developer’s perspective, Java makes your application portable among different platforms, and XML makes your data portable among different applications. These languages make our lives easier.
In today’s article we will lean the basic fundamentals of how to parse an XML document. We will earn about both even-based and tree-based parsers. I will focus more towards Java when explaining things, but the concepts in this article are useful for those learning XML in general.
XML and Parsing XML Documents
An XML document is a tagged data file. The tags in an XML document define the structures and boundaries of the embedded data elements. The syntax of the tags is very similar to that of HTML. Parsing XML simply means retrieving data from an XML document based on its meaning and structure.
Listed below is a sample XML document that contains a mail message:
//mail.xml
<?xml version=”1.0″?>
<IDOCTYPE mail SYSTEM “mail.dtd” [
<IENTITY from “from@from.com”>
<IENTITY to “somebody@somewhere.com”>
<IENTITY cc “you@you.com”> ]>
<mail>
<From> &from; </From>
<To> &to; </To>
<Cc> &cc; </Cc>
<Date>Fri, 12 Jan 2001 10:21:56 -0600</Date>
<Subject>XML and Parsing XML Documents </Subject>
<Body language=”english”>
An XML document is a tagged data file. The tags in an XML document define the structures and boundaries of the embedded data elements.
<Signature> Zaid &from; http://www.devarticles.com </Signature>
</Body>
</mail>
In general, there are four main components associated with an XML document: elements, attributes, entities, and DTD’s.
An element is something that describes a piece of data. An element is comprised of markup tags and the element’s content. The following is an element in listed above XML file (mail.xml):
<Subject> XML and Parsing XML Documents </Subject>
It contains a start tag, , the content XML parsers for J2ME MIDP, and an end tag, .
An attribute is used in an element to provide additional information about the element. It usually resides inside the start tag of an element. In the following example, language is an attribute of the element Body that describes the language used in the message body:
<Body language=”english”>
An entity is a virtual storage of a piece of data (either text data or binary data) that you can reference in an XML document. Entities can be further categorized into internal entities and external entities. An internal entity is defined inside an XML document and doesn’t reference any outside content. For example, “from” is an internal entity defined in our XML file above:
<IENTITY from “from@from.com”>
The entity “from” is referenced later on in the XML document as &from;. When the XML document is parsed, the parser simply replaces the entity with its actual value: from@from .com.
An external entity refers to content outside an XML document. Its content is usually a filename or a URL proceeded with a SYSTEM or PUBLIC identifier. SYSTEM means that the filename exists on the local PC. PUBLIC means that the file can be accessed online, usually being prefixed with “http://”. The following is an example of an external entity, iconimage, that references a local file called icon.png:
<IENTITY iconimage SYSTEM “icon.png” NDATA png>
A Document Type Definition (DTD) is an optional portion of XML that defines the allowable structure for a particular XML document. Think of DTD as the roadmap or rulebook of the XML document. The code listed below shows the DTD definition for the XML file (mail.xml) listed above:
// mail. dtd
<IELEMENT mail (From, To, Cc, Date, Subject, Body)>
<IELEMENT From (#PCDATA)>
<IELEMENT To (#PCDATA)>
<IELEMENT Cc (#PCDATA)>
<IELEMENT Date (#PCDATA)>
<IELEMENT Subject (#PCDATA)>
<IELEMENT Signature (#PCDATA)>
<IELEMENT Body (#PCDATAISignature)+>
This DTD basically says that the element called mail contains six sub-elements: From, To, Cc, Date, Subject, and Body. The term #PCDATA refers to the “Parsed Character Data,” which indicates that an element can contain only text. The last line of the DTD definition indicates that the element “Body” could contain mixed contents that include text, sub-element Signature, or both.
Event-Based XML Parser Versus Tree-Based XML Parser
There are 2 types of interfaces available for parsing XML documents: the event- based interface, and the tree-based interface.
Event-Based XML Parsers
An event-based XML parser reports parsing events directly to the application through callback methods. It provides a serial-access mechanism for accessing XML documents. Applications that use a parser’s event-based interface need to implement the interface’s event handlers to receive parsing events.
The Simple API for XML (SAX) is an industry standard event-based interface for XML parsing. The SAX 1.0 Java API defines several callback methods in one of its interface classes. The applications need to implement these callback methods to receive parsing events from the parser. For example, the startElement ( ) is one of these callback methods. When a SAX parser reaches the start tag of an element, the application that implements the parser’s startElement ( ) method will receive the event. It will also receive the tag name through one of the method’s parameters.
Tree-Based XML Parsers
A tree-based XML parser reads an entire XML document into an internal tree structure in memory. Each node of the tree represents a piece of data from the original document. This method allows an application to navigate and manipulate the parsed data quickly and easily.
The Document Object Model (DOM) is an industry standard tree-based interface for XML parsing. A DOM parser can be very memory and CPU intensive, because it keeps the whole data structure in memory. A DOM parser may arise performance issues for your wireless applications, especially when the XML document to be parsed is large and complex.
In general, SAX parsers are faster and consume less CPU and memory than DOM parsers. However, SAX parsers allow only serial access to the XML data. A DOM parsers’ tree-structured data is easier to access and manipulate. SAX parsers are often used by Java servlets or network oriented programs to transmit and receive XML documents in a fast and efficient fashion. DOM parsers are often used for manipulating XML documents that exist physically, such as a configuration file or an already saved order.
Conclusion
In this article we’ve had a quick glance over the many components that comprise an XML file. We’ve learnt what both even-driven and tree-driven parsers are. We’ve seen an example DTD, and we’ve also learnt that SAX is an event-driven XML parser which uses callback functions to notify an application about certain aspects of an XML document. Originally published at http://www.devarticles.com/art/1/240
Many more XML, Java, ASP, PHP, .NET and C++ articles like this one are available at http://www.devarticles.com. If you’re in search of free scripts to help make your life as a developer easier, why not checkout http://www.devscripts.com.
Zaid Siddiqui has worked with ASP, SQL, Core Java, JSP, VBScript, JavaScript and a whole host of different markup languages, such as HTML and XML. You can contact Zaid via email at zaid@devarticles.com.