NCBI Data in XML

Introduction

Extensible Markup Language (XML) plays an increasingly important role in the exchange of wide variety of data on the Web and elsewhere. In early 1990s NCBI chose a language called Abstract Syntax Notation 1 (ASN.1) for describing and exchanging information in a manner similar to the ways XML is now used. ASN.1 came out of the telecommunications industry and is a compact text or binary encoding intended for both human readable text as well as integers, floating point numbers, and so on. Tools for ASN.1 have largely stayed within the commercial telecommunications industry while a host of public domain tools of varying character have arisen for XML.

While ASN.1 remains the primary data specification language at NCBI, our toolkit also supports XML input and output. An ASN.1 specification can be rendered into an XML DTD or schema. Data encoded in ASN.1 can be output in XML which will validate against the DTD using standard XML tools. We hope this will make the structured sequence, map, and structure data, as well as the output of tools like BLAST, more accessible to those who wish to work in XML.

Data Conversion Details

Please note that the conversion of existing ASN.1 specified data into XML has some limitations.
ASN.1 has a number of specific data types such as INTEGER or REAL numbers while XML DTD has only strings, so our DTD automatically adds some ENTITY definitions at the top which maps these numbers to strings - to allow humans that read the DTD to see where numbers are expected. At the same time, when converting an ASN.1 specification into XML schema, our tools correctly map ASN.1 data types into corresponding XML schema ones. ASN.1 does not require that an element name be unique except within a structure, similar to C or C++. XML DTD however requires that all names be unique across the DTD, unless they are attributes which must come from a limited repertoire. Many XML parsers rely on this so that callback functions are associated with a tag, not a tag within context. As a trivial illustration, if both people and genes have names, they are distinct in ASN.1:

Person ::= SEQUENCE {
	name VisibleString,
	room-number INTEGER }

Gene ::= SEQUENCE {
	name VisibleString,
	map VisibleString }
but must be made unique in XML to be distinguished. To do so, we prefix all element names with the name of the context structure:
<!ELEMENT Person ( Person_name, Person_room-number )>
<!ELEMENT Person_name (#PCDATA)>
<!ELEMENT Person_room-number (#PCDATA)>

<!ELEMENT Gene (Gene_name, Gene_map)>
<!ELEMENT Gene_name (#PCDATA)>
<!ELEMENT Gene_map (#PCDATA)>
While this is a default behavior, our tools allow omitting such prefixes if needed - for example, when XML DTD was the original specification.


Please email questions at: info@ncbi.nlm.nih.gov

Last updated: Aug 26, 2005