Open Node Syntax

Written By: Seairth Jacobs
Document Status: Draft
Version: 0.6.9
Last Updated: June 17, 2005

Introduction

Welcome to the Open Node Syntax (ONX, pronounced "onyx"). ONX was created as an alternative to the current markup languages used today. In particular, ONX spawned from my experiences with XML and the problems that I and others encountered along the way. ONX is designed to be data-oriented instead of document-oriented and is intended for use in platform-independent transfer of data over distributed systems, though it can be used for non-networked applications just as effectively. While it has similarities to other markup languages like XML, it is designed from the ground up with the following goals in mind:

Minimal Set of Rules : The simpler and fewer the rules, the less likely there are to be problems
Generic : Allows ONX to be used for a wide variety of applications
Compact : Since the ONX-encoded data is intended for use over networks (with an emphasis on the Internet), it is important to keep the overall size of the data down as much as possible
Simple Parser Implementation : Allows parsers to be written with less bugs and more speed

To get a quick feel for what ONX looks like, here are a few examples (more thorough ones are given later in the specification):

:onx{
    :calendar{
        :entry{
            :date["2003" "1" "1"]
            :type["event"]
            :note["Happy New Year!"]
        }
        :entry{
            :date["2003" "3" "8"]
            :type["birthday"]
            :note["Buy self a present..."]
        }
    }calendar
}onx

:onx{
    :fields{
        :field["ID" "integer"]
        :field["city" "string"]
        :field["state" "string"]
    }
    :data{
        :record["1" "Norfolk" "VA"]
        :record["2" "Salem" "MA"]
    }
}onx

:onx{:request{:host["www.seairth.com"]:resource["/web/onx/onx.html"]}}onx

NOTE: It is important to point out here that ONX is not intended to be a universal replacement for XML, SGML, or any other markup language. It contains many features that are similar to other languages. In many cases, ONX could be used as an effective alternative. There are also some features that are not commonly found in other markup languages which make ONX more ideal for some uses. Take a look at the specifications. Figure out what your needs are and see if they match with ONX's capabilities. In many cases, you will be quite surprised to find how effective a solution ONX will be for you.

Specification

This is the entire set of rules for ONX (in EBNF):

RootNode ::= ':onx{' (S | Node)* '}onx'
Node ::= ValueNode | ContainerNode
ContainerNode ::= ':' Name '{' (S | Node)* '}' (Name)?
ValueNode ::= ':' Name '[' (S | Value)* ']' (Name)?
Name ::= (Letter | '_') (NameChar)?
NameChar ::= (Letter | Digit | '_' )+
Letter ::= ['A'-'Z'] | ['a'-'z'] | [0xC0-0xD6] | [0xD8-0xF6] | [0xF8-0xFF]
Digit ::= ['0'-'9']
Value ::= '"' [0x00–0xFF]* '"'
S ::= (0x09 | 0x0A | 0x0D | 0x20)+

ONX is organized into Information Blocks, or "infoblocks". An infoblock is defined as the RootNode and its contents. Inside the RootNode, there may be zero or more additional "nodes". Since the RootNode is used to identify the beginning and end of the infoblock, it is possible to have multiple infoblocks in a single stream without ambiguity.

The most basic structure in ONX is the "Node". Each Node starts with a Name. Node names have a standard format and are unlimited in length. However, any Name that starts with “ONX” or any case variation (e.g. Onx, onx, onX, etc.) is reserved for current and future standardization purposes. This means that user-defined nodes should never use this string to start a Name, since it is possible that some future standard would use that same name as a reserved node name.

Before continuing in detail about node types, note the use of whitespace [Rule 10] in ONX. This is entirely optional and is only intended as an aid when a human must directly read an ONX infoblock. As a result, arbitrarily inserting whitespace for readability does not change the meaning of the data. However, in production situations, it is more likely that all whitespace will be left out of the infoblock to reduce size and processing time.

Hint: ONX parsers should first test for the brace, bracket, quote, etc. characters before testing for whitespace. In production, this should reduce the number of tests performed, again assuming that a production system will not generate whitespace.

There are two Node types in an ONX infoblock:

ValueNode: Contains zero or more quoted strings.
ContainerNode: Contains zero or more additional nodes.

ValueNode

The ValueNode is the next most basic type of Node. In this case, the purpose of the node is to provide the ability to convey Name/Value pairs. The name can be associated to either a no value, a single value, or a list of values separated by commas. The contents of a value can be anything including binary data. The value itself is always enclosed in a pair of quote symbols. Examples of a ValueNode are:

:Node[]
:Node["This is a value of this ValueNode"]
:Node["This is a value of this ValueNode" "This is also a value in the same ValueNode"]

The ValueNode can end in one of two formats:

"]" : The short version allows for compact representation. This is important when large infoblocks containing a high percentage of ValueNodes are being transferred over low-bandwidth connections.
"]" Name : The long version allows for increased validation and readability. Since it is possible to create an ONX infoblock with a ValueNode where a "]" has been accidentally left off, this format would lend itself to better pinpointing where the error is. Also, when directly viewing an infoblock, the long version allows easier viewing of large ValueNodes, especially thost that are several lines long . This is useful for development and debugging purposes. However, it is suggested that the short version be used in production situations.

Since the beginning and end of a value is delimited by the quote symbol, this means that the quote symbol cannot be in the value without additional consideration. To solve this problem, values can contain Escape Sequences. An escape sequence in ONX is similar to the C/C++ notion of escape sequences. Each escape sequence starts with the escaping character "\". There are a total of five allowable escape sequences:

\\ : When the escaping character is in the value, it must be escaped to avoid ambiguity or incorrect parsing. Otherwise, it is possible that a valid escape sequence could be in the value which is not actually meant to be an escape sequence.
\" : When the quote character is in the value, it must be escaped to avoid incorrect parsing. Otherwise, it would be recognized as the closing quote.
\0 : Since the values can contain binary data, it is possible for the value to contain NULL characters (binary value zero). However, many programming languages use the NULL character to indicate the end of a string. As a result, it is possible that an actual NULL character could cause incorrect processing of the ONX document. To avoid this problem, all NULL characters must be escaped with this sequence.
\xHH : When it becomes necessary to encode a binary value to limit the character usage to ASCII values, the character may be escaped with the hexadecimal value for that character stored in this format. The hexadecimal value must be exactly two characters long. This escape sequence is optional.
\[HHHHHHHH] : When the value is long, especially if it contains a lot of characters that would require escaping, this escape sequence allows the next HH (hexidecimal up to eight digits) bytes as-is. For instance, if HH was "9EF", then the next 2543 (decimal) bytes are to be read without any additional processing. Even if there are escape sequences within these 2543 bytes, they are ignored. This also allows the ability to have embedded NULL characters, since the ONX processor can accurately read them without mistaking them for an end of the string (this would be very useful for embedding MBCS strings, for instance). This escape sequence is the only way that NULL characters, quotes, and backslashes can be directly embedded in an ONX infoblock without the need to individually escape them. The maximum value allowable is 4294967295, which is the maximum value that can be held in a 32-bit integer. However, if a value will be longer than this value, multiple escape sequences can be put together as necessary. The format of the number is from one to eigth hexidecimal digits. Reading from left-to-right, the digits are read in most-to-least significant order. If less than eight digits are given, the beginning of the number is implictly padded with zeros. Zero is a legal value.
NOTE: This escape sequence can be a security risk. It is possible for this value to be greater than the actual number of bytese within the value, possible longer than the remaining number of bytes in the infoblock. This could cause a parser to encounter a buffer overrun, which an attacker may be able to take advantage of. To avoid this risk, it is suggested that parsers keep track of the length of the infoblock (instead of using a null character) to indicate the end of the infoblock. This allows the length of the escape sequence to be tested against that length to ensure that the buffer is not overrun (e.g. nOffset + nEscapeSequenceLength < nInfoblockLength).

Examples of escape sequences in ValueNodes are:

:phrase["The word \"test\" is used here."]
:string["First Line\x0D\x0ASecond Line."]
:sample["Showing Escape Sequence \\x0D\\x0A"]
:data["\[9]\"As-Is\" and \"Not As-Is\""]

The decoded values of the above examples would like like:

The word "test" is used here.
First Line
Second Line.
Showing Escape Sequence \x0D\x0A
\"As-Is\" and "Not As-Is"

ContainerNode

The ContainerNode is the most complex Node type. The purpose of a ContainerNode is to contain other Nodes. The contained nodes can be of any node type. This allows Nodes to be arranged or organized in a hierarchical manner. There is no limitation to the maximum depth that ConatinerNodes may be nested. As a result, both simple and complex data structures can be represented as needed.

Note: Is is also important to point out here that the RootNode is a special-purpose ContainerNode. As a result, it may be possible for developers to take advantage of this characteristic when writing or using ONX parsers/processors.

The ContainerNode can end in one of two formats:

"}" : The short version allows for compact representation. This is important when large infoblocks containing a high percentage of ContainerNodes are being transferred over low-bandwidth connections.
"}" Name : The long version allows for increased validation and readability. Since it is possible to create an ONX infoblock with a ContainerNode where a "}" has been accidentally left off, this format would lend itself to better pinpointing where the error is. Also, when directly viewing an infoblock, the long version allows easier viewing of the nesting of the ContainerNodes. This is useful for development and debugging purposes. However, it is suggested that the short version be used in production situations.

Sample ONX Infoblock(s)

Here are a few samples of complete ONX infoblocks:

:onx{
    :Messages{
        :Message{
            :From["seairth@seairth.com"]
            :To["seairth@seairth.com"]
            :Subject["Concerning ONX..."]
            :Body["This is a simple ONX sample."]
        }Message
        :Message
            :From["seairth@seairth.com"]
            :To["seairth@seairth.com"]
            :Subject["A Slightly Bigger Message"]
            :Body["In this case, each value in this ValueNode might indicate"
                  "each \"line\" of this message body."
                  ""
                  "Seairth"
                  "seairth@seairth.com"
            ]Body
        }Message
        :Message
            :From["seairth@seairth.com"]
            :To["seairth@seairth.com"]
            :Subject["Whitespace in a value"]
            :Body["As was stated above, the value can contain anything that we want,
which means that the CR/LF combination at the end of the prior line
is actually part of the value. In the next value, the \\x0D\\x0A are
escaped versions of the CR\LF pair and are also considered to be
embedded whitespace (once they are evaluated, anyhow)."
                  "\x0D\x0ASeairth
seairth@seairth.com"
            ]Body
        }Message
    }Messages
}onx


:onx{:Request{:Name["GetPopulation"]:Parameters["US" "Virginia" "Norfolk"]:ReturnAs["Number"]}}onx


:onx{
    :Database{
        :Name["Inventory"]
        :Tables{
            :Table{
                :Name["Items"]
                :Header{
                    :Field{
                        :Name["id"]
                        :Type["unsigned integer"]
                        :AutoIncrement[]
                        :PrimaryKey[]
                    }Field
                    :Field{
                        :Name["itemnumber"]
                        :Type["string"]
                        :Length["10"]
                        :DefaultValue["New Item"]
                    }Field
                }Header
                :Records{
                    :Record["1" "ABC123"]
                    :Record["2" "XYZ789"]
                }Records
            }Table
        }Tables
    }Database
}onx

:onx{
    :install-file{
        :name["Super Application 1.0"]
        :platforms{
            :platform["windows" "target1"]
            :platform["linux" "target2"]
        }platforms
        :target1{
            :default-path["c:\program files\superapp\"]
            :run-post-install["sa_setup.exe"]
            :data["\[113C7F](imagine 1,129,599 bytes of raw binary data here)"]
        }target1
        :target2{
            :default-path["/bin/superapp/"]
            :run-post-install["sa_setup"]
            :data["\[E6DD6](imagine 945,622 bytes of raw binary data here)"]
        }target2
    }install-file
}onx

What's Missing/Different

Attributes

For those who are familiar with XML, attributes are Name/Value pairs associated with an element. One of the common disputes among XML developers is when to use attributes or child elements. In ONX, this is not an issue. There are only ValueNodes. ValueNodes can represent attributes or content (which can be thought of as just another attribute).

Comments

Many markup languages allow comments. While this may be fine for hand-coded document-oriented markup, comments do not fit into the ONX goals stated above. If comments are required for a given implementation, it can be implemented as a ValueNode.

Document Type Declaration/Definition

This has intentionally been left out of ONX primarily due to the lesson(s) learned from XML. XML comes with a DTD mechanism. However, many have found it to be inadequate. As a result, multiple XML-based schema languages have been created (e.g. W3C XML Schema, RELAX-NG, Schematron, etc.). It is likely that the same would happen for ONX. As a result, a DTD mechanism has been left out. I intend to create a schema language for ONX, but welcome any other implementations as well.

Special-Purpose Attributes

XML, SGML, and possibly other markup languages have special-purpose attributes types. For instance, XML has id, idref, idrefs, etc. If one or more of these special-purpose attributes are needed in an ONX implementation, then they can be implemented as a ValueNode that is recognized within the particular application that needs it.

Whitespace

When developing applications that use XML, it is common to use whitespace characters (0x09, 0x0A, 0x0D, 0x20) to make it easier to read and debug the XML documents. However, it can also be used as the content of a given element. In ONX, the whitespace serves no purpose other than making it easier to develop/debug infoblocks. Under ideal circumstances, whitespace would be left out altogether. However, this would make development and debugging more difficult.

What's Left

Unicode/ISO Character Sets

One of the next things that is apparent in the above specification is the lack of support for multi-byte character sets. There is a simple reason for this: I do not have enough experience with multi-byte character sets to implement this aspect myself. In order to get international support for ONX, Unicode support needs to be added. Anyone want to take a crack at it? If so, e-mail me.

ONX Parser(s)

Well, there is plenty to do here. I have one parser which I quickly wrote as a starting point. It is a simple event-generating parser similar to XML's SAX or expat.

ONX Representation

As with XML, it makes sense to have some sort of object model for ONX infoblock representation. However, since ONX is not document-oriented, the use of the Document Object Model would be confusing. So instead I suggest the object model be called the Infoblock Object Model (IOM). As for the implementation of such a model, this still has to be determined. If I come up with something workable, I will make it available online. In the meantime, any suggestions?.

Versioning

Currently, ONX does not have a way to indicate what version it is. I have a few thoughts on this:

Do not add versioning to ONX. It's intended to be simple. Newer versions would likely mean addition of more features, which would take ONX away from its simplicity.
Embed the version number in the rootnode name. This would allow only the ":" and the name to indicate what version of ONX is being used. This has the advantage of allowing practically every part of the specification to be changed, which could be a good or bad thing.
define a ValueNode called something like "onxversion" which must always be the first node inside an infoblock, though it would be optional (and therefore implied to be version 1.0 or whatever).
Add a new element/node type to support this feature. This option is given for thoroughness only and is not recommended.

Schema Language(s)

ONX is only concerned with well-formedness and not validity. This is where schema languages come in. What we need are some usable schema languages that allow ONX infoblocks to be validated. My only suggestion here is that it is important to keep the schema language as simple as possible. I frequently hear how large and complex the W3C XML Schema language is. ONX is meant to be simple. I would like to see a schema language to match.

Namespacing Mechanism

I have mixed feelings about adding this functionality to ONX, either directly or indirectly. Namespaces have been one of the most contentious issues for XML and I would expect no less than that for ONX. I tend to see namespaces being connected moreso to the issues of validation, not the issues of well-formedness. As a result, it makes more sense to me to address this issue at the schema-level, not the markup language level. Of course, lots of people seem to be very opinionated about this matter and I welcome those opinions (as long as they are constructive).

Collaborative Development

The current version of ONX was written with onyl a little input and feedback, much of which I got indirectly from conversations, articles, and other resources concerning markup languages and XML in particular. The current draft has had some additional direct feedback. However, in order for ONX to move towards its potential, I know that it must become a truly open and collaborative effort. To that extent, I would like to work towards putting together a group of people who would be willing to take the time and effort necessary to help promote and and improve ONX. With the immense popularity of XML these days, this is no small task. But, I feel that it is a task worth taking on. If you are someone who would like to help improve ONX now and in the future, e-mail me. Also, I would like some suggestions on how to go about providing the tools for the collaborative effort online. I know that there are many projects at SourceForge, but wondered if there was anything else out there, etc. I would even consider setting up a more formal location for ONX... but I feel that I am getting ahead of myself here. Let's start of simple and see where we go.

Much, Much More...

This is only the beginning of a new markup language. I feel that it has some good potential and can be used to help solve some of the problems we face today. As this markup language matures, there will be plenty more to add, improve upon, etc. And my hope is that anyone who is interested will help out. I think that, together, we can make ONX as popular (if not more so) as XML.

Miscellaneous

Specification Version Number

The specification/document version at the top of this document is indicated by three numbers separated by periods. The first number is the major revision number. This is used to indicate major changes in the specification such as the addition of new features. The second number is the minor revision number and indicates updates to the specification due to errata such as ambiguities or typographical errors. These changes can change the actual meaning of the specification, but not the intended meaning. The third number is the document revision and indicates changes to the documentation that do not affect the specification, such as spelling and gramatical corrections, adding new examples, rewording for increased clarity, etc.

License

This document is licensed under Creative Commons Attribution-ShareAlike (version 2.5). Specific details can be found at http://creativecommons.org/licenses/by-sa/2.5/.