Sutra-Storage ideas : Sinzui

Sutra is a proposed metadata database. It stores and retrieves data about local and remote resources such as files and people. Some properties are intrinsic to the resource, like image size or music artist. Other properties are external, such as filename or creator. Resource properties are discovered through incident, and attributed in an ad hoc fashion. Properties names and purposes are not rigorously defined, nor required, including intrinsic properties. Sutra requires an extensible data model and a flexible database to accomplish its purpose.

The Storage schema is simple: ChildNode where anonymous, numbered resources are linked, three attribute tables where resources keep their named properties, TextRecords where resource content is stored, and two tables to issue and record resource ids to link everything. This schema has some short comings. Some links between resources are named, but cannot be represent, for instance, the author of resource might be represented by an entry in an address book, and that in turn might link to a homepage. In this scenario the link is both an attribute and a ChildNode, but that cannot be easily represented in the database. Named links could be put in the NumberSoup table, but there is no means to indicate that the number isn't a literal value, but a reference to another ChildNode. The operational data about how attributes work can be stored in the attribute tables, mixing with the content data, but this can lead to conflicts between libstorage, and the content it manages. Additionally, there is no namespace facility to prevent attribute name collision between different kinds of resources. There is no means to define attribute names, their use, and to what they belong. The three attribute tables represent only three simple data types, and cannot manage more complex or refined data types like URIs, or bytes, but that is the very type of data it proposes to store.

The Sutra schema is simple and it addresses Storage's problems by moving and consolidating property information, and adding a table to store type information. The Resource table stores resources by unique id and URI, and resource literal and link properties are stored in the Property table. The Type table defines all properties, and is linked to the Property name column. The Content table is identical to Storage's TextRecord table.

[Sutra-Storage schema]

The Resource table represent resources in two ways, by a unique numeric id and a unique URI. The uid is used is a foreign key in the Property and Content tables. The URI must be valid, and may be real. Anonymous resources are represent by valid, but meaningless URIs.

The Property table represents the union of Storage's three attribute tables and ChildNode's link responsibility. The resource and type columns are foreign keys representing the Resource and Type tables respectively. The value is a string because it is the most versatile format to represent data types as exemplified in XML-Schema. Most data Storage deals with is string data, so little work will be needed to convert it. In the rare case of numbers and dates, comparison operations can easily and efficiently be accomplished with correct encoding. The isliteral column is a flag indicating that the value is not a foreign key pointing to a resource. The weight column represents a means to order and identify properties that have the same resource and type. It is a mutable id that might be changed by applications to indicate the higher valued properties are more relevant and commonly used. There are scenarios where the value might point to a resource, but the application saving the information chooses to save the literal value. Typing information could be used to declare whether a value is literal or link, but that can lead to problems when there is no resource to link to, or resource table is filled with entities that represent leaf nodes. The property value can be converted from link to literal as needed.

The Type table contains the definition of the types of properties in the Property table. Each type has a unique id. The namespace column is a token indicating to which group the property type belongs. It is synonymous with XML-Namespace and similar to namespace in some programming languages. The property column is the name of the type and used for external representations of the properties. The datatype column indicates the kind of data a property is an how it should stored as defined in XML-Schema data types. The domain column comes from RDF-Scheme where objects can be defined as classes (resources), or properties. Domain is a foreign key linking to another entity in the Type table. In the Sutra model, classes and properties are stored in the same table and used similarly. The super column is a foreign key pointing to the super type of the type. The type table provide a means to define all the kinds of resources and properties in the database. Strong and week queries can be performed by restricting property matching to exactly a property, or to a group constructed from the super relation of properties respectively. Applications can extend and explore the data in the type table to store new kinds of data in the database and query it. The domain is used to distinguish between properties that define major objects (resources) and those that define minor objects (properties). The property concept from RDF-Schema allows properties to be attributed to resources in an ad hoc fashion--classes do not require attribute. A resource acquires properties by context (namespace). For example, an image has intrinsic properties like width and height, plus properties like filename and size because it is stored on a file system.

The Content table contains file data. It represents the body of a resource, when a resource is a digital artifact that has a body, such as an document. Storage does not know the structure of files, so while a file might be a compound document of several kinds of content, it is stored as a single block.