Python Schema Frameworks for Serialization and Validation
We use Schemas to describe data to be exchanged and presented. Schemas can be a core feature of objects in a system. In my work at Swisscom, we have created our own Schema system, but I recently investigated whether standard solutions using a Python schema framework would be feasible.
Python is a dynamically typed language. The data type of objects is not determined during compilation, but when code is run. This is powerful, but may be less than ideal when communicating with other programs or with humans (usually through APIs for code and documentation for humans).
A description of data objects crossing program boundaries permits validation (rejecting unforeseen data) and also conversion to and from other formats. Converting a tree of objects with links between them to a sequence of bytes is called serialization (or marshalling); reconstructing the object tree from the sequence of bytes deserialization (or unmarshalling).
Common serialization formats are:
- Python pickle, a format internal to Python (basically instructions for a stack machine that reconstructs objects). This can be used only for communication between Python programs (e.g., across process boundaries, or as an object storage format). The serialization contains all information, and doesn’t need a schema.
- String representation of Python built-in objects: Everything is converted to a built-in, which in turn have string representations (returned by __repr__). To reconstruct the object tree, the string is simply passed to eval.
- XML doesn’t have a straight-forward one-to-one mapping to built-in language structures, but such a mapping can be constructed.
Except for Python pickles, an object tree cannot be reconstructed from a serialized representation without a form of annotation. A string '3.14159' may just be a string or a floating point number. A dictionary might be an actual dictionary, or might represent the contents of an object. References between objects may be broken on serialization and restored on deserialization.
Based on my experience with home-grown schemas (both Python and otherwise), I have a set of requirements, some of which are more important, some a matter of style and taste:
Schemas must be able to handle objects, not only built-in types. When deserializing, object trees should be reconstructed.
Serialization / Deserialization / Validation
Given an object and its schema, the object should be serialized to a Python dictionary using only built-in types, or to a JSON string. Given the deserialization, the object, including all subobjects and references should be reconstructed.
Given an object, validation should assure it corresponds to its schema. This is useful if objects are constructed in code, from scratch.
Schemas should be able to inherit from other Schemas, and refine their fields. When Schemas are attached to classes, subclasses should be able to easily (if not automatically) inherit the superclass’ schema.
Schemas should be open to additional attributes, for example, to define screen layouts.
The Schema Framework should be able to generate descriptions according to standards, e.g., OpenAPI for REST (a.k.a. Swagger). The definition is basically a JSON file itself and is pretty easy to construct.
If you need to serialize to XML, XML Schema should be supported to document and to validate XML documents.
Independence of Frameworks
A Schema framework is most versatile if it is not tied to any other component, i.e., application servers, databases, ORM frameworks. If the schema is coupled tightly into an ORM, it’s hard to use it with a non-relational database, e.g., MongoDB.
Many frameworks like to declare a schema as a class to benefit from inheritance, however, I prefer a Schema object:
- Declaring Schemas as a separate class is probably the easiest to read and write. Inheritance of class variables is automatic. Dynamic manipulation of schema classes is not so easy, and the class itself cannot be used for other things (because each variable becomes part of the schema).
- However, having a separate schema class and object class makes it harder to keep definition in synch.
- Since an inner class cannot refer to an outer class, nesting classes is not a good solution.
- Objects are easier to handle – object state and methods are easier than their class counterpart. So I prefer a Schema to be an object, declared as single class variable in the implementation class,
- The __init__ constructor may take keyword arguments, which is much easier to read than defaults supplied somewhere in code.
- A Schema defined as nested dictionaries is hard to read, especially if fields have many parameters (such as validation, ranges, defaults, mandatory declaration, etc.)
Thanks to Python’s dynamic nature, it is often possible to wrap a framework to provide another style – e.g., wrap dictionaries in objects; or provide a mapping interface for objects. However, this adds conceptual overhead, e.g., for usage documentation.
Open Source Health
I would only consider open source packages, which support Python >=3.5 and 2.7 (unless they’re new and you have left Python 2 behind). Actively maintained packages with stable releases are a plus, as creating schema declarations is a major investment for big applications.
Consider your specific requirements. As long as there is no clear leader among the frameworks, some might fulfil your exact requirements, whereas others might be extensible to accommodate them. Because your requirements might change, I’d go for the extensible ones.
Schemas embedded in frameworks
These Schemas tools are embedded in larger frameworks. They’re the prime choice if you’re using that framework in its intended way. They’re often a bad choice if you need something not fitting the primary purpose of the framework, e.g., don’t use an Object Relational Mapper (ORM) with a non-relational database.
- Django is the big, established web application server with its own ORM. It sports a class-based schema, with a separation between schemas and objects referring to them.
- SQLAlchemy is a powerful and versatile ORM that abstracts from the actual SQL dialect used. It uses an object-based schema and suits all my requirements except that it is tied into the ORM framework.
- Graphene is the Python library for GraphQL, the new kid on the block for interfaces. By declaring schemas for all objects and accessors to retrieve all instances or a single instance, Graphene is able to return arbitrary graphs of objects. Graphene uses class-based schemas, but there are adapters for other styles, such as Django and SQLAlchemy.
Stand-alone Schema Frameworks
I’ve looked at some specific Schema frameworks, with no claim to completeness.
- Cerberus (from the Eve project) is a very extensible validator that uses dictionaries as a specification format. Although it doesn’t support objects natively, it is extensible.
- Colander is the framework used by the Pyramid application server. It is rather complex, supports several ways to declare schemas (class based and object based). Error messages and strings are translated to several languages.
- KIM serializes directly to JSON, and has many advanced features to customize its behavior. It fully supports object creation. The Schema is class based.
- Marshmallow is an extensive library supporting objects and uses a class-based declaration. It feels slightly over-engineered to me, but that’s a personal taste.
- Schema: A very simple, but extensible library. It basically validates dictionaries. I like its name – it says what it is.
- Valideer is a dictionary based schema. If you prefer objects for readability, the Schema can be created as a series of nested objects. Creating objects is not directly supported.
- Voluptuous has a nice object-based schema declaration. It is not clear to me how I would reconstruct objects easily.
Selecting a Framework
Some frameworks are powerful, but you have to follow their style closely; some are less expressive, but are very open to extension. Because Python has an open source community, there is not yet a clear standard in this domain. Unless someone like Kenneth Reitz comes along building Schemas for Humans, people will contribute their playfully named frameworks which fulfil various niches and leave you to pick the most suitable.
A few hints and rules might help:
- If you use a framework such as Django or SQLAlchemy, stick with the tools the framework provides (unless you know very well why you’re different).
- A framework used by a larger product gets more real-life exposure; completeness and quality are more likely.
- Be sure of your requirements: Do you need to validate documents, or perform serialization? Do you have a document exchange format, or do you want to provide an interface to the core objects of your application (Naked Objects takes this to the extreme)?
- Prototype your main use case as part of your evaluation. This is also a check on the quality of the documentation – if you get stuck, perhaps the documentation is lacking for your purpose – “If the implementation is hard to explain, it’s a bad idea.” (Zen of Python).
Finally, pick a framework that suits your style. Hopefully, it should be Pythonic. What is Pythonic? The Zen of Python inspires a Pythonic way of thinking and coding. It promotes readable, but succinct code that uses but not abuses language features. It states “There should be one – and preferably only one – obvious way to do it.” As we have seen, there is a wide variety of styles to express schemas, so this is a hard one to fulfill.
It is not easy to say which framework is the most Pythonic. Dynamic construction of Schemas instead of a static declaration is important to me, as well as keeping definition and code closely together.