OOP in Python, part 13: Serialization and ABCs
MP 55: - What it takes to save an object, and a real-world use of abstract base classes.
Note: This post is part of a series about OOP in Python. The previous post discussed abstract base classes, and how they can be used to enforce constraints on subclasses. The next post looks at how exceptions are implemented in Python, to see a real-world use case of inheritance.
In the last post we discussed abstract base classes, and how they allow you to define constraints that subclasses must follow. There are many examples of abstract base classes in both the Python standard library, and in third-party libraries.
In this post we’ll look at one specific example of how an abstract base class is used to enforce constraints throughout a library, even when the end user ends up writing their own subclass. We’ll do this with a focus on marshmallow, a library that helps convert instances of custom classes to a form that allows them to be saved to a file.
Writing simple data to a file
There are a number of ways to save data to a file with Python. One of the most common approaches is to convert the data to JSON, and then write that JSON data to a file:
from pathlib import Path
import json
robot = {'name': 'Marvin'}
robot_data = json.dumps(robot)
path = Path("robot.json")
path.write_text(robot_data)
Here we start with a dictionary called robot
, which consists of just a single key-value pair. If you try to write a dictionary directly to a file, you’ll get an error. We first need to convert the Python dictionary object to JSON, and then we’re free to write it to a file.
Here’s the contents of robot.json after running this program:
{"name": "Marvin"}
This JSON file is also valid Python, but that’s not true of all JSON files.
Writing class data to a file
Let’s try to do this same thing with a simple class:
from pathlib import Path
import json
class Robot:
def __init__(self, name=""):
self.name = name
def say_hi(self):
print(f"Hi, I'm {self.name} the robot.")
# Create an instance of Robot.
robot = Robot('Marvin')
# Convert the data to JSON, and save it.
robot_data = json.dumps(robot)
path = Path("robot.json")
path.write_text(robot_data)
Instead of a dictionary we create a class called Robot
, with a single name
attribute. We make an instance, and try to call json.dumps()
with that instance as an argument.
Here’s the traceback:
Traceback (most recent call last):
File "robot_class.py", line 14, in <module>
robot_data = json.dumps(robot)
^^^^^^^^^^^^^^^^^
...
TypeError: Object of type Robot is not JSON serializable
This error tells us that a Robot
object is not “serializable”. To write an object to a file, Python must first structure that object as a sequence of bits—the 1s and 0s that make up a file. This process is called serialization, and it’s not always obvious how to serialize an object. For example, the JSON module can’t serialize custom classes by default.
When data is read from a file, the process of deserialization recreates the original object from that serialized data.
Serializing a class
There are many ways to serialize a simple class. In this post we’re going to use a library that works for more complex classes as well.
Serialization comes up in every programming language; it’s not specific to Python. The process of converting data associated with an object to a series of bits that can be written to a file is sometimes called marshalling. The marshmallow library, whose name is a play on the word “marshall”, makes it easier to serialize custom objects.1
To serialize a class using marshmallow, you define a new class that inherits from marshmallow’s Schema
class. In this class, you specify exactly what data associated with an instance should be serialized.
Here’s how we can use marshmallow to store an instance of the Robot
class:
from pathlib import Path
from marshmallow import Schema, fields
class Robot:
....
class RobotSchema(Schema):
name = fields.Str()
# Create an instance of Robot.
robot = Robot('Marvin')
# Use RobotSchema to serialize the data.
schema = RobotSchema()
robot_data = schema.dumps(robot)
path = Path("robot.json")
path.write_text(robot_data)
We first import two resources from marshmallow
: the Schema
class, and a module called fields
. The fields
module contains a number of classes that define how to serialize and deserialize specific kinds of information.
We need to define a companion class to Robot
that handles serialization of Robot
objects. This class inherits from Schema
, and the convention is to give it the same name as the class it handles, with Schema
added to the end. The class RobotSchema
tells Python which data needs to be serialized. To include an attribute of the class in the serialization process, you list the name of the attribute and assign it to a class from the fields
module:
class RobotSchema(Schema):
name = fields.Str()
We’ll look more closely at the code that does the actual serialization later in this post.
To save an instance of Robot
, we first create an instance of RobotSchema
, here assigned to schema
. We then call schema.dumps()
, passing the instance as an argument. The dumps()
method generates a valid JSON string containing the data specified by RobotSchema
. This serialized data can then be written to a file.
Here’s the contents of robot.json after running this program:
{"name": "Marvin"}
This is the data we asked RobotSchema
to preserve when serializing instances of the Robot
class.
Deserializing a class
Writing class data to a file is only useful if you can then read the file and recreate an instance of the class. To do this, we’ll need to add a method to RobotSchema
that handles deserialization.
Here’s a program that reads in the data associated with an instance of Robot
, and recreates an equivalent instance:
import json
from pathlib import Path
from marshmallow import Schema, fields, post_load
class Robot:
...
class RobotSchema(Schema):
name = fields.Str()
@post_load
def make_robot(self, data, **kwargs):
return Robot(**data)
# Read in the data.
path = Path("robot.json")
robot_data = path.read_text()
robot_data = json.loads(robot_data)
# Use RobotSchema to deserialize the data.
schema = RobotSchema()
robot = schema.load(robot_data)
# Use the robot.
robot.say_hi()
print(robot)
We need to import the json
module, because we’ll need to convert the saved JSON data back to a Python data structure in order to use it. We also need to import the post_load
decorator from marshmallow
.
In RobotSchema
, we add a method called make_robot()
. The @post_load
decorator causes this method to run after we call schema.load()
. The data
argument receives the information needed to recreate an instance of Robot
. The method make_robot()
only has one line; it calls Robot()
with the data read in from robot.json, and returns the resulting object.
When we read the data from robot.json using path.read_text()
, it’s read in as a string. We use json.loads()
to convert this string to a valid Python object. In this case, the result is a dictionary containing all the information needed to recreate an instance of Robot
, as we specified in RobotSchema
.
We call schema.load()
, passing it the data that was read in. This call returns an instance of Robot
, which we assign to robot
:
schema = RobotSchema()
robot = schema.load(robot_data)
To verify that we have a functional instance of Robot
, we call say_hi()
and print the actual object.
Here’s the output:
Hi, I'm Marvin the robot.
<__main__.Robot object at 0x101845ed0>
The object we just created acts like an instance of Robot
, and we’ve verified that it is in fact an instance of Robot
.
More about serialization
Serialization is about more than just saving data to files. For example:
Streaming services must serialize any data before it’s transmitted, and that same data must be deserialized once it’s received.
When data is transmitted through an API, that data must be serialized before transmission and deserialized when it’s received as well.
Caching systems often need to handle serialization and deserialization.
There are numerous other situations where serialization plays an important role as well.
marshmallow fields
The marshmallow library is useful because it handles much more than just string attributes. The documentation for fields lists over 30 different field types that support serialization. You can work with data such as dates and times, emails, numerical data, IP addresses, URLS, many kinds of sequences, and more. Without a dedicated library, you’d have to write your own methods for how to convert different kinds of data like these to a format that can be written to a file.
The String
field
Let’s look at how the String
field is implemented, since it serves as the base class for many other fields. Here’s the most relevant code from the fields.String
class:
class String(Field):
...
default_error_messages = {
"invalid": "Not a valid string.",
"invalid_utf8": "Not a valid utf-8 string.",
}
def _serialize(...) -> str | None:
...
def _deserialize(...) -> typing.Any:
...
There are several things to note here. First, String
inherits from Field
, so we’ll look at that class in a moment. Next, the String
class defines three things:
A dictionary of error messages that will be displayed if the data provided doesn’t match the field type;
A
_serialize()
method.A
_deserialize()
method.
If you wanted to save a custom object to a file without using a library like this, you might be able to get away without writing error messages. But you would definitely have to write your own serialization and deserialization methods.
The Field
class
String
inherits from Field
, so let’s take a look at its source code. The Field
class is just over 400 lines long, so I’ll only show a small part of it:
class Field(FieldABC):
...
default_error_messages = {
"required": "Missing data for required field.",
"null": "Field may not be null.",
"validator_failed": "Invalid value.",
}
...
def serialize(...)
...
return self._serialize(...)
def deserialize(...)
...
output = self._deserialize(value, attr, data, **kwargs)
self._validate(output)
return output
def _bind_to_schema(...)
...
def _serialize(...)
...
def _deserialize(...)
...
First, there are some default error messages that can be used if a subclass of Field
doesn’t define its own error messages. There are also a number of methods not shown that handle tasks like accessing and validating data.
The five methods shown here demonstrate how several of the topics discussed in this series come together in a real-world project. We have two public methods, serialize()
and deserialize()
. These are the methods we called earlier when saving and loading an instance of Robot
. Both of these methods call helper methods to do the bulk of their work.
Next, we have a helper method called _bind_to_schema()
. This method makes sure the data specified in a Schema
subclass, such as RobotSchema
, is included in the serialization process.
There are two more helper methods, _serialize()
and _deserialize()
. These methods are overridden by subclasses, which define their own specific approaches to converting data to a form that can be written to a file.
Notice that Field
inherits from another class, FieldABC
. That must be an abstract base class; let’s take a look at it.
The FieldABC
class
Here’s the full listing of FieldABC
:
class FieldABC(ABC):
"""Abstract base class from which all Field classes inherit."""
parent = None
name = None
root = None
@abstractmethod
def serialize(self, attr, obj, accessor=None):
pass
@abstractmethod
def deserialize(self, value):
pass
@abstractmethod
def _serialize(self, value, attr, obj, **kwargs):
pass
@abstractmethod
def _deserialize(self, value, attr, data, **kwargs):
pass
The file fields.py is over 2,000 lines long and it contains more than 30 classes representing different kinds of fields. Every one of these classes has FieldABC
in its class hierarchy. The class FieldABC
is only 20 lines long, but it enforces a consistent structure on all those other field subclasses:
Every field class must implement a public
serialize()
method, and a publicdeserialize()
method. These are the methods that end users will call.Every field class must implement helper methods called
_serialize()
and_deserialize()
. These are the methods that do the messy work of serialization and deserialization.
There’s an interesting thing to note here, though, which illustrates the flexbility of OOP and inheritance. The FieldABC
class requires that at least one subclass implements these methods. Sometimes the inheritance chain is three layers deep:
class FieldABC(ABC):
...
class Field(FieldABC):
...
class String(Field):
...
In this case, Field
implements all four required methods. String
only implements _serialize()
and _deserialize()
, handling the serialization process for string data.
Some fields have four layers of inheritance:
class FieldABC:
...
class Field(FieldABC):
...
class String(Field):
...
class Email(String):
...
The Email
class doesn’t implement any of the four required methods. Instead, it focuses on validating that the field meets the required formatting conventions of email addresses. It then lets the String
class handle the serialization work, by treating the email address as a string.
When working with abstract base classes, the overall inheritance chain needs to satisfy the requirements laid out in the base class.
When working with abstract base classes, the overall inheritance chain needs to satisfy the requirements laid out in the base class. Each class in the hierarchy does not have to meet those requirements individually.
Custom Field
classes
You can create a custom field class, by inheriting directly from Field
. Typically, this involves overriding the _serialize()
and _deserialize()
methods.
The documentation for writing custom field classes uses a field for a PIN code as an example:
class PinCode(fields.Field):
"""Field that serializes to a string of numbers and deserializes
to a list of numbers.
"""
def _serialize(...):
...
def _deserialize(...):
try:
return [int(c) for c in value]
except ValueError as error:
msg = "Pin codes must contain only digits."
raise ValidationError(msg) from error
One feature of this custom field is that it raises a ValidationError if the data that’s being deserialized doesn’t match the format expected of a PIN code. Deserialization is a trust boundary, and the ability to write custom deserialization methods that include validation and security checks is quite useful.
Conclusions
While I don’t expect many people reading this to go out and start using marshmallow, I hope this post has tied together a number of OOP concepts. Many of those concepts can only really start to make sense in the context of a moderately complex, real-world project. Here are some of the main takeaways:
The instances we create from custom classes can’t be saved directly; they must be converted to a format that can be written to a file (serialization).
Serialized data is also useful for other purposes, such as streaming and implementing APIs.
Serialization is a perfect example of a common task that should be implemented by a library.
When designing such a library, abstract base classes can enforce simple constraints on a large and evolving code base.
When implementing a library, a clear understanding of inheritance and the OOP model can go a long way toward building a project that’s both flexible and maintainable over time.
When using a library like this, understanding its internal structure can give you insights beyond what you’re likely to get from just reading the documentation.
A clear understanding of OOP and inheritance also sets you up to be able to write custom extensions of well-established libraries, such as custom fields when using a library like marshmallow.
If you’re looking to use marshmallow, be sure to consider Pydantic as well. Pydantic is a newer library that can do many of the same things as marshmallow. I’ll write a followup post at some point highlighting what Pydantic offers.
Resources
You can find the code files from this post in the mostly_python GitHub repository.
For example if you wanted to save a robot
instance you could write its default __dict__
attribute to a file, if that’s all the data you wanted to preserve from the instance. However, the default data in __dict__
may not include the exact data you want to store from the instance. It may store more information than you want to keep in a file, or it may leave out some important information.
It’s also not always straightforward to recreate (deserialize) an object just from the data contained in the __dict__
attribute.