OOP in Python, part 15: Class structure in pathlib
MP 59: How file paths are modeled in pathlib.
Note: This post is part of a series about OOP in Python. The previous post discussed how classes are used to implement exceptions in Python. The next post looks at the class structure in the Matplotlib library.
In the last post we looked at the Exception
class hierarchy, much of which is implemented in C. In this post we’ll look at a much newer library, pathlib, which is implemented almost entirely in Python.
The need for pathlib
In the old days of Python, people used strings to represent file paths. This was problematic for a number of reasons. One of the most significant issues arose when dealing with different operating systems.
As a brief example, consider a program that shows a file’s location before doing any other work:
path = 'static/note_images/a1.png'
print(f"File location: {path}")
This is taken from a project I’ve been working on recently that helps people learn the grand staff when playing piano. Here’s the image file:
This simple program works on my macOS system:
File location: static/note_images/a1.png
But file paths on Windows use backslashes, so that path is different on Windows:
path = 'static\note_images\a1.png'
print(f"File location: {path}")
This path looks like it should be correct on Windows, but here’s the output:
File location: static
ote_images1.png
This program falls apart because the sequence \n
in the string path
is interpreted as a newline.
To fix this, you’d see paths written like this on Windows:
path = 'static\\note_images\\a1.png'
print(f"File location: {path}")
This works because the first forward slash escapes the next forward slash:
File location: static\note_images\a1.png
However, this is a pretty inelegant and inefficient way of handling something as important as file and directory paths.
Beyond just making paths look consistent across OSes, there are a number of things we’d like to do with fully-featured path objects that we can’t do with strings:
Get parts of a path: root, parent, filename, file extension
Check if a file or directory exists.
Find out if a path represents a file or a directory.
Read from a file.
Write to a file.
Many more common file and directory operations.
Python has a lot of resources in the os
module for working with files and directories. But the move to creating a library with support for paths as dedicated objects has made it much easier and more intuitive to work with paths in Python, in a way that facilitates cross-platform functionality.
A simple example with pathlib
Let’s see what the previous example looks like if we use pathlib instead of strings:
from pathlib import Path
path = Path('static/note_images/a1.png')
print(f"File location: {path}")
We import the Path
class from the pathlib
module. We then make a Path
object, using forward slashes. This file generates the same output as the previous example on macOS:
File location: static/note_images/a1.png
That’s good, but here’s the important part. On Windows the same program file, including the forward slashes, generates this output:
File location: static\note_images\a1.png
The file path is no longer a string that knows nothing about file paths and operating systems. Instead it’s a full Path
object, which has lots of functionality built in that makes it aware of common file operations, including how they should be represented on each OS. Here the variable path
is being formatted appropriately for whichever operating system the program is running on.
The pathlib
module does a lot of work behind the scenes to check what the current operating system is, and how things should be handled given that information. Let’s see how OOP principles were used to implement this kind of functionality.
Path
objects
Since we just made a Path
object, let’s first see how that class is implemented. Here’s the definition of the class, along with part of its docstring:
class Path(PurePath):
"""PurePath subclass that can make system calls.
Depending on your system, instantiating a Path will return
either a PosixPath or a WindowsPath object...
"""
This is interesting already! The class Path
inherits from PurePath
, so we’ll take a look at that class in a bit. But also, calling Path()
doesn’t return an instance of Path
. Instead it either returns an instance of PosixPath
, or an instance of WindowsPath
. We’ll have to look at those classes as well.
A real-world use of __new__()
Let’s first see how calling Path()
returns an instance of a different class. We saw in an earlier post that __new__()
is the method responsible for creating new instances of a class, so let’s look at that method:
class Path(PurePath):
...
def __new__(cls, *args, **kwargs):
if cls is Path:
cls = WindowsPath if os.name == 'nt' else PosixPath
self = cls._from_parts(args)
if not self._flavour.is_supported:
raise NotImplementedError(...)
return self
The __new__()
method gets a reference to the type of class that’s being created, which is passed to the cls
argument. The code shown here checks the value of os.name
. If this value is 'nt'
then we’re on Windows, and it changes the cls
type to WindowsPath
. If the value is anything else, it changes the cls
type to PosixPath
. This is the type that’s appropriate for macOS and most Linux systems.
Path
methods: path.cwd()
Many of the methods defined in Path
are wrappers around calls to functions in the os
module, or calls to other methods within Path
itself. The overall effect is to make it easier for end users to do what they need to with path objects.
For example, here’s the implementation of the path.cwd()
method:
class Path(PurePath):
...
@classmethod
def cwd(cls):
"""Return a new path pointing to the current working directory
(as returned by os.getcwd()).
"""
return cls(os.getcwd())
This is a one-line method, but it does a lot to simplify things for end users. It’s a thin wrapper around the os.getcwd()
function. That might not seem particularly beneficial, but take a look at the output of path.cwd()
compared to os.getcwd()
:
>>> path.cwd()
PosixPath('/Users/eric/.../mp59_oop15')
>>> os.getcwd()
'/Users/eric/.../mp59_oop15'
The main difference here is what gets returned. With path.cwd()
, you get back a Path
object that’s aware of how your system works. With os.getcwd()
, you’re stuck with a string.
This helps explain the last line of the path.cwd()
method:
return cls(os.getcwd())
The call to os.getcwd()
returns a string representation of a path. If you remember that cls
is a reference to the current class type, you can start to see that wrapping cls()
around that return value gives us back a new Path
object. Except it won’t be a Path
object; it will be a WindowsPath
object on Windows, and a PosixPath
object on macOS and Linux.
This is pretty interesting to think about. path.cwd()
is a method that, for its return value, creates an instance of its own class. It’s a little Inception-like, for those who’ve seen that movie.
The difference is even more noticeable on Windows:
>>> path.cwd()
WindowsPath('C:/Users/eric/.../mp59_oop15')
>>> os.getcwd()
'C:\\Users\\eric\\...\\mp59_oop15'
The Path
object representing the current working directory on Windows is identical to what we get on other systems, except for the overall class type. All the custom information and behavior needed for that OS is contained in the class. The representation of the path, in Python code, is consistent across all systems.
The return value from os.getcwd()
is an ugly double-backslashed string. More important though, is what you can do with the return value. When using the methods from pathlib, you can continue to work with what’s returned, because it’s another path object.
Path
methods: path.exists()
Let’s look at how one of the most useful methods in the Path
class is implemented. The path.exists()
method tells you whether a file or directory exists, so you can verify it exists before taking any other actions. It saves you from having to use try-except blocks to handle the possibility of missing files or directories.
Here’s the method:
class Path(PurePath):
...
def exists(self):
"""
Whether this path exists.
"""
try:
self.stat()
except OSError as e:
if not _ignore_error(e):
raise
return False
except ValueError:
# Non-encodable path
return False
return True
This is another thin wrapper, around another method in the same class. The path.stat()
method makes a call to os.stat()
, which is a wrapper for a system-level stat
call. Many people don’t know about stat
, and don’t necessarily need to if all they really want to know is whether a path exists or not. The path.exists()
method makes the stat()
call, and interprets the results in a useful way.
All of this results in a more intuitive API for working with paths:
>>> path
PosixPath('static/note_images/a1.png')
>>> path.exists()
True
Let’s look at the OS-specific path classes, and then come back to the PurePath
class.
The PosixPath
and WindowsPath
classes
Here’s the entire implementation of PosixPath
, which is what you get when you call Path()
on most non-Windows systems:
class PosixPath(Path, PurePosixPath):
"""Path subclass for non-Windows systems.
On a POSIX system, instantiating a Path should return this object.
"""
__slots__ = ()
This is a small class that combines the behavior of the Path
and PurePosixPath
classes. If you haven’t seen __slots__
before, it’s a way of restricting the set of attributes that can be defined for an instance of a class. The empty tuple here means you can’t add any new attributes to an instance of PosixPath
.
WindowsPath
has an almost identical structure:
class WindowsPath(Path, PureWindowsPath):
"""Path subclass for Windows systems.
On a Windows system, instantiating a Path should return this object.
"""
__slots__ = ()
def is_mount(self):
raise NotImplementedError("Path.is_mount() is unsupported...")
The only difference here is the is_mount()
method, which overrides a parent class’ is_mount()
method. This makes sure anyone who calls path.is_mount()
on Windows gets an appropriate message that the method isn’t available on Windows.
Now let’s move back up the hierarchy, and see what the “pure” path classes look like.
The PurePosixPath
and PureWindowsPath
classes
Here are two of the classes that PosixPath
and WindowsPath
inherit from:
class PurePosixPath(PurePath):
"""PurePath subclass for non-Windows systems..."""
_flavour = _posix_flavour
__slots__ = ()
class PureWindowsPath(PurePath):
"""PurePath subclass for Windows systems..."""
_flavour = _windows_flavour
__slots__ = ()
These are each thin classes in the hierarchy. They each define a _flavour
attribute, either _posix_flavour
or _windows_flavour
. These are the pieces that help determine things like whether a forward slash or a backslash should be used when formatting paths.
What is _flavour
?!
The underscore in _flavour
tells us it’s not meant to be used outside the class. But we’re trying to understand the inner workings of the module, so let’s figure out where it’s defined.
Two lines in the middle of pathlib.py stand out because they aren’t part of any class:
_windows_flavour = _WindowsFlavour()
_posix_flavour = _PosixFlavour()
These two lines define one instance of the class _WindowsFlavour
, and one instance of the class _PosixFlavour
.
Here are the first parts of those two classes:
class _WindowsFlavour(_Flavour):
sep = '\\'
altsep = '/'
has_drv = True
pathmod = ntpath
is_supported = (os.name == 'nt')
...
class _PosixFlavour(_Flavour):
sep = '/'
altsep = ''
has_drv = False
pathmod = posixpath
is_supported = (os.name != 'nt')
...
Here you can start to see how paths are handled differently on each OS. The attribute sep
is short for separator, which is the file separator Python needs to use on each OS. On Windows, that’s the double backslash, \\
. On non-Windows systems, that’s a single forward slash, /
. You can also see how the is_supported
attribute is set, based on the value of os.name
.
I won’t include it here directly, but if you look at the source for WindowsPath
there are a number of longer comments that document what people have learned about handling paths on Windows systems over the course of developing and maintaining the pathlib
module. One longer comment begins: Interesting findings about extended paths. The people who develop and maintain these libraries don’t start out knowing everything about each operating system. They’ve just carefully defined what they want the library to be able to do, researched how to make that happen, and documented their findings so that others don’t have to repeat all that work.
The _Flavour
class
Both _WindowsFlavour
and _PosixFlavour
inherit from _Flavour
. That base class provides some functionality for dealing with classes on specific operating systems. I’ll show one small part of that class:
class _Flavour(object):
"""A flavour implements a particular (platform-specific)
set of path semantics.
"""
def __init__(self):
self.join = self.sep.join
...
If you haven’t come across it yet, join()
is a built-in string method. It lets you do things like this:
>>> flavors = ['chocolate', 'vanilla', 'strawberry']
>>> ', '.join(flavors)
'chocolate, vanilla, strawberry'
The join()
method lets you specify a separator, in this case a comma followed by a space. It then joins all the items in a sequence into one string, using that separator.
Consider this line of code from _Flavour.__init__()
:
self.join = self.sep.join
This overrides the built-in join
method, and makes it so that calling join()
on a path always uses the separator that’s appropriate for the current operating system.
Most of this code is meant for internal use, but we can play around with some of these attributes and methods if we understand how everything fits together.
Let’s explore this in a terminal session, starting on macOS. Here’s the path we’ve been working with:
>>> path = Path('static/note_images/a1.png')
Now let’s see its _flavour
attribute:
>>> path._flavour
<pathlib._PosixFlavour object at 0x100a825d0>
It’s an instance of _PosixFlavour
. Now let’s see the separator:
>>> path._flavour.sep
'/'
A path object doesn’t have a sep
attribute. It has a _flavour
attribute, which itself has a sep
attribute. To get a path’s separator, you have to work through its _flavour
attribute.1
Path objects have an attribute _parts
, which consists of each element in the path:
>>> path._parts
['static', 'note_images', 'a1.png']
Putting all this together, we can rebuild a path from its parts by calling the join()
method from _flavour
:
>>> path._flavour.join(path._parts)
'static/note_images/a1.png'
Notice that the parts were put back together using a forward slash, without us ever specifying what the separator should be. That separator was defined when _Flavour
overrode join
.
Much of this looks the same on Windows, but key parts are different:
>>> path = Path('static/note_images/a1.png')
>>> path._flavour
<pathlib._WindowsFlavour object at 0x00000195E423C0D0>
>>> path._flavour.sep
'\\'
>>> path._parts
['static', 'note_images', 'a1.png']
>>> path._flavour.join(path._parts)
'static\\note_images\\a1.png'
We start out with the same path object. The value of _flavour
is a _WindowsFlavour
object, and the separator is a double backslash. The parts of the path are identical, as they should be. Calling join()
generates a path using the OS-specific \\
separator.
This is exactly how pathlib works internally. It looks complex from the outside, but it’s a complexity that lets all the OS-agnostic and OS-specific parts work together efficiently, and maintainably. End users have to think very little about OS-specific implementations.
The PurePath
class
All this brings us back to the PurePath
class. Three classes: Path
, PurePosixPath
, and PureWindowsPath
all inherit from PurePath
. Let’s take a look at its implementation.
Here’s the first part of PurePath
:
class PurePath(object):
"""Base class for manipulating paths without I/O...
"""
__slots__ = (
'_drv', '_root', '_parts',
'_str', '_hash', '_pparts', '_cached_cparts',
)
...
The PurePath
class implements path behaviors that aren’t related to input or output actions. These include actions like getting the parts of a path, getting an OS-specific representation of the class, and more.
The __slots__
attribute shows us the small set of attributes a PurePath
object can have. One of these is the _parts
attribute we just looked at.
Let’s close this out by looking at some of the methods in PurePath
. Here’s the as_posix()
method:
class PurePath(object):
...
def as_posix(self):
"""Return the string representation of the path with forward (/)
slashes."""
f = self._flavour
return str(self).replace(f.sep, '/')
Even if you’re on Windows, it’s sometimes useful to represent a path with forward slashes. This method checks the path’s _flavour
attribute, builds a string representation of the path, and then replaces the OS-specifc separator with a single forward slash.
Here’s a method that returns the path’s file extension, if there is one:
class PurePath(object):
...
@property
def suffix(self):
"""
The final component's last suffix, if any.
This includes the leading period. For example: '.txt'
"""
name = self.name
i = name.rfind('.')
if 0 < i < len(name) - 1:
return name[i:]
else:
return ''
This method gets the name
attribute of the path, which is a string. It then uses rfind()
to find the rightmost dot in name
. It uses that index to return everything from the final dot to the end of the string. So for a path like static/note_images/a1.png
, name
would be 'a1.png'
. It would find that the dot is the third character in the string, and return everything from that dot to the end of the string:
>>> path = Path('static/note_images/ai.png')
>>> path.suffix
'.png'
There are many other methods in PurePath
, all of which are meant to help make common file and directory tasks intuitive and simple in a cross-platform manner.
Conclusions
It can seem hard to know what to take away from some of these detailed looks at complicated implementations. You might be asking yourself, “How would I ever design a hierarchy like this?!”
The big takeaway for me is not to make a habit of building a hierarchy like this for its own sake. Instead, think carefully about what you want to accomplish, and list out all the ways your codebase will be used. What kinds of instances will people want to make? How will they use those instances? The goal is to build a hierarchy that makes it intuitive for end users to do the work they need to do, and to develop a library that can be maintained so it will do that work reliably and correctly for the foreseeable future. This all holds even if you’re the only end user at the moment.
This diagram representing the class hierarchy is shown at the top of the pathlib docs:
I hope this discussion has helped you understand a hierarchy such as this one a little better, and gain some understanding of how pathlib works as well.
The use of _Flavour
in pathlib is a great example of composition, which I’ll discuss in more detail before closing out this series.