Grounding yourself as a programmer in the AI era, part 3
MP #24: Refactoring a small project, without AI assistance.
Note: This is the third post in a 6-part series about starting to integrate AI tools into your programming workflow. The previous post walked through the process of creating a small command-line utility for adding borders to images. In the next post we’ll repeat the refactoring process described below, with the use of an AI assistant.
Like many people, I write a lot of exploratory code. I’ll have an idea, and then write some code to see if that idea goes anywhere meaningful. When an idea proves meaningful, I almost always refactor that exploratory code before moving on to more formal development work.
In this post we’ll take the final version of add_border.py from MP #23, and refactor it to support further development and an initial release as a pip-installable project. I’ll do this without assistance from any AI tools. In the next post we’ll repeat this process using an AI assistant, and compare the workflows and resulting code for the two approaches.
Why refactor?
Refactoring is the act of rewriting code so you get the same end result, but with a better codebase. There are several different ways the code might be “better”:
It may be simpler in some significant way: better organized, more readable, uses standard conventions.
It may run more efficiently.
It may be easier to test.
When I write exploratory code, the goal is to find out whether an idea for a project is worth pursuing, and how easily I can get some minimal version of the project working. I try to write reasonably clean code, but I don’t let that goal get in the way of quickly trying a variety of ideas. The focus is on making something that works now, for my specific use cases.
When I want to move beyond the exploratory phase, I have a number of different goals that must be balanced:
I want my code to be readable and maintainable by me in six months, and by others at any time.
I want the code to be organized enough that I can add new features cleanly.
I want to address any obvious performance issues, even if I haven’t hit any bottlenecks in my own usage yet.
I’ll work toward each of these goals in this post.1
Make sure you’re using version control
There are a number of things that are nice to have in place before starting a refactoring effort, but one thing is most critical: make sure you’ve started to use version control.
Refactoring, by definition, involves restructuring your project. You’re almost certainly going to introduce breaking changes at some point in the process. Using version control, and being able to roll back to a working state of the project, is a non-negotiable requirement in today’s development workflows.
Write some tests!
Even if you don’t necessarily like testing, I highly encourage you to write some tests before jumping into refactoring. We refactor because we have some behavior we want to preserve and improve. Don’t test everything at this point, but absolutely take a little time to write the most fundamental tests you can think of for your project. Test the behaviors and workflows you want to preserve through this particular refactoring effort.
For this project, we’ll test that these two commands work as they currently do:
$ python add_border.py bear_scratching.jpg
$ python add_border.py bear_scratching.jpg 15
Just to keep things interesting, I’m using a different image in this post. This is an image of a bear I captured with a game camera last summer:
We’ll want to know that the first command above adds a 2-pixel light gray border to this image, and the second command adds a 10-pixel light gray border. We’ll also test that passing no filename generates an error message, as does passing an invalid image file.
We won’t test for padding and custom border colors yet, because we’re going to change the way those options are passed. We’ll try to write tests for those behaviors a little later, as part of this refactoring work.
This series isn’t focused on testing, so I won’t explain how to write tests here.2 Here’s the current test suite, which passes:
% pytest
========= test session starts =========
collected 4 items
tests/test_basic_functionality.py .... [100%]
========== 4 passed in 0.21s ==========
We’ll run these regularly throughout the refactoring process, and add some new tests as we refine the project.
Refactoring add_border.py
Let’s think more specifically about refactoring this project. It works, and if we’re not going to do anything more with it then there would be no need to refactor it. What kinds of features do we want to add? What kinds of restructuring would support that ongoing development?
Here are the kinds of features and updates I have in mind for this project:
Update the CLI so that padding and border color are named options. I want to support a command like this:
$ python add_border.py image_file.png --padding 10
Let users specify a different location and filename for the modified image.
Let users convert the screenshot to a format that supports transparency.
(Maybe) let users specify different size borders on each side.
Let users process all images in a given directory.
Make the code more readable.
Make it easier to test the project.
This project is a specialized wrapper around Pillow. In a project like this, it’s easy to start exposing more and more of the underlying library until your users might as well just use the underlying library itself. The features listed here are all focused on processing screenshots efficiently. For example, the focus on transparency is intended to make screenshots look good on both light and dark themed browsers.
To support ongoing development, here are the specific goals for this round of refactoring:
Restructure the code to use functions.
I don’t think there’s any reason to write a class at this point. If we start passing too many arguments back and forth, that’s a sign that we should implement some of the code as a class.
The code isn’t difficult to read; there are just over 30 lines of actual code, ignoring blank lines and comments. But, well written code that uses functions is even easier to read and reason about. Using functions will make the code easier to develop further and easier to test. Right now it’s hard to test anything other than the entire workflow. Functions, however, can be imported and tested individually. That probably won’t be especially important in this small project, but it’s a huge benefit in projects that are likely to continue evolving indefinitely.
Restructure the CLI to use argparse
.
Using sys.argv
is really helpful when writing exploratory code, but it only supports a limited approach to building a CLI. Using argparse
will let us keep the filename as a required argument, let the border width remain a positional argument, and let us add as many named optional arguments as we want.
Starting to modify code
Many programmers have had the experience of staring at a blank text editor window, wondering exactly how to start a project. The idea might be clear, but what should the first lines of code be? Refactoring can feel a lot like this: “I know the kinds of changes I’d like to make, but I have no idea where to start!”
One way to get past this is to write out what you’d like the project (or part of the project) to look like. Here’s what I’d like add_border.py to look like:
"""Add a border to any image."""
# import statements
# functions
process_cli_args()
load_image()
process_image()
save_image()
This is really helpful! Now we can work backwards, writing these functions as we go. We don’t need to stick to these names. In a smaller, less formal project like this I often find reason to change the number and names of these functions as I start to work on the code. But, this short step gives me a much clearer idea of where I want to end up. And with tests in place, we can test the project after writing each of these functions.
I like to write these functions in the same file as the main program. Then they can be moved to one or more separate modules as needed. We should end up with a short, easy to read main program file, and a small set of modules that you can dig into if you need to examine the underlying implementation.
Writing process_cli_args()
Let’s jump in. It can be tempting to immediately write the kind of small, self-contained functions we think we’re supposed to end up with. But that’s not realistic; you can find lots of really long, well-structured functions in highly respected codebases. We’ll just move blocks of existing code to the functions we just outlined, and break things up further if it seems helpful.
Here’s add_border.py, with all the code that processes CLI arguments in one function:
"""Add a border to any image."""
import sys
from pathlib import Path
from PIL import Image, ImageOps, UnidentifiedImageError
def process_cli_args():
"""Process all CLI arguments."""
# Get the filename.
try:
path = Path(sys.argv[1])
except IndexError:
print("You must provide a target image.")
sys.exit()
# --- Get optional CLI args. ---
try:
border_width = int(sys.argv[2])
except IndexError:
border_width = 2
try:
padding = int(sys.argv[3])
except IndexError:
padding = 0
try:
border_color = sys.argv[4]
except IndexError:
border_color = "lightgray"
return path, border_width, padding, border_color
path, border_width, padding, border_color = process_cli_args()
# Load the image.
...
After making a change like this, we should run the tests. I won’t show the test results each time I run them, but I’ll mention it; the tests do still pass at this point.
Adding an options
dict
This first function is helpful, but unpacking a tuple of four return values is not ideal. That’s a nudge toward using a class, but I don’t think we need to go that far yet. Instead of packing these values into a class, we can just put them in a dictionary.
Here’s what it looks like with everything except path
tucked neatly into a dictionary called options
:
...
def process_cli_args(options):
"""Process all CLI arguments."""
# Get the filename.
try:
path = Path(sys.argv[1])
except IndexError:
print("You must provide a target image.")
sys.exit()
# --- Get optional CLI args. ---
try:
options['border_width'] = int(sys.argv[2])
except IndexError:
pass
try:
options['padding'] = int(sys.argv[3])
except IndexError:
pass
try:
options['border_color'] = sys.argv[4]
except IndexError:
pass
return path
# Set default options.
options = {
'border_width': 2,
'padding': 0,
'border_color': 'lightgray'
}
path = process_cli_args(options)
# Load the image.
...
# Add some padding before adding the border.
# The padding is just a white border added before
# the actual border.
new_img = ImageOps.expand(img, border=options['padding'],
fill="white")
# Add the border.
new_img = ImageOps.expand(new_img,
border=options['border_width'],
fill=options['border_color'])
# Save new image.
...
There are a few things to notice here. The best way to read this is starting in the middle, where the default options are set. We define a dictionary called options
, which starts out with all of the default values. This is much more readable than having default option values defined in different parts of the project. It also means that, in the rest of the project, options only need to be set when the user wants to override one of the defaults.
Next, note that the call to process_cli_args()
is much cleaner:
path = process_cli_args(options)
We’re down to just one return value, path
. We can simply pass the options
dict to the function, and any changes made there will be made to the actual dictionary; there’s no need to return it.
Now let’s look inside process_cli_args()
. The try
blocks are simpler; if an optional argument is present in sys.argv
, we update the corresponding entry in the options
dict. If it’s not present, we don’t need to do anything because the default value has already been set:
try:
options['border_width'] = int(sys.argv[2])
except IndexError:
pass
We’ll simplify this even further in a moment.
Finally, in the calls to expand()
where the actual processing happens, we need to use the values from options
:
new_img = ImageOps.expand(new_img,
border=options['border_width'],
fill=options['border_color'])
At this point, all of the tests once again pass.
A simpler approach to overriding options
When we’re processing the optional CLI arguments, we’re no longer doing anything in the except
blocks. We can simplify these blocks by checking to see how many arguments have been passed. All of these changes are internal to process_cli_args()
:
def process_cli_args(options):
"""Process all CLI arguments."""
...
# --- Get optional CLI args. ---
if len(sys.argv) >= 3:
options['border_width'] = int(sys.argv[2])
if len(sys.argv) >= 4:
options['padding'] = int(sys.argv[3])
if len(sys.argv) >= 5:
options['border_color'] = sys.argv[4]
return path
Since these are all positional arguments, we can easily check if the argument is present based on the number of values in sys.argv
. This isn’t a particularly important change, because we’re going to switch to using argparse
shortly. But it’s a quick and easy change to make, and the function is much shorter and simpler after this. Also, if we don’t get time to implement the switch to argparse
soon, we’ll have simpler code until we can get to that work.
It’s important to note that these need to be a set of independent if
blocks. If you try to use elif
or else
in this block, you won’t process all of the possible arguments.
When writing about a process like refactoring, I try to avoid giving the impression that everything worked flawlessly. The tests didn’t pass on my first attempt at making this change, because I wrote the following code:
if len(sys.argv >= 3):
options['border_width'] = int(sys.argv[2])
The correct code is len(sys.argv)
, but I accidentally put the closing parenthesis at the end of the conditional statement. Having a small set of tests that cover the most basic use cases of a project are fantastic for catching simple errors like this quickly, when it’s easiest to find them. If you can’t find a bug, then knowing there’s an issue right away lets you roll back to the last working version of the project sooner rather than later.
After fixing this issue, all tests pass again.
Validating the path exists
As I look at the current structure of add_border.py, I notice that there’s one validation action in the second part of the code that really should be handled when processing the CLI arguments. In this block, we’re doing two kinds of validation:
# Load the image.
try:
img = Image.open(path)
except FileNotFoundError:
...
except UnidentifiedImageError:
...
Here we’re using Pillow to check if the file exists, and then to check if it’s a valid image file. We have to use Pillow for the second check, but pathlib
can make sure the path exists. Let’s move that validation to process_cli_args()
:
def process_cli_args(options):
"""Process all CLI arguments."""
# Get the filename.
...
# Make sure the file exists.
if not path.exists():
print(f"{path} does not seem to exist.")
sys.exit()
# --- Get optional CLI args. ---
...
return path
This makes the code for loading an image simpler, and more focused:
# Load the image.
try:
img = Image.open(path)
except UnidentifiedImageError:
...
I ran the tests, but as I ran them I realized that this behavior was not actually covered by the test suite. The tests pass, but that doesn’t mean anything here. This is an easy case to write a test for though, so I added an appropriate test, and it passed. This is also another reason to strongly suggest writing a minimal test suite as soon as you can when moving out of the exploratory phase: once you have a couple tests written, it’s significantly easier to add more tests whenever the need arises.
Creating cli.py
I wasn’t sure when I’d start to move some code to a separate module, but now it’s clear that we can define a module that’s dedicated to managing the CLI. All of the CLI logic is now in process_cli_args()
, so let’s move that function to a module called cli.py:
# cli.py
"""Process the CLI for add_border.py."""
import sys
from pathlib import Path
def process_cli_args(options):
...
Now add_border.py is shorter, and everything in it is focused on the functionality of adding a border to an image, not the functionality of managing a CLI:
"""Add a border to any image."""
from PIL import Image, ImageOps, UnidentifiedImageError
from cli import process_cli_args
# Set default options.
...
The function process_cli_args()
could certainly be refactored further, and it might need to be when we switch over to argparse
, but that work can all happen in cli.py. Note that sys
and Path
no longer need to be imported in add_border.py.
You can take this modularization concept as far as you want; an interesting exercise would be to write a get_options()
function, and then move that to something like config.py. In a larger project, this would have the benefit of having one file where all of the defaults are defined. We won’t bother with that in this small project; in a project of this size it’s actually more readable to keep options
in the main file. There’s some mental overhead involved with having to look at another file to see a project’s settings. With only three options, that overhead is not advantageous enough to be worth implementing.
Let’s leave cli.py alone for now, and focus on the rest of add_border.py.
Writing the load_image()
function
The CLI was the most complex part of the project. The rest of add_border.py should be much simpler to break into functions.
Here’s add_border.py, with a load_image()
function:
...
from cli import process_cli_args
def load_image(path):
"""Load the image file."""
try:
img = Image.open(path)
except UnidentifiedImageError:
print(f"{path} does not seem to be an image file.")
sys.exit()
return img
# Set default options.
...
path = process_cli_args(options)
img = load_image(path)
# Add some padding before adding the border.
...
We move the code for loading the image to the new function load_image()
. We’re starting to see the simple series of function calls that you should see when refactoring exploratory code:
path = process_cli_args(options)
img = load_image(path)
In these two lines, we see clear actions, and it’s pretty easy to see what objects are being created and manipulated. We don’t see any implementation details, but if we want to see those we can go look at the relevant functions. In refactoring work, this is a great thing to start seeing.
All tests are passing. I noticed, however, that there’s no real need to use the img
variable in load_image()
. We can just put the return
statement directly in the try
block, and then load_image()
is even simpler:
def load_image(path):
"""Load the image file."""
try:
return Image.open(path)
except UnidentifiedImageError:
print(f"{path} does not seem to be an image file.")
sys.exit()
This is a nice little simplification, and the tests still pass.
Writing the process_image()
function
Now we can move all of the image processing work to its own function:
...
def load_image(path):
...
def process_image(img, options):
"""Modify the original image."""
# Add some padding before adding the border.
# The padding is just a white border added before
# the actual border.
new_img = ImageOps.expand(img, border=options['padding'],
fill="white")
# Add the border.
new_img = ImageOps.expand(new_img,
border=options['border_width'],
fill=options['border_color'])
return new_img
# Set default options.
...
path = process_cli_args(options)
img = load_image(path)
new_img = process_image(img, options)
# Save new image.
new_path = path.parent / f"{path.stem}_bordered{path.suffix}"
new_img.save(new_path)
We pass the img
object and the options
dict to process_image()
, and it returns the correct new_img
object. All the tests pass, although I will admit I forgot to pass the arguments to the new function when I first ran the tests.
The save_image()
function
Finally, we’ll move the code for saving the modified image to its own function:
...
def load_image(path):
....
def process_image(img, options):
...
def save_image(path, new_img):
"""Save the new image."""
new_path = path.parent / f"{path.stem}_bordered{path.suffix}"
new_img.save(new_path)
# Set default options.
...
path = process_cli_args(options)
img = load_image(path)
new_img = process_image(img, options)
save_image(path, new_img)
The function needs the original path, and the new_img
object. All of the tests pass.
The image_functions.py module
We might as well move these functions to their own module, so we have a really simple main file. Here’s image_functions.py:
# image_functions.py
"""Functions to process the image."""
from PIL import Image, ImageOps, UnidentifiedImageError
def load_image(path):
...
def process_image(img, options):
...
def save_image(path, new_img):
...
We moved the image processing functions here, and the PIL
imports belong in this file as well.
Here’s the nice, clean version of add_border.py:
"""Add a border to any image."""
from cli import process_cli_args
import image_functions as img_fns
# Set default options.
options = {
'border_width': 2,
'padding': 0,
'border_color': 'lightgray'
}
# Process image.
path = process_cli_args(options)
img = img_fns.load_image(path)
new_img = img_fns.process_image(img, options)
img_fns.save_image(path, new_img)
We give image_functions
the alias img_fns
, instead of importing each function by name.
This a really simple main file. It’s easy to read and see what’s happening:
Set the default values for options.
Process the CLI arguments.
Load the image.
Process the image.
Save the image.
All of the tests still pass, so the basic functionality is intact. We could do some more refactoring in the new modules, but we can also put that off until the next time we’re working in those files.
When we update the CLI, there’s a good chance we’ll only have to modify cli.py. This is the point in the refactoring process where I start to look forward to adding new features, because the project is structured in a way that supports ongoing development.
Switching to argparse
We have two goals in switching to argparse
, instead of sys.argv
:
Clean up the external CLI, so options like
border_color
are named options.Add tests for padding and custom border colors.
I won’t say a whole lot about how to use argparse
; the main point of this post is showing how contained a change like this is after some effective refactoring.2 Here’s the updated cli.py:
"""Process the CLI for add_border.py."""
import sys, argparse
from pathlib import Path
def process_cli_args(options):
"""Process all CLI arguments."""
parser = argparse.ArgumentParser()
# Define arguments.
parser.add_argument('path',
help="path to the image file")
parser.add_argument('border_width',
nargs='?', type=int,
help="default: 2px")
parser.add_argument('--padding', type=int,
help="padding between image and border")
parser.add_argument('--border-color', type=str,
help="default: lightgray")
# Process arguments.
args = parser.parse_args()
path = Path(args.path)
if not path.exists():
print(f"{path} does not seem to exist.")
sys.exit()
if args.border_width:
options['border_width'] = args.border_width
if args.padding:
options['padding'] = args.padding
if args.border_color:
options['border_color'] = args.border_color
return path
We import argparse
, and we still need to import sys
in order to call exit()
if the path doesn’t exist. To use argparse
, you make an instance of ArgumentParser
, typically referred to as parser
. Then you define the arguments you want by calling the add_argument()
method. Here, add_argument()
is called in a variety of ways to get the mix of required, positional, and named CLI arguments that we want for this project.
After defining the arguments, we call parse_args()
to check which arguments were actually passed. After that, every argument that was defined is available as an attribute of the args
object. For each of the optional arguments, if it is present we update the options
dict with the corresponding value.
At this point, all of these are valid ways to call add_border.py:
$ python add_border.py --help
$ python add_border.py bear_scratching.jpg
$ python add_border.py bear_scratching.jpg 15
$ python add_border.py bear_scratching.jpg --padding 10
$ python add_border.py bear_scratching.jpg 15 --padding 10
$ python add_border.py bear_scratching.jpg --border-color black
...
This is not an exhaustive list, but it shows the kinds of usage we were looking for.
The tests needed to be updated slightly, because argparse
handles its own error messages. For example, here’s the output when you forget to include a filename:
$ python add_border.py
usage: add_border.py [-h] [--padding PADDING]
[--border-color BORDER_COLOR] path [border_width]
add_border.py: error: the following arguments are required: path
I also added tests for including padding and custom border colors, now that we have a more stable CLI.
A cleaner approach to managing options
When we were using sys.argv
, defining the options
dict in the main add_border.py file made sense. Now, it’s getting pretty clunky. This code seems like it could be simplified:
if args.border_width:
options['border_width'] = args.border_width
We’re checking the args.border_width
attribute, and then if it exists we’re using it again to modify an entry in options
. The add_argument()
method lets you define a default value for any CLI argument you’re creating. This suggests a much cleaner architecture: instead of defining options in add_border.py, we’ll add defaults to each argument, and then just build the options
dict from the argument values.
Here’s an even cleaner version of add_border.py:
"""Add a border to any image."""
from cli import process_cli_args
import image_functions as img_fns
# Process image.
path, options = process_cli_args()
img = img_fns.load_image(path)
new_img = img_fns.process_image(img, options)
img_fns.save_image(path, new_img)
This file is so much shorter than it started! Since we’re not defining options here, we don’t need to pass any arguments to process_cli_args()
. If we want to see how the options are defined, we can look in the cli.py module. If we want to see how they’re used, we can look in the image_functions.py module. This is a very clean structure for the main project file.
Here’s the cleaner cli.py:
...
def process_cli_args():
"""Process all CLI arguments."""
parser = argparse.ArgumentParser()
# Define arguments.
parser.add_argument('path',
help="path to the image file")
parser.add_argument('border_width',
nargs='?', type=int, default=2,
help="default: 2px")
parser.add_argument('--padding', type=int, default=0,
help="default: 0px")
parser.add_argument('--border-color', type=str,
default='lightgray', help="default: lightgray")
# Process arguments.
args = parser.parse_args()
path = Path(args.path)
if not path.exists():
...
options = {
'border_width': args.border_width,
'padding': args.padding,
'border_color': args.border_color,
}
return path, options
The function process_cli_args()
no longer needs the parameter options
.
Each of the optional arguments now has a default value included in the call to add_argument()
. I’ve also updated the help setting for —-padding
to be more consistent with the other optional arguments.
We don’t need any conditional statements when processing the optional arguments. We just use each attribute of args
to set the appropriate value in the options
dictionary. Each argument will either have a default value, or a value passed in by the user. This file now has a very clean structure as well.
It’s tempting to go back and edit the earlier parts of this post, to leave out the original options
dict. But that doesn’t reflect a realistic development process. The process of refactoring, whether formal or informal, often suggests new ways of thinking about a project. Sometimes you can use all the new ideas that come up. Sometimes you’ll have ideas that would be really nice to implement, but would affect too many aspects of a project to be worth doing. I could easily imagine a project where options
is defined in the main file, and only a few overrides are possible in the CLI. In that case, the clunkier approach to overriding options in cli.py might be the better approach.
Final changes
There are two final changes that I’ll mention, but won’t bother showing. I broke process_cli_args()
into three smaller functions: define_arguments()
, validate_path()
, and get_options()
. I’ve seen many projects that take this extra step, and just as many projects that keep all of this work in a single function. Consider both approaches for your own projects, and be aware that some people prefer the opposite approach. I tend to prefer the modular approach when things like validation become complex. For a smaller project like this, I have a slight preference for the single-function approach.
Conclusions
Almost everyone involved in programming uses the term refactoring. There are formal and informal approaches to this work, depending on the complexity of your project and how many people are currently using it.
Regardless of the workflow you use, make sure to test your project often throughout the refactoring process. Take some time to define clear goals for any phase of refactoring work, and keep notes about what’s working and what’s not working. Also note any ideas for improving the project that you’re not able to implement at the moment. If you follow a somewhat structured approach to refactoring, you should end up with a cleaner project that’s easier to maintain and develop further.
In the next post we’ll start back with the original version of this project, before we started refactoring it. We’ll then repeat this process, using an AI assistant to guide and help out with the refactoring work. At the end of the process, we’ll compare the two versions of the project, and the two workflows.
Resources
You can find the code files from this post in the mostly_python GitHub repository.
There are degrees of formality around refactoring. I’m using the terms “refactor”, “restructure”, “reorganize”, and “clean up” somewhat loosely in this post. When you have a mature, widely-used project, you need to be really careful about how you approach refactoring. There is a wide body of literature about how to follow formal refactoring methods to manage this kind of work.
With an exploratory project you should still be careful in how you go about refactoring, because you already have some behavior you want to preserve. That said, there will almost certainly be aspects of the project that aren’t worth treating formally until they’ve been refined a little further. As always, programming is about balancing competing priorities. If you follow the suggestions outlined in this post at a similar stage in your own projects, you should be able to maintain an appropriate balance between formal methods and a reasonable pace of development.