PyTPMOTW: PyYAML | Musings of an Anonymous Geek

What’s This Module For?

Reading and writing files formatted using “YAML Ain’t Markup Language”” (YAML), and converting YAML syntax into native Python objects and datatypes.

What is YAML?

According to the website which houses the YAML Specification:

YAML™ (rhymes with “camel”) is a human-friendly, cross language, Unicode
based data serialization language designed around the common native data
structures of agile programming languages. It is broadly useful for
programming needs ranging from configuration files to Internet messaging to
object persistence to data auditing.

My introduction to YAML came several years ago in the context of messaging, and I then had a run-in with YAML as a logging format (actually, I was trying to parse a MySQL slow query log by coaxing it into YAML format). However, when I started writing Python full time, working on several different initiatives, YAML quickly became the standard configuration format.

Why? Simplicity. Using YAML for our config files and PyYAML to parse them, any developer can figure out what’s happening in our application in a matter of minutes, even if Python is not their primary language. It’s also nice that the YAML syntax is parsed into native Python datatypes, so Python coders looking at a config file can start to get a pretty good picture of how the program basically works.

The other thing that makes it simpler than some other config-specific options is that there’s not a lot of underlying “stuff” to know about. YAML isn’t a configuration engine, it’s essentially just a way to deal with data structures without locking the format to a specific language.

I also happen to like that it’s not config-specific, because it means that if I later need a messaging format, I already know one, and am familiar with a certain Python module to work with it!

Basic Usage

Let’s write a very simple YAML configuration for the logging portion of anapplication:

%YAML 1.2
---
Logging:
format: "%(levelname) -10s %(asctime)s %(module)s:%(funcName)s()  %(message)s"
level: 10
...

I’ve put logging-related configuration in its own “section” (really data structure) here so when I want to configure other things in the application I can do so without shooting myself in the foot and having to be careful not to use the same key names, etc.

I’ve stored this configuration in a file called ‘log.conf’. From there you can easily play with it in an interpreter session:

>>> import yaml
>>> config_file = open('log.conf', 'r')
>>> config = yaml.load(config_file)
>>> config
{'Logging': {'format': '%(levelname) -10s %(asctime)s %(module)s:%(funcName)s()  %(message)s', 'level': 10}}
>>>

With the configuration out of the way, let’s look at the code that would use it:

#!/usr/bin/env python

import logging
import yaml

def doit(uid):
    logging.debug("Working with uid: %s" % uid)

if __name__ == "__main__":
    config_file = open('log.conf', 'r')
    config = yaml.load(config_file)
    config_file.close()
    logging.basicConfig(**config['Logging'])

    doit(22222)

logging.basicConfig() takes a keyword dictionary of optional configuration items. Here I’m just using the ‘format’ and ‘level’ options, but there are more.

The only thing I do inside the doit() function is use logging to output the value of ‘uid’ passed in. This is really a test that the format I’ve configured is actually being used.

The format is fairly intuitive: indentation defines a block, just like in Python. The ‘—‘ and ‘…’ lines denote the beginning and end of the YAML document. You can have several documents in a file if you so choose. This might be done if you’re storing a feed or email threads in YAML format.

Type Conversion

Type conversion to the built in Python primitives works very well and is very intuitive in my experience. The above would be parsed as a string for the ‘format’ key, and an ‘int’ for the ‘level’ key. The entire block above will become a dictionary, and there is YAML syntax you can use to create lists and lists of lists, etc., as well.

For example, let’s say I’m creating a Django-like web application framework and I’ve decided to store my URL-to-handler mappings in a YAML file. You could easily do it with a list of lists, which looks like this in YAML:

RequestHandlers:
- [/, framework.handlers.RootHandler]
- [/signup, framework.handlers.RegisterNow]
- [/login, framework.handlers.Login]
- [/faq, framework.handlers.FAQ]

This will form a list of lists that you can work with in your code that looks like this in the config dictionary:

{'RequestHandlers': [['/', 'framework.handlers.RootHandler'], ['/signup',
'framework.handlers.RegisterNow'], ['/login', 'framework.handlers.Login'],
['/faq', 'framework.handlers.FAQ']]}

If for some reason type conversion doesn’t work as you expect, or you need to represent, say, a boolean using a string like “y” or “Yes” instead of “True”, you can explicitly tag your value using tags defined in the YAML specification for this very purpose. Here’s how you’d explicitly tag “Yes” as a boolean, to insure it’s not parsed as a string:

verbose: !!bool "Yes"

When this is parsed by PyYAML, it will be a Python boolean, and the value when printed to the screen will be ‘True’ (without quotes). There are several other explicit type tags, including ‘!!int’, ‘!!float’, ‘!!null’, ‘!!timestamp’ and more.

If you like, you could alter our URL mapper from above and create a list of tuples. Note the use of the !!omap tag, which is short for ‘ordered mapping’:

RequestHandlers: !!omap
- /: framework.handlers.RootHandler
- /signup: framework.handlers.RegisterNow
- /login: framework.handlers.Login
- /faq: framework.handlers.FAQ

The resulting config dictionary looks like this:

{'RequestHandlers': [('/', 'framework.handlers.RootHandler'), ('/signup',
'framework.handlers.RegisterNow'), ('/login', 'framework.handlers.Login'),
('/faq', 'framework.handlers.FAQ')]}

More than once I’ve gone back to my YAML configuration to alter the type of data structure returned to better suit the code that uses it. It’s pretty convenient, and making the changes to both the configuration file and the code are typically easy enough to be considered a non-event.

Beyond Basic Data Types

The ‘level’ option in logging.basicConfig can be specified either as a word or a numeric value (internally, logging.DEBUG maps to the integer value 10). But what if you didn’t know this, or you didn’t have the option of using an integer? Specifying ‘logging.DEBUG’ in the config file wouldn’t have worked, because it would’ve come in as a string, and not an exposed module name.

If you don’t care about locking your configuration file to a language, PyYAML will let you do what you need using language-specific tags. So, for the purposes of our program, the following two lines in YAML produce the same effect:

level: 10
level: !!python/name:logging.DEBUG

You might also choose to do this because reading ‘logging.DEBUG’, even with the added tag overhead, is probably easier to understand than trying to figure out what “10” means.

If you’re developing code that allows users to write plugins, you can also let them add their plugins by adding a simple line to a ‘plugin’ section of the YAML config file in such a way that the config dictionary itself will contain an actual new instance of the proper object:

Plugins:
- !!python/object/new:MyPlugin.Processor [logfile='foo.log']
- !!python/object/new:FooPluginModule.CementMixers.RotaryMixer
[consistency='chunky']

The above will produce a list of plugin instances with ‘args’ in the appended list fed to each classes __init__ method. Don’t forget that if you want to access the plugins by name instead of looping over a list, you can easily make this a dictionary. Also, PyYAML supports passing more intialization info to the class constructor.

Anchors and Aliases

You can create a block in your YAML config file, and then reference it in other sections of the configuration, and it can save you a lot of lines in a more complex configuration. This is done using anchors and aliases. An anchor starts with “&” and an alias (a reference to the anchor) begins with a “*”. So, let’s say you have multiple plugins loaded (continuing on from the example), and they all need their own configuration, but they’ll all connect to the same exact database server, and use the same credentials and db name, etc. Just create the db config once, make it an anchor, and reference it as needed:

DB: &MainDB
   server: localhost
   port: 6000
   user: dbuser
   db: myappdb
Plugins:
   loghandler: !!python/object/new:MyLogHandler
      args: ['mylogfile.log']
      db: *MainDB

When this is read in, the dictionary defined in &MainDB will appear as the value for the dict key [‘Plugins’][‘loghandler’][‘db’]. If you wanted to pass the *entire* config structure to your plugin, you technically wouldn’t need this, but I typically would only pass the portion of the config structure specifically dealing with the plugin, because configs can get large, and there could be lots of stuff that have nothing to do with the plugin in the rest of the config.

Moving Ahead

Although 90% of your use of PyYAML might well consist of loading a YAML file or message and working with the resulting data structure, it’s nice to know that it does provide quite a bit of flexibility if you’re willing to look for it. Here are some links for further reading about PyYAML, including a couple of items not covered in this tutorial:

Pass more initialization data to classes specified with !!python/object/new

Create your own app-specific tags, a la ‘!!bool’ and ‘!!python’.

Dump Python Objects to YAML