Validate YAML in Python with Schema

When working with YAML files, e.g., using a YAML file for configuration, it's useful to validate the contents to ensure data in the file is the right types, within valid ranges, etc.

In this post, I'll look at a useful Python library to validate YAML called Schema.

In simple terms, Schema allows us to define an outline or structure for data (known as a "schema") We can take this structure and use it to validate data, in this context, data coming from a YAML file, and feedback when it doesn't conform.

Let's look at what we can do with Schema.

A simple example

Say we're working on an application that requires a token to interact with a Rest API. The end-user provides this token in a YAML configuration file.

Let's use Schema to define some simple rules for this configuration data.

We start by creating a new Schema object passing in a dict that defines the structure we expect the data to have.

from schema import Schema, SchemaError
import yaml


config_schema = Schema({
    "api": {
        "token": str
    }
})

There are a couple of things we've defined:

  • Passing in a dict indicates that valid data will be in a dict (remember that the yaml library returns parsed YAML as a dict type)
  • We require a top-level key in the dict to be named api
  • Beneath that, we require a key named token with a value that is the str type.

Let's add some YAML and load it with the yaml library. We can use the validate() method on our newly created Schema object and pass in our loaded YAML to validate our data.

We'll deliberately make this invalid to demonstrate what happens when we validate it. Run the code below and notice what happens.

from schema import Schema, SchemaError
import yaml


config_schema = Schema({
    "api": {
        "token": str
    }
})

conf_yaml = """
api:
    passkey: 625c2043c132485b
"""

configuration = yaml.safe_load(conf_yaml)
try:
    config_schema.validate(configuration)    print("Configuration is valid.")
except SchemaError as se:
    raise se

The validate() method raised a SchemaError with a message indicating that our data is missing a required key.

schema.SchemaError: Key 'api' error:
Missing key: 'token'

Let's fix the problem with the required key but change the value to an int type and re-run the code.

# ...

conf_yaml = """
api:
    token: 12345
"""

# ...

We still get a SchemaError, but this time the message indicates the value should be a str type.

schema.SchemaError: Key 'api' error:
Key 'token' error:
12345 should be instance of 'str'

Before we move on, let's fix up the YAML and see what happens when it's valid. Re-run the code below.

from schema import Schema, SchemaError
import yaml


config_schema = Schema({
    "api": {
        "token": str
    }
})

conf_yaml = """
api:
    token: 625c2043c132485b
"""

configuration = yaml.safe_load(conf_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se

Notice that no errors are raised.

Configuration is valid.

Pretty neat, right? We've defined a structure, and the Schema object has validated our data against that structure.

Let's take a look at what else we can do with Schema.

Use Schema with a callable

We can use a callable as part of our validation. When Schema encounters a callable, it will call it passing in the data being validated. If the callable evaluates to True Schema will move on to the next rule; otherwise, it will raise a SchemaError.

Here we use a lambda (anonymous) function to validate that the workers key is between 1 and 10 inclusive.

from schema import Schema, SchemaError
import yaml


config_schema = Schema({
    "concurrency": {
        "workers": lambda n: 1 <= n <= 10    }
})

conf_as_yaml = """
concurrency:
    workers: 20"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se

This raises a SchemaError because the lambda function returns False.

schema.SchemaError: Key 'concurrency' error:
Key 'workers' error:
<lambda>(20) should evaluate to True

Validate using Regex

Schema can validate data using regular expressions. To do this, Schema provides a Regex class that wraps around a regular expression.

Here we check for a valid email using a regular expression pattern.

from schema import Schema, SchemaError, Regeximport yaml


config_schema = Schema({
    "email": {
        "support": Regex(r'^\S+@\S+$')    }
})

conf_as_yaml = """
email:
    support: support_team_at_domain.tld"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se
schema.SchemaError: Key 'email' error:
Key 'support' error:
Regex('^\\S+@\\S+$') does not match 'support_team_at_domain.tld'

Validate with logic operations

Schema supports boolean logic operations for validation. It does this by providing two utility classes, And and Or. These classes enable the combining of multiple validation rules.

In this example, all the validation rules passed to the And() object must evaluate to True for validation to pass.

from schema import And, Schema, SchemaErrorimport yaml


config_schema = Schema({
    "concurrency": {
        "workers": And(int, lambda n: 1 <= n <= 10)    }
})

conf_as_yaml = """
concurrency:
    workers: 5.0"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se
schema.SchemaError: Key 'concurrency' error:
Key 'workers' error:
5.0 should be instance of 'int'

While in this example, only one of the validation rules needs to evaluate to True for validation to pass.

from schema import Or, Schema, SchemaErrorimport yaml


config_schema = Schema({
    "chart_settings": {
        "color_palette": Or("Accent", "Dark2", "Pastel1")    }
})

conf_as_yaml = """
chart_settings:
    color_palette: RdYlGn"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se

Schema raises a SchemaError because RdYlGn is neither Accent, Dark2, or Pastel1.

schema.SchemaError: Key 'chart_settings' error:
Key 'color_palette' error:
Or('Accent', 'Dark2', 'Pastel1') did not validate 'RdYlGn'
'Pastel1' does not match 'RdYlGn'

Making a key optional

To make an optional key, Schema provides a class named Optional. To use, we define the key as an Optional object passing in a description argument. Validation rules on the proceeding value(s) are defined as any other validation rule.

from schema import Optional, Schema, SchemaErrorimport yaml


config_schema = Schema(
    {
        "settings": {
            "temp_dir": str,
            Optional("proxy_server"): {                "address": str,                "port": int,            },
        }
    }
)

conf_as_yaml = """
settings:
    temp_dir: /tmp
    proxy_server: 
        address: proxy.mydomain.com
        port: 8080
"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se
Configuration is valid.

Note that if we remove the optional key from the YAML, the data is still valid. However, if the optional data is invalid, this will raise a SchemaError.

# ...

conf_as_yaml = """settings:
    temp_dir: /tmp
    proxy_server: 
        url: proxy.mydomain.com        port: 8080
"""

# ...
schema.SchemaError: Key 'settings' error:
Key 'proxy_server' error:
Missing key: 'address'

Ignoring keys

Sometimes it is useful to ignore parts of the dict or only validate some sections of it. We can do this by setting the ignore_extra_keys argument to True when creating the Schema object.

from schema import Schema, SchemaError
import yaml


config_schema = Schema({
    "application": {
        "logging": {
                "filename": lambda fp: fp.endswith(".log")
            }
    }    
}, ignore_extra_keys=True)
conf_as_yaml = """
application:
    database:
        connection_string: sqlite:///app.db
    logging:
        filename: logs.log
    concurrency:
        workers: 6
"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se
Configuration is valid.

Notice how the logging key is validated, but the surrounding keys are not.

Another option for ignoring keys is to define a rule using the object type. Because every object in Python is an object type, these keys are always valid.

# ...

config_schema = Schema({
    "application": {
        object: object,        "logging": {
                "filename": lambda fp: fp.endswith(".log")
            },
        object: object    }
    
}, ignore_extra_keys=True)

conf_as_yaml = """
application:
    database:
        connection_string: sqlite:///app.db
    logging:
        filename: logs.log
    concurrency:
        workers: 6
"""

# ...
Configuration is valid.

Working with Lists

Schema can also work with lists and validate list items. To do this, we set a list as a value and define validation rules that will apply to each item. Each item must evaluate to True for validation to pass.

from schema import Regex, Schema, SchemaError
import yaml


ip4_regex = r"^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$"

config_schema = Schema({
    "servers": [        {"host": Regex(ip4_regex), "port": int}        ]    })

conf_as_yaml = """
servers:
    - host: 146.180.127.85
      port: 5000
    - host: 89.79.252.148
      port: 5001
"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:
    raise se

Notice that when Schema encounters the servers key, it will check that the value is a list. It will then validate each input list item against the rules defined in the Schema list.

Custom error messages

One of the minor drawbacks of Schema is that it's error messaging can be a little vague. The final piece of functionality we'll take a look at is custom error messages.

To customize error messages, we pass in an error argument with a message. This message will be available on any raised SchemaError.

from schema import Or, Schema, SchemaError
import yaml


config_schema = Schema(
    {
        "retry_parameters": {
            "strategy": Or(
                "fixed",
                "double",
                error="Unsupported retry strategy. Supported retry strategies are: 'fixed' or 'double'",            )
        }
    }
)

conf_as_yaml = """
retry_parameters:
    strategy: random
"""

configuration = yaml.safe_load(conf_as_yaml)

try:
    config_schema.validate(configuration)
    print("Configuration is valid.")
except SchemaError as se:

    for error in se.errors:        if error:            print(error)    for error in se.autos:        if error:            print(error)

Custom error messages are accessed from the errors property of the SchemaError. Schema specific error messages are also available from the autos property of the SchemaError.

Unsupported retry strategy. Supported retry strategies are: 'fixed' or 'double'
Unsupported retry strategy. Supported retry strategies are: 'fixed' or 'double'
Key 'retry_parameters' error:
Key 'strategy' error:
Or('fixed', 'double') did not validate 'random'
'double' does not match 'random'

Conclusion

As you can see, Schema provides a range of useful functionality for validating YAML-based data.

Lastly, we didn't cover off everything Schema can do, just the features I've found most helpful while using the library. Other features not mentioned but worth checking out are Hooks (functions executed on validation) and JSON validation.

For more detailed information, check out the Schema Github repository.

© 2024 Andrew Villazon. All rights reserved.