publish

A tool to build and publish certain artifacts at certain times.

publish was desgined specifically for the automatic publication of course materials, such as homeworks, lecture slides, etc.

Terminology

An artifact is a file – usually one that is generated by some build process.

A publication is a coherent group of one or more artifacts and their metadata.

A collection is a group of publications which all satisfy the same schema.

A schema is a set of constraints on a publication’s artifacts and metadata.

This establishes a collection -> publication -> artifact hierarchy: each artifact belongs to exactly one publication, and each publication belongs to exactly one collection.

An example of such a hierarchy is the following: all homeworks in a course form a collection. Each publication within the collection is an individual homework. Each publication may have several artifacts, such as the PDF of the problem set, the PDF of the solutions, and a .zip containing the homework’s data.

An artifact may have a release time, before which it will not be built or published. Likewise, entire publications can have release times, too.

Discovering, Building, and Publishing

When run as a script, this package follows a three step process of discovering, building, and publishing artifacts.

In the discovery step, the script constructs a collection -> publication -> artifact hierarchy by recursively searching an input directory for artifacts.

In the build step, the script builds every artifact whose release time has passed.

In the publish step, the script copies every released artifact to an output directory.

Discovery

In the discovery step, the input directory is recursively searched for collections, publications, and artifacts.

A collection is defined by creating a file named collections.yaml in a directory. The contents of the file describe the artifacts and metadata that are required of each of the publications within the collection. For instance:

# <input_directory>/homeworks/collection.yaml

schema:
    required_artifacts:
        - homework.pdf
        - solution.pdf

    optional_artifacts:
        - template.zip

    metadata_schema:
        name:
            type: string
        due:
            type: datetime
        released:
            type: date

The file above specifies that publications must have homework.pdf and solution.pdf artifacts, and may or may not have a template.zip artifact. The publications must also have name, due, and released fields in their metadata with the listed types. The metadata specification is given in a form recognizable by the cerberus Python package.

A publication and its artifacts are defined by creating a publish.yaml file in the directory containing the publication. For instance, the file below describes how and when to build two artifacts named homework.pdf and solution.pdf, along with metadata:

# <input_directory>/homeworks/01-intro/publish.yaml

metadata:
    name: Homework 01
    due: 2020-09-04 23:59:00
    released: 2020-09-01

artifacts:
    homework.pdf:
        recipe: make homework
    solution.pdf:
        file: ./build/solution.pdf
        recipe: make solution
        release_time: 1 day after metadata.due
        ready: false
        missing_ok: false

The file field tells publish where the file will appear when the recipe is run. is omitted, its value is assumed to be the artifact’s key – for instance, homework.pdf’s file field is simply homework.pdf.

The release_time field provides the artifact’s release time. It can be a specific datetime in ISO 8601 format, like 2020-09-18 17:00:00, or a relative date of the form “<number> (hour|day)[s]{0,1} (before|after) metadata.<field>”, in which case the date will be calculated relative to the metadata field. The field it refers to must be a datetime.

The ready field is a manual override which prevents the artifact from being built and published before it is ready. If not provided, the artifact is assumed to be ready.

THe missing_ok field is a boolean which, if false, causes an error to be raised if the artifact’s file is missing after the build. This is the default behavior. If set to true, no error is raised. This can be useful when the artifact file is manually placed in the directory and it is undesirable to repeatedly edit publish.yaml to add the artifact.

Publications may also have release_time and ready attributes. If these are provided they will take precedence over the attributes of an individual artifact in the publication. The release time of the publication can be used to control when its metadata becomes available – before the release time, the publication in effect does not exist.

The file hierarchy determines which publications belong to which collections. If a publication file is placed in a directory that is a descendent of a directory containing a collection file, the publication will be placed in that collection and its contents will be validated against the collection’s schema. Publications which are not under a directory containing a collection.yaml are placed into a “default” collection with no schema. They may contain any number of artifacts and metadata keys.

Collections, publications, and artifacts all have keys which locate them within the hierarchy. These keys are inferred from their position in the filesystem. For example, a collection file placed at <input_directory>/homeworks/collection.yaml will create a collection keyed “homeworks”. A publication within the collection at <input_directory>/homeworks/01-intro/publish.yaml will be keyed “01-intro”. The keys of the artifacts are simply their keys within the publish.yaml file.

Building

Once all collections, publications, and artifacts have been discovered, the script moves to the build phase.

Artifacts are built by running the command given in the artifact’s recipe field within the directory containing the artifact’s publication.yaml file. Different artifacts should have “orthogonal” build processes so that the order in which the artifacts are built is inconsequential.

If an error occurs during any build the entire process is halted and the program returns without continuing on to the publish phase. An error is considered to occur if the build process returns a nonzero error code, or if the artifact file is missing after the recipe is run.

Publishing

In the publish phase, all published artifacts – that is, those which are ready and whose release date has passed – are copied to an output directory. Additionally, a JSON file containing information about the collection -> publication -> artifact hierarchy is placed at the root of the output directory.

Artifacts are copied to a location within the output directory according to the following “formula”:

<output_directory>/<collection_key>/<publication_key>/<artifact_key>

For instance, an artifact keyed homework.pdf in the 01-intro publication of the homeworks collection will be copied to:

<output_directory>/homeworks/01-intro/homework.pdf

An artifact which has not been released will not be copied, even if the artifact file exists.

publish will create a JSON file named <output_directory>/published.json. This file contains nested dictionaries describing the structure of the collection -> publication -> artifact hierarchy.

For example, the below code will load the JSON file and print the path of a published artifact relative to the output directory, as well as a publication’s metadata.

>>> import json
>>> d = json.load(open('published.json'))
>>> d['collections']['homeworks']['publications']['01-intro']['artifacts']['homework.pdf']['path']
homeworks/01-intro/homework.pdf
>>> d['collections']['homeworks']['publications']['01-intro']['metadata']['due']
2020-09-10 23:59:00

Only those publications and artifacts which have been published appear in the JSON file. In particular, if an artifact has not reached its release time, it will be missing from the JSON representation entirely.

API

publish can also be used as a Python package. Its behavior when run as a script can be reproduced using three high-level functions: discover(), build(), and publish.publish().

>>> discovered = publish.discover('path/to/input_directory')
>>> built = publish.build(discovered)
>>> published = publish.publish(built, 'path/to/output/directory')

These functions can be used to build and publish individual collections, publications, and artifacts as well, as described below.

The full API of the package is as follows:

Exceptions

Error

Generic error.

ValidationError

Publication does not satisfy schema.

DiscoveryError(msg, path)

A configuration file is not valid.

BuildError

Problem while building the artifact.

Types

UnbuiltArtifact(workdir, file, recipe, …)

The inputs needed to build an artifact.

BuiltArtifact(workdir, file, returncode, …)

The results of building an artifact.

PublishedArtifact(path)

A published artifact.

Publication(metadata, Any], artifacts, …)

A publication.

Collection(schema, publications, …)

A collection.

Universe(collections, publish.types.Collection])

Container of all collections.

Schema(required_artifacts, …)

Rules governing publications.

Functions

build(parent, *[, ignore_release_time, …])

Build a universe/collection/publication/artifact.

deserialize(s)

Reconstruct a universe/collection/publication/artifact from JSON.

discover(input_directory[, …])

Discover the collections and publications in the filesystem.

filter_nodes(parent, predicate[, …])

Remove nodes from a Universe/Collection/Publication.

publish(parent, outdir[, prefix, callbacks])

Publish a universe/collection/publication/artifact by copying it.

read_collection_file(path)

Read a Collection from a yaml file.

read_publication_file(path[, schema, …])

Read a Publication from a yaml file.

serialize(node)

Serialize the universe/collection/publication/artifact to JSON.

validate(publication, against)

Make sure that a publication satisfies the schema.

Types

publish provides several types for representing collections, publications, and artifacts.

UnbuiltArtifact(workdir, file, recipe, …)

The inputs needed to build an artifact.

BuiltArtifact(workdir, file, returncode, …)

The results of building an artifact.

PublishedArtifact(path)

A published artifact.

Publication(metadata, Any], artifacts, …)

A publication.

Collection(schema, publications, …)

A collection.

Universe(collections, publish.types.Collection])

Container of all collections.

There are three artifact types, each used to represent artifacts at different stages of the discover -> build -> publish process. Each are subclasses of typing.NamedTuple.

class publish.UnbuiltArtifact(workdir: pathlib.Path, file: str, recipe: Optional[str] = None, release_time: Optional[datetime.datetime] = None, ready: bool = True, missing_ok: bool = False)

The inputs needed to build an artifact.

workdir

Absolute path to the working directory used to build the artifact.

Type

pathlib.Path

file

Path (relative to the workdir) of the file produced by the build.

Type

str

recipe

Command used to build the artifact. If None, no command is necessary.

Type

Union[str, None]

release_time

Time/date the artifact should be made public. If None, it is always available.

Type

Union[datetime.datetime, None]

ready

Whether or not the artifact is ready for publication. Default: True.

Type

bool

missing_ok

If True and the file is missing after building, then no error is raised and the result of the build is None.

Type

bool

class publish.BuiltArtifact(workdir: pathlib.Path, file: str, returncode: Optional[int] = None, stdout: Optional[str] = None, stderr: Optional[str] = None)

The results of building an artifact.

workdir

Absolute path to the working directory used to build the artifact.

Type

pathlib.Path

file

Path (relative to the workdir) of the file produced by the build.

Type

str

returncode

The build process’s return code. If None, there was no process.

Type

int

stdout

The build process’s stdout. If None, there was no process.

Type

str

stderr

The build process’s stderr. If None, there was no process.

Type

str

class publish.PublishedArtifact(path: str)

A published artifact.

path

The path to the artifact’s file relative to the output directory.

Type

str

For convenience, all three of these types inherit from an Artifact base class. This makes it easy to check whether an object is an artifact of any kind using isinstance(x, publish.Artifact).

Publications and collections are represented with the Publication and Collection types. Furthermore, a set of collections is represented with the Universe type. These three types all inherit from typing.NamedTuple.

class publish.Publication(metadata: Mapping[str, Any], artifacts: Mapping[str, publish.types.Artifact], ready: bool = True, release_time: Optional[datetime.datetime] = None)

A publication.

artifacts

The artifacts contained in the publication.

Type

Dict[str, Artifact]

metadata

The metadata dictionary.

Type

Dict[str, Any]

ready

If False, this publication is not ready and will not be published.

Type

Optional[bool]

release_time

The time before which this publication will not be released.

Type

Optional[datetime.datetime]

class publish.Collection(schema: Schema, publications: Mapping[str, publish.types.Publication])

A collection.

schema

The schema used to validate the publications within the collection.

Type

Schema

publications

The publications contained in the collection.

Type

Mapping[str, Publication]

class publish.Universe(collections: Mapping[str, publish.types.Collection])

Container of all collections.

collections

The collections.

Type

Dict[str, Collection]

These types exist within a hierarchy: A Universe contains instances of Collection which contain instances of Publication which contain instances of Artifact. Universe, Collection, and Publication are internal nodes of the hierarchy, while Artifact instances are leaf nodes.

Internal node types share several methods and attributes, almost as if they were inherited from a parent “InternalNode” base class (which doesn’t exist in actuality):

class publish.InternalNode
_deep_asdict(self)

Recursively compute a dictionary representation of the object.

_replace_children(self, new_children)

Replace the node’s children with a new set of children.

_children

The node’s children.

For instance, the ._children attribute of a Collection returns a dictionary mapping publication keys to Publication instances.

Schemas and Validation

Schemas used to validate publications are represented with the Schema class.

class publish.Schema(required_artifacts: Collection[str], optional_artifacts: Optional[Collection[str]] = None, metadata_schema: Optional[Mapping[str, Mapping]] = None, allow_unspecified_artifacts: bool = False, is_ordered: bool = False)

Rules governing publications.

required_artifacts

Names of artifacts that publications must contain.

Type

typing.Collection[str]

optional_artifacts

Names of artifacts that publication are permitted to contain. Default: empty list.

Type

typing.Collection[str], optional

metadata_schema

A dictionary describing a schema used to validate publication metadata. In the style of cerberus. If None, no validation will be performed. Default: None.

Type

Mapping[str, Any], optional

allow_unspecified_artifacts

Is it permissible for a publication to have unknown artifacts? Default: False.

Type

Optional[Boolean]

is_ordered

Should the publications be considered ordered by their keys? Default: False

Type

Optional[Boolean]

Validation is performed with the following function:

publish.validate(publication: publish.types.Publication, against: publish.types.Schema)

Make sure that a publication satisfies the schema.

This checks the publication’s metadata dictionary against against.metadata_schema. Verifies that all required artifacts are provided, and that no unknown artifacts are given (unless schema.allow_unspecified_artifacts == True).

Parameters
  • publication (Publication) – A fully-specified publication.

  • against (Schema) – A schema for validating the publication.

Raises

ValidationError – If the publication does not satisfy the schema’s constraints.

Discovery

The discovery of collections, publications, and artifacts is performed using the discover() function.

publish.discover(input_directory, skip_directories=None, callbacks=None, date_context=None, template_vars=None)

Discover the collections and publications in the filesystem.

Parameters
  • input_directory (Path) – The path to the directory that will be recursively searched.

  • skip_directories (Optional[Collection[str]]) – A collection of directory names that should be skipped if discovered. If None, no directories will be skipped.

  • callbacks (Optional[DiscoverCallbacks]) – Callbacks to be invoked during the discovery. If omitted, no callbacks are executed. See DiscoverCallbacks for the possible callbacks and their arguments.

  • date_context (Optional[DateContext]) – A date context used to evaluate smart dates. If None, an empty context is used.

Returns

The collections and the nested publications and artifacts, contained in a Universe instance.

Return type

Universe

Callbacks are invoked at certain points during the discovery. To provide callbacks to the function, subclass and override the desired members of the below class, and provide an instance to discover().

class publish.DiscoverCallbacks

Callbacks used in discover(). Defaults do nothing.

on_collection(path)

When a collection is discovered.

Parameters

path (pathlib.Path) – The path of the collection file.

on_publication(path)

When a publication is discovered.

Parameters

path (pathlib.Path) – The path of the publication file.

on_skip(path)

When a directory is skipped.

Parameters

path (pathlib.Path) – The path of the directory to be skipped.

Two low-level functions read_collection_file() and read_publication_file() are also available for reading individual collection and publication files. Note that they are not recursive: reading a collection file does not load any publications into the collection. Most of the time, you probably want discover().

publish.read_collection_file(path)

Read a Collection from a yaml file.

Parameters

path (pathlib.Path) – Path to the collection file.

Returns

The collection object with no attached publications.

Return type

Collection

Notes

The file should have one key, “schema”, whose value is a dictionary with the following keys/values:

  • required_artifacts

    A list of artifacts names that are required

  • optional_artifacts [optional]

    A list of artifacts that are optional. If not provided, the default value of [] (empty list) will be used.

  • metadata_schema [optional]

    A dictionary describing a schema for validating publication metadata. The dictionary should deserialize to something recognized by the cerberus package. If not provided, the default value of None will be used.

  • allow_unspecified_artifacts [optional]

    Whether or not to allow unspecified artifacts in the publications. Default: False.

publish.read_publication_file(path, schema=None, date_context=None, template_vars=None)

Read a Publication from a yaml file.

Parameters
  • path (pathlib.Path) – Path to the collection file.

  • schema (Optional[Schema]) – A schema for validating the publication. Default: None, in which case the publication’s metadata are not validated.

  • date_context (Optional[DateContext]) – A context used to evaluate smart dates. If None, no context is provided.

Returns

The publication.

Return type

Publication

Raises

DiscoveryError – If the publication file’s contents are invalid.

Notes

The file should have a “metadata” key whose value is a dictionary of metadata. It should also have an “artifacts” key whose value is a dictionary mapping artifact names to artifact definitions.

Optionally, the file can have a “release_time” key providing a time at which the publication should be considered released. It may also have a “ready” key; if this is False, the publication will not be considered released.

If the schema argument is not provided, only very basic validation is performed by this function. Namely, the metadata schema and required/optional artifacts are not enforced. See the validate() function for validating these aspects of the publication. If the schema is provided, validate() is called as a convenience.

Build

The building of whole collections, publications, and artifacts is performed with the build() function.

publish.build(parent: Union[publish.types.Universe, publish.types.Collection, publish.types.Publication, publish.types.UnbuiltArtifact], *, ignore_release_time=False, verbose=False, now=<built-in method now of type object>, run=<function run>, exists=<function Path.exists>, callbacks=None)

Build a universe/collection/publication/artifact.

Parameters
  • parent (Union[Universe, Collection, Publication, UnbuiltArtifact]) – The thing to build. Operates recursively, so if given a Universe, for instance, will build all of the artifacts within.

  • ignore_release_time (bool) – If True, all artifacts will be built, even if their release time has not yet passed.

  • callbacks (Optional[BuildCallbacks]) – Callbacks to be invoked during the build. If omitted, no callbacks are executed. See BuildCallbacks for the possible callbacks and their arguments.

Returns

A copy of the parent where each leaf artifact is replaced with an instance of BuiltArtifact. If the thing to be built is not built due to being unreleased, None is returned.

Return type

Optional[type(parent)]

Note

If a publication or artifact is not yet released, either due to its release time being in the future or because it is marked as not ready, its recipe will not be run. If the parent node is a publication or artifact that is not built, the result of this function is None. If the parent node is a collection or universe, all of the unbuilt publications and artifacts within are recursively removed from the tree.

Callbacks are invoked at certain points during the build. To provide callbacks to the function, subclass and override the desired members of the below class, and provide an instance to build().

class publish.BuildCallbacks

Callbacks used by build()

on_build(key, node)

Called when building a collection/publication/artifact.

on_missing(artifact: publish.types.UnbuiltArtifact)

Called when the artifact file is missing, but missing is OK.

on_not_ready(artifact: publish.types.UnbuiltArtifact)

Called when the artifact is not ready.

on_recipe(artifact: publish.types.UnbuiltArtifact)

Called when artifact is being built using its recipe.

on_success(artifact: publish.types.BuiltArtifact)

Called when the build succeeded.

on_too_soon(artifact: publish.types.UnbuiltArtifact)

Called when it is too soon to release the artifact.

Publish

publish.publish(parent, outdir, prefix='', callbacks=None)

Publish a universe/collection/publication/artifact by copying it.

Parameters
  • parent (Union[Universe, Collection, Publication, BuiltArtifact]) – The thing to publish.

  • outdir (pathlib.Path) – Path to the output directory where artifacts will be copied.

  • prefix (str) – String to prepend between output directory path and the keys of the children. If the thing being published is a BuiltArtifact, this is simply the filename.

  • callbacks (PublishCallbacks) – Callbacks to be invoked during the publication. If omitted, no callbacks are executed. See PublishCallbacks for the possible callbacks and their arguments.

Returns

A copy of the parent, but with all leaf artifact nodes replace by PublishedArtifact instances. Artifacts which have not yet been released are still converted to PublishedArtifact, but their path is set to None.

Return type

type(parent)

Notes

The prefix is build up recursively, so that calling this function on a universe will publish each artifact to <prefix><collection_key>/<publication_key>/<artifact_key>

Callbacks are invoked at certain points during the publication. To provide callbacks to the function, subclass and override the desired members of the below class, and provide an instance to publish().

class publish.PublishCallbacks
on_copy(src, dst)

Called when copying a file.

on_publish(key, node)

When publish is called on a node.

Serializtion

Two functions are provided for serializing and deserializing objects to and from JSON.

publish.serialize(node)

Serialize the universe/collection/publication/artifact to JSON.

Parameters

node (Union[Universe, Collection, Publication, Artifact]) – The thing to serialize as JSON.

Returns

The object serialized as JSON.

Return type

str

publish.deserialize(s)

Reconstruct a universe/collection/publication/artifact from JSON.

Parameters

s (str) – The JSON to deserialize.

Returns

The reconstructed object; its type is inferred from the string.

Return type

Universe/Collection/Publication/Artifact

Filtering

Collections, publications, and artifacts can be removed using filter_nodes().

publish.filter_nodes(parent, predicate, remove_empty_nodes=False, callbacks=None)

Remove nodes from a Universe/Collection/Publication.

Parameters
  • parent – The root of the tree.

  • predicate (Callable[[node], bool]) – A function which takes in a node and returns True/False whether it should be kept.

  • remove_empty_nodes (bool) – Whether nodes without children should be removed (True) or preserved (False). Default: False.

Returns

An object of the same type as the parent, but wth all filtered nodes removed. Furthermore, if a node has no children after filtering, it is removed.

Return type

type(parent)

Indices and tables