Data Nodes

The primitive data type is the Node. It's a container for arbitrary user-supplied data. The API implementation takes care of storing the node objects in the database and provides a public web interface to build client applications. Whenever a node is being added, a pub/sub event is sent so that client services can take appropriate actions. Typically, this is the role of orchestrators: scheduling tasks in response to particular events.

Each node may have a parent to form a directed tree. A node with no parent is called a root. There can of course be many root nodes in the same database, each with their own arborescence. An interesting property is that every node has a single path to its root, which can be found by recursively walking through all the parent nodes.

Note: Node objects are read-only. Once added to the database, they can't be updated. However, child nodes can be added to grow the related data.

A node may also contain a Task object. This is not required as not all node objects are created by tasks, in the same way that tasks don't always create nodes. These are loosely-coupled concepts. Still it's a common scenario, and as such it's useful to have traceability between nodes and the tasks that created them whenever applicable.

Object Model

Node objects follow a model defined by the API. The .data field is an arbitrary one defined by users, the only constraint being that it's a dictionary and all keys must be strings. Similarly, the .artifacts field is a dictionary with file names and user-provided URLs to access them.

All the other fields are managed directly by the API and play a role in how the nodes are used, following certain rules.

Here's a slightly simplified list of the fields found in the Node model:

id: Optional[PyObjectId] = Field(alias='_id', default=None)

This is the Node unique identifier, basically a MongoDB ObjectId. Please note that it is only unique within the database of each individual API instance and not universally unique like Task objects which use UUID. To refer to a Node object outside the API, you may use its URL e.g. https://api-hostname.com/latest/node/ID.

parent: Optional[PyObjectId] = Field(default=None, description="Parent node id")

To form a tree, each node may have a parent node. This field is to keep the parent node identifier. Since it's an internal one, parents and therefore entire trees have to be contained within a single database. To have a parent in another API instance or database, the .data field may be used with some logic on the client side. Additional features may be built into the API to facilitate this in future versions, say with .parent.api and .parent.node fields to follow a more federated architecture. Similarly, separate trees in a same database may be linked via the .data field with some logic in the client application - for example, a previous version or iteration of the same node as produced by repeated tasks.

name: str = Field(description="Name of the node object")

Each node must have a name. This is to be able to identify it in the tree, other than with its database identifier. There's no constraint on it other than it needs to be a string, so for anonymous nodes the identifier may be used again or just node or banana. It is however very much like files and directories in a file system, having meaningful names is important. Users will typically be interacting with the node names directly via a web dashboard or command-line tools.

path: List[str] = Field(description="Full path with node names from the root")

Since each node has a name and may be in a tree, each node also has a path. This can be worked out by collecting the names of all the parent nodes recursively up to the root node which has no parent. However, it's a costly operation with lots of database lookups and the path is a common way for users to retrieve nodes in a tree. So instead of computing it many times, it's stored in the .path field. The current model uses a list of strings, another popular approach is the dotted syntax but this would add some constraints on the node names and require lots of string parsing operations outside the database engine.

artifacts: Dict[str, AnyHttpUrl] = Field(description="Artifacts associated with the node (binaries, logs...)")

Artifacts are files or generally speaking any standalone piece of data that can be retrieved over a stable URL. This will usually be logs from the task that run and produced the node, or some binary files it generated. Ultimately it's up to the user to upload them to a third-party storage service. Each artifact has a key in the dictionary, basically a string with a name to identify what is to be found with its URL.

data: Dict[str, Any] = Field(description="Arbitrary data stored in the node")

Users may also provide some arbitrary object data in the form of a dictionary. The keys need to be strings, and the values can be any object accepted by the underlying database engine (e.g. MongoDB). This will usually be a mix of primitive types, lists, dictionaries and some slightly more advanced ones such as timestamps. No schema is imposed on this data in a classic NoSQL document database approach, except if one is supplied by the user for enhanced validation - see the .kind field below.

kind: str = Field(description="Name of the optional data schema")

Users may submit "kinds" of nodes, with a schema to describe the .data field - see Issue #7 about the on-going design of this feature. If the .kind field is set, the idea is then to look up a previously registered schema with this name and use it to validate the content of .data.

task: Optional[Task] = Field(description="Task associated with this node")

If the node was created by a task, it should become customary practice for the task to store itself in the .task field as an embedded object. This is primarily for client-side usage, to keep track of which tasks were run and how they relate to the nodes.

owner: str = Field(description="Username of the node owner")

Each node belongs to a user. This is useful when searching for nodes to avoid getting data from other users instead. It's worth mentioning that parent nodes may belong to a different user since making a node a parent doesn't require changing the parent itself - only setting its identifier in the child node's .parent field. This is useful for example when orchestrating tasks that need to be run when other users have added a node. Say, if a bot user is sending meteorologic data reports, you may run an orchestrator with your personal user account to run tasks that will process them and generate additional child nodes with images as artifacts etc. and use the bot's report node as their parent.