Creating a New Schema
This is intended to be a quick guide to how to create a new schema in RAD. It is not intended to be a comprehensive guide, but rather a quick guide that highlights all the important considerations for RAD schema creation.
Note
We strongly recommend that you use the RAD Helper Tool to help you set up your schemas. This will take care of things such as creating the necessary symbolic links and checking that your selected URIs follow outward conventions. It will also help you with bumping the version of your schema when you make changes to it.
Before you begin
Before you start writing your schema, you should have a clear idea of what data you wish to store in a Roman file and how you want to organize it. Remember that RAD supports a hierarchical data model, so you can topically organize your data under headings and subheadings (or even deeper) as you see fit. In particular, you should have a clear idea of the following:
The base name of the schema you which to create. This should be a short descriptive name that should only include lower case letters and underscores. No spaces or hyphens should be used in the name or path. For example,
dark,aperture,wcsinfo,cal_logs, etc. Typically, this name will form the base name that the object you are describing will be called in Python. For example,aperturemay correspond to anApertureobject in Python. This is not always the case, but it is a good rule of thumb. From now on we will use<name>to refer to this base name.Now we need to determine the version number of the schema. Typically, if the schema is new, it will be
1.0.0. This means that your schema will often be referenced with<name>-<version>. The version number should follow standard semantic versioning, see semver.org for more details.Note
The URI for your schema will include this version number, so it will look like
asdf://stsci.edu/datamodels/roman/schemas/<path>/<name>-<version>. The version number will not typically be included in the file name itself, but will appear in the symbolic link to the schema file that you will need to create.Next you will need to decided where in RAD to locate your schema. Look inside the
latestdirectory in the top of your working copy of RAD to see the organization of the existing schemas.Note
In the end you will have something like:
latest/<path/to/your/schema>, where<path/to/your/schema>is the path to the directory insidelatestthat your schema will be located. This means you now have the URI fragment<path/to/your/schema>/<name>-<version>, which can be formed into the full RAD schema URI by prefixing it withasdf://stsci.edu/datamodels/roman/schemas/. E.G.
If your schema is not for the SOC then it should be located within a directory that corresponds to the origin of the data. For example, the IPAC/SSC schemas are located in
latest/SSC. Items located within those directories should follow the organisational structure of the schemas already in that directory. These directory names must be in ALL CAPS.If your schema is for the SOC, then it will follow a different convention.
The only schemas allowed in
latestare those that correspond to the outputs of the Romancal pipeline. For examplelatest/wfi_image.yaml, is the output of the exposure level Romancal Pipeline.If your schema is for a reference file, then it should be located in
latest/reference_files/. For example,latest/reference_files/dark.yaml,latest/reference_files/flat.yaml, etc. All represent reference files that are made available via CRDS.If your schema describes some data that is part of the
metaattribute of a Romancal pipeline output, then it should be located inlatest/meta/. For example,latest/meta/exposure.yamlis the schema for themeta.exposureattribute of Romancal pipeline outputs.If your schema provides some enumerated object which is used in multiple places, then it should be located in
latest/enums/. For example,latest/enums/wfi_detector.yamlis the schema which lists all the possible WFI detectors that can be used in Romancal pipeline outputs.If your schema works to describe the columns of a table, then it should be located in
latest/tables/. For example,latest/tables/prompt_catalog_table.yamldescribes the columns for the prompt source catalog table that is created by the Romancal pipeline.If your schema does not fit into one of the above categories, then it should be discussed with the RAD maintainers to determine the best location.
The keywords for your fields and their hierarchical organization.
Keywords are the key in the key-value pairs for information. For example, the keyword may refer to a specific value like
ra_reffor the right ascension of the reference position for the WCS. Or it may refer to an entire group of related information likewcsinfo, which contains keywords that point at information about the WCS.Hierarchical organization refers to how you organize your keywords. In the WCS example for keywords,
ra_refis a keyword that is organized under thewcsinfokeyword. If you were using the Roman Datamodels objects you might reference thera_refdata viamodel.meta.wcsinfo.ra_ref.Note
Unlike FITS headers, RAD and ASDF place no restrictions on the length or composition of the keywords. We do ask that you use keywords that will not conflict with the reserved words for the Python programming language. Simply to avoid confusion when developing code that interacts with the data.
This means that we strongly recommend that you use descriptive keywords that are easy to understand. For example,
start_timeis a much better keyword thanst_tm. Whilest_tmmay save a few characters it is not immediately clear what it means.We also recommend that you nest your keywords in a logical manner. For example, if you find your self doing
thing_keyword1,thing_keyword2,thing_keyword3, etc. then you should consider creating athingkeyword and nestingkeyword1,keyword2,keyword3under it. This will make it much easier to find and understand the data.The data types of all the fields that you wish to store. In particular, you need to pay attention to the following:
Which fields will be primitive data types like
int,float,str, orbool. In JSON-schema these will beinteger,number,string, andbooleanrespectively.Which fields will require using an ASDF tag to reference another schema corresponding to a non-primitive type.
Note
We currently do not allow internal tag references within RAD, meaning that all the structures you are creating will essentially act as nested dictionaries/mappings. The RDM library can give life to these as something that looks like a Python object.
Note
We currently only allow the use of the following external tags:
tag:stsci.edu:asdf/core/ndarray-1.*for numpy arrays.
tag:stsci.edu:asdf/time/time-1.*for astropy time objects.
tag:astropy.org:astropy/table/table-1.*for astropy tables.
tag:stsci.edu:gwcs/wcs-*for gwcs WCS objects.We also allow the following tags, but their use should be limited as there are code performance implications:
tag:stsci.edu:asdf/unit/quantity-1.*"for astropy quantities.
tag:stsci.edu:asdf/unit/unit-1.*for VO standard units.
tag:astropy.org:astropy/units/unit-1.*for units that astropy defines that are not VO standard units.If you wish to use other external tags, please discuss this with the RAD maintainers. This limited subset of tags is to make it easier for us to provide support for opening and using the RAD data in languages other than Python. The more external tags we allow, the more burden it places on the ASDF maintainers to support these tags in other languages.
Which keywords will be required and which will be optional. When you create your schema you will need to specify at each level in the hierarchy which keywords will be required. Any keywords that are not listed as required will be considered optional and will require the user to check that they exist prior to using them.
Note
We only allow the tagging of schemas that describe the top-level objects
(datamodels) that RAD outlines. Each of these tags is an ASDF tag that
you will need to define in the RAD manifest. This file is latest/manifests/datamodels.yaml. See Tag Your Schema for more
information
Note
All external tags should end with a -<major version>.* version
specifier. Rather than a specific version number, this is a wildcard that
will match any version of that tag. This is to ensure that the schema is not
tied to a specific version of the external tag.
Create the Schema Boilerplate
We suggest that you use the RAD Helper Tool to help you create
a new schema. This can be done via clicking the New button once the helper has
been started. It will require you to enter the following information:
The title of the schema.
(Optional) A short description of the schema.
Selecting that this is a schema for RAD.
Entering the path, name and version that you have selected for your schema:
<path/to/schema>/<name>-<version>.
This will create a new schema file at latest/<path/to/schema>/<name>.yaml
with the contents:
YAML 1.1
---
$schema: asdf://stsci.edu/datamodels/roman/schemas/rad_schema-1.0.0
id: asdf://stsci.edu/datamodels/roman/schemas/<path/to/schema>/<name>-<version> # Note the lack of .yaml
title: <Title of the schema>
description: |
<A long description of the schema>
# description portion will be missing if not proveded by the tool.
If you do not wish to use the RAD Helper Tool, can create a file at latest/<path/to/schema>/<name>.yaml
with the boiler plate above. You will need to create an additional symbolic link from
src/rad/resources/schemas/<path/to/schema>/<name>-<version>.yaml to this file. Without this
link, RAD will not be able to find your schema.
Note
The YAML 1.1 needs to be the very first line of the file, as this defines
the start of the YAML file and its version.
Add Your Fields
Now we will populate your schema with the fields you wish to use. In almost all
cases you will want to use an object type for your top level of the schema,
for other cases see Alternate Ways of Adding Fields. In this case you add the following
after your description in the boilerplate:
type: object
properties:
<first keyword>:
title: <Title of the field>
description: |
<A long description of the field, can be multiline>
You will repeat this step for each of the top-level fields you wish to add.
Populate a Field’s Sub-Schema
After the field’s description at the same indentation level as the
description keyword, you will start to add the sub-schema for the field.
There are several different possibilities at this point:
- Primitive type.
Things like
int,float,str, orbool. In this case you will add the following:type: <type>
Note
The <type> for a Python float is number and the <type> for a
Python bool is boolean. While the <type> for a Python int is
integer and the <type> for a Python str is string.
- Tagged type.
Things that are referenced via an ASDF tag. In this case you add the following:
tag: <tag_uri>
If you want to narrow the tag further than its general schema you add after the tag (at the same indentation level):
properties: <narrowed key from tag>: <schema information to narrow the key>
Note
If you say want to narrow an
ndarrayto a specific datatype and number of dimensions you would add the following:properties: datatype: <dtype of the ndarray> exact_datatype: true ndim: <number of dimensions of the ndarray>
RAD requires that both
datatypeandexact_datatype: truebe defined forndarraytags. Theexact_datatype: trueprevents ASDF from attempting to cast the datatype to the one in the schema, meaning that if the dtype is not a perfect match to the schema a validation error will be raised.
- Dictionary-like type.
These are things that nest further fields within them. In this case you add:
type: object properties: <first keyword>: title: <Title of the field> description: | <A long description of the field>
And then repeat the process of adding the sub-schema for each of the fields.
- List-like type.
These are lists of the same type of item. These are called an
arrayin the schema, meaning that you add the following:type: array items: type: <type>
If further narrowing is required you can narrow them just like you would a tag. If you create an object or another array you likewise add the metadata in the same way as if it were a top-level field only indented appropriately.
Special Field Considerations
There are a few special considerations that you might need to take into account when creating your schema:
- Enum.
If you have a field that can only take on a specific set of values, you can use the
enumkeyword to specify the possible values. For example:enum: [<value1>, <value2>, <value3>]
Note
enumonly works for very primitive types likestring,integer,and
number. You should specify thetypeof the field when usingenumas it gets ambiguous about if something like1is astring, aninteger, or anumber.
- Multiple Possibilities.
If a field can take on multiple different types, you can use the
anyOfcombiner to specify the different possibilities. For example:anyOf: - type: <type1> - type: <type2> - type: <type3>
where further metadata can be added to each of the types as needed.
Note
Sometimes you might want to have a field which is required, but which may not take on any values at all. In this case you can use the
nulltype as one of the possibilities in theanyOfcombiner.
Add Required and Ordering Information
After you have added all of your fields, you will want to add the required
and ordering information. This is done at the same indentation level as the
properties keyword.
required: [<required field 1>, <required field 2>, <required field 3>]
propertyOrder: [<field 1>, <field 2>, <field 3>]
Warning
The propertyOrder can only be included in schemas that are tagged. using
it outside of a tagged context will cause ASDF to fail to validate the schema, even if it might otherwise be valid.
Tag Your Schema
Warning
You should tag only your top-level schema, the one that describes an entire product.
We suggest that when using the RAD Helper Tool to create your schema, you
also use it to tag your schema. This can be done via selecting the tag option.
Note
If you wish to tag your schema manually, you will need to add an entry to the RAD tag
manifest, in latest/manifests/datamodels.yaml. To do this you will need to add
the following after the tags: keyword in the manifest file:
- tag_uri: <tag_uri>
schema_uri: <schema_uri>
title: <Title of the schema>
description: |-
<A long description of the schema>
Where <tag_uri> is the tag you wish to use and <schema_uri> matches the
id in your schema file. If a schema is tagged, it should have (the tool will
automatically do this for you if you use it to create a tagged schema):
flowStyle: block
Warning
While not explicitly necessary, RAD recommends that your formulate your
file name, schema_uri, and tag_uri following standard convention.
This is to avoid confusion and to make it easier to find the schema and tag
and determine the associations between them. The convention is to use:
Start with formulating the file name. It should always be located in the
latestdirectory. The version number should not be included as part of the file name and it should always end with a.yamlfile extension. E.g. relative tolatest/reference_files/dark.yaml,meta/exposure.yaml, orwfi_image.yaml.The “version” of the schema should be the suffix of the file name having the form
-<major>.<minor>.<patch>. E.g.-1.0.0.Next formulate the
schema_urifrom the file name by dropping the.yamlfile extension and prefixing the result with the RAD schema URI prefixasdf://stsci.edu/datamodels/roman/. Then appending-<version>E.g.asdf://stsci.edu/datamodels/roman/schemas/reference_files/dark-1.0.0,asdf://stsci.edu/datamodels/roman/schemas/meta/exposure-1.0.0, orasdf://stsci.edu/datamodels/roman/schemas/wfi_image-1.0.0,The
tag_urishould match theschema_uriwith theschemasreplaced withtags. E.g.asdf://stsci.edu/datamodels/roman/tags/reference_files/dark-1.0.0,asdf://stsci.edu/datamodels/roman/tags/meta/exposure-1.0.0, (in reality this is untagged) orasdf://stsci.edu/datamodels/roman/tags/wfi_image-1.0.0,
All of these conventions are enforced by the helper tool as it will check that
you have correctly formulated the schema_uri and then use these conventions
to automatically create the tag_uri and filename/location for you.
Note
In most cases you will not tag a schema. Tags are generally used only when the schema is intended to be used as a datamodel. This allows for easy reuse of schemas and extending another schema, see Pseudo Inheritance for more information.
Alternate Ways of Adding Fields
There are two additional ways that one might formulate the top level of a schema
which do not involve using an object type (Pseudo Inheritance is also
a method but it still involves objects). These are when one needs to tag a
specially defined list (array) data or when one needs tag a scalar type. In both
these cases, the schema is acting to mix metadata into the schema in a way that
can be reused in other schemas rather than to define a standalone object.
Aside from reuse this is done so that ASDF can correctly search and pull
metadata from the underpinning schemas. This is largely due to the difficulty
in having ASDF traverse through multiple layers of allOf combiners in its
search and find efforts in the schemas. These combiners are largely the results
of Pseudo Inheritance. By having a tag ASDf is able to
bypass the recursive search and jump directly to the schema that is being
referenced.
Testing Schemas
Once you created a schema, run the tests in the rad package before proceeding
to write the model.
Note
The schemas need to be committed to the working repository and the rad
package needs to be installed before running the tests.
Creating a Data Model
The DataModel objects from
RDM which act as the primary outward
facing Python interface to the data described by the RAD schemas are simply
wrappers around the actual data container objects. As such these
DataModel objects are not directly defined by
anything in RAD. However, they are closely related to the RAD schemas. As
such, certain additional things are added to some schemas to make this
relationship between DataModel objects and some
schemas more clear.
First, note that since all the schemas in RAD are hierarchical, there eventually
will exist a “top-level” schema which acts to describe all the data that is
expected to be in a given ASDF file for Roman. Since each ASDF file will
correspond to a specific DataModel object and
those objects are wrappers around the actual data container objects, that
“top-level” schema effectively describes the data structure of a given
DataModel object. Hence, this “top-level” schema
should be called out in a way that makes it clear that it is the schema which
fully describes the structure of a DataModel and
its associated Roman ASDF file.
To do this, right after the description of the schema in the schema file, the following should be added:
datamodel_name: <name of the datamodel in Python>
archive_meta: <archive meta table information>
The datamodel_name field is simply so that we can test that a
DataModel exists for each “top-level” schema and
that each of these schemas maps to exactly one
DataModel. Moreover, it documents which
DataModel maps to which schema as this is not
always completely clear due to the fact that the schema names and
DataModel names do not follow a strict naming
pattern.
The archive_meta field is used by the archive to define some meta table
information. This should only be included if the schema describes a datamodel
which will be archived. The value of this field should be determined by the archive
for you. Start with adding archive_meta: None and then update it when you
have the correct information from the archive team.
Pseudo Inheritance
When creating schemas, there are cases in which you might want multiple schemas
to share identical structures, but do not want to repeat this information in
multiple places. Since JSON-schema does not support inheritance in the
“classical” sense, we have to employ a workaround. This workaround employs the
JSON-schema allOf combiner together with the JSON-schema reference keyword,
$ref. This results in a schema code block that looks like the following:
allOf:
- $ref: <schema_uri>
- type: object
properties:
<additional properties to add to existing schema>
This acts somewhat like inheritance because it requires that the data described
by the schema must satisfy the requirements of the schema being referenced and
the additional new object included in the allOf combiner.
This method of combining schemas maybe used at the top level of a schema in
order to create a full inheritance-like relationship or it may be used in some
sub-schema to do a similar thing. In any case, this should be the only usage of
the $ref keyword in the schema file.
External Metadata
In addition to describing the data structure of Roman ASDF files, RAD also acts to house metadata about how the Roman ASDF files are to be interacted with. This “external metadata” is not directly related to the structure of the data structure itself, but rather describes how the data contained within that structure will be integrated into the archives or how some of that data was created external to the Romancal pipeline.
Currently, there are two types of external metadata that are supported by RAD:
sdf
archive_catalog
sdf
This is the metadata given to fields which are populated by the SDF software before the data is processed by the Romancal pipeline. This metadata currently consists of two fields:
special_processing: which is a string that describes the special processing that was done to create the data in SDF.
source: which is a string that describes the source of the data used by SDF.
Both of these values are typically provided to us by the SDF software teams and
thus should be done in consultation with them. If the SDF software teams have
not indicated the values yet then the fields should be filled with
VALUE_REQUIRED and origin: TBA respectively.
archive_catalog
This is the metadata given to fields that will be incorporated into the archive to describe the Roman ASDF file. This metadata consists of two fields:
datatype: which describes the datatype of that will be used by the archive’s database to store the data contained within the field. This maybe things such as if its a string and if so how long or what type of number it will be.
destination: This is a list of strings of the form<table name>.<column name>, which describe where that data will be stored in the archive’s database. Typically<column name>, will match the keyword of the field in the schema. This is not always the case as sometimes multiple fields from different parts of the files may end up in the same table, but whose keywords are the same. When this occurs, the archive will inform us of what the correct<column name>should be. The<table name>is the name of the table in the archive’s database and is typically provided to us by the archive to be recorded in the schema.
In both cases, the metadata should be added in consultation with the archive team. This includes if the field should even be included into the archive.