PEML

The Programming Exercise Markup Language


Designed to provide an ultra-human-friendly authoring format for describing automatically graded programming assignments.

Purpose

The Programming Exercise Markup Language (PEML) (feedback on name choice is welcome!) is intended to be a simple, easy format for CS and IT instructors of all kinds (college, community college, high school, whatever) to describe programming assignments and activities. We want it to be so easy (and obvious) to use that instructors won't see it as a technological or notational barrier to expressing their assignments.

We intend for this format to be something that authors of automated grading tools can adopt, so they can provide a very easy, low-energy onboarding path for existing instructors to get programming activities into such tools. As a result, this notation leans heavily on supporting authors and streamlining common cases, even if this may require more work on the part of tool developers--the goal is to make it super easy for authors of programming activities, not to fit into a specific auto-grader or simplify tasks for tool writers.

OK, so this is a new notation. That sucks. But there doesn't seem to be an obvious alternative that meets our goals, so ...

Also, in terms of scope of "programming activity", here we are trying to capture everything from very small programming activities, such as "here's a very short function, fill in the blank to make it work", to very large programming activities, such as "write dozens of classes spanning thousands of lines of code to implement this programming language", or whatever. Yes, that is a very broad range, but the aim is to cover the full range, rather than making simplifying assumptions that limit PEML to a narrower subset. The reason for using "Exercise" in the name of the language is to remind developers that it covers a broader range of activities than what many instructors see as pure "programming assignments".

Why not YAML? or JSON?

Actually, one of our design goals is for PEML descriptions to be directly mappable to YAML and JSON, so that people who prefer one of those notations (or tools that use those notations) can use the data model. Converting PEML to YAML or JSON should be a direct/easy tool translation that we expect will be provided as a service and/or library at some point.

But why not just use YAML (or JSON) directly, since parsers already exist?

The main reasons are:

  1. These existing formats are not writer-friendly enough for descriptions that contain large amounts of free-form text. Face it, programming activities usually require large-ish chunks of multi-line text to describe most of the interesting properties, whether you're talking about the specification for a program, or starter code to provide to a student, or reference tests to check a solution, or a sample/reference solution to provide for other instructors to look at, or whatever. In most cases, PEML is not about simple key/value pairs where values are small pieces of data, or about deeply structured nested object descriptions. It's about writer-friendly input of structured text where most of the values are multi-line text written by humans.

  2. Are not syntax-friendly enough to present a minimal-effort entry path. Of course, this isn't an obstacle for programming-oriented instructors who have already used YAML (or JSON) and are familiar with working with them. However, the sticky bits of those formats do require a small learning curve that we hope to minimize further.

    In particular, YAML's reliance on whitespace/indentation to indicate nested structure can pose an obstacle to more free-form text input. JSON's reliance on JavaScript quoting and lack of true multi-line values makes it challenging for these tasks in similar ways.

Of course, other practitioners have made the same complaints about both YAML and JSON before, so "more human-friendly" alternatives to both formats have been proposed by others (see Influences below). Naturally, we were unable to find an existing format that exactly met the needs here ... If you happen to know of one we've overlooked that would be a better fit, please say so! What we are aiming for is a format that allows instructors to copy-and-paste content from existing sources (existing assignment writeups, program solutions, program text, whatever) with minimal additional reformatting work in order to create a readable, streamlined programming exercise representation.

Design Goals

OK, with the context for this description language out of the way, here are the specific objectives we are hoping to achieve:

  1. Minimal learning curve: To paraphrase Cay Horstmann, we are aiming for a format that an average computing instructor can learn in under an hour. Further, after a little practice, we hope that an instructor who has an existing assignment can convert or map it to this representation in less than 10 minutes. We want the syntax to be super simple/direct, so that instructors can focus just on the specific properties they are describing.

  2. Plain-text file representation: PEML uses a plain-text format so that it is easy to edit with any text editor. As mentioned above, we want instructors to be able to directly copy plain-text content, including relevant code assets or instructions, right into the PEML representation. For straightforward assignments that do not require any external resources or a special execution enviroment setup, the entire description can be written in one simple text file. Only instructors with more advanced situations or requirements need to use PEML's facility to connect with external resources.

  3. Reference to external resources: While the central representation for an exercise is a text file, we recognize than instructors may run into situations where one or more supporting resources are also necessary for a given exercise (such as custom data files, a special library, existing PDF documents, etc.). PEML provides a clean way for specifying values through relative and absolute URLs. These URLs may refer to remote resources (located online, including in git repositories or other locations, docker containers, etc.), or local resources located alongside the PEML file.

  4. Directory structure: While exercises described as a single text file are easy to transport and process, sometimes it may be useful for an exercise to refer to external resources stored locally. This can be done by including relative URLs in the exercise description (see below). In this situation, we can view the subdirectory containing the PEML file as the root of an exercise description, with relative URLs referring to other file paths within that subdirectory (or subdirectory tree). This allows a single PEML file along with its associated local resources to be managed as a single local entity.

  5. Zip file packaging: Similarly, while a PEML file plus its associated local resources can be represented as a subdirectory (or subdirectory tree), they can also be packaged into a zip file. The PEML file should be in the root of the zip file's internal directory structure, named exercise.peml. Relative URLs inside the PEML file will then be interpreted within the ZIP relative to the location of the PEML file (i.e., relative to the ZIP's root). This makes it easy to zip a subdirectory representing an exercise to make it easy to transport or upload the PEML file and all of the associated local resources, and also makes it easy to unzip a packaged exercise to produce a subdirectory representation.

  6. Programming language neutral: PEML should support descriptions of programming activities in any programming language, rather than being specific to just one. Some fields/assets within one exercise description will naturally be programming-language-specific, but the notation used to describe the exercise itself should not be.

  7. Minimal technology support: Basic PEML descriptions do not require the use of any specific supporting technologies to manage build environments, execution environments, or external dependencies. This follows from the goal to use a simple text file representation (with no external resources) for basic programming activities similar to those found in many textbooks. We believe simple, low overhead assignments will be the most common case. However, we also realize that some exercise authors may prefer to use specific tools or technologies to package or compartmentalize execution environment features, supporting libraries, run-time dependencies, build environments, custom testing tools, etc. As a result, PEML is set up to allow such services to be used by exercise authors who desire it, but they are not required for mainstream (or simple) exercise descriptions. In other words, instructors who use "vanilla" assignments without any special tooling or infrastructure should be able to hop in and describe their existing exercises in PEML without having to learn about new tools or technologies.

Influences

In terms of inspiration, PEML has been influenced heavily by YAML and its relatives (including HAML), as well as many alternative notations that have been developed as alternatives for describing structured textual data.

One of the most prominent influences has been ArchieML, another format with some overlapping goals used by the New York Times to write certain types of online content. PEML reuses big chunks of ArchieML's design.

The Awesome JSON page lists a nice collection of extensions to or alternatives to JSON that address some of the shortcomings of using it as a human-authored notation, including CSON, MSON, and HOCON. Other languages like TOML and YAML variants have also influenced this design.

Finally, the data model and some of the packaging ideas have been influenced by the work of an ITiCSE 2008 working group on the topic, which produced this report: Developing a Common Format for Sharing Programming Assignments.

Basic Format

The remainder of this description is split into two main parts: first, the format for describing key/value pairs (in this section), and second, the data model (in the following section). We view these two as independent. As indicated in the Why Not YAML? section above, we view the data described for a programming assignment as directly representable in PEML, YAML, JSON, etc. We also expect that most tools will support either YAML or JSON directly for tooling purposes, and that conversions between PEML <=> YAML or PEML <=> JSON will be easy. So users who strongly prefer an alternate notation can probably freely use one. However, we strongly believe that a representation optimized for human authoring of structured text consisting primarily of many multi-line text values is warranted to make authoring easier for those who don't think/write in YAML or JSON regularly.

OK, on to the format itself.

exercise_id: edu.vt.cs.cs1114.palindromes

# Single-line comments start with #
# Comments must be on lines by themselves

title: Palindromes (A Simple PEML Example)

topics: Strings, loops, conditions
prerequisites: variables, assignment, boolean operators

instructions:----------
Write a program that reads a single string (in the form of one line
of text) from its standard input, and determines whether the string is
a _palindrome_. A palindrome is a string that reads the same way
backward as it does forward, such as "racecar" or "madam". Your
program does not need to prompt for its input, and should only generate
one line of output, in the following
format:

<pre>
"racecar" is a palindrome.
</pre>

Or:

<pre>
"Flintstone" is not a palindrome.
</pre>
----------

assets.test_format: stdin-stdout

[systems]
language: java
version: >= 1.5

[assets.tests]
stdin: racecar
stdout: "racecar" is a palindrome.

stdin: Flintstone
stdout: "Flintstone" is not a palindrome.

stdin: url(some/local/input.txt)
stdout: url(some/local/output.txt)

stdin: url(http://my.school.edu/some/local/generator/input)
stdout: url(http://my.school.edu/some/local/generator/output)

PEML uses a plain-text representation for describing exercises. This format is designed to be easy to edit in a plain text editor.

Key/Value Pairs

Like YAML, we describe a programming exercise as a series of key/value pairs. Wow, big deal.

In YAML terms, that means the top-level structure of an exercise is a mapping (a hash or dictionary).

Keys are alphanumeric identifiers (starting with a letter, and including underscores). This is more restrictive than YAML, but the more general idea of allowing any representable value to be a key has little utility here and requires more careful parsing and fancier quoting rules that only decrease writability and increase the potential learning curve ... so, PEML uses the simpler notion that is common in many programming language identifier token classes. Note that periods can be used to form dotted names to refer to nested keys, as in ArchieML.

Also as in ArchieML, each key must start at the beginning of a line and be followed by a colon (for single-valued keys; keys that map to collections will instead be either: (a) surrounded by square brackets, or (b) surrounded by curly braces, still following ArchieML).

The corresponding value follows the colon. All values are potentially multi-lined values, and extend up to the beginning of the next property. Any leading/trailing white space is trimmed (including newlines), and multi-line values (i.e., those containing embedded newline(s) after trimming) are automatically terminated with a single newline. As a result blank lines can appear immediately before any key (or before any unquoted value) for visual spacing/chunking as desired without affecting the meaning.

Like ArchieML, PEML is intended to be parsed line by line, with the first non-whitespace sequence on the line determining its role. A simple, line-oriented parsing strategy using a basic state machine should be sufficient, without requiring complex grammar-based parsing strategies.

Comments

PEML allows single-line comments using the # character, as in YAML. The # character must be the first (non-whitespace) character on the line (i.e., only whole-line comments are supported), and the corresponding line is completely ignored for the purposes of interpreting the meaning of the PEML. Any line beginning with a # character (and any leading indentation) is interpreted as a comment line, except in quoted values.

Inspired by YAML's document start and end markers, PEML uses a specific comment line ("#---", a pound sign followed by three dashes) to signal the start of a PEML description. This marker is optional for the first PEML description in text stream, but serves as the delimiter between exercises if multiple PEML descriptions are presented in a single file or stream. The current PEML description continues until the next occurrence of this marker (signaling the beginning of a new exercise), or the end of input.

Quoting

On occasion, one may end up including text as part of a value that might also be recognizable as the start of a key. You can see this where the word "format:" appears in the example above, as part of the value given for the key "instructions:". In those cases, PEML uses a variant of HereDoc-style syntax, adapted to be more like triple quotes in languages like Python, Scala, R, etc.:

some_key:"""
You can put any multi-line text inside
here and it is treated as if it is
quoted: even when it contains things
that: look
like: keys and values.
"""

Any key where the colon is immediately followed by three or more repetitions of the same printing character is treated as having a HereDoc-style quoted value, with the provided sequence of repeated characters serving as the delimiter. This is more flexible than triple-quoting, since triple quotes themselves may appear in program fragments for exercises using particular programming languages. This technique allows authors to choose a custom delimiter (as with HereDocs), but allows them to use repeated punctuation symbols to provide a more identifiable/scannable horizontal delimiter around the value, rather than using a custom identifier.

As with HereDocs in many programming languages, the quoted value is terminated by the first subsequent occurrence of a line containing only the delimiter character sequence.

yet_another_key:~~~~~
You can use any printable character as the delimiter,
as long as it is repeated at least three times. The ending
delimiter must exactly match the starting one, and appear
on its own line with no indentation.
~~~~~

Of course, many programming languages also use # as a comment character. In PEML, # has no special meaning inside a quoted value. As a result, we recommend HereDoc-quoting any values that contain source code from such a programming language, to prevent a program's comment lines from being interpreted as PEML comments.

Embedding Markdown (and HTML)

Special formatting in the textual description of the exercise can be written using Markdown, which also supports embedding HTML directly in exercise descriptions. So use Markdown or HTML for adding formatting to your text. Plain, unformatted text also works, when no special formatting markup is desired. Here, we specify git's flavor of markdown.

External Resources

Values representing external references can be expressed as absolute or relative URLs using the "url(...)" construct (similar to its use in CSS).

# An example of an external resource
instructions: url(some/directory/assignment.pdf)

While we strongly discourage the use of PDF assignment descriptions, any key value can be farmed out into an external file that may be stored locally (and referred to using a relative URL) or remotely (and referred to using an absolute URL). This approach might be most used for source code content stored in separate files, test data stored in separate files, code libraries, and so on. Also, relative (or absolute) URLs used in embedded Markdown or HTML within keys may be appropriate for referring to images, downloadable resources accessible to the student, etc.

Nested Structure

Beyond these basics, nested properties follow Archie's conventions for dotted keys (nested key structure), object blocks, and arrays. The main differences here compared to ArchieML is the use of multi-line values by default, the use of a HereDoc/triple-quote hybrid rather than a specific end marker with escaping of special characters when a delimiter is necessary, and support for comments.

Nested keys:
assets.test_format: stdin-stdout
JSON equivalent:
{
  "assets": {
    "test_format": "stdin-stdout"
  }
}
Array (list):
[assets.tests]
stdin: racecar
stdout: "racecar" is a palindrome.

stdin: Flintstone
stdout: "Flintstone" is not a palindrome.

stdin: url(some/local/input.txt)
stdout: url(some/local/output.txt)
JSON equivalent:
{
  "assets": {
    "tests": [
      {
        "stdin": "racecar",
        "stdout": "\"racecar\" is a palindrome."
      },
      {
        "stdin": "Flintstone",
        "stdout": "\"Flintstone\" is not a palindrome."
      },
      {
        "stdin": "url(some/local/input.txt)",
        "stdout": "url(some/local/output.txt)"
      }
    ]
  }
}

Further details about nested mappings and sequences (and how they are terminated) are available in the ArchieML definition.

Side by Side

The (very brief) example shown above can be directly represented in JSON (or YAML):

PEML:
exercise_id: edu.vt.cs.cs1114.sp2018.simple-PEML-example

# Single-line comments start with #
# Comments must be on lines by themselves

title: Palindromes (A Simple PEML Example)

topics: Strings, loops, conditions
prerequisites: variables, assignment, boolean operators

instructions:----------
Write a program that reads a single string (in the form of one line
of text) from its standard input, and determines whether the string is
a _palindrome_. A palindrome is a string that reads the same way
backward as it does forward, such as "racecar" or "madam". Your
program does not need to prompt for its input, and should only generate
one line of output, in the following
format:

<pre>
"racecar" is a palindrome.
</pre>

Or:

<pre>
"Flintstone" is not a palindrome.
</pre>
----------

assets.test_format: stdin-stdout


[systems]
language: java
version: >= 1.5



[assets.tests]


stdin: racecar
stdout: "racecar" is a palindrome.


stdin: Flintstone
stdout: "Flintstone" is not a palindrome.


stdin: url(some/local/input.txt)
stdout: url(some/local/output.txt)


stdin: url(http://my.school.edu/some/local/generator/input)
stdout: url(http://my.school.edu/some/local/generator/output)




JSON equivalent:
{
  "exercise_id": "edu.vt.cs.cs1114.sp2018.simple-PEML-example",



  "title": "Palindromes (A Simple PEML Example)",

  "topics": "Strings, loops, conditions",
  "prerequisites": "variables, assignment, boolean operators",

  "instructions": "Write a program that reads a single string (in the form of one line\nof text) from its standard input, and determines whether the string is\na _palindrome_. A palindrome is a string that reads the same way\nbackward as it does forward, such as "racecar" or "madam". Your\nprogram does not need to prompt for its input, and should only generate\none line of output, in the following\nformat:\n\n<pre>\n"racecar" is a palindrome.\n</pre>\n\nOr:\n\n<pre>\n"Flintstone" is not a palindrome.\n</pre>\n",



















  "assets": {
    "test_format": "stdin-stdout"
  },
  "systems": [
    {
      "language": "java",
      "version": ">= 1.5"
    }
  ],
  "assets": {
    "tests": [
      {
        "stdin": "racecar",
        "stdout": "\"racecar\" is a palindrome."
      },
      {
        "stdin": "Flintstone",
        "stdout": "\"Flintstone\" is not a palindrome.",
      },
      {
        "stdin": "url(some/local/input.txt)",
        "stdout": "url(some/local/output.txt)"
      },
      {
        "stdin": "url(http://my.school.edu/some/local/generator/input)",
        "stdout": "url(http://my.school.edu/some/local/generator/output)"
      }
    ]
  }
}

The Details