Home

Content as Code

One of the interesting things when Google created their Site Reliability Engineering team is coined the phrase “SRE is what you get when you treat operations as if it’s a software problem.”.

This article will look at Content (written, images etc) and treat it in the same light.

We are going to look from the view point of Static Site Generation - i.e. doing all the calculations once server side, rather than pushing all the logic to the users browser.

An example

Imagine the following code:

<h1>Heading</h1>
<h2>Heading 2</h2>
<h2>Heading 3</h3>
There are {count} headings on this page.

In this example, we could hardcode the count variable in the page, but everytime we make a change to the page we would need to make a change to the counter as well otherwise the document would no longer be in sync.

Another example:

<h1>Heading</h1>
<h1>Heading</h1>

It is bad practise from an accessibility point of view to have more than one H1 tag on a page, but how do we check for this across large documents collections?

Content vs Templates

Formats like JSX and MDX mix the ideas of content and templates. Pure content should only be informational and not contain styling or markup. Templates should take this content and merge it with a template to target a certain ouput type (like HTML).

Separating the processes is possible with modern tools but requires some care and attention. With MDX for example, you can map the markdown constructs to custom components rather than using the 1-1 mapping of standard markdown.

Most tools however mix these two ideas which makes reasoning about a document more difficult.

Premature optimisation

Tools like markdown are designed to remove the verbositiy of XML style documents, but they have the inherent assumption that the content will mapped to HTML as all the built in constructs all have a 1-1 mapping with HTML. Most markdown compilers do not give you any control over the templating and instead just compile built in templates for you.

There was a time when minimalist wrappers were all the rage with tools like Coffee Script to replace Javascript.

What type again?

HTML is a language, a .html file is a string. This string needs to be parsed before we can do anything with it. If you want to programatically create a .html file you are making a string and storing it in a file.

One of the limitations of HTML is that all attributes are strings:

<MyComponent count="1" />

Count in the above example is a string not a number.

Enter JSX

JSX stands for JavaScript XML, it allows us to write XML inside a JS document.

We can do this with JSX:

let myvar = <MyComponent count={1} />

The count attribute is now the number 1 (not a String).

A .jsx file is also a string. But an important thing to note is that a JSX file is a JavaScript file and needs to be treated as a JavaScript file.

JSX also needs a renderer to transform JSX into the target output type.

Markdown

Markdown is a text format that makes writing HTML easier by removing the need for verbosing tagging.

In this document when we mention markdown we are referring to the CommonMark specification for markdown (rather than the original looser spec).

MDX vs Markdoc

MDX is an extension to markdown that allows JSX components to be used.

Markdoc is a markdown extension written by Stripe to add more power to markdown (it powers their documentation).

One of the main differences between the two is philosophical. MDX can have code inside the content, Markdoc does not. This means a Markdoc page is pure content, where MDX has import statements and JS in the page, this can look messy to some people.

In this article we’ll lean towards the Markdoc way of doing things (no code inside the content), we want to keep the content DRY in the same way you would not copy/paste code and change something small.

XML and XSLT

Storing content as XML keeps it DRY as there is no room for code in the documents. XSLT is a way of transforming XML documents to other formats (the templating engine for the purpose of this document).

This is much closer to what we want to achieve, but we want to be able to have complete control over the XSLT process.

Data exports

Lots of authoring tools provide support for frontmatter. This serves a great purpose in content management for simple data like page title and publication date and author.

If we are treating content as code we’ll need a way to be able to programatically access this data.

This also raises the interesting point of being able to reference one document from another. If I had a piece of content that lists a table of sporting scores and want to reference a particular match score in another document, you don’t want to copy paste the score, but instead want to get the data from the document as if it was a KV store (in the rare case that the original document was wrong and gets updated, we want all referencing documents to be updated).

When we think about tabular data we immediately think about tables in HTML. We need to steer our mindset away from this, and start to think about a format for storing the data as-is in the content, we can convert the data structure to a table during the template phase e.g.:

<!-- Not this -->
<table>
  <tr>
    <td>Score<td>
    <td>2-1</td>
    ...

// Do this
// JSON (or JSONC) is not code as it is not dynamic
// so does not break our rule of code in content

[
  {
    sport : '',
    score : '2-1'
  }
]

All JSON can be exported as a property of the content by default so other documents can reference the data. This allows us to keep the data in the correct place in the document, but still be able to reference it elsewhere, metadata being an example of where we want the data at the top of the page, but want to mark it as non-renderable.

Thinking headless

In a headless CMS a user creates content in an interface, and then this gets exported as a JSON array of components for a templating engine to loop over and render. This is in contrast to the other tools mentioned here where components are the native break point.

Abstract Syntax Trees

If we want to be able to treat content as code we need to convert the string (.jsx or .html) into a data structure.

Unified JS has done the heavy lifting of defining the structure of a few different types of ASTs for us:

Whilst HTML is the target output for the purposes of this document, the input format and output format should be able to be anything.

Next up…

For our pure content format, we need to load in a string (our file) and convert it into a UnifiedJS style AST.

Once we have done that we can write a transformer to convert it into a HTML AST (HAST) and the standard Unified framework can do the conversion into a HTML string file.

This way we get the power of our own custom content format, combined with the ecosystem of UnifiedJS.