Compilers: An Alternative Approach to Drupal Migrations

There are two main paradigms for editing content in content management systems: the document editor, and the block editor. Document editors are similar to what you’d expect from traditional word processing software. The page is a singular entity, one continuous document, which is edited as a whole. The other approach, block editors, breaks up this continuous page into individual blocks that can be independently edited and moved around.

One of the benefits that block editors provide over document editors is the introduction of structure through semantic meaning. For instance, a contact card would be added with a Contact block, which presents a form to the user that has fields for first name, surname, email, etcetera. However, in a document editor, a user might try and add this by creating a table and adding these fields themselves. As blocks have defined meaning (and a specific schema), varying presentational choices can be made on how the information should be rendered. For instance, the CMS might render all contact cards with a photo found from the email address listed. This wouldn’t be possible with the document approach, as it’s all just a heap of text.

Before block editors really took off with the launch of WordPress’ Gutenberg editor (among others), structured content (like contact cards) was shoe-horned into content using magic HTML comments and a plethora of plugins. It isn’t a robust solution, that’s for sure, but it works. As web pages veer further and further away from “just a load of text”, the flexibility, extensibility, and robustness of block editors is becoming increasingly desirable.

I was recently involved in a web migration project with the goal of moving content from Drupal 7 to Drupal 10 before maintenance for the old version hits end-of-life. The kicker? Drupal 7 thinks in terms of documents, Drupal 10 in blocks.

The approach that had been settled upon was moving the long page markup from Drupal 7 and dropping all of it into one paragraph type that supports styling. Effectively, this turned D10’s block editor into a document editor – something it was not designed for. This doubled the workload for the development team, as not only did components (such as contact cards) have to be implemented in their new block format, but also in a format capable of being dropped into the document editor. Additionally, if editors wanted to make changes to a page after migration, they would have to delete the paragraph block and rebuild the page’s structure from the correct block elements. Not ideal.

Compilers?

Let’s distract ourselves for a moment with a summary of how applications run on your computer. It’s relevant, I promise!

The CPU inside your computer only understands a limited set of instructions (an instruction set). You can tell it to add some numbers together, move things around in memory, and a number of other things, by writing out a list of instructions from the instruction set in a special format (called assembly code). While it is possible to write programs directly in assembly, it is very verbose, highly technical, and wildly unreadable.

Therefore, when people are writing applications, they generally use programming languages instead. This allows you to write terse yet expressive code, which is easy to understand because it abstracts away the nitty-gritty of the underlying instruction set. For instance, say you want to write a function or_nothing that returns the number you give it if a condition is true, otherwise 0. In Rust, a programming language, you can write that function in a few lines of declarative code. I’ve included the assembly code that Rust produces on the right (movb, andb, and so on are the instructions that the processor understands).

fn or_nothing(x: i32, condition: bool) -> i32 {
    if condition {
        x
    } else {
        0
    }
}

or_nothing:
  movb  %sil, %al
  movb  %dil, %cl
  movb  %cl, -4(%rsp)
  movb  %cl, -2(%rsp)
  movb  %al, %cl
  andb  $1, %cl
  movb  %cl, -1(%rsp)
  testb  $1, %al
  jne    .LBB0_2
  movb  $0, -3(%rsp)
  jmp    .LBB0_3

.LBB0_2:
  movb  -4(%rsp), %al
  movb  %al, -3(%rsp)

.LBB0_3:
  movb  -3(%rsp), %al
  retq

The process of turning the code you write into something that can actually be executed by your computer is called “compilation”.

The first step is to parse the text that you’ve written into a data structure that can be processed more easily by the compiler. Usually, this data structure is called an abstract syntax tree (AST). Most compilers don’t care if you use tabs or spaces, or if you put your curly braces on the same line as the function name or not – they just care about the intent of the code, not the formatting of the code.

The second step is to apply a whole host of transformations to this AST, such as removing code that nobody uses, or reordering instructions to make them faster.

The third step is code generation, where the final assembly is produced that your computer can actually run.

Because each step is separate, you can inspect the output of each step to make sure it’s doing what you expect. If you’re writing a compiler, you can write tests for each step to make sure it’s working correctly. And, if you’re debugging a problem, you can see where in the process things are going wrong.

Programming languages fall into two broad categories. They can be compiled just-in-time (JIT) by an interpreter, meaning that the source code is read and executed line-by-line after your program has been started. This is (basically) how web browsers work – the code for webpages is downloaded as text and run interpreted on the fly.

The alternative is to is to compile ahead-of-time (AOT), meaning that you download a compact list of machine instructions and the whole thing is given to your computer to execute at once (this is what a .exe download is – a pre-compiled program). Because the whole application is compiled in one fell swoop, the compiler has full knowledge of everything in the program, meaning it can detect errors at compile time rather than runtime.

Compiling Migrations

Congratulations for making it through the dense theory section of this write-up!

Migrating content from a document editor to a block editor is an exercise in transforming the markup of a webpage into a discrete series of blocks. Which is, if you squint a little, what a compiler does to source code.

Parse the page markup into an abstract syntax tree (AST)
Apply transformations to the AST (such as collecting referenced assets)
Generate a set of instructions for creating the right blocks
Run the generated instructions in the target CMS to create the blocks

Traditional migration workflows, such as those using Drupal’s dedicated migration tooling, usually perform all these steps on the fly on a page-by-page basis. In this sense, it can be thought of as a just-in-time interpreter. However, there are a number of issues with this approach, namely that it isn’t easily debuggable. Because there isn’t separation between these four steps, debugging them is a pain. You can’t inspect the results of each stage individually to sense-check the results, especially as all four of these steps take place on the server that the CMS is running on.

So here’s the big idea: run steps one to three of the migration ahead-of-time, just like building an AOT compiled program. The parsing, transformation, and code generation can happen on your local computer – only the generated output needs to be uploaded to the CMS. This allows you to inspect the steps of the process and deterministically re-run migrations because the instructions are spelled out explicitly in code. You can run a migration locally and know you’ll get exactly the same results in production.

In this model, we know what the input is (page markup) and the processor that will run the resulting code (the CMS). But what is the code that the CMS should run? What output format can we upload to a CMS as a list of instructions for re-creating content? Well, code is a list of instructions, and Drupal 10 lets us programmatically create content through PHP code. Let’s update the process:

Parse the page into an AST
Apply transformations to the AST
Generate PHP for creating content
Upload the PHP script and run it on the CMS

Because the PHP script is just a regular old text file, you can open it up on your computer and step through it line by line. Load it into a code editor and you’ll be warned about errors before you even touch your CMS. Upload the script to Teams and ping it over to a colleague – they’ll be able to run exactly the same migration.

<h1>Page title</h1>
<div class="contact">
  <p class="contact__name">Nick Bush</p>
  <p class="contact__email">mailto:nick@nickdbush.com</p>
</div>
<p>[scald=523761:sdl_editor_representation]</p>

{
  "type": "page",
  "title": "Page title",
  "blocks": [
    {
      "type": "contact",
      "name": "Nick Bush",
      "email": "nick@nickdbush.com"
    },
    {
      "type": "image",
      "id": 523761
    }
  ]
}

<?php
use Drupal\node\Entity\Node;
use Drupal\file\Entity\File;

$node = Node::create([
  ’type’ => ’page’,
  ’title’ => ’Page title’,
]);
$node->save();

$contact = Node::create([
  ’type’ => ’contact’,
  ’title’ => ’Nick Bush’,
  ’field_email’ => ’nick@nickdbush.com’,
]);
$contact->save();
$node->field_blocks[] = $contact;

$image = Node::create([
  ’type’ => ’image’,
  ’field_image’ => [
    ’target_id’ => 523761,
  ],
]);
$image->save();
$node->field_blocks[] = $image;

$node->save();

The way you parse the page markup and generate the PHP code is entirely up to you and the needs of your project. Unfortunately, Drupal’s block APIs aren’t the best documented, so it can involve a bit of fumbling in the dark until you hit upon the right method. I found generating type definitions from config YAML files to be pretty helpful, but your mileage may vary.

I’ve found that allocating variables with auto-incrementing names (like $v0, $v1, etc.) is helpful for avoiding naming conflicts in the generated code. You also might want to split the generated PHP script into multiple files, as the generated code be a lot for the PHP interpreter to handle in one go (especially if your pages are full of text).

If you’re interested in this approach, or have any questions, feel free to reach out. I’d love to hear about your experiences with Drupal migrations, or if you’ve tried something similar to this before!