Issue 34

Is your structured content the same as mine? Yeah, nah. Nah, yeah.

More and more people are talking about structured content, which—to be frank—I did not see coming 5 years ago. I’ve been talking about it for a long time because ~~I’m a nerd~~ I think it’s crucial to making content portable and flexible, so it’s exciting to see it referenced with increasing regularity.

But when people talk about structured content, they aren’t always talking about the same thing. Let’s try to understand some of the differences and similarities—all while I think out loud.

Some may think that structured content is simply content that was written over the framework provided by an outline, like we did back in school. Heading levels mark out the big pieces of the outline in a document like a Google Doc or HTML or a simple text format called Markdown.

Yes, there is structure here, but I think of this as pre-structured content. It’s not really what we mean when we talk about structured content. Pre-structured content organizes what’s in a document, but it doesn’t make content modular or reusable in systematic ways.

I see two main paradigms for structured content:

Assembly
Retrieval

Assembly

The technical writing world pioneered ways to structure content that enables publishing multiple variants of content to multiple outputs from a single source.

Traditionally, these approaches are build systems that assemble (or compile) outputs or artifacts that were defined explicitly up front. Delivery follows a “push” model.

A content person in an assembly paradigm asks “How do I build this output?”

One particular approach has become an official standard and is known as the Darwin Information Typing Architecture (or DITA, for short). While originally created for writing technical content like user guides, DITA can be expanded and customized for other valuable content.

In DITA, the “logic” of the system lives in the publishing layer with mechanisms like conrefs, keyrefs, and maps. Relationships between content are declared manually or inherited via parent-child file structures.

DITA uses XML as its markup—the way content is encoded for the system—and content is stored as flat text files, either in source control (such as Git) or in a Component Content Management System (CCMS). The use of XML gives very granular control over the application of semantically meaningful elements in an artifact.

Another assembly-style technical writing methodology is “docs-as-code.” In the docs-as-code approach, the content lives alongside code in repositories (often as Markdown files), and sometimes embedded in code comments. It’s compiled by static site generators into a documentation website. This makes docs-as-code very friendly to developers who create and consume a lot of documentation while writing software code.

Docs-as-code typically is much less structured than DITA, but I do see it fitting in the assembly paradigm.

You model content to build something specific.

Retrieval

The other main paradigm of structured content is the retrieval paradigm where content lives in a database—a headless CMS—and is retrieved via APIs via queries.

Sometimes you’ll hear the description of “content-as-data” to describe the retrieval paradigm. Pieces of content have metadata attached to them, allowing them to be processed like data in a database.

Instead of being a build system, the retrieval paradigm is a query system. Delivery is a “pull” model.

A content worker asks “How do I make this content available for use in experiences?”

In a headless retrieval system, the “logic” lives in the application or orchestration layer. Relationships are modeled explicitly between content structures, rather than happening magically.

Many headless CMSes use JSON as their markup. JSON is popular with developers, though it has some structural differences that make it less friendly for translation than XML, long a standard exchange format in translation workflows and tooling. JSON-based content is stored in the CMS as objects.

You model content so it can be used in ways you haven’t fully defined yet.

It’s not a contest, but let’s compare

The purpose of this article is not to weigh pros and cons of DITA versus a headless CMS, though I’ve called out a few details above and I want to look at them in the context of where we are today and where content is going.

First, both assembly and retrieval paradigms can be semantic, but in different ways. By semantics, I mean that meaning is encoded with the content. It’s not just raw letters forming words.

Whiteboard with notes that turned into this article — My whiteboard planning this issue

DITA, as noted earlier, allows a very granular level of semantics. You can, for instance, encode specific steps in a task as task steps. Or you can identify UI text within that task as UI text. Semantics in DITA are pretty tactical.

As you move to retrieval systems, semantics gets a little more strategic and sits at a higher object-based level. Generally, you don’t have the granularity as fine as what DITA allows, without some customization and some careful modeling.

But neither has semantic encoding at the highest level, placing all the pieces in meaningful relationships to each other and to content and data in other systems. To do that, semantic enrichment and orchestration is done by systems parallel to or arching over the CCMS or the CMS.

As the world moves to more large language model (LLM) usage where accuracy has been suspect and the tools hallucinate, Retrieval-Augmented Generation (RAG) has become a buzzword. RAG is a framework to help index information so that the LLMs can give better responses.

Both the retrieval paradigm and the assembly paradigm can feed RAG. But RAG isn’t perfect, and there are probably better ways to manage content than we’re currently doing it today.

Looking to a graph-native future

If assembly is about building outputs and retrieval is about querying content, a graph-native system would be about modeling meaning itself.

Knowledge graphs feature objects—let’s call them content objects, similar to the objects in a relational database-based CMS—and explicit relationships between the objects.

For instance, you might have a person object (John) and a bike object (Specialized Diverge) connected by a relationship—a verb—of rides, giving you what’s called a triple. In this case, John rides a Specialized Diverge. And we could branch out to objects for terrain and skill level and so on.

Earlier I mentioned tools that do semantic enrichment. There are tools that help define those triples on top of DITA or headless CMS content. There’s not really content-specific tools built on the knowledge graph triples model. If there is, shoot me an email. I want to hear about it!

In my mind, there’s a future where there’s a graph-based content/knowledge management system. Delivery is still a “pull” model, but the “logic” is inherent because it’s defined in the graph itself. Relationships between objects are modeled and inferred through the graph.

The typical markup systems for the “bolt-on” knowledge graph tools are RDF, SHACL, and OWL. Presumably that defines the relationships, but there’s probably still need for a markup for the objects themselves. And, of course, storage is objects.

In a graph-native future, the content person will be asking “Am I properly shaping this interconnected meaning?”

Maybe this is a bit hand-wavy. Maybe it’s too technical for you. That’s fine. That’s me in the corner—me in the newsletter spotlight. Thinking my nerdy thoughts.

One final headscratcher

Remember back at the beginning? I talked about pre-structured content and mentioned Markdown files as an example. Markdown is a lightweight wiki-like format that you’ve probably used without even knowing it.

If you’ve ever made something **bold** by using asterisks or set heading levels with # Heading 1 or ## Heading 2, you’ve used Markdown. It’s really easy to use and those heading levels bring in the outline-level of structure. But there’s few tools that enforce much more structure than that within a Markdown (.md) file.

You have this basic, easy-to-process, semi-structured file type. And it just so happens that LLMs love to ingest Markdown files and generate Markdown responses.

It’s a strange kind of full circle moment: the most advanced systems we’ve built are now consuming one of the simplest content formats we have.

Some resources

There’s a lot of deep stuff to understand just in what I’ve written, but there’s way more beyond this. It’s stuff I want to get more understanding of. If you’re in the same boat, here’s some trusted sources:

Michael Iantosca — A Graph-based Universal Component Content Management System (I read Michael with some regularity, but I hadn’t seen this article when I wrote this article.)
Michael Andrews — Story Needle blog
Scott Abel — AI May Not Need XML in the Prompt Window, But It Still Needs Structured Content (Came out while my article was in draft.)
Lance Cummings — Do Prompts Really Need Markup?
Jessica Talisman — Intentional Arrangement Substack

A call to action

Let me know if I’m off-track on anything here, if you have a different perspective, questions you’d like me to track down, or ideas that this sparks for you. Email me—I read every message!

“The future of structured content lies in the fusion of structured content, structured knowledge, graph-based architecture, and AI-driven authoring.”

A Graph-based Universal Component Content Management System
by Michael Iantosca

Top of mind

I’ve noticed that we all have blind spots where we don’t realize there are unknown unknowns. I’ve seen this happen with CMS implementations and migrations, where a blind spot can lead to a mistake that costs six figures.

With decades in the content industry and working on cross-functional teams, I’ve learned some of the signals and patterns to watch for these costly blind spots. And today I’m excited to share that I’ve created a CMS readiness assessment tool to help you identify those blind spots on your next content project—and give you recommendations for shoring up any gaps the assessment finds.

There’s a shortened free version, and it’s available today at www.collinscontent.com/cms-assessment. Check it out!

Head and shoulders photo of John Collins — John Collins

Thanks for reading!

Did someone forward you this email? Subscribe here

If you’re already a subscriber and you found value in something here, tell your friends and colleagues to subscribe!

Welcome to the 11 new subscribers who joined us since the last issue of Model Thinking.