Apr 28, 2023 13 min read HTML

Single Source Publishing and HTML for Publishers

Exploring HTML as the ideal "single source" format for publishing.

SSP Explained

Definition: Single Source Publishing (SSP) is an approach used by technical publishing systems that focuses on using one source file, shared across content creation and production stages.

In the world of publishing, content creation and production are often disconnected processes. Content creation happens in isolation from the production phases, and the technical systems and file formats used in each stage are often completely separate.

Single Source Publishing (SSP) utilizes a single source file throughout the content creation and production phases.

Fragmented Publishing Processes

In numerous publishing environments, authors, copy editors, and proofreaders use a single tool, often Microsoft Word, to create and refine content. This content then undergoes production, where it is converted into various formats such as Web, HTML, PDF, XML, and ebook formats, either programmatically using software or manually by individuals using applications like InDesign.

This separation between content creation and production creates a disconnect in the people, tools, and working files involved in the process. A simple example to illustrate the issue is when content changes are required after it has entered the production stage. This necessitates a multi-step process involving communication between content creators and production staff, using their respective tools, and then checking the changes made. This back and forth process can be not only time-consuming and expensive but also involve numerous conversions, leading to multiple versions and potentially introducing errors along the way.

We could call this type of process a "Fragmented Publishing Process" (FPP). This term emphasizes the disjointed nature of the process, as it requires "jumping a gap" between different teams, tools, and formats from the content creation to the production stages.

Single Source Publishing

Single Source Publishing (SSP) tackles the main challenge of FPP by enabling both content creation and production stages to utilize the same file format. While shifts in teams and tools may still occur during each stage, SSP ensures that both stages remain linked through a shared "single source," promoting smoother transitions between them.

In a Single Source Environment, for instance, when content alterations are needed during the production phase, the content creation team can make those modifications using the same source file utilized by the production team. This ensures smooth integration into the production process, as both teams are working on the exact same files.

SSP seeks to simplify the publishing process, leading to considerable time and cost efficiencies. There is also a reduced likelihood of introducing new errors, as SSP eliminates the need for multiple conversions and manual interventions between content creation and production stages.

Concurrency - The Real Benefit of SSP

However, the advantages of single source publishing (SSP) extend beyond merely facilitating a smoother document exchange between content creators and production teams. In an SSP environment, the real value lies in concurrent workflows.

Concurrency, in general terms, refers to the execution of multiple tasks or processes at the same time. In the context of publishing workflows, it is the ability for authors, editors, illustrators, and designers to all work on the project at the same time. For example, while an author is in the process of writing the main content, an illustrator could be designing illustrations or graphics and placing them in the text. Simultaneously, a copy editor might be reviewing and editing completed sections of the text, and a designer could be working on the layout and formatting of the book or digital publication.

There are two primary advantages of concurrency:

Condensed product completion time: Concurrent workflows allow tasks to be completed in parallel, speeding up the overall process.
Enhanced collaboration: Concurrent workflows cultivate a collaborative environment with real-time interaction, fostering open communication and faster decision-making by addressing questions and feedback promptly.

Concurrency, stemming from a well-crafted SSP system, is vital in elevating the efficiency of publishing workflows. It could be argued that concurrency represents the true value of SSP.

Concurrency and Realtime-editing

It's essential to highlight that one crucial feature must be present for concurrency to truly reach its full potential in terms of efficiency: real-time editing. Real-time editing allows multiple people to edit a document simultaneously and see each other's changes in real-time, similar to tools like Google Docs or design applications such as Plasmic and Figma.

To achieve genuinely concurrent workflows, real-time editing must be enabled, allowing all team members to modify the file and perform their tasks at the same time. Without this feature, we would need to rely on locking files while one team member works on them, which may be preferable in certain situations (for example, a copy editor may not want an author to add new content to the same chapter they are editing). However, in most cases, real-time editing significantly enhances the benefits of concurrency.

Simple SSP

The simplest, and most common, way to implement SSP is to require that everyone involved in the publishing process use the same tools. Since production tools can modify content but authoring tools (e.g., MS Word) generally cannot adjust layout, style, or design elements, content creators are often required to use production tools. Some effort has been made to improve these tools for content creators, but the strategy has limited success.

We'll start by exploring this strategy in a common scenario – using an "XML-first" environment in which all parties involved in the publishing process work directly with XML files and tools more commonly used by production staff - XML editors. As highlighted in Peter Meyer's insightful 2004 presentation on SSP, this approach offers both benefits and challenges that we'll discuss in more detail.

Meyer presents many arguments in favor of designing SSP systems around XML, but chief amongst them are its adaptability and longevity. In situations where documents are held and updated over long periods, employing XML can help avoid technological obsolescence of the content caused by changing software versions and publishing styles. This ability to stand the test of time and adapt to various requirements make XML a common and celebrated choice for publishing systems. Meyer, amongst many others, believes that XML serves as an excellent file format for Single Source Publishing.

This strategy can be successful for a known group of regular authors; however, Meyer points out several challenges. One notable challenge is the need for authors to possess a thorough understanding of the specific rules governing a document's structure, known as the "schema," as well as the capability to edit XML. Meyer observed that within any group, about a quarter of authors could edit XML and work with a moderately complex schema without extensive training and support. Half of the group might learn to use the application with varying efficiency, while the remaining quarter would likely struggle to adapt to the new content authoring process.

One way to address this issue is to develop an editor that feels like a word processor but edits structured XML "under the hood." This, however, is challenging. Developing an interface that is intuitive, flexible, and preserves the structural integrity of the highly structured and rigid XML document is difficult and expensive. Additionally, updating the interface to accommodate changes to the overall document structure (schema) can be a complex and costly process.

It's important to keep in mind that even with the most user-friendly interface for editing XML, authors are still required to understand the document structure, use it correctly, and learn a new tool. These factors can contribute to challenges in adoption and productivity when implementing Single Source Publishing with an XML-first approach.

Essentially, when dealing with XML, it becomes clear that expecting authors and content creators to use tools that require them to add any kind of structure beyond the basic "display structure" (the visual organization of content as it appears in a typical word processor) is not particularly effective.

There have been other attempts to solve for the problem of unfriendly author experience when using highly structured document formats. Some platforms, like Overleaf for example, have attempted to simplify the author experience by combining user-friendly, word processor-like tools, with LaTeX markup. This takes away some of the disruptions to the writing/content production flow, however, it still requires content creators to learn and conform to the structure and markup of LaTeX and faces the same adoption and productivity issues. However this strategy has proven effective for those wishing to learn LaTeX, which is mainly researchers in 'hard sciences'.

Overleaf with the Rich Text + LaTeX Editor

There are still other approaches to requiring everyone to use tools more commonly used by production staff. A few publishers have required authors to use desktop publishing systems like InDesign. The idea being that they write in InDesign (or similar) and then the designers can also work in the same environment. Thankfully, this approach is uncommon, as it also disrupts the natural writing process for authors more familiar with word processing software like Microsoft Word or Google Docs.

The examples provided are not exhaustive, but they effectively demonstrate the challenges encountered when expecting all users to adopt tools typically used by production staff. Although it may be somewhat successful, this approach can lead to issues with adoption, a higher likelihood of introducing errors (particularly in terms of structure), and often results in productivity loss. Additionally, it necessitates considerable resources for training and support.

Simple File Formats

As mentioned earlier, highly structured file formats can present challenges. Can a simpler file format be the solution? Some SSP systems attempt to address this issue by using simpler file formats like Markdown and AsciiDoc. I have previously discussed the benefits and limitations of this approach in detail. The primary drawbacks are the lack of advanced formatting options, the necessity for content producers to learn a new, somewhat technical syntax, and the difficulty in adding the required structural information for styling and layout of published content, even though these formats are known as, paradoxically, 'structured text' formats.

 My Document

This is a paragraph.

== Section 1

This is a subsection.

=== Subsubsection 1.1

This is a sub-subsection.

== Section 2

This is another subsection.

A Simple Asciidoc Example

There is a use case for Markdown and AsciiDoc in SSP, but it ls largely for content creators who are also technical eg technical documentation (although there have been some production companies, such as Electric Bookworks, that have taken Markdown workflows quite far into other categories).

Qualities of a SSP File Format

How can we solve the 'SSP problem' when both complex and simple file formats fall short? What, then, are the characteristics of a good SSP file format?

An ideal file format for a single source publishing system must have certain qualities to meet the needs of both content creators and production staff. First, as we have already established, it should be compatible with tools that are familiar to both content creation and production teams.

Secondly, it is important to consider the distinct objectives of each group working with the file. Content creators focus on adding content, while production staff primarily concentrate on structure to enable conversion to other formats and display environments (discussed in more detail below). Essentially, the file format must be capable of containing minimal, moderate, or extensive structure as it passes continuously through the hands of both content creation and production staff. In other words, the file format should support progressive structuring without disrupting the workflow.

The Argument for HTML

Interestingly, HTML emerges as a strong candidate for an SSP format due to its flexibility in handling document structure and the wide array of content creation and production tools available for working with HTML.

In terms of structure, unlike other formats that enforce strict structure, HTML is capable of accommodating unstructured, partially structured, or fully structured documents with ease. While some may perceive this lack of enforced structure as a drawback or weakness, I believe it is actually one of HTML's key strengths. Moreover, HTML's structural flexibility enables it to accommodate progressive structuring throughout the publishing process, a characteristic critical to well design SSP systems.

The question then arises, can HTML be used to support the 'native tooling' of both content creators and production staff? For content creators, the answer is a resounding yes, as many online word processors that support HTML as an underlying format look, feel, and operate like traditional word processors.

HTML can be effectively employed for production purposes, including converting to XML, creating visually appealing print-ready PDFs, EPUBs, and other formats. The ability to transform into various formats, especially XML and high-quality PDFs, may prompt some inquiries. How exactly is this achieved, and what tools are necessary?

To address these questions, we must explore the requirements for converting HTML to other formats in greater detail. This examination will offer a better understanding of the essential tools and their impact on the culture of production in the publishing industry.

How Does HTML Get to Other Formats?

There are essentially three types of document conversion: Upconversion, Downconversion, and Typesetting.

Downconversion

Downconversion entails converting a structured format to a less structured one, such as transforming HTML to plain text. This process involves losing information rather than adding it. Downconversion is straightforward.

Upconversion

Upconversion involves enhancing a document's structural fidelity. This process requires someone or something to add structural information to the document.

As previously discussed, one can add as much structure to HTML as needed. This is accomplished technically by incorporating attributes to the elements and enclosing items with different identifiers. Upconversion is facilitated by mapping these identifiers to other document structures, enabling the automatic conversion of the single-source file format to the structure of the desired target format (e.g., XML).

Typesetting

Typesetting is the process of adapting content to fit specific layouts (and look visually appealing), such as mobile displays or paginated PDFs for print. .

Typesetting within an HTML context is a fascinating topic in publishing. There are two primary use cases: HTML display environments (such as laptop browsers or mobile phones) and PDFs for screen display or print. As previously mentioned, CSS is the design language that determines the style and layout of HTML in browsers and phones. Interestingly, CSS has been expanded to encompass rules governing design and layout in page environments, making it a popular method for generating PDFs from HTML. CSS PagedMedia, an evolving standard, is commonly used to create visually appealing, print-ready works from HTML, including all the features expected in print layouts, such as page numbers, margin control, running headers, and orphan control etc.

An example textbook PDF typeset from HTML by Pagedjs

HTML and CSS are currently being employed to create PDFs for publishers across a diverse range of use cases, including books (as per above) and journals.

eLife PDF generated from HTML + CSS using PagedMedia

In summary, it is possible to achieve all the required formats through the combination of these three strategies, along with the numerous conversion tools readily accessible for converting HTML. In fact, one could argue that the variety of strategies and tools for format conversion is greater for HTML than for any other available format.

The Tools

As previously mentioned, it is crucial to identify tools that cater to the needs and working styles of both content creators and production staff. The ideal tools should work with HTML and, optimally, support real-time concurrent operations. Fortunately, such tools do exist, and if they don't perfectly meet your specific use case, they can be cost-effectively extended or custom-built.

Numerous modern word processors that use HTML as their underlying document structure offer a range of benefits, including the ability to support real-time editing. These web-based word processors can be tailored to accommodate production teams, allowing them to mark up more complex structures.

Downconversion is easily supported by numerous conversion tools like Pandoc, making the tooling for this aspect simple. However, the two high-value transformations to consider are upconversion and typesetting.

Typesetting can be accomplished with a relatively simple setup. For example, the book production system PressBooks utilizes WordPress and PagedMedia typesetting (employing the proprietary PrinceXML). To add structural information in a PressBooks book, users can take advantage of the customized editor to include the necessary structural information for typesetting.

PressBooks Editor (Image CC-BY University of Hawai'i)

Similarly, Ketida employs the web-based word processor Wax and integrates with open-source Pagedjs for its typesetting needs.

Design is managed in both systems by linking the structure with CSS styles. This can be done by editing CSS files using a range of CSS editors popular in web design workflows or by using the built-in CSS template editors available in both platforms.

Ketida Inbuilt CSS Editor and Side-by-Side Renderer

In yet other strategies, there has been work put into a InDesign-like interface. Hederis, for example, follows this paradigm. The underlying technology behind Hederis is HTML and CSS (and PagedJS for HTML pagination).

If publishers do not have in-house CSS PagedMedia skills this work can be outsourced to a growing number of design professionals.

Upconversion solutions for transforming from HTML to highly structured formats are available, but they are not extensively distributed or widely adopted. Kotahi is a prime example, providing a word processor interface with controls for marking up display document elements with additional structural information. During export, this information is mapped to the Journal Article Tag Suite (JATS) format and validated, resulting in a clean, well-formatted JATS representation of the article.

Kotahi SSP Production Interface for Journals

MetaData

The observant reader may ask, "What about metadata?" Preparing files in production isn't solely about the structure of the underlying content; it also involves including additional information about the content that isn't considered "display content." For instance, the ISBN number is not typically part of the content since we don't usually expect to see it within the text of a book. Placing an ISBN at the beginning of a chapter just to "carry that information forward" would be a rather cumbersome solution. However the ISBN number might need to be embedded in some outputs (eg. XML). In journal publishing, particularly, a significant amount of metadata needs to be merged with content into an XML file (JATS) for archiving and distribution.

So, how does this fit into the overall picture? The answer lies in combining content from an HTML file with metadata at "export time" (when creating JATS for example) and constructing the XML from the individual components. There is no need for metadata like this to "live" in the original single source file. It can exist anywhere that is accessible to the export mechanism when generating the desired format.

Conclusion

Single Source Publishing is an effective strategy for accelerating publishing workflows and lowering costs. The secret to achieving SSP lies in the nature of a shared file format across content creation and production. Highly structured formats can be problematic, while formats with reduced structure have limited use cases. An effective SSP file format must:

Be compatible with tools that are familiar to both content creation and production teams.
Support progressive structuring without interrupting the workflow.

HTML emerges as the best contender, as it can accommodate both content creation and production tools, and supports progressive structuring. Although a growing number of tools support this approach, they remain largely under-explored by those who stand to benefit the most from them.