According to a recent survey carried out on behalf of the digital association Bitkom, an average of 26 e-mails are received daily in every professional mailbox in Germany. Their processing takes up a large part of the working time. […]
In addition, emails are an integral part of processes. Some of them must be kept under tax law, including orders or invoices, but also any documents that may be relevant in connection with a business transaction. In addition, electronic messages often contain valuable knowledge that needs to be stored. But how can emails be elegantly archived? An ultimate solution does not exist to date. However, currently the way via PDF seems to be the most feasible for many reasons.
The good news first: e-mails are digital per se and already contain metadata. In contrast to paper-based communication, it is basically easier to archive them. However, in many cases there are no requirements from the company side, so users decide individually how to deal with their e-mails. Therefore, there is a high risk that business-relevant news will be lost.
E – mails-more than just a file
Different, specialized systems are used to handle e-mails, which enable the creation, transport, viewing and storage of these electronic messages (lifecycle: client, server, relay, archive system). Emails consist of three components:
- The header is basically the counterpart to the letterhead and includes the sender and recipient information, the creation date and some optional information such as the subject in the form of metadata. Often, an ID is also included here, which makes it easier for the e-mail client to associate with other e–mails if an e-mail sequence consists of replies and redirects. In order to properly assess emails and the reliability of the header information, it is important to understand that the actual routing is independent of the header data and is done via the Simple Mail Transfer Protocol (SMTP). The SMTP functions as an envelope and controls the routing of the electronic message. The sending e-mail client thus sends an SMTP call to the e-mail server at the same time as the user data of the e-mail (including header), in which the address of the recipient is located and which is decisive for the routing.
- The body, i.e. the actual mail content, is displayed differently depending on the user-defined settings in the e-mail software and its possibilities. Plain text (ASCII) without umlauts, simply formatted text (such as bold or italics) with support for country – specific encodings (umlauts) as well as comprehensive HTML formatting with embedded images, etc.An e-mail file can contain several variants at the same time and there is no guarantee of congruent content: it is easily possible to place different texts. For example, the ASCII text often contains only the indication that an HTML-capable e-mail client is required for display. This is a crucial aspect for possible format changes in archiving.
- The third, optional part consists of attachments. This opens up the infinite field of file formats feared by every archivist: often they are documents or images, which may be combined in a ZIP file, but exotic file formats or executable programs or scripts may also be included.
The ideal email archive
The transport of the e-mail takes place, as already described, via the SMTP protocol, namely from the client to the server at the sender, then via the mail relays to the server at the recipient and from there to his client. Since e-mails are often sent as replies in “conversations” and the complete history is not always included, it would be ideal to archive the entire mail system in order to fully understand the e-mail communication with all steps later.
This is obviously not feasible in practice. Alternatively, it would be good if at least the receiving or sending mailbox were to be archived completely with all references of the e – mails to each other. To date, however, there is no standardized, interoperable approach to this, but there are interesting initiatives and approaches (e.g. a report recently prepared by the University of Illinois with the support of the PDF Association: https://www.pdfa.org/packaging-email-archives-using-pdf/).
No original format for emails
Initially, e-mails are only intended as a communication protocol. There is no standard for the format, in the communication protocol RFC # 833 only the transmission of e-mails is standardized. The email format that most closely matches the protocol is the EML format. This is a practical solution to simply store e-mails on the hard disk or other storage media and then open them with the e-mail program used. However, this does not necessarily guarantee the long-term availability of the e-mails, since there is no standardized documentation for the EML format and special software is required for the display.
Such an approach is also problematic because Microsoft’s most commonly used technology in business processes uses its own proprietary format (MSG). Although it is documented, it is subject to frequent changes. Content is sometimes not even inserted into the body of the e-mail by the programs, but as “Winmail.dat ” attachments, which can then only be interpreted and displayed by appropriately prepared clients on the recipient side.
For these reasons alone, a conversion of the e-mails into an archive-compatible standard format seems essential. This becomes even more imperative when annexes are included in the consideration. Here, there are no limits to the imagination, which file format is used in the attachments. It is therefore not possible to ensure that an application is available for years or even decades with which the attachments can be displayed – one of the reasons why PDF/A was developed and became so quickly established.
PDF / A for secure archiving
To get rid of this dependency, a system-independent archiving of all e-mails and attachments in PDF/A is recommended. The format has long been established for general archival material. Recently, the PDF/A-4f compliance level has been available as a successor to PDF/A-3, in which any files can be embedded. On this basis, at least the format question in email archiving can be answered satisfactorily.
Most email systems offer an export function to PDF. But unfortunately, this approach is often too short, because usually only the e-mail body is taken into account and not the header and the possible attachments.
In the case of a complete archiving of e-mails in the PDF, the header data should be stored as XMP metadata in the PDF file. On this basis, you can then specifically search for e-mails. The e-mail body is ideally converted on the basis of the body branch (simple ASCII, formatted text, HTML) that reproduces the content most extensively. Links or referenced images in HTML must then also be integrated.
The greatest flexibility in the use of archived e-mails is available if the original e-mail file in EML or MSG format as well as the attachments are additionally embedded in the PDF, which is possible with PDF/A-3 or PDF/A – 4f.
But experience has shown that the emails archived as PDF/A are almost always larger than the original files. Another factor is that the PDF/A standard requires the embedding of fonts or ICC profiles for colors to ensure the reproducibility of e-mails for years to come. On the other hand, the file size can be minimized through compression methods integrated into the PDF, a possibility that does not exist in the “email formats”.
In order to include as much information as possible in email archiving and to find and use it in the future, the following steps are recommended in summary:
- Convert emails to PDF/A-3 or PDF/A-4f with the look-and-feel of the email client
- Adding all header information as metadata
- Conversion of attachments also to PDF / A, if possible
- Embed attachments additionally in the original format
- Embedding the original files (emails in EML or MSG format)
This procedure can already be solved today with standard software. Not yet covered in an interoperable, standardized way are requirements for archiving and restoring e-mails in conversations (replies and redirects), e.g. with functionality for their search.
* Dietrich von Seggern, Managing Director of callas software GmbH, has been working in the area of prepress since 1991. The graduate engineer is an expert in publishing and PDF. callas software develops PDF technologies for publishing, prepress, document exchange and archiving as well as for optimizing PDF-based processes. The company is a founding member of the PDF Association and has been involved in the board of the international association from the very beginning. www.callassoftware.com