The open XML-based eBook format
Summary: Need to distribute documentation, create an
eBook, or just archive your favorite blog posts? EPUB is an open
specification for digital books based on familiar technologies like XML,
CSS, and XHTML, and EPUB files can be read on portable e-ink devices, mobile
phones, and desktop computers. This tutorial explains the EPUB format in
detail, demonstrates EPUB validation using Java
05 Feb 2009 - As a followup to reader comments, the author revised the content of Listing 3 and refreshed the epub-raw-files.zip file (see Downloads).
27 Apr 2010 - Refreshed the epub-raw-files.zip file (see Downloads).
03 Jun 2010 - At author request,revised the content of Listings 3 and 8. Also refreshed the epub-raw-files.zip file (see Downloads).
11 Jan 2011 - At author request,revised the content of Listing
5. Changed second line of code from
<item id="ncx"
href="toc.ncx" media-type="text/xml"/>;
to
<item
id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
.
12 Jul 2011 - As a followup to reader comments, revised the content
of
Listing 14. Removed
` character near end of first line
of code from <?xml version="1.0" encoding="utf-8"?`>
. Revised
code now reads: <?xml version="1.0" encoding="utf-8"?>
.
Date: 13 Jul 2011 (Published 25 Nov
2008)
Level: Intermediate
PDF:
A4 and Letter (504 KB | 25 pages)Get
Adobe® Reader®
Activity: 324982 views
Comments: 11 (View
or add comments)
Building your first EPUB
A minimally conforming EPUB bundle has several required files. The specification can be quite strict about the format, contents, and location of those files within the EPUB archive. This section explains what you must know when you work with the EPUB standard.
The basic structure of a minimal EPUB file follows the pattern in Listing 1. When ready for distribution, this directory structure is bundled together into a ZIP-format file, with a few special requirements discussed in Bundling your EPUB file as a ZIP archive.
mimetype META-INF/ container.xml OEBPS/ content.opf title.html content.html stylesheet.css toc.ncx images/ cover.png |
Note: A sample book following this pattern is available from Downloads, but I recommend that you create your own as you follow the tutorial.
To start building your EPUB book, create a directory for the EPUB project. Open a text editor or an IDE such as Eclipse. I recommend using an editor that has an XML mode—in particular, one that can validate against the Relax NG schemas listed in Resources.
This one's pretty easy: The mimetype file is required and must be named mimetype. The contents of the file are always:
application/epub+zip |
Note that the mimetype file cannot contain any newlines or carriage returns.
Additionally, the mimetype file must be the first file in the ZIP archive and must not itself be compressed. You'll see how to include it using common ZIP arguments in Bundling your EPUB file as a ZIP archive. For now, just create this file and save it, making sure that it's at the root level of your EPUB project.
At the root level of the EPUB, there must be a META-INF directory, and it must contain a file named container.xml. EPUB reading systems will look for this file first, as it points to the location of the metadata for the digital book.
Create a directory called META-INF. Inside it, open a new file called container.xml for writing. The container file is very small, but its structural requirements are strict. Paste the code in Listing 2 into META-INF/container.xml.
<?xml version="1.0"?> <container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container"> <rootfiles> <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml" /> </rootfiles> </container> |
The value of full-path
(in bold) is the only part of this file that will ever
vary. The directory path must be relative to the root of the EPUB file
itself, not relative to the META-INF directory.
The mimetype and container files are the only two whose location in the EPUB archive are strictly controlled. As recommended (although not required), store the remaining files in the EPUB in a sub-directory. (By convention, this is usually called OEBPS, for Open eBook Publication Structure, but can be whatever you like.)
Next, create the directory named OEBPS in your EPUB project. The following section of this tutorial covers the files that go into OEBPS—the real meat of the digital book: its metadata and its pages.
Open Packaging Format metadata file
Although this file can be named anything, the OPF file is conventionally called content.opf. It specifies the location of all the content of the book, from its text to other media such as images. It also points to another metadata file, the Navigation Center eXtended (NCX) table of contents.
The OPF file is the most complex metadata in the EPUB specification. Create OEBPS/content.opf, and paste the contents of Listing 3 into it.
<?xml version='1.0' encoding='utf-8'?> <package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" unique-identifier="bookid" version="2.0"> <metadata> <dc:title>Hello World: My First EPUB</dc:title> <dc:creator>My Name</dc:creator> <dc:identifier id="bookid">urn:uuid:0cc33cbd-94e2-49c1-909a-72ae16bc2658</dc:identifier> <dc:language>en-US</dc:language> <meta name="cover" content="cover-image" /> </metadata> <manifest> <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/> <item id="cover" href="title.html" media-type="application/xhtml+xml"/> <item id="content" href="content.html" media-type="application/xhtml+xml"/> <item id="cover-image" href="images/cover.png" media-type="image/png"/> <item id="css" href="stylesheet.css" media-type="text/css"/> </manifest> <spine toc="ncx"> <itemref idref="cover" linear="no"/> <itemref idref="content"/> </spine> <guide> <reference href="title.html" type="cover" title="Cover"/> </guide> </package> |
The OPF document itself must use the namespace http://www.idpf.org/2007/opf, and the metadata will be in the Dublin Core Metadata Initiative (DCMI) namespace, http://purl.org/dc/elements/1.1/.
This would be a good time to add the OPF and DCMI schema to your XML editor. All the schemas used in EPUB are available from Downloads.
Dublin Core defines a set of common metadata terms that you can use to describe a wide variety of digital materials; it's not part of the EPUB specification itself. Any of these terms are allowed in the OPF metadata section. When you build an EPUB for distribution, include as much detail as you can here, although the extract provided in Listing 4 is sufficient to start.
... <metadata> <dc:title>Hello World: My First EPUB</dc:title> <dc:creator>My Name</dc:creator> <dc:identifier id="bookid">urn:uuid:12345</dc:identifier> <meta name="cover" content="cover-image" /> </metadata> ... |
The two required terms are title and identifier. According to the EPUB
specification, the identifier must be a unique value, although it's
up to the digital book creator to define that unique value. For book
publishers, this field will typically contain an ISBN or Library of Congress
number. For other EPUB creators, consider using a URL or a large, randomly
generated unique user ID (UUID). Note that the value of the attribute unique-identifier
must match the ID attribute of the
dc:identifier
element.
Other metadata to consider adding, if it's relevant to your content, include:
dc:language
).dc:date
).dc:publisher
). (This can be your company
or individual name.) dc:rights
). (If releasing the
work under a Creative Commons license, put the URL for the license
here.) See Resources for more information on DCMI.
Including a meta
element with the name
attribute containing cover
is not part of the EPUB
specification directly, but is a recommended way to make cover pages and
images more portable. Some EPUB renderers prefer to use an image file as the
cover, while others will use an XHTML file containing an inlined cover
image. This example shows both forms. The value of the meta
element's content
attribute should be the ID of the book's
cover image in the manifest, which is the next part of the OPF file.
The OPF manifest lists all the resources found in the EPUB that are part of the content (and excluding metadata). This usually means a list of XHTML files that make up the text of the eBook plus some number of related media such as images. EPUB encourages the use of CSS for styling book content, so CSS files are also included in the manifest. Every file that goes into your digital book must be listed in the manifest.
Listing 5 shows the extracted manifest section.
... <manifest> <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/> <item id="cover" href="title.html" media-type="application/xhtml+xml"/> <item id="content" href="content.html" media-type="application/xhtml+xml"/> <item id="cover-image" href="images/cover.png" media-type="image/png"/> <item id="css" href="stylesheet.css" media-type="text/css"/> </manifest> ... |
You must include the first item, toc.ncx
(discussed in the
next section). Note that all items have an appropriate
media-type
value and that the media type for the XHTML content is application/xhtml+xml
. This exact value is required and
cannot
be text/html
or some other type.
EPUB supports four image file formats as core types: Joint Photographic Experts Group (JPEG), Portable Network Graphics (PNG), Graphics Interchange Format (GIF), and Scalable Vector Graphics (SVG). You can include non-supported file types if you provide a fall-back to a core type. See the OPF specification for more information on fall-back items.
The values of the href
attribute should be a Uniform
Resource Identifier (URI) that is relative to the OPF file. (This
is easy to confuse with the reference to the OPF file in the container.xml
file, where it must be relative to the EPUB as a whole.) In this case, the
OPF file is in the same OEBPS directory as your content, so no path
information is required here.
Although the manifest tells the EPUB reader which files are part of the archive, the spine indicates the order in which they appear, or—in EPUB terms—the linear reading order of the digital book. One way to think of the OPF spine is that it defines the order of the "pages" of the book. The spine is read in document order, from top to bottom. Listing 6 shows an extract from the OPF file.
... <spine toc="ncx"> <itemref idref="cover" linear="no"/> <itemref idref="content"/> </spine> ... |
Each itemref
element has a required attribute idref
, which must
match one of the IDs in the manifest. The toc
attribute is also
required. It references an ID in the manifest that must indicate the file
name of the NCX table of contents.
The linear
attribute in the spine indicates whether the item
is considered part of the linear reading order versus being extraneous
front- or end-matter. I recommend that you define any cover page as linear=no
. Conforming EPUB reading systems will open the book to the
first item in the spine that's not set as linear=no
.
The last part of the OPF content file is the guide. This section is optional but recommended. Listing 7 shows an extract from a guide file.
... <guide> <reference href="cover.html" type="cover" title="Cover"/> </guide> ... |
The guide is a way of providing semantic information to an EPUB reading system. While the manifest defines the physical resources in the EPUB and the spine provides information about their order, the guide explains what the sections mean. Here's a partial list of the values that are allowed in the OPF guide:
cover
: The book covertitle-page
: A page with author and
publisher informationtoc
: The table of contentsFor a complete list, see the OPF 2.0 specification, available from Resources.
Although the OCF file is defined as part of EPUB itself, the last major metadata file is borrowed from a different digital book standard. DAISY is a consortium that develops data formats for readers who are unable to use traditional books, often because of visual impairments or the inability to manipulate printed works. EPUB has borrowed DAISY's NCX DTD. The NCX defines the table of contents of the digital book. In complex books, it is typically hierarchical, containing nested parts, chapters, and sections.
Using your XML editor, create OEBPS/toc.ncx, and include the code in Listing 8.
<?xml version='1.0' encoding='utf-8'?> <!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd"> <ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1"> <head> <meta name="dtb:uid" content="urn:uuid:0cc33cbd-94e2-49c1-909a-72ae16bc2658"/> <meta name="dtb:depth" content="1"/> <meta name="dtb:totalPageCount" content="0"/> <meta name="dtb:maxPageNumber" content="0"/> </head> <docTitle> <text>Hello World: My First EPUB</text> </docTitle> <navMap> <navPoint id="navpoint-1" playOrder="1"> <navLabel> <text>Book cover</text> </navLabel> <content src="title.html"/> </navPoint> <navPoint id="navpoint-2" playOrder="2"> <navLabel> <text>Contents</text> </navLabel> <content src="content.html"/> </navPoint> </navMap> </ncx> |
The DTD requires four meta
elements inside the NCX <head>
tag:
uid
: Is the unique ID for the digital
book. This element should match the dc:identifier
in the
OPF file.depth
: Reflects the level of the
hierarchy in the table of contents. This example has only one level, so
this value is 1.totalPageCount
and maxPageNumber
: Apply only to paper books and can be left
at 0.The contents of docTitle/text
is the title of the work, and
matches the value of dc:title
in the OPF.
A good rule of thumb is that the
NCX often contains more navPoint
elements than there
are itemref
elements in the OPF spine. In practice, all
the items in the spine appear in the NCX, but the NCX can be more
granular than the spine.
The navMap
is the most important part of the NCX file, as it
defines the table of contents for the actual book. The navMap
contains one or more navPoint
elements. Each navPoint
must contain the following elements:
playOrder
attribute, which reflects the reading order
of the document. This follows the same order as the list of itemref
elements in the OPF spine.navLabel/text
element, which describes the title for
this section of the book. This is typically a chapter title or number,
such as "Chapter One," or—as in this example—"Cover page."content
element whose src
attribute
points to the physical resource containing the content. This will be a
file declared in the OPF manifest. (It is also acceptable to use
fragment identifiers here to point to anchors within XHTML content—for
example, content.html#footnote1
.)navPoint
elements. Nested
points are how hierarchical documents are expressed in the NCX.The structure of the sample book is simple: It has only two pages, and
they are not nested. That means that you'll have two navPoint
elements with ascending playOrder
values, starting at 1.
In the NCX, you have the opportunity to name these sections, allowing
readers to jump into different parts of the eBook.
Now you know all the metadata required in EPUB, so it's time to put in the actual book content. You can use the sample content provided in Downloads or create your own, as long as the file names match the metadata.
Next, create these files and folder:
img
element that
references a cover image, with the value of the src
attribute as images/cover.png
. Listing 9 contains an example of a valid EPUB content page. Use this sample for your title page (title.html) and a similar one for the main content page (content.html) of your book.
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Hello World: My First EPUB</title> <link type="text/css" rel="stylesheet" href="stylesheet.css" /> </head> <body> <h1>Hello World: My First EPUB</h1> <div><img src="images/cover.png" alt="Title page"/></div> </body> </html> |
XHTML content in EPUB follows a few rules that might be unfamiliar to you from general Web development:
name
attribute has been removed. (Use IDs to refer to
anchors within content.)img
elements can only reference images that are
local to the eBook: The elements cannot reference images on the
Web.script
blocks should be avoided: There
is no requirement for EPUB readers to support JavaScript code.There are some minor differences in the way EPUB supports CSS, but none that affect common uses of styles (consult the OPS specification for details). Listing 10 demonstrates a simple CSS file that you can apply to the content to set basic font guidelines and to color headings in red.
body { font-family: sans-serif; } h1,h2,h3,h4 { font-family: serif; color: red; } |
One point of interest is that EPUB specifically supports the CSS 2 @font-face
rule, which allows for embedded fonts. If you create technical
documentation, this is probably not relevant, but developers who build EPUBs
in multiple languages or for specialized domains will appreciate the ability
to specify exact font data.
You now have everything you need to create your first EPUB book. In the next section, you'll bundle the book according to the OCF specifications and find out how to validate it.
The command: echo "application/epub+zip" > mimetype insert a new line in mimetype file, use: echo -n "application/epub+zip" > mimetype -n do not print the trailing newline character |
Posted by CarloTafuro on 31 December 2012 |
i took the sample from this to make my first epub,it found to be strange that mimetype(should only be 20 bytes) where as in the sample it is 21 bytes ,and it requires only OEBPS and META-INF folders, other 2 files are not reqiured.for thoose find a way to build it easily ,please provide them with an appropriate sample,i beleive i will be able to find it in the updatd tutorial |
Posted by fazeela on 06 July 2012 |
s/proscribe/prescribe/ as these are commonly confused and almost opposite in meaning; "proscribe" in this context would mean "prohibit", "prescribe" would mean "direct" or "specify" |
Posted by matt-wartell on 09 May 2012 |
Removed syntax error in first line of Listing 14 per comment from Ursula_Kallio. Republished updated tutorial and PDF file. |
Posted by v_dulcimer on 13 July 2011 |
Posted by zcedar on 14 May 2013