|
|
========================== |
|
|
Docutils_ Hacker's Guide |
|
|
========================== |
|
|
|
|
|
:Author: Lea Wiemann |
|
|
:Contact: docutils-develop@lists.sourceforge.net |
|
|
:Revision: $Revision: 7302 $ |
|
|
:Date: $Date: 2012-01-03 20:23:53 +0100 (Di, 03 Jan 2012) $ |
|
|
:Copyright: This document has been placed in the public domain. |
|
|
|
|
|
:Abstract: This is the introduction to Docutils for all persons who |
|
|
want to extend Docutils in some way. |
|
|
:Prerequisites: You have used reStructuredText_ and played around with |
|
|
the `Docutils front-end tools`_ before. Some (basic) Python |
|
|
knowledge is certainly helpful (though not necessary, strictly |
|
|
speaking). |
|
|
|
|
|
.. _Docutils: http://docutils.sourceforge.net/ |
|
|
.. _reStructuredText: http://docutils.sourceforge.net/rst.html |
|
|
.. _Docutils front-end tools: ../user/tools.html |
|
|
|
|
|
.. contents:: |
|
|
|
|
|
|
|
|
Overview of the Docutils Architecture |
|
|
===================================== |
|
|
|
|
|
To give you an understanding of the Docutils architecture, we'll dive |
|
|
right into the internals using a practical example. |
|
|
|
|
|
Consider the following reStructuredText file:: |
|
|
|
|
|
My *favorite* language is Python_. |
|
|
|
|
|
.. _Python: http://www.python.org/ |
|
|
|
|
|
Using the ``rst2html.py`` front-end tool, you would get an HTML output |
|
|
which looks like this:: |
|
|
|
|
|
[uninteresting HTML code removed] |
|
|
<body> |
|
|
<div class="document"> |
|
|
<p>My <em>favorite</em> language is <a class="reference" href="http://www.python.org/">Python</a>.</p> |
|
|
</div> |
|
|
</body> |
|
|
</html> |
|
|
|
|
|
While this looks very simple, it's enough to illustrate all internal |
|
|
processing stages of Docutils. Let's see how this document is |
|
|
processed from the reStructuredText source to the final HTML output: |
|
|
|
|
|
|
|
|
Reading the Document |
|
|
-------------------- |
|
|
|
|
|
The **Reader** reads the document from the source file and passes it |
|
|
to the parser (see below). The default reader is the standalone |
|
|
reader (``docutils/readers/standalone.py``) which just reads the input |
|
|
data from a single text file. Unless you want to do really fancy |
|
|
things, there is no need to change that. |
|
|
|
|
|
Since you probably won't need to touch readers, we will just move on |
|
|
to the next stage: |
|
|
|
|
|
|
|
|
Parsing the Document |
|
|
-------------------- |
|
|
|
|
|
The **Parser** analyzes the the input document and creates a **node |
|
|
tree** representation. In this case we are using the |
|
|
**reStructuredText parser** (``docutils/parsers/rst/__init__.py``). |
|
|
To see what that node tree looks like, we call ``quicktest.py`` (which |
|
|
can be found in the ``tools/`` directory of the Docutils distribution) |
|
|
with our example file (``test.txt``) as first parameter (Windows users |
|
|
might need to type ``python quicktest.py test.txt``):: |
|
|
|
|
|
$ quicktest.py test.txt |
|
|
<document source="test.txt"> |
|
|
<paragraph> |
|
|
My |
|
|
<emphasis> |
|
|
favorite |
|
|
language is |
|
|
<reference name="Python" refname="python"> |
|
|
Python |
|
|
. |
|
|
<target ids="python" names="python" refuri="http://www.python.org/"> |
|
|
|
|
|
Let us now examine the node tree: |
|
|
|
|
|
The top-level node is ``document``. It has a ``source`` attribute |
|
|
whose value is ``text.txt``. There are two children: A ``paragraph`` |
|
|
node and a ``target`` node. The ``paragraph`` in turn has children: A |
|
|
text node ("My "), an ``emphasis`` node, a text node (" language is "), |
|
|
a ``reference`` node, and again a ``Text`` node ("."). |
|
|
|
|
|
These node types (``document``, ``paragraph``, ``emphasis``, etc.) are |
|
|
all defined in ``docutils/nodes.py``. The node types are internally |
|
|
arranged as a class hierarchy (for example, both ``emphasis`` and |
|
|
``reference`` have the common superclass ``Inline``). To get an |
|
|
overview of the node class hierarchy, use epydoc (type ``epydoc |
|
|
nodes.py``) and look at the class hierarchy tree. |
|
|
|
|
|
|
|
|
Transforming the Document |
|
|
------------------------- |
|
|
|
|
|
In the node tree above, the ``reference`` node does not contain the |
|
|
target URI (``http://www.python.org/``) yet. |
|
|
|
|
|
Assigning the target URI (from the ``target`` node) to the |
|
|
``reference`` node is *not* done by the parser (the parser only |
|
|
translates the input document into a node tree). |
|
|
|
|
|
Instead, it's done by a **Transform**. In this case (resolving a |
|
|
reference), it's done by the ``ExternalTargets`` transform in |
|
|
``docutils/transforms/references.py``. |
|
|
|
|
|
In fact, there are quite a lot of Transforms, which do various useful |
|
|
things like creating the table of contents, applying substitution |
|
|
references or resolving auto-numbered footnotes. |
|
|
|
|
|
The Transforms are applied after parsing. To see how the node tree |
|
|
has changed after applying the Transforms, we use the |
|
|
``rst2pseudoxml.py`` tool: |
|
|
|
|
|
.. parsed-literal:: |
|
|
|
|
|
$ rst2pseudoxml.py test.txt |
|
|
<document source="test.txt"> |
|
|
<paragraph> |
|
|
My |
|
|
<emphasis> |
|
|
favorite |
|
|
language is |
|
|
<reference name="Python" **refuri="http://www.python.org/"**> |
|
|
Python |
|
|
. |
|
|
<target ids="python" names="python" ``refuri="http://www.python.org/"``> |
|
|
|
|
|
For our small test document, the only change is that the ``refname`` |
|
|
attribute of the reference has been replaced by a ``refuri`` |
|
|
attribute |---| the reference has been resolved. |
|
|
|
|
|
While this does not look very exciting, transforms are a powerful tool |
|
|
to apply any kind of transformation on the node tree. |
|
|
|
|
|
By the way, you can also get a "real" XML representation of the node |
|
|
tree by using ``rst2xml.py`` instead of ``rst2pseudoxml.py``. |
|
|
|
|
|
|
|
|
Writing the Document |
|
|
-------------------- |
|
|
|
|
|
To get an HTML document out of the node tree, we use a **Writer**, the |
|
|
HTML writer in this case (``docutils/writers/html4css1.py``). |
|
|
|
|
|
The writer receives the node tree and returns the output document. |
|
|
For HTML output, we can test this using the ``rst2html.py`` tool:: |
|
|
|
|
|
$ rst2html.py --link-stylesheet test.txt |
|
|
<?xml version="1.0" encoding="utf-8" ?> |
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> |
|
|
<head> |
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> |
|
|
<meta name="generator" content="Docutils 0.3.10: http://docutils.sourceforge.net/" /> |
|
|
<title></title> |
|
|
<link rel="stylesheet" href="../docutils/writers/html4css1/html4css1.css" type="text/css" /> |
|
|
</head> |
|
|
<body> |
|
|
<div class="document"> |
|
|
<p>My <em>favorite</em> language is <a class="reference" href="http://www.python.org/">Python</a>.</p> |
|
|
</div> |
|
|
</body> |
|
|
</html> |
|
|
|
|
|
So here we finally have our HTML output. The actual document contents |
|
|
are in the fourth-last line. Note, by the way, that the HTML writer |
|
|
did not render the (invisible) ``target`` node |---| only the |
|
|
``paragraph`` node and its children appear in the HTML output. |
|
|
|
|
|
|
|
|
Extending Docutils |
|
|
================== |
|
|
|
|
|
Now you'll ask, "how do I actually extend Docutils?" |
|
|
|
|
|
First of all, once you are clear about *what* you want to achieve, you |
|
|
have to decide *where* to implement it |---| in the Parser (e.g. by |
|
|
adding a directive or role to the reStructuredText parser), as a |
|
|
Transform, or in the Writer. There is often one obvious choice among |
|
|
those three (Parser, Transform, Writer). If you are unsure, ask on |
|
|
the Docutils-develop_ mailing list. |
|
|
|
|
|
In order to find out how to start, it is often helpful to look at |
|
|
similar features which are already implemented. For example, if you |
|
|
want to add a new directive to the reStructuredText parser, look at |
|
|
the implementation of a similar directive in |
|
|
``docutils/parsers/rst/directives/``. |
|
|
|
|
|
|
|
|
Modifying the Document Tree Before It Is Written |
|
|
------------------------------------------------ |
|
|
|
|
|
You can modify the document tree right before the writer is called. |
|
|
One possibility is to use the publish_doctree_ and |
|
|
publish_from_doctree_ functions. |
|
|
|
|
|
To retrieve the document tree, call:: |
|
|
|
|
|
document = docutils.core.publish_doctree(...) |
|
|
|
|
|
Please see the docstring of publish_doctree for a list of parameters. |
|
|
|
|
|
.. XXX Need to write a well-readable list of (commonly used) options |
|
|
of the publish_* functions. Probably in api/publisher.txt. |
|
|
|
|
|
``document`` is the root node of the document tree. You can now |
|
|
change the document by accessing the ``document`` node and its |
|
|
children |---| see `The Node Interface`_ below. |
|
|
|
|
|
When you're done with modifying the document tree, you can write it |
|
|
out by calling:: |
|
|
|
|
|
output = docutils.core.publish_from_doctree(document, ...) |
|
|
|
|
|
.. _publish_doctree: ../api/publisher.html#publish_doctree |
|
|
.. _publish_from_doctree: ../api/publisher.html#publish_from_doctree |
|
|
|
|
|
|
|
|
The Node Interface |
|
|
------------------ |
|
|
|
|
|
As described in the overview above, Docutils' internal representation |
|
|
of a document is a tree of nodes. We'll now have a look at the |
|
|
interface of these nodes. |
|
|
|
|
|
(To be completed.) |
|
|
|
|
|
|
|
|
What Now? |
|
|
========= |
|
|
|
|
|
This document is not complete. Many topics could (and should) be |
|
|
covered here. To find out with which topics we should write about |
|
|
first, we are awaiting *your* feedback. So please ask your questions |
|
|
on the Docutils-develop_ mailing list. |
|
|
|
|
|
|
|
|
.. _Docutils-develop: ../user/mailing-lists.html#docutils-develop |
|
|
|
|
|
|
|
|
.. |---| unicode:: 8212 .. em-dash |
|
|
:trim: |
|
|
|
|
|
|
|
|
.. |
|
|
Local Variables: |
|
|
mode: indented-text |
|
|
indent-tabs-mode: nil |
|
|
sentence-end-double-space: t |
|
|
fill-column: 70 |
|
|
End:
|
|
|
|