***************************************************************************************
****         IMPORTANT: This filter is now deprecated                              ****
**** The current MSWord import filter can be found in koffice/filters/kword/msword ****
***************************************************************************************

Microsoft Word filter design
============================

This note describes the architecture of the Microsoft Word filter,
and some implementation details that should help navigate the code.

From the bottom up, we have...

A parser for individual on-disk structures
------------------------------------------

NOTE: This layer is machine generated code!

       - mswordgenerated.{h,cc} this is a layer of code generated
       from the specification at www.btinternet.com/~shaheedhaque.
       The code parses the on-disk format for all the Word 97
       structures described that have a simple C struct mapping.

       Each structure has a static read() function associated with
       it.

A parser for a complete .doc file
---------------------------------

This layer has all the ugly, manually-generated, code to muck with
the streams of data in the .doc document's Word OLE stream.

       - msword.{h,cc} is a layer of code that provides a parse()
       routine which invokes callbacks for the major syntactical
       elements in a word document.

       Callbacks (a.k.a. virtual functions) are used to avoid
       an in-memory representation of the bulk of a (large)
       document.

       It is intended that accessor functions will be provided for
       all the meta-data in the document (e.g. getAuthor()) since
       this has no "ordering" within the document and is not a
       scalability issue.

       It is the layer that provides custom code for reading on-disk
       structures. These augment the mswordgenerated layer in two
       respects:

               - providing template-based access to some common
               array-like structures ("Plex" and "FKP" structures).

               - providing overrides for the mswordgenerated read()
               when differences between Word 97 and other versions
               must be supported.

       - properties.{h,cc} is a simple encapsulation of the
       properties of a region of text (a.k.a. a "run" of text). The
       idea is to take the formatting information for a run which is
       scattered around a .doc file, and gather it in one, fully
       expanded form.

       In other words, we take the base properties for a run, and
       apply all the exceptions to end up with the complete format
       in one object. This is then delivered along with runs of text
       from the callbacks in msword.{h,cc}.

An abstraction layer to hide Microsoft-isms
-------------------------------------------

This layer has more manually generated code.

       - document.{h,cc} is intended to hide the Microsoft-isms
       in the msword.{h,cc} layer and present an easy to use
       parse()-with-callbacks API.

       Thus, this layer would hide the fact that the insertion
       point for embedded pictures and objects are coded
       with funny characters etc., and simply add extra callbacks
       to output them instead. Similarly, this layer will eventually
       string the ASCII(13) that marks the end of each paragraph.

       Thus, each callback from the msword.{h,cc} has an analogue
       in this layer (more or less). Compare the prototypes of the
       callbacks in the private part of this class whcih override
       the output callbacks from msword with the prototypes of the
       output callbacks defined by this layer to see the reason
       for this layer.

An output layer
---------------

This layer converts the document.{h,cc} layer in the output format
required. Currently, there is only one:

       - winworddoc.{h,cc} outputs Kword compatible XML. It does
       so by implementing the callbacks defined by document.{h,cc}
       and the calling its parse() routine.

       Actually, parse() is called twice, mostly to allow the table
       handling code to work out the gory details of variable-width
       columns.
