libextractor
============

libextractor is a simple library for keyword extraction.  libextractor
does not support all formats but supports a simple plugging mechanism
such that you can quickly add extractors for additional formats, even
without recompiling libextractor.  libextractor typically ships with a
dozen helper-libraries that can be used to obtain keywords from common
file-types.  

libextractor is a part of the GNU project (http://www.gnu.org/).



extract
=======

extract is a simple command-line interface to libextractor.



Dependencies
============

libextractor requires Python (2.3, better 2.4 including development
files) and a JNI header file (jni.h) for Java.  Further requirements
include:
* libvorbisfile
* zlib (compression library)
* c++ compiler
* libltdl (from GNU libtool)
* GNU gettext
* glib 2.6
* gtk 2.6 (for thumbnails, gdk-pixbuf)

When building libextractor binaries, please make sure all of these
dependencies are available.  Otherwise the build system may
automatically build only a subset of libextractor.



Writing plugins
===============


If you want to write your own extractor for some filetype, all you
need to do is write a little library that implements a single method
with this signature:


KeywordList * <libraryname>_extract(const char * filename,
                                    char * data,
                                    size_t size,
                                    KeywordList * prev,
                                    const char * options);

where <libraryname> is the name of the library file that you will tell
libExtractor to load, minus the suffix.  For example, if you link your
extractor into a file called 'myextractor.so', the method above should
be called 'myextractor_extract'.

The filename is the name of the file and maybe NULL, data is a pointer
to the contents of the file and size is the size of the file.  The
extract method must prepend keywords that it finds to the linked list
'prev' and return the new head.  The library must allocate (malloc)
the entry in the keyword list and the memory for the filename since
both will be free'ed by libextractor once the application calls
freeKeywords.

An example implementation can be found in mp3extractor.c.



Notes
=====

libextractor contains some very large C files.  gcc can easily use
over (!) 100 MB of memory to compile them.  If you have that much,
libextractor will compile in about a minute.  If you don't have that
much, you may want to consider using the binaries.

On Mac OS X, libextractor will avoid using GCC 3.1, because of
problems compiling one of the extractors.  GCC 3.3 and 2.95.2 are
known to work well; as such, libextractor will first look for 3.3 (by
attempting to run gcc-3.3, cpp-3.3, and g++-3.3) and then 2.95.2 (by
attempting to run gcc2 and g++2).

exiv2 requires G++ 3.0 or higher.  With older GCC versions (and other
broken C++ compilers), you have to manually disable exiv2 by passing
"--disable-exiv2" to "configure" in order to avoid compilation
problems.
