[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The thematic catalogs have an external representation that allow easy
transportation of their content. The format used is XML. The file
containing the XML representation of a catalog is named with the .rdf
extension. When the XML/RDF conventions will be better supported the
catalogs will eventually use these conventions, hence the extension.
You should not
invest too much on the current XML format because
it is likely to change drastically in the next few monthes. Nevertheless,
we've found very convinient to have a text representation of the catalogs,
specially for importing data from various sources.
9.1 XML short example | ||
9.2 XML document encoding | ||
9.3 XML structure | ||
9.4 dmoz.org |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Here is a short example of an XML file
<?xml version="1.0" encoding="ISO-8859-1" ?> <RDF xmlns:rdf="http://www.w3.org/TR/1999/REC-rdf-syntax-19990222#" xmlns="http://www.senga.org/"> <Table> <![CDATA[ CREATE TABLE urldemo ( rowid int(11) DEFAULT '0' NOT NULL auto_increment, created datetime DEFAULT '0000-00-00 00:00:00' NOT NULL, modified timestamp(14), info enum('active','inactive') DEFAULT 'active', url char(128), comment char(255), UNIQUE cdemo1 (rowid) ) ]]> </Table> <Catalog> <navigation>theme</navigation> <tablename>urldemo</tablename> NAMEurltheme</name> </Catalog> <Category> NAMENews</name> <rowid>12</rowid> <parent>1</parent> </Category> <Link> <row>135</row> <category>12</category> </Link> <Record table="urldemo"> <url>http://www.mediaslink.com/</url> <comment>Medias Link</comment> <rowid>135</rowid> </Record> <Sync/> </RDF> |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The encoding of an XML document is specified in the <?xml ... ?>
line
at the beginning. Accepted encodings are:
More encodings should be available as the XML manipulation library evolve.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
When an element is said to describe a record it means that it contains elements whose name are record field names and contains the value of the field. For instance:
<Record table=urldemo> <url>http://www.senga.org/</url> <comment>Senga</comment> </Record> |
defines a record of the urldemo table with two fields (url
and comment
) whose values are, respectively http://www.senga.org/
and Senga
.
catalog
table, See section catalog
.
The remaining of the file will relate to the catalog described in this element.
There must be only one Catalog
element in a given file.
catalog_category
table, See section catalog_category_NAME
.
The pseudo field parent
will build a record in the
catalog_category2category
table linking the category to its parent,
See section catalog_category2category_NAME
.
catalog_entry2category
table, See section catalog_entry2category_NAME
.
catalog_category2category
table, See section catalog_category2category_NAME
. The
info field of the record is automaticaly set to symbolic.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The site http://www.dmoz.org/ provides a dump of their catalog data. The format of the dump is a custom XML that looks like RDF but is not really. Since the XML format of dmoz.org and the XML format of Catalog are not compatible, the convert_dmoz command is provided to perform a translation. It must be called from the command line.
Since the dmoz.org catalog has specific requirements, a specialized version of Catalog is also provided. If you access Catalog using the CGIDIR/dmoz cgi script instead of CGIDIR/Catalog, you will use this specialized version.
The easiest way to reach it is to start from the home page of Catalog that is installed with the product at http://localhost/Catalog/ and follow the DMOZ Control Panel link. Alternatively you can jump directly to http://localhost/cgi-bin/DMOZ/dmoz?context=ccontrol_panel.
We have loaded a version of dmoz.org that contains approximately 1 500 000 records and around 250 000 categories on a Pentium 450. It leads to a 500Mb MySQL database. It takes about one hour to load. The response time when navigating the categories is excellent, provided you are using Apache + mod_perl.
The memory used during the load is around 70Mb during the conversion and 10Mb for loading.
In order to load dmoz.org data using Catalog you must follow the steps listed below. This procedure assumes that you have created a database named dmoz and a catalog of named dmoz within this database. A dmoz database has been created during the installation process. If you don't have it create it with the following command:
mysql -e "create database dmoz" |
convert_dmoz -exclude '^/Adult' -what content content.rdf.gz |
convert_dmoz -load all ~/dmoz |
After a while, you will want to reload a new version of the dmoz.org data. It can be done using the same commands. The problem is that while you do that the catalog will be unavailable to the users. The data are first removed and then populated. Catalog does not currently offer support for user transparent reloading. Instead we suggest you follow these steps:
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |