Next Previous Contents

6. The configuration file

6.1 General syntax

ht://Check uses a flexible configuration file. This configuration file is a plain ASCII text file. Each line in the file is either a comment or contains an attribute. Comment lines are blank lines or lines that start with a '#'.

6.2 Attributes

Attributes consist of a variable name and an associated value:

<name>:<whitespace><value><newline> 

The name contains any alphanumeric character or underline (_).

The value can include any character except newline. It also cannot start with spaces or tabs since those are considered part of the whitespace after the colon. It is important to keep in mind that any trailing spaces or tabs will be included.

It is possible to split the value across several lines of the configuration file by ending each line with a backslash (\). The effect on the value is that a space is added where the line split occurs.

If ht://Check needs a particular attribute and it is not in the configuration file, it will use the default value which is defined in htcommon/defaults.cc of the source directory.

6.3 Inclusion and variable expansion

A configuration file can include another file, by using a special name, include. The value is taken as the file name of another configuration file to be read in at this point. If the given file name is not fully qualified, it is taken relative to the directory in which the current configuration file is found.

Variable expansion is permitted in the file name. Multiple include statements, and nested includes are also permitted. Example:

include: common.conf 

6.4 Configuration attributes

Here you can find a brief explanation of ht://Check configuration attributes.

They've been grouped in these sections:

Setting the "spider"

start_url

This is the list of URLs that will be used to start a dig when there was no existing database. Note that multiple URLs can be given here.

Type: string

Default: http://htcheck.sourceforge.net/

Example:

start_url:      http://www.somewhere.org/alldata/index.html

limit_urls_to

This specifies a set of patterns that all URLs have to match against in order for them to be included in the search. Any number of strings can be specified, separated by spaces. If multiple patterns are given, at least one of the patterns has to match the URL. Matching is a case-insensitive string match on the URL to be used. The match will be performed after the relative references have been converted to a valid URL. This means that the URL will always start with http://. Granted, this is not the perfect way of doing this, but it is simple enough and it covers most cases.

Type: string

Default: ${start_url}

Example:

limit_urls_to:  .sdsu.edu kpbs

limit_normalized

This specifies a set of patterns that all URLs have to match against in order for them to be included in the search. Unlike the limit_urls_to directive, this is done after the URL is normalized.

Type: string

Default:

Example:

limit_normalized: http://www.mydomain.com

exclude_urls

If a URL contains any of the space separated patterns, it will be rejected. This is used to exclude such common things such as an infinite virtual web-tree which start with cgi-bin.

Type: string

Default:

Example:

exclude_urls: students.html cgi-bin

bad_extensions

This is a list of extensions on URLs which are considered non-parsable. This list is used mainly to supplement the MIME-types that the HTTP server provides with documents. Some HTTP servers do not have a correct list of MIME-types and so can advertise certain documents as text while they are some binary format.

Type: string

Default:

Example:

bad_extensions: .foo .bar .bad

bad_querystr

This is a list of CGI query strings to be excluded from indexing. This can be used in conjunction with CGI-generated portions of a website to control which pages are indexed.

Type: string

Default:

Example:

bad_querystr: forum=private section=topsecret&passwd=required

max_hop_count

Instead of limiting the indexing process by URL pattern, it can also be limited by the number of hops or clicks a document is removed from the starting URL. The starting page will have hop count 0.

Type: number

Default: 999999

Example:

max_hop_count: 4

check_external

If set to 'true', htcheck check if external Urls exist or not. An external Url is an Url which doesn't match limit configuration attributes. External URLs aren't parsed.

Type: boolean

Default: true

Example:

check_external: false

Setting the database info

db_name

Name of the MySQL database to be created or read.

Type: string

Default: htcheck (or as defined by the --with-db-name configure option)

Example:

db_name: test

db_name_prepend

String to be prepended to the MySQL database name specified. This allows to set a common string to identify all the database name used by ht://Check and to grant database privileges by using this string value. You can change the default value also by using the configure option: --with-db-name-prepend (default empty).

Type: string

Default: (or as defined by the --with-db-name-prepend configure option)

Example:

db_name_prepend: htcheck_

mysql_conf_file_prefix

Prefix for the MySQL configuration file to be searched. Default is 'my' and The file searched is usually ~/.my.cnf (suggested). If it is not found the /etc/.my.cnf file is searched. For its syntax, look at the 'Option File' contents inside the MySQL documentation.

Type: string

Default: my

Example:

mysql_conf_file_prefix: htcheck

mysql_conf_group

Group to be searched inside the .my.cnf file of MySQL for getting the settings for the connection to the server. In other words, it's the section marked with [<group>] inside the MySQL option file (default is [client]).

Type: string

Default: client

Example:

mysql_conf_group: htcheck

optimize_db

Optimize the database tables at the end of the crawl. Disable it if the database server doesn't support it.

Type: boolean

Default: false

Example:

optimize_db: true

sql_big_table_option

Enable or disable this option that is useful when performing huge queries. Otherwise, sometimes when it's not set, the MySQL db server may return a 'table is full' error.

Type: boolean

Default: true

Example:

sql_big_table_option: false

url_index_length

This number specifies the length of the index of the Url field in the Schedule and Url tables of the database. You can set different values depending on the average length of the URLs that htcheck can find in your sites. If you don't want to set any limitation, just put a '-1' value. This now allows the user to control the length of the index for the Url field in the Schedule and Url tables. This attribute may affect the performance of the crawls, as long as the length of a index can either slow down or speed up the spidering process.

Type: number

Default: 64

Example:

url_index_length: -1

Setting HTTP connections

user_agent

This allows customization of the user_agent: field sent when the digger requests a file from a server.

Type: string

Default: ht://Check

Example:

user_agent: htcheck-crawler

persistent_connections

If set to true, when servers make it possible, htdig can take advantage of persistent connections, as defined by HTTP/1.1 (RFC2616). This permits to reduce the number of open/close operations of connections, when retrieving a document with HTTP.

Type: boolean

Default: true

Example:

persistent_connections: false

head_before_get

This option works only if we take advantage of persistent connections (see persistent_connections attribute). If set to true an HTTP/1.1 HEAD call is made in order to retrieve header information about a document. If the status code and the content-type returned let the document be parsable, then a following 'GET' call is made.

Type: boolean

Default: true

Example:

head_before_get: false

timeout

Specifies the time the digger will wait to complete a network read. This is just a safeguard against unforeseen things like the all too common transformation from a network to a notwork.

The timeout is specified in seconds.

Type: number

Default: 30

Example:

timeout: 42

authorization

This tells htcheck to send the supplied username:password with each HTTP request. The credentials will be encoded using the "Basic" authentication scheme. There must be a colon (:) between the username and password.

Type: string

Default:

Example:

authorization: myusername:mypassword

max_retries

This option set the maximum number of retries when retrieving a document fails (mainly for reasons of connection).

Type: number

Default: 3

Example:

max_retries: 6

tcp_max_retries

This option set the maximum number of attempts when a connection raises a timeout. After all these retries, the connection attempt results timed out.

Type: number

Default: 1

Example:

tcp_max_retries: 6

tcp_wait_time

This attribute sets the wait time after a connection fails and the timeout is raised.

Type: number

Default: 5

Example:

tcp_wait_time: 10

http_proxy

When this attribute is set, all HTTP document retrievals will be done using the HTTP-PROXY protocol. The URL specified in this attribute points to the host and port where the proxy server resides.

The use of a proxy server greatly improves performance of the indexing process.

Type: string

Default:

Example:

http_proxy: http://proxy.bigbucks.com:3128

http_proxy_exclude

When this is set, URLs matching this will not use the proxy. This is useful when you have a mixture of sites near to the digging server and far away.

Type: string

Default:

Example:

http_proxy_exclude: http://intranet.foo.com/

http_proxy_authorization

This tells htcheck to send the supplied username:password with each HTTP request, when using a proxy with authorization requested. The credentials will be encoded using the \"Basic\" authentication scheme. There must be a colon (:) between the username and password.

Type: string

Default:

Example:

http_proxy_authorization: myusername:mypassword

accept_language

This attribute allows to restrict the set of natural languages that are preferred as a response to an HTTP request performed by the digger. This can be done by putting one or more language tags (as defined by RFC 1766) in the preferred order, separated by spaces. By doing this, when the server performs a content negotiation based on the 'accept-language' given by the HTTP user agent, a different content can be shown depending on the value of this attribute. If set empty, no language will be sent and the server default will be returned.

Type: string

Default:

Example:

accept_language:        en-us en it

remove_default_doc

Set this to the default documents in a directory used by the servers you are indexing. These document names will be stripped off of URLs when they are normalized, if one of these names appears after the final slash, to translate URLs like http://foo.com/index.html into http://foo.com/ Note that you can disable stripping of these names during normalization by setting the list to an empty string. The list should only contain names that all servers you index recognize as default documents for directory URLs, as defined by the DirectoryIndex setting in Apache's srm.conf, for example.

Type: string list

Default:

Example:

remove_default_doc: default.html default.htm index.html index.htm

disable_cookies

If set to 'true', htcheck will disable the HTTP cookies management.

Type: boolean

Default: false

Example:

disable_cookies: true

cookies_input_file

Set the input file to be used when importing cookies for the crawl; cookies must be specified according to Netscape's format. For more information, give a look at the example cookies file distributed with ht://Check. By default, no input file is read.

Type: string

Default:

Example:

cookies_input_file: /tmp/cookies.txt

url_reserved_chars

This string allows to customise the set of characters that can be considered as reserverd in a URL, avoiding their coding under the RFC1738 standard. This string is used when checking whether a URL is well-encoded or not, issuing a 'BadEncoded' state for the link which created it. The default value is slightly different from what the RFC says, giving more flexibility to the spider (it is suggested not to change it unless you are extremely sure of what you are doing).

Type: string

Default: ;/?:@&=+$,._%-#x~

Example:

url_reserved_chars: \\;/?:@&=+\$,._%-#x~

Setting what to store

max_doc_size

This is the upper limit to the amount of data retrieved for documents. This is mainly used to prevent unreasonable memory consumption since each document will be read into memory by htcheck.

Type: number

Default: 100000

Example:

max_doc_size: 5000000

store_only_links

If set to false, htcheck will store in the DB every tag he finds in every document it crawls. If set to true, htcheck stores only those Html attributes and statements that produce a link or set an anchor (identified by the pair tag: A, attribute: name).

Type: boolean

Default: false

Example:

store_only_links: true

store_url_contents

This attribute allows to store the contents of the parsed URLs. It is very useful, but can also be dangerous. You must know what you are doing, and if you enable this, your performances may slow down and your disk storage requirements can get extremely high. It is recommended to use this only for small crawls.

Type: boolean

Default: false

Example:

store_url_contents: true

available_charsets

This attribute specifies the set of possible charsets that htcheck recognises and stores into the database; other charsets will be marked as 'other'.

Type: string list

Default: windows-1250 iso-8859-1 iso-8859-10 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 koi8-r koi8-u utf-8 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 windows-874

Example:

available_charsets: iso-8859-1

Setting what to report

summary_anchor_not_found

Enable or disable the show of the summary of the HTML anchors that have not been found.

Type: boolean

Default: true

Example:

summary_anchor_not_found: false


Next Previous Contents