ht://Check uses a flexible configuration file. This configuration file is a plain ASCII text file. Each line in the file is either a comment or contains an attribute. Comment lines are blank lines or lines that start with a '#'.
Attributes consist of a variable name and an associated value:
<name>:<whitespace><value><newline>
The name
contains any alphanumeric character or underline (_).
The value
can include any character except newline. It also cannot start with
spaces or tabs since those are considered part of the whitespace after the
colon. It is important to keep in mind that any trailing spaces or
tabs will be included.
It is possible to split the value
across several lines of the configuration file by ending each
line with a backslash (\). The effect on the value is that a space is added where the line split
occurs.
If ht://Check needs a particular attribute and it is not in the configuration file, it will use the default value which is defined in htcommon/defaults.cc of the source directory.
A configuration file can include another file, by using a special name
, include.
The value
is taken as the file name of another configuration file to be read
in at this point. If the given file name is not fully qualified, it is taken
relative to the directory in which the current configuration
file is found.
Variable expansion is permitted in the file name. Multiple include statements, and nested includes are also permitted. Example:
include: common.conf
Here you can find a brief explanation of ht://Check configuration attributes.
They've been grouped in these sections:
start_url
This is the list of URLs that will be used to start a dig when there was no existing database. Note that multiple URLs can be given here.
Type: string
Default: http://htcheck.sourceforge.net/
Example:
start_url: http://www.somewhere.org/alldata/index.html
limit_urls_to
This specifies a set of patterns that all URLs have to
match against in order for them to be included in the
search. Any number of strings can be specified,
separated by spaces. If multiple patterns are given, at
least one of the patterns has to match the URL.
Matching is a case-insensitive string match on the URL
to be used. The match will be performed after
the relative references have been converted to a valid
URL. This means that the URL will always start
with http://
.
Granted, this is not the perfect way of doing this,
but it is simple enough and it covers most cases.
Type: string
Default: ${start_url}
Example:
limit_urls_to: .sdsu.edu kpbs
limit_normalized
This specifies a set of patterns that all URLs have to match against in order for them to be included in the search. Unlike the limit_urls_to directive, this is done after the URL is normalized.
Type: string
Default:
Example:
limit_normalized: http://www.mydomain.com
exclude_urls
If a URL contains any of the space separated patterns, it will be rejected. This is used to exclude such common things such as an infinite virtual web-tree which start with cgi-bin.
Type: string
Default:
Example:
exclude_urls: students.html cgi-bin
bad_extensions
This is a list of extensions on URLs which are considered non-parsable. This list is used mainly to supplement the MIME-types that the HTTP server provides with documents. Some HTTP servers do not have a correct list of MIME-types and so can advertise certain documents as text while they are some binary format.
Type: string
Default:
Example:
bad_extensions: .foo .bar .bad
bad_querystr
This is a list of CGI query strings to be excluded from indexing. This can be used in conjunction with CGI-generated portions of a website to control which pages are indexed.
Type: string
Default:
Example:
bad_querystr: forum=private section=topsecret&passwd=required
max_hop_count
Instead of limiting the indexing process by URL pattern, it can also be limited by the number of hops or clicks a document is removed from the starting URL. The starting page will have hop count 0.
Type: number
Default: 999999
Example:
max_hop_count: 4
check_external
If set to 'true', htcheck check if external Urls exist or not. An external Url is an Url which doesn't match limit configuration attributes. External URLs aren't parsed.
Type: boolean
Default: true
Example:
check_external: false
db_name
Name of the MySQL database to be created or read.
Type: string
Default: htcheck
(or as defined by the --with-db-name
configure option)
Example:
db_name: test
db_name_prepend
String to be prepended to the MySQL database name specified. This allows to set a common string to identify all the database name used by ht://Check and to grant database privileges by using this string value. You can change the default value also by using the configure option: --with-db-name-prepend (default empty).
Type: string
Default:
(or as defined by the --with-db-name-prepend
configure option)
Example:
db_name_prepend: htcheck_
mysql_conf_file_prefix
Prefix for the MySQL configuration file to be searched. Default is 'my' and
The file searched is usually ~/.my.cnf
(suggested).
If it is not found the /etc/.my.cnf
file is searched.
For its syntax, look at the 'Option File' contents inside the MySQL
documentation.
Type: string
Default: my
Example:
mysql_conf_file_prefix: htcheck
mysql_conf_group
Group to be searched inside the .my.cnf file of MySQL for getting the settings for the connection to the server. In other words, it's the section marked with [<group>] inside the MySQL option file (default is [client]).
Type: string
Default: client
Example:
mysql_conf_group: htcheck
optimize_db
Optimize the database tables at the end of the crawl. Disable it if the database server doesn't support it.
Type: boolean
Default: false
Example:
optimize_db: true
sql_big_table_option
Enable or disable this option that is useful when performing huge queries. Otherwise, sometimes when it's not set, the MySQL db server may return a 'table is full' error.
Type: boolean
Default: true
Example:
sql_big_table_option: false
url_index_length
This number specifies the length of the index of the Url field in the Schedule and Url tables of the database. You can set different values depending on the average length of the URLs that htcheck can find in your sites. If you don't want to set any limitation, just put a '-1' value. This now allows the user to control the length of the index for the Url field in the Schedule and Url tables. This attribute may affect the performance of the crawls, as long as the length of a index can either slow down or speed up the spidering process.
Type: number
Default: 64
Example:
url_index_length: -1
user_agent
This allows customization of the user_agent: field sent when the digger requests a file from a server.
Type: string
Default: ht://Check
Example:
user_agent: htcheck-crawler
persistent_connections
If set to true, when servers make it possible, htdig can take advantage of persistent connections, as defined by HTTP/1.1 (RFC2616). This permits to reduce the number of open/close operations of connections, when retrieving a document with HTTP.
Type: boolean
Default: true
Example:
persistent_connections: false
head_before_get
This option works only if we take advantage of persistent connections (see persistent_connections attribute). If set to true an HTTP/1.1 HEAD call is made in order to retrieve header information about a document. If the status code and the content-type returned let the document be parsable, then a following 'GET' call is made.
Type: boolean
Default: true
Example:
head_before_get: false
timeout
Specifies the time the digger will wait to complete a network read. This is just a safeguard against unforeseen things like the all too common transformation from a network to a notwork.
The timeout is specified in seconds.
Type: number
Default: 30
Example:
timeout: 42
authorization
This tells htcheck to send the supplied username:password with each HTTP request. The credentials will be encoded using the "Basic" authentication scheme. There must be a colon (:) between the username and password.
Type: string
Default:
Example:
authorization: myusername:mypassword
max_retries
This option set the maximum number of retries when retrieving a document fails (mainly for reasons of connection).
Type: number
Default: 3
Example:
max_retries: 6
tcp_max_retries
This option set the maximum number of attempts when a connection raises a timeout. After all these retries, the connection attempt results timed out.
Type: number
Default: 1
Example:
tcp_max_retries: 6
tcp_wait_time
This attribute sets the wait time after a connection fails and the timeout is raised.
Type: number
Default: 5
Example:
tcp_wait_time: 10
http_proxy
When this attribute is set, all HTTP document retrievals will be done using the HTTP-PROXY protocol. The URL specified in this attribute points to the host and port where the proxy server resides.
The use of a proxy server greatly improves performance of the indexing process.
Type: string
Default:
Example:
http_proxy: http://proxy.bigbucks.com:3128
http_proxy_exclude
When this is set, URLs matching this will not use the proxy. This is useful when you have a mixture of sites near to the digging server and far away.
Type: string
Default:
Example:
http_proxy_exclude: http://intranet.foo.com/
http_proxy_authorization
This tells htcheck to send the supplied username:password with each HTTP request, when using a proxy with authorization requested. The credentials will be encoded using the \"Basic\" authentication scheme. There must be a colon (:) between the username and password.
Type: string
Default:
Example:
http_proxy_authorization: myusername:mypassword
accept_language
This attribute allows to restrict the set of natural languages that are preferred as a response to an HTTP request performed by the digger. This can be done by putting one or more language tags (as defined by RFC 1766) in the preferred order, separated by spaces. By doing this, when the server performs a content negotiation based on the 'accept-language' given by the HTTP user agent, a different content can be shown depending on the value of this attribute. If set empty, no language will be sent and the server default will be returned.
Type: string
Default:
Example:
accept_language: en-us en it
remove_default_doc
Set this to the default documents in a directory used by the servers you are indexing. These document names will be stripped off of URLs when they are normalized, if one of these names appears after the final slash, to translate URLs like http://foo.com/index.html into http://foo.com/ Note that you can disable stripping of these names during normalization by setting the list to an empty string. The list should only contain names that all servers you index recognize as default documents for directory URLs, as defined by the DirectoryIndex setting in Apache's srm.conf, for example.
Type: string list
Default:
Example:
remove_default_doc: default.html default.htm index.html index.htm
disable_cookies
If set to 'true', htcheck will disable the HTTP cookies management.
Type: boolean
Default: false
Example:
disable_cookies: true
cookies_input_file
Set the input file to be used when importing cookies for the crawl; cookies must be specified according to Netscape's format. For more information, give a look at the example cookies file distributed with ht://Check. By default, no input file is read.
Type: string
Default:
Example:
cookies_input_file: /tmp/cookies.txt
url_reserved_chars
This string allows to customise the set of characters that can be considered
as reserverd in a URL, avoiding their coding under the RFC1738
standard.
This string is used when checking whether a URL is well-encoded or not,
issuing a 'BadEncoded' state for the link which created it.
The default value is slightly different from what the RFC says, giving
more flexibility to the spider (it is suggested not to change it unless you
are extremely sure of what you are doing).
Type: string
Default: ;/?:@&=+$,._%-#x~
Example:
url_reserved_chars: \\;/?:@&=+\$,._%-#x~
max_doc_size
This is the upper limit to the amount of data retrieved for documents. This is mainly used to prevent unreasonable memory consumption since each document will be read into memory by htcheck.
Type: number
Default: 100000
Example:
max_doc_size: 5000000
store_only_links
If set to false
, htcheck will store in the DB every
tag he finds in every document it crawls.
If set to true
, htcheck stores only those Html attributes
and statements that produce a link or set an anchor
(identified by the pair tag: A, attribute: name).
Type: boolean
Default: false
Example:
store_only_links: true
store_url_contents
This attribute allows to store the contents of the parsed URLs. It is very useful, but can also be dangerous. You must know what you are doing, and if you enable this, your performances may slow down and your disk storage requirements can get extremely high. It is recommended to use this only for small crawls.
Type: boolean
Default: false
Example:
store_url_contents: true
available_charsets
This attribute specifies the set of possible charsets that htcheck recognises and stores into the database; other charsets will be marked as 'other'.
Type: string list
Default: windows-1250 iso-8859-1 iso-8859-10 iso-8859-13 iso-8859-14
iso-8859-15 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7
iso-8859-8 iso-8859-9 koi8-r koi8-u utf-8 windows-1251 windows-1252 windows-1253
windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 windows-874
Example:
available_charsets: iso-8859-1
summary_anchor_not_found
Enable or disable the show of the summary of the HTML anchors that have not been found.
Type: boolean
Default: true
Example:
summary_anchor_not_found: false