BruteFIR

last updated 2003-08-10

Table of contents

News

2003-08-10
BruteFIR v0.99k. This is a maintenance release, which fixes a few bugs, including a severe powersave bug which could cause unexpected and very loud noise come out. It also adds an option to run in daemon mode.

2003-07-11
BruteFIR v0.99j. Now we are getting near a 1.0 release. This release contains quite many new features, and bug fixes. Some feature highlights: BruteFIR now employs FFTW3, there is support for 32 and 64 bits in the same binary and buffer over/underflows can be ignored. Among important bug fixes are that FFTW wisdom is now stored properly, so it can be re-used more often, and the equaliser module now sets the magnitude properly at the edges.

2003-02-11
BruteFIR v0.99i. I released the h-version a bit too early, lots of small but significant mistakes followed. This version fixes those (hopefully).

2003-02-09
BruteFIR v0.99h. A couple of bug fixes associated to the new callback I/O. It also adds support for native endian and auto sample formats, and a simple automatic load balancer for multi-processor machines.

2003-02-02
BruteFIR v0.99g. This release adds support for callback I/O. One callback I/O module is available, supporting JACK. This support means that the program has went through quite radical reorganisations, so something might be broke. If you discover any problems, please let me know.

2003-01-05
BruteFIR v0.99f. Minor peak meter adjustment and bug fix.

2002-12-25
BruteFIR v0.99e. Lots of tuning have been made to work better with sound card I/O. It should now be more reliable in low latency configurations. The release also includes some various minor improvements and bug fixes.

For those that find the default configuration file unnecessary and just in the way, there is now the -nodefault command line option, which will cause BruteFIR to skip the default configuration file.

2002-11-28
BruteFIR v0.99d. Fixes yet another bug in the ALSA code, which caused the software not to work with hardware with odd period sizes, such as some (all?) ice1712-based cards. The real-time index has also been much simplified and improved in terms of reliability, and a power-save feature was added.

Sometime soon, there will be 1.0...

2002-10-10
BruteFIR v0.99c, is an important bug fix release. Among other fixes, it fixes the slightly embarrassing bug of incorrect reading of 3 byte 24 bit formats. Apart from many bug fixes, it adds double buffer support to the equaliser module, and a simple script function to the CLI. The risk of buffer underflow at startup has also been strongly reduced.

2002-09-12
BruteFIR v0.99b, fixed a serious bug in the ALSA code, which caused buffer underflow when the software buffer size was larger than the hardware buffer size.

2002-08-25
BruteFIR v0.99a, a couple of minor bug fixes, discovered during the development of AlmusVCU.

2002-08-04
This new release (v0.99) contains a first version of an equaliser module, which allows equalisation to be changed in runtime. Now the I/O delay is fixed, always exactly twice the filter block length (if the sound card hardware is properly designed). Good for synchronisation with other audio processors, or clustering. There is also a slight change in configuration file format, so you know why it will complain when run with an old configuration file.

2002-07-26
Added a minor feature that proved necessary for some applications, such as Ambisonics. This feature makes it is possible to multiply inputs/outputs in mixing with negative values, not just positive. The new version is BruteFIR 0.98e. An invalid version was available a few hours during this day (forgot to include some CLI patches), so if you downloaded your v0.98e at this date, download it again.

2002-07-21
Two bug fixes in this new release, BruteFIR 0.98d. The first concerns scaling of coefficient parameters, where PCM coefficients where incorrectly scaled. The other fix is in the ALSA I/O module, which could at some occasions fail to set the sample rate.

2002-06-14
BruteFIR 0.98c, another small step towards 1.0. This contains an important bugfix. Earlier versions could mix up the mix buffers which caused looping sound with some filter configurations, this is now fixed. The common mistake (at least for me) to link a 32 bit BruteFIR with a 64 bit FFTW or the other way around is now taken care of.

2002-05-05
BruteFIR 0.98b. The sample rate monitoring added in 0.98a is now optional, through the option monitor_rate. Also support for SSE2 for Pentium 4 processors is implemented (only used when compiled with double precision). It is also possible to compile and run on Solaris with Sparc processors.

2002-04-16
Yet another of the usual minor updates: BruteFIR 0.98a. This fixes a minor bug which could cause stray processes to be left after exit. It also improves the real-time index calculation so it works properly on SMP, and the program now exits with an error when sample rate is changed in runtime. There are now interpretable exit codes from the program as well, so one can now why it exited.

2002-03-25
BruteFIR 0.98: This new release supports virtual inputs and outputs, which can be used to control delay of individual outputs even if they are mixed to the same physical output.

2001-12-20
Another bugfix release, 0.97d. Also added a -quiet command line parameter to suppress title, warnings and informational messages at startup.

2001-12-17
Due to popular demand, the ALSA I/O module has got support for accessing the software modes of the ALSA library. The new release is 0.97c.

2001-12-16
Ooops. The new sample format handling was not as good as I initially thought. Now that has been fixed. Oh, clipping for 32 bit formats works again. I hope I did not burst anyone's ears (other than mine). The release version is 0.97b.

2001-12-15
Some major bugs was introduced in 0.97, hopefully most of them has been squashed in this new release, 0.97a.

2001-12-09
BruteFIR 0.97: a new release with lots of major changes. The software is now much more modular. It uses modules for input and output, ALSA and file I/O being the first modules available. It also supports logic modules, the old BruteFIR CLI being the first example. The logic modules can be used to achieve adaptive filtering. The new module architecture will probably need some time to stabilise, and due to the large amount of changes to the code, there is a great risk that this new version is less stable than the last. A few details in the configuration file format has changed as well, for which the documentation has been updated. The documentation for how to program a BruteFIR module is not yet available though.

2001-11-04
Added a todo list. Any suggestions are welcome of course.

2001-10-27
Added some quick and dirty benchmarks, and added some new documentation. I made a low latency benchmark due to popular demand, and the interesting result is that it is possible to get as low as three milliseconds I/O delay, which is much lower than what I expected.

2001-09-27
New release, BruteFIR 0.96a. Some minor bugfixes, and at last processor capability detection code has been included, so BruteFIR will detect SSE or 3DNow, and use the optimised code accordingly.

2001-08-26
Updated documentation to cover all the new features of BruteFIR 0.96.

2001-08-20
BruteFIR 0.96 has been released, with a few important bugfixes, but also much new features, which not yet has been documented here. It is now possible to make filter networks, and have different length on different filters.

2001-07-18
A new release, BruteFIR 0.95b, which contains an important bugfix is available for download. It fixes a block bounds violation error when converting from 32 bit integers to floating point. It also contains some tuning of realtime priorities.

2001-06-10
Some minor updates to the documentation.

2001-06-03
A bugfix release, BruteFIR 0.95a, is available for download. It fixes a bug which caused the program to crash when long filters in raw format was read.

The documentation is now up to date again.

2001-05-26
New release, BruteFIR 0.95. This includes some new features, for example support for changing delay in runtime and support for non-interleaved sound cards. An important bug fix has also been applied, when mixing files and sound cards for inputs/outputs trouble could occur, but that should be fixed now.

Again, the documentation on this page is not entirely up to date with the software itself.

2001-04-11
BruteFIR 0.94a released, which is a bugfix release. A severe bug in the ALSA support code caused the error "Hardware does not support enough fragments." with common sound cards. Now it is gone. Still there is some work to do on the ALSA support code, like adding support for cards with non-interleaved buffer layout (like the RME9652).

2001-04-08
Major changes and cleanups of this page has been done, and the source code has been re-released. The new version is 0.94, and contains a new improved convolution algorithm with hand-coded assembler optimisations for Intel's SSE and AMD's 3Dnow. With this, BruteFIR is now capable of even higher throughput.

What is it?

BruteFIR is a software convolution engine, a program for applying long FIR filters to multi-channel digital audio, either offline or in realtime. Its basic operation is specified through a configuration file, and filters, attenuation and delay can be changed in runtime through a simple command line interface. The FIR filter algorithm used is an optimised frequency domain algorithm, partly implemented in hand-coded assembler, thus throughput is extremely high. In realtime, a standard computer can typically run more than 10 channels with more than 60000 filter taps each.

Through its highly modular design, things like adaptive filtering, signal generators and sample I/O are easily added, extended and modified, without the need to alter the program itself.

BruteFIR is free and open-source. It is licensed through the GNU General Public License [6].

The preferred operating system platform for the program is Linux [11], but it is easily ported to other Unices as well, and supports for example Solaris out of the box. BruteFIR uses the high-performance FFTW library [7] for the Fast Fourier Transform (FFT, [5]) calculations, and ALSA, the Advanced Linux Sound Architecture [2], is the preferred way of interfacing sound cards. The main features are:

What is it good for?

A few examples of applications where BruteFIR could be a central component: Among these, room equalisation and auralisation needs the longest FIR filters in the common case. Many applications can do with quite short filters actually, but the thing is that you will probably not need to compromise on the filter lengths when you use BruteFIR, even when sample rates go up. However, BruteFIR is pretty useless by itself, since it is only a FIR filter engine. It does not provide any filter coefficients, thus it is not a filter design program. Also, due to its relatively high I/O-delay, BruteFIR is most suited for applications when the input signal is not live.

If you are interested in room equalisation, my old NWFIIR project [18] might be of interest. It's a bit dated though. A better program for room equalisation is Denis Sbragion's DRC [22].

BruteFIR convolution

The main design goal of BruteFIR is to achieve as high throughput as possible when filters are long (longer than 10000 taps). This means that the filter algorithm must be very fast, since it will be consuming almost all processor time of the whole program. BruteFIR's convolution algorithm is an example of a situation where a theoretically less efficient algorithm is faster in practice, because it is easily optimised and hides performance problems of more complex components.

Frequency domain algorithms for convolution is much faster than the straight-forward time domain one when filters are long. The well known overlap-save algorithm is used as the base in BruteFIR's convolution. However, there are practical problems with this algorithm as we will see.

The problem of complexity

Efficient convolution is done in the frequency domain and therefore an FFT algorithm is needed. The FFT calculations occupy typically more than 90% of all processing time when plain overlap-save is employed. Unfortunately, FFT it is not easy to implement. There exist numerous implementations which vary greatly in performance, which is one proof of the complexity. Since it takes up almost all processing time, we must optimise it in order to make the convolution faster. This leaves us with a quite hard optimisation problem.

One way to optimise is to code assembler by hand and try to be better than the compiler. Modern processors for personal computers like Intel's Pentium III [10] or AMD's Athlon [1] has custom SIMD instructions (Single Instruction Multiple Data), which allows for a single instruction to operate on more than one data element at a time. For example, a single instruction may add together four or eight floating point numbers. Typically, one can improve the performance of an algorithm four times when using these instructions. They are not used by common compilers like GCC (GNU Compiler Collection [9]), meaning that we have a good opportunity to write assembler code that will with a wide marigin outperform code generated by the compiler. Most FFT libraries are written in C, and thus does not use these efficient SIMD instructions. So, theoretically, we could implement an FFT algorithm using SIMD instructions and beat the ones already available. However, we are going for a simpler approach as we shall see. Since one of the design goals of BruteFIR is to be fairly portable, we want to make any assembler implementation small and simple, so it easily can be ported to other processor architectures. Maybe 'small', but certainly not 'simple' would be applicable on an assembler implementation of FFT. In conclusion, we find optimisation with assembler as an attractive method to increase performance of existing algorithms. However, the algorithm we need to optimise, FFT, is quite complex and thus not an attractive target for optimisation.

One of the fastest FFT libraries available is FFTW [7], [8], which is used by BruteFIR. There are more efficient FFT libraries out there (?), but they are often limited to short lengths (typically less than 8192), or are not free software nor open-source, which is a requirement of the BruteFIR project.

Problems with long FFTs

Many of the fastest FFT implementations support only shorter filter lengths (djbfft [3] being one example), and those that support long lengths may behave poorly on some architectures. One example is FFTW which on my 900 MHz AMD Athlon test system gets a large performance dip when FFT lengths become larger than 32768 (real-valued transforms). On the test system, a 262144 point FFT is 30 times slower than a 32768 point, which theoretically should be only 10 times. Although the behaviour is more stable on my 550 MHz Pentium III test system, performance drops more than O(n * log2(n)) which is the complexity of the FFT algorithm. Note that these tests were performed using FFTW2.

These performance problems is of course due to memory accesses, and poor cooperation between the hardware caching architecture and the software. When the data of the algorithm exceeds the cache size, the problem becomes obvious.

Both Pentium and Athlon architectures allows for giving the cache hints from the software to reduce problems in these situations, but this must be done in assembler, and is therefore seldomly used.

Apart from performance problems, long FFTs include more multiplications and scalings which induces a larger quantisation error. This is however a minor problem (?).

Partitioned convolution

We have seen that the central algorithm of fast convolution, the Fast Fourier Transform, is complex to implement and optimise. We have also seen that the need of long FFTs reduces the choices of available implementations and that the existing can behave poorly on some hardware architectures. A modified fast convolution algorithm that uses shorter FFTs, and where most time is spent in code which is small and easily optimised, would be ideal.

Many have worked on improving the standard frequency domain convolution algorithms for different purposes. The central idea found in many of these improvements, is that the impulse response, that is the filter, is partitioned into several smaller parts. When each part is filtered with the input, the results delayed suitably and finally added together, one gets the same result as when processing the whole filter at once. As far as I know, the earliest user of this simple but powerful concept is T.G. Stockham [16], who published his results only one year after the famous Cooley and Tukey FFT paper [5]. The concept can be used to solve several problems. Stockham used it for saving memory, but in later work made in the eighties and early nineties, at the time when realtime DSP became feasible for the first time, it was stated that it can also be used to reduce quantisation erros, reduce I/O-delay, and adapt to optimal FFT lengths of a specific implementation. All these improvements are described by J.S. Soo and K.K. Pang [14], [15]. Other realtime partitioned convolution pioneers are B.D. Kulp [17], P.C.W. Sommen [12], [13] and J.M.P. Borrallo and M. G. Otero [4]. Their work is a good place to start reading for the one interested in getting a more detailed description of partitioned convolution. The convolution algorithm in BruteFIR is conceptually exactly the same as the one found in these papers.

When partitioned convolution is used, something interesting happens in the processing time distribution of the algorithm. The major part of processing is moved from the FFT algorithm, to the trivial operation of convolution in the frequency domain which is simply multiplication. The more parts we split the impulse response into, the more convolution and less FFT is done. Naturally the FFTs get shorter, and thus we get rid of the problems associated to long FFTs. We now realise that partitioned convolution is the answer to our wishes, we do not need long FFTs and it becomes less important to optimise the FFT algorithm.

Optimising where it counts

We notice that we will earn most from optimising the operation where a segment of input converted to the frequency domain is multiplied with the corresponding part of the filter also in the frequency domain. The result is then added to the output. When the data format is half-complex, a format used by most real-valued FFTs, The straight-forward implementation look like this when programmed in C:

    d[0] += b[0] * c[0];
    for (n = 1; n < n_fft / 2; n++) {
	d[n] += b[n] * c[n] - b[n_fft - n] * c[n_fft - n];
	d[n_fft - n] += b[n] * c[n_fft - n] + b[n_fft - n] * c[n];
    }
    d[n] += b[n] * c[n];

b is the input, c is the filter coefficients, and d is the output. As we see, this is a very short and simple algorithm, which is easy to implement in assembler. There are a couple of problems though. The data in each array is accessed from the tail and the front at the same time. It would be better for the cache to localise the accesses, and move from front to end only. It is also a problem that the data is accessed both in forward and reverse order (both 0,1,2,3 and 3,2,1,0), since we want to used SIMD instructions. To solve the problem, we need to reorder the data. This will only be necessary to do once with the filter coefficients, so it is free. For the input however, we need to do this once after each forward transform, and for the output we need to restore the half-complex order prior to each inverse transform. In BruteFIR the input reordering is put into the mixing and scaling step, and the output reordering in the quantisation step, so the cost is next to nothing. Below is a C implementation of the previous algorithm, when data has been reordered to better fit SIMD instructions and to improve the memory access pattern:

    d1s = d[0] + b[0] * c[0];
    d2s = d[4] + b[4] * c[4];
    for (n = 0; n < n_fft; n += 8) {
	d[n+0] += b[n+0] * c[n+0] - b[n+4] * c[n+4];
	d[n+1] += b[n+1] * c[n+1] - b[n+5] * c[n+5];
	d[n+2] += b[n+2] * c[n+2] - b[n+6] * c[n+6];
	d[n+3] += b[n+3] * c[n+3] - b[n+7] * c[n+7];
	
    	d[n+4] += b[n+0] * c[n+4] + b[n+4] * c[n+0];
    	d[n+5] += b[n+1] * c[n+5] + b[n+5] * c[n+1];
    	d[n+6] += b[n+2] * c[n+6] + b[n+6] * c[n+2];
    	d[n+7] += b[n+3] * c[n+7] + b[n+7] * c[n+3];
    }
    d[0] = d1s;
    d[4] = d2s;

The above function is easily converted into assembler using Intel's SSE instructions, or AMD's 3Dnow instructions, with cache hint instructions. The key loop (which is unrolled to furhter improve performance) becomes less than 50 lines long.

It is interesting that partitioned convolution makes much more memory references than ordinary overlap-save. In the most simple algorithm analysis, only the number of mathematical operations (like multiplications and additions) are considered when evaluating performance. Better analysis also counts the number of memory references, but unfortunately that is not enough considering the modern computer architecture; it is also of profound importance to take how the accesses are done into consideration. One bad reference can be worse in terms of performance than ten good ones on a modern computer.

Conclusion

By implementing partitioned convolution we have avoided the need of using long FFTs, and moved the major part of the processing time from the FFT to a simple multiplication loop. By reordering data after the forward transform and restoring it prior to inverse transform, the multiplication loop can be easily realised with SIMD instructions, and thus become very efficient. On the 900 MHz AMD Athlon test system, filtering of a 131072 tap long filter is twice as fast when 16 partitions of 8192 taps each are used instead of a single partition (note: this test case is exceptional, the performance improvement is less in the common case). This despite the new algorithm uses more memory references and more mathematical operations.

Apart from the improvement in throughput, we also get lower I/O-delay (equals about twice the partition length), lower memory consumption, and more flexible filter length options. A 140000 tap filter would require a 262144 tap filter if ordinary overlap-save was used, but with partitioned convolution we can use 18 partitions of 8192 taps, and then get a gross performance improvement, coupled with delay reduction.

Still, one must not over-estimate partitioned convolution. If there really is an optimal FFT algorithm available, ordinary overlap-save will certainly outperform the partitioned algorithm. An example of an assembler-optimised FFT algorithm can be found in the non-free and non-portable Intel Native signalling processing library [19].

Where can I get it?

You are free to download version 0.99k, which is a development release, but should be quite stable by now.

The package contains the source-code, you will need a supported platform to run it on (Linux is recommended). Apart from the basic stuff you must also have FFTW3 installed (note that FFTW2, as used by old versions of BruteFIR, won't work).

If you want sound card support, that is compile the ALSA I/O module, you need to have ALSA 0.9 installed.

If you want to use the JACK support, you need an up to date version of JACK installed.

Be sure that you use an official gcc compiler when compiling BruteFIR. One user reported bad sound quality (noise artifacts in the BruteFIR output), and it was shown that he had used gcc 2.96 (not an official version), that caused errors in the floating point calculations of BruteFIR.

The package does not yet contain configure scripts or other nice things to make compiling easier. However, with some luck it should work simply by typing 'make'. You can also view the Makefile to see what compile options there are. If you have any questions, just mail me, torger@ludd.luth.se.

How fast is it?

BruteFIR's main feature is that is fast. It's brutally fast. The key component making BruteFIR fast is the convolution algorithm described above.

How high throughput can I get?

With a massive convolution configuration file (note: the format is from an older version of BruteFIR and is not fully compatible with the current) setting up BruteFIR to run 26 filters, each 131072 taps long, each connected to its own input and output (that is 26 inputs and outputs), meaning a total of 3407872 filter taps, a 1 GHz AMD Athlon with 266 MHz DDR RAM gets about 90% processor load, and can successfully run it in real time. The sample rate was 44.1 kHz, BruteFIR was compiled with 32 bit floating point precision, and the I/O delay was set to 375 ms. The sound card used was an RME Audio Hammerfall.

How low I/O delay can I get?

BruteFIR is mainly designed for high throughput, not low delay. However, there is an interest of using BruteFIR for low delay convolution anyway, so here are some benchmarks so you know what to expect. Partitioned convolution can indeed allow for quite low delay, very low if the processing power is available, and the filters are not too long.

Below is an example of a simple cross-talk cancellation application running on a 1 GHz AMD Athlon with 266 MHz DDR RAM and an RME Audio Hammerfall sound card. You can download the cross-talk cancellation configuration file that was used if you want to test yourself. There are only four filters and their length are no more than 8192 taps, so it is indeed a very light application, which is a requirement if you want very low delay, since partitioned convolution does not scale very well with low delays (meaning a large number of partitions). The sample rate in these tests is 44.1 kHz, and BruteFIR was running with 32 bit floating point precision.

delay in ms processor load partition size number of partitions
3 ms 60% 64 samples 128
6 ms 30% 128 samples 64
12 ms 16% 256 samples 32
24 ms 11% 512 samples 16
47 ms 8% 1024 samples 8

As seen in the table, BruteFIR allows for as low delay as 3 milliseconds, which is the limit of the sound card used, which cannot have shorter than 64 sample partitions.

If you want to run BruteFIR to achieve high throughput, you should expect to have a delay of at least 100 ms though (and using no more than 16 partitions or so).

If you try to run BruteFIR with shorter delay than the computer can handle, or with too long filters, the program will exit with a broken pipe signal. If you get broken pipe only after a while, this is probably due to that you have not applied a good low latency patch to the kernel (there are bad ones as well), or you have cron jobs running or other software that competes for using the processor. For reasonable low latency, a low latency kernel can handle other processes running, but for as low as 3 milliseconds like in this example, you should have a dedicated clean system for running BruteFIR.

Hardware considerations

What is important for BruteFIR is that the machine has fast memory and fast processor. A Pentium 4 with its RDRAM is probably the best choice today. However, an Athlon with DDR RAM is not bad either, and significantly cheaper. A fast processor on a computer with slow memory is what most often causes disappointment. For example, a dual Pentium III at 1 GHz with good use of both processors was found to be slower than a single processor 1 GHz AMD Athlon with DDR RAM. The problem was that the Pentium III had poor memory performance. The stream benchmark [20] is a good program to use to verify the memory bandwidth if you think you get poor BruteFIR performance.

If you use SDRAM you will never get exceptional memory bandwidth, however, some tuning of timer settings in the BIOS, or overclocking of the memory bus can give you quite decent performance.

When it comes to sound hardware, you should be able to use any card that is compatible with ALSA [2]. However, it is not very likely that the sound card code of BruteFIR will work for all sound cards supported by ALSA, although that is the goal. If you get problems with your sound card, please send me a mail, and I will do my best to get it to work, or even better, try to get it to work yourself and send me a patch.

The best sound cards are those which support partition sizes which are a powers of two. If that is not the case, BruteFIR must run in input poll mode, which is not necessarily less reliable, but will consume a part of the spare processor time.

The worst possible sound card is one which does not support partition sizes with a power of two, and can only transfer large sample blocks at a time. Then BruteFIR will run unreliably or not at all.

If you want to avoid problems I recommend RME Audio [21] Hammerfall (Light) (RME9652 and RME9636) and also cards from the RME Audio Digi96 series (RME96), since those are the cards I use myself. The Hammerfall cards support up to 26 inputs and 26 outputs, the Digi96 cards support up to 8 channels. They are not the cheapest cards out there, but these are clean professional cards, fully digital with ADAT and S/PDIF inputs and outputs, which means you can have high-quality DACs and ADCs outside the computer to get the best sonic performance possible.

The Hammerfall cards allow for shorter delay (minimum partition size is 64 samples) than the Digi96 series (minimum size 1024 samples).

Configuring and running

When BruteFIR is run for the first time (without parameters), it will generate a default configuration file (~/.brutefir_defaults) (if not the -nodefault option is used), and then complain that it cannot find .brutefir_config in the home directory, which is the default location. The default configuration file contains default settings, which is extended and/or overridden in the main configuration file. A setting that is specified in the default configuration file, is not necessary to be listed in the main configuration file.

BruteFIR takes only four parameters, namely the filename of the main configuration file, and optionally -quiet to suppress title, warnings and informational messages at startup, and -nodefault if BruteFIR should read all settings from the main configuration file, and finally -daemon if it should run as a daemon.

If no parameters are given, the filename given in the default configuration file is used. If the filename is "stdin", BruteFIR will expect the configuration file to be available on the standard input.

The (default) default configuration file looks like this:

## DEFAULT GENERAL SETTINGS ##
 
float_bits: 32;             # internal floating point precision
sampling_rate: 44100;       # sampling rate in Hz of audio interfaces
filter_length: 65536;       # length of filters
config_file: "~/.brutefir_config"; # standard location of main config file
overflow_warnings: true;    # echo warnings to stderr if overflow occurs
show_progress: true;        # echo filtering progress to stderr
max_dither_table_size: 0;   # maximum size in bytes of precalculated dither
allow_poll_mode: false;     # allow use of input poll mode
modules_path: ".";          # path where to find BruteFIR modules
powersave: false;           # pause filtering when input is zero
monitor_rate: false;        # monitor sample rate
lock_memory: true;          # try to lock memory if realtime prio is set
convolver_config: "~/.brutefir_convolver"; # location of convolver config file
 
## COEFF DEFAULTS ##
 
coeff {
        format: "text";     # file format
        attenuation: 0.0;   # attenuation in dB
	blocks: -1;         # how long in blocks
	shared_mem: false;  # allocate in shared memory
};
 
## INPUT DEFAULTS ##
 
input {
        device: "file" {};  # module and parameters to get audio
        sample: "S16_LE";   # sample format
        channels: 2/0,1;    # number of open channels / which to use
        delay: 0,0;         # delay in samples for each channel
	maxdelay: -1;	    # max delay for variable delays
	mute: false, false; # mute active on startup for each channel
};
 
## OUTPUT DEFAULTS ##
 
output {
        device: "file" {};  # module and parameters to put audio
        sample: "S16_LE";   # sample format
        channels: 2/0,1;    # number of open channels / which to use
        delay: 0,0;         # delay in samples for each channel
	maxdelay: -1;	    # max delay for variable delays
	mute: false, false; # mute active on startup for each channel
        dither: false;      # apply dither
};
 
## FILTER DEFAULTS ##
 
filter {
        process: -1;        # process index to run in (-1 means auto)
	delay: 0;           # predelay, in blocks
};

The syntax of the main configuration file is very similar as we will see. As we can see, there are five sections in the configuration:

The general syntax rules for the configuration files is easily grasped from the default configuration file. The semicolons are important, they note the end of a setting, not line breaks, so you may have several settings on one line if you like. All characters on a line after a # is found are ignored. There are three data types: strings, numbers and booleans. Strings are text between quotes, a number is either with or without a decimal dot, and a boolean is either 'true' or 'false'.

Note that everything is case sensitive, so setting names must be written with small letters. Although the configuration file examples shown here is nicely ordered in sections, it is perfectly alright to mix settings in any order you like.

The general settings section in the main configuration file has the same syntax as in the default configuration file. The difference is that coeff, input, output and filter structures can exist in multiples, and are given names and more parameters.

General settings

Default values of all general settings (except logic) must be given in the default configuration file. Any of these settings may be overridden in the main configuration file (except config_file). These settings are:

float_bits: <NUMBER: internal floating point resolution, either 32 or 64>;
sampling_rate: <NUMBER: sampling rate in Hz>;
filter_length: <NUMBER: length in samples of the (sub)filters>[,<NUMBER: number of subfilters per filter>];;
config_file: <STRING: default location of main configuration file>;
overflow_warnings: <BOOLEAN: echo overflow warnings to stderr>;
show_progress: <BOOLEAN: echo progress to stderr>;
max_dither_table_size: <NUMBER: maximum size in bytes of precalculated dither>;
allow_poll_mode: <BOOLEAN: allow input poll mode>;
modules_path: <STRING: path where to find BruteFIR modules>;
logic: <STRING: logic module name> { <logic module parameters> }[, ...];
powersave: <BOOLEAN: pause filtering when input is zero>;
monitor_rate: <BOOLEAN: monitor sample rate, and abort if it changes>;
lock_memory: <BOOLEAN: try to lock memory if realtime prio is set>;
convolver_config: <STRING: file to store FFTW wisdom in>;
benchmark: <BOOLEAN: start in benchmark mode (can only be used in main config file)>;

The filter_length setting specifies how long the filters should be. This can be done in two ways. Either by specifying the length in one number, which must be a power of two. If so, the convolution will be done on the whole filter length. To partition a 65536 tap filter in 16 parts, you write filter_length: 4096,16. Partitioned filters can be used to improve performance and reduce I/O-delay.

The convolver_config setting specifies where FFTW wisdom should be stored, that is optimisation information for the FFT calculations.

If overflow_warnings is set to true, information about overflows will be printed to the screen when they occur. Note that overflowed samples are always set to the maximum output value of the output device, so there is no actual overflow on the output (unless the actual floating point value is overflowed). If overflow occurs, it means that the filter is amplifying too much, either through its coefficients or through input and output attenuation. Overflow is not checked for if the output values are floating point.

If dither is applied to any output, a dither table will be calculated when the program is started. It contains uncorrelated random values that is used to generate the dither. The more channels that applies dither, the larger table is needed, if to keep the dither uncorrelated between channels. This table can get quite large memory-wise. If you want to limit its size, set max_dither_table_size to a value. It should rather not be less than one megabyte though. If it is set to zero or negative, the program will itself choose a size.

BruteFIR uses external modules to provide sample I/O, and optionally add new logic. The modules field specifies the directory to search for BruteFIR modules.

If any logic modules should be loaded, these are listed in the logic field, in pairs of module name / module parameters, separated with commas. Which logic modules that are available and what functionality they provide can be found in the Logic modules section.

If there is any sound card used for input or output (or any other sample-clock dependent device), BruteFIR will automatically set its delay-sensitive processes to realtime priority, thus you will typically need to run the program as root. To maintain realtime performance, it is important that there is no memory belonging to the program in the swapfile, thus all memory must be locked to RAM. This is done if lock_memory is set to true. Note that the memory is never locked when realtime priority is not set (that is when there are only files used for input and output). Warning: there seems to be a bug in the Linux kernel which makes the shared memory to be locked one time for each process, meaning that when lock_memory is set to true, BruteFIR will seem to consume a lot more memory than it should. Also, it makes of course no sense to lock memory if your system does not have a swap activated. Due to this issue, the best thing to do is to have a system with no swap and avoid locking the memory.

The powersave feature if activated, will monitor the inputs, and if an input channel provides zero samples, the associated filters will not do any processing, since with zero on the input, BruteFIR knows in advance that there will be zero on the output. BruteFIR will continue run as normal, and filters with non-zero inputs will continue to to process normally. As soon as there is non-zero input on a suspended filter, it starts processing again. This powersave feature is transparent, there will be no convolution errors if it is activated. The reason for having it optional is that one may want to make performance tests, without the need to feed a meaningful signal to BruteFIR.

If benchmark mode is activated (can only be done in the main configuration file), performance statistics will be printed on screen. Note that due to complex caching effects of modern computers, the displayed processing times can look strange, a step that requires much more arithmetic operations than another may in certain circumstances still be considerably faster, if it has better luck with the cache. Since benchmarking measures elapsed time, the computer must not be loaded with any other tasks in order to get reliable results.

If a sound card which is used for input cannot be configured to have a period size (interrupt interval) equal to or smaller than the configured filter (partition) length, or if it is cannot be a power of two, BruteFIR must be run in input poll mode. This means that the sound card is polled for data, and sound card interrupts are not used. BruteFIR will run just as reliably (as long as the sound card allows for small transfers) but will consume more of the spare processor time. Thus it will look like BruteFIR uses more processor than it actually needs to. If more processor time is used for filtering, less will be used for polling, thus input poll mode does not mean that it is not possible to have as long filters as running in normal mode. However, for some applications (for example when the spare processor time is used by another vital program), input poll mode is not suitable, and by setting the allow_poll_mode to false, BruteFIR will exit with an error if input poll mode is required.

General structure syntax

<structure type name> <STRING: name (list for some) | NUMBER: index> {
	<field name 1>: <setting 1>;
	[...]
};

Names of structures (given after the type name) is not given in the default configuration file, but must be provided in the main configuration file. The name is either a custom string, or an index number, which must then be the same as the order of the structure in the file, that is the first structure must be indexed 0, the second 1 and so on. If a string name is given, the index number is given automatically (the opposite also applies), and when referring to the structure, either the string name or the index number can be used. Some structures, namely input and output, may have a comma-separated list of names, since the names applies to the channels defined in the structure.

After the name, or the structure type name if in the default configuration file, There is a left brace ({), and then structure fields and their settings, each field/setting pair ending with semicolon (;). As for the general settings, field names always end with a colon (:). The order of the fields is not important. The structure is closed with a right brace (}) and ended with a semicolon.

Coeff structure

coeff <STRING: name | NUMBER: index> {
	filename: <STRING: filename>; | <NUMBER: shmid>/<NUMBER: offset>/<NUMBER: blocks>[,...];
	format: <STRING: sample format string | "text" | "processed">;
	attenuation: <NUMBER: attenuation in dB>;
	blocks: <NUMBER: length in blocks>;
	shared_mem: <BOOLEAN: allocate in shared mem>
};

In the default configuration file, the filename field is not set, so it must be present in the main configuration file.

The coeff structure defines a set of filter coefficients, which becomes a FIR filter. There are several different file formats:

Note that BruteFIR currently does not provide any way to convert other formats to the "processed" format (well actually it does, but only through its module API).

The coefficients can be scaled, by setting the attenuation to non-zero.

Instead of a filename, comma-separated number groups can be given. The first number will be a shared memory ID (man shmat) where the data is found, the second number is the offset in bytes into the shared memory area where the program starts to read, and the third is how many blocks that should be read. A block is a filter segment, that is if filter_length is 4096, 16 one block is 4096 coefficients, and there can be no more than 16 blocks per coefficient set. If not all blocks covered in the first group, there must be following number groups to provide the full length. When a shared memory segment is given, it is required that the format is "processed".

In some cases, when one wants to test the performance of a certain BruteFIR configuration, but don't feel like generating coefficients, one can set the filename to "dirac pulse". Then BruteFIR will generate a dirac pulse filter internally and use it as any other filter, and thus will cost as much in processing as any other filter of the same length. However, if you need a dirac pulse in the real case, it makes no sense using this feature, since simply setting the coeff field in the filter structure to -1 gives the same effect and uses very little processor power (and memory).

The blocks field says how long in filter blocks the coefficient set should be. If it is set to -1, the full length is assumed. Note that custom lengths are only possible if partitioned convolution is employed (quite naturally, since else there will only be one filter block covering the full length).

The shared_mem field indicates if the coefficient should be stored in shared memory. Some modules may require that, such as the equalisation module.

Input and output structure

input <STRING: name | NUMBER: index>[, ...] {
        device: <STRING: I/O module name> { <I/O module settings>> };
        sample: <STRING: sample format>;
        channels: <NUMBER: open channels>[/<NUMBER: channel index>[, ...]];
	delay: <NUMBER: delay in samples>[, ...];
	maxdelay: <NUMBER: maximum delay for dynamic changes>;
	individual_maxdelay: <NUMBER: maximum delay for dynamic changes>[, ...];;
	mute: <BOOLEAN: mute channel>[, ...];
	mapping: <NUMBER: channel index>[, ...];
};

output <STRING: name | NUMBER: index>[, ...] {
        device: <same syntax as for the input structure>;
        sample: <same syntax as for the input structure>;
        channels: <same syntax as for the input structure>;
	delay: <same syntax as for the input structure>;
	maxdelay: <same syntax as for the input structure>;
	individual_maxdelay: <same syntax as for the input structure>;
	mute: <same syntax as for the input structure>;
	mapping: <same syntax as for the input structure>;
	dither: <BOOLEAN: apply dither>;
};

All fields for the input and output structures except mapping, delay and mute must be set in the default configuration file.

The device field specifies the source/destination of the digital audio. This is always an I/O module. First the name of the module is stated, followed by a its configuration within {}. If the audio is read/written from/to a module which does not continue forever (for example reading from a file), BruteFIR will finish when the first I/O module comes to an end (hopefully an input module, write failure of an output module is considered an error).

The sample format should be one of the following strings:

The common format 16 bit signed little endian found in for example 16 bit wav-files is thus "S16_LE". The floating point formats can be in any range, however all integer formats will be scaled to -1.0 to +1.0 internally, so if to match an integer format, the range should be -1.0 to +1.0. There is no overflow checking for floating point formats (that is values larger than +1.0 or lesser than -1.0 is not truncated).

The channels field specifies the number of open and used channels of the device. If the number of open channels exceed the number of used channels, a slash (/) followed by a comma-seprated list of channel indexes of used channels must be appended. If we for example have a eight channel ADAT sound card, but we only want to use the first two, we write 8/0,1 as the channels setting. As you see, the lowest channel index is zero, not one.

The length of the list of names (given after the structure type name) must match or exceed the number of used channels. If there are more channels in the head (the logical, or virtual channels) than there are available through the device, the specified channels must be mapped onto the physical device channels. This is done with the mapping field, which simply is a list of indexes, which index in the head to map to which physical device channel. Here a simplified example:

output 14,15,16 {
        ...
        channels: 8/5,4;
	mapping: 0,1,0;
};

In this example, two channels from the eight channel device are used, channels with index 5 and 4. The order of the channel indexes matter, physical channel 5 will now be considered the first (index 0) of the available physical channels, and 4 the second (index 1). The mapping fields tells how to map the channels called 14, 15 and 16 in the header to those two physical channels. The mapping is in the same order as the channels in the header, that is 14 is mapped to physical channel index 0 (which is channel 5 on the eight channel device), 15 to index 1 (channel 4 on the device), and 16 to index 0, that is the logical channels 14 and 16 will mix into the same output on the device. In the standard case, where logical channels are the same as the amount of channels made available through the channels field, a mapping specification is not needed. Then the first logical channel is mapped to the first listed device channel and so on.

The list of delays specifies how many samples a channel should be delayed. This could be used to compensate for speaker positions that is either to close or too far away. It could also be used to compensate for acasual filters. Delay can be changed in runtime, if maxdelay is not set to a negative value. It defines the upper bound of delay in samples. When the program is started, delay buffers for all channels to match maxdelay is allocated. If it is negative, only the precise amount specified by the delay array is allocated.

The setting individual_maxdelay was added later, and works the same as maxdelay with the difference that it is specified per channel. It is useful to save memory when there are many channels, and only some of them need dynamic delay (or considerably larger buffer than the others).

The mute list of booleans, specifies, in order, which channels that should be muted from the beginning. The muted channels can later be unmuted from the CLI.

If the dither flag is set to true, dither is applied on all used channels. Dither is a method to add carefully devised noise to improve the resolution. Although most modern recordings contain dither, they need to be redithered after they have been filtered for best resolution. Dither should be applied when the resolution is reduced, for example from 24 bits on the input to 16 bits on the output. However, one can claim that dither should always be applied, since the internal resolution is always higher than the output. When BruteFIR is compiled with single precision, it is not possible to apply dither to 24 bit output, since the internal resolution is not high enough. BruteFIR's dither algorithm is the highly efficient HP TPDF dither algorithm (High Pass Triangular Probability Distribution Function).

Filter structure

filter <STRING: name | NUMBER: index> {
        from_inputs: <STRING: name | NUMBER: index>[/<NUMBER:attenuation in dB>][/<NUMBER:multiplier>][, ...];
        from_filters: <same syntax as from_inputs field>;
        to_outputs: <same syntax as from_inputs field>;
        to_filters: <STRING: name | NUMBER: index>[, ...];
        process: <NUMBER: process index>;
	coeff: <STRING: name | NUMBER: index>;
	delay: <NUMBER: pre-delay in blocks>;
};

Only the process field should be given in the default configuration file.

The filter structure defines where a filter is placed and what its parameters are. This is done in a filter:

  1. Possible attenuation is applied to the inputs, whereafter they are mixed together.
  2. The mixed-together inputs are filtered.
  3. The filter output is copied to the output channels, possibly with individual attenuation. Attenuation is however not applicable to outputs going to other filters.
If an output channel exists in several filter structures, the filter outputs will be mixed into that channel. Thus, a set of filter structures defines how inputs and outputs should be copied, mixed and filtered.

With help of the from_filters and to_filters fields, filters can be connected to eachother. The only real constraint is that there must be no loops. BruteFIR will detect and point out errors if such exist in a given filter network. Note that if possible coefficients should be pre-convolved rather than put as filters in series, since a 2N length filter computes much faster than two cascaded N length filters.

The from_inputs, from_filters and to_outputs fields have the same syntax. One channel/filter is given as the string name or index number, and if attenuation should be applied, it is followed by a slash (/) and attenuation in dB. Instead of, or combined with, attenuation in dB, a multiplier can be given, a number which all samples will be multiplied with. The writing "channel 1"/6/-1 means that channel 1 is attenuated 6 dB and the polarity is changed (multiplication with -1). It is also possible to write "channel 1"//-0.5 which is equivalent to the first example.

If more than one channel should be included, they are separated with commas. The to_filters field has the same syntax with the exception that attenuation is not allowed.

The process field specifies in which Unix process the filter should be run. All filters with the same process index will run in the same process. Process index 0 must exist, and if there are more processes they should be in series, 0, 1, 2, 3 and so on. This field is important if BruteFIR runs on a multi-processor machine. The optimal situation is that there is one process per processor, and that each process requires the same processor time. Then you will get most out of your multi-processor computer. There is one limitation of how filters can be distributed between processes: mixing to an output channel or a filter input must be done within the same process.

If the process field is set to -1, an automatic but naive load balancing will take place, which may or may not be as good as a hand-made load balancing.

The coeff field defines which coefficient set that should be used for the filter. It could be given as the string name of the set, or as its index number. If the index number is set to minus one (-1), there will be no filtering in the filter, it will just mix and copy inputs/outputs as specified. Note that the length of the coefficient set specifies how processor intensive the filter will be.

The delay field specifies how many filter blocks pre-delay there should be. Zero or negative means no delay. The maximum allowed delay is one block less than full length. Thus, with unpartitioned filtering there can be no delay at all. The delay cost is zero both in terms of memory and processing.

Configuration file example

Here follows an example of a main configuration file, showing some of the aspects of BruteFIR's possibilities. It implements a cross talk cancellation filter for a stereo dipole. The two filters are placed in two processes get the max out of a dual processor machine. A computer with a single processor should if possible keep all filters within the same process for best performance. Note that the configuration uses the default settings extensively. For example, no general settings have been specified apart from the addition of the CLI logic module, and in the coeff structures, only the filename field is used.

logic: "cli" { port: 3000; };

coeff "direct path" {
        filename: "direct_path.txt";
};

coeff "cross path" {
        filename: "cross_path.txt";
};

input "left", "right" {
        device: "file" { path: "/disk0/tmp/music.raw"; };
        sample: "S16_LE";
        channels: 2;
};

output "stereo dipole left", "stereo dipole right" {
        device: "file" { path: "output01.raw"; };
        sample: "S16_LE";
        channels: 2;
};

filter "left speaker direct path" {
        inputs: 0/6.0;
        outputs: 0;
        process: 0;
	coeff: "direct path";
};

filter "left speaker cross path" {
        inputs: "right"/6.0;
        outputs: "stereo dipole left";
        process: 0;
	coeff: "cross path";
};

filter "right speaker direct path" {
        inputs: "right"/6.0;
        outputs: "stereo dipole right";
        process: 1;
	coeff: "direct path";
};

filter "right speaker cross path" {
        inputs: "left"/6.0;
        outputs: "stereo dipole right";
        process: 1;
	coeff: 1;
};

I/O modules

I/O modules are used to provide sample input and output for the BruteFIR convolution engine. It is entirely up to the I/O module of how to produce input samples or store output samples. It could for example read input from a sound card, a file, or simply generate noise from a formula.

In the BruteFIR configuration file, an I/O module is specified in each input and output structure.

The purpose of having I/O modules instead of building all functionality directly into BruteFIR is that it should be easy to extend with new functionality, without compromising the core convolution engine.

All I/O modules has the extension ".bfio".

ALSA sound card I/O (alsa)

The ALSA I/O module (named "alsa") is used to read and write samples from/to sound cards. It supports all BruteFIR sample formats also supported by the referenced sound device. The basic configuration is simple, only one field, called param need to be set, where the associated value is a string which is passed without modification to ALSA's device open function. Examples: "alsa" { param: "hw"; } or "alsa" { param: "hw:1"; }.

In the above examples, the hardware is accessed directly (the "hw" prefix), but you can also use ALSA's software modes. That is however not recommended, since some functions of BruteFIR, for example overflow protection, expects to be at the very last output stage, and not before another software layer which may perform for example mixing or volume control.

In theory it should also be possible to access files (for example wav-files) through ALSA, "alsa" { param: "file:test.wav"; } but this does not seem to work currently, and is not recommended, since the module assumes that all devices are driven by a sample clock (thus is a sound card).

If the ALSA I/O module is used in several input/output structures, all referenced sound cards will be linked together using the ALSA API. This makes starting and stopping sound cards synchronised, if the hardware and driver supports it, if not, the ALSA subsystem tries to make starting and stopping is synchronised as it can. However, when there are many alsa devices used, this linking can cause the computer to lock up, at least it has happened in the past. This is probably due to a problem in ALSA, and may have been resolved when you read this. However, should you bump into problems, you can disable linking by setting link to false (example: "alsa" { param: "hw:1"; link: false; }).

Per default, when reading fails due to an overflow, or writing fails due to and underflow, BruteFIR will abort. If your computer is heavily loaded, and/or partitions are short, and/or other services are running on the computer, over/underflow can occur occasionally. In those cases, one might rather get occasional clicks in the sound rather than a total stop. The ALSA I/O module can hide over/underflow from BruteFIR, and thus it will not abort when that occurs. Just set the ignore_xrun parameter to true (example: "alsa" { param: "hw:1"; ignore_xrun: true; }).

JACK audio server I/O (jack)

The JACK I/O module (named "jack") provides BruteFIR with support for the low-latency JACK audio server [23]. JACK is an audio server under development, and the goal for the JACK I/O module is that it should be compatible with the current CVS version.

To avoid putting I/O-delay into the JACK graph, the JACK buffer size should be set to the same as the BruteFIR partition size. It is however possible to set the JACK buffer size to a smaller value. The I/O-delay in number of JACK buffers as seen by following JACK clients will be:

2 * <BruteFIR partition size> / <JACK buffer size> - 2

Note that both the JACK buffer size and BruteFIR period size is always a power of two.

Currently, the JACK I/O module assumes that jackd is run with the -R parameter, at its default client realtime priority which is 9.

The module has only one field, ports, where the associated string values are the names of the ports to connect to. Examples: "jack" { ports: "alsa_pcm:capture_1", "alsa_pcm:capture_2"; } for input, and "jack" { ports: "alsa_pcm:playback_1", "alsa_pcm:playback_2"; } for output. The channel count must be set to the same amount of opened ports, and the sample format should be set to AUTO.

Raw PCM file I/O (file)

The raw PCM file I/O module (named "file") is used to read and write samples from/to files. It supports all BruteFIR sample formats and reads/writes them directly in raw form, interleaved format. The paramater string is in the simplest case the filename. Example: "file" { path: "test.pcm"; }. One can also specify how many bytes to skip in the beginning for input files, and if to append output files. Examples: "file" { path: "test.pcm"; skip: 44; } and "file" { path: "test.pcm"; append: true; }.

If the file I/O module is used for input, the input file can be looped, by setting loop to true.

By using /dev/stdin like this "file" { path: "/dev/stdin"; }, BruteFIR will read data from standard input, so it is then possible to do things like mpg123 -s test.mp3 | brutefir.

Writing your own I/O module

This will probably never be documented. The best way is to look at the source code to see how it is done.

Logic modules

Command line interface (cli)

The CLI logic module (named "cli") is provides a command line interface available through telnet or a local socket. If the port field is associated to a number, that will be the port the CLI will listen to when a telnet client is connecting. If it is a string, it is interpreted as the path to a local socket which is created instead. Example: "cli" { port: 3000; }.

Instead of specifying a port, you can specify a string of commands, which will be run in a loop as a script. Example: "cli" { script: "cfc 0 0;; sleep 10;; cfc 0 1;; sleep 10"; }. The script may span several lines. Each line is carried out atomically (this is also true for command line mode), so if there are several commands on a single line, separated with semicolon, they will be performed atomically. The exception is when an empty statement is put in the line (just a semicolon), like in the script example, this will work as a line break, and thus separate atomic statements.

A typical use for atomic statements is to change filter coefficents and volume at the same time.

If the field echo is set to true, the CLI commands will be printed on BruteFIR's console as well. This is off per default.

The command line interface is used for changing settings in runtime, which is of course only suitable when BruteFIR is used in realtime. When connected and you type "help" at the prompt, you will get the following output:

Commands:

lf -- list filters.
lc -- list coeffient sets.
li -- list inputs.
lo -- list outputs.
lm -- list modules.

cfoa -- change filter output attenuation.
        cfoa <filter> <output> <attenuation|Mmultiplier>
cfia -- change filter input attenuation.
        cfia <filter> <input> <attenuation|Mmultiplier>
cffa -- change filter filter-input attenuation.
        cffa <filter> <filter-input> <attenuation|Mmultiplier>
cfc  -- change filter coefficients.
        cfc <filter> <coeff>
cod  -- change output delay.
        cod <output> <delay>
cid  -- change input delay.
        cid <input> <delay>
tmo  -- toggle mute output.
        tmo <output>
tmi  -- toggle mute input.
        tmi <input>
imc  -- issue input module command.
        imc <index> <command>
omc  -- issue output module command.
        omc <index> <command>
lmc  -- issue logic module command.
        lmc <module> <command>

sleep -- sleep for the given number of seconds [and milliseconds].
abort -- terminate immediately.
tp    -- toggle prompt.
ppk   -- print peak info, channels/samples/max dB.
rpk   -- reset peak meters.
upk   -- toggle print peak info on changes.
rti   -- print current realtime index.
quit  -- close connection.
help  -- print this text.

Notes:

- When entering several commands on a single line,
  separate them with semicolons (;).
- Inputs/outputs/filters can be given as index
  numbers or as strings between quotes ("").

Most commands are simple and don't need to be further explained. Naturally, any changes will lag behind as long as the I/O delay is. The exception is the mute and change delay commands, they will lag behind as long as the period size of the sound card is, which most often is smaller than the program's total I/O delay. However, when there is a virtual channel mapping, the mute and delay will be lagged as well.

The imc, omc and lmc commands are used to give commands to I/O modules and logic modules in run-time. To find out which modules that are loaded and which indexes they have, use the command lm. Not all modules support run-time commands though.

Changing attenuations with cffa cfia and cfoa can be done with dB numbers or simply by giving a multiplier, which then is prefixed with m, like this cfoa 0 0 m-0.5. Changing the attenuation with dB will not change the sign of the resulting multiplier.

Run-time equaliser

The equaliser logic module takes control over one or more coefficient sets, and renders equaliser filters to them, as specified by the user. This can be done in the initial configuration, and also updated in runtime, through the CLI.

The startup configuration can look like this:

  "eq"  {
		debug_dump_filter: "/tmp/rendered-%d";
		{
			coeff: 0, 1;
			#bands: "ISO octave";
			#bands: "ISO 1/3 octave";
			bands: 100, 200, 500;
			magnitude: 20/-3.2, 100/8.5;
			phase: 20/0, 100/180;
		};
		{
			coeff: "eq-1";
			bands: "ISO octave";
			magnitude: 31.5/-3.2, 125/8.5;
			phase: 31.5/3.2;
		};
	};

If you want to analyse the rendered filters, the debug_dump_filter setting specifies a file name where the rendered coefficients will be written. It must contain %d, which will be replaced by the coefficient index. Then follows equalisers. Each specify which coefficient index (or name) it should render the equaliser filter to. These must be allocated and must be stored in shared memory, for example like this:

coeff 0 {
        filename: "dirac pulse";
	shared_mem: true;
	blocks: 4;
};

The dirac pulse will be replaced by the rendered filter. Each equaliser has a set of frequency bands (max 128), they can be manually specified, or use the ISO octave band presets. Optionally, magnitude (in dB) and phase (in degrees) settings can be specified. The frequency value must then match one of the given bands.

If you specify two filters, the rendering will be double-buffered, meaning that the eq module will keep one coefficient active in the filter(s), and render to the other, and switch when ready. This means that there is no risk of playing an incomplete equaliser, which can cause some noise (usually in the form a beep), thus it is recommended to use double-buffered mode. In the filter configuration and when referring to the equaliser in the CLI, the first of the two coefficients should then be used.

In run-time, equalisers can be modified through the CLI. An example: lmc eq 0 mag 20/-10, 4000/10 will set the magnitude to -10 dB at 20 Hz and +10 dB at 4000 Hz for equaliser for coeffient 0. Instead of mag, phase can be given. The command lmc eq "eq-1" info will list the current settings for the equaliser stored in the coefficent called "eq-1".

The more heavily loaded the computer is by convolution, the longer time it will take to render the new equaliser. If the coefficent it renders to is very short, and the magnitude and phase response is very detailed (sharp edges etc) it will not be able to adapt to it fully.

Sometime in the distant future I may make the documentation of this module understandable, and perhaps even make a graphical user interface for it (but that is not very likely).

Writing your own logic module

This will probably never be documented. Just look at the source code and see how it is done.

Tuning

Realtime index

The program calculates a realtime index which can be shown through the CLI, or will be printed periodically to the screen if the show_progress flag is set. The realtime index is a floating point value. When it is 1.0, 100% of the available processing power must be used at all times to be able to achieve realtime performance. If it is larger than 1.0, it means that with the current configuration, BruteFIR will not manage realtime performance.

If your configuration is too demanding for realtime, you should shorten the filters (or remove channels) until the realtime index is very close below 1.0, perhaps 0.95. This way you make full use of your computer. However, if you have multiple processors, it is not as simple. The realtime index will show how much is needed from the most loaded processor, but leaves a proper load balancing to you. So, devise your configuration carefully if you have multiple processors. The number of input and output channels and the filter length is what steals processor time. The number of filters, dither, delay, mixing and attenuation is very cheap in comparison.

When testing with realtime indexes above 1.0, inputs and outputs must of course be files. For performance testing, you could use "/dev/zero" for input and "/dev/null" for output. Also note that it takes some time for the index to stabilise.

The realtime index typically matches the processor load, if running with a sound card. However, if input poll mode is employed, real time index can be considerably lower than the processor load, since input polling is performed in the spare processor time.

FFTW wisdom

When BruteFIR runs for the first time, it will generate FFTW wisdom, which takes some time. FFTW wisdom is benchmarking information which tells the FFTW library how to run FFT the most efficient way on the given computer. Since the information is hardware and binary dependent, the file should be removed when hardware is changed/upgraded or BruteFIR is recompiled. A wisdom file that was not generated on the hardware BruteFIR is running on, or not by the binary that is run, may yeild suboptimal performance. When BruteFIR is calculating FFTW wisdom, the computer should not be running other processor-demanding software.

Naturally, it is very important that FFTW was compiled with the correct optimisation flags to achieve optimal performance.

The wisdom is loaded used and updated each time BruteFIR is run. Each time BruteFIR uses a partition length it has not used before (and thus there is no wisdom available), it will need to generate new wisdom, which will take some time.

Low latency patch

If you are going to use BruteFIR in realtime, it is strongly recommended that you patch your kernel to reduce latency, or else the program may fail to keep up when a cron-job or a screen saver starts. The Linux kernel's latency problems has been reduced in the new 2.4 kernel, but it is still not satisfactory without the patch applied.

For the 2.4 kernel, Andrew Morton's low latency patches are recommended [24].

Sample clock problems

If you use digital input and output, as I would recommend, you may get problems if the sound card is not configured properly. It is very important that the input and output sample clock use the same clock as reference. Or else, micro-differences between the input and output sample clock will make BruteFIR's IO buffers to slide apart, and eventually make the program stop. Usually there is an option to set the digital sound card's sample clock to 'slave'.

If you have analog input or output or both, you cannot get this problem (unless you use several different sound cards, then it will fail due to differences in clocking).

Digital sound cards that work in slave mode allows that the sample clock is changed in runtime. Usually, this is not what one want for BruteFIR, since the filters are designed for only one sample rate. Therefore BruteFIR can be configured to exit if it detects a sample clock different from the one mentioned in the configuration file.

Double precision or not

BruteFIR can run with 32 or 64 bit floating point internal resolution. Traditionally, 32 bit is called "single precision", and 64 bit "double precision". The float_bits setting is used to change resolution. Per default, BruteFIR runs in 32 bit.

Depending on processor used, you may loose assembler optimisations when running in 64 bit. Also, memory bandwidth used by BruteFIR will naturally double, which reduces performance. Thus, although 64 bit and 32 bit operations are generally equally fast, due to increased memory usage, BruteFIR needs 30 - 50% extra processor time, not counting additional effects if assembler optimisations are lost.

When do you need double precision? If you are picky enough on sound quality that you would require dither on 24 bit output, then you need double precision. For most audio work however, 32 bit precision is enough.

Choosing number of partitions

There is no formula for calculating the optimal number of partitions to get maximum throughput. It varies between hardware platforms, so trial and error is the only working method. More than about 16 partitions are generally not recommended though.

If you are using partitioned filters to reduce the I/O-delay for realtime filtering, make sure that it does not get too low. If I/O-delay is too low, the sound card can get overflowed/underflowed causing the program to exit with a broken pipe signal.

Realtime issues

Extreme low latencies, such as 64 sample partitions, will probably not work for long periods of time, even with a low latency patched kernel.

The processor cannot be loaded more than typically 85% for safe realtime operation. For very low latencies, this number could go down to 70%. The reason for this is that computing time will vary somewhat, that is how modern computers work, and to be able to cope with the maximum computing times, some spare processor time must be left.

Request features

Which new features that get into BruteFIR are decided by its users. If you need a feature, let me know, and I'll see what I can do (and want to do).

References

  1. Advanced Micro Devices, Inc. website. http://www.amd.com.
    Makers of the Athlon processor.
  2. A. Bagnara, J. Kysela et al ALSA, Advanced Linux Sound Architecture. http://www.alsa-project.org.
    A powerful and flexible audio applications API developed primarily for Linux.
  3. D.J. Bernstein djbfft. http://cr.yp.to/djbfft.html.
    A compact FFT library implemented in C, faster than most, including FFTW.
  4. J. M. P. Borallo, M. G. Otero On the implementation of a partitioned block frequency domain adaptive filter (PBFDAF) for long acoustic echo cancellation. Elsevier Signal Processing, vol 27 No 3 June 1992, page 301-315.
  5. J. W. Cooley, J. W. Tukey An Algorithm for the Machine Computation of the Complex Fourier Series. Mathematics of Computation, Vol. 19, April 1965, pp. 297-301.
  6. Free Software Foundation GNU General Public License. http://www.gnu.org/copyleft.
    One of the most common free software licenses. Its main purpose is to make sure that the software is kept free and open source.
  7. M. Frigo, S. G. Johnson FFTW. http://www.fftw.org.
    A fast and full-featured FFT library implemented in C. Called "Fastest Fourier Transform in the West".
  8. M. Frigo, S. G. Johnson FFTW: An Adaptive Software Architecture for the FFT. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 3, 1998, pp. 1381-1384.
  9. GNU Compiler Collection. http://gcc.gnu.org.
    A free software multi-platform compiler supporting the programming languages C, C++, Objective C and Fortran.
  10. Intel Corporation website. http://www.intel.com.
    Makers of the Pentium processor.
  11. Linux Online website. http://www.linux.org.
    Linux is a free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world.
  12. P. C. W. Sommen Adaptive Filtering Methods. Ph. D. dissertation, Tech. Univ. Eindhoven, Eindhoven, The Netherlands, 1992.
  13. P. C. W. Sommen Partitioned frequency domain adaptive filters. Proc Asilomar Conf. Signals, Systems and Computers, 1989, pp. 676 - 681.
  14. J. S. Soo, K. K. Pang A new structure for block FIR adaptive digital filters. Proc. IREECON, vol 38, pp. 364 - 367, 1987.
  15. J. S. Soo, K. K. Pang Multidelay block frequency adaptive filter, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-38, No. 2, February 1990.
  16. T. G. Stockham Jr. High-speed convolution and correlation. AFIPS Proc. 1966 Spring Joint Computer Conf., Vol 28, Spartan Books, 1966, pp. 229 - 233.
  17. B. D. Kulp Digital Equalization using Fouring Transform Techniques. AES preprint 2694, 1988.
  18. A. Torger NWFIIR Audio Tools. http://www.ludd.luth.se/~torger/filter.html.
    A set of tools for measuring and processing impulse responses, room equalisation being the target application.
  19. Intel Signal Processing Library. http://developer.intel.com/software/products/perflib/spl/index.htm.
  20. STREAM: Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.
    A portable and simple memory benchmark program.
  21. RME Audio. http://www.rme-audio.com.
  22. D. Sbragion Digital Room Correction. http://freshmeat.net/projects/drc.
    A program which generates room correction FIR filters to be used in HiFi systems.
  23. P. Davis et al JACK audio server. http://jackit.sourceforge.net/.
    A low-latency audio server, written primarily for the GNU/Linux operating system.
  24. A. Morton Linux Scheduling Latency. http://www.zip.com.au/~akpm/linux/schedlat.html.
    A collection of notes and tools related to an effort to decrease the typical scheduling latency of the 2.4.x kernel.





(c) Copyright 2001 - 2003 - Anders Torger