6 Profiling
With MLton and mlprof, you can profile your program to find
out for each function how much time it spends or how many bytes it
allocates. To profile your program, compile with either -profile
alloc or -profile time (you can not do both simultaneously).
Then, run the executable. While it runs, the profiler maintains
counts (bytes or clock ticks) for each source function in the program.
When the program finishes, it automatically writes the counts to an
mlmon.out file. You can then run mlprof on the executable
and the mlmon.out file to see the percentage of the total
(allocation or time) spent in each function.
Here is an example of time profiling, run from within the examples/profiling directory.
% mlton -profile time tak.sml
% ./tak
% mlprof tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
function cur
------------- -----
Tak.tak1.tak2 75.8%
Tak.tak1 24.2%
This is a contrived example with two mutually recursive copies of the
tak function, defined within the Tak structure. This
examples shows how mlprof indicates lexical nesting via a
sequence of period-separated names indicating the structures and
functions in which a function definition is nested. The profiling
data shows that roughly three-quarters of the time is spent in the
Tak.tak1.tak2 function, while the rest is spent in Tak.tak1.
You can display raw counts in addition to percentages with -raw
true.
% mlprof -raw true tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
function cur raw
------------- ----- -------
Tak.tak1.tak2 75.8% (4.55s)
Tak.tak1 24.2% (1.45s)
You can display the filename and line numbers for functions in addition
to their names with -show-line true.
% mlprof -show-line true tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
function cur
------------------------- -----
Tak.tak1.tak2 tak.sml: 5 75.8%
Tak.tak1 tak.sml: 3 24.2%
Allocation profiling is very similar to time profiling. Here is an
example run from within the examples/profiling directory.
% mlton -profile alloc list-rev.sml
% ./list-rev
% mlprof -show-line true list-rev mlmon.out
6,030,136 bytes allocated (108,336 bytes by GC)
function cur
----------------------- -----
append list-rev.sml: 1 97.6%
<gc> 1.8%
<main> 0.4%
rev list-rev.sml: 6 0.2%
The data shows that most of the allocation is done by the append
function defined on line 1 of list-rev.sml. The table also
shows how special functions like gc and main are handled:
they are printed with surrounding brackets. C functions are displayed
similarly. Here, as is usually the case, the allocation done by the
garbage collector is due to it growing the stack.
Time profiling typically has a very small performance impact.
However, the performance impact of allocation profiling is noticeable,
because it inserts additional C calls for object allocation.
6.1 Profiling the stack
For both allocation and time profiling, you can use -profile-stack true to count the time spent (or bytes allocated)
while a function is on the stack. Here is an example.
% mlton -profile alloc -profile-stack true list-rev.sml
% ./list-rev
% mlprof -show-line true list-rev mlmon.out
6,030,136 bytes allocated (108,336 bytes by GC)
function cur stack GC
----------------------- ----- ----- ----
append list-rev.sml: 1 97.6% 97.6% 1.4%
<gc> 1.8% 0.0% 1.8%
<main> 0.4% 98.2% 1.8%
rev list-rev.sml: 6 0.2% 97.6% 1.8%
In the above table, we see that rev, defined on line 6 of list-rev.sml, is only responsible for 0.2% of the allocation, but is
on the stack while 97.6% of the allocation is done by the user
program and while 1.8% of the allocation is done by the garbage
collector.
The performance impact of -profile-stack true can be noticeable
since there is some extra bookkeeping at every nontail call and
return.
6.2 Call graphs
For easier visualization of profiling data, mlprof creates a
call graph of the program in dot format, from which you can use the
graphviz
software package to create a postscript graph. For example, mlprof foo mlmon.out will create foo.dot with a complete call
graph. For each source function, there will be one node in the graph
that contains the function name (and source position with -show-line true), as well as the percentage of ticks. If you want to
create a call graph for your program without any profiling data, you
can simply call mlprof without any mlmon.out files,
e.g. mlprof foo.
Because SML has higher-order functions, the call graph is is dependent
on MLton's analysis of which functions call each other. This
analysis depends on many implementation details and might display
spurious edges that a human could conclude are impossible. However,
in practice, the call graphs tend to be very accurate.
Because call graphs can get big, mlprof provides the -keep
option to specify the nodes that you would like to see. This
option also controls which functions appear in the table that mlprof prints. The argument to -keep is an expression
describing a set of source functions (i.e. graph nodes), taken from
the following grammar.
e |
::= |
all |
|
| |
"s" |
|
| |
(and e ...) |
|
| |
(from e) |
|
| |
(not e) |
|
| |
(or e ...) |
|
| |
(pred e) |
|
| |
(succ e) |
|
| |
(to e) |
|
| |
(thresh x) |
|
| |
(thresh-gc x) |
|
| |
(thresh-stack x) |
In the grammar, all denotes the set of all nodes. "s" is a regular expression denoting the set of functions whose name
(followed by a space and the source position) has a prefix matching
the regexp. The and, not, and or expressions denote
intersection, complement, and union, respectively. The pred and
succ expressions add the set of immediate predecessors or
successors to their argument, respectively. The from and to expressions denote the set of nodes that have paths from or to the
set of nodes denoted by their arguments, respectively. Finally, thresh, thresh-gc, and thresh-stack denote the set of
nodes whose percentage of ticks, gc ticks, or stack ticks,
respectively, is greater than or equal to the real number x.
For example, if you want to see the entire call graph for a program,
you can use -keep all (this is the default). If you want to see
all nodes reachable from function foo in your program, you would
use -keep '(from "foo")'. Or, if you want to see all the
functions defined in subdirectory bar of your project that used
at least 1% of the ticks, you would use -keep '(and ".*/bar/"
(thresh 1.0))'. To see all functions with ticks above a threshold,
you can also use -thresh x, which is an abbreviation for -keep '(thresh x)'. You can not use multiple -keep arguments
or both -keep and -thresh. When you use -keep to
display a subset of the functions, mlprof will add dashed edges
to the call graph to indicate a path in the original call graph from
one function to another.
When compiling with -profile-stack true, you can use mlprof
-gray true to make the nodes darker or lighter depending on whether
their stack percentage is higher or lower.
MLton's optimizer may duplicate source functions for any of a
number of reasons (functor duplication, monomorphisation,
polyvariance, inlining). By default, all duplicates of a function are
treated as one. If you would like to treat the duplicates separately,
you can use mlprof -split regexp, which will cause all
duplicates of functions whose name has a prefix matching the regular
expression to be treated separately. This can be especially useful
for higher-order utility functions like General.o.
6.3 Using MLton.Profile
To profile individual portions of your program, you can use the MLton.Profile structure (see Section 8.2.15). This
allows you to create many units of profiling data (essentially,
mappings from functions to counts) during a run of a program, to
switch between them while the program is running, and to output
multiple mlmon.out files. Executing mlprof with multiple
mlmon.out files sums the profiling data to produce the output
profiling information.
Here is an example, run from within the examples/profiling
directory, showing how to profile the executions of the fib and
tak functions separately.
% mlton -profile time fib-tak.sml
% ./fib-tak
% mlprof fib-tak mlmon.fib.out
5.77 seconds of CPU time (0.00 seconds GC)
function cur
--------- -----
fib 96.9%
<unknown> 3.1%
% mlprof fib-tak mlmon.tak.out
0.68 seconds of CPU time (0.00 seconds GC)
function cur
-------- ------
tak 100.0%
% mlprof fib-tak mlmon.fib.out mlmon.tak.out mlmon.out
6.45 seconds of CPU time (0.00 seconds GC)
function cur
--------- -----
fib 86.7%
tak 10.5%
<unknown> 2.8%
6.4 Profiling details
Technically speaking, mlprof produces a call-stack graph rather
than a call graph, because it describes the set of possible call
stacks. The difference is in how tail calls are displayed. For
example if f nontail calls g and g tail calls h, then the call-stack graph has edges from f to g and
f to h, while the call graph has edges from f to g and g to h.
Conceptually, both allocation and time profiling work in the same way.
The compiler produces information that maps machine code positions to
source functions that the profiler uses while the program is running
to find out the current source function. With -profile-stack
true, the profiler also keeps track of which functions are on the
stack. The profiler associates counters (either clock ticks or byte
counts) with source functions. For allocation profiling, the compiler
inserts code whenever an object is allocated to call a C function to
increment the appropriate counter. For time profiling, the profiler
catches the SIGPROF signal 100 times per second and increments
the appropriate counter. When the program finishes, the profiler
writes the counts out to the mlmon.out file. Then, mlprof
uses source information stored in the executable to associate the
counts in the mlmon.out file with source functions.
With -profile time, use of the following in your program will
cause a run-time error, since they would interfere with the profiler
signal handler.
MLton.Itimer.set (MLton.Itimer.Prof, ...)} |
MLton.Signal.setHandler (MLton.Signal.prof, ...) |
Also, because of the random sampling used to implement -profile
time, it is best to have a long running program (at least tens of
seconds) in order to get reasonable time data.
There may be a few missed clock ticks or bytes allocated at the
very end of the program after the data is written.
Profiling has not been tested with threads. In particular, stack
profiling may behave strangely.