[Emerging Infectious Diseases
[Volume 5 No.3 / May - June 1999]


Dispatches

Application of Data Mining to Intensive Care Unit Microbiologic Data(ft 1)

Stephen A. Moser, Warren T. Jones, and Stephen E. Brossette
The University of Alabama at Birmingham, Birmingham, Alabama, USA

---------------------------------------------------------------------------
      We describe refinements to and new experimental applications of
      the Data Mining Surveillance System (DMSS), which uses a large
      electronic health-care database for monitoring emerging
      infections and antimicrobial resistance. For example,
      information from DMSS can indicate potentially important shifts
      in infection and antimicrobial resistance patterns in the
      intensive care units of a single health-care facility.

We have defined a new exploratory data mining process for automatically
identifying new, unexpected, and potentially interesting patterns in
hospital infection control and public health surveillance data. This
process, and the system based on it, Data Mining Surveillance System
(DMSS), use association rules to represent outcomes and association rule
confidences to monitor changes in the incidence of those outcomes over
time. Through experiments with infection control data from the University
of Alabama at Birmingham Hospital, we have demonstrated that DMSS can
identify potentially interesting and previously unknown patterns. Future
work on prospective clinical studies to determine the usefulness of DMSS in
hospital infection control is needed, as is improved event presentation for
the user and strategies for handling larger datasets.

The statistical strategies developed for automatically detecting temporal
patterns in surveillance data require that analysts explicitly define
outcomes of interest before surveillance begins. The Data Mining
Surveillance System (DMSS), on the other hand, is not constrained to
monitoring changes in user-defined outcomes. In DMSS, complex outcomes are
represented by association rules, and outcome incidence is captured
monthly.

An early version of DMSS, along with association rules and early
experiments with a single organism, has been described (1). We briefly
describe a newer version of DMSS and experimental results obtained by using
it to analyze 1 year's data from intensive care units (ICUs) at the
University of Alabama at Birmingham Hospital.

DMMS uses the following definitions. An itemset is a subset of the set of
all items. The support of an itemset x, sup (x), is the number of records
that contain x. If sup (x) >/= FSST, where FSST is the frequent set support
threshold (FSST), then x is a frequent set. An association rule, A ==> B,
where A and B are frequent sets and the insection of A and B = Ø, is a 
is a statement about how often the items of B are found with the items of A.
the incidence proportion of A ==> B, denoted ip(A ==> B), is equal to 
sup (union of A and B)/sup (A). The precondition support of association 
rule A ==> B is sup(A). The incidence proportion of an association rule 
A ==> B in data partition p(sub i)describes the incidence of the outcome, 
B, in the group, A, during time ti. A series of incidence proportions for 
A ==> B from partitions p(sub1), p(sub 2), ...., p(sub n)describes the 
incidence of the outcome B in group A from t(sub 1) through t(sub n). 
Therefore, by analyzing the series of incidence proportions of an 
association rule A==> B, it should be possible to detect important shifts
or trends in the incidence of B in A over time. In this way, surveillance
of B in A is possible.

Bacterial susceptibility and related demographic data of patients in the
University of Alabama at Birmingham Hospital ICUs (medical, surgical
[SICU], cardiac, neurologic [NICU]) during 1997 were extracted from the
PathNet laboratory information system. Each record describes a single
isolate and contains the following data elements: date of admission, date
of sample collection, date of results reported, source of isolate (e.g.,
sputum, blood), organism isolated, organism Gram stain and morphologic
features, patient's location in the hospital, and resistant (R),
intermediate (I), or susceptible (S) test results to relevant antibiotics,
according to the National Committee for Clinical Laboratory Standards MIC
breakpoints (2).

Duplicate records were removed so that for each patient, no more than one
isolate per organism per month was included. In each remaining record,
certain antimicrobial drug items were removed (only drugs to which the
organism is historically susceptible at least 50% of the time remained).
Additionally, items of the form S~Antimicrobial were removed so that only
I~Antimicrobial and R~Antimicrobial items remained. Finally, data were
divided into 1-month partitions (p(sub 1)....p(sub n)) before analysis. 
For each partition p(sub i), all frequent sets with support of at least 
3 (FSST >2) and association rules with precondition support greater than 
5 were generated. Both the frequent set discovery and association rule-
generating algorithms are beyond the scope of this review (3).

Each generated association rule must pass a set of rule templates that
describe families of interesting and uninteresting rules. Each template is
a construct of the form be(sub 1) ==> be(sub 2), where be(sub 1) and be(sub 2) 
are Boolean expressions over items and attributes. Association rule A ==> B 
satisfies rule template be(sub 1) ==> be(sub 2) if A satisfies be1 and B 
satisfies be(sub 2). Two types of association rule templates are used: 
include templates and exclude templates. An association rule A ==> B passes 
a set of rule templates if A ==> B satisfies at least one include template 
in the set and does not satisfy any exclude template in the set.

Rule templates are handcrafted by domain experts to eliminate inherently
uninteresting or nonsense rules. This is accomplished through iterative
experiments with representative data by initially using few templates and
then creating and modifying templates on the basis of pattern review.

History is a database that holds association rules and their incidence
proportions for different data partitions. In DMSS, the user specifies a
set of rule templates that contains any number of inclusive and restrictive
templates (Table 1). Only association rules that pass the rule templates
are included in the history. To establish a baseline for an association
rule, the incidence proportions of the rule for the three previous
partitions are obtained and stored in the history. Once stored in the
history, a rule is updated for each new partition regardless of whether or
not it is generated in the partition. Therefore, for every association
rule, the history contains an up-to-date time-series of incidence
proportions.

 Table 1. Templates used to filter association rules

 --------------------------------------------------------------------------
 Template
 type       Left (be(sub 1))  Right (be(sub 2))  Explanation

 --------------------------------------------------------------------------
 Exclude    (R~Antibiotic)    (Anything)         Want antibiotic
                                                 sensitivity info on the
                                                 right only.
 Exclude    (Anything)        (Source)           Source of infection is
                                                 not an outcome.
                                                 Therefore,
                                                   exclude all rules with
                                                 a source on the right.
 Exclude    (NS OR Org        (NS OR Org         NS, Org, and GrMp are
              GrMP)             OR GrMP          more informative if
                                                    kept together in
                                                 either a group or an
                                                 outcome.
 Exclude    (Loc)             (Org OR GrMp)      If the left contains
                                  AND            location, then exclude
                                                 rules that
                              (R~Antibiotic)        have Org and
                                                 R~Antibiotic or GrMp and
                                                 R~Antibiotic.
 Include    (Org OR Loc)      (R~Antibiotic OR   Include rules whose
                                GrMp OR Org)     groups are Org- or
                                 AND Not (Loc)   Loc-specific and
                                                    whose outcomes are
                                                 Antibiotic- or
                                                 GrMp-specific.
 --------------------------------------------------------------------------
 be(sub 1) and be(sub 2), Boolean expressions; R, resistant; NS, nosocomial; 
 OR, "or"; Org, organism; GrMp, Gram stain and morphology; Loc, Location.

Table 2. A sample event generated by the Data Mining Surveillance System
-----------------------------------------------------------------------------
          Association              P      P       P       P       P      P
            rule                (subc-5)(subc-4)(subc-3)(subc-2)(subc-1)(subc) 
                                (sup a)    
-----------------------------------------------------------------------------
(nosocomial   ==> {Acinetobacter 0/11    0/10    0/9     0/13    2/9    3/9
  SICU(sup b),     baumannii}
tracheal
  aspirate
-----------------------------------------------------------------------------
                                          w(subp)                w(subc)
                                          (sup c)
-----------------------------------------------------------------------------
(sup a)P(subc), current pair.
(sup b)SICU, surgical intensive care unit.
(sup c)w(subp), past window; w(subc), current window.

By analyzing information stored in the history, DMSS generates alerts that
describe an extreme change in the incidence of an outcome B in a group A
over time. For example, Table 2 describes the incidence of Acinetobacter
baumannii in a nosocomial tracheal aspirate and in SICU isolates over the
past six partitions. Clearly, a shift in incidence occurs between the first
4 months and the most recent 2 months of the series. If we call months 1,
2, 3, and 4 the past window, wp, and months 5 and 6 the current window, 
w(sub c), we can ask if there is an extreme change in the incidence between 
w(sub p) and w(sub c). We compute the cumulative incidence proportion for 
w(sub p) (0/43) and for w (sub c)(5/18) and compare the two by a statistical 
test of two proportions. To generate an alert for an association rule r, 
DMSS first constructs a current window (w(sub c)) and a past window (w(sub p))
on the series of incidence proportions of r (w(sub c)[r,0], w (sub p)[r,0] 
from the algorithm in the Figure). Second, it computes the cumulative 
incidence proportion for each window. Third, it compares the two cumulative 
incidence proportions by a test of two proportions. Finally, if the 
difference between the proportions is statistically extreme 
(p </= alpha = 0.01), it generates an alert. The value of alpha is 
user-defined and rather arbitrary. If an alert is not generated, the 
next set of current and past windows is formed (w(sub c)[r,1], w(sub p)
[r,1] from the algorithm in the Figure), and the cumulative incidence 
proportions are compared. Window pairs are generated for the same association 
rule until an alert is generated or no more window pairs remain to be formed. 
DMSS generates all alerts by executing the procedure described on every 
association rule in the history.

        [fig]                  Current and past window pairs are generated
  Fig. Algorithm used to       by the algorithm in the Figure. If n is the
  construct current and        number of incidence proportions in the
  past windows for             history for a given rule, (w(sub c)):w(sub p))
  association rule r.          pairs are generated for that rule in the
                               following order: (p(sub c):[p(sub c-1),
p(sub c-2)]}), ...,(p(sub c):[p(sub c-1),...,p(sub c-n)]]),([p(sub c),
p(sub c-1)],[p(sub c-2),p (sub c-3)]}),([p(sub c),p(sub c-1)]},[p(sub c-2),
p(sub c-3),p(sub c-4)]),([p(sub c),p(sub c-1)],[p(sub c-2),p(sub c-3),
p(sub c-4),...,p(sub c-n)]),([p(sub c),p(sub c-1),p(sub c-2)],[p(sub c-3),
p(sub c-4),p(sub c-5)]}),([p(sub c),p(sub c-1),p(sub c-2)]},[p(sub c-3),
p(sub c-4),p(sub c-5),p(sub c-6)]}),...,([p(sub c),p(sub c-1),p(sub c-2)]},
[p(sub c-3),p(sub c-4),p(sub c-5),p(sub c-6),...,p(sub c-n)]). For each pair,
w(sub p) must be at least as large as w(sub c).

The total number of events was reduced from 251, by including all rules, to
36, by using the templates in Table 1; thus, classes of inherently
uninteresting rules were eliminated. A retrospective look at the 155 events
eliminated by the rule templates showed that they were uninformative.
Therefore, the introduction of templates resulted in a more focused
presentation of DMSS output.

Of the 36 events, 18 were judged potentially interesting. Table 3 contains
several representative events, one per row. Each row contains the
association rule, the incidence proportions in w(sub c) (bold), and the 
incidence proportions in w(sub p)(nonbold). For example, event 1 in Table 3 
describes an increase in the number of Staphylococcus aureus resistant to 
oxacillin, clindamycin, and erythromycin isolated from tracheal aspirates 
in the fourth partition, and compared with those isolated in the 2nd and 3rd
partitions. Of the events identified by DMSS, only the NICU and SICU had
events that were location-specific (Table 3), while eight events were not.

The events identified by DMSS must be investigated by domain experts to
determine their actual importance. In this example, the data burden was
small since in a prospective analysis only a few events would be presented
to the user each month, thus allowing for the investigation of each event.

Table 3. Representative events identified and considered of potential 
interest
-----------------------------------------------------------------------
                                                Partition
                                   ------------------------------------
Left                    Right
Denominator          Numerator     1  2    3   4   5   6  7 Interpretation
----------------------------------------------------------------------------
Staphylococcus ==> R~Oxacillin       0/10 0/8 7/14          Increase in the
  aureus            (sup a,b)                               incidence of
 Source            R~Clindamycin                            oxacillin (ORSA),
 TRACHASP(sup c)   R~Erythromycin                           clindamycin and
                                                            erythromycin                                                                      
                                                            resistance in all
                                                            isolated from
                                                            tracheal
                                                            aspirates.
NSNoso(sup d)  ==> R~Ceftazidime              3/88 11/70    Increase in
                                                            incidence
                                                            of ceftazidime
                                                            resistance in all
                                                            nosocomial
                                                            isolates.
NP_GNR(sup e)  ==> R~Piperacillin             0/17 6/14     Increase in the
                                                            LocSICU incidence
                                                            of piperacillin
                                                            resistance in
                                                            non-pseudomonas
                                                            gram-negative
                                                            bacilli isolated
                                                            from NSNoso.
NP_GNR         ==> R~Piperacillin     1/12  0/14 4/11  4/8  Increase in the
                                                            LocSICU (sup f)
                                                            incidence
                                                            of piperacillin
                                                            resistance in
                                                            non-pseudomonas,
                                                            nosocomial gram-
                                                            negative bacilli
                                                            from the SICU.
NSNoso         ==> S. aureus 26  3/26 2/28  6/27 5/20  3/11 Increase in the
  LocNICUg                                                  incidence of
                                                            nosocomial  S.
                                                            aureus in
                                                            nosocomial
                                                            isolates from the
                                                            NICU.
------------------------------------------------------------------------------
(sup a)R, resistant.
(sup b)Oxacillin, resistance implies resistance to amoxycillin/clavulanic acid, 
cephalothin, and cefazolin.
(sup c)SourceTRACHASP, tracheal aspirates.
(sup d)NSNoso, nosocomial (3 days from admission).
(sup e)NP_GNR, non-pseudomonas gram-negative rod.
(sup f)LocSICU, location, surgical intensive care unit (SICU).
(sup g)LocNICU, location, neonatal intensive care unit (NICU).

We believe that this approach to surveillance will allow hospital infection
control programs to focus their limited resources on issues of probable
significance. We also believe that this approach is a step toward the
public health surveillance system described by Dean, Fagan, and
Panter-Conner (4).

---------------------------------------------------------------------------

This work was supported in part by cooperative agreement U47-CCU411451 with
the Centers for Disease Control and Prevention (SAM) and a predoctoral
research fellowship LM-00057 from the National Library of Medicine (SEB).

Dr. Moser is associate professor, Department of Pathology, University of
Alabama at Birmingham, and serves as director of Laboratory Information
Services, associate director of Clinical Microbiology for University
Hospital, and director of the Pathology Informatics Section. His research
interests are applied research in diagnostic microbiology and the
application of software as an aid to the intelligent analysis of medical
information, especially that generated in laboratory medicine.

Address for correspondence: Stephen A. Moser, University of Alabama at
Birmingham, Department of Pathology, P246, 619 19th St., South Birmingham,
AL 35233-7331, USA; fax: 205-975-4468; e-mail: moser@uab.edu.

(footnote 1)Presented in part at the International Conference on Emerging 
Infectious Diseases, March 8-11, 1998, Atlanta, Georgia.

References

  1. Brossette SE, Sprague AP, Hardin JM, Waites KB, Jones WT, Moser SA.
     Association rules and data mining in hospital infection control and
     public health surveillance. J Am Med Inform Assoc 1998;5:373-81.
  2. National Committee for Clinical Laboratory Standards. Methods for
     dilution antimicrobial susceptibility tests for bacteria that grow
     aerobically. 4th ed. Approved standard. NCCLS document M7-A4. Wayne
     (PA): The Committee; 1997.
  3. Brossette SE. Data mining and epidemiologic surveillance
     [dissertation]. Birmingham (AL): University of Alabama at Birmingham;
     1998.
  4. Dean AG, Fagan RF, Panter-Conner BJ. Computerizing public health
     surveillance systems. In: Teutsch SM, Churchill RE, editors.
     Principles and practice of public health surveillance. New York:
     Oxford University Press; 1994. p. 200-17.

Emerging Infectious Diseases
National Center for Infectious Diseases
Centers for Disease Control and Prevention
Atlanta, GA

URL: ftp://ftp.cdc.gov/pub/EID/vol5no3/ascii/moser.txt

Please note that figures and equations are not available in ASCII format; 
their placement within the text is noted by [fig] and [eq], respectively. 
Greek symbols are spelled out. The following codes are used: 
(ft) for footnote; (sup) for superscript; (sub) for subscript; 
>/= for greater than or equal to.