Development of a Bayesian Network for the ... - Semantic Scholar

4 downloads 209 Views 236KB Size Report
gain from statistical support in their decision ... recognized from the graph of a BN. The close ... DXpress expert syst
Methods of Information in Medicine © F. K. Schattauer Verlagsgesellschaft mbH (1999)

Development of a Bayesian Network for the Prognosis of Head Injuries using Graphical Model Selection Techniques

G. C. Sakellaropoulos, G. C. Nikiforidis

Abstract: The assessment of a head-injured patient’s prognosis is a task that involves the evaluation of diverse sources of information. In this study we propose an analytical approach, using a Bayesian Network (BN), of combining the available evidence. The BN’s structure and parameters are derived by learning techniques applied to a database (600 records) of seven clinical and laboratory findings. The BN produces quantitative estimations of the prognosis after 24 hours for head-injured patients in the outpatients department. Alternative models are compared and their performance is tested against the success rate of an expert neurosurgeon.

Computer Laboratory, School of Medicine, University of Patras, Greece

Keywords: Bayesian Networks, Head Injuries, Prognosis, Learning Models

1. Introduction For long, the debate continues among the scientific community on the ability of humans to manage uncertainty and the possible benefit they could gain from statistical support in their decision making. Despite expert clinicians’ proven skills in evaluating various sources of information that enable them to correctly assess a patient’s prognosis, the abundance of available data challenges the limited human capacity for indirect inferencing [1]. Decision-support systems that take into account a positive correlation between clinical and laboratory tests result in a more efficient and consistent information processing and, as a consequence, avoid overprediction. Bayesian or belief networks (BNs) comprise an expert system modality that is not only based on firm mathematical grounds, but also provide a means for information evaluation, Bayes’ rule that is familiar to most clinicians. In addition, statements of conditional independence can be directly recognized from the graph of a BN. This research was performed (in part) using the DXpress expert system development tool from Knowledge Industries of Palo Alto, California. Meth Inform Med 1999; 38: 37– 42

The close relationship between BNs and graphical log-linear models facilitates model selection for data interpretation. Hence, models adapted to a specific clinical problem and the available sample of patient data can be constructed, eliminating bias introduced by experts. Bayesian networks that are supported by a database of patient records can capture the knowledge and experience acquired during years of medical practice in a large hospital and make it available to remote health-care practices and to less experienced clinicians. This study concerns the development and validation of BNs for the assessment of prognosis after 24 hours for head-injured patients of the outpatients department in the University Hospital of Patras, Greece. Different selection strategies resulted in BNs with varying structures and prognostic performance.

2. Materials and Methods In our study, a database of patient findings was created. Using part of the database as a training set, different methods were applied to select the graphical log-linear models that best described the data. These models were transformed into BNs and their prog-

nostic performance, evaluated using the remainder of the records, was compared with the performance of an expert neurosurgeon. The steps involved in this procedure are described below.

2.1 Bayesian Networks A BN is a graph, consisting of vertices that represent random variables, and arcs that represent probabilistic dependencies between these variables [2]. The BN graph should be acyclic, meaning that there should be no path through the graph that forms a cycle, to guarantee that the graph represents a valid joint probability distribution. For a variable  in the BN, let  be its parents. Each vertex contains a conditional probability distribution of the form P() in a conditional probability table that describes the relationship between the variable and its parents (this includes the prior probabilities of the variables that have no parents). Each element of this table expresses the probability of a child state, given a parental state. The main independence assumption represented by a BN is that each vertex is independent of all its non-descendent vertices, given its parents. Consider the

Downloaded from www.methods-online.com on 2011-08-04 | IP: 46.198.64.216 For personal or educational use only. No other uses without permission. All rights reserved.

37

variables in a BN 1, 2, ..., n, ordered in such a way that the parents of a vertex come before the vertex in the ordering. The independence assumption would then mean that P(i i–1, i–2, ..., 1) = P(i  ) and, therefore, the joint probability distribution described by the BN is: i

n

P(1, 2, ..., n) =  P(i i–1, ..., 1) = i=1

n

=  P(i  ). i=1

i

Apart from capturing conditional (in)dependencies in their structure, BNs can be used to facilitate probabilistic inferencing and belief updating. The general problem of inferencing in BNs is proven to be NP-hard [3, 4], but efficient local computations algorithms have been developed (e.g. [5, 6, 7]). Every time that a new piece of evidence is available, it is introduced into the BN and causes an update of the be-

lief in the various prognostic outcomes, i.e., the posterior probability of the prognostic outcomes, given the specific evidence is computed. It should be mentioned that the BNs we consider are not given any causal interpretation. Rather, they are models that best describe the relationships that are observed in the available data. Causation cannot be implied by statistical relationships between variables; it must be inferred from the subject matter. It is possible, therefore, that the relative position of the variables in the network does not agree with the common notion of influence between the variables.

2.2 Database A total of 600 cases of head-injured patients from the outpatients department of the University Hospital of Patras, were collected during the period

Table 1 The clinical and laboratory variables included in the belief network, with their respective states. The states were chosen so as to be mutually exclusive and exhaust the sample space. The CT scans were classified according to a modification of the Diffuse Injury Scale (DIS).

1994-1996. Seventy-five cases from 1996 were reserved for a prospective test set, whereas the remaining 525 cases were used as a training set. Model selection and estimation of probabilities were based upon the training set, whereas the evaluation of prognostic performance was carried out on the test set. The patient records were completed by clinicians of the Neurosurgery Department at the time of patient admission. Each case consisted of eight variables (Table 1), including actual outcome after 24 hours, according to the Glasgow Outcome Scale [8]. The variables included are generally regarded to contribute to the assessment of the prognosis [8-16]. The need for mutual exclusivity and exhaustion of the sample space was satisfied through appropriate choice of the variables’ states. Possible CT examination states were classified with the help of the reformed and expanded Diffuse Injury Scale (DIS) [17]. More specifically, states 5 and 6 of the DIS were eliminated, while new states were introduced, giving a total of 7 possible states (Table 1). Mean arterial pressure (MAP) was recorded as the weighted average of systolic (SBP) and diastolic (DBP) blood pressure at admission (MAP = [SBP + 2  DBP]/3). The entire range of MAP was divided into three intervals: below 60 mmHg, between 60 and 120 mmHg, and above 120 mmHg. A similar approach was followed for the patient’s age and the time delay between head injury and hospital admission. The patient’s visual, verbal, and motor responses were recorded at admission together with their sum, i. e., the patient’s score on the Glasgow Coma Scale (GCS). In order to limit the states of the GCS variable, ranges of scores were grouped to give three possible GCS states: scores 3 to 8 were grouped as state 1, scores 9 to 13 constitute state 2, while scores 14 and 15 form state 3. Two more variables were recorded: cause of injury (automobile accident, fall, or other) and possible presence of concomitant injuries.

2.3 Model Selection The development of a BN for the prognosis of head-injured patients in38

Downloaded from www.methods-online.com on 2011-08-04 | IP: 46.198.64.216 For personal or educational use only. No other uses without permission. All rights reserved.

Meth Inform Med

volves the determination of the network’s architecture and the calculation of its parameters, i.e., the conditional probability tables stored in its vertices that facilitate the inferencing mechanism. A number of techniques for learning structure from data have been developed [18-20], based on Bayesian methods [21] or on concepts of information theory [22, 23]. In this study we follow another perspective: we treat the clinical problem of prognosis as a multivariate statistical analysis of discrete data, approached through a graphical loglinear model. The correspondence that exists between a subclass of graphical loglinear models – called decomposable – and BNs of discrete variables enables us to use model selection methods to obtain graphical models compatible with our data and then transform them into equivalent BNs. Graphical log-linear models [24, 25] are probability models for multivariate random observations whose independence structure is characterized by a graph: the conditional independence graph. The term “log-linear” stems from the fact that the probability density function is represented as a loglinear expansion. The conditional independence graph G = (V, E) consists of a set of vertices V, corresponding to our clinical or laboratory variables, and a set of undirected edges E that connect pairs of vertices. Each missing edge denotes that the two unconnected variables are independent, conditioned on the rest. Decomposable models form a subclass of graphical models. Their graphs are triangulated, i.e., no chordless (undirected) cycles of length 4 or more do exist. They have the property that their density function can be factorized, and this factorization can be fully simplified under the application of a perfect numbering of the vertices. This numbering guarantees that, for every vertex in the graph, its lower-indexed adjacent vertices form complete sets, i. e., sets with all vertices joined. By directing all existing edges from vertices of a lower index to vertices of a higher index, we obtain a directed graph whose Markov properties are identical to those of the undirected one Meth Inform Med

[26, 27]. One of these, the local Markov property for directed graphs [28, 29], is the main independence statement represented by a BN: a vertex v is independent of its non-descendants, given its parents.

2.3.1 Model Selection Strategy The search among the space of decomposable models can be made with different techniques that result in either a set of acceptable models or in one model that is the best model according to some criterion. In this study we considered two variants of the latter technique: forward inclusion [30] and backward elimination [31] of edges. These procedures are sequential in that they assume a current model and look to add or delete edges one at a time to that model. The backward elimination process begins by considering the model with all edges present (the saturated model) and tries to see how the quality of data description is affected by removing each edge in turn, i. e., by assuming conditional independence. If an edge can be removed according to some criterion, the most removable edge is removed. The process continues recursively, until no edge can be removed anymore. Hence, backward elimination starts with a complex model that is consistent with the data and successively simplifies it. Forward inclusion acts in an opposite manner. Starting from the model with all variables independent (the main effects model), edges that meet the predefined criterion are eligible for inclusion and the most significant edge is added. This process is repeated until no edge is anymore significant for inclusion. Since this method starts with a simple model that is inconsistent with the data, it tries to enlarge it until it reaches an acceptable model. The criterion used is the significance (p 0.05) of the appropriate test statistic. The test statistic used in the forward inclusion procedure is the deviance difference, which has a chi-squared distribution as the sample size tends to infinity. Strictly spoken, it is valid for large samples only and in case of sparse contingency tables, i. e., tables including many cells with zero counts, the results

may be unreliable. Hence, we performed Monte Carlo sampling, with the estimation of 1,000 random tables [32, 33]. In the case of backward elimination, we used the Joncheere-Terpstra test, appropriate for ordinal variables, except for tests including nominal variables (Cause, Injury), where the Kruskal-Wallis test was used. Again, Monte Carlo sampling with the same specifications was implemented.

2.3.2 Software for Model Selection Model selection was performed using the program MIM [34]. Forward inclusion of edges into the main effects model resulted in a model which was then tested for removal of edges and all edges with an insignificant (p 0.05) test statistic were removed. A second search started from the model that assumes no independence among the variables and non-significant edges were successively removed. The search was limited in the space of decomposable models.

2.3.3 BN Construction and Inferencing The undirected models that were selected using MIM were transformed into directed acyclic graphs by forcing all the edges from vertices of a lower index to point to vertices of higher index, according to a perfect vertex numbering that can be found in decomposable graphical models. The resulting structure and all the necessary conditional probability tables for inferencing were introduced to the DXpress graphical network editor and compiled using DXpress’ inferencing engine. The elements of the conditional probability tables were calculated, based on the frequency counts from the database of findings. Patients with certain combinations of variable states do not exist in our database and, therefore, in such cases we assigned the value of 1% to the corresponding conditional probability.

2.4 Validation The prognostic performance of the belief networks was evaluated using the actual outcome after 24 hours as gold

Downloaded from www.methods-online.com on 2011-08-04 | IP: 46.198.64.216 For personal or educational use only. No other uses without permission. All rights reserved.

39

with maximum probability). Table 2 shows the expert’s and the BNs’ success rates.

3. Results

Fig. 1 Starting from the saturated model (no independence among the variables assumed) the backward elimination-of-edges method selected a model which, after directing all edges, resulted in BN-1.

Fig. 2 The Bayesian network equivalent to the model selected by forward inclusion of edges, starting from the main effects model (all variables independent).

standard. The expert’s belief in the patient’s outcome after 24 hours was recorded after all available information regarding each case was gathered. The cases were classified according to the expert’s belief in one of five prognostic

outcomes. The BNs were then fed with exactly the same evidence that the expert had at his disposal. The BNs calculated exact probabilities; therefore, the cases were categorized according to the most favored outcome (outcome

Table 2 The performance of the BNs, compared to the estimations of prognosis made by an expert Neurosurgeon.

40

Backward elimination from the saturated model (no independence assumed) resulted in the network BN-1 (Fig. 1). Forward inclusion, starting from the main effects model (all variables independent), resulted in a model which was further investigated by considering it as a starting model for backward elimination. Four edges were removed in this manner, finally leading to the network BN-2 (Fig. 2). Before activating the Bayesian networks, the use of the d-separation criterion [2] or the equivalent criterion proposed by Lauritzen et al. [28] can offer useful considerations regarding conditional independencies of prognostic outcome on the Glasgow Outcome Scale (GOS). The structure of BN-2 states that, if the cause of the head injury is known, the GOS is rendered independent of the mean arterial pressure, the patient’s age, and possible existence of concomitant injuries. In addition, if the Glasgow Coma Score (GCS) is known, the GOS is independent of the cause of the injury. Knowledge of the time, lapsed between injury and admission at the hospital, gives no additional evidence for the belief in prognostic outcomes, if we have already obtained a CT scan. In BN-1, the GOS is connected with all vertices except for age. Consequently, the only independence statement involving GOS is that it is independent of age, if information regarding the mean arterial pressure and a CT scan is given. In both BNs, the relative position of the vertices GOS, GCS and CT is identical, and in both structures the arrows point towards GOS. Therefore, the impact of evidence on GOS, based on these variables, is the same. Regarding BN-2, evidence on any of the variables other than the Glasgow Coma Score, CT scan findings, and cause of injury, results in marginal variations in the prognostic beliefs, i. e., the values are almost identical to the prior beliefs. If, on the other hand, information regarding GCS and CT is given, the

Downloaded from www.methods-online.com on 2011-08-04 | IP: 46.198.64.216 For personal or educational use only. No other uses without permission. All rights reserved.

Meth Inform Med

Fig. 3 The different results that the two networks produce under the same evidence: the cause of injury is not an automobile accident or a fall and the CT scan shows midline shift 0-5 mm without epidural hematoma. Both networks favor a good recovery outcome, but while BN-1 assigns a larger belief (81.6%) than the prior probability (69.4%), the probability assigned to this outcome by BN-1 is reduced (46.9%).

rest of the variables offer no further evidence, since the corresponding vertices become d-separated from GOS. Figure 3 depicts the different results the two networks produce under the same evidence; the cause of injury is not an automobile accident or a fall and the CT scan shows a midline shift of 0-5 mm. Although both networks favor a good recovery outcome, the probability assigned to this outcome by BN-1 is much larger. The performance of the BNs, together with that of the expert neurosurgeon is shown in Table 2. BN-1 correctly predicted the patient’s outcome in 61 of the 75 cases (81%), while BN-2 was correct in 52 cases (69%). The success rate of the BN, produced by backward elimination of the edges, is closer to the success rate of the expert (successful prediction in 67 cases or 89%), and always better than that of the BN produced by forward inclusion. All three systems (human included) exhibited better performance on the extremes of the prognostic scale (death versus good recovery).

4. Discussion In this study we examined two decision-support systems, based on Bayesian networks, for the prognosis of head-injured patients visiting the outpatients department. Providing probabMeth Inform Med

ilistic measures of belief in the possible outcomes, the systems expose the relative importance of the included nodes and transform qualitative, intuitive estimations into a quantitative form. They were derived with the use of two variants of a stepwise model selection. These techniques lead to the selection of a single best model and may overlook other models that fit the data just as well. Future research will aim at the comparison of prognostic performance among models selected by a more global procedure that selects multiple models and, therefore, reflects inherent model uncertainty [35]. The model selected through successive elimination of edges (BN-1) performed better than the model reached with forward inclusion (BN-2). This is something we expected, for two reasons. First, because the backward elimination method searches among models that fit the data well, since it tests the existence of conditional independencies starting from the most conservative saturated model, which is guaranteed to exhibit an optimum fit of the data. Forward inclusion, on the other hand, presupposes independence of the variables and, therefore, it searches among incompatible models. Second, because the tests used in the forward inclusion (likelihood ratio tests) do not take into account the specific character of the variables; their power is limited for variables known to be ordinal or nominal.

The fact that our data are sparse, which is a very common phenomenon in real-world applications, affected the testing in both methods. The assumption that the deviance difference of two nested models follows a chi-squared distribution does not hold. The sparsity of data was dealt with by the use of a Monte Carlo sampling method. A combination of experts’ opinions and learning algorithms is expected to exhibit better success rates than any of its constituents would have alone. A structure derived from data exploration methods and appropriate selection algorithms could be augmented by experts through correction of inconsistencies, removal of artifactual vertices or inclusion of important variables from relevant domains. Similarly, clinicians can benefit from the ability of the model selection techniques to investigate the existence of conditional dependence between pairs of clinical or laboratory variables and consult other sources of evidence to estimate the patient’s prognosis. In all cases, the maintenance of a database containing many variables related to the specific domain is essential, since it provides both the training set for a possible structure and reliable values for the conditional probability tables. REFERENCES 1. Elstein AS. Clinical Judgment: Psychological research and medical practice. Science 1976; 194: 696-700. 2. Pearl J. Probabilistic Reasoning in Intelligent Systems. San Mateo, California: Morgan Kaufmann, 1988. 3. Cooper GF. The computational complexity of probabilistic inference using Bayesian belief networks. Artif Intell 1990; 42: 393-405. 4. Dagum P, Luby M. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artif Intell 1993; 60: 141-53. 5. Lauritzen SL, Spiegelhalter DJ. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J R Statist Soc B 1988; 50: 157-224. 6. Jensen FV, Lauritzen SL, Olesen KG. Bayesian updating in causal probabilistic networks by local computations. Comput Stat Quarterly 1990; 4: 269-82. 7. Zhang NL, Poole D. Exploiting causal independence in Bayesian network inference. JAIR 1996; 5: 301-28. 8. Jennett B, Bond M. Assessment of outcome after severe brain damage. Lancet 1975; i: 480-4. 9. Barlow P, Murray L, Teasdale G. Outcome after severe head injury – the Glasgow model.

Downloaded from www.methods-online.com on 2011-08-04 | IP: 46.198.64.216 For personal or educational use only. No other uses without permission. All rights reserved.

41

10.

11.

12.

13.

14.

15.

16.

17.

42

In: Corbett WA, ed. Medical Applications of Microcomputers. New York: Wiley, 1987; 105-26. Choi SC, Narayan RK, Anderson RL, Ward JD. Enhanced specificity of prognosis in severe head injury. J Neurosurg 1988; 69: 381-5. Feldman Z, Contant CF, Robertson CS, Narayan RK, Grossman RG. Evaluation of the Leeds prognostic score for severe head injury. Lancet 1991; 337: 1451-3. Gibson RM, Stephenson GC. Aggressive management of severe closed head trauma: time for reappraisal. Lancet 1989; 2 (8659): 369-71. Kruse JA, Thill-Baharozian MC, Carlson RW. Comparison of clinical assessment with APACHE II for predicting mortality risk in patients admitted in a medical intensive care unit. JAMA 1988; 260: 1739-42. Luerssen TG, Klauber MR, Marshall LF. Outcome from head injury related to patient’s age: a longitudinal prospective study of adult and pediatric head injury. J Neurosurg 1988; 68: 409-16. Parkan C, Hollands L. The use of efficiency linear programs for sensitivity analysis in medical decision making. Med Decis Making 1990; 10: 116-25. Teasdale G, Jennett B. Assessment of coma and impaired consciousness. A practical scale. Lancet 1974; 2 (7872): 81-4. Marshall LF, Bowers Marshall S, Klauber MR. A new classification of head injury

18. 19.

20.

21.

22.

23.

24. 25. 26.

27.

based on computerized tomography. J Neurosurg 1991; 75: S14-S20. Buntine WL. Operations for learning with graphical models. JAIR 1994; 2: 159-225. Heckerman D, Geiger D, Chickering D. Learning Bayesian Networks: the combination of knowledge and statistical data. Technical Report, Microsoft (1994), MSRTR-94-09. San Martini A, Spezzaferi F. A predictive model selection criterion. J Roy Statist Soc B 1984; 46: 296-303. Cooper G, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn 1992; 9: 309-47. Chow GC. A comparison of the information and posterior probability criteria for model selection. J Econometr 1981; 16: 21-33. Lam W, Bacchus F. Learning Bayesian belief networks. An approach based on the MDL principle. Comput Intell 1994; 10: 269-93. Whittaker J. Graphical models in applied multivariate statistics. Wiley, 1990. Christensen R. Log-Linear models. Berlin, Heidelberg: Springer Verlag, 1990. Lauritzen SL, Wermuth N. Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann Statist 1989; 17: 31-57. Dawid AP, Lauritzen SL. Hyper Markov laws in the statistical analysis of decompos-

28.

29.

30. 31. 32.

33.

34. 35.

able graphical models. Ann of Statist 1993; 21: 1272-317. Lauritzen SL, Dawid AP, Larsen BN, Leimer HG. Independence properties of directed Markov fields. Networks 1990; 20: 491-505. Kiiveri H, Speed TP, Carlin JB. Recursive causal models. J Austral Math Soc (Series A) 1984; 36: 30-52. Dempster AP. Covariance selection. Biometrics 1972; 28: 157-75. Wermuth N. Model search among multiplicative models. Biometrics 1976; 32: 253-63. Patefield WM. Algorithm AS 159. An efficient method of generating random r  c tables with given row and column totals. Appl Statist 1981; 30: 91-7. Kreiner S. Graphical modelling using DIGRAM. Research report 11/89, Statistical Research Unit, Univ. of Copenhagen 1989. Edwards D. Introduction to graphical modelling. Berlin, Heidelberg: Springer Verlag, 1995. Edwards D, Havranek T. A fast model selection procedure for large families of models. J Amer Statist Assoc 1987; 82: 205-13.

Address of the authors: G. C. Sakellaropoulos, M. Sc., G. C. Nikiforidis, Ph. D., Computer Laboratory, School of Medicine, University of Patras, GR-265 00 Rion – Patras Greece E-mail: [email protected]

Downloaded from www.methods-online.com on 2011-08-04 | IP: 46.198.64.216 For personal or educational use only. No other uses without permission. All rights reserved.

Meth Inform Med