peer reviewed published articles
Transcription
peer reviewed published articles
PEER REVIEWED PUBLISHED ARTICLES 2009 Isabel Healthcare Inc 3120 Fairview Park Drive, Ste 200, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] LIST OF PEER REVIEWED PUBLISHED ARTICLES 2009 1. HOW ACCURATELY DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM SUGGEST THE CORRECT DIAGNOSIS IN ADULT ER AND PEDIATRIC CASES ? a. Ramnarayan P, Natalie Cronje, Ruth Brown et al. Validation of a diagnostic reminder system in emergency medicine: a multi-centre study. Emergency Medicine Journal 2007;24:619-624; doi:10.1136/emj.2006.044107 b. Ramnarayan P, Tomlinson A, Rao A, Coren M, Winrow A, Britto J. ISABEL: a web-based differential diagnostic aid for pediatrics: results from an initial performance evaluation. Arch Dis Child. 2003 May;88 (5):408-13. 2. WHAT IS THE IMPACT ON CHANGE OF DIAGNOSIS DECISION MAKING BEFORE AND AFTER USING A DIAGNOSIS DECISION SUPPORT SYSTEM? a. Ramnarayan P, Winrow A, Coren M, Nanduri V, Buchdahl R, Jacobs B, Fisher H, Taylor PM, Wyatt JC, Britto JF. Diagnostic omission errors in acute paediatric practice: impact of a reminder system on decision-making. BMC Med Inform Decis Mak. 2006 Nov 6;6(1):37 b. Ramnarayan P, Roberts GC, Coren M et al. Assessment of the potential impact of a reminder system on the reduction of diagnostic errors: a quasi-experimental study. BMC Med Inform Decis Mak. 2006 Apr 28;6:22 3. HOW ACCURATELY DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM SUGGEST THE CORRECT DIAGNOSIS IN 50 ADULT NEJM CPC CASES? Graber ML, Mathews A. Performance of a web-based clinical diagnosis support system for internists. J Gen Intern Med. 2008 Jan;23 Suppl 1:85-7. 4. HOW DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM IMPROVE DIAGNOSIS ACCURACY IN A CRITICAL CARE SETTING ? Thomas NJ, Ramnarayan P, Bell P Et al. An international assessment of a web-based diagnostic tool in critically ill children. Technology and Health Care 16 (2008) 103–110. Isabel Healthcare Inc 3120 Fairview Park Drive, Ste 200, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] LIST OF PEER REVIEWED PUBLISHED ARTICLES 2009 5. WHAT IS THE MAGNITUDE OF DIAGNOSIS ERROR? Gordon D. Schiff, Seijeoung Kim,Richard Abrams et al. Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project. Advances in Patient Safety 2005; 2:255-278 6. WHAT ARE THE PRIMARY & SECONDARY OUTCOMES OF DIAGNOSTIC PROCESS - THE LEADING PHASE OF WORK IN PATIENT CARE MANAGEMENT PROBLEMS IN EMERGENCY DEPARTMENT ? Cosby KS, Roberts R, Palivos L et al. Characteristics of Patient Care Management Problems Identified in Emergency Department Morbidity and Mortality Investigations During 15 Years. Ann Emerg Med. 2008;51:251-261. 7. WHAT ARE THE CAUSES OF MISSED AND DELAYED DIAGNOSES IN THE AMBULATORY SETTING? Gandhi TK, Kachalia A, Thomas EJ et al. Missed and Delayed Diagnoses in the Ambulatory Setting: A Study of Closed Malpractice Claims. Ann Intern Med.2006;145:488-496. 8. WHAT ARE THE FACTORS THAT CONTRIBUTE TO DIAGNOSIS ERROR? a. Mark Graber, MD et al. Diagnostic Error in internal Medicine. Arch Intern Med. 2005 Jul 11; 165(13):1493-9 b. Mark Graber MD. Diagnostic errors in medicine: a case of neglect. Jt Comm J Qual Patient Saf. 2005 Feb;31(2):106-13 9. DO PHYSICIANS KNOW WHEN THEIR DIAGNOSES ARE CORRECT? Charles P. Friedman, PhD et al. Do Physicians Know When Their Diagnoses Are Correct? Implications for Decision Support and Error Reduction. J Gen Intern Med 2005; 20:334-9. -------------------------------------------------------------------------------------------------------- Isabel Healthcare Inc 3120 Fairview Park Drive, Ste 200, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] HOW ACCURATELY DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM SUGGEST THE CORRECT DIAGNOSIS IN ADULT ER AND PEDIATRIC CASES? Ramnarayan P, Natalie Cronje, Ruth Brown et al. Validation of a diagnostic reminder system in emergency medicine: a multi-centre study. Emergency Medicine Journal 2007;24:619-624; doi:10.1136/emj.2006.044107 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] Downloaded from emj.bmj.com on 22 August 2007 619 ORIGINAL ARTICLE Validation of a diagnostic reminder system in emergency medicine: a multi-centre study Padmanabhan Ramnarayan, Natalie Cronje, Ruth Brown, Rupert Negus, Bill Coode, Philip Moss, Taj Hassan, Wayne Hamer, Joseph Britto ................................................................................................................................... Emerg Med J 2007;24:619–624. doi: 10.1136/emj.2006.044107 Information on the Isabel system can be found at www.isabelhealthcare.com See end of article for authors’ affiliations ........................ Correspondence to: Padmanabhan Ramnarayan, 44 B Bedford Row, Children’s Acute Transport Service, London WC1R 4LL, UK; ram@isabelhealthcare. com Accepted 22 March 2007 ........................ E Background: Diagnostic error is a significant problem in emergency medicine, where initial clinical assessment and decision making is often based on incomplete clinical information. Traditional computerised diagnostic systems have been of limited use in the acute setting, mainly due to the need for lengthy system consultation. We evaluated a novel web-based reminder system, which provides rapid diagnostic advice to users based on free text search terms. Methods: Clinical data collected from patients presenting to three emergency departments with acute medical problems were entered into the diagnostic system. The displayed results were assessed against the final discharge diagnoses for patients who were admitted to hospital (diagnostic accuracy) and against a set of ‘‘appropriate’’ diagnoses for each case provided by an expert panel (potential utility). Results: Data were collected from 594 patients (53.4% of screened attendances). Mean age was 49.4 years (95% CI 47.7 to 51.1) and the majority had significant past illnesses. Most were assessed first by junior doctors (70%) and 266/594 (44.6%) were admitted to hospital. Overall, the diagnostic system displayed the final discharge diagnosis in 95% of inpatients and 90% of ‘‘must-not-miss’’ diagnoses suggested by the expert panel. The discharge diagnosis appeared within the first 10 suggestions in 78% of cases. Conclusions: The Isabel diagnostic aid has been shown to be of potential use in reminding junior doctors of key diagnoses in the emergency department. The effects of its widespread use on decision making and diagnostic error can be clarified by evaluating its impact on routine clinical decision making. mergency departments (EDs) have been shown to be high risk clinical areas for the occurrence of medical adverse events.1 2 In contrast to the inpatient setting where diagnostic delays and missed diagnoses account for only 10– 15% of adverse events, diagnostic errors are more frequent and assume greater significance in primary care, emergency medicine and critical care, resulting in a large number of successful negligence claims.3–6 Various reasons may account for this difference. In EDs, acute clinical presentations are often characterised by incomplete and poor quality information at initial assessment leading to considerable diagnostic uncertainty. Systematic factors such as frequent shift changes, overwork, time pressure and high patient throughput may further contribute to diagnostic errors.7 Cognitive biases inherent within diagnostic reasoning have also been shown to play a crucial role in perpetuating patient safety breaches during diagnostic assessment.8 9 Despite easier access to health information for the public through NHS Direct and NHS Online, the demand for emergency care is steadily rising,10 and it is possible that greater pressure on emergency staff to cut waiting times and increase efficiency with limited resources will lead to a higher incidence of patient safety incidents during diagnostic assessment. The use of computerised diagnostic decision support systems (DDSS) has been proposed as a technological solution to minimise diagnostic error.11 A number of DDSS have been developed over the past few decades, but most have not shown promise in EDs, either because they focussed on a narrow clinical problem (eg, chest pain) or because computerised aids intended for general use were used only for consultation in rare and complex diagnostic dilemmas. In order to enter complete clinical data into these systems using system-specific medical terminology, considerable user motivation and time was required (often 20–40 min of data entry time).12 Isabel (www.isabelhealthcare.com) is a novel DDSS which was primarily developed for acute paediatrics. Isabel users enter their search terms in natural language free text, and are shown a list of diagnostic suggestions (up to a maximum of 30, displayed on three consecutive pages, 10 diagnoses per page), which are arranged by body system (eg, cardiovascular, gastrointestinal) rather than by clinical probability.13 14 These suggestions are intended only as reminders to prompt clinicians to consider them in their diagnostic investigation, not as ‘‘correct’’ choices or ‘‘likely’’ diagnoses. The system uses statistical natural language processing techniques to search an underlying knowledge base containing textual descriptions of .10 000 diseases. In clinical trials performed in acute paediatrics, mean Isabel usage time was ,3 min, and the system reminded junior doctors to consider clinically significant diagnoses in 12.5% of cases.15 The DDSS was extended to cover adult disease conditions in January 2005.16 Although the extended Isabel system was closely modelled on the paediatric version, a large scale validation of the newly developed system was felt necessary to establish its diagnostic accuracy and potential utility for a range of clinical scenarios among adult patients in EDs. This was especially important since acute presentations in adults differ significantly from those in children. Adults may have multiple pre-morbid conditions that confound their acute illness. Further, the spectrum of diseases encountered is quite distinct, and the relative importance of diagnoses during initial assessment may be different from that in paediatric cases. It was also postulated that the greater amount of clinical detail available on adults presenting to EDs may prolong data entry time. In addition, since the Isabel system relies on extracting key concepts from Abbreviations: DDSS, diagnostic decision support systems; ED, emergency department; RA, research assistant www.emjonline.com Downloaded from emj.bmj.com on 22 August 2007 620 Ramnarayan, Cronje, Brown, et al Table 1 Characteristics of participating emergency departments Nature Annual attendances (majors) Inpatient admission rate Clinical decision unit Number of consultants Centre A Centre B Centre C Overall Teaching 74 000 Teaching 80 000 Teaching 44 000 198 000 16% Yes 5 29.8% Yes 5 19% No 4 21.6% its knowledge base (textbooks and journal articles), variations in natural language textual patterns in the medical sources used for the adult system may significantly influence its diagnostic suggestions. METHODS This preliminary validation study was designed purely to examine the clinical performance of the Isabel system and identify its potential utility in adult EDs, not to assess its impact on clinical practice or diagnostic errors. Therefore, clinicians were not allowed real-time access to the system during patient assessment in this study. Clinical data from ED patients presenting with a range of acute medical problems were used to validate the results of the DDSS. The study was approved by the London multi-regional ethics committee (04/MREC02/41) and relevant local research governance committees. Study centres A convenience sample of four UK EDs were selected for data collection during the study. Due to significant delay in completing research governance procedures at one centre, data were finally only collected from three participating sites. The characteristics of the study sites are summarised in table 1. Study patient data Data were collected from all consecutive patients over 16 years old presenting to the ‘‘majors area’’ or resuscitation rooms with an acute medical problem. Patients presenting to ‘‘minors’’ or a similar area, patients with surgical complaints (including trauma, orthopaedics, ENT, ophthalmology and gynaecology), post-operative surgical problems, psychiatric problems (including substance and alcohol abuse) and complaints directly related to pregnancy were excluded. Patients presenting for reassessment of the same clinical problem within a week of an earlier visit to the ED were also excluded. Study procedure This study was a prospective, multi-centre observational study utilising medical case note review. No interventions were performed on patients. Screening for eligibility The attendance records of all consecutive patients presenting to an ED ‘‘majors area’’ within a pre-designated 2-week period were screened for eligibility by the primary research assistant (RA). Screening was performed at study sites one after the other, ie, the 2-week period was different for each centre. The complete medical and nursing notes of patients with presenting complaints that fitted study criteria were selected for further review. Reasons for exclusion were recorded for the remainder using specific codes established a priori. Where the RA was unsure of study eligibility, patients were included and data were collected. Such notes were reviewed at regular intervals by the principal investigator, and a final decision regarding study eligibility was made. To assess inter-rater reliability during www.emjonline.com Table 2 panel Template of patient summary report provided to Patient characteristics Age, gender, ethnic origin Clinical presentation Registering complaint(s) Triage category Vital signs at triage Presenting clinical symptoms Relevant co-morbidities and family history Current medications Positive clinical signs Initial tests performed and results (if available) screening for eligibility, a second RA examined patient notes from two randomly chosen dates within the specified 2-week period at each study ED. The primary RA was blinded to these dates. Concordance between the two RAs was calculated using the kappa statistic (k 0.65, 95% CI 0.62 to 0.68). Data collection From eligible patient notes, the primary RA extracted data regarding patient details; date and time of patient assessment; details of clinical presentation such as symptoms, past medical and family history; examination findings, tests performed and results available at the end of complete assessment by the first examining clinician; differential diagnosis and management plan of the first examining clinician; referral for specialist or senior opinion; and outcome of ED assessment. This was entered directly into an Access database (Microsoft, Reading, UK) by means of pre-designed electronic forms. During data entry, both positive and negative symptoms, signs and test results were collected as recorded in the patient notes. Following complete data collection at all centres, final discharge diagnoses for patients at the end of ED assessment, recorded either on discharge letters or on ED electronic systems, were ascertained. For patients admitted as inpatients, final primary diagnoses at hospital discharge were obtained from hospital electronic coding systems. Data quality assurance was achieved by multiple means including: use of a training manual created before study commencement containing standardised case examples to practise screening, data collection and abstraction; the use of 25 medical records to practise data collection, with doubts being clarified by the study investigator at study outset; weekly meetings to discuss data abstraction issues and examine collected data; and a log of discussions for ready reference. Reliability of the data collection process was established by randomly assigning 50% of eligible patient notes from two randomly chosen dates to a second RA. Concordance for key variables was analysed using the kappa statistic (k 0.58, 95% CI 0.52 to 0.64). Expert panel An expert panel was set up at each study site consisting of two consultants (attending physicians), which met regularly to provide gold standard diagnoses for a randomly selected subset of study patients. At each panel meeting, moderated by the primary RA, data collected during initial ED assessment for each patient were provided in the form of a pre-formatted clinical summary report generated from the Access database (table 2). Presenting clinical symptoms were provided as recorded in the patient notes. Only relevant co-morbidities, family history, positive clinical signs and results of initial tests (if performed by the examining clinician) were provided. The panel were blinded Downloaded from emj.bmj.com on 22 August 2007 Validation of a diagnostic reminder system 621 Figure 1 Clinical data extracted from patient medical notes is entered into the diagnostic reminder system in free text natural language. Filters for age, gender, pregnancy status and geographical region are selected as drop-down choices. to the ED where the patient was assessed, clinical decisions made and eventual patient outcome. The panel were instructed to provide, for each case following mutual consensus, a set of ‘‘must-not-miss’’ diagnoses, defined as key diagnoses that would influence patient management, ie, result in a specific action, either eliciting further history or physical examination or initiating new tests and/or treatments. To establish concordance between the three expert panels, a random selection of 5% of study patients was assigned to each panel in blinded fashion. For this subset, gold standard diagnoses were defined as those suggested by two or more panels. DDSS data entry Concurrent with the data collection process, an Isabel prototype was created from the original paediatric version. In order to generate a mature DDSS for use in adult patients, this prototype needed refinement. A total of 130 notes randomly drawn from patients not admitted to hospital were used for this final fine-tuning (development set). In an iterative process, the system’s results for each case were critically examined by clinicians in the development team. To generate a focussed list of diagnoses, each diagnosis in the Isabel database had to be tagged to particular age group(s), gender and regions in which the disease was commonly seen (paediatric tags could not be used for adult patients). Once a fully developed Isabel system was available, the remainder of the cases (464/594) were used to test its performance (validation set). During the validation stage, the RA exported the clinical summary report (as presented to the panel) for each case from the Access database in the form of individual text files. Since the Isabel system accepted search terms only as text and negative findings could not be searched within its medical content, information from the patient summary report needed modification by the RA during data entry into Isabel. Patient characteristics (eg, age and gender) were input using a drop-down menu, numerical values from vital signs and test results were converted Figure 2 Diagnostic reminders are grouped under body system. Ten suggestions are displayed on the first page with an option to view more diagnoses. Clicking RD leads directly to other Related Diagnoses. www.emjonline.com Downloaded from emj.bmj.com on 22 August 2007 622 Ramnarayan, Cronje, Brown, et al Table 3 Breakdown of patients screened and reasons for exclusion Patient notes screened Eligible patient notes (%) Exclusions (%) Trauma Surgical/post-surgical Pregnancy-related Psychiatric/psychosomatic Substance abuse Orthopaedic Incomplete data Centre A Centre B Centre C Total 344 200 (58.1) 144 50 (34.7) 15 (10.4) 18 (12.5) 13 (9.0) 15 (10.4) 2 (1.4) 31 (21.5) 403 194 (48.1) 209 96 (45.9) 10 (4.8) 14 (6.7) 12 (5.7) 23 (11.0) 19 (9.1) 35 (16.7) 366 200 (54.6) 166 47 (28.3) 20 (12.0) 25 (15.0) 12 (7.2) 21 (12.6) 5 (3.0) 36 (21.7) 1113 594 519 193 45 57 37 59 26 102 to text terms (eg, temperature 38.8˚C into ‘‘fever’’), and only positive findings and test results (when available) were entered. This procedure was standardised prior to data entry by establishing normal ranges for vital signs and test results. The patient summary was aggregated into a single block of text with each symptom, positive clinical finding, salient past illness and the result of each test placed on a separate line, and then pasted into the search box. Any amount of clinical information could be entered, although it was accepted that specificity of search results would improve with detailed data entry. Figure 1 illustrates this procedure. In fig 2, results as displayed in Isabel are shown. p Value 0.02 0.5 0.03 0.03 0.5 0.8 0.001 0.4 Outcome measures Two separate outcome measures were assessed. Diagnostic accuracy was used to provide an indication of the system’s clinical performance and was defined as the proportion of cases admitted to hospital in which the final discharge diagnosis appeared among the DDSS suggestions. However, this did not provide an indication of the system’s utility in terms of guiding appropriate patient management in clinical practice. Therefore, the proportion of cases in which the DDSS included the entire set of key diagnoses that were deemed as ‘‘must-not-miss’’ by the expert panel indicated its utility in an ED setting. Table 4 Patient characteristics Average age (years) Male/female ratio Mode of arrival (%) Self Ambulance Other Triage category (%) Unknown 1 2 3 4 5 Examining clinician (%) SHO (resident) Registrar (fellow) Consultant (attending) Unclear/other Time of assessment (%) 0800–1800 1800–0800 Significant co-morbidities Significant family history Primary assessment outcome (%) Admit Review after tests Senior opinion Specialist opinion Discharge Unclear/other Final outcome (%) Hospital admission Discharge (no follow-up) Discharge (GP follow-up) Discharge (OP follow-up) Transferred out Other/unclear Total Centre A Centre B Centre C Overall 47.57 1.06 50.5 0.83 50.23 1.08 49.4 0.99 95 (47.5) 100 (50.0) 5 98 (50.5) 96 (49.5) 0 101 (49.9) 96 (48.0) 3 294 (49.5) 292 (49.2) 8 0.92 7 4 30 (15.1) 110 (55.6) 44 (22.2) 3 0 3 18 (11.4) 98 (62.0) 38 (24.0) 1 2 4 53 (26.9) 97 (49.2) 41 (20.8) 0 9 11 101 305 (55.0) 123 4 0.93 ,0.001 0.41 0.83 162 (81.0) 23 2 13 154 (79.4) 19 4 17 129 (64.5) 22 2 47 445 (74.9) 64 8 77 118 (59.0) 82 180 29 126 (64.9) 68 180 10 111 (55.5) 89 183 32 355 (59.8) 239 543 71 19 (9.5) 2 18 63 81 17 80 (41.2) 7 9 23 72 3 29 (14.5) 3 19 63 82 4 128 12 46 149 235 24 80 (40.0) 45 43 13 6 13 200 101 (52.0) 90 NR NR NR 3 194 85 (42.5) 59 38 9 2 7 200 266 (44.8) 194 81 22 8 23 594 GP, general practitioner; NR, not recorded; OP, out patient. www.emjonline.com p Value 0.32 0.001 0.15 0.61 0.001 ,0.001 0.16 Downloaded from emj.bmj.com on 22 August 2007 Validation of a diagnostic reminder system 623 Table 5 Figure 3 Diagnostic accuracy plotted for final diagnosis as well as expert panel ‘‘must-not-miss’’ diagnoses using 10, 20 and 30 results. Sample size and statistical analysis Using an estimated diagnostic accuracy of 85% (from the paediatric system) and an acceptable error rate of 3.5%, data were required from 450 patients to ensure adequate power. Differences between study EDs were analysed using the x2 test for proportions and ANOVA for continuous variables. Statistical significance was set at p value ,0.05. RESULTS During the study period, 1113 consecutive patient notes were screened for study eligibility at the three centres. A total of 489 patients were excluded by the RA and a further 30 patients were judged as ineligible on secondary review (overall 46.6%). Therefore, 594 medical notes were reviewed in detail. There were significant differences between centres with respect to the proportion of patient notes excluded and reasons for exclusion (table 3). Patient characteristics are summarised in table 4. The mean age of patients included in the study was 49.4 years (95% CI 47.7 to 51.1) with an equal male to female ratio. A significant number of eligible patients were brought into EDs by ambulance (49.2%), and the majority were triaged into level 3 (55%). Most patients were seen by an SHO in the first instance (74.9%), and 40% were seen out of working hours (1800–0800). The majority of patients had past medical illnesses of note (91.4%). In addition, significant family history was present in a number of patients (12%). The primary examining clinician indicated having sought senior opinion in 7.7% and specialist opinion in nearly 25% of patients. Overall, 44.8% were admitted to hospital as inpatients, of whom 33% were discharged without further follow-up arrangements. Diagnostic accuracy was measured using 217 inpatient discharge diagnoses available from 266 admissions. Overall, the DDSS displayed 206/217 diagnoses, with an accuracy rate of 95% (CI 92% to 98%). Seventy eight per cent of the discharge diagnoses were displayed on the first page (first ten suggestions). Panel members examined 129 cases to provide ‘‘must-notmiss’’ diagnoses. A total of 30 cases were assessed by all three expert panels and 99 others by a single panel. In the former set, 52 diagnoses were suggested (mean 1.7 per case). Isabel displayed 50/52 suggestions (96%) in the list of its diagnostic reminders, with the majority present on the first page (35/50, 70%). For the latter set, 100 ‘‘must-not-miss’’ diagnoses were provided by the panel (mean 1 per case). Isabel displayed 90/ 100 suggestions; 53/100 were present on the first page (first 10 Case assessment and ‘‘must-not-miss’’ diagnoses Patient characteristics 69 years old, male, white, Irish Clinical presentation Registering complaint(s): abdominal pain Triage category: 3 Vital signs at triage: temp 38.8, BP 108/65 mm Hg, heart rate 120/min, respiratory rate 20 bpm, saturations 96% in room air, GCS 15/15 Presenting clinical symptoms Fever Rigors Right iliac fossa pain radiating to central abdomen Vomited Nausea Chronic blood per rectum Reduced appetite Constipated Passing flatus Chronic productive cough Relevant co-morbidities and family history Peripheral vascular disease Diabetes mellitus Ischaemic heart disease Smoker Current medications Simvastatin Metformin Positive clinical signs Bibasal crackles Abdominal tenderness with percussion Tender rectum Pallor Dehydrated Fever Tachycardia Initial tests performed and results (if available) Leucocytosis High serum bilirubin Elevated CRP Sinus tachycardia on ECG Thickened right colon, suspected mass on ultrasound scan abdomen Panel assessment – ‘‘must not miss’’ diagnoses to consider Acute appendicitis Carcinoma colon Ischaemic bowel reminders). Comprehensiveness improved significantly when all three pages were examined (fig 3). An example of one expert panel’s assessment of a case and relevant ‘‘must-not-miss’’ diagnoses is shown in table 5. DISCUSSION We have demonstrated in this large validation study that the Isabel diagnostic aid demonstrates significant accuracy for hospital discharge diagnosis among patients presenting to EDs with acute medical problems. The system also showed potential utility in the ED setting by including all key diagnoses among its diagnostic reminders. The Isabel system has been previously evaluated in acute paediatrics, and shown to display the final diagnosis on the first page in .90% cases drawn from real life practice.17 The heterogeneous nature of acute presentations among adult patients and their lengthy past history may have resulted in more complex clinical data entry accounting for some decrease in accuracy in this study. Most adult patients had significant past medical illnesses and were on numerous medications. In addition, textual descriptions of diseases in the current Isabel knowledge base may be quite different between adult and paediatric sources resulting in a poorer match between the clinical data entered and the diagnostic results generated. Despite these findings, Isabel’s diagnostic performance is better than that of previously described www.emjonline.com Downloaded from emj.bmj.com on 22 August 2007 624 Ramnarayan, Cronje, Brown, et al ....................... DDSS.18 VanScoy et al showed in an ED setting using two diagnostic systems, QMR and ILIAD, that the final diagnosis was present in the top five choices in only 30% of cases. Data entry time was prolonged since detailed case descriptions needed to be matched to system-specific terminology and specific training was required in the use of the DDSS.19 In our study, clinical assessment was entered into Isabel in the examining clinicians’ own words in free text, leading to the conclusion that rapid and easy use of the system is possible without prolonged data entry. In this context, Graber et al have also recently shown that merely pasting the entire history and physical examination section from the Case Records of the Massachusetts General Hospital series as text without any modification into Isabel resulted in a diagnostic accuracy rate of 74%.20 Identifying the precise patient population in which DDSS might prove useful is a major challenge. Expert systems such as QMR were intended to be used in a diagnostic dilemma. However, there is sufficient evidence that diagnostic errors occur during routine practice, and that there is poor correlation between physicians’ diagnostic accuracy and their own perception of the need for diagnostic assistance.21 We chose to focus on validating Isabel against a pre-selected subset of ED patients (acute medical problems seen in the ‘‘majors area’’). Nearly half of all patients screened qualified using our liberal study criteria, although it is improbable that in practice medical staff would have used Isabel in all these patients. Experience from our paediatric clinical study indicates that clinicians may seek diagnostic advice in 5–7% of acute medical presentations (three to five ED patients per day).22 Identifying the optimal parameter against which DDSS performance can be measured also remains controversial.23 We used hospital discharge diagnoses and an expert panel’s opinion of key diagnoses to provide a combined view of system accuracy and utility. During this initial evaluation, we deliberately denied clinicians access to Isabel. Yet, by extrapolation from these results, it seems likely that clinicians will use and benefit from its diagnostic advice in situations of uncertainty, especially since minimal data entry time was required. Integration of the DDSS into an electronic medical record may allow active diagnostic advice to be delivered to staff with minimal effort. Such an interface has been developed recently.24 LIMITATIONS The main limitation of this study was the fact that users did not interact with the system making it difficult to estimate its true utility in practice. The impact of a DDSS is best measured by its ability to improve clinicians’ diagnostic assessment; in addition, unexpected negative effects might be seen. We used electronic systems or discharge summaries to provide data on final diagnoses, but these sources have been shown to be unrepresentative and of variable quality. However, due to logistical reasons, we could not follow up all patients seen in this study. Also, as ED discharge diagnoses on patients were also missing in a number of patients, we used inpatient discharge diagnoses for our main outcome analysis. CONCLUSIONS Diagnostic assistance may be useful in a large proportion of patients seen in an emergency department. The Isabel diagnostic aid performs with an acceptable degree of clinical accuracy in this setting. Further studies to elucidate its effects on decision making and diagnostic error are essential in order to clarify its role in routine practice. ACKNOWLEDGEMENTS We gratefully acknowledge the advice during study design provided by Dr Paul Taylor. www.emjonline.com Authors’ affiliations Padmanabhan Ramnarayan, Children’s Acute Transport Service, London, UK Natalie Cronje, Isabel Healthcare, London, UK Ruth Brown, Rupert Negus, St Mary’s Hospital, London, UK Bill Coode, Philip Moss, Newham General Hospital, London, UK Taj Hassan, Wayne Hamer, Leeds General Infirmary, Leeds, UK Joseph Britto, Isabel Healthcare, Reston, VA, USA Funding: This study was supported by a research grant from the National Health Service (NHS) Research & Development Unit, London. The sponsor did not influence the study design; the collection, analysis, and interpretation of data; the writing of the manuscript; or the decision to submit the manuscript for publication. Competing interests: The Isabel system is currently managed by Isabel Healthcare, and is available only to subscribed individual and institutional users. Dr Ramnarayan is a part-time research advisor for Isabel Healthcare, Ms Cronje was employed as a research assistant by Isabel Healthcare for this study, and Dr Britto is Clinical Director of Isabel Healthcare. All other authors declare that they have no competing interests. REFERENCES 1 Weingart SN, Wilson RM, Gibberd RW, et al. Epidemiology of medical error. BMJ 2000;320(7237):774–7. 2 Burroughs TE, Waterman AD, Gallagher TH, et al. Patient concerns about medical errors in emergency departments. Acad Emerg Med 2005;12(1):57–64. 3 Leape L, Brennan TA, Laird N, et al. The nature of adverse events in hospitalized patients. Results of the Harvard Medical Practice Study II. N Engl J Med 1991;324:377–84. 4 Davis P, Lay-Yee R, Briant R, et al. Adverse events in New Zealand public hospitals II: preventability and clinical context. N Z Med J 2003;116(1183):U624. 5 Sandars J, Esmail A. The frequency and nature of medical error in primary care: understanding the diversity across studies. Fam Pract 2003;20(3):231–6. 6 Bartlett EE. Physicians’ cognitive errors and their liability consequences. J Healthc Risk Manage 1998;18:62–9. 7 Driscoll P, Thomas M, Touquet R, et al. Risk management in accident and emergency medicine. In: Vincent CA, ed. Clinical risk management. Enhancing patient safety. London: BMJ Publications, 2001. 8 Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 2003;78(8):775–80. 9 Kovacs G, Croskerry P. Clinical decision making: an emergency medicine perspective. Acad Emerg Med 1999;6(9):947–52. 10 Bache J. Emergency medicine: past, present, and future. J R Soc Med 2005;98(6):255–8. 11 Graber M, Gordon R, Franklin N. Reducing diagnostic errors in medicine: what’s the goal? Acad Med 2002;77(10):981–92. 12 Lemaire JB, Schaefer JP, Martin LA, et al. Effectiveness of the Quick Medical Reference as a diagnostic tool. CMAJ 1999;161(6):725–8. 13 Ramnarayan P, Tomlinson A, Kulkarni G, et al. A novel diagnostic aid (ISABEL): development and preliminary evaluation of clinical performance. Medinfo 2004;11(Pt 2):1091–5. 14 Greenough A. Help from ISABEL for pediatric diagnoses. Lancet 2002;360(9341):1259. 15 Ramnarayan P, Roberts GC, Coren M, et al. Assessment of the potential impact of a reminder system on the reduction of diagnostic errors: a quasi-experimental study. BMC Med Inform Decis Mak 2006;6:22. 16 E-health insider. http://www.e-health-insider.com/news/item.cfm?ID = 1009 (accessed 9 July 2007). 17 Ramnarayan P, Tomlinson A, Rao A, et al. ISABEL: a web-based differential diagnostic aid for paediatrics: results from an initial performance evaluation. Arch Dis Child 2003;88(5):408–13. 18 Berner ES, Webster GD, Shugerman AA, et al. Performance of four computerbased diagnostic systems. N Engl J Med 1994;330(25):1792–6. 19 Graber MA, VanScoy D. How well does decision support software perform in the emergency department? Emerg Med J 2003;20(5):426–8. 20 Graber ML, Mathews A. Performance of a web-based clinical diagnosis support system using whole text data entry. http://www.isabelhealthcare.com/info/ independentq1.doc (accessed 9 July 2007). 21 Friedman CP, Gatti GG, Franz TM, et al. Do physicians know when their diagnoses are correct? Implications for decision support and error reduction. J Gen Intern Med 2005;20(4):334–9. 22 Ramnarayan P, Winrow A, Coren M, et al. Diagnostic omission errors in acute paediatric practice: impact of a reminder system on decision-making. BMC Med Inform Decis Mak 2006;6(1):37. 23 Berner ES. Diagnostic decision support systems: how to determine the gold standard? J Am Med Inform Assoc 2003;10(6):608–10. 24 PatientKeeper. http://www.patientkeeper.com/success_partners.html (accessed 9 July 2007). HOW ACCURATELY DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM SUGGEST THE CORRECT DIAGNOSIS IN ADULT ER AND PEDIATRIC CASES? Ramnarayan P, Tomlinson A, Rao A, Coren M, Winrow A, Britto J. ISABEL: a web-based differential diagnostic aid for pediatrics: results from an initial performance evaluation. Arch Dis Child. 2003 May;88 (5):408-13. Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] 408 ORIGINAL ARTICLE ISABEL: a web-based differential diagnostic aid for paediatrics: results from an initial performance evaluation P Ramnarayan, A Tomlinson, A Rao, M Coren, A Winrow, J Britto ............................................................................................................................. Arch Dis Child 2003;88:408–413 See end of article for authors’ affiliations ....................... Correspondence to: Dr J Britto, Department of Paediatrics, Imperial College School of Medicine, St Mary’s Hospital, Norfolk Place, London W2 1PG, UK; [email protected] Accepted 4 October 2002 ....................... A Aims: To test the clinical accuracy of a web based differential diagnostic tool (ISABEL) for a set of case histories collected during a two stage evaluation. Methods: Setting: acute paediatric units in two teaching and two district general hospitals in the southeast of England. Materials: sets of summary clinical features from both stages, and the diagnoses expected for these features from stage I (hypothetical cases provided by participating clinicians in August 2000) and final diagnoses for cases in stage II (children presenting to participating acute paediatric units between October and December 2000). Main outcome measure: presence of the expected or final diagnosis in the ISABEL output list. Results: A total of 99 hypothetical cases from stage I and 100 real life cases from stage II were included in the study. Cases from stage II covered a range of paediatric specialties (n = 14) and final diagnoses (n = 55). ISABEL displayed the diagnosis expected by the clinician in 90/99 hypothetical cases (91%). In stage II evaluation, ISABEL displayed the final diagnosis in 83/87 real cases (95%). Conclusion: ISABEL showed acceptable clinical accuracy in producing the final diagnosis for a variety of real as well as hypothetical case scenarios. large proportion of routine clinical decision making depends on the availability of good quality clinical information and instant access to up to date medical knowledge.1 There is increasing evidence that the use of computer aided clinical decision support to manage medical knowledge results in better healthcare processes and patient outcomes.2 However, the use of computer based knowledge management techniques in medical decision making has remained poor.3 4 Considerable advances have been made in making latest processed information from clinical trials available, exemplified by the Cochrane database and the Clinical Evidence series.5 The free accessibility of Medline on the internet (http://www.ncbi.nlm.nih.gov/PubMed) has enabled easy and universal search of the medical literature. However, these endeavours do not primarily serve the busy clinician at the bedside seeking bottom line answers to routine clinical questions. These questions involve diagnosis and immediate management for a patient in general practice6 or in the emergency department.7 Recent initiatives such as the ATTRACT project have attempted to provide quick, up to date answers to clinical queries for the bedside physician, but involve a substantial financial and resource commitment.8 In addition, the current model of healthcare delivery within the UK National Health Service (NHS) accentuates these problems by providing care in the form of an “inverted pyramid of knowledge”. Clinical wisdom and knowledge are concentrated at the top among senior staff, in many cases distant from the patient who is first seen by a junior doctor at the bottom of the pyramid. This may contribute to a significant proportion of medical error,9 resulting in extra bed days and a preventable financial burden.10 ISABEL (Isabel Medical Charity, UK) is a computerised differential diagnostic aid for paediatrics that is delivered via the world wide web. It was developed following a missed diagnosis on a 3 year old girl with necrotising fasciitis complicating chicken pox. It has currently over 9000 registered users (including doctors, nurses, and other healthcare profession- www.archdischild.com als) and receives over 100 000 page requests every month. Powered by Autonomy, proprietary software that serves as an efficient information retrieval engine by matching patterns within unformatted text, the tool produces differential diagnoses for any set of clinical features by searching text from standard paediatric textbooks. Rather than provide a single diagnosis (a diagnostic tool), ISABEL is primarily intended to suggest only a differential diagnosis and serve as a “reminder” system to remind the clinician of potentially important diagnoses that might have been missed. During the development of ISABEL, text from each textbook pertaining to each of 3500 different diagnostic labels (for example, measles, migraine) was added to the ISABEL database. These textbooks included Nelson’s textbook of pediatrics (16th edition, 2000, WB Saunders), Forfar and Arneil’s textbook of paediatrics (5th edition, 1998, Churchill Livingstone, UK), Jones and Dargan Churchill’s pocket book of toxicology (2001, Churchill Livingstone, UK), and Rennie and Roberton’s textbook of neonatology (3rd edition, 1999, Churchill Livingstone, UK). Each diagnostic label was allocated an age group classification (“newborn”, “infant”, “child”, or “adolescent”) to prevent inappropriate diagnoses being presented to the user for a specific patient age group. In order to examine ISABEL’s utility, an evaluation programme was planned in a stepwise fashion: initial system performance to establish the safety of the tool, subsequent evaluation of impact in a simulated setting, and evaluation of impact in a real life clinical setting. This paper describes only the initial evaluation of ISABEL’s capability, and focuses on system performance. The tool was isolated from intended users (clinicians); its impact on clinical practice and decision making was not examined. This study was planned in two stages in two different settings, with the following considerations: • The two stages to be conducted in series for ease of data collection. • They were intended to provide a variety of hypothetical as well as real cases for the investigators to test the tool. ISABEL 409 Figure 1 Outcome of all cases collected for both stages of evaluation. • ISABEL is useful primarily as a reminder tool and not as an “oracle”. However, in reminding clinicians of diagnoses in a given clinical setting, it is imperative that the final diagnosis is also one of the “reminders” suggested. ISABEL would be considered “unsafe” for clinical use if many plausible diagnoses were suggested to the junior doctor, but the final diagnosis was not. For this reason, it is crucial that ISABEL showed acceptable clinical accuracy by displaying the final diagnosis (especially for inexperienced junior doctors who may not have considered the “correct” diagnosis). METHODS Differential diagnostic tool ISABEL was delivered on the internet free of charge (www.isabel.org.uk). During the development phase, only the investigators had access to ISABEL, enabled by a secure log-in procedure. This would ensure that clinicians participating in the data collection would not be able to use ISABEL and impact on clinical management. On accessing the tool, the patient age group had to be chosen first (newborn, infant, child, or adolescent). Following this, the summary clinical features of the case were entered into a free text box. These features would normally be gathered from the history, physical examination, and results of initial investigations. Any additional findings could be subsequently entered into the tool to focus the differential diagnosis further. To maximise the tool’s performance, clinical features had to be entered in appropriate medical terminology (rather than lay terms, as the database consisted of textbooks), the spelling had to be accurate (British or American), and laboratory results needed interpretation in words (“leucocytosis” or “increased white cell count” for a white cell count of 36.7 × 106/µl). These data were then submitted to the ISABEL database, searched by Autonomy, and a fresh web page displaying the results from ISABEL was produced. This list included around 10–15 unique diagnoses for consideration, classified on the basis of the systems from which they were drawn (respiratory, metabolic, etc). The diagnoses were not ranked in order of probability, thus reinforcing the function of the tool as a “reminder” system. Study design and conduct Stage I: We undertook this study in August 2000 within the Department of Paediatrics at St Mary’s Hospital, London, which has both general paediatric as well as specialist services in infectious diseases, neonatology, and intensive care. Clinicians with varying levels of experience (consultant, registrar, and senior house officer) were contacted to provide hypothetical case histories of acute paediatric presentations. For each case, they specified the age group category, summarised the clinical features, and listed the expected diagnosis(es). Stage II: This study was undertaken in four acute paediatric units, two teaching hospitals (St Mary’s Hospital, London and Addenbrookes’ Hospital, Cambridge), and two large district general hospitals (Kingston Hospital, Surrey and Royal Alexandra Hospital, Brighton). This study was done from October to December 2000. Junior doctors working within these departments prospectively collected data on children presenting to the acute paediatric unit. Only data regarding the age group, a summary of clinical features at initial presentation, and the working diagnosis(es) were collected from the doctors. The final diagnosis for each patient, as decided by the clinical team at the end of the hospital stay (or at the end of the clinical assessment), was collected from the discharge summary. Guidance was provided to the doctors to specify clinical features in medical terminology and to interpret results of initial investigations in words. This was done so that the investigators would not need to modify the content provided before entering the clinical features into ISABEL. One investigator (AT) collected these data in a structured form. In one sitting at the end of the data collection, she then entered the age group and the clinical features of each case into ISABEL as provided by the junior doctors, without modifying the content or spelling. This preserved the inherent user variability in summarising clinical features. Results generated by the tool for each case were collected for analysis. Outcome measures The main outcome measure used for both stages of the study was the presence of the expected or final diagnosis(es) within the results generated by ISABEL. This was a measure of the www.archdischild.com 410 Figure 2 Sample screen from ISABEL: input of clinical features. Figure 3 Sample screen from ISABEL: output of diagnostic reminders classified by system. www.archdischild.com Ramnarayan, Tomlinson, Rao, et al ISABEL 411 Table 1 Clinical details of cases tested in stage I evaluation of ISABEL Primary diagnosis expected by paediatrician and number of forms Acute graft versus host disease Acute pyelonephritis AIDS Aspergillosis Bacterial meningitis Bacterial pneumonia Blastomycosis Brucellosis Campylobacter Cat scratch disease Cellulitis Chlamydia pneumoniae infection Chlamydia trachomatis infection Chronic granulomatous disease Congenital cytomegalovirus infection Crohn’s disease Croup Dermatomyositis E coli 0157 infection and haemolytic uraemic syndrome Epstein-Barr virus infection Endocarditis Endotracheal tube obstruction Enteric fever Enteroviral infection Erythema infectiosum Gastroenteritis Gram negative septicaemia Haemolytic disease of newborn Hand, foot, and mouth disease Herpes simplex gingivostomatitis 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 Juvenile psoriatic arthritis Juvenile spondyloarthropathy Kawasaki disease Leptospirosis Lyme disease Malaria Measles Meconium aspiration syndrome Meningitis Meningococcal disease Meningococcus Mumps Mycoplasma infection Necrotising enterocolitis Nonpolio enterovirus infection Osteomyelitis Parotitis Pauciarticular juvenile chronic arthritis Periorbital cellulitis 1 1 3 1 2 7 1 1 2 3 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Herpes zoster (shingles) HIV Human herpes virus infection Iatrogenic blood loss Impetigo Infant botulism Infant of a diabetic mother Infection mononucleosis (glandular fever) Infectious mononucleosis 1 2 1 1 1 2 1 1 1 Pneumonia Polio Rheumatic fever Respiratory syncytial virus bronchiolitis Respiratory syncytial virus infection Scarlet fever Sepsis Shigella infection Staphylococcal scalded skin syndrome Staphylococcal toxic shock Still’s disease (systemic juvenile chronic arthritis) Subacute sclerosing panencephalitis Systemic sclerosis Tuberculous pneumonia Toxic shock syndrome Tuberculosis Urinary tract infection Varicella zoster (chicken pox) Viral encephalitis Viral urinary respiratory tract infection clinical accuracy of the tool, defined as reminding clinicians of the “correct” diagnosis. RESULTS Figure 1 summarises the outcome of all data that were collected for both stages of the evaluation. On the whole, 99 hypothetical cases (provided by 13 clinicians) were eligible for analysis in stage I, and 100 cases in stage II. Forms in which no diagnoses were entered and where the clinical features section included the final diagnosis in the wording were excluded from testing. Figures 2 and 3 show sample screens from ISABEL for a real case in stage II testing. Tables 1 and 2 summarise the clinical characteristics of cases collected in stages I and II, providing an indication of the spectrum and frequency of final diagnoses as well as specialties covered during the evaluation. Stage I: Presence of the expected diagnosis/es in the ISABEL differential diagnosis list: Of the 99 hypothetical cases that were used to test ISABEL, the expected diagnosis (if it was a single diagnosis) and all the expected diagnoses (if the clinician expected to see multiple diagnoses) were present in 90 cases (91%). Stage II: Presence of the final diagnosis in the ISABEL differential diagnosis list: One hundred real cases were used to test ISABEL. Each of these cases had a primary final diagnosis which was used as the outcome variable (11 cases also had a supplementary diagnosis, which was not used to test the tool). Table 3 shows an example from stage II evaluation data. 1 1 1 4 1 2 1 1 1 In 13/100 cases, the final diagnosis was non-specific (such as “viral illness”). Since textbooks do not describe such nonspecific diagnoses, and the ISABEL database was itself created from textbooks, these were not recognised as distinct final diagnoses. In the remaining 87 cases, the final diagnosis was present in the ISABEL output in 83 cases (95%). The four cases in which the final diagnosis was absent in the ISABEL output were: Stevens-Johnson syndrome, respiratory syncitial virus bronchiolitis, erythema multiforme, and staphylococcal cellulitis. Recognising that non-specific final diagnoses are often made in routine clinical practice, ISABEL was separately tested against these diagnoses. In 10/13 such cases, ISABEL suggested diagnoses that were nearly synonymous. For a nonspecific final diagnosis such as “viral illness”, ISABEL suggested alternatives such as roseola infantum, Epstein-Barr virus, and enteroviruses. For each case, the ISABEL differential diagnostic tool suggested more than 10 diagnoses (mode 13, range 10–15). DISCUSSION Presence of the expected/final diagnosis This outcome measure is akin to the “sensitivity” of a test. In this respect, ISABEL showed a level of sensitivity between 83% and 91%. More importantly, ISABEL displayed the final diagnosis in stage II cases even when only the presenting clinical features were entered. Although the primary function of the tool is to remind clinicians to consider other reasonable www.archdischild.com 412 Ramnarayan, Tomlinson, Rao, et al Table 2 Clinical characteristics of cases collected during stage II evaluation Primary diagnosis* System Final diagnosis (n=55) Allergy Allergic reaction Egg allergy Urticarial reaction Erythema multiforme Staphylococcal infection Diabetes Diabetic ketoacidosis Foreign body ingestion Non-specific gastritis Non-specific abdominal pain Pyloric stenosis Sickle cell crisis Acute gastroenteritis Cellulitis Cervical lymphadenitis Enteroviral infection Infected insect bite Mycoplasma infection Osteomyelitis Otitis media Pneumococcal meningitis Pneumonia Preauricular abscess Preseptal cellulitis Roseola infantum Rotavirus infection Rubella Scalded skin syndrome Septic arthritis Stevens-Johnson syndrome Streptococcal infection Tonsillitis Toxic shock syndrome Upper respiratory tract infection Viral illness Viral meningitis Opiate withdrawal Nephrotic syndrome Urinary tract infection Febrile convulsion Vasovagal attacks Acute severe asthma Acute sinusitis Asthma Bronchiolitis Croup Laryngomalacia Respiratory syncytial virus bronchiolitis Viral induced wheeze Acute exacerbation of systemic lupus erythematosus Henoch-Schönlein purpura Kawasaki disease Diskitis of L1/L2 Slipped upper femoral epiphysis Appendicitis Dermatology Endocrine Gastroenterology Haematology Infection Neonate Nephrology Neurology Respiratory Rheumatic Skeletal Surgical Supplementary diagnosis* 2 1 1 2 1 1 2 1 1 2 1 2 1 1 1 2 1 1 1 2 2 7 1 2 1 1 1 1 1 1 1 2 1 2 9 1 1 1 1 1 1 1 5 5 2 1 12 5 1 1 1 1 1 1 1 2 1 1 1 1 *Some patients had more than one final diagnosis (for example, one child had a primary diagnosis of urinary tract infection as well as a supplementary diagnosis of asthma). diagnoses in their diagnostic plan and thus influence their management plan, it is important that the tool also generates accurate diagnoses in its output. The use of a final diagnosis as the gold standard is useful in this regard, to ensure that even inexperienced clinicians using the system remain “safe”. In this context, diagnostic accuracy rates of clinicians, in an unselected patient population, have been around 60%.11 These studies used necropsy findings or results of specific “diagnostic” tests as the gold standard. In a prospective evaluation of a prototype of the medical diagnostic aid Quick Medical Reference (QMR), the unaided diagnostic accuracy rate of physicians (an entire ward team considered as a single unit) was 60% for diagnostically challenging cases.12 In this highly selective patient population, it was possible to establish a final www.archdischild.com diagnosis only in 20 out of 31 cases, even after follow up for six months and extensive investigation. In general, published reports of similar evaluations of other diagnostic systems are sparse. In one study, some commonly used systems for adult medicine (Dxplain, Iliad, QMR, and Meditel) had an accuracy of between 50% and 70% when tested against a set of challenging cases.13 Our figures compare favourably with results obtained from testing other diagnostic systems in clinical use.14–16 Most such systems are expert systems, and use a combination of rule based and Bayesian approaches to model diagnostic decision making.17 Most systems are also oriented towards adult medicine. Even the few existing paediatric diagnostic aids offer support related to very specific areas such as rheumatic diseases and abdominal ISABEL 413 Table 3 Example from stage II evaluation data showing the clinical features, the ISABEL differential diagnosis list, and the final diagnosis Age group Clinical features Final diagnosis Infant Cough Respiratory syncytial virus bronchiolitis Wheeze Increased work of breathing Subcostal recession Wheeze and crackles on examination Tachypnoea Poor feeding The spectrum of cases tested is broad enough to ensure that the tool is not limited to a special subset of users. Delivered via the world wide web, ISABEL could impact on patient management on a global level. Because of a potential synergy between the doctor and ISABEL, the results of this study cannot be directly extrapolated to estimate the clinical impact of the tool. Further studies are underway to evaluate the effects of clinicians using ISABEL to aid clinical decision making in simulated as well as real settings. These studies are essential to explore the positive as well as negative impact ISABEL might produce on healthcare processes, health economics, as well as patient outcomes. ..................... Authors’ affiliations pain, making them difficult to use in routine clinical practice.16 18 The evaluation of one such system for paediatric rheumatic diseases suggested 80% accuracy (92% when only diagnoses included in its knowledge base were used).16 Limitations of the study This study was designed to evaluate only the presence of the final diagnosis in the reminder list. It did not address the plausibility of the other diagnoses suggested by ISABEL. In this respect, it is possible that although the final diagnosis was present in the differential diagnosis list, the other suggestions made by ISABEL were misleading. However, in order to verify the safety of the tool, it was crucial to show that the final diagnosis formed part of the “reminder” list. Further evaluation is underway to test the relevance of the alternate diagnoses. The hypothetical cases did not represent the complete spectrum of paediatrics and may not have tested the true limits of the tool. Similarly, the number of real cases used to test the tool may have been insufficient to ensure a complete evaluation of the system’s capability, especially in the specialities of neonatal paediatrics, oncology, and paediatric surgery. In addition, this study was conducted in secondary hospital care. Further testing is planned in a primary care setting to assess the safety of the tool. Considering that this study was only a preliminary evaluation of the performance of the tool, an attempt to extrapolate these results to routine clinical use is not easily possible. This evaluation separates the tool from the intended user (the clinician). This tool, like other similar tools, will only be used as an adjunct to the existing system of clinical decision making19 Despite this, it is possible that inexperienced doctors using the tool either in a primary care or hospital setting might cause an increase in referrals and investigations. The true measure of the tool’s benefits and risks will be clear only when tested by real clinicians uninvolved in the development of the tool. These questions will be answered in subsequent clinical impact evaluations. CONCLUSIONS Our study shows that in a large proportion of cases, ISABEL has the potential to remind the clinician of the final diagnosis in a variety of hypothetical as well as real clinical situations. P Ramnarayan, A Tomlinson, M Coren, J Britto, Department of Paediatrics, Imperial College School of Medicine, St Mary’s Hospital, Norfolk Place, London, UK A Rao, Department of Haematology, Chase Farm Hospital, The Ridgeway, Enfield, UK A Winrow, Department of Paediatrics, Kingstone Hospital, Galsworthy Road, Kingston upon Thames, Surrey, UK REFERENCES 1 Wyatt JC. Clinical knowledge and practice in the information age: a handbook for health professionals. London: The Royal Society of Medicine Press, 2001. 2 Hunt DL, Haynes RB, Hanna SE, et al. Effects of computer-based clinical decision support systems on physician performance and patient outcomes: a systematic review. JAMA 1998;280:1339–46. 3 Ely JW, Osheroff JA, Ebell MH, et al. Analysis of questions asked by family doctors regarding patient care. BMJ 1999;319:358–61. 4 Hersh WR, Hickam DH. How well do physicians use electronic information retrieval systems? A framework for investigation and systematic review. JAMA 1998;280:1347–52. 5 Godlee F, Smith R, Goldmann D. Clinical evidence. BMJ 1999;318:1570–1. 6 Gorman PN, Ash J, Wykoff L. Can primary care physicians’ questions be answered using the medical journal literature? Bull Med Libr Assoc 1994;82:140–6. 7 Wyer PC, Rowe BH, Guyatt GH, et al. Evidence-based emergency medicine. The clinician and the medical literature: when can we take a shortcut? Ann Emerg Med 2000;36:149–55. 8 Brassey J, Elwyn G, Price C, et al. Just in time information for clinicians: a questionnaire evaluation of the ATTRACT project. BMJ 2001;322:529–30. 9 Neale G, Woloshynowych M, Vincent C. Exploring the causes of adverse events in NHS hospital practice. J R Soc Med 2001;94:322–30. 10 Alberti KG. Medical errors: a common problem. BMJ 2001;322:501–2. 11 Sadegh-Zadeh K. Fundamentals of clinical methodology. 4. Diagnosis. Artif Intell Med 2000;20:227–41. 12 Bankowitz RA, McNeil MA, Challinor SM, et al. A computer-assisted medical diagnostic consultation service. Implementation and prospective evaluation of a prototype. Ann Intern Med 1989;110:824–32. 13 Berner ES, Webster GD, Shugerman AA, et al. Performance of four computer-based diagnostic systems. N Engl J Med 1994;330:1792–6. 14 Feldman M, Barnett GO. Pediatric computer-based diagnostic decision support. Scientific Program, Section on computers and other technologies. American Academy of Pediatrics, New Orleans, LA, 27 October 1991. 15 Swender PT, Tunnessen WW Jr, Oski FA. Computer-assisted diagnosis. Am J Dis Child 1974;127:859–61. 16 Athreya BH, Cheh ML, Kingsland LC. Computer-assisted diagnosis of pediatric rheumatic diseases. Pediatrics 1998;102. 17 Shortliffe EH, Perreault LE. Medical informatics. New York: Springer-Verlag, 2001. 18 De Dombal FT, Leaper DJ, Staniland JR, et al. Computer-aided diagnosis of acute abdominal pain. BMJ 1972;2:9–13. 19 Ramnarayan P, Britto J. Paediatric clinical decision support systems Arch Dis Child 2002;87:361–2. www.archdischild.com WHAT IS THE IMPACT ON CHANGE OF DIAGNOSIS DECISION MAKING BEFORE AND AFTER USING A DIAGNOSIS DECISION SUPPORT SYSTEM? Ramnarayan P, Winrow A, Coren M, Nanduri V, Buchdahl R, Jacobs B, Fisher H, Taylor PM, Wyatt JC, Britto JF. Diagnostic omission errors in acute paediatric practice: impact of a reminder system on decision-making. BMC Med Inform Decis Mak. 2006 Nov 6;6(1):37 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] BMC Medical Informatics and Decision Making BioMed Central Open Access Research article Diagnostic omission errors in acute paediatric practice: impact of a reminder system on decision-making Padmanabhan Ramnarayan*1, Andrew Winrow2, Michael Coren3, Vasanta Nanduri4, Roger Buchdahl5, Benjamin Jacobs6, Helen Fisher7, Paul M Taylor8, Jeremy C Wyatt9 and Joseph Britto7 Address: 1Children's Acute Transport Service (CATS), 44B Bedford Row, London, WC1H 4LL, UK, 2Department of Paediatrics, Kingston General Hospital, Galsworthy Road, Kingston-upon-Thames, KT2 7QB, UK, 3Department of Paediatrics, St Mary's Hospital, Paddington, London, W2 1NY, UK, 4Department of Paediatrics, Watford General Hospital, Vicarage Road, Watford, WD18 0HB, UK, 5Department of Paediatrics, Hillingdon Hospital, Pield Heath Road, Middlesex, UB8 3NN, UK, 6Department of Paediatrics, Northwick Park Hospital, Watford Road, Harrow, Middlesex, HA1 3UJ, UK, 7Isabel Healthcare Ltd, Po Box 244, Haslemere, Surrey, GU27 1WU, UK, 8Centre for Health Informatics and Multiprofessional Education (CHIME), Archway Campus, Highgate Hill, London, N19 5LW, UK and 9Health Informatics Centre, The Mackenzie Building, University of Dundee, Dundee, DD2 4BF, UK Email: Padmanabhan Ramnarayan* - [email protected]; Andrew Winrow - [email protected]; Michael Coren - [email protected]; Vasanta Nanduri - [email protected]; Roger Buchdahl - [email protected]; Benjamin Jacobs - [email protected]; Helen Fisher - [email protected]; Paul M Taylor - [email protected]; Jeremy C Wyatt - [email protected]; Joseph Britto - [email protected] * Corresponding author Published: 06 November 2006 BMC Medical Informatics and Decision Making 2006, 6:37 doi:10.1186/1472-6947-6-37 Received: 15 July 2006 Accepted: 06 November 2006 This article is available from: http://www.biomedcentral.com/1472-6947/6/37 © 2006 Ramnarayan et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Diagnostic error is a significant problem in specialities characterised by diagnostic uncertainty such as primary care, emergency medicine and paediatrics. Despite wide-spread availability, computerised aids have not been shown to significantly improve diagnostic decision-making in a real world environment, mainly due to the need for prolonged system consultation. In this study performed in the clinical environment, we used a Web-based diagnostic reminder system that provided rapid advice with free text data entry to examine its impact on clinicians' decisions in an acute paediatric setting during assessments characterised by diagnostic uncertainty. Methods: Junior doctors working over a 5-month period at four paediatric ambulatory units consulted the Web-based diagnostic aid when they felt the need for diagnostic assistance. Subjects recorded their clinical decisions for patients (differential diagnosis, test-ordering and treatment) before and after system consultation. An expert panel of four paediatric consultants independently suggested clinically significant decisions indicating an appropriate and 'safe' assessment. The primary outcome measure was change in the proportion of 'unsafe' workups by subjects during patient assessment. A more sensitive evaluation of impact was performed using specific validated quality scores. Adverse effects of consultation on decision-making, as well as the additional time spent on system use were examined. Results: Subjects attempted to access the diagnostic aid on 595 occasions during the study period (8.6% of all medical assessments); subjects examined diagnostic advice only in 177 episodes (30%). Senior House Officers at hospitals with greater number of available computer workstations in the clinical area were most likely to consult the system, especially out of working hours. Diagnostic workups construed as 'unsafe' occurred in 47/104 cases (45.2%); this reduced to 32.7% following system consultation (McNemar test, p < 0.001). Subjects' mean 'unsafe' workups per case decreased from 0.49 to 0.32 (p < 0.001). System advice prompted the clinician to consider the 'correct' diagnosis (established at discharge) during initial assessment in 3/104 patients. Median usage time was 1 min 38 sec (IQR 50 sec – 3 min 21 sec). Despite a modest increase in the number of diagnostic possibilities entertained by the clinician, no adverse effects were demonstrable on patient management following system use. Numerous technical barriers prevented subjects from accessing the diagnostic aid in the majority of eligible patients in whom they sought diagnostic assistance. Page 1 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Conclusion: We have shown that junior doctors used a Web-based diagnostic reminder system during acute paediatric assessments to significantly improve the quality of their diagnostic workup and reduce diagnostic omission errors. These benefits were achieved without any adverse effects on patient management following a quick consultation. Background Studies suggest that a significant proportion of adverse events in primary as well as in secondary care result from errors in medical diagnosis [1-3]; diagnostic errors also constitute the second leading cause for malpractice suits against hospitals [4]. Specialities such as primary care and emergency medicine have specifically been identified as high risk areas for diagnostic mishaps, where cognitive biases in decision making contribute to errors of omission, resulting in incomplete workup and 'missed diagnoses' [5-7]. Adverse events are also commoner in extremes of age, such as paediatric patients and the elderly [8,9]. Diagnostic decision support systems (DDSS), computerised tools that provide accurate and useful patientand situation-specific advice have been proposed as a technological solution for the reduction of diagnostic errors in practice [10]. Although a number of 'expert diagnostic systems' exist currently, a recent systematic review showed that these systems were less effective in practice than systems that provided preventive care reminders and prescription advice [11]. To a large extent, this may be because most latter systems were integrated into an existing electronic medical record (EMR), enabling effortless and frequent use by clinicians; in contrast, expert DDSS such as Quick Medical Reference (QMR), ILIAD and MEDITEL-PEDS were typically used in stand-alone fashion [12-14]. Due to a lengthy data input process, considerable clinician motivation and effort was required for their regular use, leading to infrequent usage [15]. As a result, busy clinical areas have been poorly served by existing DDSS. Attempts to integrate diagnostic decision support into an EMR have been sporadic [16,17], mainly limited by the difficulties associated with converting a complex clinical narrative into structured clinical data in a standard EMR, especially for specialities such as paediatrics and emergency medicine. It appears likely that in the medium term, effortless provision of decision support for busy clinical areas at high-risk for diagnostic error seems possible only through alternative approaches. A Web-based paediatric DDSS that permits rapid use in a busy clinical environment by using natural language free text data entry has been recently described [18,19]. Its underlying knowledge base consists of textual descriptions of diseases; using statistical natural language processing, the DDSS matches clinical features to disease descriptions in the database. This approach is similar to that adopted by the RECONSIDER program [20]. Diagnostic suggestions are displayed in sets of 10 up to a maximum of 30, and arranged by body system (e.g. cardiology) rather than by clinical probability. Between 2001 and 2003, >15,000 users registered for its use, 10% of whom used it on >10 separate occasions, resulting in >60,000 distinct user log-ins (personal communication). Thus, although poor usage has been a major confounding factor during evaluations of the clinical benefits of a number of DDSS [21], Isabel usage statistics led us to believe that a study evaluating its clinical impact would permit the assessment of its benefits and risks to be interpreted with confidence, and provide useful insights into the user-DDSS dynamic. Results from an independent email questionnaire survey also suggested that most regular users in the UK found it helpful during patient management [22]. In this study, we aimed to measure the clinical impact of the Isabel system on diagnostic decision making. We hypothesised that lessons learnt from our evaluation study could be generalised to the design, implementation and evaluation of other stand-alone DDSS, and clarify the risks associated with the use of such a system in real life. Diagnostic suggestions were provided to junior doctors during acute paediatric assessments in which they experienced diagnostic uncertainty. Methods The study was co-ordinated from St Mary's Hospital, Imperial College London, and was approved by the London multi-centre research ethics committee (MREC/02/2/ 70) and relevant local research ethics committees. Study centres From a list of non-London district general hospitals (DGH) at which >4 individual users were registered in the Isabel database, four paediatric departments (two university-affiliated DGHs and two DGHs without official academic links) were enrolled, based on logistical and clinical reasons – all sites were <100 miles driving distance from London; three were geographically clustered in the Eastern Region and two sites were separated by large distances from regional tertiary centres. The baseline characteristics of each of the participating centres are detailed in table 1. None of the study centres were involved in the development of the DDSS. Study participants All junior doctors (Senior House Officers [interns] and Registrars [residents]) in substantive posts at each of the Page 2 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Table 1: Characteristics of participating paediatric departments Nature 24 hours dedicated PAU Annual PAU attendance Number of junior doctors Number of consultants (acute) Computers in PAU Metropolitan area Dist to tertiary centre (miles) Clinical activity (PAU attendances per hour PAU open) Computer accessibility index (available computers per unit clinical activity) Senior support (number of acute consultants per subject enrolled) Centre A Centre B Centre C Centre D university no* 4194 31 10 (6) 2 yes <25 1.1 1.8 0.2 dgh yes 4560 21 7 (4) 1 mixed 26 0.53 1.9 0.2 dgh yes 4780 12 7 (4) 1 no 71 0.54 1.85 0.3 university yes 4800 16 11 (7) 3 no 69 0.67 4.5 0.4 Abbreviations: dgh = district general hospital, PAU = paediatric assessment unit * PAU open from 0800 to 0000 only participating paediatric departments between December 2002 and April 2003 were enrolled after informed verbal consent. Consultants (attending physicians) and locum doctors were excluded. Study patients All children (age 0–16 years) presenting with an acute medical complaint, and assessed by a junior doctor in a designated Paediatric Assessment Area/Ambulatory Unit (PAU), were eligible for DDSS use. Outpatients, re-attendances for ward follow up, and day cases were ineligible. Based on subjects' feedback collected prior to the study start date, we made a pragmatic decision to allow junior doctors to selectively consult the DDSS only for patients in whom they experienced diagnostic uncertainty. This latter subset formed the actual study population. Study design and power Our study was a within-subject 'before and after' evaluation in which each study subject acted as their own control. Participants explicitly recorded their diagnostic workup and clinical plans (tests and treatment) for cases before seeking DDSS advice. Following real-time use, subjects either decided to act on system advice (by recording their revised diagnostic workup and clinical plans) or chose not to, thus ending the consultation. On the basis of a pilot study in an experimental setting, we calculated that the trial required data from 180 cases to detect a 33% reduction in clinically 'unsafe' diagnostic workups (80% power; type I error 5%). We defined diagnostic workups as being 'unsafe' if they deviated from a 'minimum gold standard' provided by an independent expert panel. Intervention Decision support system A limited, secure (128-bit SSL encryption) and passwordprotected trial version was used. This differed from the publicly accessible DDSS – only the first 10 diagnostic suggestions were displayed, and reading material related to each diagnosis was not made available. The trial version was accessible only via a designated shortcut icon placed on each computer present at study commencement within the participating PAUs. This mechanism utilised cookies, and automatically facilitated identification of the centre at which the request originated, allowing subjects to access the system without an additional log-in procedure. On the trial DDSS, data were captured in two consecutive screens (figures 1 and 2). On screen 1, subjects recorded the patient's clinical details in their own words, and their own diagnostic workup and clinical plans (preDDSS). Based on the clinical details submitted in screen 1, diagnostic suggestions were displayed on screen 2. Subjects had the choice to revise their original diagnostic workup and clinical plans by adding or deleting decisions at this stage. It was not possible to go back from screen 2 to screen 1. The consultation ended when clinical decisions from screen 2 (unmodified or revised) were finally submitted, or after >2 hours of inactivity. Training Three separate group training sessions were organised by one investigator (PR) at each centre one month before the study start date, coinciding with weekly mandatory departmental teaching sessions. At each session, subjects used the trial DDSS with practice cases created for the study. Sessions were repeated twice during the study period to recruit and train new post-holders. Outcome measures The primary outcome measure was change in the proportion of 'unsafe' diagnostic workups following DDSS consultation. We defined 'unsafe' workups as instances in which subjects' diagnostic workup (pre- and post-DDSS consultation) deviated from a 'minimum gold standard' provided by an independent expert panel of clinicians. Page 3 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Figure This figure 1 shows screen 1 during the DDSS usage during the study This figure shows screen 1 during the DDSS usage during the study. This page is loaded following a successful log-in, which is initiated by clicking a designated icon on each study computer at each participating centre. Information collected automatically at this stage includes date and time of screen 1 display; patient identifiers; subject identifiers; clinical features of patient, and initial diagnostic workup and tests as well as treatments. Inclusion of the correct discharge diagnosis in the diagnostic workup (pre- and post-Isabel consultation), quality scores for diagnostic workup and clinical action plans; time taken by subjects to complete system usage; number of diagnoses included in the diagnostic assessment preand post-DDSS; inappropriate tests and treatments ordered by subjects following system advice and significant decisions deleted following consultation were examined as secondary outcome measures. Data collection Clinical data were collected during the study period (December 2002-April 2003) by two complementary methods – automatically from trial DDSS logs, and manually by a single research assistant (HF). Data collected by the trial website during system use is shown in table 2. At the end of each clinical consultation, the user indicated how useful the diagnostic advice provided had been for a) patient management and b) as an educational exercise. Page 4 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Figure This figure 2 shows screen 2, which is displayed following submission of the information entered on screen 1 This figure shows screen 2, which is displayed following submission of the information entered on screen 1. The subject has the opportunity to revise their decisions, including diagnoses and tests and treatments. A brief survey attempts to capture the user's satisfaction with respect to educational value as well as clinical utility. Information submitted following this page completes the episode of data collection. Page 5 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 This was recorded on a Likert-scale from 0–5 (not useful to extremely useful). All subjects were sent two rounds of email and postal questionnaires each to collect feedback at trial completion. The research assistant obtained study patients' medical records that matched the patient identifiers collected automatically from the trial website during system use. It was not possible to use partial or invalid entries to match medical records. Copies of available medical records were made such that information was available only up to the point of DDSS use. Copies were anonymised by masking patient and centre details. Diagnostic workup and clinical plans recorded by subjects on the trial website were verified for each case against entries in the medical records and in hospital laboratory systems. Discharge diagnoses were collected from routinely collected coding data for all study patients, and additionally from discharge summaries where available. Discharge diagnoses were validated by a consultant involved in study conduct at each centre. In addition, limited demographic and clinical details of all eligible patients at each study centre were collected from hospital administrative data. A panel of four consultant paediatricians independently examined study medical records, in which subjects' clinical decisions were masked to ensure blinding. In the first instance, each panel member provided a list of 'clinically significant' diagnoses, tests and treatments (the latter two were collectively termed 'clinical action plans') for each Table 2: Study data automatically collected by the DDSS logs Patient details Surname Date of birth Age group (neonate, infant, child or adolescent) Sex User details Centre code (based on identity of icon clicked) Subject identity (including an option for anonymous) Subject grade Operational details Date and time of usage (log in, submission of each page of data) Unique study ID assigned at log in Clinical details Patient clinical features at assessment Doctor's differential diagnosis (pre-ISABEL) Doctor's investigation plan (pre-ISABEL) Doctor's management plan (pre-ISABEL) Isabel list of differential diagnoses Diagnoses selected from Isabel list by user as being relevant Doctor's differential diagnosis (post-ISABEL) Doctor's investigation plan (post-ISABEL) Doctor's management plan (post-ISABEL) Survey details Satisfaction score for patient management Satisfaction score for educational use http://www.biomedcentral.com/1472-6947/6/37 case that would ensure a safe clinical assessment. The absence of 'clinically significant' items in a subject's workup was explicitly defined during panel review to represent inappropriate clinical care. For this reason, the panel members did not include all plausible diagnoses for each case as part of their assessment, and instead focused on the minimum gold standard. Using this list as a template, the appropriateness of each decision suggested by subjects for each case was subsequently scored by the panel in blinded fashion using a previously validated scoring system [23]. This score rewarded decision plans for being comprehensive (sensitive) as well as focussed (specific). 25% of medical records were randomly assigned to all four panel members for review; a further 20% was assigned to one of the six possible pairs (i.e. slightly more than half the records were assessed by a single panel member). Clinically significant decisions (diagnoses, tests and treatments) were collated as a 'minimum gold standard' set for each case. For cases assessed by multiple panel members, significant decisions provided by a majority of assessors were used to form the gold standard set. Concordance between panel members for clinical decisions was moderate to good, as assessed by the intraclass correlation co-efficient for decisions examined by all four members (0.70 for diagnoses, 0.47 for tests and 0.57 for treatments). Analysis We analysed study data from two main perspectives: operational and clinical. For operational purposes, we defined each attempt by a subject to log into the DDSS by clicking on the icon as a 'DDSS attempt'. Each successful display of screen 1 was defined as a 'successful log in'; a unique study identifier was automatically generated by the trial website for each successful log in. Following log in, DDSS usage data was either 'complete' (data were available from screens 1 and 2) or 'incomplete' (data were available from screen 1 only, although screen 2 may have been displayed to the subject). Time spent by the user processing system advice was calculated as the difference between the time screen 2 was displayed and the end of the consultation (or session time out). In the first instance, we used McNemar's test for paired proportions to analyse the change in proportion of 'unsafe' diagnostic workups. In order to account for the clustering effects resulting from the same subject assessing a number of cases, we also calculated a mean number of 'unsafe' diagnostic workups per case attempted for each subject. Change in this variable following DDSS consultation was analysed using two-way mixed-model analysis of variance (subject grade being between-subjects factor and occasion being within-subjects factor). In order to exclude re-thinking effect as an explanation for change in the primary outcome variable, all episodes in which there was a Page 6 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 difference between the workup pre- and post-DDSS consultation were examined. If diagnoses that were changed by subjects were present in the Isabel suggestion list, it could be inferred that the DDSS was responsible for the change. A more objective marker of clinical benefit was assessed by examining whether the post-Isabel diagnostic workup (but not the pre-Isabel workup) included the discharge diagnosis. We analysed changes in pre- and postDDSS diagnostic quality scores, as well as clinical action plan scores, using subjects as the unit of analysis. We tested for statistical significance using one way analysis of variance (grade was the between-subjects factor) to provide a sensitive measure of changes in diagnostic workup, and tests and treatments. The median test was used to examine differences between grades in system usage time. Diversity of suggestions displayed by the DDSS during the study, and therefore its dynamic nature, was assessed by calculating the number of unique diagnoses suggested by the system across all episodes of completed usage (i.e. if the diagnostic suggestions remained constant irrespective of case characteristics, this number would be 10). Statistical significance was set for all tests at p value <0.05. Subjects were expected to use the DDSS in only a subset of eligible patients. In order to fully understand the characteristics of patients in whom junior doctors experienced diagnostic difficulty and consulted the DDSS, we examined this group in more detail. We analysed available data on patient factors (age, discharge diagnosis, outcome of assessment and length of inpatient stay if admitted), user factors (grade of subject), and other details. These included the time of system usage (daytime: 0800–1800; out-of-hours: 1800-0800), centre of use and its nature (DGH vs university-affiliated hospital), an index of PAU activity (number of acute assessments per 60 min period the PAU was functional), computer accessibility index (number of available computers per unit PAU activity) and level of senior support (number of acute consultants per subject). We subsequently aimed to identify factors that favoured completion of DDSS usage using a multiple logistic regression analysis. Significant variables were identified by univariate analysis and entered in forward step-wise fashion into the regression model. Characteristics of patients where subjects derived clinical benefit with DDSS usage were also analysed in similar fashion. We correlated subjects' own perception of system benefit (Likert style response from user survey) with actual benefit (improvement in diagnostic quality score) using the Pearson test. Qualitative analysis of feedback from subjects provided at the end of the study period was performed to provide insights into system design and user interface. Results During the 5-month study period, 8995 children were assessed in the 4 PAUs; 76.7% (6903/8995 children) pre- http://www.biomedcentral.com/1472-6947/6/37 sented with medical problems and were eligible for inclusion in the study. Subjects attempted to seek diagnostic advice on 595 separate occasions across all centres (8.6%). Subjects successfully logged in only in 226 episodes. Data were available for analysis in 177 cases (complete data in 125 cases; screen 1 data only in an additional 52 cases). The summary of flow of patients and data through the study is illustrated in figure 3. Centre-wise distribution of attrition in DDSS usage is demonstrated in table 3. Medical records were available for 104 patients, and were examined by the panel according the allocation scheme shown in figure 4. The largest subset of patients in whom the DDSS was consulted belonged to the 1–6 year age group (61/177, 34.5%). The mean age of patients was 5.1 years (median 3.3 years). The DDSS was most frequently used by SHOs (126/177, 71.2%). Although 25% of all eligible patients were seen on PAUs outside the hours of 08:00 to 18:00, more than a third of episodes of system usage fell outside the hours of 0800–1800 (64/177, 36.2%). Subjects at Centre D used the DDSS most frequently (59/177, 33.3%). In general, usage was greater at university-affiliated hospitals than at DGHs (106/177, 59.9%). Discharge diagnosis was available in 77 patients. The commonest diagnosis was non-specific viral infection; however, a wide spread of diagnoses was seen in the patient population. 58 patients on whom the DDSS was consulted were admitted to hospital wards, the rest were discharged home following initial assessment. Inpatients stayed in hospital for an average of 3.7 days (range: 0.7–26 days). Clinical characteristics of patients on whom Isabel was consulted are summarised in table 4. A list of discharge diagnoses in table 5 indicates the diverse case mix represented in the study population. 80 subjects enrolled during the study. 63/80 used the system to record some patient data (mean 2, range: 1–12); 56/80 provided complete patient information and their revised decisions post-DDSS consultation (mean 2, range: 1–6). Due to limitations in the trial website design, it was unclear how many of the subjects who did not use the system during the study period (17/80) had attempted to access the DDSS and failed. It was evident that a small number of subjects used the DDSS on multiple (>9) occasions but did not provide their revised clinical decisions, leading to incomplete system use. System usage data with respect to subjects is shown in figure 5. 'Unsafe' diagnostic workups 104 cases in which medical records were available were analysed. Before DDSS consultation, 'unsafe' diagnostic workups occurred in 47/104 cases (45.2%); they constituted episodes in which all clinically significant diagnoses were not considered by subjects during initial decision Page 7 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Figure Flow diagram 3 of patients and data through the study Flow diagram of patients and data through the study. Page 8 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Table 3: Centre-wise attrition of DDSS usage and study data Patients seen in PAU Medical patients seen in PAU Number eligible for diagnostic decision support DDSS attempts DDSS successful log in Step 1 completed† Steps 1&2 completed Medical records available Centre A Centre B Centre C Centre D Total 2679 2201 unknown 338 unknown 47 30 24 1905 1383 unknown 118 unknown 26 25 24 1974 1405 unknown 52 unknown 45 20 16 2437 1914 unknown 87 unknown 59 50 40 8995 6903 unknown 595 226* 177 125 104 * Each successful log in was automatically provided a unique study identifier which was not centre-specific. The number of successful log-ins was thus calculated as the total number of study identifiers issued by the trial website. † Step 1 completed indicates that following successful log-in, the subject entered data on the first screen, i.e. patient details. making. Overall, the proportion of 'unsafe' workups reduced to 32.7% (34/104 cases) following DDSS consultation, an absolute reduction of 12.5% (McNemar test p value <0.001, table 6). In a further 5 cases, appropriate diagnoses that were missing in subjects' workup did form part of Isabel suggestions, but were ignored by subjects during their review of DDSS advice. In 11/13 cases in which 'unsafe' diagnostic workups were eliminated, the DDSS was used by SHOs. Mean number of 'unsafe' workups per case reduced from 0.49 to 0.32 post-DDSS consultation among subjects (p < 0.001); a significant interaction was demonstrated with grade (p < 0.01). Examination of the Isabel suggestion list for cases in which there was a difference between pre- and post-DDSS workup showed that in all cases, the additional diagnoses recorded by subjects formed part of the system's advice. Similar results were obtained for clinical plans, but smaller reductions were observed. In 3/104 cases, the final Allocation Figure 4 of medical records for expert panel assessment consisting of four raters Allocation of medical records for expert panel assessment consisting of four raters. All panel members rated 25% of cases and two raters (six possible pairs) rated an additional 20% of cases. Page 9 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Table 4: Characteristics of patients in whom Isabel was consulted Factor PATIENT FACTORS Age (n = 177) Neonate Infant Young child (1–6 yrs) Older child (6–12 yrs) Adolescent Number of DDSS consultation episodes (completed episodes) 19 33 61 38 26 Primary diagnostic group (n = 77) Respiratory Cardiac Neurological Surgical Rheumatology Infections Haematology Other 9 0 6 3 5 31 3 20 Outcome (n = 104) IP admission Discharge 58 46 USER FACTORS Grade (n = 177) SHO Registrar 126 (79) 51 (46) OPERATIONAL FACTORS (n = 177) Time of use In hours (0800–1800) Out of hours (1800-0800) 113 (84) 64 (41) Centre A B C D diagnosis for the patient was present in the post-DDSS list but not in the pre-DDSS list, indicating that diagnostic errors of omission were averted following Isabel advice. Diagnostic quality scores 104 cases assessed by 51 subjects in whom medical records were available were analysed. Mean diagnostic quality score across all subjects increased by 6.86 (95% CI 4.0–9.7) after DDSS consultation. The analysis of variance model indicated that there was no significant effect of grade on this improvement (p 0.15). Similar changes in clinical plan scores were smaller in magnitude (table 7). Usage time Reliable time data were available in 122 episodes. Median time spent on system advice was 1 min 38 sec (IQR 50 sec – 3 min 21 sec). There was no significant difference 47 (30) 26 (25) 45 (20) 59 (50) between grades with respect to time spent on screen 2 (median test, p = 0.9). This included the time taken to process DDSS diagnostic suggestions, record changes to original diagnostic workup and clinical plans, and to complete the user satisfaction survey. Impact on clinical decision making Pre-DDSS, a mean of 2.2 diagnoses were included in subjects' workup; this rose to 3.2 post-DDSS. Similarly, the number of tests ordered also rose from 2.7 to 2.9; there was no change in the number of treatment steps. Despite these increases, no deleterious tests or treatment steps were added by subjects to their plans following DDSS consultation. In addition, no clinically significant diagnoses were deleted from their original workup after Isabel advice. Page 10 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 Table 5: Discharge diagnoses in children in whom the diagnostic aid was consulted Diagnosis Viral infection Acute lymphadenitis Viral gastroenteritis Pelvic region and thigh infection Epilepsy Acute lower respiratory infection Allergic purpura Acute inflammation of orbit Chickenpox with cellulitis Gastroenteritis Rotaviral enteritis Feeding problem of newborn Syncope and collapse Lobar pneumonia Kawasaki disease Abdominal pain Angioneurotic oedema Erythema multiforme Constipation Irritable bladder and bowel syndrome Coagulation defect G6PD deficiency Sickle cell dactylitis Cellulitis Clavicle osteomyelitis Kerion Labyrinthitis Meningitis Myositis Purpura Scarlet fever Staphylococcal scalded skin syndrome Mitochondrial complex 1 deficiency Adverse drug effect Eye disorder Musculoskeletal back pain Trauma to eye Disorders of bilirubin metabolism Foetal alcohol syndrome Neonatal erythema toxicum Physiological jaundice Stroke Acute bronchiolitis Acute upper respiratory infection Asthma Hyperventilation Juvenile arthritis with systemic onset Polyarthritis Reactive arthropathy Anorectal anomaly Number of patients 8 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Using forward step-wise regression analysis, grade of subject (registrar), time of system usage (in-hours), centre identity, senior support and computer accessibility index were identified as independent factors predicting completion of DDSS usage. Patients in whom actual benefit was http://www.biomedcentral.com/1472-6947/6/37 demonstrated on diagnostic decision making were more likely to stay longer in hospital. 469 unique diagnostic suggestions were generated by the DDSS during its use on 125 cases. This represented a high degree of diversity of responses appropriate for the diverse case mix seen in this study – a static list would have consisted of the same 10 diagnoses, and a unique set of suggestions for each single episode of use would have generated 1250 distinct suggestions. User perception Data were available in 125 cases in which subjects completed DDSS usage. Mean satisfaction score for patient management was 1.6 (95% CI 1.4–1.96); for Isabel use as an educational adjunct, this was higher (2.4, 95% CI 1.98–2.82). There was moderate correlation between subjects' perception of DDSS usefulness in patient management and actual increment in diagnostic quality score (r value 0.28, p value 0.0038, figure 6). Feedback from questionnaires indicated that many subjects found the trial website cumbersome to use in real time since it forced them to record all their decisions prior to advice, thus taking up time during patient assessment. This was especially problematic since many subjects had used Isabel in its original form. A number of subjects were dissatisfied with computer access during the trial; these related to unavailability of passwords to access the Internet, slow computer connections, unavailability of adequate workstations at the point of clinical use and lack of infrastructure support. Another theme that emerged from user feedback involved the lack of access to reading material on diagnoses during the trial period – most users felt this was an important part of the system and the advice provided. Discussion This study demonstrates that diagnostic uncertainty occurs frequently in clinical practice, and that it is feasible for a DDSS, unintegrated into an EMR, to improve the process of diagnostic assessment when used by clinicians in real life practice. We have also shown that this improvement prevented a small but significant number of diagnostic errors of omission. A number of barriers to computer and Internet access in the clinical setting prevented system use in a significant proportion of eligible patients in whom subjects sought diagnostic assistance. The DDSS studied provided advice in the field of diagnosis, an area in which computerised systems have rarely been shown to be effective. In an early clinical study, Wexler et al showed that consultation of MEDITEL-PEDS, a DDSS for paediatric practice, resulted in a decrease in the number of incorrect diagnoses made by residents [24]. However, subjects did not interact with the DDSS themselves; advice generated by the system was provided to cli- Page 11 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Figureusage DDSS 5 data shown as distribution of number of subjects by episodes of system use DDSS usage data shown as distribution of number of subjects by episodes of system use. nicians, and diagnostic decisions were amended by subjects on the basis of the information provided. The impact of QMR was studied in similar consultative fashion: a beneficial impact was demonstrated on diagnostic decisions as well as test ordering [25]. In a subsequent laboratory study examining the impact of two different systems (QMR and ILIAD) on simulated cases, a correct diagnosis was added by subjects to their diagnostic workup in 6.5% episodes [26]. Diagnostically challenging cases were deliberately used; it was not clear that junior clinicians would seek diagnostic advice on similar cases in routine practice. Since the user-DDSS dynamic plays a key role in whether these systems are used and the extent of benefit derived from them [27,28] the above-mentioned studies provide limited information on how clinicians would interact with computerised DDSS to derive clinical benefits in practice, especially in a busy environment. Our study was notable for utilising a naturalistic design, in which subjects used the Isabel system without extensive training or monitoring, allowing results to be generalised to the clinical setting. This design allowed us to explore the complex interplay between user-DDSS interaction, user decisions in the face of diagnostic advice, and barriers to usage. The DDSS selected was already being used frequently in practice; a number of previous system evaluations have been confounded by inadequate usage. The clinical performance of the DDSS studied has also been previously validated [29]. A preliminary assessment of Isabel impact on subjects' diagnostic decisions has already been made in a simulated environment, results of which closely mirror our current findings [30]. Although the nature and frequency of clinicians' information needs have been previously described, we were able to estimate the need for diagnostic decision support, and characterise the subgroup of patients in whom junior clinicians sought diagnostic advice. Since diagnostic uncertainty only occurs in a subset of acutely ill patients, similar interventions in the future will need to be targeted, rather than being universally applied. However, this has to be balanced against our finding that there was poor correlation between subjects' own perception of system utility and actual clinical benefit, which suggests that a universal approach to usage may be more beneficial. This phenomenon has been previously described [31]. We have also identified that junior doctors, such as SHOs, are more Table 7: Changes in mean quality scores for diagnostic workup and clinical action plans Diagnostic quality score change (SD) Clinical action plan score change (SD) SHO Registrar Overall 8.3 (11.6) 1.4 (6.3) 3.8 (6.1) 1.7 (7.4) 6.9 (10.3) 1.5 (6.7) Page 12 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 http://www.biomedcentral.com/1472-6947/6/37 Table 6: Reduction in unsafe diagnostic workups following DDSS consultation (n = 104) Unsafe diagnostic workup SHO Registrar Pre-DDSS consultation Post-DDSS consultation Relative Reduction (%) 28 19 17 17 39.3 10.5 likely to use and benefit from DDSS, including in an educational role. Cognitive biases, of which 'premature closure' and faulty context generation are key examples, contribute significantly to diagnostic errors of omission [32], and it is likely that in combination with cognitive forcing strategies adopted during decision making, DDSS may act as 'safety nets' for junior clinicians in practice [33]. Fundamental deviation in function and interface design from other expert systems may have contributed to the observed DDSS impact on decision-making in this study. The provision of reminders has proved highly effective in improving the process of care in other settings [34]. Rapid access to relevant and valid advice is crucial in ensuring usability in busy settings prone to errors of omission – average DDSS consultation time during this study was <2 minutes. It also appears that system adoption is possible during clinical assessment in real time with current computer infrastructure, providing an opportunity for reduction in diagnostic error. EMR integration would allow further control on the quality of the clinical input data as well as provision of active decision support with minimum extra effort; such an interface has currently been developed for Isabel and tested with four commercial EMRs [35]. Such integration facilitates iterative use of the Figure 6 between user perception of system utility and change in diagnostic quality score Correlation Correlation between user perception of system utility and change in diagnostic quality score. Page 13 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 system during the evolution of a patient's condition, leading to increasingly specific diagnostic advice. A number of other observations are worthy of note: despite an increase in the number of diagnoses considered, no inappropriate tests were triggered by the advice provided; the quality of data input differed widely between users; the system dynamically generated a diverse set of suggestions based on case characteristics; the interpretation of DDSS advice itself was user-dependent, leading to variable individual benefit; and finally, on some occasions even useful advice was rejected by users. User variability in data input cannot be solely attributed to the natural language data entry process; considerable user variation in data entry has been demonstrated even in DDSS that employ controlled vocabularies for input [36]. Further potential benefit from system usage was compromised in this study due to many reasons: unavailability of computers, poor Internet access, and slow network connections frequently prevented subjects from accessing the DDSS. Paradoxically, the need to enter detailed information including subjects' own clinical decisions into the trial website (not required during real life usage) may itself have compromised system usage during the study, limiting the extent to which usage data from the study can be extrapolated to real life. This study had a number of limitations. Our study was compromised by the lack of detailed qualitative data to fully explore issues related to why users sometimes ignored DDSS advice, or specific cases in which users found the DDSS useful. The comparison of system versus a panel gold standard had its own drawbacks – Isabel was provided variable amount of patient detail depending on the subject who used it, while the panel were provided detailed clinical information from medical notes. Changes in decision making were also assessed at one fixed point during the clinical assessment, preventing us from examining the impact of iterative use of the DDSS with evolving and sometimes rapidly changing clinical information. Due to the before-after design, it could also be argued that any improvement observed resulted purely from subjects rethinking the case; since all appropriate diagnoses included after system consultation were present in the DDSS advice, this seems unlikely. Subjects also spent negligible time between their initial assessment of cases and processing the system's diagnostic suggestions. Our choice of primary outcome focused on improvements in process, although we were also able to demonstrate a small but significant prevention of diagnostic error based on the discharge diagnosis. The link between improvements in diagnostic process and patient outcome may be difficult to illustrate, although model developed by Schiff et al suggests that avoiding process errors will lead to actual errors in some instances, as we have demonstrated in this study [37]. However, in our study design, it was not possible to test whether an 'unsafe' diagnostic http://www.biomedcentral.com/1472-6947/6/37 workup would directly lead to patient harm. Finally, due to barriers associated with computer access and usage, we were not able to reach the target number of cases on whom complete medical data were available. Conclusion This clinical study demonstrates that it is possible for a stand-alone diagnostic system based on the reminder model to be used in routine practice to improve the process of diagnostic decision making among junior clinicians. Elimination of barriers to computer access is essential to fulfil the significant need for diagnostic assistance demonstrated in this study. Competing interests This study was conducted when the Isabel system was available free to users, funded by the Isabel Medical Charity. Dr Ramnarayan performed this research as part of his MD thesis at Imperial College London. Dr Britto was a Trustee of the medical charity in a non-remunerative post. Ms Fisher was employed by the Isabel Medical Charity as a Research Nurse for this study. Since June 2004, Isabel has been managed by a commercial subsidiary of the Medical Charity called Isabel Healthcare. The system is now available only to subscribed users. Dr Ramnarayan now advises Isabel Healthcare on research activities on a part-time basis; Dr Britto is now Clinical Director of Isabel Healthcare. Both hold restricted shares in Isabel Healthcare. All other authors declare that they have no competing interests. Authors' contributions PR conceived the study, contributed to the study design, analyzed the data and drafted the manuscript. BJ assisted with data collection, validation of discharge diagnoses, and data analysis. MC assisted with the study design, served as gold standard panel member, and revised the draft manuscript VN assisted with study design, served as gold standard panel member, and helped with data analysis RB assisted with the study design, served as gold standard panel member, and revised the draft manuscript AW assisted with the study design, served as gold standard panel member, and revised the draft manuscript HF assisted with data collection and analysis PT assisted with study conception, study design and revised the draft manuscript Page 14 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:37 JW assisted with study conception, provided advice regarding study design, and revised the draft manuscript http://www.biomedcentral.com/1472-6947/6/37 19. 20. JB assisted with study conception, study design and data analysis. 21. All authors read and approved the final manuscript. Acknowledgements A number of clinicians and ward administrators were involved in assisting in study conduct at each participating centre. The authors would like to thank Nandu Thalange, Michael Bamford, and Elmo Thambapillai for their help in this regard. Statistical assistance was provided by Henry Potts. 22. Financial support: This study was supported by a research grant from the National Health Service (NHS) Research & Development Unit, London. The sponsor did not influence the study design; the collection, analysis, and interpretation of data; the writing of the manuscript; and the decision to submit the manuscript for publication. 24. References 26. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. Sandars J, Esmail A: The frequency and nature of medical error in primary care: understanding the diversity across studies. Fam Pract 2003, 20:231-6. Leape L, Brennan TA, Laird N, Lawthers AG, Localio AR, Barnes BA, Hebert L, Newhouse JP, Weiler PC, Hiatt H: The nature of adverse events in hospitalized patients. Results of the Harvard Medical Practice Study II. N Engl J Med 1991, 324:377-84. Davis P, Lay-Yee R, Briant R, Ali W, Scott A, Schug S: Adverse events in New Zealand public hospitals II: preventability and clinical context. N Z Med J 116(1183):U624. 2003 Oct 10 Bartlett EE: Physicians' cognitive errors and their liability consequences. J Healthcare Risk Manage Fall 1998:62-9. Croskerry P: The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 2003, 78(8):775-80. Graber M, Frankilin N, Gordon R: Diagnostic error in internal medicine. Arch Intern Med 165(13):1493-9. 2005 Jul 11 Burroughs TE, Waterman AD, Gallagher TH, Waterman B, Adams D, Jeffe DB, Dunagan WC, Garbutt J, Cohen MM, Cira J, Inguanzo J, Fraser VJ: Patient concerns about medical errors in emergency departments. Acad Emerg Med 2005, 12(1):57-64. Weingart SN, Wilson RM, Gibberd RW, Harrison B: Epidemiology of medical error. BMJ 320(7237):774-7. 2000 Mar 18 Rothschild JM, Bates DW, Leape LL: Preventable medical injuries in older patients. Arch Intern Med 160(18):2717-28. 2000 Oct 9 Graber M, Gordon R, Franklin N: Reducing diagnostic errors in medicine: what's the goal? Acad Med 2002, 77(10):981-92. Garg AX, Adhikari NK, McDonald H, Rosas-Arellano MP, Devereaux PJ, Beyene J, Sam J, Haynes RB: Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: a systematic review. JAMA 293(10):1223-38. 2005 Mar 9 Miller R, Masarie FE, Myers JD: Quick medical reference (QMR) for diagnostic assistance. MD Comput 1986, 3(5):34-48. Warner HR Jr: Iliad: moving medical decision-making into new frontiers. Methods Inf Med 1989, 28(4):370-2. Barness LA, Tunnessen WW Jr, Worley WE, Simmons TL, Ringe TB Jr: Computer-assisted diagnosis in pediatrics. Am J Dis Child 1974, 127(6):852-8. Graber MA, VanScoy D: How well does decision support software perform in the emergency department? Emerg Med J 2003, 20(5):426-8. Welford CR: A comprehensive computerized patient record with automated linkage to QMR. Proc Annu Symp Comput Appl Med Care 1994:814-8. Elhanan G, Socratous SA, Cimino JJ: Integrating DXplain into a clinical information system using the World Wide Web. Proc AMIA Annu Fall Symp 1996:348-52. Greenough A: Help from ISABEL for pediatric diagnoses. Lancet 360(9341):1259. 2002 Oct 19 23. 25. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. Ramnarayan P, Tomlinson A, Kulkarni G, Rao A, Britto J: A Novel Diagnostic Aid (ISABEL): Development and Preliminary Evaluation of Clinical Performance. Medinfo 2004:1091-5. Nelson SJ, Blois MS, Tuttle MS, Erlbaum M, Harrison P, Kim H, Winkelmann B, Yamashita D: Evaluating RECONSIDER. A computer program for diagnostic prompting. J Med Syst 1985, 9(5– 6):379-88. Eccles M, McColl E, Steen N, Rousseau N, Grimshaw J, Parkin D, Purves I: Effect of computerised evidence based guidelines on management of asthma and angina in adults in primary care: cluster randomised controlled trial. BMJ 325(7370):941. 2002 Oct 26 Briggs JS, Fitch CJ: The ISABEL user survey. Med Inform Internet Med 2005, 30(2):167-72. Ramnarayan P, Kapoor RR, Coren M, Nanduri V, Tomlinson AL, Taylor PM, Wyatt JC, Britto JF: Measuring the impact of diagnostic decision support on the quality of clinical decision-making: development of a reliable and valid composite score. J Am Med Inform Assoc 2003, 10(6):563-72. Wexler JR, Swender PT, Tunnessen WW Jr, Oski FA: Impact of a system of computer-assisted diagnosis. Initial evaluation of the hospitalized patient. Am J Dis Child 1975, 129(2):203-5. Bankowitz RA, McNeil MA, Challinor SM, Miller RA: Effect of a computer-assisted general medicine diagnostic consultation service on housestaff diagnostic strategy. Methods Inf Med 1989, 28(4):352-6. Friedman CP, Elstein AS, Wolf FM, Murphy GC, Franz TM, Heckerling PS, Fine PL, Miller TM, Abraham V: Enhancement of clinicians' diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 282(19):1851-6. 1999 Nov 17 Miller RA: Evaluating evaluations of medical diagnostic systems. J Am Med Inform Assoc 1996, 3(6):429-31. Berner ES, Maisiak RS: Influence of case and physician characteristics on perceptions of decision support systems. J Am Med Inform Assoc 1999, 6(5):428-34. Ramnarayan P, Tomlinson A, Rao A, Coren M, Winrow A, Britto J: ISABEL: a web-based differential diagnostic aid for pediatrics: results from an initial performance evaluation. Arch Dis Child 2003, 88:408-13. Ramnarayan P, Roberts GC, Coren M, Nanduri V, Tomlinson A, Taylor PM, Wyatt JC, Britto JF: Assessment of the potential impact of a reminder system on the reduction of diagnostic errors: a quasi-experimental study. BMC Med Inform Decis Mak 6(1):22. 2006 Apr 28 Friedman CP, Gatti GG, Franz TM, Murphy GC, Wolf FM, Heckerling PS, Fine PL, Miller TM, Elstein AS: Do physicians know when their diagnoses are correct? Implications for decision support and error reduction. J Gen Intern Med 2005, 20(4):334-9. Croskerry P: Achieving quality in clinical decision making: cognitive strategies and detection of bias. Acad Emerg Med 2002, 9(11):1184-204. Mamede S, Schmidt HG: The structure of reflective practice in medicine. Med Educ 2004, 38(12):1302-8. Dexter PR, Perkins S, Overhage JM, Maharry K, Kohler RB, McDonald CJ: A computerized reminder to increase the use of preventive care for hospitalized patients. N Engl J Med 2001, 345:965-970. [http://www.healthcareitnews.com/NewsArticleView.aspx?Conten tID=3030&ContentTypeID=3&IssueID=19]. Bankowitz RA, Blumenfeld BH, Guise Bettinsoli N: User variability in abstracting and entering printed case histories with Quick Medical Reference (QMR). In Proceedings of the Eleventh Annual Symposium on Computer Applications in Medical Care New York: IEEE Computer Society Press; 1987:68-73. Gordon Schiff D, Seijeoung Kim, Richard Abrams, Karen Cosby, Bruce Lambert, Arthur Elstein S, Scott Hasler, Nela Krosnjar, Richard Odwazny, Mary Wisniewski F, Robert McNutt A: Diagnostic diagnosis errors: lessons from a multi-institutional collaborative project. [http://www.ahrq.gov/downloads/pub/advances/vol2/ Schiff.pdf]. Accessed 10 July 2006 Pre-publication history The pre-publication history for this paper can be accessed here: Page 15 of 16 (page number not for citation purposes) WHAT IS THE IMPACT ON CHANGE OF DIAGNOSIS DECISION MAKING BEFORE AND AFTER USING A DIAGNOSIS DECISION SUPPORT SYSTEM? Ramnarayan P, Roberts GC, Coren M et al. Assessment of the potential impact of a reminder system on the reduction of diagnostic errors: a quasi-experimental study. BMC Med Inform Decis Mak. 2006 Apr 28;6:22 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] BMC Medical Informatics and Decision Making BioMed Central Open Access Research article Assessment of the potential impact of a reminder system on the reduction of diagnostic errors: a quasi-experimental study Padmanabhan Ramnarayan*1, Graham C Roberts2, Michael Coren3, Vasantha Nanduri4, Amanda Tomlinson5, Paul M Taylor6, Jeremy C Wyatt7 and Joseph F Britto5 Address: 1Children's Acute Transport Service (CATS), 44B Bedford Row, London, WC1H 4LL, UK, 2Department of Paediatric Allergy and Respiratory Medicine, Southampton University Hospital Trust, Tremona Road, Southampton, SO16 6YD, UK, 3Department of Paediatrics, St Mary's Hospital, Paddington, London, W2 1NY, UK, 4Department of Paediatrics, Watford General Hospital, Vicarage Road, Watford, WD18 0HB, UK, 5Isabel Healthcare Ltd, Po Box 244, Haslemere, Surrey, GU27 1WU, UK, 6Centre for Health Informatics and Multiprofessional Education (CHIME), Archway Campus, Highgate Hill, London, N19 5LW, UK and 7Health Informatics Centre, The Mackenzie Building, University of Dundee, Dundee, DD2 4BF, UK Email: Padmanabhan Ramnarayan* - [email protected]; Graham C Roberts - [email protected]; Michael Coren - [email protected]; Vasantha Nanduri - [email protected]; Amanda Tomlinson - [email protected]; Paul M Taylor - [email protected]; Jeremy C Wyatt - [email protected]; Joseph F Britto - [email protected] * Corresponding author Published: 28 April 2006 BMC Medical Informatics and Decision Making 2006, 6:22 doi:10.1186/1472-6947-6-22 Received: 25 September 2005 Accepted: 28 April 2006 This article is available from: http://www.biomedcentral.com/1472-6947/6/22 © 2006 Ramnarayan et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Computerized decision support systems (DSS) have mainly focused on improving clinicians' diagnostic accuracy in unusual and challenging cases. However, since diagnostic omission errors may predominantly result from incomplete workup in routine clinical practice, the provision of appropriate patient- and context-specific reminders may result in greater impact on patient safety. In this experimental study, a mix of easy and difficult simulated cases were used to assess the impact of a novel diagnostic reminder system (ISABEL) on the quality of clinical decisions made by various grades of clinicians during acute assessment. Methods: Subjects of different grades (consultants, registrars, senior house officers and medical students), assessed a balanced set of 24 simulated cases on a trial website. Subjects recorded their clinical decisions for the cases (differential diagnosis, testordering and treatment), before and after system consultation. A panel of two pediatric consultants independently provided gold standard responses for each case, against which subjects' quality of decisions was measured. The primary outcome measure was change in the count of diagnostic errors of omission (DEO). A more sensitive assessment of the system's impact was achieved using specific quality scores; additional consultation time resulting from DSS use was also calculated. Results: 76 subjects (18 consultants, 24 registrars, 19 senior house officers and 15 students) completed a total of 751 case episodes. The mean count of DEO fell from 5.5 to 5.0 across all subjects (repeated measures ANOVA, p < 0.001); no significant interaction was seen with subject grade. Mean diagnostic quality score increased after system consultation (0.044; 95% confidence interval 0.032, 0.054). ISABEL reminded subjects to consider at least one clinically important diagnosis in 1 in 8 case episodes, and prompted them to order an important test in 1 in 10 case episodes. Median extra time taken for DSS consultation was 1 min (IQR: 30 sec to 2 min). Conclusion: The provision of patient- and context-specific reminders has the potential to reduce diagnostic omissions across all subject grades for a range of cases. This study suggests a promising role for the use of future reminder-based DSS in the reduction of diagnostic error. Page 1 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 Background A recent Institute of Medicine report has brought the problem of medical error under intense scrutiny[1]. While the use of computerized prescription software has been shown to substantially reduce the incidence of medication-related error [2,3], few solutions have demonstrated a similar impact on diagnostic error. Diagnostic errors impose a significant burden on modern healthcare: they account for a large proportion of medical adverse events in general [4-6], and form the second leading cause for malpractice suits against hospitals [7]. In particular, diagnostic errors of omission (DEO) during acute medical assessment, resulting from cognitive biases such as 'premature closure' and 'confirmation bias', lead to incomplete diagnostic workup and 'missed diagnoses' [8]. This is especially relevant in settings such as family practice [9], as well as hospital areas such as the emergency room and critical care [11,12], 20% of patients discharged from emergency rooms raised concerns in a recent survey that their clinical assessment had been complicated by diagnostic error [12]. The use of clinical decision-support systems (DSS) has been one of many strategies proposed for the reduction of diagnostic errors in practice [13]. Consequently, a number of DSS have been developed over the past few years to assist clinicians during the process of medical diagnosis [14-16]. Even though studies of several diagnostic DSS have demonstrated improved physician performance in simulated (and rarely real) patient encounters [17,18], two specific characteristics may have contributed to their infrequent use in routine practice: intended purpose and design. Many general diagnostic DSS were built as 'expert systems' to solve diagnostic conundrums and provide the correct diagnosis during a 'clinical dead-end' [19]. Since true diagnostic dilemmas are rare in practice [20], and the initiative for DSS use had to originate from the physician, diagnostic advice was not sought routinely, particularly since clinicians prefer to store the patterns needed to solve medical problems in their heads [21]. There is, however, evidence that clinicians frequently underestimate their need for diagnostic assistance, and that the perception of diagnostic difficulty does not correlate with their clinical performance [22]. In addition, due to the demands of the information era [23]. diagnostic errors may not be restricted to cases perceived as being difficult, and might occur even when dealing with common problems in a stressful environment under time pressure [24]. Further, most 'expert systems' utilized a design in which clinical data entry was achieved through a controlled vocabulary specific to each DSS. This process frequently took > 15 minutes, contributing to infrequent use in a busy clinical environment [25]. These 'expert systems' also provided between 20 and 30 diagnostic possibilities [26], with http://www.biomedcentral.com/1472-6947/6/22 detailed explanations, leading to a lengthy DSS consultation process. In order to significantly affect the occurrence of diagnostic error, it seems reasonable to conclude that DSS advice must therefore be readily available, and sought, during most clinical encounters, even if the perceived need for diagnostic assistance is minor. Ideally, real-time advice for diagnosis can be actively provided by integrating a diagnostic DSS into an existing electronic medical record (EMR), as has been attempted in the past [27,28]. However, the limited uptake of EMRs capable of recording sufficient narrative clinical detail currently in clinical practice indicates that a stand-alone system may prove much more practical in the medium term [29]. The key characteristic of a successful system would be the ability to deliver reliable diagnostic reminders rapidly following a brief data entry process in most clinical situations. ISABEL (ISABEL Healthcare, UK) is a novel Web-based pediatric diagnostic reminder system that suggests important diagnoses during clinical assessment [30,31]. The development of the system and its underlying structure have been described in detail previously [32,33]. The main hypotheses underlying the development of ISABEL were that the provision of diagnostic reminders generated following a brief data entry session in free text would promote user uptake, and lead to improvement in the quality of diagnostic decision making in acute medical settings. The reminders provided (a set of 10 in the first instance) aimed to remind clinicians of important diagnoses that they might have missed in the workup. Data entry is by means of natural language descriptions of the patient's clinical features, including any combination of symptoms, signs and test results. The system's knowledge base consists of natural language text descriptions of > 5000 diseases, in contrast to most 'expert systems' that use complex disease databases [34-36]. The advantages and trade-offs of these differences in system design have been discussed in detail elsewhere [37]. In summary, although the ability to rapidly enter patient features in natural language to derive a short-list of diagnostic suggestions may allow frequent use by clinicians during most patient encounters, variability resulting from the use of natural language for data entry, and the absence of probability ranking, may compromise the accuracy and usefulness of the diagnostic suggestions. The overall evaluation of the ISABEL system was planned in systematic fashion in a series of consecutive studies [38]. a) An initial clinical performance evaluation: This would evaluate the feasibility of providing relevant diagnostic suggestions for a range of cases when data is entered in natural language. System accuracy, speed and relevance of suggestions were studied. Page 2 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 b) An assessment of the impact of the system in a quasiexperimental setting: This would examine the effects of diagnostic decision support on subjects using simulated cases. c) An assessment of the impact of the system in a real life setting: This would examine the effects of diagnostic advice on clinicians in real patients in their natural environment. In the initial performance evaluation, the ISABEL system formed the unit of intervention, and the quality of its diagnostic suggestions was validated against data drawn from 99 hypothetical cases and 100 real patients. Key findings from cases were entered into the system in free text by one of the developers. The system included the final diagnosis in 95% of the cases [39]. This design was similar to early evaluations of a number of other individual diagnostic DSS [40-42], as well as a large study assessing the performance characteristics of four expert diagnostic systems [43]. Since this step studied ISABEL in isolation, and did not include users uninvolved in the development of the system, it was vital to examine DSS impact on decision making by demonstrating in the subsequent step that the clinician-DSS combination functioned better than either the clinician or the system working in isolation [44,45]. Evaluation of impact is especially relevant to ISABEL: despite good system performance when tested in isolation, clinicians may not benefit from its advice either due to variability associated with user data entry leading to poor results, or the inability to distinguish between diagnostic suggestions due to the lack of ranking [46]. A previous evaluation of Quick Medical Reference (QMR) assessed a group of clinicians working a set of difficult cases, and suggested that the extent of benefit gained by different users varied with their level of experience [47]. In this study, we aimed to perform an impact evaluation of ISABEL in a quasi-experimental setting in order to quantify the effects of diagnostic advice on the quality of clinical decisions made by various grades of clinicians during acute assessment, using a mix of easy and difficult simulated cases drawn from all pediatric sub-specialties. Study design was based on an earlier evaluation of the impact of ILIAD and QMR on diagnostic reasoning in a simulated environment [48]. Our key outcome measure focused on appropriateness of decisions during diagnostic workup rather than accuracy in identifying the correct diagnosis. The validity of textual case simulations has previously been demonstrated in medical education exercises [49], and during the assessment of mock clinical decision making [50,51]. http://www.biomedcentral.com/1472-6947/6/22 Methods The simulated field study involved recording subjects' clinical decisions regarding diagnoses, test-ordering and treatment for a set of simulated cases, both before and immediately after DSS consultation. The impact of diagnostic reminders was determined by measuring changes in the quality of decisions made by subjects. In this study, the quality of ISABEL's diagnostic suggestion list per se was not examined. The study was coordinated at Imperial College School of Medicine, St Mary's Hospital, London, UK between February and August 2002. The study was approved by the Local Research Ethics Committee. Subjects A convenience sample consisting of pediatricians of different grades (senior house officers [interns], registrars [residents] and consultants [attending physicians] from different geographical locations across the UK), and final year medical students, was enrolled for the study. All students were drawn from one medical school (Imperial College School of Medicine, London, UK). Clinicians were recruited by invitation from the ISABEL registered user database which consisted of a mixture of regular users as well as pediatricians who had never used the system after registration. After a short explanation of the study procedure, all subjects who consented for the study were included within the sample. Cases Cases were drawn from a pool of 72 textual case simulations, constructed by one investigator, based on case histories of real children presenting to emergency departments (data collected during earlier evaluation). Each case was limited to between 150 and 200 words, and only described the initial presenting symptoms, clinical signs and basic laboratory test results in separate sections. Since the clinical data were collected from pediatric emergency rooms, the amount of clinical information available at assessment was limited but typical for this setting. Ample negative features were included in order to prevent the reader from picking up positive cues from the text. These cases were then classified into one of 12 different pediatric sub-specialties (e.g. cardiology, respiratory) and to one of 3 case difficulty levels within each specialty (1unusual, 2-not unusual, and 3-common clinical presentation, with reference to UK general hospital pediatric practice) by the author. This allocation process was duplicated by a pediatric consultant working independently. Both investigators assigned 57 cases to the same sub-specialty and 42 cases to both the same sub-specialty and the same level of difficulty (raw agreement 0.79 and 0.58 respectively). From the 42 cases in which both investigators agreed regarding the allocation of both specialty and level of difficulty, 24 cases were drawn such that a pair of cases per sub-specialty representing two different levels of diffi- Page 3 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 culty (level 1 & 2, 1 & 3 or 2 & 3) was chosen for the final case mix. This process ensured a balanced set of cases representing all sub-specialties and comprising easy as well as difficult cases. Data collection website A customized, password protected version of ISABEL was used to collect data during the study. This differed from the main website in that it automatically displayed the study cases to each subject in sequence, assigned each case episode a unique study number, and recorded time data in addition to displaying ten diagnostic suggestions. Three separate text boxes were provided to record subjects' clinical decisions (diagnoses, tests and treatment) pre- and post-DSS consultation. The use of the customized trial website ensured that subjects proceeded from one step to the next without being able to skip steps or revise clinical decisions already submitted. Training Training was intended only to familiarize subjects with the trial website. During training, all subjects were assigned unique log-in and passwords, and one sample case as practice material. Practice sessions involving medical students were supervised by one investigator in group sessions of 2–3 subjects each. Pediatricians (being from geographically disparate locations) were not supervised during training, but received detailed instructions regarding the use of the trial website by email. Context-specific help was provided at each step on the website for assistance during the practice session. All subjects completed their assigned practice case, and were recruited for the study. Study procedure Subjects were allowed to complete their assessments of the simulated cases from any computer connected to the Internet at any time (i.e. they were not supervised). After logging into the trial website, subjects were presented with text from a case simulation. They assessed the case, abstracted the salient clinical features according to their own interpretation of the case, and entered them into the designated search query box in free text. Following this, they keyed in their decisions regarding diagnostic workup, test-ordering and treatment into the designated textboxes. These constituted pre-DSS clinical decisions. See figure 1 for an illustration of this step of the study procedure. On submitting this information, a list of diagnostic suggestions was instantly presented to the subject based on the abstracted clinical features. The subjects could not read the case text again at this stage, preventing them from processing the case a second time, thus avoiding 'secondlook' bias. Diagnostic suggestions were different for different users since the search query was unique for each subject, depending on their understanding of the case and http://www.biomedcentral.com/1472-6947/6/22 how they expressed it in natural language. On the basis of the diagnostic suggestions, subjects could modify their pre-DSS clinical decisions by adding or deleting items: these constituted post-DSS clinical decisions. All clinical decisions, and the time taken to complete each step, were recorded automatically. See figure 2 for an illustration of this step of the study procedure. The text from one case, and the variability associated with its interpretation during the study, is depicted in figure 3. Each subject was presented with 12 cases such that one of the pair drawn from each sub-specialty was displayed. Cases were presented in random order (in no particular order of sub-specialty). Subjects could terminate their session at any time and return to complete the remainder of cases. If a session was terminated midway through a case, that case was presented again on the subject's return. If the website detected no activity for > 2 hours, the subject was automatically logged off, and the session was continued on their return. All subjects had 3 weeks to complete their assigned 12 cases. Since each case was used more than once, by different subjects, we termed each attempt by a subject at a case as a 'case episode'. Scoring metrics We aimed to assess if the provision of key diagnostic reminders would reduce errors of omission in the simulated environment. For the purposes of this study, a subject was defined to have committed a DEO for a case episode if they failed to include all 'clinically important diagnoses' in the diagnostic workup (rather than failing to include the 'correct diagnosis'). A diagnosis was judged 'clinically important' if an expert panel working the case independently decided that the particular diagnosis had to be included in the workup in order to ensure safe and appropriate clinical decision making, i.e. they would significantly affect patient management and/or course, and failure to do so would be construed clinically inadequate. The expert panel comprised two general pediatricians with > 3 years consultant level experience. 'Clinically important' diagnoses suggested by the panel thus included the 'most likely diagnosis/es' and other key diagnoses; they did not constitute a full differential containing all plausible diagnoses. This outcome variable was quantified by a binary measure (for each case episode, a subject either committed a DEO or not). Errors of omission were defined for tests and treatments in similar fashion. We also sought a more sensitive assessment of changes in the quality of clinical decisions made by subjects in this study. Since an appropriate and validated instrument was essential for this purpose, a measurement study was first undertaken to develop and validate such an instrument. The measurement study, including the development and validation of a diagnostic quality score and a management Page 4 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 http://www.biomedcentral.com/1472-6947/6/22 Figure 1 of ISABEL simulated study procedure – step 1 Screenshot Screenshot of ISABEL simulated study procedure – step 1. This figure shows how one subject was presented with the text of a case simulation, how he could search ISABEL by using summary clinical features, and record his clinical decisions prior to viewing ISABEL's results. For this case, clinically important diagnoses provided by the expert panel are: nasopharyngitis (OR viral upper respiratory tract infection) and meningitis/encephalitis. This subject has committed a DEO (failed to include both clinically important diagnoses in his diagnostic workup). plan quality score, has previously been reported in detail [52]. The scoring process, tested using a subset of cases worked on by clinicians during this study (190 case episodes), was reliable (intraclass correlation coefficient 0.79) and valid (face, construct and concurrent validity). During the scoring process, the expert panel was provided an aggregate list of decisions drawn from all subjects (preand post-DDSS consultation) for each case. They provided Page 5 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 http://www.biomedcentral.com/1472-6947/6/22 Figure 2 of ISABEL simulated study procedure – step 2 Screenshot Screenshot of ISABEL simulated study procedure – step 2. This figure shows how the same subject was provided the results of ISABEL's search in one click, and how he was provided the opportunity to modify the clinical decisions made in step 1. It was not possible to go back from this step to step 1 to modify the clinical decisions made earlier. Notice that the subject has not identified meningitis/encephalitis from the ISABEL suggestions as clinically important. He has made no changes to his workup, and has committed a DEO despite system advice. measurements of quality for each of the clinical decisions in addition to identifying 'clinically important' decisions for each case. Prior to scoring, one investigator (PR) mapped diagnoses proposed by subjects and the expert panel to the nearest equivalent diagnoses in the ISABEL database. Quality of each diagnostic decision was scored Page 6 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 http://www.biomedcentral.com/1472-6947/6/22 A 6-week old infant is admitted with a week’s history of poor feeding. Whereas previously the infant had been growing along the 25th centile, he has now fallen below the 10th centile. In the past week, his parents also think he is breathing quite quickly. He has a cough and is vomiting most feeds. On examination, he is tachypnoeic, has moderate respiratory distress, and a cough. A grade 3 murmur is heard all over his precordium. He also has a 3 cm liver palpable. Initial lab results show Haemoglobin 9 g/dL, White cell count 12.4 x 106/µL, Platelets 180 x 106/µL. *This case was assigned to Cardiology and case level 2 by both investigators User A User B User C User D User E failure to thrive, growth below 10th centile, previous growth on 25th centile, tachypnoeic, moderate respiratory distress, cough, precordial grade 3 murmur, palpable liver 3cm, neutrophilia, thrombocytophilia poor feeding for 1/52, FTT, DIB + COUGH + vomiting, HEART MURMUR, liver palpable Weight loss, cough, vomiting, respiratory distress, heart murmur, hepatomegaly poor feeding, failure to thrive, tachypnoeic, cough, vomiting, respiratory distress, heart murmur, large liver, anaemia poor feeding, wt loss, tafchypneic, cardiac murmur ‘Most likely diagnosis’ ‘Clinically important’ diagnoses (panel) Ventricular septal defect Ventricular septal defect (or other congenital heart disease with left to right shunt) Nasopharyngitis (or Bronchiolitis) Figure and Example clinically 3of one important simulated diagnoses case used as in judged study*, by the panel variability in clinical features as abstracted by five different users (verbatim), Example of one simulated case used in study*, the variability in clinical features as abstracted by five different users (verbatim), and clinically important diagnoses as judged by panel. for the degree of plausibility, likelihood in the clinical setting, and its impact on further patient management. These measurements were used to derive scores for each set of subjects' clinical decisions (diagnostic workup, tests and treatments). As per the scoring system, subjects' decision plans were awarded the highest score (score range: 0 to 1) only if they were both comprehensive (contained all important clinical decisions), and focused (contained only important decisions). Scores were calculated for each subject's diagnostic, test-ordering and treatment plans both Page 7 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 Cases Subject decisions Panel suggests System generates list http://www.biomedcentral.com/1472-6947/6/22 consultation and interaction with grade was assessed by two-way mixed-model analysis of variance (grade being between-subjects factor and time being within-subjects factor). Mean number of DEOs was calculated for each subject grade, and DEOs were additionally analyzed according to level of case difficulty. Statistical significance was set at a p value of 0.05. Responses from all Scored by panel on scale for appropriateness Subject decides again Subjects’ pre-system and postsystem list of suggestions Reference standard (responses scored >0) Compare Subjects’ performance Figure 4 of scoring procedure Schematic Schematic of scoring procedure. pre- and post-ISABEL consultation. Figure 4 provides a schematic diagram of the complete scoring procedure. Primary outcome 1. Change in the number of diagnostic errors of omission among subjects. Secondary outcomes 1. Mean change in subjects' diagnostic, test-ordering and treatment plan quality scores. 2. Change in the number of irrelevant diagnoses contained within the diagnostic workup. 3. Proportion of case episodes in which at least one additional 'important' diagnosis, test or treatment decision was considered by the subject after DSS consultation. 4. Additional time taken for DSS consultation. Analysis Subjects were used as the unit of analysis for the primary outcome measure. For each subject, the total number of DEOs was counted separately for pre- and post-DSS diagnostic workup plans; only subjects who had completed all assigned cases were included in this calculation. Statistically significant changes in DEO count following DDSS Subjects were used as the unit of analysis for the change in mean quality scores (the development of quality scores and their validation has been previously described; however, the scores have never been used as an outcome measure prior to this evaluation). In the first step, subjects' quality score (pre- and post-DSS) was calculated for each case episode. For each subject, a mean quality score across all 12 cases was computed. Only case episodes from subjects who completed all 12 assigned cases were used during this calculation. A two-way mixed model ANOVA (grade as between-subjects factor; time as within-subjects factor) was used to examine statistically significant differences in quality scores. This analysis was performed for diagnostic quality scores as well as test ordering and treatment plan scores. Data from a pilot study suggested that data from 64 subjects were needed to demonstrate a mean diagnostic quality score change of 0.03 (standard deviation 0.06, power 80%, level of significance 5%). Using subjects as the unit of analysis, the mean count of diagnoses (and irrelevant diagnoses) included in the workup was calculated pre- and post-DSS consultation for each subject as an average across all case attempts. Only subjects who attempted all assigned cases were included in this analysis. Using this data, a mean count for diagnoses (and irrelevant diagnoses) was calculated for each subject grade. A two-way mixed model ANOVA was used to assess statistically significant differences in this outcome with respect to grade as well as occasion. Using case episodes as the unit of analysis, the proportion of case episodes in which at least one additional 'important' diagnosis, test or treatment was prompted by ISABEL was determined. The proportion of case episodes in which at least one clinically significant decision was deleted, and at least one inappropriate decision was added, after system consultation, was also computed. All data were analyzed separately for the subjects' grades. Two further analyses were conducted to enable the interpretation of our results. First, in order to provide a direct comparison of our results with other studies, we used case episodes as the unit of analysis and examined the presence of the 'most likely diagnosis' in the diagnostic workup. The 'most likely diagnosis' was part of the set of 'clinically important' diagnoses provided by the panel, and represented the closest match to a 'correct' diagnosis in our study design. This analysis was conducted separately for Page 8 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 http://www.biomedcentral.com/1472-6947/6/22 Table 1: Study participants, cases and case episodes* Grade of subject Consultant (%) Registrar (%) SHO (%) Subjects invited to participate 27 (27.8) 33 (34) 20 (20.6) 17 (17.5) 97 Subjects who attempted at least one case (attempters) Subjects who attempted at least six cases Subjects who completed all 12 cases (completers) 18 (23.7) 24 (31.6) 19 (25) 15 (19.7) 76 751 24 16 (25.8) 15 (28.8) 18 (29) 14 (26.9) 15 (24.2) 10 (19.2) 13 (20.9) 13 (25) 62 52 715 624 24 24 2 1 3 0 6 Regular DSS users (usage > once/week) Student (%) Total Case episodes Cases * Since each case was assessed more than once, each attempt by a subject at a case was termed as a 'case episode'. each grade. Second, since it was important to verify whether any reduction of omission errors was directly prompted by ISABEL, or simply by subjects re-thinking about the assigned cases, all case episodes in which at least one additional significant diagnosis was added by the user were examined. If the diagnostic suggestion added by the user had been displayed in the DSS list of suggestions, it strongly suggested that the system, rather than subjects' rethinking, prompted these additions. Results The characteristics of subjects, cases and case episodes are summarized in table 1. Ninety seven subjects were invited for the study. Although all subjects consented and completed their training, only seventy six subjects attempted at least one case (attempters) during their allocated three weeks. This group consisted of 15 medical students, 19 SHOs, 24 registrars and 18 consultants. There was no significant difference between attempters and non-attempters with respect to grade (Chi square test, p 0.07). Only 6/ 76 subjects had used ISABEL regularly (at least once a week) prior to the study period (3 SHOs, 1 Registrar and 2 Consultants); all the others had registered for the service, but never used the DSS previously. A total of 751 case episodes were completed by the end of the study period. Fifty two subjects completed all assigned 12 cases to produce 624 case episodes (completers); 24 other subjects did not complete all their assigned cases (non-completers). Completers and non-completers did not differ significantly with respect to grade (Chi square test, p 0.06). However, more subjects were trained remotely in the noncompleters group (Chi square test, p 0.003). The majority of non-completers had worked at least two cases (75%); slightly less than half (42%) had worked at least 6 cases. Forty-seven diagnoses were considered 'clinically important' by the panel across all 24 cases (average ~ 2 per case). For 21/24 cases, the panel had specified a single 'most likely diagnosis'; for 3 cases, two diagnoses were included in this definition. Diagnostic errors of omission 624 case episodes generated by 52 subjects were used to examine DEO. During the pre-DSS consultation phase, all subjects performed a DEO in at least one of their cases, and 21.1% (11/52) in more than half their cases. In the pre-ISABEL consultation stage, medical students and SHOs committed the most and least number of DEO respectively (6.6 vs. 4.4); this gradient was maintained post-ISABEL consultation (5.9 vs. 4.1). Overall, 5.5 DEO were noted per subject pre-DSS consultation; this reduced to 5.0 DEO after DSS advice (p < 0.001). No significant interaction was noticed with grade (F3, 48 = 0.71, p = 0.55). Reduction in DEO following DSS advice within each grade is shown in table 2. Overall, more DEOs were noted Table 2: Mean count of diagnostic errors of omission (DEO) pre-ISABEL and post-ISABEL consultation Grade of subject DEO pre-ISABEL (SD) DEO post-ISABEL (SD) Reduction (SD) Consultant Registrar SHO Medical student 5.13 (1.3) 5.64 (1.5) 4.4 (1.6) 6.61 (1.3) 4.6 (1.4) 5.14 (1.6) 4.1 (1.6) 5.92 (1.4) 0.53 (0.7) 0.5 (0.5) 0.3 (0.5) 0.69 (0.7) Mean DEO across all subjects (n = 52)* 5.50 (1.6) 4.98 (1.5) 0.52 (0.6) *Total number of subjects who completed all 12 assigned cases Page 9 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 Table 3: Mean DEO count analyzed by level of case and subject grade Grade Consultant Registrar SHO Medical student Difficult cases Easy cases Pre-DSS Post-DSS Pre-DSS Post-DSS 1.66 2.21 1.3 2.92 1.47 1.93 1.2 2.54 2.0 2.14 2.0 2.54 1.87 1.92 1.8 2.30 for easy cases compared to difficult cases pre- and postDSS advice (2.17 vs. 2.05 and 2.0 vs. 1.8); however, this was not true for medical students as a subgroup (2.5 vs. 2.9). Improvement following DSS advice seemed greater for difficult cases for all subjects, although this was not statistically significant. These findings are summarized in table 3. http://www.biomedcentral.com/1472-6947/6/22 workup pre-DSS advice was 3.9. This increased to 5.7 post-DSS consultation (an increase of 1.8 diagnoses). The increase was largest for medical students (a mean increase of 2.6 diagnoses) and least for consultants (1.4 diagnoses). The ANOVA showed significant interaction between grade and occasion (F3,58 = 3.14, p = 0.034). The number of irrelevant diagnoses in the workup changed from 0.7 pre-DSS to 1.4 post-DSS advice (an increase of 0.7 irrelevant diagnoses, 95% CI 0.5–0.75). There was a significant difference in this increase across grades (most for medical students and least for consultants; 1.1 vs. 0.3 irrelevant diagnoses, F3, 48 = 6.33, p < 0.01). The increase in irrelevant diagnoses did not result in a corresponding increase in the number of irrelevant or deleterious tests and treatments (an increase of 0.09 tests and 0.03 treatment decisions). Mean quality score changes 624 case episodes from 52 subjects who had completed all assigned 12 cases were used for this analysis. Table 4 summarizes mean diagnostic quality scores pre- and postISABEL consultation, and the change in mean quality score for diagnoses, for each grade of subject. There was a significant change in the weighted mean of the diagnostic quality score (0.044; 95% confidence interval: 0.032, 0.054; p < 0.001). No significant interaction between grade and occasion was demonstrated. In 9/52 subjects (17.3%), the pre-DSS score for diagnostic quality was higher than the post-DSS score, indicating that subjects had lengthened their diagnostic workup without substantially improving its quality. Overall, the mean score for test-ordering plans increased significantly from 0.345 to 0.364 (an increase of 0.019, 95% CI 0.011–0.027, t51 = 4.91, p < 0.001); this increase was smaller for treatment plans (0.01, 95% CI 0.007–0.012, t51 = 7.15, p < 0.001). Additional diagnoses, tests and treatment decisions At least one 'clinically important' diagnosis was added by the subject to their differential diagnosis after ISABEL consultation in 94/751 case episodes (12.5%, 95% CI 10.1%14.9%). 47/76 (61.8%) subjects added at least one 'clinically important' diagnosis to their diagnostic workup after consultation. Overall, 130 'clinically important' diagnoses were added after DSS advice during the experiment. In general, students were reminded to consider many more important diagnoses than consultants, although this was not statistically significant (44 vs. 26, Chi square p > 0.05); a similar gradient was seen for difficult cases, but DSS consultation seemed helpful even for easy cases. Similar proportions for tests and treatment items were smaller in magnitude (table 6). No clinically significant diagnoses were deleted after consultation. Important tests included by subjects in their pre-DSS plan were sometimes deleted from the post-DSS plan (64 individual items from 44 case episodes). A similar effect was seen for treatment steps (34 individual items from 24 case episodes). An inappropriate test was added to the post-ISABEL list in 7/751 cases. Number of irrelevant diagnoses 624 case episodes from 52 subjects were used for this analysis. The results are illustrated in table 5. Overall, the mean count of diagnoses included by subjects in their 751 case episodes were used to examine the presence of the 'most likely diagnosis'. Overall, the 'most likely diagnosis/es' were included in the pre-DSS diagnostic workup by subjects in 507/751 (67.5%) case episodes. This Table 4: Mean quality scores for diagnoses broken down by grade of subject Consultant Registrar SHO Medical student Weighted average (all subjects)† Mean pre-ISABEL score Mean post-ISABEL score Mean score change* 0.39 0.40 0.45 0.31 0.43 0.44 0.48 0.37 0.044 0.038 0.032 0.059 0.383 0.426 0.044 * There was no significant difference between grades in terms of change in diagnosis quality score (one-way ANOVA p > 0.05) Page 10 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 http://www.biomedcentral.com/1472-6947/6/22 Table 5: Increase in the average number of diagnoses and irrelevant diagnoses before and after DSS advice, broken down by grade Grade No. of diagnoses Pre-DSS Post-DSS Increase Pre-DSS Post-DSS Increase 3.3 4.3 4.4 4.0 4.6 5.9 6.1 6.6 1.3 1.6 1.7 2.6 0.4 0.8 0.6 1.1 0.7 1.3 1.4 2.2 0.3 0.5 0.8 1.1 Consultant Registrar SHO Medical student increased to 561/751 (74.7%) case episodes after DSS advice. The improvement was fully attributable to positive consultation effects (where the 'most likely diagnosis' was absent pre-DSS but was present post-DSS); no negative consultations were observed. Diagnostic accuracy pre-ISABEL was greatest for consultants (73%) and least for medical students (57%). Medical students gained the most after DSS advice (an absolute increase of 10%). Analysis performed to elucidate whether ISABEL was responsible for the changes seen in the rate of diagnostic error indicated that all additional diagnoses were indeed present in the system's list of diagnostic suggestions. Time intervals Reliable time data was available for 633/751 episodes (table 7). Median time taken for subjects to abstract clinical features and record their initial clinical decisions on the trial website was 6 min (IQR 4–10 min); median time taken to examine ISABEL's suggestions and make changes to clinical decisions was 1 min (IQR 30 sec-2 min). Time taken for ISABEL to display its suggestions was less than 2 sec on all occasions. Table 6: Number of case episodes in which clinically 'important' decisions were prompted by ISABEL consultation Number of 'important' decisions prompted by ISABEL No. of irrelevant diagnoses Diagnoses Tests Treatment steps 1 2 3 4 5 69 19 2 3 1 56 12 2 0 0 42 5 2 0 0 None 657 637 678 No. of case episodes in which at least ONE 'significant' decision was prompted by ISABEL 94 (12.5%) 70 (9.3%) 49 (6.5%) Total number of individual 'significant' decisions prompted by ISABEL 130 86 58 Discussion We have shown in this study that errors of omission occur frequently during diagnostic workup in an experimental setting, including in cases perceived as being common in routine practice. Such errors seem to occur in most subjects, irrespective of their level of experience. We have also demonstrated that it is possible to influence clinicians' diagnostic workup and reduce errors of omission using a stand-alone diagnostic reminder system. Following DSS consultation, the quality of diagnostic, test-ordering and treatment decisions made by various grades of clinicians improved for a range of cases, such that a clinically important alteration in diagnostic decision-making resulted in 12.5% of all consultations (1 in 8 episodes of system use). In a previous study assessing the impact of ILIAD and QMR, in which only diagnostically challenging cases were used in an experimental setting, Friedman et al showed that the 'correct diagnosis' was prompted by DSS use in approximately 1 in 16 consultations [48]. Although we used a similar experimental design, we used a mix of easy as well as difficult cases to test the hypothesis that incomplete workup was encountered in diagnostic conundrums as well as routine clinical problems. Previous evaluations of expert systems used the presence of the 'correct' diagnosis as the main outcome. We focused on clinical safety as the key outcome, preferring to use the inclusion of all 'clinically important' diagnoses in the workup as the main variable of interest. In acute settings such as emergency rooms and primary care, where an incomplete and evolving clinical picture results in considerable diagnostic uncertainty at assessment, the ability to generate a focused and 'safe' workup is a more clinically relevant outcome, and one which accurately reflects the nature of decision making in this environment [53]. Consequently, we defined diagnostic errors of omission at assessment as the 'failure to consider all clinically important diagnoses (as judged by an expert panel working the same cases)'. This definition resulted in the 'correct' diagnosis, as well as other significant diagnoses, being included within the 'minimum' workup. Further, changes in test-ordering and treatment decisions were uniquely measured in this study as a more concrete marker of the impact of diagnostic decision support on the patient's clinical management; we Page 11 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 Table 7: Time taken to process case simulations broken down by grade of subject Consultant Registrar SHO Medical student Overall Median time preISABEL Median time postISABEL 5 min 5 sec 5 min 45 sec 5 min 54 sec 8 min 36 sec 42 sec 57 sec 53 sec 3 min 42 sec 6 min 2 sec (IQR: 4:03 – 9:47) 1 min (IQR: 30 sec – 2:04) were able to demonstrate an improvement in test-ordering in 1 in 10 system consultations, indicating that diagnostic DSS may strongly influence patient management, despite only offering diagnosis-related advice. Finally, the time expended during DSS consultation is an important aspect that has not been fully explored in previous studies. In our study, subjects spent a median of 6 minutes for clinical data entry (including typing in their unaided decisions), and a median of 1 minute to process the advice provided and make changes to their clinical decisions. The research design employed in this study allowed us to confirm a number of observations previously reported, as well as to generate numerous unique ones. These findings relate to the operational consequences of providing diagnostic assistance in practice. In keeping with other DSS evaluations, different subject grades processed system advice in different ways, depending on their prior knowledge and clinical experience, leading to variable benefit. Since ISABEL merely offered diagnostic suggestions, and allowed the clinician to make the final decisions (acting as the 'learned intermediary') [54], in some cases, subjects ignored even important advice. In some other cases, they added irrelevant decisions or deleted important decisions after DSS consultation, leading to reduced net positive effect of the DDSS on decision making. For some subjects whose pre-DSS performance was high, a ceiling effect prevailed, and no further improvement could be demonstrated. These findings complement the results of our earlier system performance evaluation which solely focused on system accuracy and not on user interaction with DDSS. One of the main findings from this study was that consultants tended to generate shorter diagnostic workup lists containing the 'most likely' diagnoses, with a predilection to omit other 'important' diagnoses that might account for the patient's clinical features, resulting in a high incidence of DEO. Medical students generated long diagnostic workup lists, but missed many key diagnoses leading to a high DEO rate. Interestingly, all subject grades gained from the use of ISABEL in terms of a reduction in the number of DEO, although to varying degrees. Despite more DEOs occurring in cases considered to be http://www.biomedcentral.com/1472-6947/6/22 routine in practice than in rare and difficult ones in the pre-DSS consultation phase, ISABEL advice seemed to mainly improve decision making for difficult cases, with a smaller effect on easy cases. The impact of DSS advice showed a decreasing level of beneficial effect from diagnostic to test-ordering to treatment decisions. Finally, although the time taken to process cases without DSS advice in this study compared favorably with the Friedman evaluation of QMR and ILIAD (6 min vs. 8 min), the time taken to generate a revised workup with DSS assistance was dramatically shorter (1 min vs. 22 min). We propose a number of explanations for our findings. There is sufficient evidence to suggest that clinicians with more clinical experience resort to established pattern-recognition techniques and the use of heuristics while making diagnostic decisions [55]. While these shortcuts enable quick decision making in practice, and work successfully on most occasions, they involve a number of cognitive biases such as 'premature closure' and 'confirmation bias' that may lead to incomplete assessment on some occasions. On the other hand, medical students may not have developed adequate pattern-recognition techniques or acquired sufficient knowledge of heuristics to make sound diagnostic decisions. It may well be that grades at an intermediate level are able to process cases in an acute setting with a greater emphasis on clinical safety. This explanation may also account for the finding that subjects failed to include 'important' diagnoses during the assessment of easy cases. Recognition that a case was unusual may trigger a departure from the use of established pattern-recognition techniques and clinical shortcuts to a more considered cognitive assessment, leading to fewer DEO in these cases. We have shown that it is possible to reduce DEOs by the use of diagnostic reminders, including in easy cases, although subjects appeared to be more willing to revise their decisions for difficult cases on the basis of ISABEL suggestions. It is also possible that some subjects ignored relevant advice because the system's explanatory capacity was inadequate and did not allow subjects to sufficiently discriminate between the suggestions offered. User variability in summarizing cases may also explain why variable benefits were derived from ISABEL usage – subjects may have obtained different results depending on how they abstracted and entered clinical features. This user variability during clinical data entry has been demonstrated even with use of a controlled vocabulary in QMR [56]. We observed marked differences between users' search terms for the same textual case; however, diagnostic suggestions did not seem to vary noticeably. This observation could be partially explained by the enormous diversity associated with various natural language disease descriptions contained within the ISABEL database, as well as by the system's use of a thesaurus Page 12 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 that converts medical slang into recognized medical terms. The diminishing level of impact from diagnostic to testordering to treatment decisions may be a result of system design – ISABEL does not explicitly state which tests and treatments to perform for each of its diagnostic suggestions. This advice is usually embedded within the textual description of the disease provided to the user. Study and system design may both account for the differences in time taken to process the cases. In previous evaluations, subjects processed cases without using the DSS in the first instance; in a subsequent step, they used the DSS to enter clinical data, record their clinical decisions, and processed system advice to generate a second diagnostic hypothesis list. In our study, subjects processed the case and recorded their own clinical decisions while using the DSS for clinical data entry. The second stage of the procedure only involved processing ISABEL advice and modifying previous clinical decisions. As such, direct comparison between the studies can be made only by the total time involved per case (30 min vs. 7 min). This difference could be explained by features in the system's design that resulted in shorter times to enter clinical data and to easily process the advice provided. The findings from this study have implications specifically for ISABEL as well as other diagnostic DSS design, evaluation and implementation. It is well recognized that the dynamic interaction between user and DSS plays a major role in their acceptance by physicians [57]. We feel that adoption of the ISABEL system during clinical assessment in real time is possible even with current computer infrastructure, providing an opportunity for reduction in DEO. Its integration into an EMR would allow further control on the quality of the clinical input data as well as provision of active decision support with minimum extra effort. Such an ISABEL interface has currently been developed and tested with four commercial EMRs [58]; this integration also facilitates iterative use of the system during the evolution of a patient's condition, leading to increasingly specific diagnostic advice. The reminder system model aims to enable clinicians to generate 'safe' diagnostic workups in busy environments at high risk for diagnostic errors. This model has been successfully used to alter physician behavior by reducing errors of omission in preventive care [59]. It is clear from recent studies that diagnostic errors occur in the emergency room for a number of reasons. Cognitive biases, of which 'premature closure' and faulty context generation are key examples, contribute significantly [60]. Use of a reminder system may minimize the impact of some of these cognitive biases. When combined with cognitive forcing strategies during decision making, DDSS may act as 'safety nets' to reduce the incidence of omission errors in practice [61]. http://www.biomedcentral.com/1472-6947/6/22 Reminders to perform important tests and treatment steps may also allow a greater impact on patient outcome [62]. A Web-based system model in our study allowed users from disparate parts of the country to participate in this study without need for additional infrastructure or financial resources, an implementation model that would minimize the cost associated with deployment in practice. Finally, the role of DDSS in medical education and training needs formal evaluation. In our study, medical students gained significantly from the advice provided, suggesting that use of DDSS during specific diagnostic tasks (e.g. problem-based case exercises) might be a valuable adjunct to current educational strategies. Familiarity with DDSS will also predispose to greater adoption of computerized aids during future clinical decision making. The limitations of this study stem mainly from its experimental design. The repeated measures design raises the possibility that some of the beneficial effects seen in the study are a result of subjects 'rethinking' the case, or the consequence of a reflective process [63]. Consequently, ISABEL's effects in practice could be related to the extra time taken by users in processing cases. We believe that any such effects are likely to be minimal since subjects did not actually process the cases twice during the study – a summary of the clinical features was generated by subjects when the case was displayed for the first time, and subjects could not review the cases while processing ISABEL suggestions in the next step. Subjects also spent negligible time between their first assessment of the cases and processing the diagnostic suggestions from the DSS. The repeated measures design provided the power to detect differences between users with minimal resources; a randomized design using study and control groups of subjects would have necessitated the involvement of over 200 subjects. The cases used in our study contained only basic clinical data gained at the time of acute assessment, and may have proved too concise or easy to process. However, this seems unlikely since subjects only took an average of 8 min to process even diagnostic conundrums prior to DSS use when 'expert systems' were tested. Our cases pertained to emergency assessments, making it difficult to generalize the results to other ambulatory settings. The ability to extract clinical features from textual cases may not accurately simulate a real patient encounter where missed data or 'red herrings' are quite common. The inherent complexity involved in patient assessment and summarizing clinical findings in words may lead to poorer performance of the ISABEL system in real life, since its diagnostic output depends on the quality of user input. As a corollary, some of our encouraging results may be explained by our choice of subjects: a few were already familiar with summarizing clinical features into the DSS. Subjects were not supervised during their case exercises since they may have performed differently under scrutiny, Page 13 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 raising the prospect of a Hawthorne effect [64]. The use of a structured website to explicitly record clinical decisions may have invoked the check-list effect, as illustrated in the Leeds abdominal pain system study [65]. The check list effect might also be invoked during the process of summarizing clinical features for ISABEL input; this may have worked in conjunction with 'rethinking' to promote better decision making pre-ISABEL. We also measured decision making at a single point in time, making it difficult to assess the effects of iterative usage of the DSS on the same patient. Finally, our definition of diagnostic error aimed to identify inadequate diagnostic workup at initial assessment that might result in a poor patient outcome. We recognize the absence of an evidence-based link between omission errors and diagnostic adverse events in practice, although according to the Schiff model [53], it seems logical to assume that avoiding process errors will prevent actual errors at least in some instances. In the simulated setting, it was not possible to test whether inadequate diagnostic workup would directly lead to a diagnostic error and cause patient harm. Our planned clinical impact assessment in real life would help clarify many of the questions raised during this experimental study. http://www.biomedcentral.com/1472-6947/6/22 Authors' contributions PR conceived the study, contributed to the study design, analyzed the data and drafted the manuscript. GR assisted with the design of the study and data analysis. MC assisted with the study design, served as gold standard panel member, and revised the draft manuscript VN assisted with study design, served as gold standard panel member, and helped with data analysis MT assisted with data collection and analysis PT assisted with study conception, study design and revised the draft manuscript JW assisted with study conception, provided advice regarding study design, and revised the draft manuscript JB assisted with study conception, study design and data analysis. All authors read and approved the final manuscript. Conclusion This experimental study demonstrates that diagnostic omission errors are common during the assessment of easy as well as difficult cases. The provision of patient- and context-specific diagnostic reminders has the potential to reduce these errors across all subject grades. Our study suggests a promising role for the use of future reminderbased DSS in the reduction of diagnostic error. An impact evaluation, utilizing a naturalistic design and conducted in real life clinical practice, is underway to verify the conclusions derived from this simulation. Acknowledgements Competing interests References This study was conducted when the ISABEL system was available free to users, funded by the ISABEL Medical Charity (2001–2002). Dr Ramnarayan performed this research as part of his MD thesis at Imperial College London. Dr Britto was a Trustee of the ISABEL medical charity (non-remunerative post). Ms Tomlinson was employed by the ISABEL Medical Charity as a Research Nurse. Since June 2004, ISABEL is managed by a commercial subsidiary of the Medical Charity called ISABEL Healthcare. The system is now available only to subscribers. Dr Ramnarayan now advises ISABEL Healthcare on research activities on a part-time basis; Dr Britto is now Clinical Director of ISABEL Healthcare; Ms Tomlinson is responsible for Content Management within ISABEL Healthcare. All three hold stock options in ISABEL Healthcare. All other authors declare that they have no competing interests. The authors would like to thank Tina Sajjanhar for her help in allocating cases to specialties and levels of difficulty; Helen Fisher for data analysis; and Jason Maude of the ISABEL Medical Charity for help rendered during the design of the study. Financial support: This study was supported by a research grant from the National Health Service (NHS) Research & Development Unit, London. The sponsor did not influence the study design; the collection, analysis, and interpretation of data; the writing of the manuscript; and the decision to submit the manuscript for publication. 1. 2. 3. 4. 5. 6. 7. 8. 9. Institute of Medicine: To Err is Human; Building a Safer Health System Washington DC: National Academy Press; 1999. Fortescue EB, Kaushal R, Landrigan CP, McKenna KJ, Clapp MD, Federico F, Goldmann DA, Bates DW: Prioritizing strategies for preventing medication errors and adverse drug events in pediatric inpatients. Pediatrics 2003, 111(4 Pt 1):722-9. Kaushal R, Bates DW: Information technology and medication safety: what is the benefit? Qual Saf Health Care 2002, 11(3):261-5. Leape L, Brennan TA, Laird N, Lawthers AG, Localio AR, Barnes BA, Hebert L, Newhouse JP, Weiler PC, Hiatt H: The nature of adverse events in hospitalized patients. Results of the Harvard Medical Practice Study II. N Engl J Med 1991, 324:377-84. Wilson RM, Harrison BT, Gibberd RW, Harrison JD: An analysis of the causes of adverse events from the quality in Australian health care study. Med J Aust 1999, 170:4115. Davis P, Lay-Yee R, Briant R, Ali W, Scott A, Schug S: Adverse events in New Zealand public hospitals II: preventability and clinical context. N Z Med J 116(1183):U624. 2003 Oct 10 Bartlett EE: Physicians' cognitive errors and their liability consequences. J Healthcare Risk Manage Fall 1998:62-9. Croskerry P: The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 2003, 78(8):775-80. Green S, Lee R, Moss J: Problems in general practice. Delays in diagnosis Manchester, UK: Medical Defence Union; 1998. Page 14 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. Thomas EJ, Studdert DM, Burstin HR, Orav EJ, Zeena T, Williams EJ, Howard KM, Weiler PC, Brennan TA: Incidence and types of adverse events and negligent care in Utah and Colorado. Med Care 2000, 38:261-71. Rothschild JM, Landrigan CP, Cronin JW, Kaushal R, Lockley SW, Burdick E, Stone PH, Lilly CM, Katz JT, Czeisler CA, Bates DW: The Critical Care Safety Study: The incidence and nature of adverse events and serious medical errors in intensive care. Crit Care Med 2005, 33(8):1694-700. Burroughs TE, Waterman AD, Gallagher TH, Waterman B, Adams D, Jeffe DB, Dunagan WC, Garbutt J, Cohen MM, Cira J, Inguanzo J, Fraser VJ: Patient concerns about medical errors in emergency departments. Acad Emerg Med 2005, 12(1):57-64. Graber M, Gordon R, Franklin N: Reducing diagnostic errors in medicine: what's the goal? Acad Med 2002, 77(10):981-92. Barnett GO, Cimino JJ, Hupp JA, Hoffer EP: DXplain: an evolving diagnostic decision-support system. JAMA 1987, 258:67-74. Miller R, Masarie FE, Myers JD: Quick medical reference (QMR) for diagnostic assistance. MD Comput 1986, 3(5):34-48. Warner HR Jr: Iliad: moving medical decision-making into new frontiers. Methods Inf Med 1989, 28(4):370-2. Barness LA, Tunnessen WW Jr, Worley WE, Simmons TL, Ringe TB Jr: Computer-assisted diagnosis in pediatrics. Am J Dis Child 1974, 127(6):852-8. Wilson DH, Wilson PD, Walmsley RG, Horrocks JC, De Dombal FT: Diagnosis of acute abdominal pain in the accident and emergency department. Br J Surg 1977, 64(4):250-4. Miller RA: Medical diagnostic decision support systems – past, present, and future: a threaded bibliography and brief commentary. J Am Med Inform Assoc 1994, 1(1):8-27. Bankowitz RA, McNeil MA, Challinor SM, Miller RA: Effect of a computer-assisted general medicine diagnostic consultation service on housestaff diagnostic strategy. Methods Inf Med 1989, 28(4):352-6. Tannenbaum SJ: Knowing and acting in medical practice: the epistemiological politics of outcomes research. J Health Polit 1994, 19:27-44. Friedman CP, Gatti GG, Franz TM, Murphy GC, Wolf FM, Heckerling PS, Fine PL, Miller TM, Elstein AS: Do physicians know when their diagnoses are correct? Implications for decision support and error reduction. J Gen Intern Med 2005, 20(4):334-9. Wyatt JC: Clinical Knowledge and Practice in the Information Age: a handbook for health professionals London: The Royal Society of Medicine Press; 2001. Smith R: What clinical information do doctors need? BMJ 1996, 313:1062-68. Graber MA, VanScoy D: How well does decision support software perform in the emergency department? Emerg Med J 2003, 20(5):426-8. Lemaire JB, Schaefer JP, Martin LA, Faris P, Ainslie MD, Hull RD: Effectiveness of the Quick Medical Reference as a diagnostic tool. CMAJ 161(6):725-8. 1999 Sep 21 Welford CR: A comprehensive computerized patient record with automated linkage to QMR. Proc Annu Symp Comput Appl Med Care 1994:814-8. Elhanan G, Socratous SA, Cimino JJ: Integrating DXplain into a clinical information system using the World Wide Web. Proc AMIA Annu Fall Symp 1996:348-52. Berner ES, Detmer DE, Simborg D: Will the wave finally break? A brief view of the adoption of electronic medical records in the United States. J Am Med Inform Assoc 2005, 12(1):3-7. Epub 2004 Oct 18 Greenough A: Help from ISABEL for pediatric diagnoses. Lancet 360(9341):1259. 2002 Oct 19 Thomas NJ: ISABEL. Critical Care 2002, 7(1):99-100. Ramnarayan P, Britto J: Pediatric clinical decision support systems. Arch Dis Child 2002, 87(5):361-2. Fisher H, Tomlinson A, Ramnarayan P, Britto J: ISABEL: support with clinical decision making. Paediatr Nurs 2003, 15(7):34-5. Giuse DA, Giuse NB, Miller RA: A tool for the computer-assisted creation of QMR medical knowledge base disease profiles. Proc Annu Symp Comput Appl Med Care 1991:978-9. Johnson KB, Feldman MJ: Medical informatics and pediatrics. Decision-support systems. Arch Pediatr Adolesc Med 1995, 149(12):1371-80. http://www.biomedcentral.com/1472-6947/6/22 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. Giuse DA, Giuse NB, Miller RA: Evaluation of long-term maintenance of a large medical knowledge base. J Am Med Inform Assoc 1995, 2(5):297-306. Ramnarayan P, Tomlinson A, Kulkarni G, Rao A, Britto J: A Novel Diagnostic Aid (ISABEL): Development and Preliminary Evaluation of Clinical Performance. Medinfo 2004:1091-5. Wyatt J: Quantitative evaluation of clinical software, exemplified by decision support systems. Int J Med Inform 1997, 47(3):165-73. Ramnarayan P, Tomlinson A, Rao A, Coren M, Winrow A, Britto J: ISABEL: a web-based differential diagnostic aid for pediatrics: results from an initial performance evaluation. Arch Dis Child 2003, 88:408-13. Feldman MJ, Barnett GO: An approach to evaluating the accuracy of DXplain. Comput Methods Programs Biomed 1991, 35(4):261-6. Middleton B, Shwe MA, Heckerman DE, Henrion M, Horvitz EJ, Lehmann HP, Cooper GF: Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base. II. Evaluation of diagnostic performance. Methods Inf Med 1991, 30(4):256-67. Nelson SJ, Blois MS, Tuttle MS, Erlbaum M, Harrison P, Kim H, Winkelmann B, Yamashita D: Evaluating Reconsider: a computer program for diagnostic prompting. J Med Syst 1985, 9:379-388. Berner ES, Webster GD, Shugerman AA, Jackson JR, Algina J, Baker AL, Ball EV, Cobbs CG, Dennis VW, Frenkel EP: Performance of four computer-based diagnostic systems. N Engl J Med 330(25):1792-6. 1994 Jun 23 Miller RA: Evaluating evaluations of medical diagnostic systems. J Am Med Inform Assoc 1996, 3(6):429-31. Miller RA, Masarie FE: The demise of the "Greek Oracle" model for medical diagnostic systems. Methods Inf Med 1990, 29:1-2. Maisiak RS, Berner ES: Comparison of measures to assess change in diagnostic performance due to a decision support system. Proc AMIA Symp 2000:532-6. Berner ES, Maisiak RS: Influence of case and physician characteristics on perceptions of decision support systems. J Am Med Inform Assoc 1999, 6(5):428-34. Friedman CP, Elstein AS, Wolf FM, Murphy GC, Franz TM, Heckerling PS, Fine PL, Miller TM, Abraham V: Enhancement of clinicians' diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 282(19):1851-6. 1999 Nov 17 Issenberg SB, McGaghie WC, Hart IR, Mayer JW, Felner JM, Petrusa ER, Waugh RA, Brown DD, Safford RR, Gessner IH, Gordon DL, Ewy GA: Simulation technology for health care professional skills training and assessment. JAMA 282(9):861-6. 1999 Sep 1 Clauser BE, Margolis MJ, Swanson DB: An examination of the contribution of computer-based case simulations to the USMLE step 3 examination. Acad Med 2002, 77(10 Suppl):S80-2. Dillon GF, Clyman SG, Clauser BE, Margolis MJ: The introduction of computer-based simulations into the Unites States medical licensing examination. Acad Med 2002, 77(10 Suppl):S94-6. Ramnarayan P, Kapoor RR, Coren M, Nanduri V, Tomlinson AL, Taylor PM, Wyatt JC, Britto JF: Measuring the impact of diagnostic decision support on the quality of clinical decision-making: development of a reliable and valid composite score. J Am Med Inform Assoc 2003, 10(6):563-72. Gordon D Schiff, Seijeoung Kim, Richard Abrams, Karen Cosby, Bruce Lambert, Arthur S Elstein, Scott Hasler, Nela Krosnjar, Richard Odwazny, Mary F Wisniewski, Robert A McNutt: Diagnostic diagnosis errors: lessons from a multi-institutional collaborative project. [http://www.ahrq.gov/downloads/pub/advances/vol2/ Schiff.pdf]. Accessed 10 May 2005 Brahams D, Wyatt J: Decision-aids and the law. Lancet 1989, 2:632-4. Croskerry P: Achieving quality in clinical decision making: cognitive strategies and detection of bias. Acad Emerg Med 2002, 9(11):1184-204. Bankowitz RA, Blumenfeld BH, Guis Bettinsoli N: User variability in abstracting and entering printed case histories with Quick Medical Reference (QMR). In Proceedings of the Eleventh Annual Symposium on Computer Applications in Medical Care New York: IEEE Computer Society Press; 1987:68-73. Kassirer JP: A report card on computer-assisted diagnosis – the grade: C. N Engl J Med 330(25):1824-5. 1994 Jun 23 [http://www.healthcareitnews.com/story.cms?ID=3030]. Page 15 of 16 (page number not for citation purposes) BMC Medical Informatics and Decision Making 2006, 6:22 59. 60. 61. 62. 63. 64. 65. http://www.biomedcentral.com/1472-6947/6/22 Dexter PR, Perkins S, Overhage JM, Maharry K, Kohler RB, McDonald CJ: A computerized reminder to increase the use of preventive care for hospitalized patients. N Engl J Med 2001, 345:965-970. Graber ML, Franklin N, Gordon R: Diagnostic error in internal medicine. Arch Intern Med 2005, 165(13):1493-9. Croskerry P: Cognitive forcing strategies in clinical decision making. Ann Emerg Med 2003, 41(1):110-20. Berner ES: Diagnostic decision support systems: how to determine the gold standard? J Am Med Inform Assoc 2003, 10(6):608-10. Mamede S, Schmidt HG: The structure of reflective practice in medicine. Med Educ 2004, 38(12):1302-8. Randolph AG, Haynes RB, Wyatt JC, Cook DJ, Guyatt GH: Users' Guides to the Medical Literature: XVIII. How to use an article evaluating the clinical impact of a computer-based clinical decision support system. JAMA 282(1):67-74. 1999 Jul 7 Adams ID, Chan M, Clifford PC, Cooke WM, Dallos V, de Dombal FT, Edwards MH, Hancock DM, Hewett DJ, McIntyre N: Computer aided diagnosis of acute abdominal pain: a multicentre study. BMJ 1986, 293:800-4. Pre-publication history The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1472-6947/6/22/prepub Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 16 of 16 (page number not for citation purposes) HOW ACCURATELY DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM SUGGEST THE CORRECT DIAGNOSIS IN 50 ADULT NEJM CPC CASES? Graber ML, Mathews A. Performance of a web-based clinical diagnosis support system for internists. J Gen Intern Med. 2008 Jan;23 Suppl 1:85-7. Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] Performance of a Web-Based Clinical Diagnosis Support System for Internists Mark L. Graber, MD1,2 and Ashlei Mathew1,2 1 Medical Service-111, VA Medical Center, Northport, NY 11768, USA; 2Department of Medicine, SUNY at Stony Brook, Stony Brook, NY, USA. BACKGROUND: Clinical decision support systems can improve medical diagnosis and reduce diagnostic errors. Older systems, however, were cumbersome to use and had limited success in identifying the correct diagnosis in complicated cases. OBJECTIVE: To measure the sensitivity and speed of “Isabel” (Isabel Healthcare Inc., USA), a new web-based clinical decision support system designed to suggest the correct diagnosis in complex medical cases involving adults. METHODS: We tested 50 consecutive Internal Medicine case records published in the New England Journal of Medicine. We first either manually entered 3 to 6 key clinical findings from the case (recommended approach) or pasted in the entire case history. The investigator entering key words was aware of the correct diagnosis. We then determined how often the correct diagnosis was suggested in the list of 30 differential diagnoses generated by the clinical decision support system. We also evaluated the speed of data entry and results recovery. RESULTS: The clinical decision support system suggested the correct diagnosis in 48 of 50 cases (96%) with key findings entry, and in 37 of the 50 cases (74%) if the entire case history was pasted in. Pasting took seconds, manual entry less than a minute, and results were provided within 2–3 seconds with either approach. CONCLUSIONS: The Isabel clinical decision support system quickly suggested the correct diagnosis in almost all of these complex cases, particularly with key finding entry. The system performed well in this experimental setting and merits evaluation in more natural settings and clinical practice. KEY WORDS: clinical diagnosis support systems; Isabel; internal medicine; diagnostic error; Google; decision support. J Gen Intern Med 23(Suppl 1):37–40 DOI: 10.1007/s11606-007-0271-8 © Society of General Internal Medicine 2007 INTRODUCTION The best clinicians excel in their ability to discern the correct diagnosis in perplexing cases. This skill requires an extensive knowledge base, keen interviewing and examination skills, and the ability to synthesize coherently all of the available information. Unfortunately, the level of expertise varies among clinicians, and even the most expert can sometimes fail. There is also a growing appreciation that diagnostic errors can be made just as easily in simple cases as in the most complex. Given this dilemma and the fact that diagnostic error rates are not trivial, clinicians are well-advised to explore tools that can help them establish correct diagnoses. Clinical diagnosis support systems (CDSS) can direct physicians to the correct diagnosis and have the potential to reduce the rate of diagnostic errors in medicine.1,2 The first-generation computer-based products (e.g., QMR—First Databank, Inc, CA; Iliad—University of Utah; DXplain—Massachusetts General Hospital, Boston, MA) used precompiled knowledge bases of syndromes and diseases with their characteristic symptoms, signs, and laboratory findings. The user would enter findings from their own patients selected from a menu of choices, and the programs would use Bayesian logic or pattern-matching algorithms to suggest diagnostic possibilities. Typically, the suggestions were judged to be helpful in clinical settings, even when used by expert clinicians.3 These diagnosis support systems were also useful in teaching clinical reasoning.4,5 Surprisingly and despite their demonstrated utility in experimental settings, none of these earlier systems gained widespread acceptance for clinical use, apparently related to the considerable time needed to input clinical data and their somewhat limited sensitivity and specificity.3,6 A study of Iliad and QMR in an emergency department setting, for example, found that the final impression of the attending physician was found among the suggested diagnoses only 72% and 52% of the time, respectively, and data input required 20 to 40 minutes for each case.7 In this study, we evaluated the clinical performance of “Isabel” (Isabel Healthcare Inc, USA), a new, second generation, web-based CDSS that accepts either key findings or whole-text entry and uses a novel search strategy to identify candidate diagnoses from the clinical findings. The clinician first enters the key findings from the case using free-text entry (see Fig 1). There is no limit on the number of terms entered, although excellent results are typically obtained with entering just a few key findings. The program includes a thesaurus that facilitates recognition of terms. The program then uses natural language processing and search algorithms to compare these terms to those used in a selected reference library. For Internal Medicine cases, the library includes 6 key textbooks anchored around the Oxford Textbooks of Medicine, 4th Edition (2003) and the Oxford Textbook of Geriatric Medicine and 46 major journals in general and subspecialty medicine and toxicology. The search domain and results are filtered to take into account the patient’s age, sex, geographic location, 37 38 Graber and Mathew: A Web-Based Clinical Diagnosis Support System JGIM Figure 1. The data-entry screen of the Isabel diagnosis support software. The clinician manually enters the age, gender, locality, and specialty. The query terms can be entered manually from the key findings or findings can be pasted from an existing write-up. This patient was ultimately found to have a B-cell lymphoma secreting a cryo-paraprotein pregnancy status, and other clinical parameters that are either preselected by the clinician or automatically entered if the system is integrated with the clinician’s electronic medical record. The system then displays a total of 30 suggested diagnoses, with 10 diagnoses presented on each web page (see Fig. 2). The order of listing reflects an indication of the matching between the findings selected and the reference materials searched but is not meant to suggest a ranked order of clinical probabilities. As in the first generation systems, more detailed information on each diagnosis can be obtained by links to authoritative texts. The Isabel CDSS was originally developed for use in pediatrics. In an initial evaluation, 13 clinicians (trainees and staff) at St Mary’s Hospital, London submitted a total of 99 case scenarios of hypothetical case presentations for different diagnoses, and Isabel displayed the expected diagnosis in 91% of these cases.8 Out of 100 real case scenarios gathered from 4 major teaching hospitals in the UK, Isabel suggested the correct diagnosis in 83 of 87 cases (95%). In a separate evaluation of 24 case scenarios in which the gold standard differential diagnosis was established by two senior academic pediatricians, Isabel decreased the chance of clinicians (a mix of trainees and staff clinicians) omitting a key diagnosis by suggesting a highly relevant new diagnosis in 1 of every 8 cases. In this study, the time to enter data and obtain diagnostic suggestions averaged less than 1 minute.9 The Isabel clinical diagnosis support system has now been adapted for adult medicine. The goal of this study was to evaluate the speed and accuracy of this product in suggesting the correct diagnosis in a series of complex cases in Internal Medicine. METHOD We considered 61 consecutive “Case Records of Massachusetts General Hospital” (New England Journal of Medicine, vol. 350:166–176, 2004 and 353:189–198, 2005). Each case had an anatomical or final diagnosis, which was considered to be correct by the discussants. We excluded 11 cases (patients under the age of 10 and cases that focused solely on management issues). The 50 remaining case histories were copied and pasted into the Isabel data-entry field. The pasted material typically included the history, physical examination findings, and laboratory test results, but data from tables and figures were not submitted. Beyond entering the patient’s age, sex, and nationality, the investigators did not attempt to otherwise tailor the search strategy. Findings were compared to the recommended (but slightly more time consuming) strategy of entering discrete key findings, as compiled by a senior internist (MLG). Because the correct diagnosis is presented at the end of each case, data entry was not blinded. RESULTS Using the recommended method of manually entering key findings, the list of diagnoses suggested by Isabel contained the correct diagnosis in 48 of the 50 cases (96%). Typically 3–6 key findings from each case were used. The 2 diagnoses that were not suggested (progressive multifocal encephalopathy and nephrogenic fibrosing dermopathy) were not included in the Isabel database at the time of the study; thus, these 2 cases would never have been suggested, even with different keywords. JGIM Graber and Mathew: A Web-Based Clinical Diagnosis Support System 39 Figure 2. The first page of results of the Isabel diagnosis support software. Additional diagnoses are presented by selecting the ‘more diagnoses’ box Using the copy/paste method for entering the whole text, the list of diagnoses suggested by Isabel contained the correct diagnosis in 37 of the 50 cases (76%). Isabel presented 10 diagnoses on the first web page and 10 additional diagnoses on subsequent pages up to a total of 30 diagnoses. Because users may tend to disregard suggestions not shown on later web pages, we tracked this parameter for the copy/paste method of data entry: The correct diagnosis was presented on the first page in 19 of the 37 cases (51%) or first two pages in 28 of the 37 cases (77%). Similar data were not collected for manual data entry because the order of presentation depended on which key findings were entered. Both data entry approaches were fast: Manually entering data and obtaining diagnostic suggestions typically required less than 1 minute per case, and the copy/paste method typically required less than 5 seconds. DISCUSSION Diagnostic errors are an underappreciated cause of medical error,10 and any intervention that has the potential to produce correct and timely medical diagnosis is worthy of serious consideration. Our recent analysis of diagnostic errors in Internal Medicine found that clinicians often stop thinking after arriving at a preliminary diagnosis that explains all the key findings, leading to context errors and ‘premature closure’, where further possibilities are not considered.11 These and other errors contribute to diagnoses that are wrong or delayed, causing substantial harm in the patients affected. Systems that help clinicians explore a more complete range of diagnostic possibilities could conceivably reduce these types of error. Many different CDSSs have been developed over the years, and these typically matched the manually entered features of the case in question to a database of key findings abstracted from experts or the clinical literature. The sensitivity of these systems was in the range of 50%–60%, and the time needed to access and query the database was often several minutes.3 More recently, the possibility of using Google to search for clinical diagnoses has been suggested. However, a formal evaluation of this approach on a subset of the same “Case Records” cases used in our study found a sensitivity of 58%,12 in the range of the first-generation CDSSs and unacceptably low for clinical use. The findings of our study indicate that CDSS products have evolved substantially. Using the Isabel CDSS, we found that data entry takes under 1 minute, and the sensitivity in a series of highly complex cases approached 100% using entry of key findings. Entry of entire case histories using copy/paste functionality allowed even faster data entry but reduced sensitivity. The loss of sensitivity seemed primarily related to negative findings included in the pasted history and physical (e.g., “the patient denies chest pain”), which are treated as positive findings (chest pain) by the search algorithm. There are several relevant limitations of this study that make it difficult to predict how Isabel might perform as a diagnostic aid in clinical practice. First, the results obtained here reflect the theoretical upper limit of performance, given that an investigator who was aware of the correct diagnosis entered the key findings. Further, clinicians in real life seldom have the wealth of reliable and organized information that is presented in the Case Records or the time needed to use a CDSS in every case. To the extent that Isabel functions as a ‘learned intermediary’,13 success in using the program 40 Graber and Mathew: A Web-Based Clinical Diagnosis Support System will also clearly depend on the clinical expertise of the user and their facility in working with Isabel. A serious existential concern is whether presenting a clinician with dozens of diagnostic suggestions might be a distraction or lead to unnecessary testing. We have previously identified these trade-offs as an unavoidable cost of improving patient safety: The price of improving the odds of reaching a correct diagnosis is the extra time and resources consumed in using the CDSS and considering alternative diagnoses that might turn out to be irrelevant.14 In summary, the Isabel CDSS performed quickly and accurately in suggesting correct diagnoses for complex adult medicine cases. However, the test setting was artificial, and the CDSS should be evaluated in more natural environments for its potential to support clinical diagnosis and reduce the rate of diagnostic error in medicine. Acknowledgements: The administrative and library-related support of Ms. Grace Garey and Ms. Mary Lou Glazer is gratefully acknowledged. Funding was provided by the National Patient Safety Foundation. This study was approved by the Institutional Review Board (IRB). Conflict of interest statement: None disclosed. Corresponding Author: Mark L. Graber, MD; Medical Service-111, VA Medical Center, Northport, NY 11768, USA (e-mail: mark. [email protected]). JGIM REFERENCES 1. Berner ES. Clinical Decision Support Systems. New York: Springer; 1999. 2. Garg AX, Adhikari NKJ, McDonald H, et al. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes. JAMA. 2005;293:1223–38. 3. Berner ES, Webster GD, Shugerman AA, et al. Performance of four computer-based diagnostic systems. N Engl J Med. 1994;330:1792–6. 4. Friedman CP, Elstein AS, Wolf FM, et al. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA. 1999;282:1851–6. 5. Lincoln MJ, Turner CW, Haug PJ, et al. Iliad’s role in the generalization of learning across a medical domain. Proc Annu Symp Comput Appl Med Care. 1992;174–8. 6. Kassirer JP. A report card on computer-assisted diagnosis—the grade: C. N Engl J Med. 1994;330:1824–5. 7. Graber MA, VanScoy D. How well does decision support software perform in the emergency department? Emerg Med J. 2003;20:426–8. 8. Ramnarayan P, Tomlinson A, Rao A, Coren M, Winrow A, Britto J. ISABEL: a web-based differential diagnosis aid for paediatrics: results from an initial performance evaluation. Arch Dis Child. 2003;88:408–13. 9. Ramnarayan P, Roberts GC, Coren M, et al. Assessment of the potential impact of a reminder system on the reduction of diagnostic errors: a quasi-experimental study. BMC Med Inform Decis Mak. 2006;6:22. 10. Graber ML. Diagnostic error in medicine—a case of neglect. Joint Comm J Saf Qual Improv. 2004;31:112–19. 11. Graber ML, Franklin N, Gordon RR. Diagnostic error in internal medicine. Arch Int Med. 2005;165:1493–9. 12. Tang H, Ng JHK. Googling for a diagnosis—use of Google as a diagnostic aid: internet based study. Br Med J. 2006. 13. Kantor G. Guest software review: Isabel diagnosis software. 2006;1–31. 14. Graber ML, Franklin N, Gordon R. Reducing diagnostic error in medicine: what’s the goal? Acad Med. 2002;77:981–92. HOW DOES A WEB-BASED DIAGNOSIS DECISION SUPPORT SYSTEM IMPROVE DIAGNOSIS ACCURACY IN A CRITICAL CARE SETTING? Thomas NJ, Ramnarayan P, Bell P Et al. An international assessment of a web-based diagnostic tool in critically ill children. Technology and Health Care 16 (2008) 103–110. Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] Technology and Health Care 16 (2008) 103–110 IOS Press 103 An international assessment of a web-based diagnostic tool in critically ill children Neal J. Thomasa,∗ , Padmanabhan Ramnarayanb , Michael J. Bellc , Prabhat Maheshwarid , Shaun Wilsone, Emily B. Nazarianf , Lorri M. Phippsa , David C. Stockwellc , Michael Engelc, Frank A. Maffeif , Harish G. Vyase and Joseph Brittog a Penn State Children’s Hospital and The Pennsylvania State University College of Medicine, Hershey, PA, USA b Children’s Acute Transport Service, London, UK c Pediatric Critical Care Medicine, Children’s National Medical Center, Washington DC, USA d Pediatric Intensive Care Unit, St Mary’s Hospital, Paddington, London, UK e Pediatric Intensive Care Unit, Queen’s Medical Centre, Nottingham, UK f Pediatric Critical Care Medicine, Golisano Children’s Hospital, University of Rochester at Strong, Rochester, NY, USA g Isabel Healthcare Inc., Reston, VA, USA Received 5 September 2007 Revised /Accepted 30 October 2007 Abstract. Improving diagnostic accuracy is essential. The extent of diagnostic uncertainty at patient admission is not well described in critically ill children. Therefore, we studied the extent that pediatric trainee diagnostic performance could be improved with the aid of a computerized diagnostic tool. Data regarding patient admissions to five Pediatric Intensive Care Units were collected. Information included patients’ clinical details, admitting team’s diagnostic workup and discharge diagnosis. An attending physician assessed each case independently and suggested additional diagnostic possibilities. Diagnostic accuracy was calculated using the discharge diagnosis as the gold standard. 206 out of 927 patients (22.2%) admitted to the PICUs did not have an established diagnosis at admission. The trainee teams considered a median of three diagnoses in their workup (IQR 3–5) and made an accurate diagnosis in 89.4% cases (95% CI 84.6%–94.2%). Diagnostic accuracy improved to 92.5% with use of the diagnostic tool alone, and to 95% with the addition of attending physicians’ diagnostic suggestions. We conclude that a modest proportion of admissions to these PICUs were characterized by diagnostic uncertainty during initial assessment. Although there was a relatively high accuracy rate of initial assessment in our clinical setting, it was further improved by both the diagnostic tool and the physicians’ diagnostic suggestions. It is plausible that the tool’s utility would be even greater in clinical settings with less expertise in critical illness assessment, such as community hospitals, or emergency departments of non-training institutions. The role of diagnostic aids in the care of critically ill children merits further study. Keywords: Diagnosis, diagnostic reminder system, internet, pediatrics ∗ Address for correspondence: Neal J. Thomas, M.D., M.Sc, Pediatric Critical Care Medicine, Penn State Children’s Hospital, 500 University Drive, MC H085, Hershey, PA 17033, USA. Tel.: +1 717 531 5337; Fax: +1 717 531 0809; E-mail: [email protected]. 0928-7329/08/$17.00 2008 – IOS Press and the authors. All rights reserved 104 N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children 1. Introduction Accurate diagnosis of medical conditions has been of paramount importance for centuries and faulty diagnostic evaluation can lead to patient harm and poor medical management. Three etiologies of diagnostic errors have been recognized within the medical community: no fault errors, system-related errors and cognitive errors [20]. No-fault errors occur either when disease presentation is unusual and unpredictable or when patients are uncooperative. System-related errors occur as a result of either organizational flaws or mechanical problems within the institution. Cognitive errors occur due to lack of adequate medical knowledge or inadequate data collection. With the explosion of medical literature at exponential rates, avoidance of cognitive errors using emerging technology may be the most efficient method for improving diagnostic skills. A large body of evidence suggests that diagnostic error by clinicians poses a significant problem across a wide variety of settings, comprising 10–30% of adverse medical events (AME) [9,22]. One recent observational study suggested that 10% of AME in adult critical care patients were related to misdiagnosis [16]. Moreover, autopsy studies have demonstrated that major diagnoses are missed in critically ill adults and children who die in intensive care units [2,4,8,17]. We hypothesized that a web-based diagnostic reminder system would produce additional differential diagnosis possibilities for children admitted to five international PICUs who were admitted without an established diagnosis. We further hypothesized that the web-based tool would increase diagnostic accuracy by input of the findings recorded in the medical record admission history and physical examination. To do this, we compared the accuracy of determining the discharge diagnosis for (i) the admitting team, (ii) an unbiased board-certified/board eligible pediatric intensivist and (iii) ISABEL, a web-based diagnostic reminder system (www.isabelhealthcare.com) [14,21]. 2. Materials and methods This prospective study was conducted during a three-month period at five pediatric intensive care units (PICUs), two from the United Kingdom (UK) and three from the United States (USA). The selection of ICUs was based on working relationships between the five primary investigators in each site, with a goal of studying institutions that had diverse patient populations, different patient volumes, and a diverse culture. All patient data collection was approved by the research ethics committees (institutional review boards) at the participating centers prior to initiation of the study. The requirement for informed consent was waived. All admissions to the PICU during the study period were screened by a designated study investigator at each center. Medical admissions to the PICU without an established diagnosis were eligible for study. All surgical admissions and medical admissions with known primary diagnoses (such as culture-proven sepsis, diabetic ketoacidosis and others) were excluded. All PICUs utilize various medical personnel (including residents from pediatrics or pediatric subspecialties, clinical fellows, advanced practice nurses (APN), physician assistants and pediatric intensivists) as their daily clinical team. 2.1. Diagnosis by admitting team At each PICU, the screening investigator (pediatric critical care medicine fellow, pediatric resident or pediatric APN) reviewed all admissions and enrolled eligible patients. Importantly, these investigators neither admitted the child to the PICU nor participated in the clinical care given to the child so that N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children 105 the utility of the tool itself could be tested without bias from difficulties with the system or changes in routine practice. The medical records of eligible patients were reviewed and presenting symptoms and physical findings and the differential diagnosis (including all diagnostic possibilities) was extracted. The screening investigator further determined the discharge diagnosis based on review of the medical record at discharge. 2.2. Diagnosis by ISABEL The screening investigator at each center used a customized study version of ISABEL to securely log in, enter, and store data on eligible patients. Each patient was assigned a unique study number, and the date and time of electronic access was logged. Clinical data from the chart were entered and was used to drive the diagnostic tool’s algorithm for production of diagnostic possibilities. ISABEL is a Web based diagnostic decision support system (DSS). Computerized decision support tools utilize two or more pieces of clinical data to provide patient-specific recommendations; assist in diverse clinical processes such as diagnosis, prescribing or clinical management; and consist of an underlying knowledge base, user interface for data input, and an inference engine [23]. The ISABEL knowledge base consists of > 100,000 raw text documents describing > 10,000 diseases. Documents are drawn from multiple resources such as reputed textbooks (e.g. Oxford Textbook of Medicine) and review articles in journals. The user interface permits data entry in the form of a free text description of clinical findings, without the need for a specific terminology. The inference engine utilizes statistical natural language processing software to search through the entire database of text documents and returns all disease descriptions that match the clinical features entered. This process is similar to a search engine, although much more sensitive and specific. The software indexes each document in the database daily, and extracts relevant concepts based on frequency of their occurrence, uniqueness and proximity to other concepts. Each diagnosis in the database is assigned clinical weighting scores, derived from expert opinion, to reflect its prevalence by age group (e.g. child), gender and geographical region (e.g. North America). In the current study, when investigators entered historical data and physical findings noted at admission for each patient as search terms, and applied appropriate age, gender and region filters, 10–12 diagnoses that were matched within the database were displayed, arranged by body system (e.g. cardiovascular, respiratory) rather than by clinical probability. Hyperlinks from each diagnostic suggestion led to text documents that matched the search query. There was no limit on the number of search terms that could be entered, although the tool performed well even with 3–5 terms. A number of different diagnostic DSS have been studied in the past. Tools such as De Dombal’s abdominal pain system were developed using data from > 6000 patients with proven surgical causes of abdominal pain and their clinical findings. Using probabilistic techniques, this tool assisted clinicians in differentiating surgical from non-surgical causes of abdominal pain [5]. Other DSS such as Dxplain, Quick Medical Reference, and ILIAD utilized a complex database of relationships between hundreds of diseases and thousands of clinical findings (e.g. rheumatoid arthritis and fever) derived from expert opinion to provide diagnostic hypotheses (i.e. quasi-probabilistic systems) [1]. Keeping such knowledge bases up to date was a challenge - an expert panel needed to frequently appraise new information, and make significant changes to existing relationships within the database. The ISABEL system utilizes a novel mechanism to keep its knowledge base current. When updated textbooks or reviews became available on a topic, old documents are simply replaced with new text in the database without any expert input. A small content management team searches monthly for newer editions of textbooks and reviews and updates the database on a quarterly basis. Expert input is only necessary to modify clinical 106 N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children weighting scores, and is undertaken as soon as important new information is available (e.g. SARS virus outbreak in Hong Kong). Since the tool is web based, updates are instantly available to users. A more detailed description of the methodology involved in the development and maintenance of this tool is available [13]. 2.3. Diagnosis by board certified/board eligible pediatric intensivist At each center, a pediatric intensivist independently examined the clinical data available at the time of admission for each patient and the admitting PICU team’s diagnostic workup; the discharge diagnosis was withheld from the attending physician during this process. They then highlighted further diagnoses from within the computerized tool’s list that they considered clinically relevant or significant for patient assessment. This judgment was based on whether consideration of that particular diagnosis in the workup would have triggered a specific action (either performing further focused clinical examination or a particular test or treatment). If two closely related (or nearly synonymous) diagnoses were displayed in the list, only one was taken into account. This investigator was then able to propose additional diagnoses they felt were relevant to generate an additional list of potential diagnoses. 2.4. Diagnostic accuracy assessment Diagnostic accuracy was examined by matching the discharge diagnosis with lists generated by (i) the admitting team, (ii) ISABEL and (iii) the study intensivist. The diagnosis was considered accurate if the discharge diagnosis was located within the list of diagnoses generated by these three methods. An a priori calculation demonstrated that 250 patients would be required to identify the frequency of additional significant diagnoses in the diagnostic tool’s list assuming 20% of the patients were without an established diagnosis (with 80% power, type I error 5%). Data collection was planned for three months at each unit, based on an estimated annual admission rate of approximately 4000 patients. Diagnostic accuracy rates of admission teams, ISABEL and attending physicians were calculated based on the discharge diagnosis. Results were calculated as proportions, and expressed as percentages with 95% confidence interval limits. The Chi-square test was used to detect statistically significant differences between proportions (defined as p value < 0.05). Factors influencing diagnostic accuracy were analyzed by entering the variables into a multiple logistic regression model. 3. Results Data were collected from all five participating centers during the study. All centers admitted a general mix of medical and surgical pediatric patients. The characteristics of each of the participating centers are shown in Table 1. A total of 927 patients were admitted to the five PICUs between May and July 2003. The majority of patients was either surgical/post-surgical or had an established diagnosis by the time they were admitted to the PICU (721/927, 77.8%). A total of 206 patients (22.2%) had no established diagnosis at the time of PICU admission, and were therefore eligible for study; 69 were from UK units and 137 from the US units. Admission data for each center are presented in Table 2. Of these, 45 lacked discharge diagnoses and were thereby excluded. The remaining 161 patients were therefore analyzed. There were significant differences between the various centers. The number of patients without an established diagnosis at admission varied between center, ranging from 19.6% to 77.8% (mean 38.9%; N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children 107 Table 1 Key baseline characteristics of the participating critical care units C1 C2 C3 C4 C5 Country UK UK USA USA USA Total number of beds 6 10 12 12 26 Medical admissions in 2003 (%) 143 (35) 363 (77) 289 (37) 402 (62) 1001 (67) Surgical admissions in 2003 (%) 267 (65) 106 (23) 483 (63) 248 (38) 492 (33) Total annual admissions* 410 469 772 650 1493 * Total annual admissions at all participating centers: 3794. (C = center; UK = United Kingdom; US = United States). Table 2 Breakdown of patient admissions according to diagnostic groups: MayJuly 2003 Known surgical Known medical Unknown Total diagnosis (%) diagnosis (%) diagnosis (%)* C1 67 (65.0) 8 (7.8) 28 (27.2) 103 C2 22 (22.7) 33 (34) 42 (43.3) 97 C3 117 (61.3) 30 (15.7) 44 (23) 191 C4 60 (38.2) 78 (49.7) 19 (12.1) 157 C5 132 (34.8) 174 (45.9) 73 (19.2) 379 Total 398 (42.9) 323 (34.8) 206 (22.2) 927 * Difference between centers was statistically significant (chi-square test, p < 0.05). (C = center). Chi-square test, p < 0.05). The proportion of surgical admissions on the units ranged from 23.7% to 65% (mean 42.9%). The admitting team’s differential list contained a median of three diagnoses per case (IQR 3–5) with a range of 1–12 diagnoses and contained the discharge diagnosis in 144/161 cases (89.4%). In five of the remaining 17 cases in which the admitting team did not include the discharge diagnosis in their initial workup, the diagnostic tool displayed the final diagnosis in the list of suggestions. ISABEL generated a differential list including between 10 and 12 diagnoses and this list contained the final diagnosis in 149/161 cases (92.5%). The attending intensivist accurately provided the final diagnosis in a further four cases, leading to an overall accuracy rate of 92% (when combined with the admitting team) and 95% (combined with the admitting team and the diagnostic tool). Despite the fact that the diagnostic tool displayed the discharge diagnosis in five patients where the admitting team had failed to consider it, the attending physician was able to identify only two of these as being clinically relevant to the case. The admitting team’s diagnostic accuracy appeared to be a function of the age of the patient, but not the number of diagnoses considered at admission or the diagnostic group. 4. Discussion In summary, we have demonstrated that within tertiary care PICUs, approximately 20% of critically ill children will be admitted without an established diagnosis and that the clinical team makes an accurate preliminary diagnosis almost 90% of the time. This rate is improved by use of the ISABEL tool as well as a higher degree of clinical training and experience from an intensivist. As we chose to examine every PICU admission during the study period, this tool may have some added utility in a variety of patient populations, particularly those that are complex diagnostic challenges. 108 N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children In general, it is believed that improvements in technology will ultimately lead to improved outcomes for patients in the health care system of all countries. While this might be true in some instances, such as the prevention of medication errors [11], there has also been some evidence that increased need for technology may also lead to increased patient harm [7]. To our knowledge, ISABEL is a unique system in that it collates data from multiple sources including textbooks, journal articles and reviews to generate a potential diagnostic list for the user. However, its ability to improve outcome has yet to be demonstrated, and the direct impact on patient outcome related to use of the system is an area of research that requires further clinical study. Estimates of diagnostic uncertainty in critical care have come from various sources. In extreme circumstances, persistent diagnostic uncertainty is often only resolved by post mortem studies. A number of autopsy studies have concluded that the rate of clinically significant missed diagnoses in intensive care patients ranges from 2.1% to 21.2% [10,15,19]. In addition, it is reported that in 17% of pediatric autopsies in critical care, knowledge of the diagnosis pre-mortem would have influenced survival by affecting decision making [3]. Therefore, it is crucial to suspect and confirm the correct diagnosis early in order to improve outcome in these children. The rate of missed diagnoses directly leading to death is higher in patients with a shorter stay in critical care, while unexpected minor findings are common in patients with a longer ICU stay, presumably due to the addition of new medical problems while undergoing critical care treatments, either nosocomial or iatrogenic [6]. These post-mortem studies suggest a significant burden of missed diagnoses, and imply that diagnosis constitutes an important part of critical care decision making. We failed to find such a critical burden of missed diagnoses in our study and it did not appear that the missed diagnoses in our patients significantly altered patient care or outcome. However, prospective evaluation of diagnostic errors in the critical care environment suggests that they constitute only a small proportion of medical error; medication-related errors are the most common cause for AME in this setting [16]. A more rigorous evaluation of the ISABEL tool in a variety of clinical situations may lead to improved knowledge about missed diagnoses as well as a more thorough evaluation of ISABEL’s usefulness. We aimed to explore the role of a diagnostic tool in critical illness as part of this preliminary study. Recently, ISABEL has been shown to be an effective reminding system in adult emergency departments, with 95% accuracy in presenting the user with the final diagnosis and 90% accuracy in obtaining “must-not-miss” diagnoses [12]. We undertook the current study in the Pediatric Intensive Care Units of the five centers for several reasons. First, we wanted to demonstrate the utility of ISABEL in a complex population of children with complex medical conditions, thereby collecting information on a very challenging population for study. Second, we believe it is important to demonstrate the utility of a tool in diverse patient groups. As three of the PICUs were in the USA and two were in the UK and all units showed similar ISABEL performance, we believe that this highlights the potential utility and generalizability of the tool. Moreover, the patient volumes at the five sites differed by more than 300%, again arguing for the utility of the tool in a diverse population. The most interesting question that can be asked regarding our data and device is whether a clinician can improve his/her own diagnostic ability in collaboration with the use of ISABEL. It is not intended that ISABEL is to become a substitute for clinical acumen, but rather an ancillary tool to establish potential diagnoses. We believe that this is consistent with others who believe that the performance of the intended user working in combination with the tool is greater than the performance of the user alone [18]. Obviously, this necessitates that the user-computer interaction must be seamless and user-friendly. None of the investigators in our subset of more than a dozen had difficulty in utilizing the ISABEL tool to generate useful diagnostic lists. Again, further studies with more diverse clinician populations may also demonstrate whether this interface is sufficient for all users’ needs. N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children 109 This study has a number of limitations. Patient selection of only cases without an established diagnosis at admission to ICU may not accurately indicate the requirement or need for diagnostic decision making. Since diagnoses are being constantly made and adjusted within the first hours of PICU admission, it is possible that ISABEL might have other uses than the method we tested. Our methods also did not allow for testing in other settings, such as the emergency department, where less data are available and diagnostic uncertainty might be considerably greater. Second, we only studied this tool in tertiary care facilities with well-trained clinicians and highly-productive trainees. The results of a similar study to this in a smaller unit are unpredictable at this time. It is possible that the experience of the trainees masked the true utility of the ISABEL device. Conversely, it is possible that in a smaller unit the diagnoses might be less challenging and this might make ISABEL’s utility even less. Only future studies can answer this question. Lastly, we intentionally excluded surgical cases in our study population even though they made up almost over 50% of total admissions to the PICUs. Therefore, it is not possible to assess the utility of ISABEL in diagnosing common pediatric surgical emergencies such as ruptured appendix or other disorders. In conclusion, ISABEL improved the diagnostic accuracy of the clinical team caring for children without an established diagnosis upon PICU arrival. Future studies demonstrating ISABEL’s utility in other patient populations will be required to fully assess whether web-based tools can improve children’s outcome after critical illness. It will be important to determine if implementation of tools such as ISABEL can be seamlessly integrated within clinical practice and if subsequent versions of the tool can be sufficiently rigorous to advance as the field of pediatrics evolves and changes. This challenge must be met by constant update and maintenance of diagnostic tools, as well as ongoing evaluation of the efficacy of these tools in practice. Acknowledgements The computerized diagnostic tool studied was provided free for registered users by the Isabel Medical Charity at the time of this study. It is now managed by a commercial entity called Isabel Healthcare and is available only to subscribers. Dr P Ramnarayan currently advises Isabel Healthcare on research activities on a part-time basis. Dr J Britto is Clinical Director of Isabel Healthcare Inc, USA. All the other authors have no competing interests. References [1] [2] [3] [4] [5] [6] E.S. Berner, G.D. Webster, A.A. Shugerman, J.R. Jackson, J. Algina, A.L. Baker et al., Performance of four computerbased diagnostic systems, N Engl J Med 330 (1994), 1792–1796. S.A. Blosser, H.E. Zimmerman and J.L. Stauffer, Do autopsies of critically ill patients reveal important findings that were clinically undetected? Crit Care Med 26 (1998), 1332–1336. M.P. Cardoso, D.C. Bourguignon, M.M. Gomes, P.H. Saldiva, C.R. Pereira and E.J. Troster, Comparison between clinical diagnoses and autopsy findings in a pediatric intensive care unit in Sao Paulo, Brazil, Pediatr Crit Care Med 7 (2006), 423–427. A.M. Combes, A. Mokhtari, A. Couvelard, J.L. Trouillet, J. Baudot, D. Henin et al., Clinical and autopsy diagnoses in the intensive care unit: a prospective study, Arch Intern Med 164 (2004), 389–392. F.T. De Dombal, D.J. Leaper, J.R. Staniland, A.P. McCann and J.C. Horrocks, Computer-aided diagnosis of acute abdominal pain, Br Med J 2 (1972), 9–13. G.M. Dimopoulos, M. Piagnerelli, J. Berre, I. Salmon and J.L. Vincent, Post mortem examination in the intensive care unit: still useful? Intensive Care Med 30 (2004), 2080–2085. 110 [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] N.J. Thomas et al. / An international assessment of a web-based diagnostic tool in critically ill children Y.Y. Han, J.A. Carcillo, S.T. Venkataraman, R.S. Clark, R.S. Watson, T.C. Nguyen et al., Unexpected increased mortality after implementation of a commercially sold computerized physician order entry system, Pediatrics 116 (2005), 1506– 1512. P. Kumar, J. Taxy, D.B. Angst and H.H. Mangurten, Autopsies in children: are they still useful? Arch Pediatr Adolesc Med 152 (1998), 558–563. L.L. Leape, T.A. Brennan, N. Laird, A.G. Lawthers, A.R. Localio, B.A. Barnes et al., The nature of adverse events in hospitalized patients. Results of the Harvard Medical Practice Study II, N Engl J Med 324 (1991), 377–384. G.D. Perkins, D.F. McAuley, S. Davies and F. Gao, Discrepancies between clinical and postmortem diagnoses in critically ill patients: an observational study, Crit Care 7 (2003), R129–R132. A.L. Potts, F.E. Barr, D.F. Gregory, L. Wright and N.R. Patel, Computerized physician order entry and medication errors in a pediatric critical care unit, Pediatrics 113 (2004), 59–63. P. Ramnarayan, N. Cronje, R. Brown, R. Negus, B. Coode, P. Moss et al., Validation of a diagnostic reminder system in emergency medicine: a multi-centre study, Emerg Med J 24 (2007), 619–624. P. Ramnarayan, A. Tomlinson, G. Kulkarni, A. Rao and J. Britto, A novel diagnostic aid (ISABEL): development and preliminary evaluation of clinical performance, Medinfo 11 (2004), 1091–1095. P. Ramnarayan, A. Tomlinson, A. Rao, M. Coren, A. Winrow and J. Britto, ISABEL: a web-based differential diagnostic aid for paediatrics: results from an initial performance evaluation, Arch Dis Child 88 (2003), 408–413. J. Roosen, E. Frans, A. Wilmer, D.C. Knockaert and H. Bobbaers, Comparison of premortem clinical diagnoses in critically iII patients and subsequent autopsy findings, Mayo Clin Proc 75 (2000), 562–567. J.M. Rothschild, C.P. Landrigan, J.W. Cronin, R. Kaushal, S.W. Lockley, E. Burdick et al., The Critical Care Safety Study: The incidence and nature of adverse events and serious medical errors in intensive care, Crit Care Med 33 (2005), 1694–1700. K.G. Shojania, E.C. Burton, K.M. McDonald and L. Goldman, Changes in rates of autopsy-detected diagnostic errors over time: a systematic review, JAMA 289 (2003), 2849–2856. K.G. Shojania, E.C. Burton, K.M. McDonald, and L. Goldman, Overestimation of clinical diagnostic performance caused by low necropsy rates, Qual Saf Health Care 14 (2005), 408–413. J.J. Stambouly, E. Kahn and R.A. Boxer, Correlation between clinical diagnoses and autopsy findings in critically ill children, Pediatrics 92 (1993), 248–251. D.C. Sutherland, Improving medical diagnoses by understanding how they are made, Intern. Med J 32 (2002), 277–280. N.J. Thomas, Isabel, Crit Care 7 (2002), 99–100. C. Vincent, G. Neale and M. Woloshynowych, Adverse events in British hospitals: preliminary retrospective record review, BMJ 322 (2001), 517–519. J.C. Wyatt and J.L. Liu, Basic concepts in medical informatics, J Epidemiol Community Health 56 (2002), 808–812. WHAT IS THE MAGNITUDE OF DIAGNOSIS ERROR? Gordon D. Schiff, Seijeoung Kim,Richard Abrams et al. Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project. Advances in Patient Safety 2005; 2:255-278 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project Gordon D. Schiff, Seijeoung Kim, Richard Abrams, Karen Cosby, Bruce Lambert, Arthur S. Elstein, Scott Hasler, Nela Krosnjar, Richard Odwazny, Mary F. Wisniewski, Robert A. McNutt Abstract Background: Diagnosis errors are frequent and important, but represent an underemphasized and understudied area of patient safety. Diagnosis errors are challenging to detect and dissect. It is often difficult to agree whether an error has occurred, and even harder to determine with certainty its causes and consequence. The authors applied four safety paradigms: (1) diagnosis as part of a system, (2) less reliance on human memory, (3) need to open “breathing space” to reflect and discuss, (4) multidisciplinary perspectives and collaboration. Methods: The authors reviewed literature on diagnosis errors and developed a taxonomy delineating stages in the diagnostic process: (1) access and presentation, (2) history taking/collection, (3) the physical exam, (4) testing, (5) assessment, (6) referral, and (7) followup. The taxonomy identifies where in the diagnostic process the failures occur. The authors used this approach to analyze diagnosis errors collected over a 3-year period of weekly case conferences and by a survey of physicians. Results: The authors summarize challenges encountered from their review of diagnosis error cases, presenting lessons learned using four prototypical cases. A recurring issue is the sorting-out of relationships among errors in the diagnostic process, delay and misdiagnosis, and adverse patient outcomes. To help understand these relationships, the authors present a model that identifies four key challenges in assessing potential diagnosis error cases: (1) uncertainties about diagnosis and findings, (2) the relationship between diagnosis failure and adverse outcomes, (3) challenges in reconstructing clinician assessment of the patient and clinician actions, and (4) global assessment of improvement opportunities. Conclusions and recommendations: Finally the authors catalogue a series of ideas for change. These include: reengineering followup of abnormal test results; standardizing protocols for reading x-rays/lab tests, particularly in training programs and after hours; identifying “red flag” and “don’t miss” diagnoses and situations and use of manual and automated check-lists; engaging patients on multiple levels to become “coproducers” of safer medical diagnosis practices; and weaving “safety nets” to mitigate harm from uncertainties and errors in diagnosis. These change ideas need to be tested and implemented for more timely and error-free diagnoses. 255 Advances in Patient Safety: Vol. 2 Introduction Diagnosis errors are frequent and important, but represent an underemphasized and understudied area of patient-safety.1–8 This belief led us to embark on a 3-year project, funded by the Agency for Healthcare Research and Quality (AHRQ), to better understand where and how diagnosis fails and explore ways to target interventions that might prevent such failures. It is known that diagnosis errors are common and underemphasized, but they are also challenging to detect and dissect. It is often difficult even to agree whether or not a diagnosis error has occurred. In this article we describe how we have applied patient safety paradigms (blame-free reporting/reviewing/learning, attention to process and systems, an emphasis on communication and information technology) to better understand diagnosis error.2, 7–9 We review evidence about the types and importance of diagnosis errors and summarize challenges we have encountered in our review of more than 300 cases of diagnosis error. In the second half of the article, we present lessons learned through analysis of four prototypical cases. We conclude with suggested “change ideas”—interventions for improvement, testing, and future research. Although much of the patient safety spotlight has focused on medication errors, two recent studies of malpractice claims revealed that diagnosis errors far outnumber medication errors as a cause of claims lodged (26 percent versus 12 percent in one study;10 32 percent versus 8 percent in another study11). A Harris poll commissioned by the National Patient Safety Foundation found that one in six people had personally experienced a medical error related to misdiagnosis.12 Most medical error studies find that 10–30 percent (range = 0.6–56.8 percent) of errors are errors in diagnosis (Table 1).1–3, 5, 11, 13–21 A recent review of 53 autopsy studies found an average rate of 23.5 percent major missed diagnoses (range = 4.1–49.8 percent). Selected disease-specific studies (Table 2),6, 22–32 also show that substantial percentages of patients (range = 2.1 – 61 percent) experienced missed or delayed diagnoses. Thus, while these studies view the problem from varying vantage points using heterogeneous methodologies (some nonsystematic and lacking in standardized definitions), what emerges is compelling evidence for the frequency and impact of diagnosis error and delay. Of the 93 safety projects funded by AHRQ, only 1 is focused on diagnosis error, and none of the 20 evidence-based AHRQ Patient Safety Indicators directly measures failure to diagnose.33 Nonetheless, for each of AHRQ’s 26 “sentinel complications” (e.g., decubitus ulcer, iatrogenic pneumothorax, postoperative septicemia, accidental puncture/laceration), timely diagnosis can be decisive in determining whether patients experience major adverse outcomes. Hence, while diagnosis error remains more in the shadows than in the spotlight of patient safety, this aspect of clinical medicine is clearly vulnerable to well-documented failures and warrants an examination through the lens of modern patient safety and quality improvement principles. 256 257 Mailed survey questionnaires regarding patient safety issues, Canadian health care facilities, and colleges and associations Baker GR (2002) Wilson RM (1999) AE in England hospitals The Quality in Australian Health Care Study Bhasale AL (1998) Neale G (2001) Medical practice study, 1984, AE in NY hospitals (Leape) Austrian GP, Diagnostic incidents Bogner M (1994) Potential adverse events AE in hospitalized patients Leape LL (1991) Weingart S (2000) Context/Design Physicians and surgeons update, St. Paul, MN. Malpractice claims data reviewed Author, Year Flannery FT (1991) 1/25 (4%) of health care errors in health care facilities, 9/21 (39%) in colleges/associations 5/118 (0.6%) 29/110 (26.4%) process of care problems associated w/ diagnosis 267/470 (56.8%) of AEs due to delays were delays in diagnosis, 191 (40.6%) treatment delay. 198/252 (78.6%) investigation category were investigation not performed, 39 (15.5%) not acted on, 9 (3.6%) inappropriate. 275/805 (34.2%) 11731/68645 (17.1%) 168/1276 (13.8%) Diagnosis errors/Total errors (%) 1126/7233 (27.4%) failure to diagnose Table 1. General medical error studies that reported errors in diagnosis Comments/Types of Diagnostic Errors Top 5 diagnoses (cancer, circulatory/thrombosis, fracture/dislocation, lack of attendance, infection) Failure to use indicated tests, act on test, appropriate test, delay in diagnosis, practicing outside area of expertise 17.1% preventable errors due to diagnostic errors. 71% of them were negligent. Calculated rates (N = 275): missed diagnosis (40 per 100), delayed (34), misdiagnosed (23), and diagnostic procedural (14) 1922/2351 AEs (81.8%) associated w/ human errors, 470 (20%) delays in diagnosis or treatment, 460 (19.6%) treatment, and 252 (10.7%) investigation. 1922 AEs w/ 2940 causes identified. Causes of human errors: 465/2940 (15.8%) failure to synthesis/decide/act on information, 346 (11.8%) failure to request, arrange an investigation. 18/110 (16.4%) inadequate evaluation, 7 (6.4%) diagnostic error, 4 (3.6%) delayed consultation. 118/840 (14%) had AE. 57% of all AE, cognitive. 5/118 (0.6%) of admissions were associated with incorrect diagnoses. 2 missed heart failures, 2 incorrect assessment of abdominal pain, and 1 missed fracture. Human factors, 3 (12%), 9 (39%); competency, 5 (20%), 2 (9%) Diagnosing Diagnosis Errors 258 26126 claims peer reviewed, 5921/26126 (23%) related with negligent 1371 claims, Failure to appropriate diagnostic testing/monitoring 2003/5921 (34%) associated with diagnosis error, 16% failure to monitor case, 4% failure/delay in referral, 2% failure to recognize a complication 55/528 (10.4%) patients admitted to the hospitalists had at least 1 error Comments/Types of Diagnostic Errors 55 delays in treatment, due to misdiagnosis (42%), test results availability (13%), delayed initial assessment (7%). Most frequent missed diagnosis: meningitis 7/23 (30%), cardiac disease, PE, trauma, asthma, neurologic disorder Process errors 104/134 (78%) in Austria, 235/301 (78%), Of the process errors, investigation errors (lab errors, diagnostic imaging errors, and others) were 13%, 19% Failure-to-diagnose, failed to diagnosis in timely fashion AE = adverse events; GP = general practitioners; JCAHO = Joint Commission for Accreditation of Healthcare Organizations; ED = emergency department Malpractice claims in primary care Error detection by attending physicians, general medicine inpatients Phillips R (2004) 12/63 (19.1%) of all errors, 8/39 (20.5%) of near misses, and 4/24 (16.7%) of AEs were related with diagnostic errors 3–8% Medical malpractice Medical malpractice lawyers and attorneys online (2002) Chaudhry S (2003) Malpractice claims data, 4 specialties 40% GP in 6 countries Makeham MA (2002) Kravitz RL (2003) 17/104 (13%) in Austria, 55/236 (19%) in other countries Context/Design ED sentinel event Author, Year JCAHO Sentinel event advisory group (2002) Diagnosis errors/Total errors (%) 55/23 (42%) Table 1. General medical error studies that reported errors in diagnosis, cont. Advances in Patient Safety: Vol. 2 Diagnosing Diagnosis Errors Table 2. Illustrative disease-specific studies of diagnosis errors Author Disease/Design Findings Steere AC (1993) Lyme disease Overdiagnosis of Lyme disease: patients given the diagnosis but on review did not meet criteria for diagnosis. - 452/788 (57%) Cravan ER (1994) Glaucoma Diagnosis error in glaucoma claims/lawsuits - 42/194 (21.7%) Lederle FA (1994) Ruptured abdominal aortic aneurysm Ruptured aneurysm diagnosis initially missed - 14/23 (61%) Mayer PL (1996) Symptomatic cerebral aneurysm Patients initially misdiagnosed - 54/217 (25%, ranging from 13% to 35% at 4 sites); Of these, 26/54 (48%) deteriorated before definite treatment; erroneous working diagnoses in 54 pts; 8 (15%) viral meningitis, 7 (13%) migraine, 7 (13%) headache of uncertain etiology Williams V (1997) Brain and spinal cord biopsies Errors based on “second opinion” reviews - 214/500 (43%); 44/214 (20%) w/ serious complications, 96 (45%) w/ substantial, 50 (23%) w/ minor errors Clark S (1998) Pyrogenic spinal infection Delay in diagnosis - 41/69 (59.4%); more than 1 month of back/neck pain before specialist referral Arbiser ZK (2000) Soft tissue pathology Second option reviews: minor discrepancy - 20/266 (7.5%); major discrepancy - 65 (25%) Pope JH (2000) Acute cardiac ischemia in ED Mistakenly discharged from ED; MI - 894/889 (2.1%); Edeldman D (2002) Diabetes in an outpatient clinic Blood glucoses meeting criteria for DM with no diagnosis of diabetes in records - 258/1426 (18%) Goodson WH (2002) Breast Cancer Inappropriately reassured to have benign lesions - 21/435 (5%); 14 (3%) misread mammogram, 4 (1%) misread pathologic finding, 5 (1%) missed by poor fine-needle biopsy Kawalski RG (2004) Aneurysmal subarachnoid hemorrhage (SAH) Patients w/ SAH initially misdiagnosed - 56/482 (12%); migraine/tension headache most common incorrect dx – 20/56 (36%); failure to obtain a CT was most common error - 41/56 (73%) Traditional and innovative approaches to learning from diagnosis error Traditionally, studying missed diagnoses or incorrect diagnoses had a central role in medical education, research, and quality assurance in the form of autopsies.34–36 Other traditional methods of learning about misdiagnosed cases include malpractice litigation, morbidity and mortality (M&M) conferences, unsystematic feedback from patients, other providers, or simply from patients’ illnesses as they evolved over time.3, 14, 15, 32 Beyond the negative aspects of being grounded in patients’ deaths or malpractice accusations, there are other limitations of these historical approaches, including 259 Advances in Patient Safety: Vol. 2 • Lack of systematic approaches to surveillance, reporting, and learning from errors, with nonrandom sample of cases subjected to such review37, 39 • Lack of timeliness, with cases often reviewed months or years after the event38 • Examinations that rarely dig to the root of problems: not focused on the “Five Whys”40 • Postmortems that seldom go beyond the case-at-hand, with minimal linkages to formal quality improvement activities41 • Atrophy of the value of even these suboptimal approaches, with autopsy rates in the single digits (in many hospitals, zero), many malpractice experiences sealed by nondisclosure agreements, and shorter hospitalizations limiting opportunities for followup to ultimate diagnosis34, 41, 42 What is needed to overcome these limitations is not only a more systematic method for examining cases of diagnosis failure, but also a fresh approach. Therefore, our team approached diagnosis error with the following perspectives: Diagnosis as part of a system. Diagnostic accuracy should be viewed as a system property rather than simply what happens between the doctor’s two ears.2, 43–45 While cognitive issues figure heavily in the diagnostic process, a quite from Don Berwick summarizes46 a much lacking and needed perspective: “Genius diagnosticians make great stories, but they don’t make great health care. The idea is to make accuracy reliable, not heroic.” Less reliance on human memory. Relying on clinicians’ memory—to trigger consideration of a particular diagnosis, recall a disease’s signs/symptoms/pattern from a textbook or experience—or simply to remember to check on a patient’s lab result—is an invitation to variations and failures. This lesson from other error research resonates powerfully with clinicians, who are losing the battle to keep up to date.9, 45, 47 Need for “space” to allow open reflection and discussion. Transforming an adversarial atmosphere into one conducive to honest reflection is an essential first step.48, 49 However, an equally important and difficult challenge is creating venues that allow clinicians (and patients) to discuss concerns in an efficient and productive manner.37 Cases need to be reviewed in sufficient detail to make them “real.” Firsthand clinical information often radically changes our understanding from what the more superficial “first story” suggested. As complex clinical circumstances are better understood, new light is often shed on what at first appeared to be indefensible diagnostic decisions and actions. Unsuspected additional errors also emerge. Equally important is not to get mired in details or making judgments (whether to label a case as a diagnosis error). Instead, it is more valuable to focus on generalizable lessons of how to ensure better treatment of similar future patients.16 260 Diagnosing Diagnosis Errors Adopting multidisciplinary perspectives and collaboration. A broad range of skills and vantage points are valuable in understanding the complex diagnostic problems that we encountered. We considered input from specialists and primary care physicians to be essential. In addition, specialists in emergency medicine (where many patients first present) offered a vital perspective, both for their diagnostic expertise and their pivotal interface with system constraints (resource limits mean that not every patient with a confusing diagnosis can be hospitalized). Even more valuable has been the role of non-MDs, including nursing quality specialists, information scientists, and social scientists (cognitive psychologist, decision theory specialist) in forging a team to broadly examine diagnosis errors. Innovative screening approaches. Developing new ways to uncover errors is a priority. We cannot afford to wait for a death, lawsuit, or manual review. Approaches we have been exploring include electronic screening that links pharmacy and lab data (e.g., to screen for abnormal results, such as elevated thyroid stimulating hormone [TSH], unaddressed by thyroxin therapy), trajectory studies (retrospectively probing delays in a series of cases with a particular diagnosis), and screening for discrepancies between admitting and discharge diagnoses. A related approach is to survey specialists (who are poised to see diagnoses missed in referred patients), primary care physicians (about their own missed diagnoses), or patients themselves (who frequently have stories to share about incorrect diagnoses), in addition to various ad hoc queries and self-reports. Where does the diagnostic process fail? One of the most powerful heuristics in medication safety has been delineation of the steps in the medication-use process (prescribing, transcribing, dispensing, administering, and monitoring) to help localize where an error has occurred. Diagnosis, while more difficult to neatly classify (because compared to medications, stages are more concurrent, recurrent, and complex), nonetheless can be divided into seven stages: (1) access/presentation, (2) history taking/collection, (3) the physical exam, (4) testing, (5) assessment, (6) referral, and (7) followup. We have found this framework helpful for organizing discussions, aggregating cases, and targeting areas for improvement and research. It identifies what went wrong, and situates where in the diagnostic process the failure occurred (Table 3). We have used it for a preliminary analysis of several hundred diagnosis error cases we collected by surveying physicians. This taxonomy for categorizing diagnostic “assessment” draws on work of Kassirer and others,53 highlighting the two key steps of (a) hypothesis generation, and (b) differential diagnosis or hypothesis weighing/prioritization. We add another aspect of diagnostic assessment, one that connects to other medical and iatrogenic error work—the need to recognize the urgency of diagnoses and complications. This addition underscores the fact that failure to make the exact diagnosis is often less important than correctly assessing the urgency of the patient’s illness. We divide the “testing” stage into three components—ordering, performing, and clinician processing (similar but not identical to the laboratory literature classification of the phases of lab testing as preanalytic, analytic, and 261 Advances in Patient Safety: Vol. 2 Table 3. Taxonomy of where and what errors occurred What Went Wrong Where in Diagnostic Process (~Anatomic localization) (~Lesion) 1. Access/presentation Denied care 2. History Failure/delay in eliciting critical piece of history data Delayed presentation Inaccurate/misinterpretation Suboptimal weighing Failure/delay to followup 3. Physical exam “ “ “ Failure/delay in eliciting critical physical exam finding Inaccurate/misinterpreted “ Suboptimal weighing “ Failure/delay to followup 4. Tests (lab/radiology) “ Ordering Failure/delay in ordering needed test(s) Failure/delay in performing ordered test(s) Suboptimal test sequencing Ordering of wrong test(s) Performance Sample mix-up/mislabeled (e.g., wrong patient) Technical errors/poor processing of specimen/test Erroneous lab/radiol reading of test Failed/delayed transmission of result to clinician Clinician processing Failed/delayed followup action on test result Erroneous clinician interpretation of test 5. Assessment Hypothesis generation Failure/delay in considering the correct diagnosis Suboptimal weighing/prioritizing Too much weight to low(er) probability/priority dx Too little consideration of high(er) probability/priority dx Too much weight on competing diagnosis Recognizing urgency/complications Failure to appreciate urgency/acuity of illness Failure/delay in recognizing complication(s) 6. Referral/consultation Failure/delay in ordering needed referral Inappropriate/unneeded referral Suboptimal consultation diagnostic performance Failed/delayed communication/followup of consultation 7. Followup Failure to refer to setting for close monitoring Failure/delay in timely followup/rechecking of patient 262 Diagnosing Diagnosis Errors postanalytic).50, 51 For each broad category, we specified the types of problems we observed. A recurring theme running through our reviews of potential diagnosis error cases pertains to the relationship between errors in the diagnostic process, delay and misdiagnosis, and adverse patient outcomes. Bates52 has promulgated a useful model for depicting the relationships between medication errors and outcomes. Similarly, we find that most errors in the diagnostic process do not adversely impact patient outcomes. And, many adverse outcomes associated with misdiagnosis or delay do not necessarily result from any error in the diagnostic process—the cancer may simply be undiagnosable at that stage, the illness presentation too atypical, rare, or unlikely even for the best of clinicians to diagnose early. These situations are often referred to as “no fault” or “forgivable” errors—terms best avoided because they imply fault or blame for preventable errors (Figure 1).7, 53 While deceptively simple, the model raises a series of extremely challenging questions—questions we found ourselves repeatedly returning to in our weekly discussions. We hope these questions can provide insights into recurring themes and challenges we faced, and perhaps even serve as a checklist for others to structure their own patient care reviews. While humbled by our own inability to provide more conclusive answers to these questions, we believe researchers and practitioners will be forced to grapple with them before we can make significant progress. Questions for consideration by diagnosis error evaluation and research (DEER) investigators in assessing cases Uncertainties about diagnosis and findings 1. What is the correct diagnosis? How much certainty do we have, even now, about what the correct diagnosis is? 2. What were the findings at the various points in time when the patient was being seen; how much certainty do we have that a particular finding and diagnosis was actually present at the time(s) we are positing an error? Relationship between diagnosis failure and adverse outcomes 3. What is the probability that the error resulted in the adverse outcome? How treatable is the condition, and how critical is timely diagnosis and treatment for impacting on the outcome—both in general and in this case? 4. How did the error in the diagnostic process contribute to making the wrong diagnosis and wrong treatment? 263 Advances in Patient Safety: Vol. 2 Figure 1. Relationships between diagnostic process errors, misdiagnosis, and adverse events Diagnostic Process Errors Adverse Events G F D C A B Diagnosis* Errors E *Delayed, missed, or misdiagnosis Caption Group A = Errors in diagnostic process (blood sample switched between two patients, MD doesn’t do a physical exam for patient with abdominal pain) Group B = Diagnostic process error with resulting misdiagnosis (patient given wrong diagnosis because blood samples switched) Group C = Adverse outcome resulting from error-related misdiagnosis (Patient is given toxic treatment and has adverse effect as result of switched samples. Fail to diagnose appendicitis because of failure to examine abdomen, and it ruptures and patient dies) Group D = Harm from error in diagnostic process (colon perforation from colonoscopy done on wrong patient) Group E = Misdiagnosis, delayed diagnosis or missed diagnosis, but no error in care or harm (incidental prostate cancer found on autopsy) Group F = Adverse event due to misdiagnosis but no identifiable process error (death from acute MI but no chest pain or other symptoms that were missed) Group G = Adverse events but not related to misdiagnosis, delay, or error in diagnostic process, e.g., death from correctly diagnosed disease complication, or nonpreventable drug reaction (PCN anaphylaxis in patient never previously exposed) 264 Diagnosing Diagnosis Errors Clinician assessment and actions 5. What was the physican’s diagnostic assessment? How much consideration was given to the correct diagnosis? (This is usually difficult to reconstruct because differential diagnosis often is not well documented.) 6. How good or bad was the diagnostic assessment based on evidence clinicians had on hand at that time (should have been obvious from available data vs. no way anyone could have suspected)? 7. How erroneous was the diagnostic assessment, based on the difficulty in making the diagnosis at this point? (Was there a difficult “signal-tonoise” situation, a rare low-probability diagnosis, or an atypical presentation?) 8. How justifiable was the failure to obtain additional information (i.e., history, tests) at a particular point in time? How can this be analyzed absolutely, as well as relative to the difficulties and constraints in obtaining this missing data? (Did the patient withhold or refuse to give accurate/additional history; were there backlogs and delays that made it impossible to obtain the desired test?) 9. Was there a problem in diagnostic assessment of the severity of the illness, with resulting failure to observe or follow up the patient more closely? (Again, both absolutely and relative to constraints.) Global assessment of improvement opportunities 10. To what extent did the clinicians’ actions deviate from the standard-ofcare (i.e., was there negligent care with failure to follow accepted diagnostic guidelines and expected practices, or to pursue abnormal finding that should never be ignored)? 11. How preventable was the error? How ameliorable or amenable to change are the factors/problems that contributed to the error? How much would such changes, designed to prevent this error in the future, cost? 12. What should we do better the next time we encounter a similar patient or situation? Is there a general rule, or are there measures that can be implemented to ensure this is reliably done each time? Diagnosis error case vignettes Case 1 A 25-year-old woman presents with crampy abdominal pain, vaginal bleeding, and amenorrhea for 6 weeks. Her serum human choriogonadotropin (HCG) level is markedly elevated. A pelvic ultrasound is read by the on-call radiology chief resident and obstetrics (OB) attending physician as showing an empty uterus, 265 Advances in Patient Safety: Vol. 2 suggesting ectopic pregnancy. The patient is informed of the findings and treated with methotrexate. The following morning the radiology attending reviews the ultrasound and amends the report, officially reading it as “normal intrauterine pregnancy.” Case 2 A 49-year-old, previously healthy man presents to the emergency department (ED) with nonproductive cough, “chest congestion,” and dyspnea lasting 2 weeks; he has a history of smoking. The patient is afebrile, with pulse = 105, respiration rate (RR) = 22, and white blood count (WBC) = 6.4. Chest x-ray shows “marked cardiomegaly, diffuse interstitial and reticulonodular densities with blunting of the right costophrenic angle; impression—congestive heart failure (CHF)/pneumonia. Rule out (R/O) cardiomyopathy, valve disease or pericardial effusion.” The patient is sent home with the diagnosis of pneumonia, with an oral antibiotic. The patient returns 1 week later with worsening symptoms. He is found to have pulsus paradoxicus, and an emergency echocardiogram shows massive pericardial effusion. Pericardiocentesis obtains 350 cc fluid with cytology positive for adenocarcinoma. Computed tomography of the chest suggests “lymphangitic carcinomatosis.” Case 3 A 50-year-old woman with frequent ED visits for asthma (four visits in the preceding month) presents to the ED with a chief complaint of dyspnea and new back pain. She is treated for asthma exacerbation and discharged with nonsteroidal anti-inflammatory drugs (NSAID) for back pain. She returns 2 days later with acutely worsening back pain, which started when reaching for something in her cupboard. A chest x-ray shows a “tortuous and slightly ectatic aorta,” and the radiologist’s impression concludes, “If aortic dissection is suspected, further evaluation with chest CT with intravenous (IV) contrast is recommended.” The ED resident proceeds to order a chest CT, which concludes “no evidence of aneurysm or dissection.” The patient is discharged. She returns to the ED 3 days later, again complaining of worsening asthma and back pain. While waiting to be seen, she collapses in the waiting room and is unable to be resuscitated. Autopsy shows a ruptured aneurysm of the ascending aorta. Case 4 A 50-year-old woman with a past history diabetes and alcohol and IV drug abuse, presents with symptoms of abdominal pain and vomiting and is diagnosed as having “acute chronic pancreatitis.” Her amylase and lipase levels are normal. She is admitted and treated with IV fluids and analgesics. On hospital day 2 she begins having spiking fevers and antibiotics are administered. The next day, blood cultures are growing gram negative organisms. 266 Diagnosing Diagnosis Errors At this point, the service is clueless about the patient’s correct diagnosis. It only becomes evident the following day when (a) review of laboratory data over the past year shows that patient had four prior blood cultures, each positive with different gram negative organisms; (b) a nurse reports patient was “behaving suspiciously,” rummaging through the supply room where syringes were kept; and (c) a medical student looks up posthospital outpatient records from 4 months earlier and finds several notes stating that “the patient has probable Munchausen syndrome rather than pancreatitis.” Upon discovering these findings, the patient’s IVs are discontinued and sensitive, appropriate followup primary and psychiatric care are arranged. A postscript to this admission: 3 months later, the patient was again readmitted to the same hospital for “pancreatitis” and an unusual “massive leg abscess.” The physicians caring for her were unaware of her past diagnoses and never suspected or discovered the likely etiology of her abscess (self-induced from unsterile injections). Lessons and issues raised by the diagnosis error cases Difficulties in sorting out “don’t miss” diagnoses Before starting our project, we compiled a list of “don’t miss” diagnoses (available from the authors). These are diagnoses that are considered critical, but often difficult to make—critical because timely diagnosis and treatment can have major impact (for the patient or the public’s health, or both), and difficult because they either are rare or pose diagnostic challenges. Diagnoses such as spinal epidural abscess (where paraplegia can result from delayed diagnosis), or active pulmonary tuberculosis (TB) (where preventing spread of infection is critical) are examples of “don’t miss” diagnoses. While there is a scant evidence base to definitively compile and prioritize such a list, three of our cases—ectopic pregnancy, dissecting aortic aneurysm, and pericardial effusion with tamponade— are diagnoses that would unquestionably be considered as life-threatening diagnoses that ought not be delayed or missed.25 Although numerous issues concerning diagnosis error are raised by these cases, they also illustrate problems relating to uncertainties, lack of gold standards (for both testing and standard of care), and difficulties reaching consensus about best ways to prevent future errors and harmful delays. Below we briefly discuss some of these issues and controversies. Diagnostic criteria and strategies for diagnosing ectopic pregnancy are controversial, and our patient’s findings were particularly confusing. Even after careful review of all aspects of the case, we were still not certain who was “right”—the physicians who read the initial images and interpreted them as consistent with ectopic pregnancy, or the attending physician rereading the films the next day as normal. The literature is unclear about criteria for establishing this 267 Advances in Patient Safety: Vol. 2 diagnosis.54–57 In addition, there is a lack of standards in performance and interpretation of ultrasound exams, plus controversies about timing of interventions. Thus this “obvious” error is obviously more complex, highlighting a problem-prone clinical situation. The patient who was found to have a malignant pericardial effusion illustrates problems in determining the appropriate course of action for patients with unexplained cardiomegaly, which the ED physicians failed to address on his first presentation: Was he hemodynamically stable at this time? If so, did he require an urgent echocardiogram? What criteria should have mandated this be done immediately? How did his cardiomegaly get “lost” as his diagnosis prematurely “closed” on pneumonia when endorsed from the day to the night ED shift? How should one assess the empiric treatment for pneumonia given his abnormal chest x-ray? Was this patient with metastatic lung cancer harmed by the 1 week delay? The diagnosis of aneurysms (e.g., aortic, intracranial) arises repeatedly in discussions of misdiagnosis. Every physician seems to recall a case of a missed aneurysm with catastrophic outcomes where, in retrospect, warnings may have been overlooked. A Wall Street Journal article recently won a Pulitzer Prize for publicizing such aneurysm cases.58 Our patient’s back pain was initially dismissed. Because of frequent visits, she had been labeled as a “frequent flyer”— and back pain is an extremely common and nonspecific symptom. A review of literature on the frequency of dissecting aortic aneurysm reveals that it is surprisingly rare, perhaps less than 1 out of 50,000 ED visits for chest pain, and likely an equally rare cause for back pain.59–62 She did ultimately undergo a recommended imaging study after a suspicious plain chest x-ray, however it was deemed “negative.” Thus, in each case, seemingly egregious and unequivocal errors were found to be more complex and uncertain. Issues related to limitations of diagnostic testing Even during our 3 years of diagnosis case reviews, clinicians have been confronted with rapid changes in diagnostic testing. New imaging modalities, lab tests, and testing recommendations have been introduced, often leaving clinicians confused about which tests to order and how to interpret their confusing and, at times, contradictory (from one radiologist to the next) results.63 If diagnosis errors are to be avoided, clinicians must be aware of the limitations of the diagnostic tests they are using. It is well known that a normal mammogram in a woman with a breast lump does not rule out the diagnosis of breast cancer, because the sensitivity of test is only 70 to 85 percent.13, 26, 64, 65 A recurring theme in our cases is failure to appreciate pitfalls in weighing test results in the context of the patient’s pretest disease probabilities. Local factors, such as the variation in quality of test performance and readings, combined with communication failures between radiology/laboratory and ordering physicians (either no direct communication or interactions where complex interpretations get 268 Diagnosing Diagnosis Errors reduced to “positive” or “negative,” overlooking subtleties and limitations) provide further sources of error.66 The woman with the suspected ectopic pregnancy, whose emergency ultrasound was initially interpreted as being “positive,” illustrates the pitfalls of taking irreversible therapeutic actions without carefully weighing test reading limitations. Perhaps an impending rupture of an ectopic pregnancy warrants urgent action. However, it is also imperative for decisions of this sort that institutions have fail-safe protocols that anticipate such emergencies and associated test limitations. The patient with a dissecting aortic aneurysm clearly had a missed diagnosis, as confirmed by autopsy. This diagnosis was suspected premortem, but considered to be “ruled out” by a CT scan that did not show a dissection. Studies of the role of chest CT, particularly when earlier CT scanning technology was used, show sensitivity for dissecting aneurysm of only 83 percent.67 When we reexamined the old films for this patient, several radiologists questioned the adequacy of the study (quality of the infusion plus question of motion artifact). Newer, faster scanners reportedly are less prone to these errors, but the experience is variable. We identified another patient where a “known” artifact on a spiral CT nearly led to an unnecessary aneurysm surgery; it was only prevented by the fact that she was a Jehovah’s Witness and, because her religious beliefs precluded transfusions, surgery was considered too risky.62 The role of information transfer and the communication of critical laboratory information Failure of diagnosis because of missing information is another theme in our weekly case reviews and the medical literature. Critical information can be missed because of failures in history-taking, lack of access to medical records, failures in the transmission of diagnostic test results, or faulty records organization (either paper or electronic) that created problems for quickly reviewing or finding needed information. For the patient with the self-induced illness, all of the “missing” information was available online. Ironically, although many patients with a diagnosis of Munchausen’s often go to great lengths to conceal information (i.e., giving false names, using multiple hospitals), in our case, there was so much data in the computer from previous admissions and outpatient visits that the condition was “lost” in a sea of information overload—a problem certain to grow as more and more clinical information is stored online. While this patient is an unusual example of the general problems related to information transfer, this case illustrates important principles related to the need for conscientious review, the synthesizing of information, and continuity (both of physicians and information) to avoid errors. Simply creating and maintaining a patient problem list can help prevent diagnosis errors. It can ensure that each active problem is being addressed, helping all caregivers to be aware of diagnoses, allergies, and unexplained 269 Advances in Patient Safety: Vol. 2 findings. Had our patient with “unexplained cardiomegaly” been discharged listing this as one of his problems, instead of only “pneumonia,” perhaps this problem would not have been overlooked. However, making this seemingly simple documentation tool operational has been unsuccessful in most institutions, even ones with advanced electronic information systems, and thus represents a challenge as much as a panacea. One area of information transfer, the followup of abnormal laboratory test results, represents an important example of this information transfer paradigm in diagnostic patient safety.68–73 We identified failure rates of more than 1 in 50 for both followup abnormal thyroid tests (where the diagnosis of hypothyroidism was missed in 23 out of 982, or 2.3 percent, of patients with markedly elevated TSH results), and our earlier study on failure to act on elevated potassium levels (674 out of 32,563, or 2.0 percent, of potassium prescriptions were written for hyperkalemic patients).74 Issues of communication, teamwork, systems design, and information technology stand as areas for improvement.75–77 Recognizing this, the Massachusetts Safety Coalition has launched a statewide initiative on Communicating Critical Test Results.78 Physician time, test availability, and other system constraints Our project was based in two busy urban hospitals, including a public hospital with serious constraints on bed availability and access to certain diagnostic tests. An important recurring theme in our case discussions (and in health care generally) is the interaction between diagnostic imperatives and these resource limitations. To what extent is failure to obtain an echocardiogram, or even a more thorough history or physical exam, understandable and justified by the circumstances under which physicians find themselves practicing? Certainly our patient with the massive cardiomegaly needed an echocardiogram at some time. Was it reasonable for the ED to defer the test (meaning a wait of perhaps several months in the clinic), or would a more “just-in-time” approach be more efficient, as well as safer in minimizing diagnosis error and delay?79 Since, by definition, we expect ED physicians to triage and treat emergencies, not thoroughly work up every problem patients have, we find complex trade-offs operating at multiple levels. Similar trade-offs impact whether a physician had time to review all of the past records of our factitious-illness patient (only the medical student did), or how much radiology expertise is available around-the-clock to read ultrasound or CT exams, to diagnose ectopic pregnancy or aortic dissection. This is perhaps the most profound and poorly explored aspect of diagnosis error and delay, but one that will increasingly be front-and-center in health care. 270 Diagnosing Diagnosis Errors Cognitive issues in diagnosis error We briefly conclude where most diagnosis error discussions begin, with cognitive errors.45, 53, 80–84 Hindsight bias and the difficulty of weighing prior probabilities of the possible diagnoses bedeviled our efforts to assess decisions and actions retrospectively. Many “don’t miss” diagnoses are rare; it would be an error to pursue each one for every patient. We struggled to delineate guidelines that would accurately identify high-risk patients and to design strategies to prevent missing these diagnoses. Our case reviews and firsthand interviews often found that each physician had his or her own individual way of approaching patients and their problems. Such differences made for lively conference discussions, but have disturbing implications for developing more standardized approaches to diagnosis. The putative dichotomy between “cognitive” and “process” errors is in many ways an artificial distinction.7, 8 If a physician is interrupted while talking to the patient or thinking about a diagnosis and forgets to ask a critical question or consider a critical diagnosis, is this a process or cognitive error? Conclusion Because of their complexity, there are no quick fixes for diagnosis errors. As we review what we learned from a variety of approaches and cases, certain areas stood out as ripe for improvement—both small-scale improvements that can be tested locally, as well as larger improvements that need more rigorous, formal research. Table 4 summarizes these change ideas, which harvest the lessons of our 3-year project. As outlined in the table, there needs to be a commitment to build learning organizations, in which feedback to earlier providers who may have failed to make a correct diagnosis becomes routine, so that institutions can learn from this aggregated feedback data. To better protect patients, we will need to conceptualize and construct safety nets to mitigate harm from uncertainties and errors in diagnosis. The followup of abnormal test results is a prime candidate for reengineering, to ensure low “defect” rates that are comparable to those achieved in other fields. More standardized and reliable protocols for reading x-rays and laboratory tests (such as pathology specimens), particularly in residency training programs and “after hours,” could minimize the errors we observed. In addition, we need to better delineate “red flag” and “don’t miss” diagnoses and situations, based on better understanding and data regarding pitfalls in diagnosis and ways to avoid them. 271 Safety nets to mitigate harm from diagnostic uncertainty and error Change Idea Upstream feedback to earlier providers who have may have failed to make correct diagnosis 272 • • • • • • • • • • • Educating and empowering patients to have lower threshold for seeking followup care or advice, including better defining and specifying particular warning symptoms • • • • • Well designed observation and followup venues and systems (e.g., admit for observation, followup calls or e-mail from MD in 48 hours, automated 2 wk phone followup to ensure tests obtained) for high-risk, uncertain diagnoses Permits aggregation for tracking, uncovering patterns, learning across cases, elucidating pitfalls, measuring improvements Build in “feedback from the feedback” to capture reflective practitioner assessment of why errors may have occurred and considerations for future prevention Feedback from specialists poised to see missed diagnosis could be especially useful “Hard-wires” organizational and practitioner learning from diagnosis evolution and delays Promotes culture of safety, accountability, continuous and blame-free learning, and communication Rationale/Description Table 4. Change ideas for preventing and minimizing diagnostic error Not creating excessive patient worry/anxiety Avoiding false positive errors, inappropriate use of scarce/costly resources Resource constraints (bed, test availability, clinician time) Logistics To be highest leverage needs to be coupled with reporting, case conferences Ways to extend to previous hospitals and physicians outside of own institution Protecting confidentiality, legal liabilities, blame-free atmosphere Avoiding “tampering,” from availability bias that neglects base rates (ordering aortogram on every chest pain patient to “rule out” dissection) Logistical requirements for implementation and surveillance screening to identify errors Challenges Advances in Patient Safety: Vol. 2 273 Prospectively defining red flag diagnoses and situations and instituting prospective readiness Fail-safe protocols for preliminary/resident and definitive readings of tests Change Idea Timely and reliable abnormal test result followup systems • • • • • • • • • • • Embodies/requires coordinated multidisciplinary approach (e.g., pharmacy, radiology, specialists) Like AMI thrombolytic “clot box,” in-place for ready activation the moment patient first presents Create “pull” systems for patients with particular medical problems to ensure standardized, expedited diagnostic evaluations (so don’t have to “push” to quickly obtain) “After-hours” systems for readings, amending Need for system for amending and alerting critical changes to clinicians Must be well-defined and organized system for supervision, creating and communicating final reports Modeled on Massachusetts Patient Safety Coalition program 78 Involve patients to ensure timely notification of results, and contacting provider when this fails Leverage information technologies to automate and increase reliability Uniform approach across various test types and disciplines (lab, pathology, radiology, cardiology) Fail-safe, prospectively designed systems to identify which results are critical abnormals (panic and routine), who to communicate result to, how to deliver Rationale/Description Table 4. Change ideas for preventing and minimizing diagnostic error, cont. • • • • • • • • Avoiding excessive work-up, diverting resources from other problems Difficulties in evidence-based delineation of diagnoses, situations, patient selection, criteria, and standardized actions Quality control poses major unmet challenges How to best recognize and convey variations in expertise of attendings who write final reports Practitioner efficiency issues: avoiding excess duplicate work; efficient documentation Defining responsibilities: e.g., for ED patients, for lab personnel On-call, cross-coverage issues for critical panic results Achieving consensus on test types and cut-off thresholds Challenges Diagnosing Diagnosis Errors Test and leverage information technology tools to avoid known cognitive and care process pitfalls High-level patient education and engagement with diagnosis probabilities and uncertainties Change Idea Automated screening check lists to avoid missing key history, physical, lab data 274 • • • • • • • • • • Prompts to suggest consideration of medication effects in differential, based on linkages to patient’s medication profile, lab results Sophisticated decision support tools that use complex rules and individualized patients data Facilitating real-time access to medical knowledge sources Easing documentation time demands to give practitioners more time to talk to patients and think about their problems Better design of ways to streamline documentation (including differential diagnosis) and access/display of historical data Support and enhance patients taking initiative to question diagnosis, particularly if not responding as expected • • • • • • • • • Could be customized based on presenting problem (i.e., work exposures for lung symptoms, travel history for fever) Since diagnosis often so complex and difficult, this needs to be shared with patients in ways to minimize disappointments and surprises • • Queries triggered by individual patient features Less reliance on human memory for more thorough questioning Rationale/Description Table 4. Change ideas for preventing and minimizing diagnostic error, cont. Alleged atrophy of unaided cognitive skills New errors and distractions introduced by intrusion of computer into clinical encounter Paucity of standardized, accepted, sharable clinical alerts/rules Challenges coupling knowledge bases with individual patient characteristics Shortcomings of first generation of “artificial intelligence” diagnosis software Doing in way that balances need for preserving patient confidence in their physicians and advice, with education and recognition of diagnosis fallabilities Avoiding unnecessary testing Potentially more time-consuming for practitioners Effectively implementing with teamwork and integrated information technology Sorting out, avoiding false positive errors from data with poor signal:noise ratio Evidence of efficacy to-date unconvincing; unclear value of unselective “review of systems” Challenges Advances in Patient Safety: Vol. 2 Diagnosing Diagnosis Errors To achieve many of these advances, automated (and manual) checklists and reminders will be needed to overcome current reliance on human memory. But information technology must also be deployed and reengineered to overcome growing problems associated with information overload. Finally, and most importantly, patients will have to be engaged on multiple levels to become “coproducers” in a safer practice of medical diagnosis.85 It is our hope that these change ideas can be tested and implemented to ensure safer treatment based on better diagnoses—diagnosis with fewer delays, mistakes, and process errors. Acknowledgments This work was supported by AHRQ Patient Safety Grant #11552, the Cook County–Rush Developmental Center for Research in Patient Safety (DCERPS) Diagnostic Error Evaluation and Research (DEER) Project. Author affiliations Cook County John H. Stroger Hospital and Bureau of Health Services, Chicago (GDS, KC, MFW). Rush University Medical Center, Chicago (RA, SH, RO, RAM). Hektoen Research Institute, Chicago (SK, NK). University of Illinois at Chicago, College of Pharmacy (BL). University of Illinois at Chicago Medical School (ASE). Address correspondence to: Gordon D. Schiff; Cook County John H. Stroger Hospital and Bureau of Health Services, 1900 W. Polk Street, Administration Building #901, Chicago, IL 60612. Phone: 312-8644949; fax: 312-864-9594; e-mail: [email protected]. References 1. Baker GR, Norton P. Patient safety and healthcare error in the Canadian healthcare system. Ottawa, Canada: Health Canada; 2002. pp. 1–167. (http://www.hc-sc.gc.ca/ english/care/report. Accessed 12/24/04.) 2. Leape L, Brennan T, Laird N, et al. The nature of adverse events in hospitalized patients: Results of the Harvard medical practice study II. NEJM 1991;324:377–84. 3. Medical Malpractice Lawyers and Attorneys Online. Failure to diagnose. http://www.medical-malpracticeattorneys-lawsuits.com/pages/failure-todiagnose.html. 2004. 7. Graber M, Gordon R, Franklin N. Reducing diagnostic errors in medicine: what’s the goal? Acad Med 2002;77:981–92. 8. Kuhn G. Diagnostic errors. Acad Emerg Med 2002;9:740–50. 9. Kohn LT, Corrigan JM, Donaldson MS, editors. To err is human: building a safer health system. A report of the Committee on Quality of Health Care in America, Institute of Medicine. Washington, DC: National Academy Press; 2000. 10. Sato L. Evidence-based patient safety and risk management technology. J Qual Improv 2001;27:435. 4. Ramsay A. Errors in histopathology reporting: detection and avoidance. Histopathology 1999;34:481–90. 11. Phillips R, Bartholomew L, Dovey S, et al. Learning from malpractice claims about negligent, adverse events in primary care in the United States. Qual Saf Health Care 2004;13:121–6. 5. JCAHO Sentinel event alert advisory group. Sentinel event alert. http://www.jcaho.org/about+us/news+letters/sentinel+ event+alert/sea_26.htm. Jun 17, 2002. 12. Golodner L. How the public perceives patient safety. Newsletter of the National Patient Safety Foundation 2004;1997:1–6. 6. Edelman D. Outpatient diagnostic errors: unrecognized hyperglycemia. Eff Clin Pract 2002;5:11–6. 13. Bhasale A, Miller G, Reid S, et al. Analyzing potential harm in Australian general practice: an incidentmonitoring study. Med J Aust 1998;169:73–9. 275 Advances in Patient Safety: Vol. 2 14. Kravitz R, Rolph J, McGuigan K. Malpractice claims data as a quality improvement tool, I: Epidemiology of error in four specialties. JAMA 1991;266:2087–92. 31. Steere A, Taylor E, McHugh G, et al. The overdiagnosis of lyme disease. JAMA 1993;269:1812–6. 32. American Society for Gastrointestinal Endoscopy. Medical malpractice claims and risk management in gastroenterology and gastrointestinal endoscopy. http://www.asge.org/gui/resources/manual/gea_risk_u pdate_on_endoscopic.asp, Jan 8, 2002. 15. Flannery F. Utilizing malpractice claims data for quality improvement. Legal Medicine 1992;92:1–3. 16. Neale G, Woloshynowych M, Vincent C. Exploring the causes of adverse events in NHS hospital practice. J Royal Soc Med 2001;94:322–30. 33. Zhan C, Miller MR. Administrative data based patient safety research: a critical review. Qual Saf Health Care 2003;12(Suppl 2):ii58–ii63. 17. Bogner MS, editor. Human error in medicine. Hillsdale, NJ: Lawrence Erlbaum Assoc.; 1994. 18. Makeham M, Dovey S, County M, et al. An international taxonomy for errors in general practice: a pilot study. Med J Aust 2002;177:68–72. 34. Shojania K, Burton E, McDonald K, et al. Changes in rates of autopsy-detected diagnostic errors over time: a systematic review. JAMA 2003;289:2849–56. 35. Gruver R, Fries E. A study of diagnostic errors. Ann Intern Med 1957;47:108–20. 19. Wilson R, Harrison B, Gibberd R, et al. An analysis of the causes of adverse events from the quality in Australian health care study. Med J Aust 1999;170:411–5. 36. Kirch W, Schafii C. Misdiagnosis at a university hospital in 4 medical eras. Medicine 1996;75:29–40. 20. Weingart S, Ship A, Aronson M. Confidential clinician-reported surveillance of adverse events among medical inpatients. J Gen Intern Med 2000;15:470–7. 37. Thomas E, Petersen L. Measuring errors and adverse events in health care. J Gen Intern Med 2003;18:61–7. 38. Orlander J, Barber T, Fincke B. The morbidity and mortality conference: the delicate nature of learning from error. Acad Med 2002;77:1001–6. 21. Chaudhry S, Olofinboda K, Krumholz H. Detection of errors by attending physicians on a general medicine service. J Gen Intern Med 2003;18:595–600. 39. Bagian J, Gosbee J, Lee C, et al. The Veterans Affairs root cause analysis system in action. J Qual Improv 2002;28:531–45. 22. Kowalski R, Claassen J, Kreiter K, et al. Initial misdiagnosis and outcome after subarachnoid hemorrhage. JAMA 2004;291:866–9. 40. Spear S, Bowen H. Decoding the DNA of the Toyota production system. Harvard Bus Rev 1999;106. 23. Mayer P, Awad I, Todor R, et al. Misdiagnosis of symptomatic cerebral aneurysm. Prevalence and correlation with outcome at four institutions. Stroke 1996;27:1558–63. 41. Anderson R, Hill R. The current status of the autopsy in academic medical centers in the United States. Am J Clin Pathol 1989;92:S31–S37. 24. Craven, ER. Risk management issues in glaucoma diagnosis and treatment. Surv Opthalmol 1996 May– Jun;40(6):459–62. 42. The Royal College of Pathologists of Australasia Autopsy Working Party. The decline of the hospital autopsy: a safety and quality issue for health care in Australia. Med J Aust 2004;180:281–5. 25. Lederle F, Parenti C, Chute E. Ruptured abdominal aortic aneurysm: the internist as diagnostician. Am J Med 1994;96:163–7. 43. Plsek P. Section 1: evidence-based quality improvement, principles, and perspectives. Pediatrics 1999;103:203–14. 26. Goodson W, Moore D. Causes of physician delay in the diagnosis of breast cancer. Arch Intern Med 2002;162:1343–8. 44. Leape L. Error in medicine. JAMA 1994;272:1851–7. 45. Reason J. Human error. New York, NY: Cambridge University Press; 1990. 27. Clark S. Spinal infections go undetected. Lancet 1998;351:1108. 46. Gaither C. What your doctor doesn’t know could kill you. The Boston Globe. Jul 14, 2002. 28. Williams VJ. Second-opinion consultations assist community physicians in diagnosing brain and spinal cord biopsies. http://www3.mdanderson.org/~oncolog/brunder.html. Oncol 1997. 11-29-0001. 47. Lambert B, Chang K, Lin S. Effect of orthographic and phonological similarity on false recognition of drug names. Soc Sci Med 2001;52:1843–57. 48. Woloshynowych M, Neale G, Vincent C. Care record review of adverse events: a new approach. Qual Saf Health Care 2003;12:411–5. 29. Arbiser Z, Folpe A, Weiss S. Consultative (expert) second opinions in soft tissue pathology. Amer J Clin Pathol 2001;116:473–6. 49. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 2003;78:775–80. 30. Pope J, Aufderheide T, Ruthazer R, et al. Missed diagnoses of acute cardiac ischemia in the emergency department. NEJM 2000;342:1163–70. 276 Diagnosing Diagnosis Errors 50. Pansini N, Di Serio F, Tampoia M. Total testing process: appropriateness in laboratory medicine. Clin Chim Acta 2003;333:141–5. 67. Cigarroa J, Isselbacher E, DeSanctis R, et al. Diagnostic imaging in the evaluation of suspected aortic dissection—old standards and new directions. NEJM 1993;328:35–43. 51. Stroobants A, Goldschmidt H, Plebani M. Error budget calculations in laboratory medicine: linking the concepts of biological variation and allowable medical errors. Clin Chim Acta 2003;333:169–76. 68. McCarthy B, Yood M, Boohaker E, et al. Inadequate follow-up of abnormal mammograms. Am J Prev Med 1996;12:282–8. 52. Bates D. Medication errors. How common are they and what can be done to prevent them? Drug Saf 1996;15:303–10. 69. McCarthy B, Yood M, Janz N, et al. Evaluation of factors potentially associated with inadequate followup of mammographic abnormalities. Cancer 1996;77:2070–6. 53. Kassirer J, Kopelman R. Learning clinical research. Baltimore, MD: Williams & Wilkins; 1991. 70. Boohaker E, Ward R, Uman J, et al. Patient notification and follow-up of abnormal test results: a physician survey. Arch Intern Med 1996;156:327–31. 54. Pisarska M, Carson S, Buster J. Ectopic pregnancy. Lancet 1998;351:1115–20. 55. Soderstrom, RM. Obstetrics/gynecology: ectopic pregnancy remains a malpractice dilemma. http://www.thedoctors.com/risk/specialty/obgyn/J4209 .asp. 1999. Apr 26, 2004. 71. Murff H, Gandhi T, Karson A, et al. Primary care physician attitudes concerning follow-up of abnormal test results and ambulatory decision support systems. Int J Med Inf 2003;71:137–49. 72. Iordache S, Orso D, Zelingher J. A comprehensive computerized critical laboratory results alerting system for ambulatory and hospitalized patients. Medinfo 2001;10:469–73. 56. Tay J, Moore J, Walker J. Ectopic pregnancy. BMJ 2000;320:916–9. 57. Gracia C, Barnhart K. Diagnosing ectopic pregnancy: decision analysis comparing six strategies. Obstet Gynecol 2001;97:464–70. 58. Helliker K, Burton TM. Medical ignorance contributes to toll from aortic illness. Wall Street Journal; Nov 4, 2003. 59. Hagan P, Nienaber C, Isselbacher E, et al. The international registry of acute aortic dissection (IRAD): new insights into an old disease. JAMA 2000;283:897–903. 73. Kuperman G, Boyle D, Jha A, et al. How promptly are inpatients treated for critical laboratory results? JAMIA 1998;5:112–9. 74. Schiff, G, Kim, S, Wisniewski, M, et al. Every system is perfectly designed to: missed diagnosis of hypothyroidism uncovered by linking lab and pharmacy data. J Gen Intern Med 2003;18(suppl 1):295. 60. Kodolitsch Y, Schwartz A, Nienaber C. Clinical prediction of acute aortic dissection. Arch Intern Med 2000;160:2977–82. 75. Kuperman G, Teich J, Tanasijevic M, et al. Improving response to critical laboratory results with automation: results of a randomized controlled trial. JAMIA 1999;6:512–22. 61. Spittell P, Spittell J, Joyce J, et al. Clinical features and differential diagnosis of aortic dissection: experience with 236 cases (1980 through 1990). Mayo Clin Proc 1993;68:642–51. 76. Poon E, Kuperman G, Fiskio J, et al. Real-time notification of laboratory data requested by users through alphanumeric pagers. J Am Med Inform Assoc 2002;9:217–22. 62. Loubeyre P, Angelie E, Grozel F, et al. Spiral CT artifact that simulates aortic dissection: image reconstruction with use of 180 degrees and 360 degrees linear-interpolation algorithms. Radiology 1997;205:153–7. 77. Schiff GD, Klass D, Peterson J, et al. Linking laboratory and pharmacy: opportunities for reducing errors and improving care. Arch Intern Med 2003;163:893–900. 78. Massachusetts Coalition for the Prevention of Medical Error. Communicating critical test results safe practice. JCAHO J Qual Saf 2005;31(2) [In press] 63. Cabot R. Diagnostic pitfalls identified during a study of 3000 autopsies. JAMA 1912;59:2295–8. 79. Berwick D. Eleven worthy aims for clinical leadership of health system reform. JAMA 1994;272:797–802. 64. Conveney E, Geraghty J, O’Laoide R, et al. Reasons underlying negative mammography in patients with palpable breast cancer. Clin Radiol 1994;49:123–5. 80. Dawson N, Arkes H. Systematic errors in medical decision making: judgment limitations. J Gen Intern Med 1987;2:183–7. 65. Burstein, HJ. Highlights of the 2nd European breast cancer conference. http://www.medscape.com/viewarticle/408459. 2000. 81. Bartlett E. Physicians’ cognitive errors and their liability consequences. J Health Care Risk Manag 1998;18:62–9. 66. Berlin L. Malpractice issues in radiology: defending the “missed” radiographic diagnosis. Amer J Roent 2001;176:317–22. 277 Advances in Patient Safety: Vol. 2 82. McDonald J. Computer reminders, the quality of care and the nonperfectability of man. NEJM 1976;295:1351–5. 84. Fischhoff B. Hindsight [does not equal] foresight: the effect of outcome knowledge on judgment under uncertainty. J Exp Psychol: Hum Perc Perform 1975;1:288–99. 83. Smetzer J, Cohen M. Lesson from the Denver medication error/criminal negligence case: look beyond blaming individuals. Hosp Pharm 1998;33:640–57. 85. Hart JT. Cochrane Lecture 1997. What evidence do we need for evidence based medicine? J Epidemiol Community Health 1997 Dec;51(6):623–9. 278 WHAT ARE THE PRIMARY & SECONDARY OUTCOMES OF DIAGNOSTIC PROCESS - THE LEADING PHASE OF WORK IN PATIENT CARE MANAGEMENT PROBLEMS IN EMERGENCY DEPARTMENT ? Cosby KS, Roberts R, Palivos L et al. Characteristics of Patient Care Management Problems Identified in Emergency Department Morbidity and Mortality Investigations During 15 Years. Ann Emerg Med. 2008;51:251-261 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] CLINICAL PRACTICE AND PATIENT SAFETY/ORIGINAL RESEARCH Characteristics of Patient Care Management Problems Identified in Emergency Department Morbidity and Mortality Investigations During 15 Years Karen S. Cosby, MD Rebecca Roberts, MD Lisa Palivos, MD Christopher Ross, MD Jeffrey Schaider, MD Scott Sherman, MD Isam Nasr, MD Eileen Couture, DO Moses Lee, MD Shari Schabowski, MD Ibrar Ahmad, BS R. Douglas Scott II, PhD From the Department of Emergency Medicine, Cook County Hospital, Rush Medical School, Chicago, IL. Study objective: We describe cases referred for physician review because of concern about quality of patient care and identify factors that contributed to patient care management problems. Methods: We performed a retrospective review of 636 cases investigated by an emergency department physician review committee at an urban public teaching hospital over a 15-year period. At referral, cases were initially investigated and analyzed, and specific patient care management problems were noted. Two independent physicians subsequently classified problems into 1 or more of 4 major categories according to the phase of work in which each occurred (diagnosis, treatment, disposition, and public health) and identified contributing factors that likely affected outcome (patient factors, triage, clinical tasks, teamwork, and system). Primary outcome measures were death and disability. Secondary outcome measures included specific life-threatening events and adverse events. Patient outcomes were compared with the expected outcome with ideal care and the likely outcome of no care. Results: Physician reviewers identified multiple problems and contributing factors in the majority of cases (92%). The diagnostic process was the leading phase of work in which problems were observed (71%). Three leading contributing factors were identified: clinical tasks (99%), patient factors (61%), and teamwork (61%). Despite imperfections in care, half of all patients received some benefit from their medical care compared with the likely outcome with no care. Conclusion: These reviews suggest that physicians would be especially interested in strategies to improve the diagnostic process and clinical tasks, address patient factors, and develop more effective medical teams. Our investigation allowed us to demonstrate the practical application of a framework for case analysis. We discuss the limitations of retrospective cases analyses and recommend future directions in safety research. [Ann Emerg Med. 2008;51:251-261.] 0196-0644/$-see front matter Copyright © 2008 by the American College of Emergency Physicians. doi:10.1016/j.annemergmed.2007.06.483 Volume , . : March Annals of Emergency Medicine 251 Patient Care Management Problems Editor’s Capsule Summary What is already known on this topic Morbidity and mortality conferences and other forms of quality review are commonplace. Their content has not been well studied. What question this study addressed What types of events and what causal and contributing factors are common in morbidity and mortality reviews? What this study adds to our knowledge The diagnostic process was judged the most common locus of failure in more than 600 cases, spanning 15 years. Despite imperfections in care, more than half the patients still received some benefit compared with the likely outcome with no care at all. How this might change clinical practice Detailed case reviews provide useful information for practice improvement but have selection bias in the choice of cases and biases produced by the retrospective interpretation of incomplete information. SEE EDITORIAL, P. 262. INTRODUCTION Background Several major studies have reported the incidence of medical error and risks associated with health care.1-6 These studies have raised awareness of imperfections in care but offer little understanding of the nature of the work or solutions to improve safety. Although the focus on patient safety is relatively new, processes for improvement have existed for decades. We suggest that cumulative reviews of existing data, available in most health care organizations, can be used to guide current efforts to improve safety. Our study was initially aimed at detecting “medical errors.” Although the concept of error has been widely disseminated in the medical literature during the last decade, the term “error” carries with it connotations of carelessness, implies blame, and often focuses unduly on humans as the source of harm. As our understanding of these problems grew, we modified our study to describe events once labeled errors as “patient care management problems.” This subtle change is intended to promote a healthier perspective and constructive analysis. The term “problem” implies something to be solved and states more directly our intent to seek strategies and solutions for improvement. Importance Demand for improvement has led to changes throughout health care, including new safety regulations for health care organizations, proposed legislation for error reporting and 252 Annals of Emergency Medicine Cosby et al malpractice reform, and advancements in information technology applications for medical settings.7-10 Reforms in medical education have led to restrictions in house staff working hours, development of curriculums for patient safety, increased rigor in requirements for lifelong learning for physicians, and new emphasis on competency standards.11,12 Although we are anxious to improve, recommended changes should be founded on a thorough understanding of the types of patient care management problems that occur in medicine and factors that may increase risk. Goals of This Investigation The objective of this study was to characterize the types of cases referred to a physician review committee of an urban emergency department (ED) and identify the phase of work in which problems were detected and specific factors that affected quality of patient care. Our long-term goal is to use our existing morbidity and mortality investigations to guide safety interventions and to develop valid methods for future study of patient safety targets. The ED has been described as a high-risk practice environment, particularly under conditions of crowding, and serves as a natural setting to study safety in health care.13,14 MATERIALS AND METHODS Study Design and Setting This was a retrospective study to characterize patient care management problems identified by routine mortality and morbidity surveillance at an urban public teaching hospital during the 15-year period spanning 1989 through 2003. The annual adult ED census during those years ranged from 109,000 to 128,000. All cases referred for emergency physician review between 1989 and 2003 were eligible. Our institution has separate reporting mechanisms for pediatric and trauma care; these populations are not included in our study. Case referrals came from ED staff, admitting services, consultants, or quality assurance managers. Two independent emergency physician reviewers assessed each case by using an existing framework described previously.15,16 The lead author designed the data collection form and performed review 1 in every case. Cases were randomly assigned to one of 8 coinvestigators for review 2. Whenever possible, reviewers were disqualified from cases in which they had direct involvement or firsthand knowledge. The study was exempted from review by our institutional review board. Our department actively sought referrals from hospital staff of cases in which there were concerns about quality of care and encouraged reports of administrative and system problems, as well as questions about medical management. This process began as routine surveillance for morbidity and mortality cases but became a significant part of the quality process in our department. Cases underwent an investigation at referral that included interviews with medical staff involved in each case and a review of medical records and clinical data. Medical staff were invited to record details of their cases to explain circumstances Volume , . : March Cosby et al as they existed at the time. Unlike many retrospective medical record reviews, these investigations yielded thick files rich in context and content. Each case was typically presented for discussion at 1 or more forums, including physician review committees, morbidity and mortality conferences, and faculty meetings. Once investigations were complete, the original medical records, notes from interviews and committee discussions, formal recommendations, and final case summaries were placed in an investigations library. Throughout the 15-year period of case collection summarized in this study, a common philosophy guided case review: cases were scrutinized with the intent to detect all patient care management problems, their potential causes, and contributing factors, with the ultimate goal to find solutions and seek improvement. These records constitute the source data reviewed for this study. Data Collection At the start of this study, a research assistant deidentified the source data from the investigations library. Study investigators reviewed the case files, identified management problems, and then classified the problems by the phase of work in which they occurred, their contributing factors, and patient outcome. Investigators were allowed to use summary statements and conclusions from the initial reviews but were also encouraged to record any other relevant factors they identified during their reviews of file documents. To ensure internal validity throughout the study, specific criteria were established for each data field before data collection. Reviewers were asked to consider each study category. If they selected a category, they were required to check at least 1 of the category criteria to justify their selection. Reviewers participated in educational sessions and a pilot study, with feedback to verify consistent application of study definitions. The framework that guided this investigation applies a variety of approaches to categorizing problems in care.15 We determine the stage at which a problem occurred and what factors likely contributed. Identified problems were classified into 1 or more broad categories according to the phase of work in which they occurred: diagnostic, treatment, disposition, and public health (Table 1). Reviewers were then asked to determine whether any of the following contributing factors added risk, contributed to a problem in care, or had a negative impact on patient outcome: patient factors, triage, clinical factors, teamwork, and system. Clinical factors included medical reasoning, specific interpretive and procedural skill sets, taskbased problems, and affective influences. The concept of taskbased problems has been described previously: they involve specific tasks that are so routine and basic to care that they are relegated to lower-order thinking, that is, they are done as a matter of habit and typically with little conscious thought.15,17 Problems in these basic tasks are observed as a failed human behavior but likely reflect a system weakness. Such events tend to occur when the system is overloaded, workers are distracted, or altered staffing patterns affect work flow. They can be used as a marker of system overload or disorganization. The category of Volume , . : March Patient Care Management Problems affective influences includes the human tendency to be influenced by factors other than objective facts; it includes the sometimes unconscious effects of personality, external pressures, and conflict on reasoning.18 Individual assessments were made for each team of clinicians. System problems included any process, service, equipment, or supplies that failed or were unavailable, broken, or faulty. System problems were further classified by location: ED microsystem, hospital-wide failure, administrative factors, and community resources. Specific problems were also characterized according to Reason’s17 scheme as problems in “planning,” “execution,” or both. This classification distinguishes problems that are primarily cognitive from those that are primarily related to process failure; this distinction offers additional insight into potential solutions. None of these descriptors or classifications was mutually exclusive. Judgments about care were based on whether actions and decisions were accurate, timely, and effective. An example of a training case analysis is demonstrated in Figure 1. Outcome Measures Primary patient outcome was categorized as death, permanent disability, long-term disability (between 30 days and 12 months), short-term disability (up to 30 days), or no disability. These outcomes were mutually exclusive. Final outcomes, however, do not always reflect significant events. Thus, we added 2 additional secondary outcome measures to capture events that deserve attention: acute but short-lived lifethreatening events and adverse events (harm caused by medical management itself, independent of the patient’s disease). These secondary outcome measures were not mutually exclusive and could also overlap with the primary outcomes. Finally, each reviewer was asked to compare the actual patient outcome for each patient in this series to the outcome expected for optimal care and the likely outcome of no care to determine to what extent our management met expectations or caused harm. Primary Data Analysis The results for both reviewers were entered in a SAS data set. To increase data entry accuracy, 2 personnel collaborated: one to read aloud results from the data collection forms and validate the keystroke entry and the other to perform keystrokes. Observer agreement between reviewers 1 and 2 was quantified using the statistic. For rates and proportions, 95% confidence intervals were calculated. All analyses were performed with SAS version 7 (SAS Institute, Cary, NC). RESULTS Of the original library of 673 cases, 37 were excluded from the study because of incomplete or illegible records, inadequate 30-day follow-up, incomplete investigations, or inconclusive summaries (4 unexplained deaths). The final study group of 636 patients includes 353 male patients, 276 female patients, and 7 with undesignated sex. (Seven cases had sex designation inadvertently removed from the original source data.) Annals of Emergency Medicine 253 Patient Care Management Problems Cosby et al Table 1. Study definitions. Term Definition Example Phases of work in which problems were observed Diagnosis Failure to make a correct diagnosis or A delay in diagnosis that affects outcome or Treatment Disposition Public health Reason’s classification Problem with planning Problem with execution Contributing factors Patient factors An unnecessary or excessive diagnostic evaluation that itself caused harm Failure to provide appropriate treatment (error of omission) or An inappropriate treatment (error of commission) Failure to provide appropriate level of care according to acuity or Inadequate plan for follow-up care Care that poses no specific threat to the patient but places others at risk A poor plan at the outset Failure to carry out the plan as intended A specific patient characteristic that poses risk Inability to communicate Specific anatomy poses risk Clinical factors/tasks Medical reasoning Specific skill set Task based Affective influences Teamwork factors System factors Characteristic elicits destructive affective bias in care provider Failure to apply triage protocol or failure of criteria to accurately assess severity of illness. May be cognitive (knowledge deficit), design (inadequate criteria), or situational (excessive demand) Failure in a specific clinical task Incorrect clinical impression or a flawed decision Inaccurate interpretation of clinical data, such as ECGs, laboratory data, imaging studies; or Complication from invasive procedures Failure in routine bedside tasks such as measurement of vital signs Personal bias, evidence of interpersonal conflict, or the influence of factors other than objective facts in clinical decisions Miscommunication, poor team transition, authority gradients, or failure to supervise Failure due to inadequate supplies, equipment, services, or policies ED microsystem Hospital-wide system Administration Community There was significant overlap between the main categories of work in which problems were noted, with diagnosis being the most commonly identified classification (451; 71%), followed by disposition (280; 44%) and then treatment (265; 42%). Public health decisions comprised the smallest fraction of cases 254 Annals of Emergency Medicine Failure to give antibiotics for meningitis Treating a viral infection with antibiotics Patient moved to higher level of care after admission; deterioration in condition not anticipated Neck mass not referred for biopsy; inadequate followup plan leads to delay in care and poor prognosis Failing to isolate tuberculosis Failure of contact tracing for meningitis Failing to recognize risk to relatives of victims of carbon monoxide exposure Failing to recognize and thus treat a condition Intending to deliver a treatment but mistakenly delivering the wrong drug Inability to cooperate Triage factors Failure to diagnose pneumonia Delay in recognizing a cord compression leads to paralysis Contrast used in unnecessary test causes renal failure Unconscious or altered mental status Language barrier Delirium, dementia, or under the influence of substances Anatomically difficult airway, morbid obesity, difficult venous access Borderline personality disorder alienates medical staff Failure to detect acute coronary syndrome Failing to recognize a clinical pattern of illness Missing a relevant radiographic abnormality Pneumothorax from central line placement Failing to detect a change in vital signs when staffing ratio exceeds established margins of safety Interpersonal conflict interferes with clinical judgment Confusion about treatment given in ED leads to failure to give medication Airway supplies not restocked Hospital call system not updated Paging system down Consultants not available Community dialysis center closes, limiting access for care (23; 4%). Multiple categories were selected in more than half the cases (322; 51%) (Figure 2). Problems were most likely to occur in the planning stage (591; 93%) compared with the execution stage (170; 27%), although some events had problems in both (144, 23%). Volume , . : March Cosby et al Patient Care Management Problems TRAINING CASE EXAMPLE DEMONSTRATING THE FRAMEWORK FOR ANALYSIS Case Summary A prominent business man with crushing chest pain and severe hypertension arrives in a remote rural community ED that lacks an interventional cardiologist. By triage policy he is rapidly moved to an acute care area where an EKG is promptly interpreted as evidence of an acute myocardial infarction. A chest x-ray is viewed by the treating physician, but the radiologist is unavailable to review it. The nearest hospital with cardiology support is more than an hour away. A decision is made to offer the patient thrombolytic therapy. Consent is obtained and the medication delivered. The patient later dies of an aortic dissection. After interviews with medical staff and further investigation, additional details were obtained and the case analyzed. The critique of this case concluded that failures occurred in two main categories of work: 1. Diagnosis (failure to diagnose aortic dissection), and 2. Treatment (thrombolytic therapy given in the face of severe, uncontrolled hypertension). Contributing factors included: 1. Flawed medical reasoning (uncontrolled hypertension not treated; thrombolytics given despite uncontrolled blood pressure). 2. Specific skill-set failure (failure to recognize radiographic evidence for dissection, findings later verified by expert review). 3. Ineffective team communication (the medical team did not communicate to the radiologist the urgent need for his assistance). 4. Task-based failure (the radiologist noted that he was distracted by competing demands). 5. Adverse affective influences (the physician later noted that he was influenced by his desire to act aggressively because he recognized the patient to be an influential community leader, as well as his desire to conform to local standards to decrease “door-to-needle time.”) A limited group discussion centered around the question of whether or not the community should support local chest pain centers and improve resources for similar cases, but ultimately no system factors were identified. Figure 1. Training case example demonstrating the framework for case analysis. Problems in specific clinical tasks were the most common contributing factor identified (632; 99%) (Table 2). Problems in clinical tasks most commonly involved medical reasoning (595; 94%) but also included specific skill sets (212; 33%), routine task-based problems (173; 27%), and destructive affective influences (38; 6%). In almost half of all cases, more than 1 group of clinicians participated in specific management problems (289; 45%). Thus, many clinical problems did not occur in isolation and were often not the result of a single person, team, or process. Other leading factors that affected outcome included patient factors (391; 61%) and teamwork (387; 61%). System and triage factors were selected to a lesser Volume , . : March extent (41% and 16%, respectively). The distribution of contributing factors was similar across all phases of work. More than 1 contributing factor was identified for most cases. In fact, 2 or more factors were identified in all but 56 cases (92%); 420 cases (66%) had 3 or more factors identified; 113 cases (18%) had 5 or more factors identified. Specific problems in care were typically not the result of failure by 1 factor (person or process) alone but rather the result of the complex interplay of multiple factors that ultimately influenced clinical work. Two thirds (66%) of patients in this series had death, major permanent disability, or disability exceeding 30 days; 34% Annals of Emergency Medicine 255 Patient Care Management Problems Figure 2. Major categories of work in which problems were observed. This Venn diagram illustrates the number of events identified in each phase of work and their overlap. Numbers in parentheses indicate total numbers for each major heading. Public health decisions, not shown in this diagram, accounted for an additional 23 events (4% of all events). returned to their baseline state of health within 30 days or had no disability (Table 2). Acute life-threatening events occurred sometime during the ED or hospital course in 66 cases (10%), including cardiac arrest, airway emergency, respiratory failure, shock, anaphylaxis, or major dysrhythmia. Adverse events (harm from medical management itself) occurred in 86 cases (14%). Of all adverse events, 24 (28%) resulted in death and 23 (27%) were associated with an acute life-threatening event. Despite these identified events, 50% of all patients in this series received net benefit from their care; 40% had an outcome similar to that expected for no care. The remaining 11% had an outcome that was worse than expected according to the natural history of their disease. The majority of cases in this data set failed to achieve the full benefit expected from optimal care (590; 93%). Observer agreement was measured with the statistic. Observer agreement was highest in categories with well-defined, objective criteria such as phases of work (86% to 100% agreement; 0.71 to 1.00) and patient outcomes (100% agreement; 0.98 to 1.00). When reviewers were asked to assess more subjective factors such as destructive affective influences, the observer agreement was only fair (92% agreement; 0.34) (Table 3). LIMITATIONS This study is a retrospective review of investigations completed on cases referred because of poor outcome, flawed processes, or the perception of imperfections in care. There are several limitations inherent to the study design. 256 Annals of Emergency Medicine Cosby et al First, the study has selection bias. Because our cases came largely from physician referrals, our data are weighted toward the types of problems that physicians find interesting or important and do not necessarily reflect a broad sampling of all types of cases. This does not mean that they represent the most common or even the “worst” management problems; our cases probably represent the most visible events or those most likely to have immediate impact on patient outcome. There is no control group of randomly selected patients during the same study period. Thus, many of the same factors identified in the study sample may have been present in other cases without harm. Cases with significant problems or harm may have occurred during the same period but gone unreported. Second, post hoc critiques of care are inevitably subject to bias. Our cases were typically referred after outcome was known; thus, they have outcome bias, the tendency to judge similar events more harshly when the outcome is poor.19 In addition, our investigations were challenged by our ability to judge the events as they actually occurred, not as we might imagine them. Hindsight bias, the reconstruction of events in a way that makes sense to the investigator, can mislead judgments about events.20 Even clinicians present at the time, in an effort to explain their actions, may reconstruct memories that are only partially true. We attempted to minimize the influence of bias by encouraging staff to record their impressions and recall specific information they had or lacked at the time of their decisions, what factors influenced their decisions, and what the ambient conditions were in the ED at the time. Timelines of actions were generated from electronic records and added to information from interviews to reconstruct, as much as possible, the actual events. The thick descriptions in our study files permit detailed analysis aimed at providing contextually rich detail to explain the circumstances surrounding care. There are several other limitations to our study. The type of factors identified is constrained by the expertise of the reviewers and the quality of information they preserved. Although we observe the presence of factors that may have contributed to outcome, we do not make conclusions of causality. Although our study attempted to define discrete phases of work in which problems occurred, we found overlap between the major categories, which suggests that our classifications (and perhaps the events themselves) were not independent of one another; clinical decisions in one phase of work likely affect other phases of work. Failure to discriminate between these categories also reflects several natural limitations in quality reviews. First, it is difficult to separate a single flawed step in the continuum of care. Second, a critical appraisal of many cases, with good or bad outcome, can often find multiple flaws. Last, such distinctions require judgments about what was in the mind of the clinician, facts that may be difficult for the involved clinicians to recall accurately and even more difficult for an independent reviewer remote from the event to determine. The language of safety science and models of causation are controversial, particularly when applied to health care.21 Terms Volume , . : March Cosby et al Patient Care Management Problems Table 2a. Association between contributing factors and primary outcome measures. Primary Outcome Measures, No. (%) Disability Contributing Factors* All Cases nⴝ636 Death nⴝ226 Permanent, nⴝ24 Long Term, nⴝ171 Short Term, nⴝ178 None, nⴝ37 Patient Triage Clinical (all subsets) Reasoning Skill set Task based Affective influences Teamwork System 391 (61) 103 (16) 632 (99) 595 (94) 212 (33) 173 (27) 38 (6) 387 (61) 261 (41) 170 (75) 47 (21) 225 (99) 212 (94) 77 (34) 91 (40) 11 (5) 155 (66) 103 (46) 16 (67) 2 (8) 23 (96) 21 (88) 12 (50) 7 (29) 2 (8) 13 (54) 9 (38) 86 (50) 20 (12) 171 (100) 164 (96) 47 (27) 29 (17) 9 (5) 92 (54) 79 (46) 102 (57) 31 (17) 177 (99) 169 (95) 58 (33) 37 (21) 14 (8) 100 (56) 60 (34) 17 (46) 3 (8) 36 (97) 29 (78) 18 (49) 9 (24) 2 (5) 27 (73) 10 (27) *Contributing factors are not mutually exclusive; thus, percentages do not sum to 100. Table 2b. Association between contributing factors and secondary outcome measures. Secondary Outcome Measures, No. (%)* Contributing Factors† All Cases, No. (%), nⴝ636 Life Threats, nⴝ66 Adverse Events, nⴝ86 Patient Triage Clinical (all subsets) Reasoning Skill set Task based Affective influences Teamwork System 391 (61) 103 (16) 632 (99) 595 (94) 212 (33) 173 (27) 38 (6) 387 (61) 261 (41) 47 (71) 11 (17) 66 (100) 60 (91) 28 (42) 23 (35) 7 (11) 40 (61) 22 (33) 59 (69) 10 (12) 84 (98) 69 (80) 49 (57) 30 (35) 5 (6) 50 (58) 30 (35) *Secondary outcomes note specific events that are a subset of all cases; thus, percentages do not sum to 100. † Contributing factors are not mutually exclusive; thus, percentages do not sum to 100. such as “error” tend to evoke accusations of blame, whereas terms such as “problems” or “failures” may be vague. Judgments about quality and causality are flawed by outcome knowledge. From a simple human perspective, we want patients to have the best possible outcome. When they do not, there is a natural tendency to cite any imperfection in their care as a contributing (if not causal) factor. No doubt, similar problems in the care of patients with satisfactory results are more easily forgiven and dismissed. This study is ultimately a summary of the aspects of clinical work that our reviewers found to be most prone to problems and thus most likely to contribute to risk. The development of safety systems in other fields has started by analysis of accidents or events and then progressed to more sophisticated, systematic, and prospective design. Studies such as ours provide a framework from which to begin. DISCUSSION Our data demonstrate that cases referred because of patient care management problems are often complex and not the result of the failure of any single individual or process at any one moment in time. The typical course of a “medical error” has been described as a cascade, in which one failure leads to Volume , . : March another.22 However, our study went beyond simply recognizing complexity to identifying specific factors that can be targeted to improve safety. Our reviewers selected “diagnosis” as the leading phase of work in which problems were noticed, consistent with results from other major studies,2,4,23-25 which is not unexpected, because diagnosis is at the heart of clinical work and is the foundation on which all other actions are predicated. That judgment is likely influenced by the perspective of our reviewers, who may be inclined to focus on their own expertise and reluctant to judge factors outside their usual sphere of influence. Our results provide some clues about why the diagnostic process is difficult and at times imprecise. Two major contributing factors were consistently identified in most cases in our study and likely created an environment in which rapid decisions may be problematic: patient factors and teamwork. Patient factors include conditions that impair the ability of patients to communicate or cooperate with medical staff, anatomic problems that add difficulty to the ability to provide care, and conditions that elicit affective bias in care providers. Patients who present to the ED with altered mental status or in distress may be unable to offer important details of their medical Annals of Emergency Medicine 257 Patient Care Management Problems Cosby et al Table 3. Summary of descriptive study statistics and observer agreement.* Variable Number of Cases Phases of work in which problems were observed Diagnosis 451 Treatment 265 Disposition 280 Public health 23 Reason’s classification Planning 591 Execution 170 Planning and execution 144 Contributing factors Patient 391 Triage 103 Clinical (all subsets) 632 Reasoning 595 Specific skill set 212 Task based 173 Affective influences 38 Teamwork 387 System 261 Primary outcomes Death 226 Permanent disability 24 Long-term disability 171 Short-term disability 178 No disability 37 Secondary outcomes Life-threatening events 66 Adverse events 86 Observer Agreement Total, % (95% CI) Agreement, % (95% CI) 71 (67–75) 42 (38–46) 44 (40–48) 4 (2–6) 98 93 86 100 0.95 (0.93–0.98) 0.85 (0.81–0.89) 0.71 (0.66–0.77) 1.00 (1.00–1.00) 93 (96–98) 27 (24–30) 23 (20–26) 92 79 79 0.43 (0.30–0.56) 0.50 (0.43–0.58) 0.47 (0.39–0.54) 61 (57–65) 16 (13–19) 99 (98–100) 94 (92–96) 33 (29–37) 27 (24–30) 6 (4–8) 61 (57–65) 41 (37–45) 77 93 99 94 90 82 92 72 83 0.57 (0.50–0.63) 0.74 (0.67–0.81) 0.54 (0.19–0.90) 0.45 (0.31–0.60) 0.77 (0.73–0.83) 0.47 (0.39–0.55) 0.34 (0.20–0.48) 0.41 (0.34–0.48) 0.49 (0.38–0.55) 36 (32–40) 4 (2–5) 27 (23–30) 28 (24–31) 6 (4–8) 100 100 100 100 100 0.99 (0.98–1.10) 1.00 (1.00–1.00) 0.99 (0.98–1.00) 0.99 (0.98–1.00) 0.98 (0.96–1.01) 10 (8–12) 14 (11–17) 100 100 0.99 (0.98–1.01) 0.99 (0.98–1.01) CI, Confidence interval. *Categories are not mutually exclusive; thus, percentages do not sum to 100. history, which contributes to the “information gap” described by Stiell et al,26 in which physicians must act despite a paucity of reliable facts. Some of these factors are inherent to the practice of emergency medicine and contribute substantially to risk. Other acute care specialties have recognized the importance of teamwork; our study confirmed that problems in care often involve ineffective team functioning.27,28 ED care tends to be punctuated by numerous transitions that can lead to dropped tasks and lost information. In addition, handoffs in care may hamper the ability of clinicians to recognize dynamic changes in a patient’s condition.29 There are implications from this study that can be used to design safer care. First, clinicians need reliable access to information and strategies for making decisions in the face of uncertainty. Cognitive psychologists can offer specific training in cognition, critical thinking, and decisionmaking.30-38 Clinical decision support systems and cognitive aids can reduce cognitive load and standardize clinical protocols.39,40 Medical simulation can provide opportunities to practice and refine performance.41 Second, our study reveals that patient factors contribute substantially to management problems. Vincent et al16,42 have observed that patient factors may contribute to the likelihood of 258 Annals of Emergency Medicine an adverse event; their framework for analyzing clinical events includes consideration of patient factors. Patient factors are often beyond our immediate control; harm caused by some patient factors has even been labeled “no fault.”38,43 Although a no-fault judgment may be fair to those present at the time, it prematurely limits discussions about how to improve care. We argue that all information from investigations can be used to improve care, even if problems are judged to be unavoidable at the time. Creative use of resources can address common patient factors, such as providing 24-hour access to interpreter services, improving electronic medical record databases for access to reliable medical information, and using ultrasonographic guidance to improve our ability to care for patients with difficult venous access, to name just a few. Third, organization and teamwork in acute care medicine present specific challenges. Reliable and consistent processes for transitions and exchange of patient care responsibility and accurate communication of test results and patient information are fundamental to the design of safe health care systems, particularly in the dynamic setting of the ED. The fact that system factors were not identified as a leading contributor to problems identified in this study contrasts sharply with current views in safety science.8,44 Our study was Volume , . : March Cosby et al performed by physicians, not safety scientists; thus, their reviews focus on medical judgments, not system design. Human tasks and the systems that support them are inextricably linked. The fact that our reviewers consistently selected human aspects of tasks over system factors may be due to fundamental attribution error, the tendency for organizational actors to attribute undesirable outcomes to the perceived character flaws of people rather than to the processes engendered by the organization’s structure.45 This tendency is reinforced by cultural pressures that promote an overriding sense of professional accountability but create “system blindness,” that is, clinicians are largely unaware of the potential for system design to improve care.46 The failure to detect a similar proportion of human and system factors may also reflect an imbalance in health care design that encourages reliance on individuals over development of support systems. If so, the recent focus on design and development of safe systems to support clinical work offers significant promise for improvement. Future safety research will likely focus on prospective error reporting systems. The development of a national error reporting mechanism in the United States is under way but faces challenges to define process: how to collect case information, what information to collect, and how to analyze and then disseminate useful results. Once in place, reporting systems are often slow to produce actionable items. Meanwhile, studies such as ours can use historical cases to identify major problems needing correction and identify the types of factors that are worth tracking prospectively. In the course of this study, we learned several lessons that can affect the success of future error reporting systems. During the early phases of the study, we attempted to identify a variety of potential contributing factors based on safety research, including cognitive bias, human factors, ergonomics, and ED design. Our efforts failed, primarily because physician reviewers were not familiar with the concepts and routine clinical investigations do not typically address them. We recommend that future studies and error reporting systems engage nonclinician experts from safety disciplines to provide fresh insight and perspective. We also found that observer agreement was best for categories that were described by detailed checklists and definitions. Reviewers were less likely to agree when asked to make subjective assessments. Large-scale reporting systems need a workable reporting mechanism and may struggle between the need for uniformity and objective definitions and the desire for free-text reports. An ideal system may need both and may need to be modified periodically to allow for evolving standards of care and improved understanding of causes of medical failure. For future work, there is a variety of sources of data that can be used to probe problems in patient care. Dramatic, visible, and highly publicized accounts of individual cases have been used successfully to drive safety intitiatives.47 In contrast, aggregate data from existing databases of cases may provide a broad overview of risk and help define particular tasks and moments in patient care that are most vulnerable and most in Volume , . : March Patient Care Management Problems need of system support. Several studies have used closed malpractice claims and risk management files to identify causes and contributing factors to adverse events.48,49 We suggest that collective reviews of existing mortality and morbidity investigations and internal review processes, available in most health care organizations, offer an additional source of data to guide improvement (Appendix E1, available online at www. annemergmed.com). Finally, we suggest that collaboration with safety experts in other disciplines can expand our perspective on safety concepts and the potential for system improvements in health care. In summary, the primary goal for understanding patient care management problems is to prevent them or at least mitigate harm. However, our efforts to improve care have been frustrated by the realization that health care and patient care processes are complex.5,16,36-38 Our data confirm that most problems are multifactorial. The leading factors that contributed to problems detected by physician reviewers in this study include medical reasoning, patient factors, and teamwork, although these problems occurred in a background of system imperfections. Efforts to improve safety in medicine should be directed at improving access to accurate medical information and communication across the continuum of medical care; coordination of care among health care teams; and improved processes, education, and support for decisionmaking in the uncertain environment of acute care medicine. Advances in system design to support these aspects of clinical care offer opportunities for improvement in health care safety. We wish to thank Mr. Geoffrey Andrade for assistance with data entry and Mr. Donald Bockenfeld, BS, for assistance with article preparation. Supervising editor: Robert L. Wears, MD, MS Author contributions: KSC and RR conceived the study and obtained funding. KSC, RR, JS, and RDS designed the study. KSC, LP, CR, JS, S Sherman, IN, EC, ML, and S Schabowski were responsible for data acquisition. KSC, RR, IA, and RDS analyzed and interpreted the data. KSC and RR drafted the article and all authors contributed substantially to its revision. KSC, RR, IA, and RDS were responsible for statistical analysis. RR, LP, CR, JS, S Sherman, IN, EC, ML, S Schabowski, IA, and RDS provided administrative, technical, or material support. KSC and RR were responsible for overall study supervision. KSC and RR had full access to all the data and take responsibility for the integrity of the data and the accuracy of the data analysis. KSC takes responsibility for the paper as a whole. Funding and support: By Annals policy, all authors are required to disclose any and all commercial, financial, and other relationships in any way related to the subject of this article, that might create any potential conflict of interest. See the Manuscript Submission Agreement in this issue for examples of specific conflicts covered by this statement. This study was Annals of Emergency Medicine 259 Patient Care Management Problems funded in part by the Agency for Healthcare Research and Quality, grant number 5 P20 HS011552, and the Department of Emergency Medicine, Cook County Hospital (Stroger). Publication dates: Received for publication November 18, 2006. Revisions received March 8, 2007; May 21, 2007; and June 14, 2007. Accepted for publication June 25, 2007. Available online October 15, 2007. Address for reprints: Karen Cosby, MD, Department of Emergency Medicine, Cook County Hospital (Stroger), 10th floor Administration Building, 1900 W Polk St, Chicago, IL 60612; 312-864-1986 or 312-864-0060, fax 312-864-9656; E-mail [email protected]. REFERENCES 1. Brennan TA, Leape LL, Laird NM, et al. Incidence of adverse events and negligence in hospitalized patients: results of the Harvard Medical Practice Study I. N Engl J Med. 1991;324:370376. 2. Leape LL, Brennan TA, Laird N, et al. The nature of adverse events in hospitalized patients: results of the Harvard Medical Practice Study II. N Engl J Med. 1991;324:377-384. 3. Thomas EJ, Studdert DM, Burstin HR, et al. Incidence and types of adverse events and negligent care in Utah and Colorado. Med Care. 2000;38:261-271. 4. Wilson RM, Runciman WB, Gibberd RW, et al. The Quality in Australian Health Care Study. Med J Aust. 1995;163:458-471. 5. Vincent C, Neale G, Woloshynowych M. Adverse events in British hospitals: preliminary retrospective record review. BMJ. 2001; 322:517-519. 6. Forster AJ, Asmis TR, Clark HD, et al. Ottawa Hospital Patient Safety Study: incidence and timing of adverse events in patients admitted to a Canadian teaching hospital. CMAJ. 2004;170: 1235-1240. 7. Kohn LT, Corrigan JM, Donaldson MS, eds. To Err Is Human: Building a Safer Health System. Washington, DC: National Academy Press; 2000. 8. Committee on Quality of Health Care in America, Institute of Medicine. Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academy Press; 2001. 9. Fassett WE. Patient Safety and Quality Improvement Act of 2005. Ann Pharmacother. 2006;40:917-924. 10. Clinton HR, Obama B. Making patient safety the centerpiece of medical liability reform. N Engl J Med. 2006;354:2205-2208. 11. Landrigan CP, Rothschild JM, Cronin JW, et al. Effect of reducing interns’ work hours on serious medical errors in intensive care units. N Engl J Med. 2004;351:1838-1848. 12. Cosby KS, Croskerry P. Patient safety: a curriculum for teaching patient safety in emergency medicine. Acad Emerg Med. 2003; 10:69-78. 13. Croskerry P, Sinclair D. Emergency medicine: a practice prone to error? Can J Emerg Med. 2001;3:271-276. 14. Committee on the Future of Emergency Care in the United States Health System, Institute of Medicine. Hospital-Based Emergency Care: At the Breaking Point. Washington, DC: National Academies Press; 2006. 15. Cosby KS. A framework for classifying factors that contribute to error in the emergency department. Ann Emerg Med. 2003; 42: 815-823. 16. Vincent C, Taylor-Adams S, Stanhope N. Framework for analysing risk and safety in clinical medicine. BMJ. 1998;316:1154-1157. 17. Reason J. Human Error. Cambridge, England: Cambridge University Press; 1990. 260 Annals of Emergency Medicine Cosby et al 18. Croskerry P. Diagnostic failure: a cognitive and affective approach. In: Henriksen K, Battles JB, Marks ES, et al, eds. Advances in Patient Safety: From Research to Implementation. Vol 2. Rockville, MD: Agency for Healthcare Research and Quality; 2005:241-254. AHRQ Publication No. 05-0021-2. 19. Caplan RA, Posner KL, Cheney FW. Effect of outcome on physician judgments of the appropriateness of care. JAMA. 1991; 265:1957-1960. 20. Henriksen K, Kaplan H. Hindsight bias, outcome knowledge and adaptive learning. Qual Saf Health Care. 2003;12(suppl II):ii46ii50. 21. Senders JW, Moray NP. Human Error (Cause, Prediction, and Reduction): Analysis and Synthesis. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc; 1991. 22. Woolf SH, Kuzel AJ, Dovey SM, et al. A string of mistakes: the importance of cascade analysis in describing, counting, and preventing medical errors. Ann Fam Med. 2004;2:317-326. 23. Kuhn GJ. Diagnostic errors. Acad Emerg Med. 2002;9:740-750. 24. Schiff GD, Kim S, Abrams R, et al. Diagnosing diagnostic errors: lessons from a multi-institutional collaborative project. In: Henriksen K, Battles JB, Marks ES, et al, eds. Advances in Patient Safety: From Research to Implementation. Vol 2. Rockville, MD: Agency for Healthcare Research and Quality; 2005:255-278. AHRQ Publication No. 05-0021-2. 25. Croskerry P. Achilles heels of the ED: delayed or missed diagnoses. ED Leg Lett. 2003;14:109-120. 26. Stiell A, Forster AJ, Stiell IG, et al. Prevalence of information gaps in the emergency department and the effect on patient outcomes. CMAJ. 2003;169:1023-1028. 27. Davies JM. Team communication in the operating room. Acta Anaesthesiol Scand. 2005;49:898-901. 28. Jain M, Miller L, Belt D, et al. Decline in ICU adverse events, nosocomial infections and cost through a quality improvement initiative focusing on teamwork and culture change. Qual Saf Health Care. 2006;15:235-239. 29. Behara R, Wears RL, Perry SJ, et al. A conceptual framework for studying the safety of transitions in emergency care. In: Henriksen K, Battles JB, Marks ES, et al, eds. Advances in Patient Safety: From Research to Implementation. Vol 2. Rockville, MD: Agency for Healthcare Research and Quality; 2005:309-3321. AHRQ Publication No. 05-0021-2. 30. Croskerry P. Achieving quality in clinical decision making: cognitive strategies and detection of bias. Acad Emerg Med. 2002;9:1184-1204. 31. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003;78:775-780. 32. Croskerry P. Cognitive forcing strategies in clinical decisionmaking. Ann Emerg Med. 2003;41:110-120. 33. Croskerry P. The cognitive imperative: thinking about how we think. Acad Emerg Med. 2000;7:1223-1231. 34. Graber M. Metacognitive training to reduce diagnostic errors: ready for prime time? Acad Med. 2003;78:781. 35. Elstein AS, Schwarz A. Clinical problem solving and diagnostic decision making: selective review of the cognitive literature. BMJ. 2002;324:729-732. 36. Kassirer JP, Kopelman RI. Learning Clinical Reasoning. Baltimore, MD: Williams & Wilkins; 1991. 37. Graber M, Gordon R, Franklin N. Reducing diagnostic errors in medicine: what’s the goal? Acad Med. 2002;77:981-992. 38. Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Intern Med. 2005;165:1493-1499. 39. Garg AX, Adhikari NK, McDonald H, et al. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: a systematic review. JAMA. 2005;293:12231238. Volume , . : March Cosby et al Patient Care Management Problems 40. Bates DW, Kuperman GJ, Wang S, et al. Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality. J Am Med Inform Assoc. 2003;10:523-530. 41. Bond WF, Deitrick LM, Arnold DC, et al. Using simulation to instruct emergency medicine residents in cognitive forcing strategies. Acad Med. 2004;79:438-446. 42. Vincent C, Taylor-Adams S, Chapman EJ. How to investigate and analyse clinical incidents: Clinical Risk Management and Association of Litigation and Risk Management protocol. BMJ. 2000;320:777-781. 43. Kassirer JP, Kopelman RI. Cognitive errors in diagnosis: instantiation, classification, and consequences. Am J Med. 1989;86:433-441. 44. Wang MC, Hyun JK, Harrison M, et al. Redesigning health systems for quality: lessons from emerging practices. Jt Comm J Qual Patient Saf. 2006;32:599-611. 45. Ross L. The intuitive psychologist and his shortcomings: 46. 47. 48. 49. distortions in the attribution process. In: Berkowitz L, ed. Advances in Experimental Social Psychology. New York, NY: Academic Press; 1977:174-221. Banja J. Medical Errors and Medical Narcissism. Sudbury, MA: Jones and Bartlett Publishers; 2005. Conway JB, Weingart SN. Organizational change in the face of highly public errors. I. The Dana-Farber Cancer Institute experience. Agency for Healthcare Research and Quality Morbidity and Mortality Rounds Web site. Available at: http://www.webmm.ahrq.gov/perspective. aspx?perspectiveID⫽3. Accessed January 26, 2007. Kachalia A, Gandhi TK, Poupolo AL, et al. Missed and delayed diagnoses in the emergency department: a study of closed malpractice claims from 4 liability insurers. Ann Emerg Med. 2007;49:196-205. White AA, Wright SW, Blanco R, et al. Cause-and-effect analysis of risk management files to assess patient care in the emergency department. Acad Emerg Med. 2004;11:1035-1041. IMAGES IN EMERGENCY MEDICINE (continued from p. 230) DIAGNOSIS: Emphysematous cystitis. Emphysematous cystitis is a rare, necrotizing infection characterized by gas collection in the urinary bladder wall and lumen, resulting from gas-producing pathogen infection.1 The risk factors are diabetes (up to 80%), bladder outlet obstruction, recurrent urinary tract infection, urinary stasis, neurogenic bladder, immunosuppression, female sex, and being a transplant recipient.2 The mechanism and pathogenesis of emphysematous cystitis are still unknown. The gas is suggested to be produced by the infected organism by the fermentation of albumin or glucose in urine. The most common organisms are Escherichia coli,3 Enterobacter aerogenes, and Klebsiella pneumoniae. Emphysematous cystitis has nonspecific clinical features and is often misdiagnosed. Clinically, emphysematous cystitis is often diagnosed by the unanticipated imaging findings. Plain abdominal radiograph usually makes the diagnosis, with high sensitivity (97.4%),4 but abdominal computed tomography scan was the most sensitive and specific diagnostic tool.5 About 18.8% of emphysematous cystitis cases have complicated courses.4 Emphysematous cystitis demands prompt diagnosis and intervention,6 including aggressive parenteral antibiotics and even bladder drainage.7 Generally, emphysematous cystitis has favorable prognosis, whereas delays in diagnosis and treatment may contribute to high mortality rate, which approaches 20%. A favorable prognosis may be achieved by early recognition of emphysematous cystitis, by clinical and radiologic assessment, by appropriate antibiotic use, and by timely surgical intervention when indicated. This patient was administrated empiric antibiotic and promoted surgical drainage. Escherichia coli was isolated subsequently from both urine and drainage pus cultures. The patient was discharged after a 2-week hospitalization. REFERENCES 1. Quint HJ, Drach GW, Rappaport WD, et al. Emphysematous cystitis: a review of the spectrum of disease. J Urol. 1992; 147:134-137. 2. Akalin E, Hyde C, Schmitt G, et al. Emphysematous cystitis and pyelitis in a diabetic renal transplant recipient. Transplantation. 1996;62:1024-1026. 3. Bailey H. Cystitis emphysematosa: 19 cases with intraluminal and interstitial collections of gas. Am J Roentgenol Radium Ther Nucl Med. 1961;86:850-862. 4. Grupper M, Kravtsov A, Potasman I. Emphysematous cystitis: illustrative case report and review of the literature. Medicine. 2007;86:47-53. 5. Perlemoine C, Neau D, Ragnaud JM, et al. Emphysematous cystitis. Diabetes Metab. 2004;30:377-379. 6. Quint HJ, Drach GW, Rappaport WD, et al. Emphysematous cystitis: a review of the spectrum of disease. J Urol. 1992; 147:134-137. 7. Yasumoto R, Asakawa M, Nishisaka N. Emphysematous cystitis. Br J Urol. 1989;63:644. Volume , . : March Annals of Emergency Medicine 261 APPENDIX E1. Examples of actions resulting from morbidity and mortality investigations. Uncommon Diagnosis Requiring Complex Evaluations and Multispecialty Interventions The Problem: After observing difficulties in the timely diagnosis and treatment of aortic dissections, we reviewed a series of cases of aortic dissections from our morbidity and mortality database and identified key problem areas. The diagnosis was observed to be difficult, often overlapping more common chest pain entities. The evaluation for dissection could involve a variety of approaches (echocardiography, computed tomography, or angiography) and involved multiple consultants (chest surgeons, cardiologists, vascular surgeons, intensivists, and interventional radiologists). Because of the variation in practice, there was often conflict between specialists and services about the optimal testing and management of these cases; care was often delayed by conflict and indecision. Actions Taken: After review of a series of problematic cases, the ED staff convened a multidisciplinary conference with all the involved specialties, eventually developing a standard approach agreed on by all disciplines. Once the policy was developed, hospital staff worked to ensure that services were available to follow the protocol. The result was a much more simplified and direct approach to caring for similar cases. Cases With Evolving Standards of Care and Controversy About Optimal Management: Ectopic Pregnancy The Problem: Multiple problems were observed in the timely diagnosis of ectopic pregnancy and in management of patients with first trimester bleeding. Because of evolving standards in the care of patients with first trimester pregnancy, the ED staff experienced conflict and inconsistent standards 261.e1 Annals of Emergency Medicine within their own group, as well as consultants. In the process, the diagnosis of several ectopic pregnancies was delayed. Actions Taken: The ED organized a literature review and grand rounds conference on ectopic pregnancy, developed a treatment protocol for first trimester pregnancy, and then met with a working group with obstetrics to standardize criteria for consultation, admission, and follow-up. Meanwhile, the ED improved its proficiency in bedside ultrasonography and began more liberal screening of all first trimester pregnancy bleeding. Dealing With a Local Infectious Disease Threat: Tuberculosis The Problem: A resurgence of tuberculosis in our city led to an increase in number of patients with active tuberculosis in our ED. There were a limited number of isolation beds available within the hospital, and medical staff had too few beds to isolate all potential cases. A number of patients admitted to general medical wards were found to have active tuberculosis; at the same time, there was an increase in the number of house staff who converted to positive tuberculin skin tests. The Action: Cases with positive tuberculosis culture results were identified and their ED records tracked. The ED convened a multidisciplinary conference with the Infectious Disease and Radiology Departments to review the cases. Ultimately, we found that many active cases of tuberculosis could not be predicted according to clinical and radiographic criteria. Eventually, the medical staff obtained additional isolation rooms in the ED and in the hospital, applied new screening at triage to prioritize chest radiographs in patients with respiratory symptoms, prioritized movement of patients with suspected cases from triage to isolation beds in the ED, and successfully argued for increased resources to handle the challenge. Volume , . : March WHAT ARE THE CAUSES OF MISSED AND DELAYED DIAGNOSES IN THE AMBULATORY SETTING? Gandhi TK, Kachalia A, Thomas EJ et al. Missed and Delayed Diagnoses in the Ambulatory Setting: A Study of Closed Malpractice Claims. Ann Intern Med.2006;145:488-496 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] Article Annals of Internal Medicine Missed and Delayed Diagnoses in the Ambulatory Setting: A Study of Closed Malpractice Claims Tejal K. Gandhi, MD, MPH; Allen Kachalia, MD, JD; Eric J. Thomas, MD, MPH; Ann Louise Puopolo, BSN, RN; Catherine Yoon, MS; Troyen A. Brennan, MD, JD; and David M. Studdert, LLB, ScD Background: Although missed and delayed diagnoses have become an important patient safety concern, they remain largely unstudied, especially in the outpatient setting. Objective: To develop a framework for investigating missed and delayed diagnoses, advance understanding of their causes, and identify opportunities for prevention. Design: Retrospective review of 307 closed malpractice claims in which patients alleged a missed or delayed diagnosis in the ambulatory setting. Setting: 4 malpractice insurance companies. Measurements: Diagnostic errors associated with adverse outcomes for patients, process breakdowns, and contributing factors. Results: A total of 181 claims (59%) involved diagnostic errors that harmed patients. Fifty-nine percent (106 of 181) of these errors were associated with serious harm, and 30% (55 of 181) resulted in death. For 59% (106 of 181) of the errors, cancer was the diagnosis involved, chiefly breast (44 claims [24%]) and colorectal (13 claims [7%]) cancer. The most common breakdowns in the M issed and delayed diagnoses in the ambulatory setting are an important patient safety problem. The current diagnostic process in health care is complex, chaotic, and vulnerable to failures and breakdowns. For example, one third of women with abnormal results on mammography or Papanicolaou smears do not receive follow-up care that is consistent with well-established guidelines (1, 2), and primary care providers often report delays in reviewing test results (3). Recognition of systemic problems in this area has prompted urgent calls for improvements (4). However, this type of error remains largely unstudied (4). At least part of the reason is technical: Because omissions characterize missed diagnoses, they are difficult to identify; there is no standard reporting mechanism; and when they are identified, documentation in medical records is usually insufficiently detailed to support detailed causal analyses. The result is a relatively thin evidence base from which to launch efforts to combat diagnostic errors. Moreover, conceptions of the problem tend to remain rooted in the notion of physicians failing to be vigilant or up-to-date. This is a less nuanced view of error causation than careful analysis of other major patient safety problems, such as medication errors (5, 6), has revealed. Several considerations highlight malpractice claims as a potentially rich source of information about missed and delayed diagnoses. First, misdiagnosis is a common allegation. Over the past decade, lawsuits alleging negligent misdiagnoses have become the most prevalent type of claim in 488 © 2006 American College of Physicians diagnostic process were failure to order an appropriate diagnostic test (100 of 181 [55%]), failure to create a proper follow-up plan (81 of 181 [45%]), failure to obtain an adequate history or perform an adequate physical examination (76 of 181 [42%]), and incorrect interpretation of diagnostic tests (67 of 181 [37%]). The leading factors that contributed to the errors were failures in judgment (143 of 181 [79%]), vigilance or memory (106 of 181 [59%]), knowledge (86 of 181 [48%]), patient-related factors (84 of 181 [46%]), and handoffs (36 of 181 [20%]). The median number of process breakdowns and contributing factors per error was 3 for both (interquartile range, 2 to 4). Limitations: Reviewers were not blinded to the litigation outcomes, and the reliability of the error determination was moderate. Conclusions: Diagnostic errors that harm patients are typically the result of multiple breakdowns and individual and system factors. Awareness of the most common types of breakdowns and factors could help efforts to identify and prioritize strategies to prevent diagnostic errors. Ann Intern Med. 2006;145:488-496. For author affiliations, see end of text. www.annals.org the United States (7, 8). Second, diagnostic breakdowns that lead to claims tend to be associated with especially severe outcomes. Third, relatively thorough documentation on what happened is available in malpractice insurers’ claim files. In addition to the medical record, these files include depositions, expert opinions, and sometimes the results of internal investigations. Previous attempts to use data from malpractice claims to study patient safety have had various methodologic constraints, including small sample size (9, 10), a focus on single insurers (11) or verdicts (9, 10) (which constitute ⬍10% of claims), limited information on the claims (8 – 11), reliance on internal case review by insurers rather than by independent experts (8, 11), and a general absence of See also: Print Editors’ Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Editorial comment. . . . . . . . . . . . . . . . . . . . . . . . . . 547 Summary for Patients. . . . . . . . . . . . . . . . . . . . . . . I-12 Web-Only Appendix Appendix Tables Appendix Figures Conversion of tables and figures into slides Missed or Delayed Diagnoses robust frameworks for classifying types and causes of failures. To address these issues, we analyzed data from closed malpractice claims at 4 liability insurance companies. Our goals were to develop a framework for investigating missed and delayed diagnoses, advance understanding of their causes, and identify opportunities for prevention. METHODS Study Sites Article Context Efforts to reduce medical errors and improve patient safety have not generally addressed errors in diagnosis. As with treatment, diagnosis involves complex, fragmented processes within health care systems that are vulnerable to failures and breakdowns. Contributions Four malpractice insurance companies based in 3 regions (northeastern, southwestern, and western United States) participated in the study. Collectively, the participating companies insured approximately 21 000 physicians, 46 acute care hospitals (20 academic and 26 nonacademic), and 390 outpatient facilities, including a wide variety of primary care and outpatient specialty practices. The ethics review boards at the investigators’ institutions and at each review site approved the study. The authors reviewed malpractice claims alleging injury from a missed or delayed diagnosis. In 181 cases in which there was a high likelihood that error led to the missed diagnosis, the authors analyzed where the diagnostic process broke down and why. The most common missed diagnosis was cancer, and the most common breakdowns were failure to order appropriate tests and inadequate follow-up of test results. A median of 3 process breakdowns occurred per error, and 2 or more clinicians were involved in 43% of cases. Claims Sample Cautions Data were extracted from random samples of closed claim files from each insurer. A claim is classified as closed when it has been dropped, dismissed, paid by settlement, or resolved by verdict. The claim file is the repository of information accumulated by the insurer during the life of a claim. It captures a wide variety of data, including the statement of claim, depositions, interrogatories, and other litigation documents; reports of internal investigations, such as risk management evaluations and sometimes rootcause analyses; expert opinions from both sides; medical reports detailing the plaintiff ’s preevent and postevent condition; and, while the claim is open, medical records pertaining to the episode of care at issue. We reacquired the relevant medical records for sampled claims. Following previous studies, we defined a claim as a written demand for compensation for medical injury (12, 13). Claims involving missed or delayed diagnoses were defined as those alleging an error in diagnosis or testing that caused a delay in appropriate treatment or a failure to act or follow up on results of diagnostic tests. We excluded allegations related to pregnancy and those pertaining to care rendered solely in the inpatient setting. We reviewed 429 diagnostic claims alleging injury due to missed or delayed diagnoses. Insurers contributed to the study sample in proportion to their annual claims volume (Appendix, available at www.annals.org). The claims were divided into 2 main categories based on the primary setting of the outpatient care involved in the allegation: the emergency department (122 claims) and all other locations (for example, physician’s office, ambulatory surgery, pathology laboratory, or radiology suites) (307 claims). The latter group, which we call ambulatory claims, is the focus of this analysis. Study Instruments and Claim File Review Physicians who were board-certified attendings, fellows, or third-year residents in internal medicine reviewed www.annals.org The study relied on malpractice claims, which are not representative of all diagnostic errors that occur. There was only moderate agreement among the authors in their subjective judgments about errors and their causes. Implications Like other medical errors, diagnostic errors are multifactorial. They arise from multiple process breakdowns, usually involving multiple providers. The results highlight the challenge of finding effective ways to reduce diagnostic errors as a component of improving health care quality. —The Editors sampled claim files at the insurers’ offices or insured facilities. Physician-investigators trained the reviewers in the content of claim files, use of the study instruments, and confidentiality procedures in 1-day sessions at each site. The reviewers also used a detailed manual. Reviews took on average 1.4 hours per file. To test review reliability, a second reviewer reviewed a random selection of 10% (42 of 429) of the files. Thirty-three of the 307 ambulatory claims that are the focus of this analysis were included in the random blinded re-review. A sequence of 4 instruments guided the review. For all claims, insurance staff recorded administrative details of the case (Appendix Figure 1, available at www.annals.org), and clinical reviewers recorded details of the adverse outcome the patient experienced, if any (Appendix Figure 2, available at www.annals .org). Reviewers scored adverse outcomes on a 9-point severity scale ranging from emotional injury only to death. This scale was developed by the National Association of Insurance Commissioners (14) and has been used in previous research (15). If the patient had multiple adverse outcomes, reviewers scored the most severe outcome. To simplify presentation of our results, we grouped scores on 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 489 Article Missed or Delayed Diagnoses this scale into 5 categories (emotional, minor, significant, major, and death). Next, reviewers considered the potential role of a series of contributing factors (Appendix Figure 3, available at www.annals.org) in causing the adverse outcome. The factors covered cognitive-, system-, and patient-related causes that were related to the episode of care as a whole, not to particular steps in the diagnostic process. The factors were selected on the basis of a review of the patient safety literature performed in 2001 by 5 of the authors in consultation with physician-collaborators from surgery and obstetrics and gynecology. Reviewers then judged, in light of available information and their decisions about contributing factors, whether the adverse outcome was due to diagnostic error. We used the Institute of Medicine’s definition of error, namely, “the failure of a planned action to be completed as intended (i.e. error of execution) or the use of a wrong plan to achieve an aim (i.e. error of planning)” (16). Reviewers recorded their judgment on a 6-point confidence scale ranging from “1. Little or no evidence that adverse outcome resulted from error/errors” to “6. Virtually certain evidence that adverse outcome resulted from error/errors.” Claims that scored 4 (“More likely than not that adverse outcome resulted from error/errors; more than 50-50 but a close call”) or higher were classified as having an error. The confidence scale and cutoff point were adapted from instruments used in previous studies of medical injury (15, 17). Reviewers were not blinded to the litigation outcomes but were instructed to ignore them and rely on their own clinical judgment in making decisions about errors. Training sessions stressed that the study definition of error is not synonymous with the legal definition of negligence and that a mix of factors extrinsic to merit influence whether claims are paid during litigation. Finally, for the subset of claims judged to involve errors, reviewers completed an additional form (Appendix Figure 4, available at www.annals.org) that collected additional clinical information about the missed diagnosis. Specifically, reviewers considered a defined sequence of diagnostic steps (for example, history and physical examination, test ordering, and creation of a follow-up plan) and were asked to grade their confidence that a breakdown had occurred at each step (5-point Likert scale ranging from “highly unlikely” to “highly likely”). If a breakdown was judged to have been at least “somewhat likely” (score of ⱖ3), the form elicited additional information on the particular breakdown, including a non–mutually exclusive list of reasons for the breakdown. Statistical Analysis The primary unit of analysis was the sequence of care in claims judged to involve a diagnostic error that led to an adverse outcome. For ease of exposition, we henceforth refer to such sequences as errors. The hand-filled data forms were electronically entered and verified by a profes490 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 sional data entry vendor and sent to the Harvard School of Public Health for analysis. Additional validity checks and data cleaning were performed by study programmers. Analyses were conducted by using SAS, version 8.2 (SAS Institute, Cary, North Carolina) and Stata SE, version 8.0 (Stata Corp., College Station, Texas). We examined characteristics of the claims, patients, and injuries in our sample and the frequency of the various contributing factors. We compared characteristics of error subgroups using Pearson chi-square tests and used percentage agreement and scores (18) to measure interrater reliability of the injury and error determinations. Role of the Funding Sources This study was funded by grants from the Agency for Healthcare Research and Quality (HS011886-03) and the Harvard Risk Management Foundation. These organizations did not play a role in the design, conduct, or analysis of the study, and the decision to submit the manuscript for publication was that of the authors. RESULTS The 307 diagnosis-related ambulatory claims closed between 1984 and 2004. Eighty-five percent (262 of 307) of the alleged errors occurred in 1990 or later, and 80% (245 of 307 claims) closed in 1997 or later. In 2% (7 of 307) of claims, no adverse outcome or change in the patient’s clinical course was evident; in 3% (9 of 307), the reviewer was unable to judge the severity of the adverse outcome from the information available; and in 36% (110 of 307), the claim was judged not to involve a diagnostic error. The remaining group of 181 claims, 59% (181 of 307) of the sample, were judged to involve diagnostic errors that led to adverse outcomes. This group of errors is the focus of further analyses. In 40 of the 42 re-reviewed claims, reviewers agreed about whether an adverse outcome had occurred (95% agreement). The reliability of the determination of whether an error had occurred (a score ⬍4 vs. ⱖ4 on the confidence scale) was moderate (72% agreement; ⫽ 0.42 [95% CI, ⫺0.05 to 0.66]). Errors and Diagnoses Fifty-nine percent (106 of 181) of the errors were associated with significant or major physical adverse outcomes and 30% (55 of 181) were associated with death (Table 1). For 59% (106 of 181) of errors, cancer was the diagnosis missed, chiefly breast (44 of 181 [24%]), colorectal (13 of 181 [7%]), and skin (8 of 181 [4%]) cancer. The next most commonly missed diagnoses were infections (9 of 181 [5%]), fractures (8 of 181 [4%]), and myocardial infarctions (7 of 181 [4%]). Most errors occurred in physicians’ offices (154 of 181 [85%]), and primary care physicians were the providers most commonly involved (76 of 181 [42%]). The mean interval between when diagnoses should have been made (that is, in the absence of error) www.annals.org Missed or Delayed Diagnoses Table 1. Characteristics of Patients, Involved Clinicians, Adverse Outcomes, and Missed or Delayed Diagnoses among 181 Diagnostic Errors in Ambulatory Care Characteristics Errors, n (%) Female patient 110 (61) Patient age (mean, 44 y [SD, 17]) ⬍1 y 1–18 y 18–34 y 35–49 y 50–64 y ⬎64 y Health insurance (n ⴝ 117) Private Medicaid Uninsured Medicare Other Clinicians involved in error* Specialty Primary care† Radiology General surgery Pathology Physician’s assistant Registered nurse or nurse practitioner Trainee (resident, fellow, or intern) Setting Physician’s office Ambulatory surgery facility Pathology or clinical laboratory Radiology suite Other Missed or delayed diagnosis Cancer Breast Colorectal Skin Hematologic Gynecologic Lung Brain Prostate Liver or gastric Other Infection Fracture Myocardial infarction Embolism Appendicitis Cerebral vascular disease Other neurologic condition Aneurysm Other cardiac condition Gynecologic disease (noncancer) Endocrine disorder Birth defect Ophthalmologic disease Peripheral vascular disease Abdominal disease Psychiatric illness Other Adverse outcome‡ Psychological or emotional 5 (3) 9 (5) 34 (19) 64 (35) 50 (28) 19 (10) 103 (88) 3 (3) 7 (6) 1 (1) 3 (3) 88 (49) 31 (17) 23 (13) 13 (7) 13 (7) 14 (8) 20 (11) 154 (85) 8 (4) 8 (4) 5 (3) 6 (3) 106 (59) 44 (24) 13 (7) 8 (4) 7 (4) 7 (4) 6 (3) 5 (3) 5 (3) 2 (1) 9 (5) 9 (5) 8 (4) 7 (4) 6 (3) 5 (3) 4 (2) 4 (2) 3 (2) 3 (2) 3 (2) 3 (2) 3 (2) 3 (2) 2 (1) 2 (1) 2 (1) 8 (4) 9 (5) Continued www.annals.org Article Table 1—Continued Characteristics Minor physical Significant physical Major physical Death Errors, n (%) 11 (6) 72 (40) 34 (19) 55 (30) * Percentages do not sum to 100% because multiple providers were involved in some errors. † Includes 12 pediatricians, therefore pediatrics accounts for an absolute 7% of the 49%. ‡ Categories correspond to the following scores on the National Association of Insurance Commissioners’ 9-point severity scale: psychiatric or emotional (score 1), minor physical (scores 2 and 3), significant physical (scores 4 to 6), major physical (scores 7 and 8), and death (score 9). For details, see Appendix Figure 2, available at www.annals.org. and when they actually were made was 465 days (SD, 571 days), and the median was 303 days (interquartile range, 36 to 681 days). Appendix Table 1 (available at www .annals.org) outlines several examples of missed and delayed diagnoses in the study sample. Breakdowns in the Diagnostic Process The leading breakdown points in the diagnostic process were failure to order an appropriate diagnostic test (100 of 181 [55%]), failure to create a proper follow-up plan (81 of 181 [45%]), failure to obtain an adequate history or to perform an adequate physical examination (76 of 181 [42%]), and incorrect interpretation of a diagnostic test (67 of 181 [37%]) (Table 2). Missed cancer diagnoses were significantly more likely to involve diagnostic tests being performed incorrectly (14 of 106 [13%] vs. 1 of 75 [1%]; P ⫽ 0.004) and being interpreted incorrectly (49 of 106 [46%] vs. 18 of 75 [24%]; P ⫽ 0.002), whereas missed noncancer diagnoses were significantly more likely to involve delays by patients in seeking care (4 of 106 [4%] vs. 12 of 75 [16%]; P ⫽ 0.004), inadequate history or physical examination (24 of 106 [23%] vs. 52 of 75 [69%]; P ⬍ 0.001), and failure to refer (19 of 106 [18%] vs. 28 of 75 [37%]; P ⫽ 0.003). Table 3 details the 4 most common process breakdowns. Among failures to order diagnostic tests, most tests involved were from the imaging (45 of 100 [45%]) or other test (50 of 100 [50%]) categories. With respect to the specific tests involved, biopsies were most frequently at issue (25 of 100 [25%]), followed by computed tomography scans, mammography, ultrasonography, and colonoscopy (11 of 100 for each [11%]). The most common explanation for the failures to order was that the physician seemed to lack knowledge of the appropriate test in the clinical circumstances. Among misinterpretations of test results, 27% (18 of 67) related to results on mammography and 15% (10 of 67) related to results on radiography. The main reasons for inadequate follow-up plans were that the physician did not think follow-up was necessary (32 of 81 [40%]), selected an inappropriate follow-up interval (29 of 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 491 Article Missed or Delayed Diagnoses Table 2. Breakdown Points in the Diagnostic Process Process Breakdown Initial delay by the patient in seeking care Failure to obtain adequate medical history or physical examination Failure to order appropriate diagnostic or laboratory tests Adequate diagnostic or laboratory tests ordered but not performed Diagnostic or laboratory tests performed incorrectly Incorrect interpretation of diagnostic or laboratory tests Responsible provider did not receive diagnostic or laboratory test results Diagnostic or laboratory test results were not transmitted to patient Inappropriate or inadequate follow-up plan Failure to refer Failure of a requested referral to occur Failure of the referred-to clinician to convey relevant results to the referring clinician Patient nonadherence to the follow-up plan Missed Cancer Diagnoses (n ⴝ 106), n (%) Missed Noncancer Diagnoses (n ⴝ 75), n (%) P Value* 16 (9) 76 (42) 4 (4) 24 (23) 12 (16) 52 (69) 0.004 ⬍0.001 100 (55) 63 (59) 37 (49) 0.178 17 (9) 10 (9) 7 (9) 0.98 15 (8) 67 (37) 23 (13) 14 (13) 49 (46) 17 (16) 1 (1) 18 (24) 6 (8) 0.004 0.002 0.110 22 (12) 15 (14) 7 (9) 0.33 81 (45) 47 (26) 9 (5) 3 (2) 51 (48) 19 (18) 8 (8) 2 (2) 30 (40) 28 (37) 1 (1) 1 (1) 0.28 0.003 0.058 0.77 31 (17) 21 (20) 10 (13) 0.25 All Missed Diagnoses (n ⴝ 181), n (%) * Pearson chi-square test for differences between cancer and noncancer categories. 81 [36%]), or did not document the plan correctly (22 of 81 [27%]). Contributing Factors The leading factors that contibuted to errors were failures in judgment (143 of 181 [79%]), vigilance or memory (106 of 181 [59%]), knowledge (86 of 181 [48%]), patient-related factors (84 of 181 [46%]), and handoffs (36 of 181 [20%]) (Table 4). The patient-related factors included nonadherence (40 of 181 [22%]), atypical clinical presentation (28 of 181 [15%]), and complicated medical history (18 of 181 [10%]). There were no significant differences in the prevalence of the various contributing factors when missed cancer and missed noncancer diagnoses were compared, with the exception of lack of supervision, which was significantly more likely to have occurred in missed noncancer diagnoses (10 of 75 [13%] vs. 5 of 106 [5%]; P ⫽ 0.038). The higher prevalence of lack of supervision as a contributing factor to missed noncancer diagnoses was accompanied by greater involvement of trainees in these cases (13 of 75 [17%] vs. 7 of 106 [7%]; P ⫽ 0.023). Multifactorial Nature of Missed or Delayed Diagnoses The diagnostic errors were complex and frequently involved multiple process breakdowns, contributing factors, and contributing clinicians (Table 5). In 43% (78 of 181) of errors, 2 or more clinicians contributed to the missed diagnosis, and in 16% (29 of 181), 3 or more clinicians contributed. There was a median of 3 (interquartile range, 2 to 4) process breakdowns per error; 54% (97 of 181) of errors had 3 or more process breakdowns and 29% (52 of 181) had 4 or more. Thirty-five diagnostic errors (35 of 181 [19%]) involved a breakdown at only 1 point in the care process (single-point breakdowns). Twenty-four (24 of 492 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 35 [69%]) of these single-point breakdowns were either failures to order tests (n ⫽ 10) or incorrect interpretations of tests (n ⫽ 14); only 3 involved trainees. The median number of contributing factors involved in diagnostic errors was 3 (interquartile range, 2 to 4); 59% (107 of 181) had 3 or more contributing factors, 27% (48 of 181) had 4 or more, and 13% (23 of 181) had 5 or more. Thus, although virtually all diagnostic errors were linked to cognitive factors, especially judgment errors, cognitive factors operated alone in a minority of cases. They were usually accompanied by communication factors, patient-related factors, or other system factors. Specifically, 36% (66 of 181) of errors involved cognitive factors alone, 16% (29 of 181) involved judgment or vigilance and memory factors alone, and 9% (16 of 181) involved only judgment factors. The likelihood that cognitive factors alone led to the error was significantly lower among errors that involved an inadequate medical history or physical examination (21 of 66 [32%] vs. 55 of 115 [48%]; P ⫽ 0.036), lack of receipt of ordered tests by the responsible provider (3 of 66 [5%] vs. 20 of 115 [17%]; P ⫽ 0.013), and inappropriate or inadequate follow-up planning (21 of 66 [32%] vs. 60 of 115 [52%]; P ⫽ 0.008). In other words, the role of communication and other systems factors was especially prominent in these 3 types of breakdowns. DISCUSSION Our study of closed malpractice claims identified a group of missed diagnoses in the ambulatory setting that were associated with dire outcomes for patients. Over half of the missed diagnoses were cancer, primarily breast and colorectal cancer; no other diagnosis accounted for more www.annals.org Missed or Delayed Diagnoses than 5% of the sample. The main breakdowns in the diagnostic process were failure to order appropriate diagnostic tests, inappropriate or inadequate follow-up planning, failure to obtain an adequate medical history or perform an adequate physical examination, and incorrect interpretation of diagnostic test results. Cognitive factors, patientrelated factors, and handoffs were the most prominent contributing factors overall. However, relatively few diagnostic errors could be linked to single-point breakdowns or lone contributing factors. Most missed diagnoses involved manifold breakdowns and a potent combination of individual and system factors. The resultant delays tended to be long, setting diagnosis back more than 1 year on average. The threshold question of what constitutes a harmful Article diagnostic error is extremely challenging. Indeed, the nebulous nature of missed diagnoses probably helps to explain why patient safety research in this area has lagged. Our approach was 2-dimensional. We considered the potential role of a range of contributing factors drawn from the fields of human factors and systems analysis (16, 19) in causing the patient’s injury. We also disaggregated the diagnostic process into discrete steps and then examined the prevalence of problems within particular steps. Trained physician-reviewers had the benefit of information from the claim file and from the medical record in deciding whether diagnoses were missed. Even so, agreement among reviewers was only fair, highlighting how difficult and subjective clinical judgments regarding missed diagnoses are. Table 3. Details of the 4 Most Frequent Process Breakdowns* Process Breakdown Leading Reasons for Breakdown Value, n Within Process Breakdown Category, % Failure to order appropriate diagnostic or laboratory tests (n ⫽ 100) Provider lacked knowledge of appropriate test Poor documentation Failure of communication Among providers Between provider and patient Patient did not schedule or keep appointment Patient declined tests Imaging Computed tomography scanning Mammography Ultrasonography Blood Other† Biopsy Colonoscopy or flexible sigmoidoscopy Provider did not think follow-up was necessary Provider selected inappropriate follow-up interval Follow-up plan not documented Follow-up appointment not scheduled Miscommunication between patient and provider Miscommunication between providers Incomplete physical examination Failure to elicit relevant information Poor documentation Patient provided inaccurate history Error in clinical judgment Failure of communication among providers Inexperience Whose misinterpretation? Radiologist Primary care physician Pathologist Imaging Mammography Radiography Ultrasonography Blood Other‡ Biopsy 37 10 15 7 8 9 5 45 11 11 11 14 50 25 11 32 29 37 10 15 7 8 9 5 45 11 11 11 14 50 25 11 40 36 22 20 20 27 25 25 10 39 35 20 13 52 5 3 12 50 46 26 17 78 7 5 27 19 10 36 18 10 5 9 21 8 40 28 15 54 27 15 7 13 31 12 Type of test not ordered Inappropriate or inadequate follow-up plan (n ⫽ 81) Failure to obtain adequate medical history or physical examination (n ⫽ 76) Incorrect interpretation of diagnostic or laboratory tests (n ⫽ 67) Type of test interpreted incorrectly * Category and subcategory totals may exceed 100% because of non–mutually exclusive reasons and multiple tests. Only leading reasons and tests are presented. † The tests not shown in this category are echocardiography (n ⫽ 3), cardiac catheterization (n ⫽ 3), electrocardiography (n ⫽ 2), treadmill (n ⫽ 2), urinalysis (n ⫽ 1), esophagogastroduodenoscopy/barium swallow (n ⫽ 1), colposcopy (n ⫽ 1), urine pregnancy (n ⫽ 1), and joint aspirate (n ⫽ 1). ‡ The tests not shown in this category are electrocardiography (n ⫽ 4), urinalysis (n ⫽ 1), echocardiography (n ⫽ 1), stool guaiac (n ⫽ 2), Papanicolaou smear (n ⫽ 3), cystoscopy (n ⫽ 1), and treadmill (n ⫽ 1). www.annals.org 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 493 Article Missed or Delayed Diagnoses Table 4. Factors That Contributed to Missed or Delayed Diagnoses Variable Value, n (%)* Provider- and systems-related factors Cognitive factors Judgment Vigilance or memory Lack of knowledge Communication factors Handoffs Failure to establish clear lines of responsibility Conflict Some other failure of communication Other systems factors Lack of supervision Workload Interruptions Technology problems Fatigue 179 (99) 143 (79) 106 (59) 86 (48) 55 (30) 36 (20) 17 (9) 3 (2) 16 (9) 31 (17) 15 (8) 12 (7) 6 (3) 5 (3) 1 (1) Patient-related factors† Nonadherence Atypical presentation Complicated medical history Information about medical history of poor quality Other 84 (46) 40 (22) 28 (15) 18 (10) 8 (4) 15 (8) * Percentages were calculated by using the denominator of 181 missed or delayed diagnoses. † Sum of specific patient-related factors exceeds total percentage because some errors had multiple contributing factors. Troublesome Diagnoses The prevalence of missed cancer diagnoses in our sample is consistent with previous recognition of this problem as a major quality concern for ambulatory care (8). Missed cancer diagnoses were more likely than other missed diagnoses to involve errors in the performance and interpretation of tests. Lack of adherence to guidelines for cancer screening and test ordering is a well-recognized problem, and there are ongoing efforts to address it (1, 20). Our data did not allow assessment of the extent to which guideline adherence would have prevented errors. The next most commonly missed diagnoses were infections, fractures, and myocardial infarctions, although there was also a broad range of other missed diagnoses. Primary care physicians were centrally involved in most of these errors. Most primary care physicians work under considerable time pressure, handle a wide range of clinical Table 5. Distribution of Contributing Clinicians, Process Breakdowns, and Contributing Factors* Number per Error Contributing Clinicians, % Process Breakdowns, % Contributing Factors, % 1 ⱖ2 ⱖ3 ⱖ4 ⱖ5 57 43 16 6 2 19 81 54 29 11 14 86 59 27 13 * The total number of errors was 181. 494 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 problems, and treat predominantly healthy patients. They practice in a health care system in which test results are not easily tracked, patients are sometimes poor informants, multiple handoffs exist, and information gaps are the norm. For even the most stellar practitioners, clinical processes and judgments are bound to fail occasionally under such circumstances. Cognitive and Other Factors Although there were several statistically significant differences in the patterns of process breakdowns between missed cancer diagnoses, the patterns of contributing factors were remarkably similar. This suggests that although vulnerable points may vary according to the idiosyncratic pathways of different diagnoses, a core set of human and system vulnerabilities permeates diagnostic work. For example, relying exclusively on a physician’s knowledge or memory to ensure that the correct test is ordered, practicing in a clinic with poor internal communication strategies, or having a high prevalence of patients with complex disorders may be risky regardless of the potential diagnosis. Such cognitive factors as judgment and knowledge were nearly ubiquitous but rarely were the sole contributing factors. Previous research has highlighted the importance of cognitive factors in misdiagnosis and has proposed frameworks for better understanding the underlying causes of such errors (21–23). Detecting these cognitive failures and determining why they occur is difficult, but the design of robust diagnostic processes will probably depend on it. Follow-Up Nearly half of all errors involved an inadequate follow-up plan, a failure that was split fairly evenly between situations in which the physician determined that follow-up was not necessary and those in which the need for follow-up was recognized but the wrong interval was selected. Poor documentation, scheduling problems, and miscommunication among providers also played a role. A recent survey (24) of patients in 5 countries found that 8% to 20% of adults who were treated as outpatients did not receive their test results, and 9% to 15% received incorrect results or delayed notification of abnormal results. In addition, in our study, 56% (45 of 81) of the errors that occurred during follow-up also had patient-related factors that contributed to the poor outcome, a finding that highlights the need for risk reduction initiatives that address potential breakdowns on both sides of the patient–physician relationship during follow-up. Interventions In general, our findings reinforce the need for systems interventions that mitigate the potential impact of cognitive errors by reducing reliance on memory, forcing consideration of alternative diagnostic plans or second opinions, and providing clinical decision support systems (21, 22). However, cognitive factors rarely appear alone, so interventions must also address other contributing factors, such as handoffs and communication. To be clinically usewww.annals.org Missed or Delayed Diagnoses ful and maximally effective, such interventions must be easy to use and well integrated into clinicians’ workflows (25). For example, incorporating clinical decision support into the electronic medical record of a patient with breast symptoms should help guard against failures in history taking or test ordering by bolstering physicians’ knowledge and reducing their reliance on vigilance and memory. An important feature of interventions is that their use should not rely on voluntary decisions to access them, because physicians are often unaware of their need for help (26). Rather, the use of an intervention should be automatically triggered by certain predetermined and explicit characteristics of the clinical encounter. For example, strategies to combat misinterpretation could mandate second reviews of test results in designated circumstances or require rapid expert reviews when physicians interpret test results outside of their areas of expertise. After follow-up plans are made, improvements in scheduling procedures, tickler systems, and test result tracking systems could help keep patients and physicians on track. Nationwide attention is being given to the issue of follow-up (27, 28). Research and quality improvement efforts that equip physicians with tools to reliably perform follow-up must be a priority. Selecting just 1 of the interventions we have outlined may not be sufficient. The multifactorial and complex nature of diagnostic errors suggests that meaningful reductions will require prevention strategies that target multiple levels in the diagnostic process and multiple contributing factors. Nevertheless, the resource constraints most health care institutions face demand priorities. Attention to the 3 vulnerable points we identified— ordering decisions, test interpretation, and follow-up planning—is a useful starting point and promises high yields, especially in reducing the number of missed cancer diagnoses. Study Limitations Our study has several limitations. First, unlike prospective observational studies or root-cause analyses, retrospective review of records, even the detailed records found in malpractice claim files, will miss certain breakdowns (for example, patient nonadherence) and contributing factors (for example, fatigue and workload), unless they emerged as issues during litigation. This measurement problem means that prevalence findings for such estimates will be lower bounds, and the multifactorial causality we observed probably understates the true complexity of diagnostic errors. An additional measurement problem relates to the process breakdowns. Although reviewers considered breakdowns as independent events, in some situations breakdowns may have been prompted or influenced by earlier breakdowns in the diagnostic sequence. For example, a failure to order appropriate tests may stem from oversights in the physical examination. Such interdependence would tend to inflate the frequency of some breakdowns. www.annals.org Article Second, awareness of the litigation outcome may have biased reviewers toward finding errors in claims that received compensation and vice versa (29, 30). Several factors militate against this bias: Reviewers were instructed to ignore the litigation outcome; physicians, who as a group tend to be skeptical of the malpractice system, may have been disinclined to credit the system’s findings; and, in fact, one quarter of error judgments diverged from the litigation outcomes. Third, the reliability of the error determination was not high, and the CIs around the score are statistically compatible with poor or marginal reliability. Diagnostic errors are one of the most difficult types of errors to detect reliably (31). Although reviewers may have included some episodes of care in the study sample that were not “true” errors, and may have overlooked some that were, we know of no reason why the characteristics of such false-positive results and false-negative results would differ systematically from those we analyzed. Fourth, malpractice claims data generally, and in our sample in particular, have several other biases. Severe injuries and younger patients are overrepresented in the subset of patients with medical injuries that trigger litigation (32, 33). It is possible that the factors that lead to errors in litigated cases may differ systematically from the factors that lead to errors in nonlitigated cases, although we know of no reason why they would. In addition, outpatient clinics associated with teaching hospitals are overrepresented in our sample, so the diagnostic errors we identified may not be generalizable outside this setting. Conclusions Our findings highlight the complexity of diagnostic errors in ambulatory care. Just as Reason’s “Swiss cheese” model of accident causation suggests (19) diagnostic errors that harm patients seem to result from the alignment of multiple breakdowns, which in turn stem from a confluence of contributing factors. The task of effecting meaningful improvements to the diagnostic process—with its numerous clinical steps, stretched across multiple providers and months or years, and the heavy reliance on patient initiative—looms as a formidable challenge. The prospects for “silver bullets” in this area seem remote. Are meaningful gains achievable in the short to medium term? The answer will probably turn on whether simple interventions that target 2 or 3 critical breakdown points are sufficient to disrupt the causal chain or whether interventions at a wider range of points are necessary to avert harm. In this sense, our findings are humbling, and they underscore the need for continuing efforts to develop the “basic science” of error prevention in medicine (34), which remains in its infancy. From Brigham and Women’s Hospital and Harvard School of Public Health, Boston, Massachusetts; University of Texas Health Science Center, Houston, Texas; and the Harvard Risk Management Foundation, Cambridge, Massachusetts. 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 495 Article Missed or Delayed Diagnoses Note: Drs. Gandhi, Kachalia, and Studdert had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Grant Support: This study was supported by grants from the Agency for Healthcare Research and Quality (HS011886-03) and the Harvard Risk Management Foundation. Dr. Studdert was also supported by the Agency for Healthcare Research and Quality (KO2HS11285). Potential Financial Conflicts of Interest: Employment: T.A. Brennan (Aetna); Stock ownership or options (other than mutual funds): T.A. Brennan (Aetna); Expert testimony: T.A. Brennan. Requests for Single Reprints: David M. Studdert, LLB, ScD, Depart- ment of Health Policy and Management, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115; e-mail, [email protected]. Current author addresses and author contributions are available at www .annals.org. References 1. Haas JS, Cook EF, Puopolo AL, Burstin HR, Brennan TA. Differences in the quality of care for women with an abnormal mammogram or breast complaint. J Gen Intern Med. 2000;15:321-8. [PMID: 10840267] 2. Marcus AC, Crane LA, Kaplan CP, Reading AE, Savage E, Gunning J, et al. Improving adherence to screening follow-up among women with abnormal Pap smears: results from a large clinic-based trial of three intervention strategies. Med Care. 1992;30:216-30. [PMID: 1538610] 3. Poon EG, Gandhi TK, Sequist TD, Murff HJ, Karson AS, Bates DW. “I wish I had seen this test result earlier!”: dissatisfaction with test result management systems in primary care. Arch Intern Med. 2004;164:2223-8. [PMID: 15534158] 4. Margo CE. A pilot study in ophthalmology of inter-rater reliability in classifying diagnostic errors: an underinvestigated area of medical error. Qual Saf Health Care. 2003;12:416-20. [PMID: 14645756] 5. Bates DW, Cullen DJ, Laird N, Petersen LA, Small SD, Servi D, et al. Incidence of adverse drug events and potential adverse drug events. Implications for prevention. ADE Prevention Study Group. JAMA. 1995;274:29-34. [PMID: 7791255] 6. Leape LL, Bates DW, Cullen DJ, Cooper J, Demonaco HJ, Gallivan T, et al. Systems analysis of adverse drug events. ADE Prevention Study Group. JAMA. 1995;274:35-43. [PMID: 7791256] 7. Chandra A, Nundy S, Seabury SA. The growth of physician medical malpractice payments: evidence from the national practitioner data bank. Health Aff (Millwood). 2005;W5240-9. [PMID: 15928255] 8. Phillips RL Jr, Bartholomew LA, Dovey SM, Fryer GE Jr, Miyoshi TJ, Green LA. Learning from malpractice claims about negligent, adverse events in primary care in the United States. Qual Saf Health Care. 2004;13:121-6. [PMID: 15069219] 9. Kern KA. Medicolegal analysis of bile duct injury during open cholecystectomy and abdominal surgery. Am J Surg. 1994;168:217-22. [PMID: 8080055] 10. Kern KA. Malpractice litigation involving laparoscopic cholecystectomy. Cost, cause, and consequences. Arch Surg. 1997;132:392-8. [PMID: 9108760] 11. Kravitz RL, Rolph JE, McGuigan K. Malpractice claims data as a quality improvement tool. I. Epidemiology of error in four specialties. JAMA. 1991;266: 2087-92. [PMID: 1920696] 12. Weiler PC, Hiatt HH, Newhouse JP, Johnson WG, Brennan T, Leape LL. A Measure of Malpractice: Medical Injury, Malpractice Litigation, and Patient 496 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 Compensation. Cambridge, MA: Harvard Univ Pr; 1993. 13. Studdert DM, Brennan TA, Thomas EJ. Beyond dead reckoning: measures of medical injury burden, malpractice litigation, and alternative compensation models from Utah and Colorado. Indiana Law Rev. 2000;33:1643-86. 14. Sowka M, ed. National Association of Insurance Commissioners. Malpractice Claims: Final Compilation. Brookfield, WI: National Association of Insurance Commissioners; 1980. 15. Thomas EJ, Studdert DM, Burstin HR, Orav EJ, Zeena T, Williams EJ, et al. Incidence and types of adverse events and negligent care in Utah and Colorado. Med Care. 2000;38:261-71. [PMID: 10718351] 16. Kohn LT, Corrigan JM, Donaldson MS. To Err Is Human. Building a Safer Health System. Institute of Medicine. Washington, DC: National Academy Pr; 1999. 17. Brennan TA, Leape LL, Laird NM, Hebert L, Localio AR, Lawthers AG, et al. Incidence of adverse events and negligence in hospitalized patients. Results of the Harvard Medical Practice Study I. N Engl J Med. 1991;324:370-6. [PMID: 1987460] 18. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159-74. [PMID: 843571] 19. Reason J. Managing the Risks of Organizational Accidents. Brookfield, VT: Ashgate; 1997. 20. Harvard Risk Management Foundation. Breast care management algorithm: improving breast patient safety. Accessed at www.rmf.harvard.edu/files/flash /breast-care-algorithm/index.htm on 9 August 2006. 21. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003;78:775-80. [PMID: 12915363] 22. Graber M, Gordon R, Franklin N. Reducing diagnostic errors in medicine: what’s the goal? Acad Med. 2002;77:981-92. [PMID: 12377672] 23. Elstein AS, Schwarz A. Clinical problem solving and diagnostic decision making: selective review of the cognitive literature. BMJ. 2002;324:729-32. [PMID: 11909793] 24. Schoen C, Osborn R, Huynh PT, Doty M, Davis K, Zapert K, et al. Primary care and health system performance: adults’ experiences in five countries. Health Aff (Millwood). 2004;Suppl:W4-487-503. [PMID: 15513956] 25. Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ. 2005;330:765. [PMID: 15767266] 26. Friedman CP, Gatti GG, Franz TM, Murphy GC, Wolf FM, Heckerling PS, et al. Do physicians know when their diagnoses are correct? Implications for decision support and error reduction. J Gen Intern Med. 2005;20:334-9. [PMID: 15857490] 27. Joint Commission on Accreditation of Healthcare Organizations. National Patient Safety Goals. Accessed at www.jointcommission.org/PatientSafety/NationalPatientSafetyGoals/07_amb_obs_npsgs.htm on 9 August 2006. 28. Gandhi TK. Fumbled handoffs: one dropped ball after another. Ann Intern Med. 2005;142:352-8. [PMID: 15738454] 29. Guthrie C, Rachlinski JJ, Wistrich AJ. Inside the judicial mind. Cornell Law Rev. 2001;86:777-830. 30. LaBine SJ, LaBine G. Determinations of negligence and the hindsight bias. Law Hum Behav. 1996;20:501-16. 31. Studdert DM, Mello MM, Gawande AA, Gandhi TK, Kachalia A, Yoon C, et al. Claims, errors, and compensation payments in medical malpractice litigation. N Engl J Med. 2006;354:2024-33. [PMID: 16687715] 32. Burstin HR, Johnson WG, Lipsitz SR, Brennan TA. Do the poor sue more? A case-control study of malpractice claims and socioeconomic status. JAMA. 1993;270:1697-701. [PMID: 8411499] 33. Studdert DM, Thomas EJ, Burstin HR, Zbar BI, Orav EJ, Brennan TA. Negligent care and malpractice claiming behavior in Utah and Colorado. Med Care. 2000;38:250-60. [PMID: 10718350] 34. Brennan TA, Gawande A, Thomas E, Studdert D. Accidental deaths, saved lives, and improved quality. N Engl J Med. 2005;353:1405-9. [PMID: 16192489] www.annals.org Annals of Internal Medicine Current Author Addresses: Drs. Gandhi and Kachalia and Ms. Yoon: Division of General Internal Medicine, Brigham and Women’s Hospital, 75 Francis Street, Boston, MA 02114. Dr. Thomas: University of Texas Medical School at Houston, 6431 Fannin, MSB 1.122, Houston, TX 77030. Ms. Puopolo: CRICO/RMF, 101 Main Street, Cambridge, MA 02142. Dr. Brennan: 151 Farmington Avenue, RC5A, Hartford, CT 06156. Dr. Studdert: Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115. Author Contributions: Conception and design: T.K. Gandhi, E.J. Thomas, T.A. Brennan, D.M. Studdert. Analysis and interpretation of the data: T.K. Gandhi, A. Kachalia, C. Yoon, D.M. Studdert. W-136 3 October 2006 Annals of Internal Medicine Volume 145 • Number 7 Drafting of the article: T.K. Gandhi, A. Kachalia, T.A. Brennan, D.M. Studdert. Critical revision of the article for important intellectual content: T.K. Gandhi, A. Kachalia, E.J. Thomas, T.A. Brennan, D.M. Studdert. Final approval of the article: T.K. Gandhi, A. Kachalia, E.J. Thomas, D.M. Studdert. Provision of study materials or patients: A.L. Puopolo. Statistical expertise: D.M. Studdert. Obtaining of funding: D.M. Studdert. Administrative, technical, or logistic support: A. Kachalia, A.L. Puopolo, D.M. Studdert. Collection and assembly of data: A. Kachalia, A.L. Puopolo, D.M. Studdert. www.annals.org WHAT ARE THE FACTORS THAT CONTRIBUTE TO DIAGNOSIS ERROR? Mark Graber, MD et al. Diagnostic Error in internal Medicine. Arch Intern Med. 2005 Jul 11; 165(13):1493-9 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] ORIGINAL INVESTIGATION Diagnostic Error in Internal Medicine Mark L. Graber, MD; Nancy Franklin, PhD; Ruthanna Gordon, PhD Background: The goal of this study was to determine the relative contribution of system-related and cognitive components to diagnostic error and to develop a comprehensive working taxonomy. Methods: One hundred cases of diagnostic error in- volving internists were identified through autopsy discrepancies, quality assurance activities, and voluntary reports. Each case was evaluated to identify systemrelated and cognitive factors underlying error using record reviews and, if possible, provider interviews. Results: Ninety cases involved injury, including 33 deaths. The underlying contributions to error fell into 3 natural categories: “no fault,” system-related, and cognitive. Seven cases reflected no-fault errors alone. In the remaining 93 cases, we identified 548 different systemrelated or cognitive factors (5.9 per case). Systemrelated factors contributed to the diagnostic error in 65% of the cases and cognitive factors in 74%. The most common system-related factors involved problems with poli- cies and procedures, inefficient processes, teamwork, and communication. The most common cognitive problems involved faulty synthesis. Premature closure, ie, the failure to continue considering reasonable alternatives after an initial diagnosis was reached, was the single most common cause. Other common causes included faulty context generation, misjudging the salience of findings, faulty perception, and errors arising from the use of heuristics. Faulty or inadequate knowledge was uncommon. Conclusions: Diagnostic error is commonly multifac- torial in origin, typically involving both system-related and cognitive factors. The results identify the dominant problems that should be targeted for additional research and early reduction; they also further the development of a comprehensive taxonomy for classifying diagnostic errors. Arch Intern Med. 2005;165:1493-1499 Once we realize that imperfect understanding is the human condition, there is no shame in being wrong, only in failing to correct our mistakes. George Soros I Author Affiliations: Department of Veterans Affairs Medical Center, Northport, NY (Dr Graber); Departments of Medicine (Dr Graber) and Psychology (Dr Franklin), State University of New York at Stony Brook; and Department of Psychology, Illinois Institute of Technology, Chicago (Dr Gordon). Financial Disclosure: None. In his classic studies of clinical reasoning, Elstein1 estimated the rate of diagnostic error to be approximately 15%, in reasonable agreement with the 10% to 15% error rate determined in autopsy studies.2-4 Considering the frequency and impact of diagnostic errors, one is struck by how little is known about this type of medical error.5 Data on the types and causes of errors encountered in the practice of internal medicine are scant, and the field lacks both a standardized definition of diagnostic error and a comprehensive taxonomy, although preliminary versions have been proposed.6-12 According to the Institute of Medicine, the most powerful way to reduce error in medicine is to focus on systemlevel improvements, 1 3 , 1 4 but these interventions are typically discussed in regard to patient treatment issues. The possibility that system-level dysfunction could also contribute to diagnostic errors has received little attention. Typically, diagnos- (REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1493 tic error is viewed as a cognitive failing.7,15-17 Diagnosis reflects the clinician’s knowledge, clinical acumen, and problemsolving skills.1,18 In everyday practice, clinicians use expert skills to arrive at a diagnosis, often taking advantage of various mental shortcuts known as heuristics.16,18-21 These strategies are highly efficient, relatively effortless, and generally accurate, but they are not infallible. The goal of this study was to clarify the basic etiology of diagnostic errors in internal medicine and to develop a working taxonomy. To understand how these errors arise and how they might be prevented in the future, we systematically examined the etiology of error using root cause analysis to classify both systemrelated and cognitive components. METHODS Based on a classification used by the Australian Patient Safety Foundation, we defined diagnostic error operationally as a diagnosis that was unintentionally delayed (sufficient information was available earlier), wrong (another diagnosis was made before the correct one), or missed (no diagnosis was ever made), as judged from the eventual appreciation of more definitive information. WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. Cases of suspected diagnostic error were collected from 5 large academic tertiary care medical centers over 5 years. To obtain a broad sampling of errors, we reviewed all eligible cases from 3 sources: 1. Performance improvement and risk management coordinators and peer review committees. 2. Voluntary reports from staff physicians and resident trainees. 3. Discrepancies between clinical impressions and autopsy findings. Cases were included if internists (staff specialists or generalists or trainees) were primarily responsible for the diagnosis and if sufficient details about the case and the decision-making process could be obtained to allow analysis. In all cases, details were gathered from a review of the medical record, from fact-finding information obtained in the course of quality assurance activities when available, and in 42 cases, involved practitioners were interviewed, typically within 1 month of error identification. To minimize hindsight bias, reviews of medical records and interviews used a combination of open-ended queries and a root cause checklist developed by the Veterans Health Administration (VHA).22,23 The VHA instrument24,25 identifies specific flaws in the standard dimensions of organizational performance.26,27 and is well suited to exploring system-related factors. We developed the cognitive factors portion of the taxonomy by incorporating and expanding on the categories suggested by Chimowitz et al,11 Kassirer and Kopelman,12 and Bordage.28 These categories differentiate flaws in the clinician’s knowledge and skills, ability to gather data, and ability to synthesize all available information into verifiable hypotheses. Criteria and definitions for each category were refined, and new categories added, as the study progressed. Case histories were redacted of identifying information and analyzed as a team to the point of consensus by 1 internist and 2 cognitive psychologists to confirm the existence of a diagnostic error and to assign the error type (delayed, wrong, or missed) and both the system-related and the cognitive factors contributing to the error.29 To identify 100 usable cases, 129 cases of suspected error were reviewed, 29 of which were rejected. Definitive confirmation of an error was lacking in 19 cases. In 6 cases, the diagnosiswassomewhatdelayedbutjudgedtohavebeenmadewithin an acceptable time frame. In 3 cases, the data were inadequate for analysis, and 1 case was rejected because the error reflected an intentional act that violated local policies (wrong diagnosis of hyponatremia from blood drawn above an intravenous line). Impact was judged by an internist using a VHA scale that multiplies the likelihood of recurrence (1, remote; up to 4, frequent) by the severity of harm (1, minor injury; up to 4, catastrophic injury).24,25 A minor injury with a remote chance of recurrence received an impact score of 1, and a catastrophic event with frequent recurrence received an impact score of 16. Close-call errors were assigned an impact score of 0, and psychological impact was similarly discounted. The relative frequency of error types was compared using the Fisher exact test. The impact scores of different groups were compared by 1-way analysis of variance, and if a significant difference was found, group means were compared by a t test. The study was approved by the institutional review board(s) at the participating institutions. Confidentiality protections were provided by the federal Privacy Act and the Tort Claims Act, by New York State statutes, and by a certificate of confidentiality from the US Department of Health and Human Services. RESULTS We analyzed 100 cases of diagnostic error: 57 from quality assurance activities, 33 from voluntary reports, and (REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1494 10 from autopsy discrepancies. The error was revealed by tissue in 53 cases (19 autopsy specimens and 34 surgical or biopsy specimens), by definitive tests in 44 cases (24 x-ray studies and 20 laboratory investigations), and from pathognomonic clinical findings or procedure results in the remaining 3 cases. The diagnosis was wrong in 38 cases, missed in 34 cases, and delayed in 28 cases. IMPACT Ten cases were classified as close calls, and 90 cases involved some degree of harm, including 33 deaths. The clinical impact averaged 3.80±0.28 (mean±SEM) on the VHA impact scale, indicating substantial levels of harm, on average. The impact score tended to be lower in cases of delayed diagnosis than in cases that were missed or wrong (3.79±0.52 vs 4.76±0.40 and 4.47±0.46; P=.34). Cases with solely cognitive factors or with mixed cognitive and system-related factors had significantly higher impact scores than cases with only system-related factors (4.11±0.46 and 4.27±0.47 vs 2.54±0.55; P=.03 for both comparisons). These 2 effects may be related, as delays were the type of error most likely to result from system-related factors alone. ETIOLOGY OF DIAGNOSTIC ERROR Our results suggested that diagnostic error in medicine could best be described using a taxonomy that includes no-fault, system-related ( Table 1 ), and cognitive (Table 2) factors. v No-fault errors Masked or unusual presentation of disease Patient-related error (uncooperative, deceptive) v System-related errors Technical failure and equipment problems Organizational flaws v Cognitive errors Faulty knowledge Faulty data gathering Faulty synthesis In 46% of the cases, both system-related and cognitive factors contributed to diagnostic error. Cases involving only cognitive factors (28%) or only system-related factors (19%) were less common, and 7 cases were found to reflect solely no-fault factors, without any other system-related or cognitive factors. Combining the pure and the mixed cases, system-related factors contributed to the diagnostic error in 65% of the 100 cases and cognitive factors contributed in 74% (Figure). Overall, we identified 228 system-related factors and 320 cognitive factors, averaging 5.9 per case. The relative frequency of both system-related and cognitive factors varied with the source of the case and the type of error involved. Cases identified from quality assurance reports and from voluntary reports had a similar prevalence of system-related factors (72% and 76%, respectively) and cognitive factors (65% and 85%, respectively). In contrast, cases identified from autopsy discrepancies involved cognitive factors 90% of the time (P⬎.50) and system-related factors only 10% of the time (P⬍.001). Cases of delayed diagnosis had relatively more WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. Table 1. System-Related Contributions to Diagnostic Error No. of Encounters Type Definition Example Technical (n = 13) Test instruments are faulty, miscalibrated, or unavailable 13 Technical failure and equipment problems 35 Clustering 33 Policy and procedures 32 Inefficient processes 27 Teamwork or communications Failure to share needed information or skills 23 Patient neglect Failure to provide necessary care 20 Management Failed oversight of system issues 18 Coordination of care Organizational (n = 215) Repeating instances of the same error type Policies that fail to account for certain conditions or that actively create error-prone situations Standardized processes resulting in unnecessary delay; absence of expedited pathways 8 Supervision Clumsy interactions between caregivers or sites of care; hand-off problems Failed oversight of trainees 8 Expertise unavailable Required specialist not available in a timely manner 7 Training and orientation 4 Personnel 0 External interference Clinicians not made aware of correct processes, policy, or procedures Clinician laziness, rude behavior, or known to have recurring problems with communication or teamwork Interference with proper care by corporate or government institutions Wrong glucose reading on incorrectly calibrated glucometer Repeated instances of incorrect readings of x-ray studies in the emergency department, by covering physicians; radiologists not available; administration aware Recurrent colon cancer missed: no policy to ensure regular follow-up of patients after colon cancer surgery 9-Month delay in diagnosis of colon cancer, reflecting additive delays in scheduling clinic visits, procedures, and surgery Alarming increase in PSA levels never effectively communicated to a provider who had changed clinic sites Biopsy report of cancer never communicated to patient who missed a clinic appointment Multiple x-ray studies not read in a timely manner; films repeatedly lost or misplaced Delayed workup of pulmonary nodule; consultation request lost and, when found, not acted on promptly Delayed diagnosis of peritonitis; reinsertion of a percutaneous gastric feeding tube not appropriately supervised Radiologist not available to read a key film on a holiday evening Delay in diagnosis of Wegener granulomatosis; staff not aware that ANCA testing required special approvals Clinician known to commonly skip elements of the physical examination missed detection of gangrenous toes None encountered Abbreviations: ANCA, antineutrophil cytoplasmic autoantibody; PSA, prostate-specific antigen. system-related errors (89%) and fewer cognitive errors (36%) on average, and cases of wrong diagnosis involved more cognitive errors (92%) and fewer systemrelated errors (50%, P⬍.01 for both pairs). NO-FAULT ERRORS No-fault factors were identified in 44 of the 100 cases and constituted the sole explanation in 7 cases. Eleven of these cases involved patient-related factors, including 2 instances of deception (surreptitious self-injection of saliva, mimicking sepsis and denial of high-risk sexual activity, which delayed diagnosis of Pneumocystis carinii pneumonia) and 9 cases involving delayed diagnoses related to missed appointments or instances in which patient statements were unintentionally misleading or incomplete. By far, the most common no-fault factor was an atypical or masked disease presentation, encountered in 33 cases. SYSTEM-RELATED CONTRIBUTIONS TO ERROR In 65 cases, system-related factors contributed to diagnostic error (Table 1). The vast majority of these (215 instances) were related to organizational problems, and a small fraction (13 instances) involved technical and (REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1495 equipment problems. The factors encountered most often related to policies and procedures, inefficient processes, and difficulty with teamwork and communication, especially communication of test results. Many error types were encountered more than twice in the same institution, an event we referred to as clustering. COGNITIVE CONTRIBUTIONS TO ERROR We identified 320 cognitive factors in 74 cases (Table 2). The most common category of factors was faulty synthesis (264 instances), or flawed processing of the available information. Faulty data gathering was identified in 45 instances. Inadequate or faulty knowledge or skills were identified in only 11 instances. FAULTY KNOWLEDGE OR SKILLS Inadequate knowledge was identified in only 4 cases, each concerning a rare condition: (1) a case of missed Fournier gangrene; (2) a missed diagnosis of calciphylaxis in a patient undergoing dialysis with normal levels of serum calcium and phosphorus; (3) a case of chronic thrombotic thrombocytopenic purpura; and (4) a wrong diagnosis of disseminated intravascular coagulation in a patient ultimately thought to have WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. Table 2. Cognitive Contributions to Diagnostic Error No. of Encounters 4 7 24 10 7 3 Type Definition Knowledge base inadequate or defective Skills inadequate or defective Ineffective, incomplete, or faulty workup Ineffective, incomplete, or faulty history and physical examination Faulty test or procedure techniques Failure to screen (prehypothesis) Example Faulty Knowledge (n = 11) Insufficient knowledge of relevant condition Providers not aware of Fournier gangrene Insufficient diagnostic skill for relevant condition Missed diagnosis of complete heart block: clinician misread electrocardiogram Faulty Data Gathering (n = 45) Problems in organizing or coordinating patient’s tests and consultations Failure to collect appropriate information from the initial interview and examination Delayed diagosis of drug-related lupus: failure to consult patient’s old medical records Delayed diagnosis of abdominal aortic aneurysm: incomplete history questioning Standard test/procedure is conducted incorrectly Failure to perform indicated screening procedures Failure to collect required information owing to poor interaction with patient Wrong diagnosis of myocardial infarction: electrocardiographic leads reversed Missed prostate cancer: rectal examination and PSA testing never performed in a 55-year-old man Missed CNS contusion after very abbreviated history; perjorative questioning Faulty Synthesis: Faulty Information Processing (n = 159) Lack of awareness/consideration of aspects of patient’s situation that are relevant to diagnosis Clinician is aware of symptom but either focuses too closely on it to the exclusion of others or fails to appreciate its relevance Missed perforated ulcer in a patient presenting with chest pain and laboratory evidence of myocardial infarction Wrong diagnosis of sepsis in a patient with stable leukocytosis in the setting of myelodysplastic syndrome 1 Poor etiquette leading to poor data quality 26 Faulty context generation 25 23 Overestimating or underestimating usefulness or salience of a finding Faulty detection or perception Failed heuristics 15 Failure to act sooner 14 Faulty triggering 11 Misidentification of a symptom or sign 10 Distraction by other goals or issues 10 Faulty interpretation of a test result Other aspects of patient treatment (eg, dealing with an earlier condition) are allowed to obscure diagnostic process for current condition Test results are read correctly, but incorrect conclusions are drawn 0 Reporting or remembering findings not gathered Symptoms or signs reported that do not exist, often findings that are typically present in the suspected illness 25 39 Premature closure 18 Failure to order or follow up on appropriate test Symptom, sign, or finding should be noticeable, but clinician misses it Failure to apply appropriate rule of thumb, or overapplication of such a rule under inappropriate/atypical circumstances Delay in appropriate data-analysis activity Clinician considers inappropriate conclusion based on current data or fails to consider conclusion reasonable from data One symptom is mistaken for another Faulty Synthesis: Faulty Verification (n = 106) Failure to consider other possibilities once an initial diagnosis has been reached Clinician does not use an appropriate test to confirm a diagnosis or does not take appropriate next step after test Missed pneumothorax on chest radiograph Wrong diagnosis of bronchitis in a patient later found to have pulmonary embolism Missed diagnosis of ischemic bowel in a patient with a 12-week history of bloody diarrhea Wrong diagnosis of pneumonia in a patient with hemoptysis: never considered the eventual diagnosis of vasculitis Missed cancer of the pancreas in a patient with pain radiating to the back, attributed to GERD Wrong diagnosis of panic disorder: patient with a history of schizophrenia presenting with abnormal mental status and found to have CNS metastases Missed diagnosis of Clostridium difficile enteritis in a patient with a negative stool test result None encountered Wrong diagnosis of musculoskeletal pain after a car crash: ruptured spleen ultimately found Wrong diagnosis of urosepsis in a patient: bedside urinalysis never performed Continued clopidogrel-associated thrombotic thrombocytopenic purpura. The 7 cases involving inadequate skills involved misinterpretations of x-ray studies and electrocardiograms by nonexperts. (REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1496 FAULTY DATA GATHERING The dominant cause of error in the faulty datagathering category lay in the subcategory of ineffective, WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. Table 2. Cognitive Contributions to Diagnostic Error (cont) No. of Encounters Type Definition Example 15 Failure to consult Appropriate expert is not contacted 10 Failure to periodically review the situation Failure to gather other useful information to verify diagnosis Overreliance on someone else’s findings or opinion Failure to validate findings with patient Failure to gather new data to determine whether situation has changed since initial diagnosis Appropriate steps to verify diagnosis are not taken 10 8 5 1 Confirmation bias Failure to check previous clinician’s diagnosis against current findings Clinician does not check with patient concerning additional symptoms that might confirm/disconfirm diagnosis Tendency to interpret new results in a way that supports one’s previous diagnosis Hyponatremia inappropriately ascribed to diuretics in a patient later found to have lung cancer; no consultations requested Missed colon cancer in a patient with progressively declining hematocrit attributed to gastritis Wrong diagnosis of osteoarthritis in a patient found to have drug-induced lupus after ANA testing Outpatient followed up with diagnosis of CHF, admitted with increased shortness of breath, later found to have lung cancer as the cause Wrong diagnosis of bone metastases in a patient with many prior broken ribs Wrong diagnosis of pulmonary embolism: positive results on D-dimer test taken to support this diagnosis in a patient with respiratory failure due to ARDS and gram-negative sepsis Abbreviations: ANA, antinuclear antibody; ARDS, adult respiratory distress syndrome; CHF, congestive heart failure; CNS, central nervous system; GERD, gastroesophageal reflux disease; PSA, prostate-specific antigen. incomplete, or faulty workup (24 instances). For example, the diagnosis of subdural hematoma was missed in a patient who was seen after a motor vehicle crash because the physical examination was incomplete. Problems with ordering the appropriate tests and interpreting test results were also common in this group. No-Fault Factors Only (7%) Both System-Related and Cognitive Factors (46%) System-Related Error Only (19%) FAULTY INFORMATION SYNTHESIS Faulty information synthesis, which includes a wide range of factors, was the most common cause of cognitive-based errors. The single most common phenomenon was premature closure: the tendency to stop considering other possibilities after reaching a diagnosis. Other common synthesis factors included faulty context generation, misjudging the salience of a finding, faulty perception, and failed use of heuristics. Faulty context generation and misjudging the salience of a finding often occurred in the same case (15 of 25 instances). Perceptual failures most commonly involved incorrect readings of x-ray studies by internists and emergency department staff before official reading by a radiologist. Of the 23 instances related to heuristics, 14 reflected the bias to assume that all findings were related to a single cause when a patient actually had more than 1 condition. In 7 cases, the most common condition was chosen as the likely diagnosis, although a less common condition was responsible. COVARIATION AMONG FACTORS Cognitive and system-related factors were found to often co-occur, and these factors may have led, directly or indirectly, to each other. For example, a mistake relatively early on (eg, an inadequate history or physical examination) is likely to lead to subsequent mistakes (eg, in interpreting test results, considering appropriate can(REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1497 Cognitive Error Only (28%) Figure. The categories of factors contributing to diagnostic error in 100 patients. didate diagnoses, or calling in appropriate specialists). We examined the patterns of factors identified in these 100 cases to identify clusters of cognitive factors that tended to co-occur. Using Pearson r tests and correcting for the use of multiple pairwise analyses, we found several such clusters of cognitive factors. The more common clusters of 3 factors, all of which have significant pairwise correlations within a cluster, were as follows: • Incomplete/faulty history and physical examination; failure to consider the correct candidate diagnosis; and premature closure • Incomplete/excessive data gathering; bias toward a single explanation; and premature closure • Underestimating the usefulness of a finding; premature closure; and failure to consult WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. COMMENT In classifying the underlying factors contributing to error, 3 natural categories emerged: no fault, systemrelated, and cognitive. This classification validates the cognitive and no-fault distinctions described by Chimowitz et al,11 Kassirer and Kopelman,12 and Bordage28 and adds a third major category of system-level factors. A second objective was to assess the relative contributions of system-related and cognitive root cause factors. The results allow 3 major conclusions regarding diagnostic error in internal medicine settings. DIAGNOSTIC ERROR IS TYPICALLY MULTIFACTORIAL IN ORIGIN Excluding the 7 cases of pure no-fault error, we identified an average of 5.9 factors contributing to error in each case. Reason’s6 “Swiss cheese” model of error suggests that harm results from multiple breakdowns in the series of barriers that normally prevent injury. This phenomenon was identified in many of our cases, in which the ultimate diagnostic failure involved separate factors at multiple levels of both the system-related and the cognitive pathways. A second reason for encountering multiple factors in a single case is the tendency for one type of error to lead to another. For example, a patient with retrosternal and upper epigastric pain was given a diagnosis of myocardial infarction on the basis of new Q waves in his electrocardiogram and elevated levels of troponin. The clinicians missed a coexisting perforated ulcer, illustrating that if a case is viewed in the wrong context, clinicians may miss relevant clues and may not consider the correct diagnosis. SYSTEM FLAWS CONTRIBUTE COMMONLY TO DIAGNOSTIC ERROR System-related factors were identified in 65% of cases. This finding supports a previous study linking diagnostic errors to system issues30 but contrasts with the prevailing belief that diagnostic errors overwhelmingly reflect defective cognition. The system flaws identified in our study reflected far more organizational issues than technical problems. Errors related to suboptimal supervision of trainees occurred, but uncommonly. COGNITIVE ERRORS ARE A COMMON CAUSE OF DIAGNOSTIC ERROR AND PREDOMINANTLY REFLECT PROBLEMS WITH SYNTHESIS OF THE AVAILABLE INFORMATION Faulty data gathering was much less commonly encountered, and defective knowledge was rare. These results are consistent with conclusions from earlier studies28,30 and from autopsy data almost 50 years ago: “ . . . mistakes were due not so much to lack of knowledge of factual data as to certain deficiencies of approach and judgment.”31 This finding may distinguish medical diagnosis from other types of expert decision making, in which (REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1498 knowledge deficits are more commonly encountered as the cause of error.32 As predicted by other authors,1,9,15,33 premature closure was encountered more commonly than any other type of cognitive error. Simon34 described the initial stages of problem solving as a search for an explanation that best fits the known facts, at which point one stops searching for additional explanations, a process he termed satisficing. Experienced clinicians are as likely as more junior colleagues to exhibit premature closure,15 and elderly physicians may be particularly predisposed.35 STUDY LIMITATIONS This study has a variety of limitations that restrict the generality of the conclusions. First, because the types of error are dependent on the source of the cases,36-38 a different spectrum of case types would be expected outside internal medicine. Also, selection bias might be expected in cases that are reported voluntarily. Distortions could also result from our nonrandomized method of case selection if they are not representative of the errors that actually occurred. A second limitation is the difficulty in discerning exactly how a given diagnosis was reached. Clinical reasoning is hidden from direct examination, and may be just as mysterious to the clinician involved. A related problem is our limited ability to identify other factors that likely affect many clinical decision-making situations, such as stress, fatigue, and distractions. Clinicians had difficulty recalling such factors, which undoubtedly existed. Their recollections might also be distorted because of the unavoidable lag time between the experience and the interview, and by their knowledge of the clinical outcomes. A third weakness is the subjective assignment of root causes. The field as it evolves will benefit from further clarification and standardization for each of these causes. A final concern is the inevitable bias that is introduced in a retrospective analysis in which the outcomes are known.23,39,40 With this in mind, we did not attempt to evaluate the appropriateness of care or the preventability of adverse events, judgments that are highly sensitive to hindsight bias. STRATEGIES TO DECREASE DIAGNOSTIC ERROR Although diagnostic error can never be eliminated,41 our results identify the common causes of diagnostic error in medicine, ideal targets for future efforts to reduce the incidence of these errors. The high prevalence of system-related factors offers the opportunity to reduce diagnostic errors if health care institutions accept the responsibility of addressing these factors. For example, errors could be avoided if radiologists were reliably available to interpret x-ray studies and if abnormal test results were reliably communicated. Institutions should be especially sensitive to clusters of errors of the same type. Although these institutions may simply excel at error detection, clustering could also indicate misdirected resources or a culture of tolerating suboptimal performance. WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. Devising strategies for reducing cognitive error is a more complex problem. Our study suggests that internists generally have sufficient medical knowledge and that errors of clinical reasoning overwhelmingly reflect inappropriate cognitive processing and/or poor skills in monitoring one’s own cognitive processes (metacognition).42 Croskerry43 and others44 have argued that clinicians who are oriented to the common pitfalls of clinical reasoning would be better able to avoid them. High-fidelity simulations may be one way to provide this training.45,46 Elstein1 has suggested the value of compiling a complete differential diagnosis to combat the tendency to premature closure, the most common cognitive factor we identified. A complementary strategy for considering alternatives involves the technique of prospective hindsight: the crystal ball experience: The clinician would be told to assume that his or her working diagnosis is incorrect, and asked, “What alternatives should be considered?”47 A final strategy is to augment a clinician’s inherent metacognitive skills by using expert systems, an approach currently under active research and development.48-50 Accepted for Publication: February 21, 2005. Correspondence: Mark L. Graber, MD, Medical Service 111, Veterans Affairs Medical Center, Northport, NY 11768 ([email protected]). Funding/Support: This work was supported by a research support grant honoring James S. Todd, MD, from the National Patient Safety Foundation, North Adams, Mass. Acknowledgment: We thank Grace Garey and Kathy Kessel for their assistance with the manuscript and references. REFERENCES 1. Elstein AS. Clinical reasoning in medicine. In: Higgs J, Jones MA, eds. Clinical Reasoning in the Health Professions. Woburn, Mass: Butterworth-Heinemann; 1995:49-59. 2. Kirch W, Schafii C. Misdiagnosis at a university hospital in 4 medical eras. Medicine (Baltimore). 1996;75:29-40. 3. Shojania KG, Burton EC, McDonald KM, Goldman L. Changes in rates of autopsydetected diagnostic errors over time. JAMA. 2003;289:2849-2856. 4. Goldman L, Sayson R, Robbins S, Cohn LH, Bettmann M, Weisberg M. The value of the autopsy in three different eras. N Engl J Med. 1983;308:1000-1005. 5. Graber M. Diagnostic error in medicine: a case of neglect. Jt Comm J Qual Patient Saf. 2005;31:106-113. 6. Reason J. Human Error. New York, NY: Cambridge University Press; 1990. 7. Zhang J, Patel VL, Johnson TR. Medical error: is the solution medical or cognitive? J Am Med Inform Assoc. 2002;9(suppl):S75-S77. 8. Leape L, Lawthers AG, Brennan TA, Johnson WG. Preventing medical injury. QRB Qual Rev Bull. 1993;19:144-149. 9. Voytovich AE, Rippey RM, Suffredini A. Premature conclusions in diagnostic reasoning. J Med Educ. 1985;60:302-307. 10. Friedman MH, Connell KJ, Olthoff AJ, Sinacore JM, Bordage G. Medical student errors in making a diagnosis. Acad Med. 1998;73(suppl):S19-S21. 11. Chimowitz MI, Logigian EL, Caplan LR. The accuracy of bedside neurological diagnosis. Ann Neurol. 1990;28:78-85. 12. Kassirer JP, Kopelman RI. Cognitive errors in diagnosis: instantiation, classification, and consequences. Am J Med. 1989;86:433-441. 13. Institute of Medicine. To Err is Human; Building a Safer Health System. Washington, DC: National Academy Press; 1999. 14. Spath PL. Reducing errors through work system improvements. In: Error Reduction in Health Care: A Systems Approach to Improving Patient Safety. San Francisco, Calif: Jossey-Bass; 2000:199-234. 15. Kuhn GJ. Diagnostic errors. Acad Emerg Med. 2002;9:740-750. 16. Croskerry P. Achieving quality in clinical decision making: cognitive strategies and detection of bias. Acad Emerg Med. 2002;9:1184-1204. (REPRINTED) ARCH INTERN MED/ VOL 165, JULY 11, 2005 1499 17. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003;78:775-780. 18. Kassirer JP. Diagnostic reasoning. Ann Intern Med. 1989;110:893-900. 19. Elstein AS. Heuristics and biases: selected errors in clinical reasoning. Acad Med. 1999;74:791-794. 20. McDonald CJ. Medical heuristics, the silent adjudicators of clinical practice. Ann Intern Med. 1996;124:56-62. 21. Gigerenzer G, Goldstein DG. Reasoning the fast and frugal way: models of bounded rationality. Psychol Rev. 1996;103:650-669. 22. Johnson C. Designing forms to support the elicitation of information about incidents involving human error. Available at: http://www.dcs.gla.ac.uk/~johnson /papers/incident_forms/. Accessed May 1, 2005. 23. Henriksen K, Kaplan H. Hindsight bias, outcome knowledge and adaptive learning. Qual Saf Health Care. 2003;12(suppl):ii46-ii50. 24. Bagian JP, Gosbee J, Lee CZ, Williams L, McKnight SD, Mannos DM. The Veterans Affairs root cause analysis system in action. Jt Comm J Qual Improv. 2002; 28:531-545. 25. VA National Center for Patient Safety. Root cause analysis. Available at: www .patientsafety.gov/tools.html. Accessed May 1, 2005. 26. Battles JB, Kaplan HS, Van der Schaaf TW, Shea CE. The attributes of medical event-reporting systems. Arch Pathol Lab Med. 1998;122:231-238. 27. Reason J. Managing the Risks of Organizational Accidents. Brookfield, Vt: Ashgate Publishing Co; 1997. 28. Bordage G. Why did I miss the diagnosis? some cognitive explanations and educational implications. Acad Med. 1999;74(suppl):S128-S143. 29. Graber M. Expanding the goals of peer review to detect both practitioner and system error. Jt Comm J Qual Improv. 1999;25:396-407. 30. Weinberg NS, Stason WB. Managing quality in hospital practice. Int J Qual Health Care. 1998;10:295-302. 31. Gruver RH, Freis ED. A study of diagnostic errors. Ann Intern Med. 1957;47:108-120. 32. Klein G. Sources of error in naturalistic decision making tasks. In: Proceedings of the 37th Annual Meeting of the Human Factors and Ergonomics Society: Designing for Diversity; October 11-15, 1993; Seattle, Wash. 33. Dubeau CE, Voytovich AE, Rippey RM. Premature conclusions in the diagnosis of iron-deficiency anemia: cause and effect. Med Decis Making. 1986;6:169-173. 34. Simon HA. Invariants of human behavior. Annu Rev Psychol. 1990;41:1-19. 35. Eva KW. The aging physician: changes in cognitive processing and their impact on medical practice. Acad Med. 2002;77(suppl):S1-S6. 36. Thomas EJ, Petersen LA. Measuring errors and adverse events in health care. J Gen Intern Med. 2003;18:61-67. 37. O’Neil AC, Petersen LA, Cook F, Bates DW, Lee TH, Brennan TA. Physician reporting compared with medical-record review to identify adverse medical events. Ann Intern Med. 1993;119:370-376. 38. Heinrich J. Adverse Events: Surveillance Systems for Adverse Events and Medical Errors. Washington, DC: US Government Accounting Office; February 9, 2000. 39. Caplan RA, Posner KL, Cheney FW. Effect of outcome on physician judgments of appropriateness of care. JAMA. 1991;265:1957-1960. 40. Fischoff B. Hindsight does not equal foresight: the effect of outcome knowledge on judgment under uncertainty. J Exp Psychol Hum Percept Perform. 1975; 1:288-299. 41. Graber M,. Gordon R, Franklin N. Reducing diagnostic error in medicine: what’s the goal? Acad Med. 2002;77:981-992. 42. Croskerry P. The cognitive imperative: thinking about how we think. Acad Emerg Med. 2000;7:1223-1231. 43. Croskerry P. Cognitive forcing strategies in clinical decision making. Ann Emerg Med. 2003;41:10-120 1. 44. Hall KH. Reviewing intuitive decision-making and uncertainty: the implications for medical education. Med Educ. 2002;36:216-224. 45. Satish U, Streufert S. Value of a cognitive simulation in medicine: towards optimizing decision making performance of healthcare professionals. Qual Saf Health Care. 2002;11:163-167. 46. Bond WF, Deitrick LM, Arnold DC, et al. Using simulation to instruct emergency medicine residents in cognitive forcing strategies. Acad Med. 2004;79:438-446. 47. Mitchell DJ, Russo JE, Pennington N. Back to the future: temporal perspective in the explanation of events. J Behav Decis Making. 1989;2:25-38. 48. Trowbridge R, Weingarten S. Clinical decision support systems. In: Shojania KG, Duncan BW, McDonald KM, Wachter RM, eds. Making Health Care Safer: A Critical Analysis of Patient Safety Practices. Rockville, Md: Agency for Healthcare Research and Quality; 2001. 49. Hunt DL, Haynes RB, Hanna SE, Smith K. Effects of computer-based clinical decision support systems on physician performance and patient outcomes: a systematic review. JAMA. 1998;280:1339-1346. 50. Sim I, Gorman P, Greenes RA, et al. Clinical decision support systems for the practice of evidence-based medicine. J Am Med Inform Assoc. 2001;8:527-534. WWW.ARCHINTERNMED.COM Downloaded from www.archinternmed.com on August 12, 2008 ©2005 American Medical Association. All rights reserved. WHAT ARE THE FACTORS THAT CONTRIBUTE TO DIAGNOSIS ERROR? Mark Graber MD. Diagnostic errors in medicine: a case of neglect. Jt Comm J Qual Patient Saf. 2005 Feb;31(2):106-13 Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] Forum Diagnostic Errors in Medicine: A Case of Neglect Mark Graber, M.D. ore than 90 years have passed since Harvard’s Dr. Richard Cabot illustrated the value of autopsies to reveal diagnostic error.1 At his weekly meetings with medical students, Cabot would compare the clinical impressions before death with the findings at autopsy, creating the powerful and enduring teaching format known as the “CPC” (clinical-pathological correlation). From his experience with more than 3,000 cases, Cabot estimated that the clinical diagnosis was correct less than half the time in patients dying of thoracic aneurysms, cirrhosis, pericarditis, nephritis, and a variety of other conditions. The reaction at the time was one of incredulity.2 Dr. Alfred Croftan, a Chicago physician, responded that “the overwhelming majority of cases are today diagnosed correctly…” and that the errors noted by Dr. Cabot could only reflect the “unpardonably careless work” by the physicians of Boston!3(p. 145) Modern clinicians still struggle to recognize and accept the possibility of diagnostic error. Although many of the diseases on Dr. Cabot’s list can now be diagnosed with routine blood tests or imaging, diagnostic error still exists and always will.4 Medical diagnoses that are wrong, missed, or delayed make up a large fraction of all medical errors and cause substantial suffering and injury. Compared with other types of medical error, however, diagnostic errors receive little attention. We and others have previously proposed that this neglect is a major factor in perpetuating unacceptable rates of diagnostic error.4–6 In this article, I explore the reasons for this neglect in detail. I propose that three factors explain why diagnostic errors tend to be ignored. First, these errors are fundamentally obscure. Diagnostic errors are difficult to identify and when found are less M 106 February 2005 Article-at-a-Glance Background: Medical diagnoses that are wrong, missed, or delayed make up a large fraction of all medical errors and cause substantial suffering and injury. Compared with other types of medical error, however, diagnostic errors receive little attention—a major factor in perpetuating unacceptable rates of diagnostic error. Diagnostic errors are fundamentally obscure, health care organizations have not viewed them as a system problem, and physicians responsible for making medical decisions seldom perceive their own error rates as problematic. The safety of modern health care can be improved if these three issues are understood and addressed. Solutions: Opportunities to improve the visibility of diagnostic errors are evident. Diagnostic error needs to be included in the normal spectrum of quality assurance surveillance and review. The system properties that contribute to diagnostic errors need to be systematically identified and addressed, including issues related to reliable diagnostic testing processes. Even for cases entirely dependent on the skill of the clinician for accurate diagnosis, health care organizations could minimize errors by using system-level interventions to aid the clinician, such as second readings of key diagnostic tests and providing resources for clinical decision support. Physicians need to improve their calibration by getting feedback on the diagnoses they make. Finally, clinicians need to learn about overconfidence and other innate cognitive tendencies that detract from optimal reasoning and learning. Conclusion: Clinicians and their health care organizations need to take active steps to discover, analyze, and prevent diagnostic errors. Volume 31 Number 2 easily understood than other types of medical error. Second, health care organizations have not viewed diagnostic errors as a system problem. Third, physicians responsible for making medical decisions seldom perceive their own error rates as problematic. The safety of modern health care can be improved if we understand and address these three issues in an effort to make diagnostic errors visible. Neglected Errors Although clinicians routinely arrive at most of their diagnoses with ease and accuracy, diagnostic errors occur at non-negligible rates. Diagnostic errors are encountered in every specialty and every type of practice. It is estimated that mistakes are made in roughly 5% of radiology and pathology diagnoses.7,8 Error rates in clinical practice are not known with certainty but might be as high as 12%, for example in the emergency department where complex decision making is involved in settings of above-average uncertainty and stress.9 Diagnostic errors are easily demonstrated when physicians are tested with standardized patients or scenarios.10,11 There is excess variation between providers who analyze the same case,12,13 and physicians at times even disagree with themselves when re-presented with a case they have previously diagnosed.14 Medical residents report that diagnostic errors are the most common medical errors they encounter, and only a minority of these are ever discussed with their mentors.15 In the Harvard Medical Practice Study, diagnostic errors were the second leading cause of adverse events.16 Autopsies are considered the gold standard for detecting diagnostic errors, and, indeed, discrepancies from the clinical diagnosis are found in approximately one quarter of cases.17 Not all these discrepancies have clinical relevance, but errors that potentially could have changed the outcome are found in 5% to 10% of all autopsies.17,18 Despite this evidence that diagnostic errors are a major issue in patient safety, they receive scant attention. This may reflect the general tendency for clinicians to be unaware of medical error,19 but even within the patient safety community diagnostic errors are barely on the radar screen. Clinical decision making is an underfunded research area, and the topic is generally neglected in patient safety symposia and in medical school February 2005 curricula. If we are ever to optimize patient safety, it will require that we minimize diagnostic errors. As a start, we need to understand why these errors are neglected. Fundamental Issues Although clinical decision making has been described in general terms, at a fundamental level the process is complicated, hidden, and only partially understood.20,21 Making a diagnosis is primarily a cognitive process and is subject to influence by the affective state of the clinician.22 These processes are difficult to study and quantify and to understand. Even the decision maker may not be aware of how or why a given diagnosis was reached. Experts in this field work in the area of cognitive psychology and discuss specialized concepts such as “metacognition,” “heuristics,” and “debiasing.” It is little wonder that practicing clinicians and patient safety staff have difficulty acquiring a comprehensive understanding of clinical decision making, the first step in trying to understand diagnostic errors. Root cause analysis, so powerful in understanding other types of medical error, is less easily applied when the root causes are cognitive. Diagnostic errors often escape detection. It has been estimated that for every error we detect, scores are missed.23 Compared with adverse events related to surgery or other “process” and treatment errors, which are often glaring and obvious, diagnostic errors tend to be more subtle and difficult to pinpoint in time and place. Detecting unacceptable delays in diagnosis is especially difficult, because clear standards are lacking, and the perceptions of the clinician regarding the existence of a delay may differ substantially from the impressions of an affected patient. Physicians are uncomfortable discussing diagnostic error. It is unsettling to consider the possibility that one’s own diagnostic capabilities might be in question. This reflects, in part, the real concern that diagnostic errors can lead to career-threatening malpractice suits. A substantial proportion of these suits are provoked by diagnostic errors. In the Veterans Health Administration, tort claims related to diagnostic errors were twice as common as claims related to medication errors.24 Malpractice suits related to diagnostic errors involve all specialties and are the most difficult to defend.25 Volume 31 Number 2 107 System Issues Health care organizations view diagnostic error as physician failure rather than as an institutional problem. The available data, however, suggests that systemrelated factors commonly contribute to diagnostic error. Emergency departments are particularly predisposed to system-related diagnostic errors relating to workloadrelated stress, constant distractions, nonstandardized processes, and a variety of associated issues.26 Similarly, when diagnostic errors involve internists, both cognitive and latent system flaws are typically identified as root causes.27 The system-level processes most commonly found in these cases involve cultural acceptance of suboptimal systems, inadequate policies and procedures, inefficient processes, and problems with teamwork and communication. Diagnostic errors are simply not a priority for health care organizations. No one has told them they should be. Patient safety is an enormous universe, and many organizations look to expert advisory groups for advice on which areas need attention. Unfortunately, the guidance provided by advisory organizations has yet to identify diagnostic error as an area in need of attention. The Joint Commission on Accreditation of Healthcare Organizations encourages facilities to develop their own patient safety priorities using the tools of root cause analysis to investigate serious injuries.28 In practice, however, organizations tend to focus on the initial seven National Patient Safety Goals, none of which directly involves diagnostic error (but see page 110). The Joint Commission also requires organizations to study and report on 13 categories of sentinel event, but none of these address diagnostic accuracy,28 nor do the 27 reportable events advocated by the National Quality Forum.29 Similarly, the Leapfrog Group, a consortium of 145 organizations providing healthcare benefits, has issued patient safety priorities for healthcare organizations, none directly involving medical diagnosis.30 The Agency for Healthcare Research and Quality recently compiled evidence-based patient safety practices.31 Of the 25 measures with the strongest evidence base, none are measures to improve diagnostic accuracy or timeliness. Health care organizations have many incentives to bypass thorny issues, such as diagnostic error, and focus instead on “low-hanging fruit,” such as medication errors and wrong-site surgery. These errors are easier to 108 February 2005 understand, interventions are more obvious, and the organization can show progress towards achieving patient safety goals. Provider Issues Although the act of making a medical diagnosis is a defining skill of physicians, clinicians seldom think about how well they carry out this function. Two questions immediately arise: (1) Are clinicians aware that diagnostic errors can occur? (2) Are they themselves susceptible to this problem? The answer to the first question is unequivocally yes: Physicians are keenly aware that diagnostic errors are made. Throughout their training and on a regular basis in their practice years, physicians learn of diagnostic errors via quality assurance proceedings and malpractice suits. The fear of a malpractice suit relating to diagnostic error is a major impetus for the practice of defensive medicine—excessive referrals to specialists and the use of expensive and redundant tests and procedures.32 Defensive practice is ubiquitous, and some estimate that it accounts for 5% to 9% of health care expenditures in the United States.33 Of direct relevance, this malpractice pressure likely has more significant impact on diagnostic decisions than management decisions.34 The answer to the second question is more interesting: Although clinicians are aware of the possibility for diagnostic error and that they are personally at risk, they believe that their own diagnoses are correct. Errors are made by someone else. Clinicians can perhaps remember a diagnostic error or two they’ve made in their career, but when asked about their current patients and their current diagnoses, my personal observation is that most believe, like Dr. Croftan, that virtually all of these are accurate. Indeed only 29% of physicians reported encountering any medical error in the past year.35 Given the evidence that the true error rate is in the range of 5% to 15%, how is it possible that individual physicians believe that their own error rates approach zero? Inadequate calibration and overconfidence may help expain this contradiction. Inadequate calibration. Across a wide spectrum of skills and professions, receiving feedback on one’s performance is an essential requirement for developing expertise. Feedback provides the follow-up information necessary to assess the accuracy of initial predications, a process referred to as calibration. Professionals who Volume 31 Number 2 are well calibrated agree with each other and correctly interpret “gold standard” cases. Physicians occasionally receive feedback regarding some of their diagnoses as patients move through a diagnostic evaluation or return for follow-up after treatment. Feedback is too sporadic, however, and lack of feedback contributes to the physician’s sincere belief that the great majority (or all) of their diagnoses are correct. A variety of factors limit feedback.36 Some of the more important factors follow: ■ Availability of definitive diagnostic information. The true diagnosis is not known in every case. This would include, for example, most diagnoses regarding less serious ailments; if a patient’s knee pain resolves after two weeks of treatment with aspirin, the clinician may never know if the correct diagnosis was arthritis, bursitis, or tendonitis. ■ Communication barriers. If definitive diagnostic information does become available at some later time, the clinician making the initial diagnosis may no longer be in the information loop. The clinician in an outpatient setting may not be routinely informed if his or her initial diagnosis is later modified in the hospital, or vice versa. Even if the opportunity does arise to convey knowledge of an error, clinicians feel awkward informing colleagues their diagnoses were wrong. ■ Paucity of planned feedback. The autopsy rate in the United States is now estimated to be less than 5%, depriving clinicians of a unique calibration tool. Although some medical specialty organizations provide formal programs that provide and encourage feedback, this is more the exception than the rule. Physicians are rarely expected to show evidence that they have tracked or processed feedback in any organized way. Overconfidence. Lack of calibration contributes to overconfidence, a second factor causing clinicians to overestimate their diagnostic accuracy. There is substantial evidence that humans are overconfident in a variety of settings; we routinely overestimate our ability to function flawlessly.37 A classic example is eyewitness testimony—the correlation between the accuracy of eyewitness identification and the confidence level of the witness is generally less than 0.25.38 Another sobering statistic is that only 1% of drivers rate their driving skills below those of the average driver.39 February 2005 Just as we overestimate our skills and knowledge, we are overconfident that our decisions, such as medical diagnoses, are correct.37,40 Overconfidence in the setting of medical decision making, however, is particularly inappropriate given the general uncertainty that surrounds most diagnostic endeavors. A chilling observation in studies of overconfidence is that the least skilled are the most overconfident. In tests of grammar, logic, and humor, the subjects scoring the lowest by objective criteria are substantially more overconfident of their performance than subjects who score well on these tests.41 Exactly the same phenomenon is seen when medical residents estimate their skill in communicating with patients.42 Those residents least skilled in communicating were exactly the ones least able to judge their skill level and the most likely to overestimate it. In the same vein, trainees dismissed from residency training programs believed that they rarely made any mistakes.43 This phenomenon fits well with current theories regarding the development of expertise. Experts not only possess superior knowledge of their field but are highly aware of their performance level and are less likely to exhibit overconfidence in regard to their performance. Superior metacognition and accurate calibration are the hallmarks of an expert. Overconfidence and poor calibration become selfperpetuating if our (inappropriate) certainty about our diagnoses dissuades us from asking for autopsies.44 Is our diagnostic acumen sharp enough to predict that an autopsy will simply confirm our suspicions? The available evidence would suggest otherwise.45 For example, in a recent study, clinicians were asked about the certainty of the diagnosis; fatal but potentially treatable errors were found at autopsy in 10% of the cases regardless of whether the clinicians were certain or not.46 Physicians are uncomfortable with uncertainty. Partly in response to pressures from our patients and partly to satisfy our own unease, we assign a “working” diagnosis. Typically, this is produced by a “cognitive disposition to respond” without always ensuring that the process is rigorous and the product is accurate.47 Physicians are expected to present a professional air of confidence and expertise. Diagnostic errors fall in our blind spot—we ignore them. Volume 31 Number 2 109 Solutions Opportunities to improve the visibility of diagnostic errors are evident. Fundamental Issues Diagnostic error needs to be included in the normal spectrum of quality assurance surveillance and review. Staff responsible for organizational performance needs simple definitions and working rules to identify, classify, and study these errors. A simple working definition of diagnostic error is those diagnoses that are missed, wrong, or delayed, as detected by some subsequent definitive test or finding. The origins of these errors can be classified by considering the provider-specific (cognitive) elements, the system-related contributions, and “no fault” elements reflecting diseases that present atypically or involve excessive patient noncompliance.4,27 Medical schools need to teach principles of optimal decision making and cognitive debiasing.47,48 More research on the nature of clinical decision making is needed to understand how errors arise and how they can be prevented. Research specifically targeted at diagnostic errors is currently being sponsored by the Agency for Healthcare Research and Quality,49 as well as the National Patient Safety Foundation, and hopefully future research funding can be identified to build on this foundation. System Issues A host of system properties contributes to diagnostic errors, and these need to be systematically identified and addressed, including issues related to reliable processes related to diagnostic testing, which are addressed in this issue of the Joint Commission Journal on Quality and Patient Safety. Health care organizations need to accept partial responsibility for diagnostic errors,50 and diagnostic accuracy should be a concern of regulatory groups and policy-guiding organizations. The Joint Commission has taken a positive step in this direction by adding two requirements to the 2005 National Patient Safety Goals which are directed at improving the timeliness of reporting abnormal laboratory results and ensuring that critical abnormalities are communicated to a licensed caregiver.51 These requirements address one of the more 110 February 2005 common system-related diagnostic errors—failure to appropriately communicate abnormal test results. Health care organizations would also be well advised to address another common system-related condition that predisposes to error: ensuring that specialty expertise is available when needed, at all times and on all days. Mandatory second opinions and computer-assisted diagnosis have the potential to reduce errors in pathology and radiology.52–56 Health care organizations need to evaluate the costs versus benefits of these novel interventions. Even for those cases entirely dependent on the skill of the clinician for accurate diagnosis, health care organizations could help minimize errors by using system-level interventions to aid the clinician. For example, second readings of key diagnostic tests improve diagnostic accuracy,57 and clinicial decision support tools may be helpful in a variety of settings. Similarly, efforts to enhance feedback to clinicians regarding their diagnoses would be beneficial. Organizations also have responsibility for training their clinicians in both the fundamentals of their science and in the diagnostic pitfalls that contribute to errors.58 Provider Issues Physicians need to improve their calibration by getting feedback on the diagnoses they make. Cognitive feedback improves judgment and decisions made under conditions of uncertainty.59 For example, feedback regarding their diagnostic accuracy in reading standardized cases is thought to be a major factor explaining how radiologists in the United Kingdom have reduced their rate of diagnostic error.60 For the perceptual specialties (radiology, pathology, and dermatology), feedback can be provided through voluntary or mandated participation in quality assurance procedures. For primary care physicians and specialists, the pathway to enhanced feedback is not so clear-cut, and creative ideas are needed to identify how to obtain and provide this information. The value of an autopsy as a feedback tool needs to be rediscovered.61 Beyond identifying what diagnoses were missed or wrong, the autopsy is the best tool we have to combat the overconfidence that seems to be ubiquitous in modern medicine. If providers and organizations are unable to increase the autopsy rate voluntarily, autopsies Volume 31 Number 2 could be required as a condition of Medicare participation or for Joint Commission certification.62 A novel strategy just emerging is the use of postmortem magnetic resonance imaging examination as a supplement or alternative to autopsy.63 Grieving family members may view this noninvasive procedure as a more acceptable alternative to autopsy. Although they are less likely to feature autopsy findings than in Cabot’s day, conferences dedicated to reconciling clinical and pathological findings (CPCs) continue to be warranted. Morbidity and mortality conferences, while ubiquitous today in virtually all medical schools, need to be better structured to distill and learn from diagnostic mistakes.64 An extension on this theme is the growing number of sections in leading medical journals featuring quality and safety discussions, and the creative, multimedia “Web M&M” conferences sponsored by the Agency for Healthcare Research and Quality, presenting monthly cases of medical error in internal medicine, surgery and anesthesiology, critical care, and emergency medicine.65 These new resources, like autopsies, can increase awareness of medical error. It remains to be seen, however, whether they are as effective a feedback tool, insofar as the autopsy focuses on one’s own case while the literature and Web resources focuses on the errors of others: “. . . the goal of autopsy is not to uncover clinician’s mistakes or judge them, but rather to instruct clinicians in the sense of errando discimus (to be taught by one’s own mistakes).”66(p. 37) Finally, clinicians need to learn about overconfidence and the many other innate cognitive tendencies that detract from optimal reasoning and learning.6,22,37 Principles and pitfalls of clinical decision making need to be incorporated into the medical school curriculum,48 and education must also reach the legions of practicing February 2005 clinicians who have had inadequate exposure to these concepts. A potential concern is that feedback, though likely to improve calibration, may not be well received. An interesting and relevant experiment is currently underway in Major League Baseball, exploring the possibility that umpires can be improve the consistency of their strike calls by viewing videos of each pitch at the completion of a game.67 The umpires have filed suit to stop the experiment, viewing the process as an objectionable intrusion. Similarly, some radiologists have resisted mandatory self-assessment meant to improve the accuracy of reading mammograms.68 This tension between autonomy and the need to improve calibration via feedback challenges efforts to increase diagnostic accuracy through this approach, but perhaps new incentives can be offered to make clinicians more interested in selfimprovement. Conclusion A bright light is now being focused on patient safety, but diagnostic errors lie in the shadow. Clinicians and their health care organizations both need to take active steps to discover and analyze these errors. Diagnostic error rates will fall as remediable problems are identified, but the first step is to make these errors visible. J The author thanks Pat Croskerry and Sam Campbell for suggesting this topic; Nancy Franklin, Pat Croskerry, and Gordon Schiff for editorial suggestions; and Grace Garey and Kathy Kessel for help obtaining and organizing the references. This work was supported by a grant from the National Patient Safety Foundation. Mark Graber, M.D., is Chief, Medical Service, VA Medical Center, Northport, New York, and Professor and Vice Chair, Department of Medicine, State University of New York at Stony Brook, New York. Please address requests for reprints to Mark Graber, M.D., [email protected]. Volume 31 Number 2 111 References 1. Cabot R.C.: Diagnostic pitfalls identified during a study of three thousand autopsies. JAMA 59:2295–2298, Dec. 28, 1912. 2. Gore D.C., Gregory S.R.: Historical perspective on medical errors: Richard Cabot and the Institute of Medicine. J Am Coll Surg 197:609–611, Oct. 2003. 3. Croftan A.C.: The diagnostic doubts of Dr. Cabot. JAMA 60:145, 1913. 4. Graber M., et al.: Reducing diagnostic error in medicine: What’s the goal? Acad Med 77:981–992, Oct. 2002. 5. Kuhn G.J.: Diagnostic errors. Acad Emerg Med 9:740–750, Jul. 2002. 6. Croskerry P.: The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 78:775–780, Aug. 2003. 7. Foucar E., Foucar M.K.: Error in anatomic pathology. In: Foucar M.K. (ed.): Bone Marrow Pathology, 2nd ed. Chicago: ASCP Press, 2000. 8. Fitzgerald R.: Error in Radiology. Clin Radiol 56:938–946, Dec. 2001. 9. O’Connor P.M., et al.: Unnecessary delays in accident and emergency departments: Do medical and surgical senior house officers need to vet admissions? J Accid Emerg Med 12:251–254, Dec. 1995. 10. Gorter S., et al.: Psoriatic arthritis: Performance of rheumatologists in daily practice. Ann Rheum Dis 61:219–224, Mar. 2002. 11. Christensen-Szalinski J.J.J., Bushyhead J.B.: Physicians’ use of probabalistic information in a real clinical setting. J Exp Psychol Hum Percept Perform 7:928–935, Aug. 1981. 12. Margo C.E.: A pilot study in ophthalmology of inter-rater related reliability in classifying diagnostic errors: An under-investigated area of medical error. Qual Saf Health Care 12:416–420, Dec. 2003. 13. Beam C.A., et al.: Variability in the interpretation of screening mammograms by US radiologists: Findings from a national sample. Arch Intern Med 156:209–213, Jan. 22, 1996. 14. Hoffman P.J., et al.: An analysis-of-variance model for the assessment of configural cue utilization in clinical judgment. Psychol Bull 69:338–349, May 1968. 15. Wu A.W. et al.: Do house officers learn from their mistakes? JAMA 265:2089–2094, Apr. 24, 1991. 16. Leape L. et al.: The nature of adverse events in hospitalized patients: Results of the Harvard Medical Practice Study II. N Engl J Med 324:377–384, Feb. 7, 1991. 17. Shojania K.G., et al.: Changes in rates of autopsy-detected diagnostic errors over time. JAMA 289:2849–2856, Jun. 4, 2003. 18. The Autopsy as an Outcome and Performance Measure. Evidence report/Technology assessment #58. AHRQ Publication No 03-E002. Agency for Healthcare Research and Quality, Rockville, MD, 2003. 19. Berwick D.M.: Errors today and errors tomorrow. N Engl J Med 348:2570–2572, Jun. 19, 2003. 20. Kassirer J.P.: Diagnostic reasoning. Ann Intern Med 110:893–900, Jun. 1, 1989. 21. Elstein A.S.: Clinical reasoning in medicine. In: Higgs J.M. (ed.): Clinical Reasoning in the Health Professions. Oxford, U.K.: Butterworth-Heinemann, 1995, pp. 49-59. 22. Croskerry P.: The cognitive imperative: Thinking about how we think. Acad Emerg Med 7:1223–1231, Nov. 2000. 23. Croskerry P., Wears R.L.: Safety errors in emergency medicine. In: Markovchick V.J., Pons P.T. (eds.): Emergency Medicine Secrets. Philadelphia: Hanley & Belfus, 2003, pp. 29–37. 24. Weeks W.B., et al.: Tort claims analysis in the Veterans Health Administration for quality improvement. J Law Med Ethics 29:335–345, Fall–Winter 2001. 25. Bartlett E.E.: Physicians’ cognitive errors and their liability consequences. J Healthc Risk Manag 18: 62–69, Fall 1998. 26. Adams J.G., Bohan J.S.: System contributions to error. Acad Emerg Med 7:1189–1193, Nov. 2000. 112 February 2005 27. Graber M.L., et al.: Diagnostic error in internal medicine. Poster presented at the National Patient Safety Foundation Annual Meeting, Boston, May 3–7, 2004. 28. Joint Commission on Accreditation of Healthcare Organizations: Reporting of Medical/Health Care Errors. A position statement of the Joint Commission on Accreditation of Healthcare Organizations. Nov. 8, 2003. 29. National Quality Forum: Serious Reportable Events in Healthcare: A Consensus Report. National Quality Forum Publication # CR01-02. Washington, D.C.: National Quality Forum, 2002. 30. The Leapfrog Group for Patient Safety. http://www.leapfroggroup.org (last accessed Dec. 8, 2004). 31. Shojania K.G., et al.: Making Health Care Safer: A Critical Analysis of Patient Safety Practices. Evidence Report/Technology Assessment No. 43 from the Agency for Healthcare Research and Quality: AHRQ Publication No. 01-E058. Rockville, MD, 2001. http://www.ahrq.gov/clinic/ptsafety/ (last accessed Dec. 8, 2004). 32. Anderson R.E.: Billions for defense: The pervasive nature of defensive medicine. Arch Intern Med 159:2399–2402, Nov. 8, 1999. 33. Kessler D.P., McClellan M.B.: Do doctors practice defensive medicine? Q J Econ 111:353–390, 1996. 34. Kessler D.P., McClellan M.B.: How liability law affects medical productivity. J Health Econ 21:931–955, Nov. 2002. 35. Blendon R.J., et al.: Views of practicing physicians and the public on medical errors. N Engl J Med 347:1933–1940, Dec. 12, 2002. 36. Croskerry P.: The feedback sanction. Acad Emerg Med 7:1232–1238, Nov. 2000. 37. Parker D., Lawton R.: Psychological contribution to the understanding of adverse events in health care. Qual Saf Health Care 12:453–457, Dec. 2003. 38. Sporer L.E., et al.: Choosing, confidence, and accuracy: A metaanalysis of the confidence-accuracy relation in eyewitness identification studies. Psychol Bulletin 118:315–327, Nov. 1995. 39. Reason J.T., et al.: Errors and violation on the roads: A real distinction? Ergonomics 33:1315–1332, 1990. 40. Russo J.E., Schoemaker P.J.H.: Decision Traps: Ten Barriers to Brilliant Decision-Making and How to Overcome Them. New York: Doubleday/Currency, 1989. 41. Kruger J., Dunning D.: Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. J Pers Soc Psychol 77:1121–1134, Dec. 1999. 42. Hodges B., et al.: Difficulties in recognizing one’s own incompetence: Novice physicians who are unskilled and unaware of it. Acad Med 76(10 Suppl):S87–S89, Oct. 2001. 43. Hall J.C., et al.: Surgeons and cognitive processes. Br J Surg 90:10–16, Jan. 2003. 44. Gawande A. Final cut: Medical arrogance and the decline of the autopsy. The New Yorker. 19, 2001, pp. 94–99. 45. Shojania K.G.: Diagnostic surprises and the persistent value of the autopsy. AHRQ Web M&M 2004. http://webmm.ahrq.gov/cases.aspx?ic=54 (last accessed Dec. 8, 2004). 46. Podbregar M. et al.: Should we confirm our clinical diagnostic certainty by autopsies? Intensive Care Med 27:1750–1755, Nov. 2001. 47. Croskerry P.: Achieving quality in clinical decision making: Cognitive strategies and detection of bias. Acad Emerg Med 9:1184–1204, Nov. 2002. 48. Cosby K.S., Croskerry P.: Patient safety: A curriculum for teaching patient safety in emergency medicine. Acad Emerg Med 10:69–78, Jan. 2003. continued Volume 31 Number 2 References, continued 49. Schiff G., et al.: Diagnostic Errors: Lessons from a MultiInstitutional Collaborative Project for the Diagnostic Error Evaluation and Research Project Investigators. Rockville, MD: Agency for Healthcare Research and Quality, 2004. 50. Nolan T.W.: System changes to improve patient safety. BMJ 320:771–773, Mar. 18, 2000. 51. Joint Commission on Accreditation of Healthcare Organizations: 2005 Laboratory Service National Patient Safety Goals. http://www.jcaho.org/accredited+organizations/patient+safety/ 05+npsg/05_npsg_lab.htm (last accessed Dec. 8, 2004). 52. Hunt D.L., et al.: Effects of computer-based clinical decision support systems on physician performance and patient outcomes: A systematic review. JAMA 280:1339–1346, Oct. 21, 1998. 53. Trowbridge R., Weingarten S.: Clinical decision support systems. In: Shojania K.G., et al. (eds.): Making Health Care Safer: A Critical Analysis of Patient Safety Practices. Evidence Report/Technology Assessment No. 43 from the Agency for Healthcare Research and Quality: AHRQ Publication No. 01-E058. Rockville, MD, 2001. http://www.ahrq.gov/clinic/ptsafety/ (last accessed Dec. 8, 2004). 54. Espinosa J.A., Nolan T.W.: Reducing errors made by emergency physicians in interpreting radiographs: Longitudinal study. BMJ 320:737–740, Mar. 18, 2000. 55. Kripalani S., et al.: Reducing errors in the interpretation of plain radiographs and computed tomography scans. In: Shojania K.G., et al. (eds.): Making Health Care Safer: A Critical Analysis of Patient Safety Practices. Evidence Report/Technology Assessment No. 43 from the Agency for Healthcare Research and Quality: AHRQ Publication No. 01-E058. Rockville, MD, 2001. http://www.ahrq.gov/ clinic/ptsafety/ (last accessed Dec. 8, 2004). 56. Westra W.H., et al.: The impact of second opinion surgical pathology on the practice of head and neck surgery: A decade experience at a large referral hospital. Head and Neck 24:684–693, Jul. 2002. February 2005 57. Ruchlin H.S., et al.: The efficiency of second-opinion consultation programs: A cost-benefit perspective. Med Care 20:3–19, Jan. 1982. 58. Croskerry P.: Cognitive forcing strategies in clinical decision making. Ann Emerg Med 41:110–120, Jan. 2003. 59. Balzer W.K., et al.: Effects of cognitive feedback on performance. Psychol Bull 106:410–433, 1989. 60. Smith-Bindman R., et al.: Comparison of screening mammography in the United States and the United Kingdom. JAMA 290:2129–2137, Oct. 22, 2003. 61. Hill R.B., Anderson R.E.: Autopsy: Medical Practice and Public Policy. Oxford, U.K.: Butterworth-Heinemann, 1988. 62. Lundberg G.D.: Low-tech autopsies in the era of high-tech medicine: Continued value for quality assurance and patient safety. JAMA 280:1273–1274, Oct. 14, 1998. 63. Patriquin L., et al.: Postmortem whole-body magnetic resonance imaging as an adjunct to autopsy: Preliminary clinical experience. J Magn Reson Imaging 13:277–287, Feb. 2001. Erratum in: J Magn Reson Imaging 13:818, May 2001. 64. Pierluissi E., et al.: Discussion of medical errors in morbidity and mortality conferences. JAMA 290:2838–4282, Dec. 3, 2003. 65. Agency for Healthcare Quality and Research: Welcome to AHRQ Web M&M. http://webmm.ahrq.gov/default.aspx. (last accessed Dec. 8, 2004) 66. Kirch W., Schafii C.: Misdiagnosis at a university hospital in 4 medical eras. Medicine (Baltimore) 75:29–40, Jan. 1996. 67. Silver N., Wooner K.: QuesTec not yet showing consistency from umpires. ESPN Baseball. Jun. 5, 2003. 68. Maguire P.: Is an access crisis on the horizon in mammography? ACP Observer 23:1, Oct. 2003. http://www.acponline.org/journals/ news/oct03/mammo.htm. Volume 31 Number 2 113 DO PHYSICIANS KNOW WHEN THEIR DIAGNOSES ARE CORRECT? Charles P. Friedman, PhD et al. Do Physicians Know When Their Diagnoses Are Correct? Implications for Decision Support and Error Reduction. J Gen Intern Med 2005; 20:334-9. Isabel Healthcare Inc 2941 Fairview Park Drive, Ste 800, Falls Church, VA 22042 T: 703 289 8855 E: [email protected] JGIM ORIGINAL ARTICLE Do Physicians Know When Their Diagnoses Are Correct? Implications for Decision Support and Error Reduction Charles P. Friedman, PhD,1 Guido G. Gatti, MS,1 Timothy M. Franz, PhD,2 Gwendolyn C. Murphy, PhD,3 Fredric M. Wolf, PhD,4 Paul S. Heckerling, MD,5 Paul L. Fine, MD,7 Thomas M. Miller, MD,8 Arthur S. Elstein, PhD6 1 Q1 Center for Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA; 2Department of Psychology, St. John Fisher College, Rochester, NY, USA; 3Division of Community Health, Duke University, Durham, NC, USA; 4Department of Medical Education and Informatics, University of Washington, Seattle, WA, USA; Departments of 5Medicine and 6Medical Education, University of Illinois at Chicago, Chicago, IL, USA; 7Department of Medicine, University of Michigan, Ann Arbor, MI, USA; 8Department of Medicine, University of North Carolina, Chapel Hill, NC, USA. OBJECTIVE: This study explores the alignment between physicians’ confidence in their diagnoses and the ‘‘correctness’’ of these diagnoses, as a function of clinical experience, and whether subjects were prone to over-or underconfidence. DESIGN: Prospective, counterbalanced experimental design. SETTING: Laboratory study conducted under controlled conditions at three academic medical centers. PARTICIPANTS: Seventy-two senior medical students, 72 senior medical residents, and 72 faculty internists. INTERVENTION: We created highly detailed, 2-to 4-page synopses of 36 diagnostically challenging medical cases, each with a definitive correct diagnosis. Subjects generated a differential diagnosis for each of 9 assigned cases, and indicated their level of confidence in each diagnosis. MEASUREMENTS AND MAIN RESULTS: A differential was considered ‘‘correct’’ if the clinically true diagnosis was listed in that subject’s hypothesis list. To assess confidence, subjects rated the likelihood that they would, at the time they generated the differential, seek assistance in reaching a diagnosis. Subjects’ confidence and correctness were ‘‘mildly’’ aligned (k =.314 for all subjects, .285 for faculty, .227 for residents, and .349 for students). Residents were overconfident in 41% of cases where their confidence and correctness were not aligned, whereas faculty were overconfident in 36% of such cases and students in 25%. CONCLUSIONS: Even experienced clinicians may be unaware of the correctness of their diagnoses at the time they make them. Medical decision support systems, and other interventions designed to reduce medical errors, cannot rely exclusively on clinicians’ perceptions of their needs for such support. KEY WORDS: diagnostic reasoning; clinical decision support; medical errors; clinical judgment; confidence. DOI: 10.1111/j.1525-1497.2005.30145.x J GEN INTERN MED 2005; 20:000–000. Q2 Accepted for publication October 28, 2004 The authors have no conflicts of interest to report. This report is based on a paper presented at the Medinfo 2001 conference in London, September, 2001. Address correspondence and requests for reprints to Dr. Friedman: Center for Biomedical Informatics, University of Pittsburgh, 8084 Forbes Tower, Pittsburgh, PA 15213 (e-mail: [email protected]). Address for 2003–2005: Dr. Friedman, Division of Extramural Programs, National Library of Medicine, Suite 301, 6705 Rockledge Drive, Bethesda, MD 20892 (e-mail: [email protected]). W hen making a diagnosis, clinicians combine what they personally know and remember with what they can access or look up. While many decisions will be made based on a clinician’s own personal knowledge, others will be informed by knowledge that derives from a range of external sources including printed books and journals, communications with professional colleagues, and, increasingly, a range of computer-based knowledge resources.1 In general, the more routine or familiar the problem, the more likely it is that an experienced clinician can ‘‘solve it’’ and decide what to do based on personal knowledge only. This method of decision making uses a minimum of time, which is a scarce and precious resource in health care practice. Every practitioner’s personal knowledge is, however, incomplete in various ways, and decisions based on incorrect, partial, or outdated personal knowledge can result in errors. A recent landmark study2 has documented that medical errors are a significant cause of morbidity and mortality in the United States. Although these errors have a wide range of origins,3 many are caused by a lack of information or knowledge necessary to appropriately diagnose and treat.4 The exponential growth of biomedical knowledge and shortening half-life of any single item of knowledge both suggest that modern medicine will increasingly depend on external knowledge to support practice and reduce errors.5 Still, the advent of modern information technology has not changed the fundamental nature of human problem solving. Diagnostic and therapeutic decisions, for the foreseeable future, will be made by human clinicians, not machines. What has changed in recent years is the potential for computerbased decision support systems (DSSs) to provide relevant and patient-specific external knowledge at the point of care, assembling this knowledge in a way that complements and enhances what the clinician decision maker already knows.6,7 DSSs can function in many ways, ranging from the generation of alerts and reminders to the critiquing of management plans.8–14 Some DSSs ‘‘push’’ information and advice to clinicians whether they request it or not; others offer no advice until it is specifically requested. The decision support process presupposes the clinician’s openness to the knowledge or advice being offered. Clinicians who believe they are correct, or believe they know all they need to know to reach a decision, will be unmotivated to seek additional knowledge and unreceptive to any knowledge or sug1 2 Friedman et al., Do Physicians Know When Their Diagnoses Are Correct? gestions a DSS presents to them. The literatures of psychology and medical decision making15–18 address the relationship between these subjective beliefs and objective reality. The well-established psychological bias of ‘‘anchoring’’15 stipulates that all human decision makers are more loyal to their current ideas, and resistant to changing them, than they objectively should be in light of compelling external evidence. This study addresses a question central to the potential utility and success of clinical decision support. If clinicians’ openness to external advice hinges on their confidence in their assessments based on personal knowledge, how valid are these perceptions? Conceptually, there are 4 possible combinations of objective ‘‘correctness’’ of a diagnosis and subjective confidence in it: 2 in which confidence and correctness are aligned and 2 in which they are not. The ideal condition is an alignment of high confidence in a correct diagnosis. Confidence and correctness can also be aligned in the opposing sense: low confidence in a diagnosis that is incorrect. In this state, clinicians are likely to be open to advice and disposed to consult an external knowledge resource. In the ‘‘underconfident’’ state of nonalignment, a clinician with low confidence in a correct diagnosis will be motivated to seek information that will likely confirm an intent to act correctly. However, it is also possible that a consultation with an external resource can talk a clinician out of a correct assessment.19 The other nonaligned state, of greater concern for quality of care, is high confidence in an incorrect diagnosis. In this ‘‘overconfident’’ state, clinicians may not be open or motivated to seek information that could point to a correct assessment. This work addresses the following specific questions: 1. In internal medicine, what is the relationship between clinicians’ confidence in their diagnoses and the correctness of these diagnoses? 2. Does the relationship between confidence and correctness depend on clinicians’ levels of experience ranging from medical student to attending physician? 3. To the extent that confidence and correctness are mismatched, do clinicians tend toward overconfidence or underconfidence, and does this tendency depend on level of clinical experience? One study similar to this one in design and intent,20 but limited to medical students as subjects, found that students were frequently unconfident about correct diagnostic judgments when classifying abnormal heart rhythms. Our preliminary study of this question has found the relationship between correctness and confidence, across a range of training levels, to be modest at best.21 METHODS Experimental Design and Dataset To address these questions, we employed a large dataset originally collected for a study of the impact of decision support systems on the accuracy of clinician diagnoses.19 We developed for this study detailed written synopses of 36 diagnostically challenging cases from patient records at the University of Illinois at Chicago, the University of Michigan, and the University of North Carolina. Each institution contributed 12 cases, each with a firmly established final diagnosis. The 2-to 4page case synopses were created by three coauthors who are JGIM experienced academic internists (PSH, PSF, TMM). The synopses contained comprehensive historical, examination, and diagnostic test information. They did not, however, contain results of definitive tests that would have made the correct diagnosis obvious to most or all clinicians. The cases were divided into 4 approximately equivalent sets balanced by institution, pathophysiology, organ systems, and rated difficulty. Each set, with all patient-and institution-identifying information removed, therefore contained 9 cases, with 3 from each institution. We then recruited to the study 216 volunteer subjects from these same institutions: 72 fourth-year medical students, 72 second-and third-year internal medicine residents, and 72 general internists with faculty appointments and at least 2 years of postresidency experience (mean, 11 years). Recruitment was balanced so that each institution contributed 24 subjects at each experience level. Each subject was randomly assigned to work the 9 cases comprising 1 of the 4 case sets. Each subject then worked through each of the assigned cases first without, and then with, assistance from an assigned computer-based decision support system. On the first pass through each case, subjects generated a diagnostic hypothesis set with up to 6 items. After generating their diagnostic hypotheses, subjects indicated their perceived confidence in their diagnosis in a manner described below. On the second pass through the case, subjects employed a decision support system to generate diagnostic advice, and again offered a differential diagnosis and confidence ratings. After deleting cases with missing data, the final dataset for this work consisted of 1,911 cases completed by 215 subjects. Results reported elsewhere19 indicated that the computer-based decision support systems engendered modest but statistically significant improvements in the accuracy of diagnostic hypotheses (overall effect size of .32). The questions addressed by this study, emphasizing the concordance between confidence and correctness under conditions of uncertainty, focus on the first pass through each case where the subjects applied only their personal knowledge to the diagnostic task. Measures To assess the correctness of each clinician’s diagnostic hypothesis set for each case, we employed a binary score (correct or incorrect). We scored a case as correct if the established diagnosis for that case, or a very closely related disease, appeared anywhere in the subject’s hypothesis set. Final scoring decisions, to determine whether a closely related disease should be counted as correct, were made by a panel comprised of coauthors PSF, PSH, and TMM. The measure of clinician confidence was the response to the specific question: ‘‘How likely is it that you would seek assistance in establishing a diagnosis for this case?’’ ‘‘Assistance’’ was not limited to that which might be provided by a computer. After generating their diagnostic hypotheses for each case, subjects responded to this question using an ordinal 1 to 4 response scale with anchor points of 1 representing ‘‘unlikely’’ (indicative of high confidence in their diagnosis) and 4 representing ‘‘likely’’ (indicative of low confidence). Because subjects did not receive feedback, they offered their confidence judgments for each case without any definitive knowledge of whether their diagnoses were, in fact, correct. Because they reflect only the subjects’ first pass through each case, these confidence judgments JGIM 3 Friedman et al., Do Physicians Know When Their Diagnoses Are Correct? were not confounded by any advice subjects might later have received from the decision support systems. Analysis In this study, each data point pairs a subjective confidence assessment on a 4-level ordinal scale with a binary objective correctness score. The structure of this experiment and the resulting data suggested two approaches to analyzing the results. Given that each subject in this study worked 9 cases, and offered confidence ratings on a 1 to 4 scale for each case, interpretations of the meanings of these scale points might be highly consistent for each subject but highly variable across subjects. Our first analytic approach therefore sought to identify an optimal threshold for each subject to distinguish subjective states of ‘‘confident’’ and ‘‘unconfident.’’ This approach addresses the ‘‘pooling’’ problem, identified by Swets and Pickett,22 that would tend to underestimate the magnitude of the relationship between confidence and correctness. Our second analytical approach took the assumption that all subjects made the same subjective interpretation of the confidence scale. This second approach entails a direct analysis of the 2-level by 4-level data with no within-subject thresholding. Qualitatively, the first approach approximates the upper bound on the relationship between confidence and correctness, while the second approach approximates the lower bound. To implement the first approach, we identified, for each subject, the threshold value along the 1 to 4 scale that maximized the proportion of cases where confidence and correctness were aligned. With reference to Table 1, we sought to find the threshold value that maximized the numbers of cases in the on-diagonal cells. For 58 subjects (27%), we found that maximum alignment was achieved by classifying only ratings of 1 as confident and all other ratings as unconfident; for 105 subjects (49%), maximum alignment was achieved by classifying ratings of 1 or 2 as confident; and for the remaining 52 subjects (24%), maximum alignment was achieved by classifying ratings of 1, 2, or 3 as confident. This finding validated our assumption that subjects varied in their interpretations of the scale points. We then created a dataset for further analysis that consisted, for each case worked by each subject, of a binary correctness score and a binary confidence score calculated using each subject’s optimal threshold. To address the first research question with the first approach, we computed Kendall’s tb and k coefficients to characterize the relationship between subjects’ correctness and confidence levels. We then modeled statistically the proportions of cases correctly diagnosed, as a function of confidence (computed as a binary variable as described above), subjects’ level of training (faculty, resident, student), and the interaction of confidence and training level. To address the second question, we modeled the proportions of cases in which confidence and correctness were aligned, as a function of training level. To address the third research question, we focused only on those cases in which confidence and correctness were not aligned. We modeled the proportions of cases in which subjects were overconfident (high confidence in an incorrect diagnosis) as a function of training level. All statistical models used the Generalized Linear Model (GzdLM) procedure23 assuming diagnostic correctness, alignment, and overconfidence to be distributed as Bernoulli variables with a logit link and used naive empirical covariance estimates24 for the model effects to account for the clustering of cases within subjects. Wald statistics were employed to test the observed results against the null condition. Ninety-five percent confidence intervals were calculated by transforming logit scale Wald intervals using naive empirical standard error estimates into percent scale intervals.25 The SPSS for Windows (SPSS Inc., Chicago, IL) and SAS26 Proc GENMOD (SAS Institute Inc., Cary, NC) software were employed for statistical modeling and data analyses. Our second approach offers a contrasting strategy to address the first and second research questions. To this end, we computed nonparametric correlation coefficients (Kendall’s tb) between the 2-level variable of correctness and the 4 levels of confidence from the original data, without thresholding. We computed separate tb coefficients for subjects at each experience level, and for the sample as a whole. Correlations were computed with case as the unit of analysis after exploratory analyses correcting for the nesting of cases within subjects led to negligible changes in the results. The power of the inferential statistics employed in this analysis was based on the two-tailed t test, as the tests we performed are analogous to testing differences in means on a logit scale. Because our tests are based on a priori unknown marginal cell counts, we halved the sample size to estimate power. For the analyses addressing research question 1, which use all cases, power is greater than .96 to detect a small to Table 1. Crosstabulation of Correctness and Confidence for Each Clinical Experience Level and for All Subjects, with Optimal Thresholding for Each Subject Experience Level Students Residents Faculty All subjects Correctness of Diagnosis Correct Incorrect Total Correct Incorrect Total Correct Incorrect Total Correct Incorrect Total High 63 35 98 140 98 238 167 80 247 370 213 583 (55 to 71) (27 to 43) (72 to 132) (129 to 150) (88 to 109) (193 to 287) (155 to 178) (69 to 92) (205 to 300) (352 to 388) (195 to 231) (527 to 667) Cells contain counts of cases, with 95% confidence intervals in parentheses. Confidence Low 105 442 547 141 259 400 144 237 381 390 938 1,328 (88 to 125) (422 to 459) (513 to 573) (124 to 159) (241 to 276) (351 to 445) (128 to 160) (221 to 253) (338 to 433) (357 to 425) (903 to 971) (1,246 to 1,405) Total 168 (146 to 192) 477 (453 to 499) 645 281 (256 to 306) 357 (332 to 382) 638 311 (293 to 339) 317 (299 to 346) 628 760 (713 to 808) 1,151 (1,103 to 1,198) 1,911 4 JGIM Friedman et al., Do Physicians Know When Their Diagnoses Are Correct? moderate effect of .3 standard deviations at an a level of .05. For analyses addressing research Qquestions 2 and 3, analyses that are based on subsets of cases, the analogous statistical power estimate is greater than .81. 1.0 Proportion Correct Diagnoses 0.9 RESULTS Results with Threshold Correction Table 1 displays the crosstabulation of correctness of diagnosis and binary levels of confidence (with 95% confidence interval) for all subjects and separately for each clinical experience level, using each subject’s optimal threshold to dichotomize the confidence scale. The difficulty of these cases is evident from Table 1, as 760 of 1,911 (40%) were correctly diagnosed by the full set of subjects. Diagnostic accuracy increased monotonically with subjects’ clinical experience. The difficulty of the cases is also reflected in the distribution of the confidence ratings, with subjects classified as confident for 583 (31%) of 1,911 cases, after adjustment for varying interpretations of the scale. These confidence levels revealed the same general monotonic relationship with clinical experience. Across the entire sample of subjects, confidence and correctness were aligned for 1,308 of 1,911 cases (68%), corresponding to Kendall’s tb =.321 (Po.0001) and a k value of .314. Alignment was seen in 64% of cases for faculty (tb =.291 [Po.0001]; k =.285), 63% for residents (tb =.230 [Po.0001]; k =.227), and 78% for students (tb =.369 [Po.0001]; k =.349). Figure 1 offers a graphical portrayal, for each experience level, of the proportions of correct diagnoses as a function of confidence, with 95% confidence intervals. The relationship between correctness and confidence, at each level, is seen in the differences between these proportions. Wald statistics generated by the statistical model reveal a significant alignment between diagnostic correctness and confidence across all subjects (w2 =199.64, df =1, Po.0001). Significant relationships are also seen between correctness and training level (w2 =20.40, df =2, Po.0001) and in the interaction between confidence and training level (w2 =17.00, df =2, Po.0002). Alignment levels for faculty and residents differ from those of the students (Po.05); and from inspection of Figure 1 it is evident that students’ alignment levels are higher than those of faculty or residents. With reference to the third research question, Table 2 summarizes the case frequencies for which clinicians at each level were correctly confident—where confidence was aligned with correctness—as well as frequencies for the ‘‘nonaligned’’ cases where they were overconfident and underconfident. Students were overconfident in 25% of nonaligned cases, corresponding to 5% of cases they completed. Residents were overconfident in 41% of nonaligned cases, and 15% of cases overall. Faculty physicians were overconfident in 36% of nonaligned cases, and 13% of cases overall. 0.8 0.7 0.6 0.5 0.4 0.3 Faculty Residents 0.2 0.1 Students 0.0 Low High Low Low High High Diagnostic Confidence - Maximum Subject Calibration FIGURE 1. Proportions of cases correctly diagnosed at each confidence level, with thresholding to achieve maximum calibration of each subject. Brackets indicate 95% confidence intervals for each proportion. All subjects were more likely to be underconfident than overconfident (w2 =29.05, Po.0001). Students were found to be more underconfident than residents (Wald statistics: w2 =6.19, df =2, Po.05). All other differences between subjects’ experience levels were not significant. Results Without Threshold Correction The second approach to analysis yielded Kendall tb measures of association between the binary measure of correctness and the 4-level measure of confidence, computed directly from the study data, without any corrections. For all subjects and cases, we observed tb =.106 (N =1,911 cases; Po.0001). Separately for each level of training, Kendall coefficients are: faculty tb =.103 (n =628; Po.005), residents tb =.041 (n =638; NS), and students tb =.121 (n =645 cases; Po.001). The polarity of the relationship is as would be expected, associating correctness of diagnosis with higher confidence levels. The tb values reported here can be compared with their counterparts, reported above, for the analyses that included threshold correction. DISCUSSION The assumption built into the first analytic strategy, that subjects make internally consistent but personally idiosyncratic interpretations of confidence, generates what may be termed upper-bound estimates of alignment between confidence and Table 2. Breakdown of Under- and Overconfidence for Cases in Which Clinician Confidence and Correctness Are Nonaligned Q6 Experience Level Proportion of Cases that Were Nonaligned Students Residents Faculty All subjects 140/645 239/638 224/628 603/1911 Parentheses contain 95% confidence intervals. Percentage of Nonaligned Cases Reflecting Underconfidence 75.0 59.0 64.3 64.7 (65.2 (50.7 (55.4 (59.5 to to to to 82.8) 66.8) 72.3) 69.5) Percentage of Nonaligned Cases Reflecting Overconfidence 25.0 41.0 35.7 35.3 (17.2 (33.2 (27.7 (30.5 to to to to 34.8) 49.3) 44.6) 40.5) JGIM Friedman et al., Do Physicians Know When Their Diagnoses Are Correct? correctness. Under the assumptions embedded in this analysis, the results of this study indicate that the correctness of clinicians’ diagnoses and their perceptions of the correctness of these diagnoses are, at most, moderately aligned. The correctness and confidence of faculty physicians and senior medical residents were aligned about two thirds of the time—and in cases where correctness and confidence were not aligned, these subjects were more likely to be underconfident than overconfident. While faculty subjects demonstrated tendencies toward greater alignment and less frequent overconfidence than residents, these differences were not statistically significant. Students’ results were substantially different from those of their more experienced colleagues, as their confidence and correctness were aligned about four fifths of the time and more highly skewed, when nonaligned, toward underconfidence. The alignment between ‘‘being correct’’ and ‘‘being confident’’—within groups and for all subjects—would be qualitatively characterized as ‘‘fair,’’ as seen by k coefficients in the range .2 to .4.27 The more conservative second mode of analysis yielded smaller relationships between correctness and confidence, as seen in the tb coefficient for all subjects, which is smaller by a factor of three. For the residents, the relationship between correctness and confidence does not exceed chance expectations when computed without thresholding. Comparison across experience levels reveals the same trend seen in the primary analysis, with students displaying the highest level of alignment. The greater apparent alignment for the students, under both analytic approaches, may be explained by the difficulty of the cases. The students were probably overmatched by many of these cases, perhaps guessing at diagnoses, and were almost certainly aware that they were overmatched. This is seen in the low proportions of correct diagnoses for students and the low levels of expressed confidence. These skewed distributions would generate alignment between correctness and confidence of 67% by chance alone. While students’ alignments exceeded these chance expectations, a better estimate of their concordance between confidence and correctness might be obtained by challenging the students with less difficult cases, making the diagnostic task as difficult for them as it was for the faculty and residents with the cases employed in this study. We do not believe it is valid to conclude from these results that the students are ‘‘more aware’’ than experienced clinicians of when they are right and wrong. By contrast, residents and faculty correctly diagnosed 44% and 50% of these difficult cases, respectively, and generated distributions of confidence ratings that were less skewed than those of the students. In cases for which these clinicians’ correctness and confidence were not aligned, both faculty and residents showed an overall tendency toward underconfidence in their diagnoses. Despite the general tendency toward underconfidence, residents and faculty in this study were overconfident, placing credence in a diagnosis that was in fact incorrect, in 15% (98/938) and 12% (80/928) of cases, respectively. Because these two more experienced groups are directly responsible for patient care, and offered much more accurate diagnoses for these difficult cases, findings for these groups take on a different interpretation and perhaps greater potential significance. In designing the study, we approached the measurement of ‘‘confidence’’ by grounding it in hypothetical clinical be- 5 havior. Rather than asking subjects directly to estimate their confidence levels in either probabilistic or qualitative terms, we asked them for the likelihood of their seeking help in reaching a diagnosis for each case. We considered this measure to be a proxy for ‘‘confidence.’’ Because our intent was to inform the design of decision support systems and medical error reduction efforts generally, we believe that this behavioral approach to assessment of confidence lends validity to our conclusions. Limitations of this study include restriction of the task to diagnosis. Differences in results may be seen in clinical tasks other than diagnosis, such as determination of appropriate therapy for a problem already diagnosed. The cases, chosen to be very difficult and with definitive findings excluded, certainly generated lower rates of accurate diagnoses than are typically seen in routine clinical practice. Were the cases in this study more routine, this may have affected the measured levels of alignment between confidence and correctness. In addition, this study was conducted in a laboratory setting, using written case synopses, to provide experimental precision and control. While the case synopses contained very large amounts of clinical information, the task environment for these subjects was not the task environment of routine patient care. Clinicians might have been more, or less, confident in their assessments had the cases used in the study been real patients for whom these clinicians were responsible; and in actual practice, physicians may be more likely to consult on difficult cases regardless of their confidence level. While we employed volunteer subjects in this study, the sample sizes at each institution for the resident and faculty groups were large relative to the sizes at each institution of their respective populations, and thus unlikely to be skewed by sampling bias. The relationships, of ‘‘fair’’ magnitude, between correctness and confidence were seen only after adjusting each subject’s confidence ratings to reflect differing interpretations of the confidence scale. The secondary analytic approach, which does not correct individuals’ judgments against their own optimal thresholds, results in observed relationships between correctness and confidence that are smaller. Under either set of assumptions, the relationship between confidence and correctness is such that designers of clinical decision support systems cannot assume clinicians to be accurate in their own assessments of when they do and do not require assistance from external knowledge resources. This work was supported by grant R01-LM-05630 from the National Library of Medicine. REFERENCES 1. Hersh WR. ‘‘A world of knowledge at your fingertips’’: the promise, reality, and future directions of online information retrieval. Acad Med. 1999;74:240–3. 2. Kohn LT, Corrigan JM, Donaldson MS, eds. To Err Is Human: Building a Safer Health System. Washington, DC: National Academy Press; 2000. 3. Bates DW, Gawande AA. Error in medicine: what have we learned? Ann Intern Med. 2000;132:763–7. 4. Leape LL, Bates DW, Cullen DJ, et al. Systems analysis of adverse drug events. ADE Prevention Study Group. JAMA. 1995;274:35–43. 5. Wyatt JC. Clinical data systems, part 3: development and evaluation. Lancet. 1994;344:1682–7. 6. Norman DA. Melding mind and machine. Technol Rev. 1997;100:29–31. 7. Chueh H, Barnett GO. ‘‘Just in time’’ clinical information. Acad Med. 1997;72:512–7. 6 Friedman et al., Do Physicians Know When Their Diagnoses Are Correct? 8. Hunt DL, Haynes RB, Hanna SE, Smith K. Effects of computer-based clinical decision support systems on physician performance and patient outcomes. JAMA. 1998;280:1339–46. 9. Miller RA. Medical diagnostic decision support systems—past, present, and future. J Am Med Inform Assoc. 1994;1:8–27. 10. Evans RS, Pestotnik SL, Classen DC, et al. A computer-assisted management program for antibiotics and other antiinfective agents. N Eng J Med. 1998;338:232–8. 11. McDonald CJ, Overhage JM, Tierney WM, et al. The Regenstrief Medical Record System: a quarter century experience. Int J Med Inform. 1999;54:225–53. 12. Wagner MM, Pankaskie M, Hogan W, et al. Clinical event monitoring at the University of Pittsburgh. Proc AMIA Annu Fall Symp. 1997;188–92. 13. Cimino JJ, Elhanan G, Zeng Q. Supporting infobuttons with terminological knowledge. Proc AMIA Annu Fall Symp. 1997;528–32. 14. Miller PL. Building an expert critiquing system: ESSENTIAL-ATTENDING. Methods Inf Med. 1986;25:71–8. 15. Tversky A, Kahneman D. Judgment under uncertainty: heuristics and biases. Science. 1974;185:1124–31. 16. Lichtenstein S, Fischhoff B. Do those who know more also know more about how much they know? Organ Behav Hum Perform. 1977;20: 159–83. 17. Christensen-Szalanski JJ, Bushyhead JB. Physicians’ use of probabilistic information in a real clinical setting. J Exp Psychol. 1981;7: 928–35. JGIM 18. Tierney WM, Fitzgerald J, McHenry R, et al. Physicians’ estimates of the probability of myocardial infarction in emergency room patients with chest pain. Med Decis Making. 1986;6:12–7. 19. Friedman CP, Elstein AS, Wolf FM, et al. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA. 1999;282:1851–6. 20. Mann D. The relationship between diagnostic accuracy and confidence in medical students. Presented at the annual meeting of the American Educational Research Association, Atlanta, 1993. 21. Friedman C, Gatti G, Elstein A, Franz T, Murphy G, Wolf F. Are Clinicians Correct When They Believe They Are Correct? Implications for Medical Decision Support. Proceedings of the Tenth World Congress on Medical Informatics. London; 2000. Medinfo. 2001;10(PP. 1): 454–8. 22. Swets JA, Pickett RM. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. New York, NY: Academic Press; 1982. 23. McCullagh P, Nelder JA. Generalized Linear Models. 2nd ed. New York, NY: Chapman and Hall; 1991. 24. Liang K, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. 25. Neter J, Kutner MH, Nachstsheim CJ, Wasserman W. Applied Linear Regression Models. Chicago, IL: Irwin; 1996. 26. SAS Institute Inc. SAS/STAT User’s Guide, Version 8. Cary, NC: SAS Institute Inc.; 1999. 27. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74. Isabel Healthcare Inc 3120 Fairview Park Drive, Ste 200, Falls Church, VA 22042 T: 703 289 8855 E: [email protected]