Document 6424736
Transcription
Document 6424736
"Form Approved Through 05/2004 OMB No. 0925-0001 LEAVE BLANK—FOR PHS USE ONLY. Type Activity Number Review Group Formerly Department of Health and Human Services Public Health Services Grant Application Do not exceed 56-character length restrictions, including spaces. 1. TITLE OF PROJECT Council/Board (Month, Year) Date Received Industrialized informatics for drug discovery 2. RESPONSE TO SPECIFIC REQUEST FOR APPLICATIONS OR PROGRAM ANNOUNCEMENT OR SOLICITATION (If “Yes,” state number and title) Number: PAR-03-106 Title: NO ✘ YES Innovations in biomedical computational science and technology ✘ 3. PRINCIPAL INVESTIGATOR/PROGRAM DIRECTOR New Investigator 3a. NAME (Last, first, middle) 3b. DEGREE(S) Ling, Bruce, Xuefeng BS, MA, Ph.D. 3c. POSITION TITLE 3d. MAILING ADDRESS (Street, city, state, zip code) Director, Research Informatics 1120 Veterans Blvd, South San Francisco, CA 94080 3e. DEPARTMENT, SERVICE, LABORATORY, OR EQUIVALENT Bioinformatics No Yes 3f. MAJOR SUBDIVISION Research Informatics 3g. TELEPHONE AND FAX (Area code, number and extension) TEL: 650-825-7143 4. HUMAN SUBJECTS RESEARCH ✘ No FAX: 4a. Research Exempt [email protected] 509-271-7814 ✘ No Yes 5. VERTEBRATE ANIMALS ✘ No Yes If “Yes,” Exemption No. 4b. Human Subjects Assurance No. Yes E-MAIL ADDRESS: 4c. NIH-defined Phase III Clinical Trial ✘ No 5a. If “Yes,” IACUC approval Date 5b. Animal welfare assurance no Yes 6. DATES OF PROPOSED PERIOD OF SUPPORT (month, day, year—MM/DD/YY) 7. COSTS REQUESTED FOR INITIAL BUDGET PERIOD 8. COSTS REQUESTED FOR PROPOSED PERIOD OF SUPPORT From Through 7a. Direct Costs ($) 7b. Total Costs ($) 8a. Direct Costs ($) 10/01/03 10/01/08 $450,000 $684,000 $2,406,772 9. APPLICANT ORGANIZATION $3,658,283 10. TYPE OF ORGANIZATION Tularik Inc. Name 8b. Total Costs ($) Address Public: → Private: → For-profit: → 1120 Veterans Blvd, South San Francisco, CA 94080 Federal State Local Private Nonprofit General ✘ Small Business Woman-owned Socially and Economically Disadvantaged 11. ENTITY IDENTIFICATION NUMBER 94-31-48800 DUNS NO. (if available) 795475946 Institutional Profile File Number (if known) Congressional District 12. ADMINISTRATIVE OFFICIAL TO BE NOTIFIED IF AWARD IS MADE 13. OFFICIAL SIGNING FOR APPLICANT ORGANIZATION Name Title Address Tel Bruce Ling, Ph.D. Director, Research Informatics FAX Tel 509-271-7814 [email protected] 650-825-7181 E-Mail 14. PRINCIPAL INVESTIGATOR/PROGRAM DIRECTOR ASSURANCE: I certify that the statements herein are true, complete and accurate to the best of my knowledge. I am aware that any false, fictitious, or fraudulent statements or claims may subject me to criminal, civil, or administrative penalties. I agree to accept responsibility for the scientific conduct of the project and to provide the required progress reports if a grant is awarded as a result of this application. 15. APPLICANT ORGANIZATION CERTIFICATION AND ACCEPTANCE: I certify that the statements herein are true, complete and accurate to the best of my knowledge, and accept the obligation to comply with Public Health Services terms and conditions if a grant is awarded as a result of this application. I am aware that any false, fictitious, or fraudulent statements or claims may subject me to criminal, civil, or administrative penalties. PHS 398 (Rev. 05/01) Louisa M. Daniels Corporate Counsel Veterans Blvd, South San Francisco, CA 94080 Title Senior Address 1120 1120 Veterans Blvd, South San Francisco, CA 94080 650-825-7143 E-Mail Name 12th FAX 650-825-7664 [email protected] SIGNATURE OF PI/PD NAMED IN 3a. (In ink. “Per” signature not acceptable.) DATE SIGNATURE OF OFFICIAL NAMED IN 13. (In ink. “Per” signature not acceptable.) DATE Face Page Form Page 1 Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. DESCRIPTION: State the application’s broad, long-term objectives and specific aims, making reference to the health relatedness of the project. Describe concisely the research design and methods for achieving these goals. Avoid summaries of past accomplishments and the use of the first person. This abstract is meant to serve as a succinct and accurate description of the proposed work when separated from the application. If the application is funded, this description, as is, will become public information. Therefore, do not include proprietary/confidential information. DO NOT EXCEED THE SPACE PROVIDED. Currently the global pharmaceutical industry is facing unprecedented pressure to increase its productivity to deliver new chemical entities. The long-term goal of this proposal is to develop a robust, high throughput informatics platform to accelerate industrialized drug discovery. The specific aims for the Tularik Discovery Informatics Platform are: (1). Architect scalable and robust high throughput enterprise computing infrastructures. The Java 2 platform, Enterprise Edition (J2EE), Microsoft .NET and high throughput/performance computing (HTC/HPC) technologies will be interoperable to build a state of the art Discovery Informatics platform. (2). Integrate various standalone robotic applications into networked automated Discovery pipelines. Thanks to technological innovations, robotics and automation are now absolutely essential in various stages of the drug discovery processes. The Discovery Informatics platform will integrate robotic vendor proprietary software through .NET Web Services to automate the inter-robotic data management and mechanical operations. (3). Systemize the high throughput discovery workflows to reveal “knowledge” from the raw data. Tularik Discovery Informatics platform has automated the data analysis and data management in the areas of array-based comparative genomic hybridization, high throughput screening (HTS), structure activity relationship (SAR) and ADMET. Additional machine learning algorithms and visualization modules will be developed to automatically extract knowledge, e.g. the novel compound structural motifs, from large-scale bioassay databases. Discovery Informatics platform will integrate computational chemistry approaches for parallel drug lead optimization of potency, selectivity, and ADMET properties. (4). Integrate in silico drug lead seeking, explosion and optimization processes into the high throughput Discovery platform. Integrate ligand or receptor based virtual screening algorithms into the Discovery platform to increase the throughput for drug lead seeking, explosion and optimization. Algorithms, including proper compound filters, will be developed to create a 1 billion-member virtual screening library. (5). Standardize the informatics data flow and implement interoperable service-oriented computing architecture. Tularik will work with I3C (Interoperable Informatics Infrastructure Consortium) to adopt and enact proper standardizations for data flow in the areas of genomics, biological pathway, compound acquisition, compound inventory, lead discovery and optimization. The current Tularik Discovery platform hosts various XML based J2EE and .NET distributed applications, providing a solid foundation to extend to the service-oriented computing architecture. (6). Establish industrialized software configuration management (SCM) mechanisms for application build and deployment. Discovery platform has evolved to ensure code portability, robust build and easy deployment to Tularik’s worldwide campuses and relevant research communities. Discovery platform will continue to improve through the utilization of the open source standards and applications. These developments will make Tularik Discovery Informatics platform generalizable, scalable, extensible and interoperable to the entire biomedical research community. PERFORMANCE SITE(S) (organization, city, state) Tularik Inc., South San Francisco, California KEY PERSONNEL. See instructions. Use continuation pages as needed to provide the required information in the format shown below. Start with Principal Investigator. List all other key personnel in alphabetical order, last name first. Name Organization Role on Project Ling, Bruce, Ph.D. Tularik Inc. Principal Investigator King, Brian Hoey, Tim, Ph.D. Life code, Inc. & Interoperable Informatics Infrastructure Consortium Tularik Inc. Jaen, Juan, C., Ph.D. Shuttleworth, Stephen J., Ph.D. Tularik Inc. Tularik Inc. Young, Stephen, W., Ph.D. Tularik Inc. Waszkowycz, Bohdan, Ph.D. Tularik Ltd. (UK) Consultant, standardization and interoperability Director, directs biology efforts VP, directs chemistry efforts Director, directs combinatorial chemistry efforts Director, directs lead discovery efforts Director, direct virtual Screening Connor, Richard, Ph.D. Tularik Inc. Scientist, combi-chem Name Organization Role on Project PHS 398 (Rev. 05/01) Page _2a___ Form Page 2 Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Cardozo, Mario, Ph.D. Tularik Inc. Young, Steve, Ph.D. Cutler, Gene, Ph.D. Tularik Ltd. (UK) Tularik Inc. Pan, Zheng, Ph.D. Tularik Inc. Liu, Jane, M.D. Lukes, Melissa Tularik Inc. Tularik Inc. Scientist, computational chemistry Scientist, virtual screening Research Investigator, in silico target identification and microarray Scientist, informatics operation Scientist, chemoinformatics DBA Porter, Richard Subramani, Jayanthi Ding, Epic Tularik Inc. Tularik Inc. Tularik Inc. Developer Developer Developer Charati, Kaveri Self employed consultant XML specialist PHS 398 (Rev. 05/01) Page _2b___ Form Page 2 Principal Investigator/Program Director (Last, first, middle): The name of the principal investigator/program director must be provided at the top of each printed page and each continuation page. RESEARCH GRANT TABLE OF CONTENTS Page Numbers Face Page ...................................................................................................................................... Description, Performance Sites, and Personnel ............................................................................ Table of Contents .......................................................................................................................... Detailed Budget for Initial Budget Period (or Modular Budget)...................................................... Budget for Entire Proposed Period of Support (not applicable with Modular Budget) ........................ 1 a,b 3 4 5-8 Budgets Pertaining to Consortium/Contractual Arrangements (not applicable with Modular Budget) Biographical Sketch—Principal Investigator/Program Director (Not to exceed four pages)................. Other Biographical Sketches (Not to exceed four pages for each – See instructions)) ...................... Resources...................................................................................................................................... 9-11 12-44 45 2- Research Plan Introduction to Revised Application (Not to exceed 3 pages)........................................................................................................... Introduction to Supplemental Application (Not to exceed one page)................................................................................................ A. Specific Aims ........................................................................................................................................................................ B. Background and Significance................................................................................................................................................ C. Preliminary Studies/Progress Report/ (Items A-D: not to exceed 25 pages*) Phase I Progress Report (SBIR/STTR Phase II ONLY) * SBIR/STTR Phase I: Items A-D limited to 15 pages. D. Research Design and Methods............................................................................................................................................. E. Human Subjects.................................................................................................................................................................... 46-47 47-50 50-56 56-65 Protection of Human Subjects (Required if Item 4 on the Face Page is marked “Yes”) Inclusion of Women (Required if Item 4 on the Face Page is marked “Yes”) .................................................................. Inclusion of Minorities (Required if Item 4 on the Face Page is marked “Yes”) ................................................................ Inclusion of Children (Required if Item 4 on the Face Page is marked “Yes”) .................................................................. Data and Safety Monitoring Plan (Required if Item 4 on the Face Page is marked “Yes” and a Phase I, II, or III clinical roposed research trial is proposed......................................................................................................................................................... F. G. H. I. J. Vertebrate Animals ............................................................................................................................................................... Literature Cited ..................................................................................................................................................................... Consortium/Contractual Arrangements................................................................................................................................. Letters of Support (e.g., Consultants)................................................................................................................................... Product Development Plan (SBIR/STTR Phase II and Fast-Track ONLY) ........................................................................... 65-66 67-68 69 Checklist ........................................................................................................................................ Check if Appendix is Included Appendix (Five collated sets. No page numbering necessary for Appendix.) Appendices NOT PERMITTED for Phase I SBIR/STTR unless specifically solicited. Number of publications and manuscripts accepted for publication (not to exceed 10) 10 70 Other items (list): Appendix summary PHS 398 (Rev. 05/01) Page ___3____ Form Page 3 Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Ph.D. BUDGET FOR ENTIRE PROPOSED PROJECT PERIOD DIRECT COSTS ONLY BUDGET CATEGORY INITIAL BUDGET PERIOD TOTALS (from Form Page 4) PERSONNEL: Salary and fringe benefits. Applicant organization only. ADDITIONAL YEARS OF SUPPORT REQUESTED 2nd 3rd 4th 5th $346,400 $360,256 $374,666 $374,666 $389,653 $86,000 $89,440 $93,018 $96,739 $100,608 EQUIPMENT $0 $0 $0 $0 $0 SUPPLIES $0 $0 $0 $0 $0 $17,600 $18,304 $19,036 $19,797 $20,589 450,000 468,000 486,720 491,202 510,850 450,000 468,000 486,720 491,202 510,850 CONSULTANT COSTS TRAVEL PATIENT CARE COSTS INPATIENT OUTPATIENT ALTERATIONS AND RENOVATIONS OTHER EXPENSES SUBTOTAL DIRECT COSTS CONSORTIUM/ CONTRACTUAL COSTS DIRECT F&A TOTAL DIRECT COSTS TOTAL DIRECT COSTS FOR ENTIRE PROPOSED PROJECT PERIOD (Item 8a, Face Page) $ 2,406,772 $ 0 SBIR/STTR Only Fee Requested SBIR/STTR Only: Total Fee Requested for Entire Proposed Project Period (Add Total Fee amount to “Total direct costs for entire proposed project period” above and Total F&A/indirect costs from Checklist Form Page, and enter these as “Costs Requested for Proposed Period of Support on Face Page, Item 8b.) JUSTIFICATION. Follow the budget justification instructions exactly. Use continuation pages as needed. PHS 398 (Rev. 05/01) Page __5_____ Form Page 5 Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BUDGET FOR ENTIRE PROPOSED PROJECT PERIOD DIRECT COSTS ONLY JUSTIFICATION. Follow the budget justification instructions exactly. CONSULTANT: King, Brian, President, LifeCode Inc., I3C (Interoperable Informatics Infrastructure Consortium) committee section leader. Mr. King has agreed to serve as a consultant on this project. He will advise at $110 per hour rate on the architecture and process of informatics standardization and interoperability. $20,000 first year and 4% annual increase as the funding for his consulting work. Charati, Kaveri, XML specialist. Ms. Charati has agreed to serve as a consultant on the project. She will be responsible for the XML based data transaction and XSLT data transformation. $66,000 first year and 4% annual increase as the funding for her consulting work. PERSONNEL: Bruce Xuefeng Ling, Ph.D. will supervise all studies and manage the implementation progress on a weekly basis. He will directly coordinate project team staff in different disciplinary areas and interact with them on a daily basis if necessary. Juan Jaen, Ph.D. will oversee the entire chemistry efforts. Salary is not requested. Tim Hoey, Ph.D. will coordinate target identification and high throughput assay development to ensure the data integrity and data flow. Salary is not requested. Stephen Shuttleworth, Ph.D. will coordinate combinatorial chemistry data flow and will be actively involved in the design of the informatics architecture to integrate high throughput combi-chem in the lead discovery informatics platform. Salary is not requested. Salary is not requested. Stephen W. Young, Ph.D. will coordinate and be involved in the informatics area of lead discovery high throughput screening and compound inventory. Salary is not requested. Steve Young, Ph.D. will be part of the team to design and integrate the in silico lead identification (docking) and optimization into the Discovery informatics platform. Salary is not requested. Waszkowycz, Bohdan, Ph.D. will supervise and coordinate the high throughput computational chemistry efforts. Salary is not requested. Richard Connor, Ph.D. will be part of the team to integrate the combi-chem robotics Tularik proprietary driver and master control program into the Discovery informatics platform through the .NET technologies. Salary is not requested. Mario Cardozo, Ph.D. will be responsible for the enabling of algorithms for the high throughput compound property calculation and integration into the Discovery informatics platform. Salary is not requested. Gene Cutler, Ph.D. will be responsible for the in silico target identification and high throughput microarray data management. Zheng Pan, Ph.D. will be responsible for the Discovery site configuration and ISIS integration for compound handling. Jane Liu, M.D. will be responsible for the high throughput chemoinformatics application integration into the Discovery informatics platform. Melissa Lukes will be responsible for Oracle database architecture, setup, maintenance, and data integrity. Rick Porter will be responsible for the robotics informatics implementation, which enables the interface, integration of campus wide robotics machines through vendor driver and .NET framework applications into the Discovery platform. PHS 398 (Rev. 05/01) Page ___ 6____ Form Page 5 Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Jayanthi Subramani will be responsible for compound inventory data management and Discovery platform web application architecture. Epic Ding will be responsible for third party software integration and Discovery platform database modeling. Walter Pan will be responsible for HTS, SAR data flow support and .NET architecture implementation. SUPPLIES: Tularik will cover the necessary cost for various supplies and software licenses. TRAVEL: Funding of $17,600 is requested for the project members to attend the following conference. Remaining balance of the registration fees, lodging and airfare expenses will be covered by Tularik Inc. conference name JAVA ONE Intelligent Drug Discovery & Development Information Systems and Technology for Life Sciences ICSB2003: 4th International Conference on Systems Biology PHS 398 (Rev. 05/01) date Registration location fee Jun-04 San Francisco $2,500 May-04 Philadelphia, Pennsylvania $2,000 Feb-04 London, UK $2,000 Nov-03 St. Louis,MO $1,000 Page ___7____ Form Page 5 Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BUDGET JUSTIFICATION PAGE MODULAR RESEARCH GRANT APPLICATION Initial Budget Period Second Year of Support $ 450,000 $ 468,000 Third Year of Support Fourth Year of Support $ 486,720 Fifth Year of Support $ 491,202 Total Direct Costs Requested for Entire Project Period $ 510,850 $ 2,406,772 Personnel Details of the personnel budget justification can be found on Form Page 5. Name Organization Role on Project Ling, Bruce, Ph.D. Hoey, Tim, Ph.D. Tularik Inc. Tularik Inc. Shuttleworth, Stephen J., Ph.D. Tularik Inc. Young, Stephen, W., Ph.D. Tularik Inc. Waszkowycz, Bohdan, Ph.D. Tularik Ltd. (UK) Jaen, Juan, C., Ph.D. Tualrik Inc. Principal Investigator Director, directs biology efforts Director, directs highthroughput combinatorial chemistry efforts Director, directs high throughput lead discovery efforts Director, direct virtual screening VP Chemistry, directs chemistry efforts Connor, Richard, Ph.D. Cardozo, Mario, Ph.D. Tularik Inc. Tularik Inc. Scientist, combi-chem Scientist, computational chemistry Young, Steve, Ph.D. Cutler, Gene, Ph.D. Tularik Ltd. (UK) Tularik Inc. Pan, Zheng, Ph.D. Tularik Inc. Scientist, virtual screening Research Investigator, in silico target identification and microarray Scientist, informatics operation Liu, Jane, M.D. Lukes, Melissa Porter, Richard Tularik Inc. Tularik Inc. Tularik Inc. Scientist, chemoinformatics DBA Developer Walter Pan Subramani, Jayanthi Ding, Epic Tularik Inc. Tularik Inc. Tularik Inc. Developer Developer Developer King, Brian Life code, Inc. & I3C Charati, Kaveri Self employed consultant Consultant, standardization and interoperability XML specialist Consortium Fee (SBIR/STTR Only) PHS 398 (Rev. 05/01) Page ___8____ Modular Budget Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Ling, Bruce, Xuefeng Director, Research Informatics EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION DEGREE (if applicable) Fudan University UCLA UCLA B.S. M.A. Ph.D. Stanford Medical Center Postdoc. YEAR(s) FIELD OF STUDY 1990 1994 1996 Biochemistry Molecular Biology Biological Chemistry 1996-1998 Molecular Immunology A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. Positions 2003 Director, Research informatics Tularik Inc. 2001 - 2002 Director, Bioinformatics Tularik Inc. 2000 - 2001 Associate Director, R&D DoubleTwist, Inc. 1999 - 2000 Project manager, Research Dept. Pangea System, Inc. 1998 - 1999 Computation/Bioinformatics Scientist Incyte Pharmaceuticals, Inc. 1997 Member, Medical Advisor Board National Kidney Foundation of Northern California 1996 - 1998 Fellow Stanford Molecular Immunology Laboratory, SUMC, CA Honors 1997 – 1998 Walter Berry Medical Research Award 1997 National Kidney Foundation Research Award 1996 Dean's Fellowship, Stanford University 1992 -1993 University Fellowship, UCLA, CA 1991 - 1992 University Fellowship, University of Iowa, IA 1990 - 1991 University Fellowship, Fudan University, China 1990 Summa cum laude, Fudan University, China 1986 – 1990 University Fellowship, Fudan University, China B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. • • • • Li S, Cutler G, Liu J, Hoey T, Chen L, Schultz PG, Liao J, Ling XB (corresponding author). 2003. A Comparative Analysis Of HGSC and Celera Human Genome Assemblies and Gene Sets. Bioinformatics in press. Li S, Liao J, Cutler G, Hoey T, Hogenesch JB, Cooke MP, Schultz PG, Ling XB (corresponding author). 2002 Comparative Analysis of Human Genome Assemblies Reveals Genome-Level Differences. Genomics 80 (2): 138. Pei L, Peng Y, YangY, Ling XB, van Eyndhoven WG, Nguyen K, Rubin M, Hoey T, Powers S and Li, J 2002. PRC17, a novel oncogene encoding a Rab GTPase-activating protein, is amplified in prostate cancer. Cancer Res. 62 (19):5420-4. Jiang Y, Chen D, Lyu S-C, Ling X, Krensky AM, Clayberger C. 2002. DQ 65-79, a Peptide Derived from HLA Class II, Induces IkappaB Expression. J. of Immunology 168(7):3323-8 PHS 398/2590 (Rev. 05/01) Page ____9___ Biographical Sketch Format Page • • • • • • Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Pouliot Y, Gao J, Su Q, Liu G, Ling XB (corresponding author). 2001. DIAN, a Novel Algorithm for Genome Ontological Classification. Genome Research. Genome Res. 11(10):1766-79. Ling X, Kamamgar S; Boytim ML; Kelman Z; Huie P; Lyu S-C, Sibley RK; Hurwitz J; Clayberger C; Krensky A. 2000. Proliferating Cell Nuclear Antigen as the Cell Cycle Sensor for an HLA-Derived Peptide Blocking T Cell Proliferation. J. of Immunology. 164: 6188-92 Ling X, Tamaki T; Xiao Y; Kamangar S; Clayberger C; Lewis DB; Krensky AM. 2000. An immunosuppressive and anti-inflammatory HLA class I-derived peptide binds vascular cell adhesion molecule -1. Transplantation 70(4):662-7 Lenfant F; Mann RK; Thomsen B; Ling X; Grunstein M. 1996 All four core histone N-termini contain sequences required for the repression of basal transcription in yeast. EMBO J.15:3974-85. Ling X; Harkness TA; Schultz MC; Fisher-Adams G; Grunstein M. 1996 Yeast histone H3 and H4 amino termini are important for nucleosome assembly in vivo and in vitro: redundant and positionindependent functions in assembly but not in gene regulation. Genes and Development 10:686-99. Thompson JS; Ling X; Grunstein M. 1994. Histone H3 amino terminus is required for telomeric and silent mating locus repression in yeast. Nature 369:245-7. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. • • • • • Role: Research Informatics Director. Responsible for the research informatics area (bioinformatics/Chemoinformatics/Lead discovery) support for genomics-based drug discovery @ Tularik Inc. • IEEE (2002) report: http://siliconvalleycs.org/Ling.htm http://siliconvalleycs.org/LingBio.htm • Bioinformatics data mining for drug targets: in silico identification and validation • Architect in implementation of micro-array amplicon data analysis and Oligo-chip design software for the cancer drug target identification • Lead Discovery Data Flow: automatic data flow and data management development • Setup the robust and scalable enterprise computing platform for genomics and lead discovery data flow • Architect JAVA 2 Enterprise Edition Platform for genomics and lead discovery data flow • Architect and integrate .NET Enterprise Platform with the J2EE Platform computing backbone for genomics and lead discovery data flow • Architect data modeling covering genomics, assay development, HTS, SAR, Lead Optimization etc data repository • Setup Linux Cluster and IT infrastructure for genomics based drug discovery • Build Linux clusters which enables Tularik on the US TOP-500 cluster list. • http://clusters.top500.org/db/site.php?mode=listanon • http://www.bio-itworld.com/archive/071102/linux.html Role: Architect and project manager. Responsible for DoubleTwist Genomic Mapping Project for Integration into DoubleTwist Human Genomic Database • Design the Genomic Mapping Project data flow • Managing the project implementation and system integration Role: Archietect and project manager. Responsible for DoubleTwist Concept Mining Tool (DIAN System) • Design and architect the DIAN system data flow and algorithm logistics • Manage the system implementation Role: Archietect and developer. Responsible for the design of the protein domain analysis pipeline for DoubleTwist Protein Comprehensive Analysis Agent Role: Architect and developer for http://www.doubletwist.com genomic project PHS 398/2590 (Rev. 05/01) Page ___10____ Biographical Sketch Format Page • • • Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Design the genomic project data flow Implementation of Human Genomic Project BAC fragment ordering tool for the genomic BAC sequence ordering and assembly (coded in C++ and perl) Provide leadership in the implementation of the DoubleTwist Human Genomic Database (Prophecy) PHS 398/2590 (Rev. 05/01) Page ____11___ Biographical Sketch Format Page ? Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format on for each person. (See attached sample). DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Hoey, Timothy C. Director, Biology Department EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Columbia University, New York, NY Columbia University, New York, NY Columbia University, New York, NY University of Michigan, Ann Arbor, MI DEGREE (if applicable) YEAR(s) Ph.D. M. Phil. M.A. B.S. FIELD OF STUDY 1989 1987 1986 1980 Molecular Biology Molecular Biology Molecular Biology Biology A. Positions and Honors. Positions and Employment 1983-1984 Research Technician, Catholic Medical Center, New York 1983-1984 Research Technician, Columbia University, New York 1984-1989 Graduate Research Fellow/Teaching Assistant, Columbia University, New York 1989-1993 Postdoctoral Fellow, University of California, Berkeley 1993-1999 Scientist, Biology Department, Tularik, Inc., South San Francisco 1999Director, Biology Department, Tularik, Inc., South San Francisco Other Experience and Professional Memberships 1994 Seminar, Department of Molecular Pharmacology, Stanford University 1996 Seminar, Department of Immunology, University of Washington 1996 Seminar, Department of Pathology, Brown University 1996 Seminar, Department of Biology, UC Santa Cruz 1996 Seminar, Roussel Signal Transduction Symposium, Oxford University, UK 1996 Seminar, IBC Transcriptional Regulation Conference, San Diego 1996 Seminar, Department of Pathology, Yale University 1996 Seminar, Samsung International Symposium, Seoul, Korea 1996 Seminar, Shock Society Conference, Indian Wells, CA 1997 Seminar, IBC Transcriptional Regulation Conference, San Diego 1997 Seminar, Department of Molecular Pharmacology, Stanford University 1997 Seminar, Keystone Symposium of Jaks and STATS, Tamarron, CO 1997 Seminar, Swiss Society for Experimental Biology, Lausanne, Switzerland 1998 Seminar, Department of Pathology, Emory University 1998 Seminar, Department of Immunology, Gladstone Institute, UCSF 1999 Seminar, ZMBH Cancer Center, Heidelberg, Germany 2000 Seminar, Department of Molecular Biology, USC 2001 Seminar, Inflammation Society Annual Meeting, San Diego, CA 2001 Seminar, Bay Area Biotechnology Conference, UCSF 2001 Seminar, Department of Immunology, Lerner Institute, Cleveland Clinic 2002 Seminar, Institute of Medicine Cancer Meeting, National Academy of Sciences 2002Associate Director, Journal of Immunology ? PHS 398/2590 (Rev. 05/01) Page __12_____ Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b. Biographical Sketch Format Page ? ? Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. B. Selected peer-reviewed publications (in chronological order). (Publications selected from 40 peer-reviewed publications.) 1. Hoey, T. and Levine, M. Divergent homeo box proteins recognize similar DNA sequences in Drosophila. Nature 1988;332:858-861. 2. Levine, M. and Hoey, T. Homeo box proteins as sequence-specific transcription factors. Cell 1988;55:537540. 3. Hoey, T., Dynlacht, B.D., Peterson, M.G., Pugh, B.F., and Tjian, R. Isolation and characterization of the Drosophila gene encoding the TATA box binding protein, TFIID. Cell 1990;61:1179-1186. 4. Dynlacht, B.D., Hoey, T. and Tjian, R. Isolation of coactivators associated with the TATA-binding protein that mediate transcriptional activation. Cell 1991;66:563-576. 5. Hoey, T., Weinzierl, R.O.J., Gill, G., Chen, J-L., Dynlacht, B.D. and Tjian, R. Molecular cloning and functional analysis of Drosophila TAF110 reveal properties expected of coactivators. Cell 1993;72:247-260. 6. Goodrich, J.A. Hoey, T., Thut, C.J., Admon, A. and Tjian, R. Drosophila TAFΠ40 interacts with both a VP16 activation domain and the basal transcription factor TFIIB. Cell 1993;75:519-530. 7. Rooney, J.W., Hoey, T., and Glimcher, L.H. Coordinate and cooperative roles for NF-AT and AP-1 in the regulation of the murine IL-4 gene. Immunity 1995;2:461-472. 8. Hoey, T., Sun, Y.L., Williamson, K. and Xu, X. Isolation of two new member of the NFAT gene family and functional characterization of the NFAT proteins. Immunity 1995;2:473-483. 9. Rooney, J.W., Sun, Y.L., Glimcher, LG., and Hoey, T. Novel NFAT sites that mediate activation of the Interleukin-2 promoter in response to T-cell receptor stimulation. Mo. Cell. Biol. 1995;15:6299-6310. 10. Hodge, M.R., Ranger,A.M., de la Brousse, F., Hoey, T., Grusby, M.J., Glimcher, L.H. Hyper proliferation and dysregulation of Interleukin-4 expression in NFATp deficient mice. Immunity 1996;4:397-405. 11. Kaplar, M.H., Sun, Y.L., Hoey, T., and Grusby, M. Impaired IL-12 responses and enhanced development of TH2 cells in STAT4-deficient mice. Nature 1996;382:174-177. 12. Xu, X., Sun, Y.L., and Hoey, T. The STAT amino-terminal domain mediates cooperative DNA binding and confers selective sequence recognition. Science 1996:263:794-797. 13. Hoey, T. A new play in cell death. Science 1997;278:1578-1579. 14. Naeger, L., and Hoey T. Identification of STAT4 binding site in the IL-12 receptor required for signaling. J. Biol. Chem. 1999;274;1875-1878. 15. Lawless, V.A., Zhang, S., Ozes, O.N., Bruns, H.A., Oldham, I., Hoey, T., Grysby, M.J., and Kaplan, M.H. STAT4 regulates multiple components of IFN-gamma-inducing signaling pathways. J. Immunol. 2000;165:6803-6808. 16. Li, J., Yan, Y., Austin, R., van Eyndhoven, W., Peng, Y., Mcurrach, M.E., Nguyen, K., Apella, E., Lowe, S.W., Hoey, T., and Powers, S. Oncogenic properties of PPIMD located within a breast cancer amplification epicenter at 17q23. Nature Genetics. 2002;31:133-134. ? PHS 398/2590 (Rev. 05/01) Page ___13____ Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b. Biographical Sketch Format Page ? Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. 17. Pei, L, Peng, Y., van Eyndhoven, W., Ling, X.B., Nguyen, K., Rubin, M., Hoey, T., Powers, S., and Li, J. (in press, 2002 Cancer Research). 18. Li, S., Liao, J., Cutler, G., Hoey, T., Hogenesch, J., Cooke, M., Schultz, P., and Ling, X. Genomics 2002;80:138. Patents/Inventorships TATA-Binding Protein Associated Factors drug screens TATA-Binding Protein Associated Factors nucleic acids Human Nuclear Factors and binding assays Human Signal Transducer and binding assays PHS 398/2590 (Rev. 05/01) U.S. patent number 5,534,410 U.S. patent number 5,637,686 U.S. patent number 5,612,455 U.S. patent number 5,639,858 Page ___14____ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Shuttleworth, Stephen J. Scientist EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION University of Liverpool, Liverpool, UK University of Liverpool, Liverpool, UK DEGREE (if applicable) B.Sc. PhD YEAR(s) FIELD OF STUDY 1991 Chemistry 1994 Organic Chemistry A. Positions and Honors Positions and Employment 1994-1997 Senior Research Chemist Chiroscience Ltd., Cambridge, UK 1997 Head of Combinatorial Chemistry, Glycodesign, Inc., Toronto, Ontario, Canada 1999-2000 Research Leader, Combinatorial Chemistry, BioChem Pharma, Inc. Laval, Montreal, Canada 2000- present Associate Director, Chemistry, Tularik Inc., South San Francisco, CA, USA Other Experience and Professional Memberships 1990-present Member of the Society of Chemical Industry 1990 Awarded GRSC from the Royal Society of Chemistry 1993 Awarded C Chem, MRSC from the Royal Society of Chemistry 1995-present Member of the European Chemical Society 1996-1997 Member of the UK Automation Society 1997-present Member of the Chemical Institute of Canada 1997 Awarded MCIC from the Chemical Institute of Canada 1999-present Member of the American Chemical Society Honors 1990 1991 1994 Nuffield Foundation Research Scholarship Fully-funded Ph.D. Studentship awarded from Glaxo, UK First Prize, SCI National Postgraduate Symposium, Manchester University, UK B. Peer-reviewed Publications (In Chronological Order). 1. Allcock, S.J., Gilchrist, T. L., Shuttleworth, S.J. King, F.D. Intramolecular and Intermolecular Diels-Alder Reactions of Ac ylhydrazones Derived from Methacrolein and Ethylacrolein. Tetrahedron 1991;47:1005310064. 2. Page, P.C.B., Gareh, M.T., Shuttleworth, S.J. 1,3-Dithiane 1-Oxide: New Applications and its First Asymmetric Synthesis. IUPAC 18th Symposium on the Chemistry of Natural Products. 1992;262-263 3. Page, P.C.B., Shuttleworth, S.J. Schilling, M.B., Tapolczay, D.J. One-Pot Stereocontrolled Cyclolkanone Synthesis using 1,3-Dithiane 1-Oxides. Tetrahedron Lett. 1993;34:6947-6950. 4. Page, P.C.B., Shuttleworth, S.J., McKenzie, M.J., Schilling M.B., Tapolczay, D.J. Pummerer and Related Rearrangements of 2-Acyl-1,3-Dithiane 1-Oxides. Synthesis 1995;73-77 5. Allin, S.M. and Shuttleworth, S.J. Synthesis and Uses of a Resin-Bound “Evans” Auxiliary. Tetrahedron 1996;37:8023-8026. PHS 398/2590 (Rev. 05/01) Page ___15 ____ Biographical Sketch Format Page ? Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. 6. Allin, S.M. Button, M.C. and Shuttleworth, S.J. Aza-Cope Rearrangement in the Asymmetric Alkylation of Enamines. Synett. 1997;725-727. 7. Shuttleworth, S.J. Allin, S.M. and Sharma, P.K. Functionalized Polymers: Recent Developments & New Applications in Synthetic Organic Chemistry. Synthesis 1997;1217-1239. 8. Page, P.C.B., Allin, S.M., Shuttleworth, S.J. Organosulfur Chemistry:Synthetic and Stereochemical Aspects. Organosulfur Chemistry: Volume 2 ed. P.C.B. Page, Academic Press, UK 1998;97-155. 9. Shuttleworth, S.J. An Overview of Combinatorial Synthesis and its Applications in the Identification of Matrix Metalloproteinase Inhibitors. Advances in Drug Discovery, ed. Harvey, A., Wiley, UK. 1998;115-141. 10. Montana, J., Baxter, A., Shuttleworth, S.J., Manallack, D., Bird, J., Bhogal, R., Minton, K., Jagpal, S. Int. Combinatorial Synthesis of Matrix Metalloprotienase Inhibitors. J. Pharm. Med. 1998;9-12. 11. Shuttleworth, S.J., Quimpere, M., Lee, N., DeLuca, J. Parallel Solution Synthesis of Pyridinones, Pyridinethiones and Thienopyridines. Molecular Diversity 1999:4(3):183-185. 12. Shuttleworth, S.J., Allin, S.M., Wilson, R.D., Nasturica, D. Functionalised Polymers in Organic Chemistry, Part 2. Synthesis 2000;8:1035-1074. 13. Shuttleworth, S.J., Nasturica, D., Gervais, C., Siddiqui, M.A., Rando, R., Lee, N. Parallel Synthesis of Isatin-Based Serine Protease Inhibitors. Bioorg. Med. Chem. Lett. 2000;2501-2504. 14. Kearney, P.C., Fernandez, M., Fu, M., Flygare, J., Shuttleworth, S.J., Wahhab, A., Wilson, R., De Luca, J. Solid Phase Synthesis of 2-Aminothiazoles. Solid-Phase Org. Synth. 2001;1:1-8. 15. Shuttleworth, S.J., (Guest Editor), Development and Applications of Polymer-Supported Reagents and Ion Exchange Resins in Organic Synthesis and Combinatorial Chemistry. Combinatorial Chemistry & High Throughput Screening 2002;5(3):197-261. 16. Lizaraburu, M.E., Shuttleworth, S.J. Synthesis of Aryl Ethers from Protected Aminoalcohols Using Polymer-Supported Triphenylphosphine. Tetrahedron Lett. 2002;43:2157-2159. 17. Kong, L.C.C., Bedard, J., Das, S.K., Ba, N.N., Pereira, O.Z., Shuttleworth, S.J. Compounds and Methods for the Treatment or Prevention of Flavivirus Infections. PHAR-130-002-USA. 18. Connors, R.V., Zhang, A. J. and Shuttleworth, S.J. Pictet-Spendler Synthesis of Tetrahydro-βCarbolines using Vinylsulfonylmethyl Resin. Tetrahedron Lett. 2002;43:6661-6663 ? PHS 398/2590 (Rev. 05/01) Page ___16____ Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b. Biographical Sketch Format Page ? Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Stephen W. Young Director, Lead Discovery Tularik Inc 1120 Veterans Blvd South San Francisco CA 94080 EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION DEGREE (if applicable) University of Bristol, Bristol, U.K BSc University of Bristol, Bristol, U.K. PhD Open University, U.K. Diploma YEAR(s) 19871990 19911994 19992001 FIELD OF STUDY Biochemistry Insulin Receptor Signal Transduction Business Studies (Executive study – conducted while working at Roche) NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. 1994-1996 Senior Research Biologist, BioMolecular Screening Department, Glaxo R&D, Stevenage, U.K. 1996-1999 Senior Research Biologist, Immunology Unit, GlaxoWellcome R&D, Stevenage, U.K. 1999-2001 Head Of High Throughput Screening, Roche Discovery, Welwyn, U.K. 2001-2003 Director, Lead Discvoery, Tularik Inc, South San Francisco, California, USA B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. 1. Issad, T., Young, S.W., Tavaré, J.M. and Denton, R.M. “Effect of Glucagon on Insulin Receptor Phosphorylation in Intact Cells” FEBS Lett. 296 41-45 1992 2. Young, S.W., Poole, R.C., Hudson, A.T., Halestrap, A.P., Denton, R.M. and Tavaré, J.M. “Effects of Tyrosine Kinase Inhibitors on Protein Kinase Independent Systems” FEBS Lett. 316, 278-282 1993 3. Young, S.W., Dickens, M. and Tavaré, J.M. “Differentiation of PC12 Cells in Response to a cAMP Analogue is accompanied by Sustained Activation of Mitogen Activated Protein Kinase. Comparison with the Effects of Insulin, Growth Factors and Phorbol Esters” FEBS Lett. 338, 212-216 1994 4. Welsh, G.I., Foulstone, E.J., Young, S.W. Tavaré, J.M. and Proud, C.G. “Wortmannin Inhibits the Effects of Insulin and Serum on the Activities of Glycogen Synthase Kinase-3 and Mitogen Activated Protein Kinase” Biochem J. 303, 12-20 1994 5. Young, S.W., Dickens, M. and Tavaré “Activation of Mitogen Activated Protein Kinase by PKC isotopes α, β and γ but not ε” J.M. FEBS Lett. 384, 181-184 1996 6. Young, S.W. “HTS Personal Perspectives: Big Pharma – interview with Rebecca Lawrence” Drug Discovery Today 6 (12) S8-S10 2001 7. Mallari, R., Swearingen, E., Liu, W., Ow, A., Young, S.W. and Huang, S.G. “A Generic High-throughput Screening Assay for Kinases: Protein Kinase A as an Example” J. Biomol. Screening 8 (2) 198-204 2003 PHS 398/2590 (Rev. 05/01) Page ___17____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): 8. Ling, Bruce, Xuefeng, Ph.D. Hong, C.A., Swearingen, E., Mallari, R., Gao, X., Cao, Z., North, A., Young, S.W. and Huang, S.G. “Development of A High-Throughput Time-Resolved Fluorescence Resonance Energy Transfer Assay for TRAF6 Ubiquitin Polymerization” Assay and Drug Development Technologies 1 (1-2) 175-180 2003 C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects PHS 398/2590 (Rev. 05/01) Page ___18____ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Jaen, Juan. Vice President, Chemistry EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION University of Complutense, Madrid, Spain University of Complutense, Madrid, Spain University of Michigan, Ann Arbor, Michigan University of Michigan, Ann Arbor, Michigan DEGREE (if applicable) B.S. M.S. M.S. Ph.D. YEAR(s) 1979 1980 1981 1984 FIELD OF STUDY Organic Chemistry Organic Chemistry Organic Chemistry Organic Chemistry A. Positions and Honors. Positions and Employment 1983-1985 Scientist, Parke-Davis Pharmaceutical Research Division, Ann Arbor, MI. 1985-1987 Senior Scientist, Parke-Davis Pharmaceutical Research Division, 1988-1991 Research Associate, Parke-Davis Pharmaceutical Research Division 1991-1992 Senior Research Associate, Parke-Davis Pharmaceutical Research Division 1992-1993 Section Director, Neurodegenerative Diseases Chem., Parke-Davis Research 1993-1996 Director, Neurodegenerative Diseases Chem., Parke-Davis Research Division, 1996-2000 Director of Chemistry, Tularik Inc., South San Francisco, CA 1999 Vice-President of Chemistry, Tularik Inc., South San Francisco, CA Professional Activities 1992-1995 CNS Section Editor for Expert Opining in Therapeutic Patents. 1992-1996 Editorial Board of Current Drugs: Neurodegenerative Disorders. 1992-1993 NIH Special Study Section 7, Drug Development & Drug Delivery - Small Business Innovation Research 1994-1995 Special Study Section Z, Multipisciplinary Special Emphasis Panel - Small Business. Innovation Reseach 2000-present Editorial Board – Current Medicinal Chemistry (Immunology, Endocrine and Metabolic Agents). B. Selected Peer-reviewed publications (in chronological order). (Publications selected from 58 peer-reviewed publications) 1. Davis, R.E., Doyle, P.D., Carroll, R.T., Emmerling, M.R., Jaen, J.C., Cholenergic therapies for Alzheimer’s Disease: palliative or disease altering? Arzneim.Forsch/Drug Res. 1995;45(1):425. 2. Jaen, J.C., Laborde, E., Bucsh, R.A, Caprathe, B.W., Sorenson, R.J., Fergus, J., Spiegel, K., Dickerson, M.R., Davis, R.E. Kynurenic acid derivatives inhibit the binding of Nerve Growth Factor (NGF) to the low affinity p75 NGF receptor. J. Med. Chem. 1995;38:4439-4445. 3. Emmerling, M.R., Gregor, V.E. Callahan, M.J., Schwarz, R.D., Scholten, J.D., Orr, E.L., Pugsley, T., Moore, C.J., Raby, C., Myers, S.L., Davis, R.E., Jaen, J.C. CI-1002, a combined acetylcholinesterase inhibitor and muscarinic antagonist. CNS Drug Reviews 1995;1:27-29. 4. Pool, W.F., Woolf, T.F., Reily, M.D., Caprathe, B.W., Emmerling, M.R., Jaen, J.C. Identification of a 3-hydroxylated tacrine metabolite in rat and man: metabolic profiling implications and pharmacology. J. Med. Chem. 1996;39:3014-3018. 5. Glase, S.A., Akunne, H.C., Heffner, T.G., Jaen, J.C., Meltzer, L.T., Pugsley, T.A., Smith, S.J., Wise, L.D. Aryl 1-but-3ynyl-4-phenyl-1,2,3,6-tetrahydropyrines as potential antipsychotic agents: Synthesis and Structure-Activity relationships. J. Med. Chem. 1996;39:3179-3187. 6. Jaen, J.C. and Schwarz, R.D. Development of muscarinic agonists for the symptomatic treatment of Alzheimer’s Disease. In: Pharmacological Treatment of Alzheimer’s Disease. J.D. Brioni and M.W. Decker, Eds. Wiley & Sons: 1997;409-432. PHS 398/2590 (Rev. 05/01) Page ___19____ Biographical Sketch Format Page ? Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. 7. Schwarz, R.D., Callahan, M.J., Davis, R.E., Jaen, J.C., Jaen, J.C., Tecle, H. Development of M1-subtype-selective muscarinic agonists for Alzheimer’s Disease: translation of in vitro selectivity into in vivo efficacy. Drug Dev, Res. 1997;40:133-143. 8. Hays, S.J., Caprathe, B.W., Gilmore, J.L., Amin, N., Emmerling, M.R., Michael, W., Nadimpali, R., Nath, R. Raser, K.J., Stafford, D., Watson, D., Wang, K., Jaen, J. C. 2-Amino-4H-3,1-benzoxazin-4-ones as inhibitors of C1r serine protease. J. Med. Chem. 1998;41:1060-1067. 9. Augelli-Szafran C.E., Jaen, J.C., Moreland, D.W., Nelson, C.B., Penvose-Yi, J.R., Schwarz, R.D. Identification and characterization of m4-selective muscarinic antagonists. BioOrg. Med. Chem. Lett. 1998;8:1991-1996. 10. Medina, J.C., Shan, B., Bechmann, H., Farrell, R.P., Clark, DL., Learned, M., Roche, D., Li, A., Baichwal, V., Case, C., Baeurle, P., Rosen, T., Jaen, J.C. Novel antineoplastic agents with efficacy against multidrug resistant tumor cells. BioOrg. Med. Chem. Lett. 1998;8:2653-2656. 11. Augelli-Szafran, C.E., Blankley, C.J., Jaen, J.C., Moreland, D.W., Nelson, C.B., Penvose-Yi, J.R., Schwarz, R.D., Thomas, A.J. Identification and characterization of m1 selective muscarinic receptor antagonists. J. Med. Chem. 1999;42:356-363. 12. Plummer, J.S., Cai, C., Hays, S.J., Gilmore, J.L., Emmerling, M.R., Michael, W., Narasimhan, L.S., Watson, M.D., Wang, K., Nath, R., Evans, L.M., Jaen, J.C. Benzenesulfonamide derivatives of 2-substituted 4H-3,1-benzoxazin-4-ones and benzthiazin-4-ones as inhibitors of complement C1r protease. . BioOrg. Med. Chem. Lett. 1999;9:815-820. 13. Shan, B. Medina, J.C., Santha, E., Frankmoelle, W.P., Chou, T.C. Learned, R.M., Narbut, M.R., Stott, D., Wu, P., Jaen, J.C., Rosen, T., Timmmermans, P.B.M.W.M., Beckmann, H. Selective, covalent modification of β-tubulin residue Cys239 by T138067, an antitumor agent with in vivo efficacy against multidrug-resistant tumors. Proc. Nat. Acad. Sci. (USA) 1999;96:5686-5691. 14. Medina, J.C., Roche, D., Shan, B., Learned, R.M., Frankmoelle, W.P., Clark, D.L., Rosen, T., Jaen, J.C., Novel halogenated sulfonamides inhibit the growth of multidrug resistant MCF-7?ADR cancer cells. . BioOrg. Med. Chem. Lett. 1999;9:1843-1846. 15. Tecle, H., Schwarz, R.D., Barrett, S.D., Callahan, M.J., Caprathe, B.W., Davis, R.E., Doyle, P., Emmerling, M., Lauffer, D.J., Mirzadegan, T., Moreland, D.W., Lipinski, W., Nelson, C., Raby, C., Spencer, C., Spiegel, K., Thomas, A.J., Jaen, J.C. CI-1017, a functionally M1-selective muscarinic agonist: design, synthesis and preclinical pharmacology. Pharm. Acta Helv. 200;74(2-3):141-148. Patents/Inventorships (Selected Patents out of 41 US Patents and PCT Publications for Pending US Patents) Preparation of 2-substituted-4H-3,1-benzoxazin-4-ones and benzothiazin-4-ones as inhibitors of complement C1r protease for the treatment of inflammatory processes. Caprathe, B.W., Gilmore, J., Hays, S., Jaen, J.C.: US 5,652,237 (July 29, 1997). Method of imaging amyloid deposits. Caprathe, B.W., Gilmore, J.L., Hays, S., Jaen, J.C., LeVine, H.: US 6,001,331 (December 14, 1999). PPARg modulators. DeLaBrouse-Elwood, F., Chen, J-L., Cushing, T.D., Flygare, J.A., Houze, J.B., ., Jaen, J.C., McGee, L.R., Miao, S-C., Rubenstein, S.M., Kearney, P.C. ; US 6,200,995 (March 13, 2001). Pyrimidine derivatives. Cushing, T.D., Mellon, H.L., Jaen, J.C., Flygare, J.A., Miao, S-C., Chen, X, Powers, J.P. US 6,200,977 (March 13, 2001). HIV integrase inhibitors. Young, S.D., Egbertson, M., Payne, L.S., Wai, J.S., Fisher, T.E., Guare, J.P., Embrey, M.W., Tran, L., Zhuang, L., Vacca, J.P., Langford, M., Melamed, J. Jaen, J.C., Clark, D.L., Medina, J.C. US 6,380,249 (April 30, 2002). Preparation of arylsulfonanilide amino acid derivatives. Rubenstein, S., Jaen, J.C.; US 6,153,585 (November 28, 2000). PHS 398/2590 (Rev. 05/01) Page ____20___ Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b. Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Waszkowycz, Bohdan Head of Computational Chemistry EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION University of Manchester, UK University of Manchester, UK DEGREE (if applicable) BSc PhD YEAR(s) FIELD OF STUDY 1983 Pharmacy 1990 Theoretical Chemistry NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. 1984-1985: Pharmacist, Withington Hospital, Manchester, UK 1985-1987: Pharmacist, Christie Hospital, Manchester, UK 1990-1993: Computational Chemist, Proteus Molecular Design Ltd, Stockport, UK 1993-1999: Group Leader, Computational Chemistry, Proteus Molecular Design Ltd, Macclesfield, UK 1999-2001: Group Leader, Computational Chemistry, Protherics Molecular Design Ltd, Macclesfield, UK 2001-present: Head, Computational Chemistry, Tularik Ltd, Macclesfield, UK B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. 1. B Waszkowycz, I H Hillier, N Gensmantel and D W Payling, Aspects of the Mechanism of Catalysis in Phospholipase A2. A Combined ab initio Molecular Orbital and Molecular Mechanics Study. J. Chem. Soc. Perkin Trans. 2, 1989, 1795. 2. D E Clark, D Frenkel, S A Levy, J Li, C W Murray, B Robson, B Waszkowycz and D R Westhead, PRO_LIGAND: An Approach to de Novo Molecular Design. 1: Application to the Design of Organic Molecules. J. Comput.-Aided Mol. Design, 1995, 9, 13. 3. B Waszkowycz, D E Clark, D Frenkel, J Li, C W Murray, B Robson and D R Westhead, PRO_LIGAND: An Approach to de Novo Molecular Design. 2: Design of Novel Molecules from Molecular Field Analysis (MFA) Models and Pharmacophores, J. Med. Chem., 1994, 37, 3994. 4. CW Murray, DE Clark, TR Auton et al PRO_SELECT : Combining structure-based drug design and combinatorial chemistry for rapid lead discovery. 1. Technology. J. Comput.-Aided Mol.Des. 1997, 11, 193. 5. J Li, CW Murray, B Waszkowycz and SC Young. Targeted molecular diversity in drug discovery integration of structure-based design and combinatorial chemistry Drug Discovery Today 1998, 3, 105 6. B Waszkowycz. New methods for structure-based de novo drug design in “Advances in Drug Discovery Techniques”, Ed. A L Harvey, Publ. Wiley 1998 7. CA Baxter, CW Murray, B Waszkowycz et al. New approach to molecular docking and its application to virtual screening of chemical databases. J. Chem. Inf. Comp. Sci. 2000, 40, 254. 8. B Waszkowycz, TDJ Perkins, RA Sykes & J Li, Large scale virtual screening for lead discovery in the post-genomics era. IBM Systems J. 2001, 40, 360. PHS 398/2590 (Rev. 05/01) Page ____21__ Biographical Sketch Format Page ? Principal Investigator/Program Director (Last, first, middle): 9. JW Liebeschuetz et al, PRO_SELECT: combining structure-based drug design and array-based chemistry for rapid lead discovery. 2. The development of a series of highly potent and selective factor Xa inhibitors. J. Med. Chem. 2002, 45, 1221. 10. B Waszkowycz, Structure-based approaches to drug design and virtual screening. Curr. Opin. Drug Discov. Devel. 2002, 5, 407 C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. PHS 398/2590 (Rev. 05/01) Page __22_____ Biographical Sketch Format Page Principal Investigator/Program Director Ling, Bruce, Xuefeng, Ph.D.: BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Stephen C Young Head of Chemsitry, Tularik Ltd EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) DEGREE (if applicable) INSTITUTION AND LOCATION Nottingham University (UK) Nottingham University (UK) BSc Hons PhD YEAR(s) 1981 1984 FIELD OF STUDY Chemistry Medicinal Chemistry NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. 1984 – 1986 1986 – 1989 1989 – 1991 1992 – 1995 1996 – 2001 2001 – 2003 Research Fellow Senior Research Chemist Sales and Marketing Manager Experimental Facilities Manager Synthetic Chemistry Section Head Head of Chemistry Edinburgh University (UK). Merck Sharp and Dohme (Neuroscience Research Labs). Novabiochem (UK) Ltd. Proteus Molecular Design Ltd. Protherics Molecular Design Ltd. Tularik Ltd. 1981 – 2003 1987 – 2003 Member of the Royal Society of Chemistry Member of the American Chemical Society B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. Papers: Neuroscience Letters, 1987, 80, 321-326. Senktide, a selective neurokinin B-like agonist, elicits serotonin-mediated behaviour following intracisternal administration in the mouse, A.J. Stoessl, C.T. Dourish, S.C. Young, S.D. Iversen and L.L. Iversen. Merck Sharp and Dohme Research Laboratories, Harlow, Essex, U.K. Peptides 1990, 313-5 (Eds. E. Giralt and D. Andreu). Counterion distribution monitoring: A novel method for acylation monitoring in solid phase peptide synthesis. S.C. Young, P.D. White. J.W. Davies, D.E.I.A. Owen, S.A. Salisbury and E.J. Tremeer. Novabiochem U.K. Ltd., Cambridge, U.K. Journal of Medicinal Chemistry, 1993, 36, 2-10. Cyclic peptides as selective tachykinin antagonists, B.J. Williams, N.R. Curtis, A.T. McKnight, J.J. Maguire, S.C. Young, D.F. Veber, R. Baker. Merck Sharp and Dohme Research Laboratories, Harlow, Essex, U.K. Veterinary Immunology and immunopathology, 1996, 55, 243, Immunisation of rainbow trout Oncrhynchus mykiss with multiple antigen peptide system (MAPS). E.M. Riley, S.C. Young, C.J. Secombes. University of Aberdeen and Proteus Molecular Design Ltd., UK Journal of Computer aided Molecular Design 1997, 11, 193, PRO-SELECT: Combining combinatorial chemistry and structure-based drug design for rapid lead discovery. 1. Technology C.W. Murray, D.E. Clark, T.R Auton, M.A. Firth, J. Li, B. Waszkowycz, D.R. Westhead, and S.C. Young, Proteus Molecular Design Ltd, Macclesfield, Cheshire, U.K. Drug Discovery Today 1998, 3, 105-112, Targeted molecular diversity in drug discovery – integration of structure-based design and combinatorial chemistry. Li, Jin; Murray, Christopher W.; Waszkowycz Bohdan; Young, Stephen C., Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK PHS 398/2590 (Rev. 05/01) Page ___23____ Biographical Sketch Format Page Principal Investigator/Program Director Ling, Bruce, Xuefeng, Ph.D.: Acta Crystallogr. 1999, C55, IUC9900072: 2-Amino-4-(methoxymethyl)-thiazole-5-carboxilic acid methyl ester, A.R. Kennedy, A.I. Khalaf, A.R.Pitt, M. Scobie, C.J. Suckling, J. Urwin, R.D. Waigh and S.C.Young J. Med. Chem. 2000, 43, 3257-3266, DNA binding, solubility and partitioning characteristics of extended Lexitropsins. R.V. Fishleigh, K.R. Fox, A.I. Khalaf, A.R.Pitt, M. Scobie, C.J. Suckling, J. Urwin, R.D. Waigh and S.C.Young. Proteus Molecular Design Ltd, Macclesfield, Cheshire, U.K and University of Strathclyde, Glasgow, UK. Tetrahedron. 2000, 56, 5225-5239, The synthesis of some head to head linked DNA minor groove binders. A.I. Khalaf, A.R.Pitt, M. Scobie, C.J. Suckling, J. Urwin, R.D. Waigh, R.V. Fishleigh, W.A. Wylie and S.C.Young. Proteus Molecular Design Ltd, Macclesfield, Cheshire, U.K and University of Strathclyde, Glasgow, UK. Journal of Chemical Research-S 2000, 6, 264-265 Synthesis of novel DNA binding agents: indole-containing analogues of bis-netropsin. Khalaf, A.I.; Pitt, A.R.; Scobie, M.; Suckling, C.J.; Urwin, J.; Waigh, R.D.; Fishleigh R.V.; Young, S.C. Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK , and Univ Strathclyde, Glasgow G4 0NR, UK Drug Discovery & Development, April 2000, 34-38: Virtual screening speeds discovery. Young, Stephen; Li, Jin Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK Innovations in Pharmaceutical Technology (2000), 00(5), 24-28: Virtual screening of focused combinatorial libraries Young, S. ; Li, J. , Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK BioOrganic & Medicinal Chemistry Letters (2001), 11(5), 733-736, The design of phenylglycine containing benzamidine carboxamides as potent and selective inhibitors of factor Xa; Jones, S D ; Liebeschuetz, J W ; Morgan, P J ; Murray, C W ; Rimmer, A D ; Roscoe, J M E ; Waszkowycz, B ; Welsh, P M ; Wylie, W A ; Young, S C ; Mahler, J ; Martin, H., Brady ; L., and Wilkinson, K ; Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK and Bristol University J. Med. Chem. (2002), 45, 1221; PRO-SELECT: Combining structure-based drug design and array-based chemistry for rapid lead discovery. 2. The Development of a Series of Highly Potent and Selective Factor Xa Inhibitors. J Liebeschuetz, S.D. Jones, J. Mahler, H. Martin, P.J. Morgan, C.W. Murray, A.D. Rimmer, J.M.E. Roscoe, B. Waszkowycz, P.M. Welsh, W.A. Wylie and S.C. Young, Protherics Molecular Design Ltd, Macclesfield, Cheshire, U.K. and Bristol University BioOrganic & Medicinal Chemistry Letters (2003 in Press), A Four Component Coupling Strategy for the Synthesis of DPhenylglycinamide-Derived Non-Covalent Factor Xa Inhibitors, Scott M. Sheehan, John J. Masters, Michael R. Wiley, Stephen C. Young, John W. Liebeschuetz, Stuart D. Jones, Christopher W. Murray, Jeffrey B. Franciskovich, David B. Engel, Wayne W. Weber II, Jothirajah Marimuthu, Jeffrey A. Kyle, Jeffrey K. Smallwood, Mark W. Farmen, and Gerald F. Smith. Patents: EP818744 Process for selecting candidate drug compounds. C.W. Murray and S.C. Young. Proteus Molecular Design Ltd, Macclesfield, Cheshire, U.K. World Patent WO-9858952 Angiotensin derivatives. Glover, J F.; Rushton A.; Morgan P J.; Young S C. Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK World Patent WO-9911657 1-Amino-7-isoquinoline derivatives as serine protease inhibitors. Liebeschuetz, J.W.; Wylie, W.A.; Waszkowycz, B.; Young, S.C.; Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK US2002/0040144 1-Amino-7-isoquinoline derivatives as serine protease inhibitors. Liebeschuetz, J.W.; Wylie, W.A.; Waszkowycz, B.; Young, S.C.; Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK World Patent WO-9911658 meta-Benzamidine derivatives as serine protease inhibitors Liebeschuetz, J.W.; Wylie, W.A.; Waszkowycz, B.; Young, S.C.; Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK US 2002/0055522 meta-Benzamidine derivatives as serine protease inhibitors Liebeschuetz, J.W.; Wylie, W.A.; Waszkowycz, B.; Young, S.C.; Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK World Patent WO-0076970 Use of compounds as serine protease inhibitors Liebeschuetz, J W ; Lyons, A J ; Murray, C W ; Rimmer, A D, Young S.C., Camp, N.P., Jones S.D., Morgan, P.J., Richards, S.J., Wylie, W.A., Lively, S.E., Harrison, M.J., Waszkowycz, B., Masters, J.J., Wiley, M.J. Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285 World Patent WO-0076971 Compounds as serine protease (especially factor Xa) inhibitors useful as antithrombotic agents Liebeschuetz, J W ; Lyons, A J ; Murray, C W ; Rimmer, A D, Young S.C., Camp, N.P., Jones S.D., Morgan, P.J., Richards, S.J., Wylie, W.A., Masters, J.J., Wiley, M.J. Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285 World Patent WO-0077027 Compounds as serine protease (especially tryptase) inhibitors useful as antiinflammatory agents Liebeschuetz, J W; Young, S C; Lively, S E; Harrison, M J; Morgan, P.J; Waszkowycz, B; Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK World Patent WO-0196303 compounds as serine protease inhibitors Liebeschuetz, J W, Murray, C W., Young S.C., Camp, N.P., Jones S.D., Wylie, W.A., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285 PHS 398/2590 (Rev. 05/01) Page ___24____ Biographical Sketch Format Page Principal Investigator/Program Director Ling, Bruce, Xuefeng, Ph.D.: World Patent WO-0196304 compounds as serine protease inhibitors Liebeschuetz, J W, Murray, C W., Young S.C., Camp, N.P., Jones S.D., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285 World Patent WO-0196323 compounds as serine protease inhibitors Liebeschuetz, J W, Murray, C W., Young S.C., Camp, N.P., Jones S.D., Wylie, W.A., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Guzzo, P.R., Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285 World Patent WO-0196296 compounds as serine protease inhibitors Liebeschuetz, J W, Murray, C W., Young S.C., Camp, N.P., Jones S.D., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285 C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. PHS 398/2590 (Rev. 05/01) Page ___25____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Mario G. Cardozo Research Investigator EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION DEGREE (if applicable) YEAR(s) FIELD OF STUDY Faculty of Chemical Sciences National University of Cordoba Cordoba, ARGENTINA B. Sc. in Pharmacy 4/78 to 12/81 Pharmacy Faculty of Chemical Sciences National University of Cordoba Cordoba, ARGENTINA M. Sc. in Organic Chemistry 12/81 to 3/83 Physical Organic Chemistry Faculty of Pharmacy and Biochemistry University of Buenos Aires, Buenos Aires, ARGENTINA Ph. D. in Pharmacy 4/83 to 12/87 Medicinal Chemistry College of Pharmacy University of Illinois at Chicago (UIC) Chicago Il Postdoctoral Research Associated 05/89 to 06/92 Computer-aided drug design NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. POSITIONS 07/92 to 12/96: Senior Scientist Medicinal Chemistry Department Boehringer Ingelheim Pharmaceuticals, Inc. 01/97 to 07/02: Principal Scientist Molecular Modeling Laboratory-Structural Research Group Boehringer Ingelheim Pharmaceuticals, Inc. 900 Ridgebury RD, Ridgefield CT 08/02 to Present Tularik Inc Department of Structural Biology 1120 Veterans Blvd, South San Francisco, CA PHS 398/2590 (Rev. 05/01) Page ___26____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. FELLOWSHIPS AND AWARDS: 04/84 to 03/88 05/89 to 04-91: 1998 1999 Graduate Student Fellowship (National Research Council, ARGENTINA). Fogarty International Postdoctoral Fellowship (NIH-USA). Boehringer Ingelheim Vice President Golden Achievement Award. Boehringer Ingelheim Medicinal Chemistry Department. Achievement Award B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. 1. Cardozo, M.G., with Pierini, A.B., Montiel, A.A., Albonico, S.M., and Pizzorno, M.T., 1,3-dipolar Cycloaddition Reactions. Regioselective Synthesis of Heterocycles and Theoretical Studies. J. Heterocyclic Chem., 26, 1003 (1989). 2. Cardozo, M.G., with Hopfinger, A.J., Molecular Mechanics and Molecular Dynamics Studies of the Intercalation of Dynemicin-A with Oligonucleotide Models of DNA. Mol. Pharmacol., 40, 1023 (1991). 3. Cardozo, M.G., with Hopfinger, A.J., Iimura, Y., Sugimoto, H., and Yamanishi, Y., QSAR Analyses of the Substituted Indanone and Benzyl Piperidine Inhibitors of Acetylcholinesterase, J. Med. Chem. 35, 584 (1992). 4. Cardozo, M.G., with Hopfinger, A.J., Iimura,. Y., Sugimoto, H., and Yamanishi, Y., Conformational Analysis and Molecular Shape Comparisons of a Series of Indanone Benzyl Piperidine Inhibitors of Acetylcholinesterase, J. Med. Chem. 35, 590 (1992). 5. Cardozo, M.G., with Hopfinger, A.J., Burke, B.J., Rowberg, K.L., and Koehler, M.G., New Methods in Molecular Shape Analysis to Identify and Characterize Active Conformations, in Second International Telesymposium Procced. on QSAR, (ed. K. Kuchar), Prous Press, Barcelona, Spain, 1991. 6. Cardozo, M.G., with Kawakami, Y., and Hopfinger, A.J., Construction of QSARs from Ligand-DNA Intercalation Molecular Modeling Studies. "Nucleic Acid Targeted Drug Design", In Computer-Aided Drug Design Methods and Aplications, vol 2. Ed. T. J. Perum and C.L. Propst, Marcel Dekker, Inc. New York. pp 151-193 (1992) 7. Cardozo, M.G., with Hopfinger, A.J., A Model for the Dynemicin-A Cleavage of DNA Using Molecular Dynamics Simulation, Biopolymers 33, 377 (1993). 8 Cardozo, M.G., with Tong, L.; Jones P.-J.; and Adams, J., Preliminary Structural Analysis of the Mutations Selected by Non-Nucleoside Inhibitors of HIV-1 Reverse Transcriptase. Bioorganic & Medicinal Chemistry Letters, 3, 721 (1993). 9. Cardozo, M.G. with Hopfinger, A.J., , and Kawakami, Y. Molecular Modelling of Ligand-DNA Intercalation Interactions. J. Chem. Soc. Faraday Trans. 1995, 91 (16) 2515-2524.. 10. Cardozo, M.G. with Proudfoot, J., etal. Novel Non-nucleoside Inhibitors of Human Virus Type 1 Reverse Transcriptase. 4. J. Med. Chem. 1995, 38 (24) 4830-4838. Immunodeficiency 11.Cardozo, M.G. with Kelly, T.A., Proudfoot, J.R., McNeil, D.W., Patel, U.R., David, E, Farina, V., Hargrave, K.D., Grob, P., Agarwal, A., and Adams, J. Non_nucleoside Inhibitors of Human Immunodeficiency Virus Type 1 Reverse Transcriptase. 5. J. Med. Chem. 1995, 38 (24) 4839-4847. PHS 398/2590 (Rev. 05/01) Page ___27____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. 12. Cardozo, M. G, with Betageri, R., etal. Phosphotyrosine-Containing Dipeptides as High-Affinity Ligands for p56lck SH2 Domain. J. Med. Chem. 1999 42 (4) 722-729. 13. Cardozo, M.G., with Betageri, R., etal. Ligands for the Tyrosine p56lck SH2 Domain: Discovery of potent Dipeptide Derivatives with Monocharged, Nonhydrolyzable Phosphate Replacement. J. Med. Chem. 1999, 42 (10) 1757-1766. 14. Cardozo, M.G., with Last-Barney, K. , Davidson, W., etal. Binding Site Elucidation of Hydantoin-based Antagonosts of LFA-1 Using Multidisciplinary Technologies: Evidences for the Allosteric Inhibition of a Protein-Protein Interaction. J.Am.Chem.Soc. 2001, 123, 5643-5650. 15. Cardozo, M.G., with Proudfoot, J.R. etal. Non-peptidic, Monocharged, Cell Permeable Ligands for the p56lck SH2 Domain. Submited J. Med. Chem. 2001, 44 (15) 2421-2431. 16. Cardozo, M.G., with Graham, E, and Jacober, S. A method for selecting compounds from a combinatorial or other chemistry libraries for efficient synthesis. J. Chem. Inf. Comp. Sci. 2001, 41 (6) 1508-1516. 20. Cardozo, M.G., with Snow, R. J., Morwick, T. M. etal. Discovery of 2-Phenylamine-imidazo [4,5h]isoquinolin-9-one: A New Class of Inhibitors of Lck Kinase. J. Med. Chem. 2002, 45 (16) 3394-3405. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. PHS 398/2590 (Rev. 05/01) Page ___28____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the follow ing information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Connors, Richard Victor Senior Chemistry Scientist EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Laurentian University, Sudbury Canada University of Ottawa, Ottawa, Canada Columbia University Duke University A. DEGREE (if applicable) B.Sc.(Hon) Ph.D. Postdoc Postdoc YEAR(s) 1988 1994 1996 1997 FIELD OF STUDY Biochemistry Chemistry Chemistry Chemistry Positions and Honors. Positions and Employment 1997-1999 1999-2001 2003- Research Scientist, Pharmacopeia Inc, Princeton, NJ. Senior Scientist, Pharmacopeia Inc, Princeton, NJ. Senior Chemistry Scientist, Tularik, Inc, South San Francisco, CA. Honors 1986-1988 1988-1990 1990-1992 B. Dean’s Honor List, Laurentian University, Sudbury, Canada. University of Ottawa Entrance Scholarship, Ottawa, Canada. NSERC PGS-3 Predoctoral Scholarship, Ottawa, Canada. Peer-Reviewed Publications. Connors, Richard; Durst, Tony. Acyl cyanides as carbonyl heterodienophiles. Tetrahedron Letters (1992), 33(48), 7277-80. Breslow, Ronald; Connors, Richard V.. Quantitative Antihydrophobic Effects as Probes for Transition State Structures. 1. Benzoin Condensation and Displacement Reactions. Journal of the American Chemical Society (1995), 117(24), 6601-2. Connors, Richard; Tran, Elisabeth; Durst, Tony. Acyl cyanides as carbonyl heterodienophiles: application to the synthesis of naphthols, isoquinolones, and isocoumarins. Canadian Journal of Chemistry (1996), 74(2), 221-6. Breslow, Ronald; Connors, Richard. Antihydrophobic Cosolvent Effects Detect Two Different Geometries for an SN2 Displacement and the Change to a Single-Electron-Transfer Mechanism in Related Cases. Journal of the American Chemical Society (1996), 118(26), 63236324. PHS 398/2590 (Rev. 05/01) Page ____29___ Biographical Sketch Format Page ? Principal Investigator/Program Director (Last, first, middle): Breslow, Ronald; Connors, Richard; Zhu, Zhaoning. Mechanistic studies using antihydrophobic agents. Pure and Applied Chemistry (1996), 68(8), 1527-1533. Pirrung, Michael C.; Connors, Richard V.; Odenbaugh, Amy L.; Montague-Smith, Michael P.; Walcott, Nathan G.; Tollett, Jeff J. The arrayed primer extension method for DNA microchip analysis. Molecular computation of satisfaction problems. Journal of the American Chemical Society (2000), 122(9), 1873-1882. Pirrung, Michael; Connors, Richard; Odenbaugh, Amy; Montague-Smith, Michael; Walcott, Nathan; Tollett, Jeff. Arrayed primer extension on DNA microchips (APEX). Molecular computation of satisfaction (SAT) problems. Frontiers Science Series (2000), 30(Currents in Computational Molecular Biology), 20-21. Pirrung, Michael C.; Odenbaugh, Amy L.; Connors, Richard V.; Worden, Janice D. Method of attaching a biopolymer to a solid support using bromoacetamidosilanes to functionalize the support. U.S. Pat. Appl. Publ. (2002), 13 pp. Connors, Richard V.; Zhang, Alex J.; Shuttleworth, Stephen J. Pictet-Spengler synthesis of tetrahydro-β -carbolines using vinylsulfonylmethyl resin. Tetrahedron Letters (2002), 43(37), 6661-6663. PHS 398/2590 (Rev. 05/01) Page ___30____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Gene Cutler Research Investigator EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Cornell University; Ithaca, NY University of California; Berkeley, CA Tularik Inc. DEGREE (if applicable) BA PhD Postdoc YEAR(s) 1988-1992 1992-1997 1998-2000 FIELD OF STUDY Biology Molecular and Cell Biology Biology A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. Positions and Employment 1997 – 1998 Postdoctoral Fellow, Molecular and Cell Biology Dept, University of California, Berkeley 1998 – 2000 Postdoctoral Fellow, Biology Dept, Tularik Inc. 2000 – 2002 Scientist, Bioinformatics Dept, Tularik Inc. 2002 – Research Investigator, Bioinformatics Dept, Tularik Inc. Honors 1988 New York State Science Supervisors Association Award 1988 National Merit Scholarship Finalist 1988 Lavinia Wright Scholarship, OPEIU 1989 New York State Scholarship of Excellence 1989 Howard Coughlin Memorial Scholarship, OPEIU 1989-1992 Dean’s List, Cornell University 1991 Cornell Hughes Scholars Program 1992 Phi Beta Kappa Society Membership, Cornell Chapter 1992 National Science Foundation Graduate Fellowship 1992 Howard Hughes Medical Institute Predoctoral Fellowship B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. • • • Goodrich JA, Cutler G, and Tjian R. Contacts in context: promoter specificity and macromolecular interactions in transcription. Cell, 1 1996 Mar 22, 84(6):825-30. Cutler G, Perry K, and Tjian R. Transcription factor Adf-1 contains a TAF-binding myb motif as part of a non-modular activation domain. Molecular and Cell Biology, 1998 Apr; 18(4):225261. An S, Cutler G, Zhao JJ, Huang SG, Tian H, Li W, Liang L, RIch M, Bakleh A, Du J, Chen JL, Dai K. Identification and characterization of a melanin-concentrating hormone receptor. Proceedings of the National Academy of Sciences USA, 2001 Jun 19; 98(13):7576-81. PHS 398/2590 (Rev. 05/01) Page ____31___ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): • • Ling, Bruce, Xuefeng, Ph.D. Li S, Liao J, Cutler G, Hoey T, Hogenesch JB, Cooke MP, Schultz PG, Ling XB. Comparative analysis of human genome assemblies reveals genome-level differences. Genomics, 2002 Aug; 80(2):138-9. Li S, Cutler G, Liu JJ, Hoey T., Chen L, Schultz PG, Liao J., Ling XB . A Comparative Analysis Of HGSC and Celera Human Genome Assemblies and Gene Sets. Bioinformatics, in press. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. Ongoing Research Support Tularik Inc 2000 – present Role: Co-Investigator Design microarray experiments and analyze resulting data in experiments to probe the activities of novel bioactive compounds, the effects of ectopically expressed genes, and the effects of knock-out genes in animals and tissue-culture systems. These experiments probe a variety of pathways related to cancer, disorders of the immune system, and metabolic disorders. Tularik Inc 2002 – present Role: Co-Investigator Analyze human protein kinase sequences to better predict which kinases will bind a given substrate analog. Completed Research Support Tularik Inc 2000 – 2002 Role: Co-Investigator Perform exhaustive sequence analysis of Human genome sequence to identify and classify novel G Protein-Coupled Receptors and Protein Kinases. Tularik Inc 2000 – 2003 Role: PI Design and develop a suite of tools for storing, retrieving, manipulating, and analyzing microarray data. This includes development of novel algorithms for microarray data normalization and gene expression data clustering. Tularik Inc 2002 Role: Co-Investigator Perform analysis on human genome assemblies from different sources, comparing the development of these assemblies over time. PHS 398/2590 (Rev. 05/01) Page ____32___ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow this format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Jane Liu Bioinformatics specialist EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Beijing Medical University Santa Clara University, CA DEGREE (if applicable) M.D. M.S. YEAR(s) 1992-1997 2000-2002 FIELD OF STUDY Medicine Computer Engineering NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. Positions and Employment 2002 – Bioinformatics specialist, Bioinformatics Dept, Tularik Inc. Honors B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. Li S, Cutler G, Liu J (Co-first author), Hoey T, Chen L, Schultz PG, Liao J, Ling XB. 2003. A Comparative Analysis Of HGSC and Celera Human Genome Assemblies and Gene Sets. Bioinformatics in press. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. Ongoing Research Support Tularik Inc 2003 – present Role: Co-Investigator Develop machine learning algorithms e.g. neural network and HMM to associate biological sequences with Gene Ontology terms. Completed Research Support Tularik Inc 2002 Role: Co-Investigator Perform comparative analysis on human genome assemblies from HGSC and Celera Genomics and their associated gene sets. Study the evolvement of these assemblies over time. PHS 398/2590 (Rev. 05/01) Page ____33___ Biographical Sketch Fo rmat Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow this format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Zheng (Sam) Pan Scientist EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION DEGREE (if applicable) YEAR(s) FIELD OF STUDY Fudan Univ. Shanghai, China Fudan Univ. Shanghai, China BS MS 1987 1990 Univ. of Massachusetts, Amherst, MA, USA Harvard Medical School, Boston, MA, USA Ph.D. Post Doc 1997 2000 Biology/Plant Physiology Biochemistry/Plant Physiology Fungal Genetics Leukemia, cancer A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. 1992-1997 1997-2000 2000-2002 2002- Research Assistant, Dept. of Microbiology, Univ. of Massachusetts, Amherst, MA Post-doc/Research fellow, Harvard Institute of Medicine/Harvard Medical School, Boston, MA Scientist, DoubleTwist, Inc, Oakland, CA Scientist, Tularik Inc, South San Francisco, CA, B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. • • • • • • Pan Z, Zhou LM, Hetherington CJ, et al. Hepatocytes contribute to soluble CD14 production, and CD14 expression is differentially regulated in hepatocytes and monocytes, J BIOL CHEM 275 (46): 36430-36435 NOV 17 2000 Schwer H, Liu LQ, Zhou LM, Pan Z, et al. Cloning and characterization of a novel human ubiquitin-specific protease, a homologue of murine UBP43 (Usp18), GENOMICS 65 (1): 44-52 APR 1 2000 Pan Z, Zhou L, Hetherington CJ, et al. Differential regulation of human CD14 expression in monocytes and hepatocytes. BLOOD 94 (10): 1644 Part 1 Suppl. 1 NOV 15 1999 Libermann TA, Pan Z, Akbarali Y, et al. AML1 (CBF alpha 2) cooperates with B cell-specific activating protein (BSAP/PAX5) in activation of the B cell-specific BLK gene promoter, J BIOL CHEM 274 (35): 24671-24676 AUG 27 1999 Pan Z, Hetherington CJ, Zhang DE, CCAAT/enhancer-binding protein activates the CD14 promoter and mediates transforming growth factor beta signaling in monocyte development, J BIOL CHEM 274 (33): 23242-23248 AUG 13 1999 Pan Z, Hetherington CJ, Zhang DE, Regulation of CD14 gene expression during monocytic differentiation by C/EBPs., BLOOD 92 (10): 2875 Part 1 Suppl. 1 NOV 15 1998 C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. PHS 398/2590 (Rev. 05/01) Page ____34___ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research 1 project. Do not list award amounts or percent effort in projects. Ongoing Research Support Tularik Inc 2002-present Role: Investigator Designed and developed tools and oligos for cancer-related amplicon discovery in human genome Completed Research Support Tularik Inc 2002-2003 Role: Investigator Perform exhaustive sequence analysis of Human genome sequence with Hiden Markov Model to identify and classify novel phosphatases and proteases. PHS 398/2590 (Rev. 05/01) Page ___35____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE King, Brian D. President, Life Code, Inc. EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) DEGREE (if applicable) INSTITUTION AND LOCATION Michigan State University B.S. YEAR(s) FIELD OF STUDY 1989 Computer Science NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. Positions and Employment 2003 2002-2003 1999-2002 1998-1999 1997-1998 1997-1998 1996-1997 1995-1996 1994-1995 1989-1994 President, Life Code, Inc. Contractor, Sun Microsystems Senior Software Architect, DoubleTwist, Inc. Contractor, Oregon Dept. of Transportation Contractor, Hewlett-Packard Contractor, Pangea Systems Contractor, Kaiser Permanente Contractor, Strategic Concepts Corp. Software Engineer, IA Corp. Staff Programmer, IBM Professional Memberships 2001-2003 Interoperable Informatics Infrastructure Consortium (I3C) B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. PHS 398/2590 (Rev. 05/01) Page ____36___ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Lukes, Melissa, Ann Database Administrator EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION California State University Hayward DEGREE (if applicable) YEAR(s) BS FIELD OF STUDY 1985 Biology, Minor Computer Science NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. Position and Employment 1985-1988 Network Manager at NASA AMES Research Center, Sterling Software, Palo Alto, CA 1988-1990 Bioanalyst, Syntex Reasearch, Palo Alto, CA 1990-1991 Sr. Analysis Programmer, Syntex Research, Palo Alto, CA 1991-1993 Technical Analyst, Syntex Research, Palo Alto, CA 1993-1994 Sr. Technical Analyst, Syntex Research, Palo Alto, CA 1994-1996 Systems Project Manager, Syntex Research, Palo Alto, CA 1996-1999 Data Manager/Analyst, Mercator Genetics/Progenitor, Menlo Park, CA 1999Database Administrator, Tularik Inc, South San Francisco, CA Other Experience and Professional Memberships 1985- NOCOUG, Northern California Oracle Users Group Membership Honors 1988 1994 Ozone Hole Participation Award, NASA AMES Research Center Chairman's Recognition Award for Individual Effort, Syntex Research B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. Elizabeth Kunysz, Douglas W. Bonhaus and Melissa Lukes, `Bar-code technology and a Centralized database: Key components in a Radioligand Binding High Throughput Screening Program', Accepted for publication by Packard 1996 Marshall B. Wallach, Tim Maslyn, Melissa Lukes, and Ronald Rhodes, 'Automated Sample Preparation and Dispensing For High Throughput Screening Assays', Syntex Discovery Research. Proceedings International Symposium on Laboratory Automation and Robotics, 1994 p474 Maureen Laney, Ronnel Cabuslay, Ronald Rhodes, Melissa Lukes, and Randall Schatzman, 'A fully Automated Assay of in vitro Enzyme Activity by Continuous Kinetic Measurement', Syntex Discovery Research. Proceedings International Symposium on Laboratory Automation and Robotics, 1994 p485 PHS 398/2590 (Rev. 05/01) Page ___37____ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Richard K. Porter Senior Software Engineer EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Temple University, Phila., PA Temple University, Phila., PA Brown University, Providence, RI UC Santa Cruz Extension, Santa Clara, CA DEGREE (if applicable) YEAR(s) FIELD OF STUDY B.S. M.S. 1968 1971 1969 - 1972 1993 - 1994 Chemistry Organic Chemistry Phys. Org. Chem. C Prog. Language, and Advanced C SQL*Forms V3 to Oracle Forms 4.5: New Feature Basic Java Programming Intro to Jdeveloper MDL ISIS/Direct Molecules v2.0 Designing with Visual Basic 6 Oracle 9i ODTUG JDeveloper Certificate of Training Oracle Corp, Redwood Shores, CA 1998 Oracle Corp, Redwood Shores, CA 2000 MDL Information Systems, San Leandro, CA 2002 Foothill College, Los Altos, CA 2002 Oracle Development Tools User Group, Las Vegas, NV 2002 NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. 1978-1979 1979-1984 1984-1990 1990-1992 1993-1996 1996-1997 1997-2000 2001- Scientist/Analytical Chemistry, Lockheed Missiles and Space Company, Palo Alto, CA Senior Scientist/Analytical Chemistry, Lockheed Missiles and Space Company, Palo Alto, CA Senior Member of the Technical Staff/Project Mgr., Space Applications Corp., Sunnyvale, CA Senior Software Engineer, Advanced Software Resources, Santa Clara, CA Senior Analyst/Programmer (contract), Syntex/Roche Bioscience, Palo Alto, CA Senior Systems Analyst/Database Programmer (contract), CareAmerica Compensation, Burlingame, CA Senior Analyst/Programmer (contract), Quantum Corporation, Milpitas, CA Senior Software Engineer, Tularik Inc., South San Francisco, CA B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. Nature of the carbonium ion. VIII. Cycloalkyl cations from thiocyanate isomerizations Langley A. Spurlock, Richard K. Porter, Walter G. Cox; J. Org. Chem.; 1972; 37(8); 1162-1168. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. PHS 398/2590 (Rev. 05/01) Page ___38____ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. On-going Research Support Tularik Inc 2001SQUID, Select Query Inventory Data, Project Manager/Analyst/Designer Inventory Automation Automation and Visualization of Compound Inventory processes supporting the Inventory staff, Chemists, and the biologists including support for material for HTS assays and Structure Activity Projects Completed Research Support Tularik Inc, 2000-2001 COI, Compounds of Interest, Project Manager/Analyst/Designer Analysis and Reporting of Compound Activity Automation and Visualization of Compound Activity across all projects and assays. Giving history of activity of the compound as well as statistical summary of compound assay activity. Tularik Inc, 1999-2000 Automation of Assay Data Analysis, Project Manager/Analyst/Designer Automation of assay data analysis and reporting for High Throughput Assays including data storage. Mercator Genetics/Progenitor Inc, 1996-1998 Genotyper, Project Manager/Analyst/Designer Automation of Data Analysis and Querying of a Human Genotyping data for an Asthma Project. Worked with researchers to implement company critical genotyping database including integration and automation of analysis programs and visualization tools. Mercator Genetics/Progenitor Inc, 1998 Mutation Detection, Project Analyst/Designer Worked with Researchers to designed and started implementation of Mutation Detection database Syntex Research, Inc 1991-1995 Robotic Implementation and Data Analysis, Project Manager/Designer/Analyst Designed and developed automated statistical systems required to handle volume of data generated from robotic systems. Designed and developed inventory and assay robotic system processes for High Throughput Screening. PHS 398/2590 (Rev. 05/01) Page ___39____ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Pan, Zhiyu Scientific Programmer EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Drexel University Beijing Polytechnic University DEGREE (if applicable) YEAR(s) M.S. B.S. FIELD OF STUDY 1993 Computer Science 1985 Computer Software A. Positions and Honors. Positions and Employment 1985-1989 1989-1991 1992-1993 1993-1995 1996-1997 1997-1998 1998-2000 2000-2001 2000-2002 2002-present Software Engineer, National Laboratory of Pattern Recognition, Beijing, CHINA Programmer, Dept. of Physiology, University of Pennsylvania, Philadelphia, PA Programmer, HEM Pharmaceuticals Corporation, Philadelphia, PA Software Engineer, Amiable Technologies Inc., Philadelphia, PA Senior Member Technical Staff, NYMA Inc, JPL, NASA, Pasadena, CA Software Engineer, National Semiconductor, Santa Clara, CA Software Engineer, Triada Ltd., Foster City, CA Sr. Software Engineer, E-Compare Corp., San Jose, CA Sr. Software Engineer, Brokat Technologies, San Jose, CA Scientific Programmer, Tularik Inc., South San Francisco, CA B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. PHS 398/2590 (Rev. 05/01) Page ___40____ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, First, Middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Epic Junjie Ding Software Engineer EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION Stanford University University of Wisconsin-Madison Tsinghua University, P.R.China DEGREE (if applicable) MS BS YEAR(s) 1999-2001 1998-1999 1993-1998 FIELD OF STUDY Computer Science Engineering Physics Engineering Physics A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. Positions and Employment 2001 – Software Engineer, Research Informatics Dept, Tularik Inc. 1999 – 2001 Research Assistant, Medical School, Stanford University B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. Ongoing Research Support Tularik Inc 2003 – present Role: System Architect Evaluate, integrate, and customize third-party drug discovery software in the area of compound inventory, registration, assay data management and reporting, and chemo-informatics software in the area of compound library generation, analysis and compound property calculation with in-house developed software. Completed Research Support Tularik Inc 2001 – 2003 Role: Database Architect, System Architect Design, develop, and maintain a drug discovery data management web application suite to keep track of enterprise electronic data creation, collection, analysis, modification, storage, and reporting for each phase of drug discovery. This includes the design of corporate database to provide storage for drug discovery data, and information for validation, authorization, and signature of data operation. J2EE is the main technology being used. XML is widely used in the system. Stanford University 1999 – 2001 Role: Developer and Administrator Design and develop a suite of tools for storing, retrieving, manipulating and analyzing gene sequence and gene tree. This includes design of an Oracle relational database for the storage and algorithms for gene tree analysis. The project is done for Genetics Dept. Stanford University. PHS 398/2590 (Rev. 05/01) Page ____41___ Biographical Sketch Format Page Ling, Bruce, Xuefeng,Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Jayanthi Subramani Software Engineer EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION DEGREE (if applicable) YEAR(s) Government College Of Technology, Coimbatore, India B.S. 1996 SSI Systems, Chennai, India Post Graduate Diploma M.S. 1997 San Jose State University, San Jose, CA 2001 FIELD OF STUDY Electronics and Communication Engineering Relational Database Management Systems Computer Engineering NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. SOFTWARE ENGINEER, Tularik Inc., South San Francisco, CA (Mar 2002 - current) SOFTWARE INTERN, Tularik Inc., South San Francisco, CA (Sep 2001 - Feb 2002) SOFTWARE PROGRAMMER, Turbo Information Systems, Chennai, India, Dec 1996- May 98 B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. SMIL-based graphical interface for Interactive TV: The paper was accepted and presented for the Internet Imaging TV Conference at IS&T/SPIE’s Electronic Imaging 2003. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and non-federal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. Discovery Framework: Designed and implemented a comprehensive, easy-to-use and easily reconfigurable Software Platform to automate gathering, storing, organizing and analysing data for the purpose of drug discovery. The technology used is Java - EJB, JMS, Servlets, JSP, Applets, JDBC, Oracle, and XML PHS 398/2590 (Rev. 05/01) Page ___42____ Biographical Sketch Format Page Ling, Bruce, Xuefeng, Ph.D. Principal Investigator/Program Director (Last, first, middle): BIOGRAPHICAL SKETCH Provide the following information for the key personnel in the order listed for Form Page 2. Follow the sample format for each person. DO NOT EXCEED FOUR PAGES. NAME POSITION TITLE Kaveri Charati Software Engineer EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.) INSTITUTION AND LOCATION K.L.E College of Engineering and Technology, Belgaum, India San Jose State University, San Jose, CA DEGREE (if applicable) YEAR(s) FIELD OF STUDY BS 1996 MS 2001 Electrical and Electronics Engineering Computer Engineering NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of the four-page limit. Follow the formats and instructions on the attached sample. A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List any honors. Include present membership on any Federal Government public advisory committee. 1996-1998 2000-2001 2001-2001 2002-2003 Lecturer, Motichand Lengade Bharatesh Polytechnic, Belgaum, India Graduate Assistant, San Jose State University, San Jose, CA Software Engineer, iCommerce, San Jose, CA Software Engineer, Tularik Inc, South San Francisco, CA B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in preparation. C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and nonfederal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award amounts or percent effort in projects. Project: Assay Registration Role: Software Engineer Currently developing an application to register Assays (Experiments) for High Throughput Screening (HTS) and Structural Activity Relationship (SAR) projects. The registration information gathered included Data Mining parameters, Data Analysis Configuration Parameters and Screening logistics. Through this application Scientists can register new Assay, edit and update registered assays, promote an Assay to the next stage etc. Environment: Jakarta-tomcat, Oracle 8 RDBMS, JSP, Servlet, XML Operating System: Red Hat Linux 7.1 Project: Structural Biology Role: Software Engineer Designed and developed a web application to provide a central repository to store protein preparation data and to keep track of the different stages in protein preparation. Through this user-friendly Web application, users can add requests to obtain proteins, get status on their protein preparation via email etc. The application also supports different types of users to support different privileges. Environment: J2EE- JBoss, Jakarta-tomcat, Oracle 8 RDBMS, JSP, Servlet, XML Operating System: Red Hat Linux 7.1 PHS 398/2590 (Rev. 05/01) Page ____43___ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Project:Security Server Role: Software Engineer Designed and implemented a security system to handle authentication and integrated it with the current architecture. This project aimed at developing a dedicated server to handle authentication and security privileges. A centralized intra- web site was designed and developed to provide users with a common interface to access various projects. The authentication and privilege information was maintained in LDAP and Oracle. Environment: J2EE- JBoss, Jakarta-tomcat, Oracle 8 RDBMS Operating System: Red Hat Linux 7.1 Project: Search Role: Software Engineer Designed and developed a scalable and user-friendly application to allow scientists to mine drug discovery data generated from various stages (high throughput screening, structural activity relationship, lead optimization, etc). The application retrieves data based on a list of compounds provided as input. XML is used to configure data fields for each assay. Environment: Jakarta-tomcat, Oracle 8 RDBMS, JSP, Servlet, XML Operating System: Red Hat Linux 7.1 Project: XML database Role: Software Engineer Designed an XML Schema to accommodate scientific data, which can be classified as assay based, process based, and protocols based. The aim of this project was to gather data in an XML format according to the DTD to provide minimal processing and retrieval. This included designing the XML database to store XML documents. Open Source XML Database (eXist) was used as the XML repository. XML Authority and XML Spy were used to develop the schema. Operating System: Mandrake Linux 8.2 Project: ISIS Role: Software Engineer Designed Oracle Tables and Views to be used by ISIS. This involved analyzing the existing oracle tables to extract the information to be viewed in ISIS, creating new tables and views to allow scientists to connect to the backend data through ISIS interface. Project: SAR Application Role: Software Engineer Designed and developed an application for Structural activity relationship (SAR) Projects using Java and XML. The SAR Data was maintained in XML format. SAS was used to analyze data and plot data to provide a graphical analysis to the users. PHS 398/2590 (Rev. 05/01) Page ____44___ Biographical Sketch Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. RESOURCES FACILITIES: Specify the facilities to be used for the conduct of the proposed research. Indicate the performance sites and describe capacities, pertinent capabilities, relative proximity, and extent of availability to the project. Under “Other,” identify support services such as machine shop, electronics shop, and specify the extent to which they will be available to the project. Use continuation pages if necessary. Laboratory: Tularik Research Division has allowed the full access to its state of the art labs, robotic equipments and computing infrastructure for the proposed research in the area of drug discovery informatics. Worldwide, the company employs 400 people, 85 percent of whom are engaged in research and development. We currently have five subsidiaries, located in the U.S. and Europe. Tularik has fully epuipped laboratories for biology, chemisty and pharmacology, an extensive library, complete wiring for electronic communication and data transmission. All the research related data generated from the different laboratories will be managed through the Discovery informatics platform. Clinical: Animal: Computer: Tularik Research Informatics Department has the state of the art supercomputer, Linux clusters, and various SUN, SGI workstations. All these computing facilities, Oracle/MYSQL servers and various statistics tools can be fully accessed for the proposed research in the area of algorithm development and high throughput genome scale annotation and data repository. Office: Tularik has sufficient office space and equipment to support the activities for the proposed research. Other: Over the years, Tularik has developed novel algorithms, database schema and data flow architecture. All these utilities and proprietary databases such as internal genome databases, microarray databases and amplicon databases can be utilized for the sake of the algorithm development and data analysis. MAJOR EQUIPMENT: List the most important equipment items already available for this project, noting the location and pertinent capabilities of each. Linux cluster and parallel computing environment Tularik Inc High throughput data analysis Paracel super computer Tularik Inc High throughput HMM data analysis Oracle/MySQL database server Tularik Inc Data repository Genome Server Tularik Inc Mirror all the public domain genome data and encapsulate Tularik proprietary genomic content Various robotic equipments Tularik Inc. High throughput screening and combinatorial chemistry Wet lab facilities Tularik Inc Allow all the state of the art molecular and cellular lab work PHS 398 (Rev. 05/01) Page ___45____ Resources Format Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Research Plan This proposal contains proprietary information. A. Specific aim The genomics revolution and other rapid advances in technologies, such as combinatorial chemistry, high throughput drug screening, and computer aided drug design, demand efficient high throughput data management and powerful computing application support. Operating at the crossroads of biomedical research and computing innovation, the Tularik Research Informatics team has pioneered the pharmaceutical industry to integrate cutting edge enterprise technologies including J2EE, Microsoft .NET, Linux cluster to meet the ever-increasing scale and complexity of discovery research. The Tularik Discovery Informatics platform (http://discovery.tularik.com) has been a successful prototype, enabling a powerful, flexible infrastructure that promotes workflow efficiency in a high throughput, collaborative discovery environment. In order to continue this development such that the Discovery Informatics platform can be ultimately generalizable, scalable, extensible and interoperable, we are proposing the following approaches. 1. Architect scalable and robust high throughput enterprise computing infrastructures. The Java 2 platform, Enterprise Edition (J2EE), Microsoft .NET and high throughput/performance computing (HTC/HPC) technologies will be interoperable to build a state of the art Discovery Informatics platform. 2. Integrate various standalone robotics applications into networked automated Discovery pipelines. Thanks to technological innovations, robotics and automation are now absolutely essential in various stages of the drug discovery processes. The Discovery Informatics platform will integrate robotic vendor proprietary software through Microsoft .NET Web Services to automate the inter-robotic data management and mechanical operations. 3. Automate high throughput discovery workflows. Tularik Discovery Informatics platform has automated the data analysis and data management in the areas of array-based comparative genomic hybridization, high throughput screening (HTS), structure activity relationship (SAR) and ADMET. Additional machine learning algorithms and visualization modules will be developed to automatically extract knowledge, e.g. the novel compound structural motifs, from large-scale bioassay databases. Discovery Informatics platform will integrate computational PHS 398 (Rev. 05/01) 46 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. chemistry approaches for parallel drug lead optimization of potency, selectivity, and ADMET properties. 4. Integrate in silico drug lead seeking, explosion and optimization processes into the high throughput Discovery platform. Integrate ligand or receptor based virtual screening algorithms into the Discovery platform to increase the throughput for drug lead seeking, explosion and optimization. Algorithms, including proper compound filters, will be developed to create a 1 billion-member virtual screening library. 5. Standardize the informatics data flow and implement interoperable service-oriented computing architecture. Tularik will work with I3C (Interoperable Informatics Infrastructure Consortium) to adopt and enact proper standardizations for data flow in the areas of genomics, biological pathway, compound acquisition, compound inventory, lead discovery and optimization. The current Tularik Discovery platform hosts various XML based J2EE and .NET distributed applications, providing a solid foundation to extend to the service-oriented computing architecture. 6. Establish industrialized software configuration management (SCM) mechanisms for application build and deployment. Discovery platform has evolved to ensure code portability, robust build and easy deployment to Tularik’s worldwide campuses and relevant research communities. Discovery platform will continue to improve through the utilization of the open source standards and applications. These developments will make Tularik Discovery Informatics platform generalizable, scalable, extensible and interoperable to the entire biomedical research community. B. Background and Significance J2EE, Microsoft .NET enterprise platforms and their implications in pharmaceutical research informatics Both J2EE (Java 2 Platform, Enterprise Edition) and Microsoft .NET platforms offer cutting edge scalable technologies to simplify multitier distributed application architecture and Web Service development. J2EE is platform neutral and has “Write Once, Run Anywhere” portability while .NET is restricted to the Windows platform application support. Pharmaceutical research enterprises have very dynamic use cases and requirements, demanding fast informatics development and deployment to gain a competitive advantage. J2EE and Microsoft .NET offers the enterprise solutions, which promise to scale up the data management, deliver the robust business services, and accelerate the discovery process. PHS 398 (Rev. 05/01) 47 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Robotics and industrialized drug discovery process Modern high throughput methodologies and robotics symbolize the industrialization of the drug discovery process. Large investments in robotic instruments are justified not only on the basis of the labor cost savings, operational precision and higher throughput, but also by the financial impact of shortening the drug discovery time lines. Almost all robotics come with some form of scheduling software in addition to the required system management software. Most, if not all, of the robotic applications are dependent on the Microsoft Windows platform. The interoperability between different robotics can be difficult due to the proprietary nature of the vendor software, and the distribution of robotic applications on PC workstations with different Windows versions. All of these may create operational bottlenecks once different vendor robotics are required to work interactively. High throughput discovery data management The publication of the human genome sequence, and the advance of biomedical and robotic technology significantly increase the volume and complexity of the drug discovery data. In the area of target identification and validation, microarray applications have become an indispensable and routine process. Modern combinatorial chemistry (CC) allows the automated explosion of large number of compounds across many drug discovery programs. Compound acquisition, registration, receiving, inventory and distribution are very complex processes yet data transaction needs to be robust, flexible and real time. With the advent of modern high throughput bioassays and robotics, HTS platforms are capable of screening more than ten thousand compounds per day. Thus, developing comprehensive customized informatics packages for discovery research have become a formidable endeavor. There are commercially available software packages for various types of discovery applications including GeneSpring, Spot Fire, Activity Base, MDL Information Systems, Oxford Molecular, Tripos, Accelrys Accord Enterprise. In spite of the promises from the vendors, none of these packages can supply a complete solution. Highly complicated and dynamic situations require high levels of cooperation between the various teams that produce and consume information and those that develop or integrate the software applications. Every consideration should be given to the ease of use and flexibility, but inevitably the flexibility desired by the users must be carefully balanced against both the flexibility allowed by the automated assay systems and the IT cost of maintaining overly complex software. Targeted molecular diversity in drug discovery: in silico lead seeking, explosion and optimization In silico screening of combinatorial libraries prior to synthesis promises to be valuable aid to lead discovery (Lyne 2002). Although still an evolving approach, virtual screening (VS) can serve as a complementary approach to PHS 398 (Rev. 05/01) 48 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Figure 1. Compound library diversity analysis and lead explosion. The molecules in the full library are marked as circles according to two computational molecular descriptors. For a molecule to be active, it must lie in a grey area on the graph. The red circles indicate the elements of a diverse sublibrary. Lead explosion: The gold circles show how a directed sublibrary can be expanded around an active library member. experimental screening (HTS). When coupled with structural biology, virtual screening has emerged as an efficient, cost-effective identification of lead molecules (Figure 1). Broadly speaking, virtual screening can be classified into either ligand-based or receptor-based categories. Ligand-based methods extract the common structural motif (Mestres and Knegtel 2000), similar pharmacophore (Mason et al. 2001), and 3D shape (Srinivasan et al. 2002) from the known active compound to screen for additional compounds with similar properties. The receptor-based approach docks the compound library to the predetermined target structure and prioritizes compounds according to the quality of the fit to the target-binding site. The advance in the high throughput computing infrastructure has made the computational chemistry technically feasible to analyze large databases of chemical compounds for lead seeking, explosion and optimization. Standardization and interoperability Industrialized drug discovery strongly depends on information exchange. Automated robotic applications in the fields of chemical synthesis, biological assay, drug metabolism, and even protein crystallography have resulted in an explosion of data. The discovery data sets are very difficult to relate to each other because they can be of heterogeneous sources, in many formats and file types, and typically dispersed across incompatible IT systems (Attwood 2000). The problem of bringing together heterogeneous and distributed systems is known as the interoperability problem. The current approach to achieve the data interoperability is, mainly, to write ad-hoc data interface programs for each pair of communicating systems, resulting in bottlenecks due to continuous development and maintenance of these programs. Ontology mediated document exchange (Ashburner et al. 2000; Pouliot et al. 2001) provides a standard conceptual specification and a solution for this scalability problem. The data model of source contents is aligned with the representation specified by the ontological standardization and transformed accordingly in both ways. The W3C standard for semi-structured data representation XML (eXensible Markup Language) has become the industry standard for data exchange over the web based enterprise applications. In life science field, the Interoperability Informatics Infrastructure Consortium (I3C) was formed in PHS 398 (Rev. 05/01) 49 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. 2001 to promote global, vendor-neutral informatics solutions to accelerate discovery and product development. Web Services Web Services provide a standard means of interoperating between different software applications, running on a variety of platforms and/or frameworks. Web Services promise to greatly increase interoperability and ease data exchange even as it lowers costs. It is expected that the impact of the Web Services on the IT industry will be profound. Specifically: 1). By lowering the cost of software integration between systems, Web Service offers a way to maintain and integrate legacy IT systems at a lower cost than typical Enterprise Application Integration (EAI) efforts. 2). By allowing software running on different platforms to communicate, Web Services enable the interoperability between multiple platforms running on everything from mainframes to servers to desktops to PDAs. 3). By employing universal, nonproprietary standards, Web Services dramatically lower the IT costs of collaborating with external partners, vendors and clients. Software configuration management One of the key issues in enterprise software development is how to manage the software code repository. Any software configuration management (SCM) scheme requires a central system and robust architecture for tracking, deploying throughout the entire lifecycle of software packages. Specifically, the SCM should: 1). Maintain source code under revision control. 2). Manage code dependencies and third-party library dependencies. 3). Manage builds and build dependencies. 4). Manage dependencies on third-party libraries. Enterprise applications like J2EE and Microsoft .NET demand a comprehensive SCM, which requires the integration of proper code management and revision tools, and a carefully planned source code tree. C. Preliminary Studies This section summarizes the preliminary studies for each of the Specific Aims. The detailed design and procedures for the results are described in the next section “Research Design and Methods” Specific Aim 1: Architect scalable and robust high throughput enterprise computing infrastructures. Over the past two years, Tularik has built and deployed several Linux Farms for different applications: a 440-processor cluster for high throughput computational chemistry, a 150-processor cluster for genome-scale bioinformatics computing, and a 100-processor cluster for J2EE enterprise applications (Figure 2). PHS 398 (Rev. 05/01) 50 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Old vs. New • Time taken to BLAST raw mouse genomic sequence read against human genome database: Darwin Linux Cluster (SGI Irix 6.5) (RedHat 7.1) 1 mouse sequence 1 minute 10 seconds 1000 mouse sequences 15 hours 3 minutes All mouse genomic sequence reads at NCBI (22 million reads) 38 years 34 days Note: Darwin is a heavily used machine, So it is not a machine to machine comparison - However, it does accurately reflect the environment in which these computers are used. Figure 2. Left: Linux cluster benchmark. Right: High availability of the database clusters. The high throughput computing for the computational biology and chemistry was setup through PBS (Portable Batch System, http://www.openpbs.org), which improves utilization of overall computing resources (CPU) from less than 20% to over 90%. These computing environments significantly empower the Tularik’s scientific computing capability in the areas of computational chemistry (Waszkowycz et al. 2001) and functional genomics (Li et al. 2003; Li et al. 2002). The distribution and clustering of nodes for J2EE enterprise applications were managed through application servers including BEA WebLogic and open source JBOSS (Figure 3). Data Central JMS Reader XML Application Controller Cluster Oracle Server XSLT Transformer Central Controller XML Objects Client Security Server Web Server LDAP XML Objects XML Rule Server Figure 3. Tularik J2EE Discovery Architecture PHS 398 (Rev. 05/01) 51 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Specific Aim 2: Integrate various standalone robotic applications into networked automated Discovery pipelines. . Most robotic instruments are not interoperable between each other, performing unique applications on vendor proprietary platforms and incompatible computing environments. Thus, workflow integration of Tecan Genesis RSP 100 •Library synthesis •Fraction pooling AccordConv Biotage Parallex Flex 2 -Channel Prep-HPLC •Two channels scalable to four in parallel •Each channel independently controlled •Monitored at 220 and 254 nm simultaneously PE Sciex High -Throughput Analytical HPLC •Teamed with a Gilson 215 Liquid Handler •Confirmation of product mass and purity Accord SD File SubMaster VialMaster BioMaster PureConv BioConv SciexConv PureConv SynMaster SciexMaster SetOne BioInput SciexInput SetTwo Tecan LiquidHand Sciex LCMS Tecan LiquidHand Biotage HPLC Sciex LCMS Tecan LiquidHand BioOutput SciexOutput SciexOutput Tecan-Program VB-Program Hardware Spreadsheet PurityConv PurityConv PurityReport PurityReport Figure 4. Integration of liquid handler, LCMS, HPC robotics instruments into automated high throughput combi-chem synthesis pipeline. functionally relevant robotics has largely remained a labor-intensive manual work, impeding operational productivity. As a pilot program, Tularik has successfully completed the robotic workflow integration in the area of combinatorial chemistry. As shown in Figure 4, centralized software program and associated robotic interfacing drivers have been developed to integrate the robotics as components of an automated workflow pipeline, enabling automated synthesis, purification, quantitation of the combi-chem process. Specific Aim 3: Systemize the high throughput discovery workflows to reveal “knowledge” from the raw data. As summarized (Figure 5), Tularik Discovery Informatics platform has: 1). Established the genomics infrastructure integrating Ensembl (Clamp et al. 2003; Hubbard et al. 2002) utilities and MYSQL databases. 2). Automated the data management in the areas of array-based comparative genomic hybridization, HTS, SAR and ADMET. Rational Rose, Borland Together Control Center, and Microsoft .NET ARCHITECT have been utilized to design and architect the Discovery informatics platform using Unified Modeling Language (UML). Both the XML and Oracle database have been rigorously designed to enforce data integrity on various business entities, associated properties, and relationships in the areas of druggable target identification and validation, compound acquisition and inventory, and bioassays (HTS, SAR and ADMET). PHS 398 (Rev. 05/01) 52 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Types of Data, Links, Applications Data Management MicroArray Structural Biology Personnel Legal Acquisition Genome Clinical Virtual Screening Storage Targets Identification Validation • T-number database • Biological results from HTS • Equipment scheduling HTS Reports Assays Chemistry • Chemical characterization HTS • Inventory Compounds SAR In vivo • Biological assays • CYP, solubility/permeability, cytotoxicity • PK • Pharmacology • Special PK experiments Acquisition ADMET Inventory Processes SAR MicroArray Figure 5. Discovery Informatics platform use cases and enterprise services A successful prototype, Tularik Discovery Platform (Figure 6), has been developed based upon the J2EE architecture (Figure 3) and high throughput computing Linux clusters (Figure 2). Using high throughput screening or Structure Activity Relationship analysis as examples, data generated from robotic readers are stored via SAMBA server directly to the data central server, which has a rigorous backup mechanism for disaster recovery. http://discovery.tularik.com is the centralized gateway to Discovery platform application suites, enterprise services and data sets. User logins are authenticated through LDAP server and security application server. Userprofiling process is flexible enough to distinguish different user types and has CORPORATE DATABASE Inventory Data Analysis Raw Data upload Legacy Database Results invoke ort ep alR fici Of update HVIEW Reconfiguration SQL upload Point To Point ISIS/HOST Live Data Tracking Bruno Reports View Browser Figure 6. Discovery Platform – Informatics gateway PHS 398 (Rev. 05/01) 53 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. dynamically configured the web interface to authorize different categories of applications based upon the user’s predefined privileges. Since efficient data analysis and advanced data visualization demand powerful and flexible application support, the 100 CPU Linux cluster has been dedicated to empower live web transactions and Web Services. Third party software like ISIS can be utilized to open a window to the corporate database to provide additional data analysis and visualization capabilities. TMAX (Tularik MicroArray eXplorer, Figure 7), an essential component of the Tularik Discovery platform, is Tularik’s in-house micro-array data storage T M A X ChipViewer T M A X ChipCluster T M A X ChipPlotter TMAX GenomeScan Figure 7. TMAX has flexible technology-independent design that handles data in in a variety of formats including Incyte, Affymetrix, Scanalyze, Genepix, Rosetta, Motorola, and simple spreadsheets. TMAX is comprised of 12 front-end applications plus a database server and several administrator support tools. Currently, TMAX contains about more than 14 million data points. and analysis software solution. TMAX has satisfied Tularik’s scaled up microarray operation in the area of gene expression and genomic amplification/deletion research. Specific Aim 4: Integrate in silico drug lead seeking, explosion and optimization processes into the high throughput Discovery platform. Tularik has acquired and integrated Protherics (Li et al. 1998; Liebeschuetz et al. 2002; Waszkowycz et al. 2001) virtual screening technology as cost-effective approach to enable in silico lead seeking, explosion and optimization (Figure 8). The recent setup of the 440 CPU Linux cluster has solved the computing deficit challenge: the computational PHS 398 (Rev. 05/01) 54 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. cost has been reduced to the order of one minute of CPU time per ligand per processor. Figure 8. Tularik Protherics virtual screen platform. Specific Aim 5: Standardize the informatics data flow and implement interoperable service-oriented computing architecture. Based upon careful use case study (Figure 5), the Discovery Informatics platform has standardized to formulate a unified data model to encapsulate discovery data sets. The architecture (Figure 3) heavily replies on XML and Java technologies and tools. Through the use of a unified Discovery Java class library and XML serialization of objects, the Discovery platform allows a common data model and common data exchange format. Central to the Discovery platform is a Discovery XML Document Type Definition (DTD) and a corresponding Java Object Model that facilitate data exchange, data integration and data transformation between components. Based upon this lightweight architecture, the entire Discovery platform has been designed and configured through the Discovery XML Document Type Definition. The Discovery architecture has allowed significant system flexibility, enabled rapid system integration and application interoperability, and eased the burden of schema and use case evolution. Microsoft Windows has been widely used to host various in house or vendor specific applications including most robotic management software. Interoperability between Windows applications and mainframe J2EE Discovery platform needs to be solved in order to scale up and completely automate the Discovery informatics operations. The cutting edge technology Web Service provides a solution for this purpose. Both Microsoft .NET and PHS 398 (Rev. 05/01) 55 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. J2EE platforms support Web Service, enabling SOAP over HTTP for XML based data transactions and services. Resources and efforts have been allocated to pilot SOAP over HTTP SOAP over HTTP Internet XML XML Window Apps .NET Platform J2EE Discovery Platform Figure 9. Web Services promise interoperability across application platforms. the adoption of Microsoft .NET technology to integrate Windows applications as Web Services to the J2EE Discovery platform (Figure 9). The pilot project has been designed and implemented to launch a bioassay reporting Web Service. Specifically, based upon the service request from the J2EE Discovery platform, .NET Windows server extracts compound 2D chemical structure via integrated MDL Windows applications and combines both compound structure and assay data points into a comprehensive Microsoft Office EXCEL report. The success of this pilot development helped to finalize our decision to introduce Microsoft .NET as a secondary main frame platform and use Web Service to address the interoperability issues in the Discovery operations. Specific Aim 6: Establish industrialized software configuration management (SCM) mechanisms for application build and deployment. Open source “Concurrent Version Control” system (CVS) have been introduced in house to manage the software code versioning and code sharing among project team members. Open source tools, Ant and NAant, have been introduced to automate the nightly build and deployment for the respective J2EE and .NET Discovery platforms. During Discovery platform development, reticulate interdependencies arose within the source code tree when the tree grew beyond a certain point of complexity. This package interdependency problem has been carefully examined and resolved. The source code tree has then been restructured and code management policies have been made to manage the application development and check in process to avoid the code circular interdependencies. D. Research Design and Methods Specific Aim 1: Architect scalable and robust high throughput enterprise computing infrastructures. PHS 398 (Rev. 05/01) 56 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Rationale: As a result of the prototyping work of the Discovery platform http://discovery.tularik.com, we have accumulated significant experiences and technical know-how in the areas of J2EE and Microsoft .NET enterprise application architecture, XML based data transaction and validation, and Linux cluster high throughput environment setup. The goal for this specific aim is to create an architecture to consolidate different enterprise technologies into a mature Discovery informatics platform. Design and Methods: Although both are powered by the Linux operating system, currently the J2EE cluster and the PBS (Portable Batch System) configured cluster are functionally related but mechanistically separate units at the Discovery platform. Integrating the high throughput computing PBS farm with the high performance computing J2EE farm will harness the advantages of both systems, boost up the productivity and streamline the operational process. To achieve this, we will deploy a J2EE application server on the master node of the PBS farm and develop Java driver object to wrap up the PBS computing job management functionalities. As a result, the PBS farm master node will be transformed to become one application node in the J2EE farm and thus the entire PBS farm will be harnessed upon computational request from the J2EE central controller. Another approach is to utilize the J2EE application server on the PBS farm master node, wrapping up the PBS job management utilities into a Web Service. Theoretically either approach should work to enable the interoperability between these two computing platforms. Based upon our in house .NET prototyping efforts, Microsoft .NET architecture implementation has been proven to be robust, scalable and interoperable. Since Tularik has decided to switch from MAC to Microsoft Windows platform, .NET technology will provide add-on value to Tularik if leveraged properly during this desktop transition. Along with the potential huge number of Windows based equipments after the MAC to Windows platform transition, a new high throughput computing resource, Microsoft .NET cluster, can emerge. Microsoft .NET will integrate these PC processors into a high throughput computing cluster for applications that require parallelism, fault tolerance, and load balance. Thanks to the relative homogeneous Windows computing environment, the integration overhead will be compensated by simplicity, ease of Windows programming, immediacy of execution, and desktop integration. The Discovery Informatics platform depends heavily on XML for data processing and data transaction. We will integrate a native XML database into the Discovery platform. After careful evaluation of the available XML databases on the market, we have chosen eXist, an Open Source native XML database to support Discovery XML transactions. “eXist” database features efficient, index-based XPath query processing, extensions for keyword search and tight integration with existing XML development tools. The database is PHS 398 (Rev. 05/01) 57 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. lightweight, completely written in Java and may be easily deployed in a number of ways, running either as a stand-alone server process, inside a servlet-engine or directly embedded into an application. We believe the integration of the eXist XML database will significantly improve the scalability of the Discovery Informatics platform. Advantages and limitations: Despite the fact that J2EE, .NET, high throughput and high performance computing (HTC/HPC), and native XML database technologies promise to offer a competitive edge, coping with heterogeneous computing platforms and environments may impose significant operational and integration overhead and challenges. We will focus our initial efforts on the scaling up of Microsoft .NET integration of mission critical Windows applications into the Discovery Informatics platform. Specific Aim 2: Integrate various standalone robotics applications into networked automated Discovery pipelines. Rationale: Most robotics driver applications remain proprietary, win32 dependent, and not interoperable, giving rise to bottlenecks in operation and application integration. The Discovery Informatics platform will integrate the standalone third party vendor software through Microsoft .NET Web Services to automate the inter-robotic data management and mechanical operations. User Interface Application Module VB defined Procedures C# Objects Application Layer ActiveX/ TCP/IP Link Robotics Device Driver Device Driver Device Driver Driver Layer RS232/485 IEEE488 Link Web Services Device Device Device XML DB Hardware Layer Oracle Servers Figure 10. .NET robotics integration Architecture Schematic Design and Methods: Microsoft .NET framework will be deployed to PCs hosting robotics Windows applications (Figure 10). The .NET applications will integrate the three main components of the robotic specific platforms: robotic PHS 398 (Rev. 05/01) 58 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. specific application; the hardware modules of the mechanical system; and the module level software. The .NET integration will empower the robotic instruments to provide Web Services to allow live data transaction and interoperate with different enterprise applications and equipments. To assemble a series of different types of robotic equipments into a pipeline fulfilling specific discovery needs, one master program will be developed to coordinate various Web Services coming through the different robotic .NET servers. Specific Aim 3: Systemize the high throughput discovery workflows to reveal “knowledge” from the raw data. Rationale: Tularik Discovery Informatics platform has significantly automated the data analysis and data management in the areas of array-based comparative genomic hybridization, HTS, SAR and ADMET. The total amount of data continues to explode because of the high throughput nature these processes. An important goal of the Discovery platform is to assist scientists analyzing the raw data thereby revealing new “knowledge” that might otherwise have been missed. Design and Methods: Knowledge discovery is defined as “the non-trivial extraction of implicit, unknown, and potentially useful information from data'' (Frawley et al. 1991). Often this information is not typically retrievable by standard techniques but is uncovered through the use of Artificial Intelligence (AI) techniques. Discovery informatics platform will provide user-friendly interface and visualization modules to integrate data mining technologies including various multivariable classification, linear or non-linear regression, expert system, and machine learning. Most likely these tools and modules would be acquired from third party licenses. Middle ware software drivers and data exchange utilities need to be developed to push these technologies to the Discovery platform such that average scientists can be empowered to utilize them for data mining purposes. Discovery platform has integrated commercial MATLAB and SAS toolboxes as mathematical computation, analysis, visualization, and algorithm development utilities for the automated SAR and ADMET data flow. In addition, database will be modeled to encapsulate the entire data sets. The exploratory analysis will select appropriate descriptors, which relies on the clear understanding of the scientific problem that one is trying to solve. Patterns, which can lead to reasonable prediction, would be discovered using relevant descriptors and managed in the knowledge Oracle database. Once the pattern has been validated, de-convolution or data visualization technologies are required to translate the abstract pattern, such as neural network patterns, so that scientists can take chemical or biological actions. Despite the fact that knowledge discovery process heavily depends on similar statistic approaches, there is a lack of commercially available solutions PHS 398 (Rev. 05/01) 59 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. suitable for browsing through large lists of data and enabling simultaneous visual inspection and interrogation of various aspects of the data. Discovery Informatics platform will integrate algorithms from in house development or from commercial vendors such as MDL, Tripos, Accerys to computationally derive the compound physicochemical and ADMET properties to contribute to the knowledge database. These tools would be available as protocols that will run upon request. Further, the implementation will allow the computational chemist to update the models as more validation studies become available. In silico computed compound properties will be loaded to the Discovery database and will be utilized as part of the multiparametric optimization drug discovery strategy. Advantages and limitations: Due to the extreme complexity in the drug discovery process, caution should be made in the process of knowledge discovery. It is true that HTS data studies discover knowledge, e.g. compound structural patterns, which are responsible for the bioactivities. However, at the start of one’s data mining efforts, it is not known if such knowledge is present in the database, if it can be effectively used or even if patterns can be reasonably extracted. The issue is not lack of good computational science, but a matter of not having enough underlying data. The paradox of predictivity versus diversity can arise: the problem evolves from the fact that the greater the diversity of the data set, the smaller the chance models with prediction power can be uncovered; on the other hand, the information content of the model (if it exists) will increase as the boundaries of the space and the diversity of the subjects under investigation increases. Many ADMET models are based upon small sets of chemical compounds (from tens to hundreds), thus frequently cited as non-significance by potential users. A similar situation also exists with empirical observations. That is why the value or utility of the Lipinski ‘Rule of Five’ (Lipinski 2000) have been questioned by many medicinal chemists. Since most of these in silico efforts have not been well validated due to lack of data, the project team coordination so that the results of these data sets are appropriately integrated. Specific Aim 4: Integrate in silico drug lead seeking, explosion and optimization processes into the high throughput Discovery platform. Rationale: Considering the total lead-like molecular space, the percentage occupied by compounds that current technologies have made and screened is quite limited. Virtual screening utilizing Linux cluster has made it possible to screen compounds that do not exist within the corporate inventory. In this proposal, the goal has been set to generate a 1 billion member virtual screening library to extend the database diversity and availability to the processes of in silico lead seeking, explosion and optimization. PHS 398 (Rev. 05/01) 60 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Design and Methods: This 1 billion member virtual screening (VS) library will be generated using a computational approach. The virtual screening library quality and usefulness depends on the library diversity, compounds’ ADMET properties and synthetic chemistry accessibility. We will define the list of scaffolds and reagents as the basis for the virtual library construction. The compiled database should contain reagent lists, covering the majority of relevant reagent classes. These reagents will be stored in a compound 3D processed form, which will be available for routine library construction. Drug-like scaffolds will be identified and compiled from the literature or from our combinatorial chemistry experiences. First round library will be built based on A+B reactions. Sybyl diversity analysis tools will be evaluated to explore any particular properties applicable to the library design. Compounds that disobey Lipinski’s “Rule of Five” (Lipinski 2000) will be flaged (but not excluded). Database of viable scaffolds and routes appropriate for parallel chemistry will be established. Software applications will be developed to integrate various chemoinformatics filters to construct and store the drug like libraries. Graphical interface will be developed to facilitate enumeration and sampling of the library, which will allow modelers to easily evaluate, optimize, and build subset libraries when required. Because the large number of this targeted VS library, current virtual screen data flow should be updated with additional capability for screening virtual libraries on the 1 billion compound scale: the goal is to access large synthetically accessible libraries for docking or pharmacophore search. The 1 billion member virtual screening library, if built as projected, will significantly expand the lead or drug like molecular space to improve the overall in silico lead seeking, explosion and optimization processes. Specific Aim 5: Standardize the informatics data flow and implement interoperable service-oriented computing architecture. Rationale: We intend to define an interoperable service-oriented architecture and work flow that will allow various applications to be provided in whatever language and on whatever platform is most appropriate, while ensuring that applications can inter-communicate seamlessly and be managed and assembled with minimal effort. Design and Methods: Discovery Informatics platform has focused on design approaches, processes, and application tools supporting the concept that large software PHS 398 (Rev. 05/01) 61 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Repository Services Orchestration Services State Security and Data Management Discovery Services Publishing Services JSP Event Handling Versioning Services Components Presentation Designer Presentation Services Java Integration Window Integration Data Translation Contract Management Third Party Integration Service Execution Ling, Bruce, Xuefeng, Ph.D. Business Rules Application Orchestrator Service Assembler Logical Services Lifecycle Services Repository Manager Component Services MetaData Services Repository Integration Data Integration Services External Services Gene Expression; Gene sequence; Patents; References; Compounds; Reactions; Bio assay (HTS, SAR); ADMET; UDDI Integration Version Control Integration Remote Services Database Services Third Party Integration JAXR Integration Local Services Core Services Robotics Unit Tester Component Developer Business Rules Manager Monitoring Dynamic Service Management Simulation Testing Debugging External Service Access Methods Security Management Figure 11. A service based Discovery architecture. systems can be assembled from independent, reusable collections of functionality. Some of the functionalities, such as compound handling and analysis, genomics data mining etc., may already be available and implemented in house or acquired from a third party, while the remaining functionalities may need to be created. We will implement a service oriented architecture (Figure 11) to bring together all of these elements into a single, coherent whole. Each service will provide access to a well-defined collection of functionalities interacting with other services. J2EE and .NET frameworks have allowed the feasible implementation of this architecture through Web Service. For example, robotics services have to improve performance, availability and scalability through coordinating functionality executing on a collection of distributed hardware. Handling of the service provider, requestor, locator and broker will leverage the open source I3C’s Life Science Identifier Resolution (LSIR) scheme using Web Services. The LSID URN format template and examples are shown as following: urn:LSID:<AuthorityID>:<NamespaceID>:<ObjectID>[:<RevisionID>] Examples: urn:LSID:ebi.ac.uk:SWISS-PROT/accession:P34355:3 urn:LSID:rcsb.org:PDB:1D4X:22 urn:LSID::ncbi.nlm.nih.gov:GenBank/accession:NT_001063:2 urn:LSID:ibm.com:rowenfsdb:DAC8266B-9B9E-4CD3-853F-7DB764F9D2D3:1 PHS 398 (Rev. 05/01) 62 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. LSID Client software will resolve access to data objects named using the LSID format by discovering the network location of the LSID resolution service using a combination of the Dynamic Delegation Discovery System (DDDS) standard, DNS Naming Authority Pointer (NAPTR) records, the DNS SRV standard and finally a number of web service interfaces. The grant coapplicant I3C member Brian King authors the LSIR implantation and will contribute the integration of LSIR to the Tularik Discovery Informatics platform. Our prototype http://discovery.tularik.com and its associated applications have been built as a component based architecture, heavily relying on XML and Java technologies and tools. We intend to leverage the previous development, wrapping components into Web Services. The effective use of XML as a serialization of syntax for Java objects and as inter-service exchange format is key to the transition from component based to the service based design and architecture. We intend to share a common data model and common data exchange format throughout the platform via ubiquitous use of unified XML and Java data model for data exchange and persistence. This design allows the freedom to use different languages without risking errors due to impedence mismatch between data models. Since XML-based data transformation can be done with robust standard components and tools, XML and Java validation eliminates errors early in development and improves data quality. Advantages and limitations: The current Tularik Discovery platform hosts various XML based J2EE and .NET distributed applications, providing solid foundations to extend to the service-oriented computing architecture. Data modeling and workflow standardization with open source and other research communities will be important for developing interoperable platforms. Tularik will work with I3C (Interoperable Informatics Infrastructure Consortium) to adopt and enact proper standardizations for data flow involved in the areas of genomics, biological pathway, compound acquisition, compound inventory, lead discovery and optimization. Orchestration between various teams in open source or consortium can be effort and time consuming. This overhead cost needs to be projected in the informatics operations to ensure the timeliness of the final delivery. Specific Aim 6: Setup industrialized software configuration management (SCM) mechanisms for application build and deployment. Rationale: Tularik has established a well-structured source tree of J2EE and .NET code versioned under CVS repository. Nightly build and deployment are managed through Apache ANT and NANT. Build and deploy operations have to scale up and ensure the robustness due to the ever-increasing number and different types of application servers. Tularik’s current worldwide campuses PHS 398 (Rev. 05/01) 63 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. and operations demand the global informatics support, including capabilities to deploy enterprise solutions globally. Design and Methods: Code CVS Code Repository Application Assembler Service XML SOAP over HTTP check out Application Publishing Service SOAP over HTTP Application Assembler Service SOAP over HTTP Intranet Window Apps .NET Platform XML Internet XML J2EE Discovery Platform Figure 12. Application publishing and assembler Web Services. Both J2EE and .NET application servers allow “hot” deploy without interruption of the enterprise services and applications. One application server will be dedicated to provide the application publishing Web Service, which nightly checks out code remotely from the CVS repository server, builds the software packages and deploy to the relevant machines world wide through the application publishing Web Service using SOAP over HTTP through internet or intranet. The .NET or J2EE platform servers host application assembler Web Service, accepting the packages, and deploying the packages into the application server environment after validation according to the predefined contract. Data and software sharing: Academic License Agreement: All software, design methods, and analysis protocols developed through the funding of this grant will be made publicly available for free use by the biomedical researchers in academic universities and institutions. The software is being provided on an 'as is' basis for the non-commercial research purposes. Please do not distribute the software, or any portion or derivative thereof, beyond the academic organization. We are providing the software PHS 398 (Rev. 05/01) 64 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. without warranties and with no provisions for support or future enhancements. Please note that Tularik Inc. and its employees have no liability in connection with the use of the software. Commercial License Agreement: Commercial or corporate use of the relevant information and utilities requires a signed license agreement from Tularik Inc. To get the appropriate forms and detailed instructions for licensed use of the software packages, please contact Terry Rosen, [email protected]. E. Human Subjects N/A F. Vertebrate Animals N/A G. Literature Cited Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25-29. Attwood, T. K. 2000. Genomics. The Babel of bioinformatics. Science 290: 471-473. Clamp, M., D. Andrews, D. Barker, P. Bevan, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen et al. 2003. Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res 31: 38-42. Frawley, W. J., G. Piatetsky-Shapiro and C. Matheus, 1991 Knowledge Discovery In Databases: An Overview. In Knowledge Discovery In Databases,. AAAI Press/MIT Press. Hubbard, T., D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down et al. 2002. The Ensembl genome database project. Nucleic Acids Res 30: 38-41. Li, J., C. W. Murray, B. Waszkowycz and S. C. Young. 1998. Targeted molecular diversity in drug discovery: integration of structure-based design and combinatorial chemistry. DDT 3: 105-112. Li, S., G. Cutler, J. J. Liu, T. Hoey, C. Chen, P. G. Schultz, J. Liao and X. B. Ling. 2003. A comparative analysis of HGSC and Celera human genome assemblies and gene sets. Bioinformatics In press. PHS 398 (Rev. 05/01) 65 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Li, S., J. Liao, G. Cutler, T. Hoey, J. B. Hogenesch, M. P. Cooke, P. G. Schultz and X. B. Ling. 2002. Comparative analysis of human genome assemblies reveals genome-level differences. Genomics 80: 138-139. Liebeschuetz, J. W., S. D. Jones, P. J. Morgan, C. W. Murray, A. D. Rimmer, J. M. Roscoe, B. Waszkowycz, P. M. Welsh, W. A. Wylie, S. C. Young et al. 2002. PRO_SELECT: combining structure-based drug design and array-based chemistry for rapid lead discovery. 2. The development of a series of highly potent and selective factor Xa inhibitors. J Med Chem 45: 1221-1232. Lipinski, C. A. 2000. Drug-like properties and the causes of poor solubility and poor permeability. J Pharmacol Toxicol Methods 44: 235-249. Lyne, P. D. 2002. Structure-based virtual screening: an overview. Drug Discov Today 7: 1047-1055. Mason, J. S., A. C. Good and E. J. Martin. 2001. 3-D pharmacophores in drug discovery. Curr Pharm Des 7: 567-597. Mestres, J., and R. M. Knegtel. 2000. Similarity versus docking in 3D virtual screening. Perspect. Drug Des. Discovery 20: 191-207. Pouliot, Y., J. Gao, Q. J. Su, G. G. Liu and X. B. Ling. 2001. DIAN: a novel algorithm for genome ontological classification. Genome Res 11: 17661779. Srinivasan, J., A. Castellino, E. K. Bradley, J. E. Eksterowicz, P. D. Grootenhuis, S. Putta and R. V. Stanton. 2002. Evaluation of a novel shape-based computational filter for lead evolution: application to thrombin inhibitors. J Med Chem 45: 2494-2500. Waszkowycz, B., T. D. J. Perkins, R. A. Sykes and J. Li. 2001. Largescale virtual screening for discovering leads in the postgenomic era. IBM systems journal 40: 360-376. H. Consortium/Contractual Arrangements N/A I. Letters of Support (e.g., Consultants) Fax attached as the next page. PHS 398 (Rev. 05/01) 66 Page _______ Research Plan Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. CHECKLIST TYPE OF APPLICATION (Check all that apply.) NEW application. (This application is being submitted to the PHS for the first time.) SBIR Phase I SBIR Phase II: SBIR Phase I Grant No. _ ______________________ SBIR Fast Track STTR Phase I STTR Phase II: STTR Phase I Grant No. _ ______________________ STTR Fast Track REVISION of application number: (This application replaces a prior unfunded version of a new, competing continuation, or supplemental application.) INVENTIONS AND PATENTS COMPETING CONTINUATION of grant number: (Competing continuation appl. and Phase II only) (This application is to extend a funded grant beyond its current project period.) No Previously reported SUPPLEMENT to grant number: (This application is for additional funds to supplement a currently funded grant.) Yes. If “Yes,” Not previously reported CHANGE of principal investigator/program director. Name of former principal investigator/program director: FOREIGN application or significant foreign component. 1. PROGRAM INCOME (See instructions.) All applications must indicate whether program income is anticipated during the period(s) for which grant support is request. If program income is anticipated, use the format below to reflect the amount and source(s). Budget Period Anticipated Amount Source(s) 2. ASSURANCES/CERTIFICATIONS (See instructions.) The following assurances/certifications are made and verified by the signature of the Official Signing for Applicant Organization on the Face Page of the application. Descriptions of individual assurances/ certifications are provided in Section III. If unable to certify compliance, where applicable, provide an explanation and place it after this page. •Debarment and Suspension; •Drug- Free Workplace (applicable to new [Type 1] or revised [Type 1] applications only); •Lobbying; •NonDelinquency on Federal Debt; •Research Misconduct; •Civil Rights (Form HHS 441 or HHS 690); •Handicapped Individuals (Form HHS 641 or HHS 690); •Sex Discrimination (Form HHS 639-A or HHS 690); •Age Discrimination (Form HHS 680 or HHS 690); •Recombinant DNA and Human Gene Transfer Research; •Financial Conflict of Interest (except Phase I SBIR/STTR) •STTR ONLY: Certification of Research Institution Participation. •Human Subjects; •Research Using Human Embryonic Stem Cells• •Research on Transplantation of Human Fetal Tissue •Women and Minority Inclusion Policy •Inclusion of Children Policy• Vertebrate Animals• 3. FACILITIES AND ADMINSTRATIVE COSTS (F&A)/ INDIRECT COSTS. See specific instructions. DHHS Agreement dated: No Facilities And Administrative Costs Requested. Regional Office. DHHS Agreement being negotiated with Date No DHHS Agreement, but rate established with CALCULATION* (The entire grant application, including the Checklist, will be reproduced and provided to peer reviewers as confidential information.) a. Initial budget period: Amount of base $ b. 02 year Amount of base $ c. 03 year Amount of base $ d. 04 year Amount of base $ e. 05 year Amount of base $ $450,000 $468,000 $486,720 $491,202 $510,850 x Rate applied x Rate applied x Rate applied x Rate applied x Rate applied 52.0 52.0 52.0 52.0 52.0 % = F&A costs $ $234,000 % = F&A costs $ $243,360 % = F&A costs $ $253,094 % = F&A costs $ $255,425 % = F&A costs $ $265,642 TOTAL F&A Costs $ $1,251,511 *Check appropriate box(es): Salary and wages base Modified total direct cost base Other base (Explain) Off-site, other special rate, or more than one rate involved (Explain) Explanation (Attach separate sheet, if necessary.): 4. SMOKE-FREE WORKPLACE PHS 398 (Rev. 05/01) Yes No (The response to this question has no impact on the review or funding of this application.) Page __69_____ Checklist Form Page Principal Investigator/Program Director (Ling, Bruce): Ling, Bruce, Xuefeng Place this form at the end of the signed original copy of the application. Do not duplicate. PERSONAL DATA ON PRINCIPAL INVESTIGATOR/PROGRAM DIRECTOR The Public Health Service has a continuing commitment to monitor the operation of its review and award processes to detect—and deal appropriately with—any instances of real or apparent inequities with respect to age, sex, race, or ethnicity of the proposed principal investigator/program director. To provide the PHS with the information it needs for this important task, complete the form below and attach it to the signed original of the application after the Checklist. Do not attach copies of this form to the duplicated copies of the application. Upon receipt of the application by the PHS, this form will be separated from the application. This form will not be duplicated, and it will not be a part of the review process. Data will be confidential, and will be maintained in Privacy Act record system 09-25-0036, “Grants: IMPAC (Grant/Contract Information).” The PHS requests social Security numbers for accurate identification, referral, and review of applications and for management of PHS grant programs. Provision of the Social Security number is voluntary. No individual will be denied any right, benefit, or privilege provided by law because of refusal to disclose his or her Social Security Number. The PHS requests the Social Security Number under Sections 301 (a) and 487 of the PHS Act as amended (42 USC214a and USC288). All analyses conducted on the date of birth and race and/or ethnic origin data will report aggregate statistical findings only and will not identify individuals. If you decline to provide this information, it will in no way affect consideration of your application. Your cooperation will be appreciated. DATE OF BIRTH (07/27/68) SEX/GENDER Female Male Social Security Number 484-19-0649 ETHNICITY 1. Do you consider yourself to be Hispanic or Latino? (See definition below.) Select one. Hispanic or Latino. A person of Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. The term, “Spanish origin,” can be used in addition to “Hispanic or Latino.” Hispanic or Latino Not Hispanic or Latino RACE 2. What race do you consider yourself to be? Select one or more of the following. American Indian or Alaska Native. A person having origins in any of the original peoples of North, Central, or South America, and who maintains tribal affiliation or community attachment. Asian. A person having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent, including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam. (Note: Individuals from the Philippine Islands have been recorded as Pacific Islanders in previous data collection strategies.) Black or African American. A person having origins in any of the black racial groups of Africa. Terms such as “Haitian” or “Negro” can be used in addition to “Black” or African American.” Native Hawaiian or Other Pacific Islander. A person having origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands. White. A person having origins in any of the original peoples of Europe, the Middle East, or North Africa. Check here if you do not wish to provide some or all of the above information. PHS 398 (Rev. 05/01) DO NOT PAGE NUMBER THIS FORM Personal Data Form Page Principal Investigator/Program Director (Last, first, middle): Ling, Bruce, Xuefeng, Ph.D. Appendix 1. Tularik Inc., a biopharmaceutical company – http://www.tularik.com. 2. J2EE, The Platform for Enterprise Solutions. 3. Microsoft .NET, Working Beyond the Network. 4. I3C: Interoperable Informatics Infrastructure Consortium. 5. September 20, 1999 New York Time Technology Headline Surfing the Human Genome – Database of Genetic Code Are Moving to the Web. 6. Journal “Computer World” Tularik case study – Linux cluster tackles gene research 7. Pouliot, Y., J. Gao, Q. J. Su, G. G. Liu and X. B. Ling. 2001. DIAN: a novel algorithm for genome ontological classification. Genome Res 11: 1766-1779. 8. Li, S., J. Liao, G. Cutler, T. Hoey, J. B. Hogenesch, M. P. Cooke, P. G. Schultz and X. B. Ling. 2002. Comparative analysis of human genome assemblies reveals genome-level differences. Genomics 80: 138-139. 9. Li, S., G. Cutler, J. J. Liu, T. Hoey, C. Chen, P. G. Schultz, J. Liao and X. B. Ling. 2003. A comparative analysis of HGSC and Celera human genome assemblies and gene sets. Bioinformatics In press. 10. Liebeschuetz, J. W., S. D. Jones, P. J. Morgan, C. W. Murray, A. D. Rimmer, J. M. Roscoe, B. Waszkowycz, P. M. Welsh, W. A. Wylie, S. C. Young et al. 2002. PRO_SELECT: combining structure-based drug design and array-based chemistry for rapid lead discovery. 2. The development of a series of highly potent and selective factor Xa inhibitors. J Med Chem 45: 12211232. PHS 398/2590 (Rev. 05/01) Page ___70____ Appendix Tularik d d d d d d d d d Appendix 1 Our History Our Mission CEO Message Science & Technology Page 1 of 2 Leadership Patents Publications Global Presence Our Integrated Drug Discovery and Development Platform Places us in a Leading Position to Create Novel and Superior Drugs. The gene regulation approach to the discovery of novel therapeutics is enabled by a solid foundation of biology, biochemistry and molecular biology. Tularik does not depend on a single technique or technology. Rather our scientists develop and take advantage of multiple approaches and cutting-edge technologies, bringing them all to bear on the drug discovery process. We are continually incorporating new capabilities that increasing the probability of success. A Distinctive Approach for Building the Company Tularik has grown in a logical, stepwise fashion, having concentrated in its early years on establishing excellence in fundamental biological science. Our approach has been to build the company "from the bottom up," with scientific need driving the internal development or acquisition of appropriate enabling technologies. To our core biological expertise we first added assay and high throughput screening capabilities, assembled a substantial chemical library. We then made major investments in medicinal chemistry, structural biology and pharmacology. More recently, we have been building strength in the clinical area, as a number of our drug candidates enter and move through human testing. We have integrated a number of new technologies that help us advance our programs. We acquired a core technology called Representational Difference Analysis (RDA) in order to discover the full set of cancer-causing genes; it is the centerpiece of our oncogene-based drug discovery effort. We have also acquired an innovative computer-aided molecular design (CAMD) capability that enables us to add virtual screening to our already robust high throughput screening program. Drug Discovery and Development Process We seek to develop compounds that have novel mechanisms of action and treat serious diseases more effectively than the best existing drugs. The discovery of new drug leads is a multi-step process. We begin by establishing a precise biological link between a disease state and inappropriate gene expression. Once such a link is clear, we seek to elucidate the corresponding gene regulation pathway inside the cell. We can then begin to develop highly specific http://www.tularik.com/page.php?id=3 6/17/2003 Tularik Page 2 of 2 biochemical and cell-based assays for these targets that will help us to identify promising leads for therapeutic intervention. At this stage, we bring to bear our million-compound chemical library, and, with the aid of robotics technology developed in-house, we perform high throughput screens, running select portions of that vast library against our targets. "Hits" are subjected to secondary assays designed to eliminate compounds that lack potency or specificity, or that have other unwanted characteristics. If a compound survives the secondary assay screening process, it is then subjected to further testing and ultimately optimization by medicinal chemists to improve drug-like characteristics such as potency, specificity, oral bioavailability and safety. Once again, our team's expertise in the fundamentals--biology and medicinal chemistry--guides the process. Our tools are state-of-the-art: combinatorial chemistry; CAMD; and an impressive array of advanced structural biology technologies that include nuclear magnetic resonance (NMR) spectroscopy and x-ray crystallography. Utilizing structural information, our chemists can design and synthesize new analogs of lead compounds that are more likely to have a better fit with target proteins, and thus, potentially, greater potency and specificity. Finally, our lead series enters pharmacological analysis and testing in animal models of disease. From these studies, we learn about our candidate drug's effectiveness, pharmacokinetic profile, its selectivity with respect to its target, its potency and its possible side effects. All data are carefully recorded and compiled, in preparation for the submission of an Investigational New Drug application (IND) to the U.S. Food and Drug Administration. Once we have decided to pursue an IND, our clinical team--assisted and counseled by the scientific team that discovered and developed our drug candidate--undertakes the task of designing and implementing clinical trials. Members of our expanding clinical development group have brought many important medicines through trials to commercialization at major pharmaceutical companies. They possess expertise in clinical research, clinical pharmacology, biostatistics and data management, drug safety and surveillance and regulatory affairs. Copyright © Tularik Inc. 2003. All rights Reserved. http://www.tularik.com/page.php?id=3 6/17/2003 Java 2 Platform, Enterprise Edition - Overview Page 1 of 2 Advanced Search Technologies Downloads Documentation Industry News Developer Services Java BluePrints Java 2 Platform, Enterprise Edition (J2EE) J2EE Technologies J2EE Downloads J2EE Documentation OVERVIEW J2EE Main - APIs Compatibility Licensees Java Verification New to Java Tools Simplified Guide to the Java 2 Platform, Enterprise Edition format · PostScript (292,059 bytes) * format · PDF (90,042 bytes) * View & print PDF files with Acrobat Reader from Adobe. Printable Page Introduction | Application Model | Setting the Standard The Platform for Enterprise Solutions The Java 2 Platform, Enterprise Edition (J2EE) defines the standard for developing multitier enterprise applications. J2EE simplifies enterprise applications by basing them on standardized, modular components, by providing a complete set of services to thos components, and by handling many details of application behavior automatically, without complex programming. The Java 2 Platform, Enterprise Edition, takes advantage of many features of the Java 2 Platform, Standard Edition, such as "Write Once, Run Anywhere" portability, JDBC API for database access, CORBA technology for interaction with existing enterprise resources, and a security model that protects data even in internet applications. Building on this base, Java 2 Enterprise Edition adds full support for Enterprise JavaBeans components, Java Servlets API, JavaServer Pages TM and XML technology. The J2EE standard includes complete specifications and compliance tests to ensure portability of applications across the wide range of existing enterprise systems capable of supporting J2EE. Making Middleware Easier Today's enterprises gain competitive advantage by quickly developing and deploying custom applications that provide unique business services. Whether they're internal applications for employee productivity, or internet applications for specialized custome or vendor services, quick development and deployment are key to success. Portability and scalability are also important for long term viability. Enterprise applications must scale from small working prototypes and test cases to complete 24 x 7, enterprise-wide services, accessible by tens, hundreds, or even thousands of clients simultaneously. However, multitier applications are hard to architect. They require bringing together a variety of skill -sets and resources, legacy data and legacy code. In today's heterogeneous environment, enterprise applications have to integrate services from a variety of vendors with a diverse set of application models and other standards. Industry experience shows that integrating these resources can take up to 50% of http://java.sun.com/j2ee/overview.html 6/16/2003 Java 2 Platform, Enterprise Edition - Overview Page 2 of 2 application development time. As a single standard that can sit on top of a wide range of existing enterprise systems - database management systems, transaction monitors, naming and directory services and more -- J2EE breaks the barriers inherent between current enterprise systems. Th unified J2EE standard wraps and embraces existing resources required by multitier applications with a unified, component -based application model. This enables the next generation of components, tools, systems, and applications for solving the strategic requirements of the enterprise. With simplicity, portability, scalability and legacy integration, J2EE is the platform for enterprise solutions. The Standard with Industry Momentum While Sun Microsystems invented the Java programming language and pioneered its use for enterprise services, the J2EE standard represents a collaboration between leaders from throughout the enterprise software arena. Our partners include OS and database management system providers, middleware and tool vendors, and vertical market applications and component developers. Working with these partners, Sun has defined a robust, flexible platform that can be implemented on the wide variety of existing enterprise systems currently available, and that supports the range of applications IT organizations need to keep their enterprises competitive. Introduction | Application Model | Setting the Standard [ This page was last updated Apr-12-2003 ] Company Info | Licensing | Employment | Press | Contact | JavaOne | Java Community Process | Java Wear and Books | Content Feeds | Java Series Books Java, J2EE, J2SE, J2ME, and all Java -based marks are trademarks or registered trademarks of Sun Microsystem Inc. in the United States and other countries. Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License. http://java.sun.com/j2ee/overview.html Copyright © 1995-2003 Sun Microsystems, I All Rights Reserved. Terms of Use 6/16/2003 Microsoft .NET Page 1 of 3 All Products .NET Home | Site Map | .NET Worldwide Search GO news Advanced Search Keep working whether or not you're plugged into the network. Smart Client software, based on Microsoft® .NET-connected technology, combines the reach of the Internet with the power of local computing hardware. Learn how. .NET Home What Is .NET? Technical Resources Services Business Agility Smart Clients combine the power of the PC with the reach of the Web For Partners The technology map to build Smart Client software Home & Entertainment The .NET Framework is the foundation of a new generation of software Product Information .NET Connected Directory downloads technical resources Get the latest tools, guides, code samples, and community links to help you build Web services and deploy and maintain a .NETconnected environment. Microsoft releases recommended practices for solving enterprise books and t problems with .NET Using the .NET Framework MSDN ® resources for .NET Framework developers TechNet resources for IT professionals More technical resources … business agility Don't throw out your existing systems. Microsoft .NET -connected software makes it easier for you to share or integrate information using the technology you own now. webcasts McKinsey Quarterly shows businesses how to fight complexity in IT Microsoft helps businesses benefit from Web services today What .NET means for IT professionals Case studies: See .NET in action More business agility information… for partners Your success is our success. That's why we build programs that provide your company with new business opportunities. .NET Connected Directory helps businesses find Web service solutions and products Register your product with the .NET Connected Logo program Online resources for Microsoft partners home and entertainment Change flight reservations with your PDA? Automatically reserve concert tickets the minute they go on sale? The possibilities are endless in a .NET-connected world. Digital decade vision: From personal computer to personal http://www.microsoft.com/net/ 6/16/2003 Microsoft .NET Page 2 of 3 computing Can the Internet change your oil? More home and entertainment information… what is .NET? product information .NET services .NET Alerts .NET 101: The .NET products .NET Passport basic elements Smart devices MSN® Messenger Servers Connect Developer tools Microsoft of .NET The ABCs of Web services .NET glossary Frequently asked More links … MapPoint® Web Service More links … questions More links … Contact Us | E-mail This Page http://www.microsoft.com/net/ 6/16/2003 Microsoft .NET © 2003 Microsoft Corporation. All rights reserved. http://www.microsoft.com/net/ Page 3 of 3 Terms of Use Privacy Statement Accessibility 6/16/2003 Interoperable Informatics Infrastructure Consortium Page 1 of 2 Interoperable Informatics Infrastructure Consortium Search I3C Working Groups LSID Registry Security Outreach TechOps Committee I3C Emerging Work Areas Pathways/Systems Biology Life Science Object Ontology Chemiinformatics Meetings/Events Mail Lists/Discussion Threads Demos Downloads Publications Members Only Member Home My Profile Membership How to Join Benefits Press Room About Us Contact Us http://www.i3c.org/ Accelerating Life Sciences Discovery Through Software Interoperability Home | Working Groups | Meetings | Members Only | Membership | About I3C | Contact Us | Site Map I3C develops and promotes global, vendor-neutral informatics solutions that improve data quality and accelerate the development of life science products. W h a t ' s l l N E W ! WEB SITE CHANGES IN STORE Upcoming As the result of your comments, we'll be Events making some navigational adjustments to the Web site over the next few days. The Mark your result should be a simplified left menu and calendar for easier, more direct access to Working the remainder Group activities. Immediate requests for of 2003! fixes/links should be addressed to Suzanne ···· Mahler at [email protected]. Thank you I3C Demo at once again for your patience. BIO 2003 I3C Annual Meeting, Elections, "Merging LSID Technical Meeting & Hackathon & BioMOBY" Summary June 22-25 I3C's first Annual Meeting began with Board President Tim Clark's "State of I3C" report. (See next item for details.) Tim reviewed I3C's purpose of promoting, developing, and recommending "best-inclass" interoperability solutions for the life sciences community. But more than interoperability, Tim stressed that we need to look at the total recommended approach that includes specific point solutions and approaches, interoperability solutions with open interfaces, and methods of semantic integration across the domain. He also reviewed I3C's approach to accomplishing its work by bringing the best people together from IT, academic, and biopharma communities. Tim went on to outline I3C's many accomplishments todate, as well as exciting future opportunities. "What we want to do is make it easier for life scientists, and their laboratories, to do their jobs while lowering cost and speeding the development of at the Washington Convention Center Washington, D. C. ** Details Here ** · · · · Technical Meeting Oct. 27-29 Hackathon Oct. 30-31 Wellcome Trust Genome Campus Hinxton, U.K. ** Details Here ** · · · · 6/13/2003 Interoperable Informatics Infrastructure Consortium Page 2 of 2 needed drugs," he said. l l l To read the entire summary, click here. To download a photo slide show of the Hackathon, click here. I3C Annual Meeting - Chairperson's Report On Monday, May 5, I3C Board President Tim Clark gave an important presentation at the Annual Meeting. An outline of the talk appears below; if you missed it in person, you can download the PowerPoint presentation here . ¡ I3C's Purpose & Approach ¡ Accomplishments 2002-3 ¡ Problems Requiring Attention ¡ Driving the Work Forward ¡ The I3C Vision: Next Stage MAKE PLANS FOR HINXTON - Details now available The last meeting of 2003 is scheduled for October 27-31 on the Wellcome Trust Genome Campus in Hinxton, U.K. Meeting and accommodation details are available here. Remember to reserve your room early for best selection. EMERGING WORK AREAS In addition to I3C's existing Working Groups (LSID, Registry, and Security), new work areas are forming, including Chemiinformatics, Pathways/Systems Biology, and Life Science Object Ontology. You may join a group at any time. Click on the "Working Groups" link in the left menu for more information. Home | Working Groups | Meetings | Members Only | Membership | About I3C | Contact Us | Site Map Copyright © 2003 I3C http://www.i3c.org/ 6/13/2003 Surfing the Human Genome Page 1 of 7 September 20, 1999 Surfing the Human Genome Databases of Genetic Code Are Moving to the Web By LAWRENCE M. FISHER AN FRANCISCO -- Call it an end-of-the-century business case study. Pangea Systems Inc. is a small but leading company in "bioinformatics," a hot new field that combines the two keystone technologies of the 1990s -- computing and biotechnology. But its products are expensive and difficult for mortals to use, which limits Pangea's potential market and reduces the prospects for a public stock offering. What to do? This being 1999, the answer if you are Pangea is to dot-com yourself. This week Pangea, which is based in Oakland, Calif., intends to begin a shakedown Thor Swift for The New York Times test of DoubleTwist.com, a At Pangea Systems, left to right, Kyle new Web site intended to Hart, Bruce Xuefeng Ling and Brian King, make online genetic and revise genetic data base software. biological research fast, easy and available to any amateur or professional biologist. While the test phase is available only to faculty and students at Stanford University, the site is scheduled to go live for general use in December. The DoubleTwist site, whose name is a play on the double-helix structure of DNA, holds the near-term promise of lifting Pangea above the pack of competitors chasing the business opportunities in bioinformatics. But other companies may not be far behind. And the implications go beyond the interests of professional biologists and biotechnology executives. As more of the arcane secrets of genetics and molecular biology become available to the modemed masses, some industry executives http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Surfing the Human Genome Page 2 of 7 foresee the day when an educated consumer might take a CD-ROM containing a laboratory's rendering of his or her genetic profile, and combine it with a Web surf through gene libraries to determine the person's predisposition toward adverse drug reactions, for example, or for Alzheimer's disease, colon cancer or other afflictions that might eventually be treatable through gene therapy. To promote its name and capabilities, Pangea plans to let individuals who make only casual use of the site have access to its software and data base at no charge. Heavy users and corporations may obtain licenses to pay for access on a sliding fee scale -- which could run tens of thousands of dollars a year, but would still be significantly less than the $500,000 or more that Pangea now charges big pharmaceutical companies to buy its software outright. "The power of bioinformatics has been somewhat limited to those who could afford it," said John Couch, Pangea's president and chief executive, who was an executive at Apple Computer in the late 1970s and early 1980s. "I've been trying to figure out how to empower the scientist the way we did computer users at Apple in the early days," Couch said. "We saw the opportunity to be the first Web portal that enabled scientists to do molecular research." Celera Genomics Group is another company that has said it will offer its bioinformatics tools from its Web site, although it has not specified a launch date. "This is an Internet company," said Craig Venter, president and chief executive of Celera, a unit of the PE Corp., which is based in Rockville, Md. Scientists and nonscientists alike, he said, will be able to use Celera's tools to gain insights into their genetic makeup. And as catalogs of common mutations correlated with disease become broadly Pangea Systems' Doubletwist is a available, he said, individuals will Web version of genetics research be able to make appropriate software designed to let a person lifestyle changes or health-care search libraries of gene-code decisions. "You'll be able to log fragments for matches. on to our data base and get information about yourself," Venter said. "Our ultimate customer on the Internet is individuals." Bioinformatics is a field that emerged from the Human Genome Project, the international quest -- which began in 1988 and is expected to be concluded in the next two years -- to spell out the precise sequence of the three billion letters in the human genetic code. The first industry spawned by the genome project was http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Surfing the Human Genome Page 3 of 7 genomics companies, which sell data bases of individual genes whose sequences have already been identified or are developing drugs aimed at gene targets. As these efforts began to produce vast amounts of biological information, they needed powerful software to keep track and make sense of it all. And so, in the early 1990s, bioinformatics was born as a tool of genomics. While the software created by the government-funded labs like the Whitehead Institute at the Massachusetts Institute of Technology is in the public domain, with intriguing names like Blast and Fasta, the genomics companies, like Human Genome Sciences Inc. and Incyte Pharmaceuticals Inc., have kept their tools for use by themselves or their licensed partners. That is Celera's primary business as well, despite Venter's intent to offer bioinformatics services on the Web. It was not long before a few entrepreneurs and venture capitalists saw an opportunity in a pure-play bioinformatics company, which would sell not genes or data, but software. As private companies, none of the bioinformatics players publish revenue figures, but most say they are between $5 million and $10 million in annual sales, and growing. Indeed, some analysts predict a multibillion-dollar bioinformatics market within the next 10 years. "Bioinformatics is not necessarily the next wave, but the glue that holds everything together," said Tim Wilson, an analyst with S.G. Cowen. "If you don't get that part right, it's hard to realize the value of genomics," he said. "The opportunity is something obvious to anyone who speaks to pharmaceutical companies." With the DoubleTwist site, according to Pangea, a researcher would have many of the same capabilities previously available only to the company's big corporate customers, which include drug companies like Bristol-Myers Squibb and Hoechst Marion Roussel. After logging on to the DoubleTwist site, a visitor could enter a partial sequence of a gene -- some combination of the letters A, C, T and G, which make up the genetic alphabet -- and then search for contiguous sequences that might lead to a full-length gene. Or if the code of a full-length gene were known, the researcher could ask in which tissues of the body that gene is found or found only when in the presence of cancer. To the extent the answer is available in the scientific literature, including patent filings, the software would retrieve it and highlight relevant passages. Other cross-referenced data might include notations on what biochemical materials are required for working with a given gene in the laboratory. Such are the capabilities of the computational biology that underlies bioinformatics -- a field that Francis Collins, director of the Human Genome Project for the National Institutes of Health, says he now often counsels promising graduate students to look to for career opportunities. "I just think it is going to hit us like a freight train and http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Surfing the Human Genome Page 4 of 7 we really have too small a supply of expertise in that area," he said. But there has been a dichotomy between the opportunity and the market reality for Pangea and competitors like Netgenics Inc. of Cleveland; Informax Inc. of Rockville, Md.; Lion Bioscience AG of Heidelberg, Germany; Compugen Ltd. of Tel Aviv; the Genomica Corp. of Boulder, Colo.; and Molecular Applications Group of Palo Alto, Calif. Most of these companies are five years old or more, yet few are profitable. Couch, Pangea's president, said the two hurdles to expanding the market have been complexity and cost. Besides the $500,000 price for Pangea's suite of software programs, a suite customer must make a comparable investment in hardware. And Thor Swift for The New York Times even though they have a point-and-click graphical user John Couch, left, president of Pangea Systems, and Robert Williamson, senior interface, like any Windows vice president for marketing, say the application, their Oakland, Calif., company's Web site will sophistication has tended to let scientists do complex genetic research restrict their use to on line. bioinformatics specialists within large pharmaceutical or biotechnology companies, not to individual research scientists without special training. In moving to the Web, Pangea will find neighbors with some similar-sounding offerings. This week, HySeq Inc., a genomics company in Sunnyvale, Calif., will launch GeneSolutions.com, which will sell genes and genetic information over the Web. And there are various Web sites, for example, that freely offer publicdomain algorithms, or mathematical formulas, that can perform the basic tasks of bioinformatics. These include a technique called clustering and alignment, which pieces together full-length genes from the fragments spewed out by so-called automated sequencing machines that derive their data from DNA samples. But these public-domain tools tend to be difficult to use, and limited in their application to specific gene data bases. Pangea's DoubleTwist, by contrast, will aggregate data from multiple sources, and then make it available using software agents -- small automated software programs that will scan the Web at a user's request and return answers to complex biological queries via e-mail. Theses agents can update information as it becomes available, suggest necessary laboratory supplies and provide links to vendors. DoubleTwist is intended to complement rather than supplant http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Surfing the Human Genome Page 5 of 7 Pangea's established software suites. But Couch said it was possible that a growing portion of the company's revenues would come from the Web rather than packaged programs. Rather than buy Pangea's software suite for $500,000, companies or academic institutions could spend, say $10,000 a year to provide each user access to these programs over the Web. Pangea's competition in this arena is companies very much like itself: small, financed with venture capital and possessing more programming prowess than marketing skills. All of these companies are looking for ways to differentiate themselves, and while an Internet presence is one way to do that, it is by no means the only one. For example, Netgenics' programs run on corporate intranets, rather than the World Wide Web. But they are built using Internet technology like the Java programming language so that they can be easily adapted to the specific needs of different customers. "Pangea decided they would come up with the perfect schema for all types of drug discovery and put a nice graphic user interface on it," said Manuel J. Glynias, president and chief executive of Netgenics, which was founded in 1996. "We decided there was no perfect schema because every pharmaceutical company is different." Netgenics did consider a Web-based electronic commerce business model, but decided a faster route to growth was to bundle consulting services with custom bioinformatics software. So far, customers include Abbott Laboratories and Pfizer. "We've very much targeted big pharma and biotech," Glynias said. "They're the only ones who can afford it, and really the only ones it makes sense for. At the end of the day you've got 50 big pharma and biotech companies and 100 medium-sized ones. It's not a big market." If the market is small, creating a big company requires that each sale be large, and Netgenics bases its goals on finding at least 20 customers willing to pay $5 million annually for its services. Another player, Lion Bioscience, takes that model a step further. It recently announced a deal in which it would develop new bioinformatics systems and identify target genes for drug development by Bayer A.G. for an investment estimated at $100 million. The figure includes an up-front equity stake in Lion as well as fees for use of Lion's existing information systems, research and set-up costs for a new subsidiary to be based in Cambridge, Mass., and royalties on drugs developed from the gene targets identified at the subsidiary. Lion calls its concept iBiology, and like Netgenics' approach, it uses intranets rather than the Internet. "It goes far beyond the usual gene http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Surfing the Human Genome Page 6 of 7 sequence analysis software," said Claus Kermoser, Lion's vice president for corporate development. "We crawl further up the value chain to include the chemical side, and also pharmacological and toxicology data. It's not just a software package, tools and data; it's a solution for pharmaceuticals research data management." In fact, Lion is actually a hybrid of pure-play bioinformatics and genomics, because it sells gene targets along with informationprocessing capabilities. Similarly, Compugen, after building a successful business selling bioinformatics tools, has recently added a genomics thrust, selling novel gene variants the Compugen researchers have identified with the company's tools. Compared with these other companies, which have aimed for a corporate clientele, Informax has taken a vastly different tack. For six years it has sold a program for individual scientists, Vector NTI, which is almost to biology what desktop publishing software was to print publications. At $3,500 a user, for the Windows or Macintosh versions, Vector NTI is not inexpensive. But because it is a purchase that typically can be authorized at the department level, it is the most widely used bioinformatics program in the industry. It is used at 60 pharmaceuticals companies, 250 biotechnology concerns and 500 academic institutions. "We've built our franchise by meeting the needs of the bench biologist," said Timothy Sullivan, Informax's senior vice president for marketing and sales. "Informax took a bottom-up approach and did it well, versus Pangea and Netgenics, who started out at the enterprise level," he said. Informax recently introduced its own enterprise product, Software Solution for Bioscience, and hopes to use the leverage of its existing customer base to win sales at large companies. One hurdle for all of these competitors is that the large companies that are their obvious customers often have substantial bioinformatics capabilities of their own -- expertise that the company may even view as a proprietary advantage. "You're trying to do cutting-edge research, and if you're on the leading edge of the curve that means you also have to develop the software to do it," said Paul Godowski, director of molecular biology at Genentech Inc., the pioneering biotech company. "On the other hand, there are products out there from these third-party vendors we can import for our programs," Godowski. "It's a mixture, and I don't see that going away, certainly not at a place like Genentech." No wonder Pangea is looking to cyberspace to expand its potential audience. http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Surfing the Human Genome Page 7 of 7 "Only a few select pharmas can afford the tools, and if they can, then in some cases they can also afford to produce their own software," Couch said. "Why not take the infrastructure we've created, add a graphic interface that makes it easier, and offer it directly to the scientist? We are taking the Internet, which was originally developed to do research, and giving it back to the researchers." Related Sites These sites are not part of The New York Times on the Web, and The Times has no control over their content or availability. l Pangea Systems Inc. l DoubleTwist.com l Celera Genomics Group l Human Genome Project l GeneSolutions.com l Informax Inc. l Lion Bioscience AG l Compugen Ltd. l Genomica Corp. l Molecular Applications Group l Netgenics Inc. Home | Site Index | Site Search | Forums | Archives | Marketplace Quick News | Page One Plus | International | National/N.Y. | Business | Technology | Science | Sports | Weather | Editorial | Op-Ed | Arts | Automobiles | Books | Diversions | Job Market | Real Estate | Travel Help/Feedback | Classifieds | Services | New York Today Copyright 1999 The New York Times Company http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html 6/12/2003 Linux cluster tackles gene research - Computerworld Page 1 of 3 Computerworld Home > Browse Topics > Software > Operating Systems > Linux > Story Linux cluster tackles gene research By TODD R. WEISS AUGUST 05, 2002 Content Type: Story Source: Computerworld Operating Systems Knowledge Center Operating Systems News Discussions Glossary Vendor Listing Resource Links White Papers Operating Systems XML Feed Mobile Channel E-mail newsletters Knowledge Centers Careers CRM Data Management Development E-business ERP/Supply Chain Hardware IT Management Mobile & Wireless Networking Operating Systems ROI Security Storage Web Site Mgmt xSP More topics... Departments QuickStudies SharkTank FutureWatch Careers Opinions/Letters More departments... A 150processor cluster supercomputer built by Linux NetworX Inc. is helping a California biopharmaceutical company compare genes from mice and humans in the race to find effective drugs to fight cancer and other diseases. Using the cluster supercomputer, Tularik Inc. has been able to accomplish in two months what would have taken the company 38 years with its older hardware, said Bruce Ling, director of bioinformatics at the South San Francisco-based company. Tularik had been using an older four-processor mainframe computer from Silicon Graphics Inc. for such work. The new cluster, by contrast, has 150 Pentium III 1-GHz http://www.computerworld.com/softwaretopics/os/linux/story/0,10801,73254,00.html 6/15/2003 Linux cluster tackles gene research - Computerworld Services Forums Research QuickPolls WhitePapers Buyer's Guide More services... Page 2 of 3 processors, 300GB of memory and 3TB of storage; it has given Tularik researchers far greater potential for their gene mapping work, Ling said. The cluster was installed late last year and began operations in December. Details about the project were announced today after performance benchmark testing results were released, showing performance that was 75 times greater than the old system. The Evolocity cluster supercomputer was built by Salt Lake City-based Linux NetworX and has helped accelerate Tularik's drug discovery efforts by mining massive databases of information and quickly identifying gene combinations behind diseases in the areas of cancer, immunology and metabolic disorder. The 75-node supercomputer has two CPUs per node, plus a four-CPU administrative node. The machine was built at about a tenth of the cost of a traditional mainframe, Ling said. The price is not being released, he said. The biopharmaceutical company is working to compare human and mouse genomes because the genetic makeup of the two species is very similar, Ling said. A genome is a collection of genetic information consisting of individual genes. When experiments are necessary, researchers can use mouse genomes instead of having humans undergo experimental procedures. By cross-mapping the genomes, researchers can find similarities and differences in the genomes between the species that can help them in their experiments. Tularik, founded in 1991, has been a pioneer in the recent cross-mapping of mouse and human genomes, Ling said, and hopes to develop pharmaceuticals that regulate the development of cancer and other diseases in genes. Clark Roundy, a vice president of marketing at Linux NetworX, said the company is seeing continued growth in the biotechnology and pharmaceutical marketplaces for cluster supercomputers for research. "This cluster has saved them years of time in mapping the mouse genome," Roundy said. Related Content IBM expands server cluster technology , AUG 02, 2002 http://www.computerworld.com/softwaretopics/os/linux/story/0,10801,73254,00.html 6/15/2003 Linux cluster tackles gene research - Computerworld Page 3 of 3 Clusters give users supercomputer power, JUN 07, 2002 IBM nets its biggest supercomputer deal yet, JUN 03, 2002 Source: Computerworld Page Utilities Send feedback to editor Printer friendly version E-mail this article Request reprints of this article Sponsored Links Gateway: The Value of Wireless Mobility Oracle9i Database: Click to calculate your savings. Caught and Kept! How to Keep Your Customers by Knowing Who They are. Oracle9iAS Can help with all your integration challenges. Get two FREE audio titles from Audible. Click here! Sony – Marriage of Storage and E -Business: A Match.com Success Story. Microsoft Get this Free White Paper on Business Portals Webcast: Caught and Kept! Tips to keep customers faithful Microsoft®: Windows® Server 2003. Free Evaluation Kit. Got Outsourcing Questions? Apply for Computerworld’s complimentary half-day summit on outsourcing Get the latest news on Windows Server 2003 - Across all IDG sites AMD Opteron: Introducing the AMD Opteron Processor Sun: Get a FREE mainframe rehosting assessment now News Latest News Week in Review E-mail Newsletters Special Coverage This Week in Print Corrections Technology QuickStudies Emerging Technologies Future Watch Reviews Field Reports Security Manager Management Book Reviews Case Studies Managing ROI Q&As Careers Career Adviser Education Salary/Skills Surveys Best Places Workstyles Search/Post Jobs Opinions Editorial Columns Letters to the Editor Shark Tank QuickPoll Center Events Premier 100 IT Leaders Storage Networking World Computerworld Honors Program Mobile & Wireless World Services Forums Buyer's Guide Research White Papers Media Kit Subscriptions Reprints About Us Contacts Editorial Calendar Help Desk Advertise Privacy Policy Copyright © 2003 Computerworld Inc. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc. http://www.computerworld.com/softwaretopics/os/linux/story/0,10801,73254,00.html 6/15/2003 Methods DIAN: A Novel Algorithm for Genome Ontological Classification Yannick Pouliot, Jing Gao, Qiaojuan Jane Su, Guozhen Gordon Liu, and Xuefeng Bruce Ling1,2 DoubleTwist, Inc., Oakland, California 94612, USA Faced with the determination of many completely sequenced genomes, computational biology is now faced with the challenge of interpreting the significance of these data sets. A multiplicity of data-related problems impedes this goal: Biological annotations associated with raw data are often not normalized, and the data themselves are often poorly interrelated and their interpretation unclear. All of these problems make interpretation of genomic databases increasingly difficult. With the current explosion of sequences now available from the human genome as well as from model organisms, the importance of sorting this vast amount of conceptually unstructured source data into a limited universe of genes, proteins, functions, structures, and pathways has become a bottleneck for the field. To address this problem, we have developed a method of interrelating data sources by applying a novel method of associating biological objects to ontologies. We have developed an intelligent knowledge-based algorithm, DIAN, to support biological knowledge mapping, and, in particular, to facilitate the interpretation of genomic data. In this respect, the method makes it possible to inventory genomes by collapsing multiple types of annotations and normalizing them to various ontologies. By relying on a conceptual view of the genome, researchers can now easily navigate the human genome in a biologically intuitive, scientifically accurate manner. Biologists have never before been exposed to such vast amounts of sequence data as that from the human genome and a variety of model organisms. This development now raises the issue of how to interpret the meaning of the genome on the basis of prior biological understandings. Annotation tasks, such as the prediction of protein function and structure, are essential to this process and are by no means completely robust. Furthermore, the integration of historical domain knowledge accumulated in individual research fields with these sequence and structural annotations is becoming increasingly complex and difficult. The size, diversity, and complexity of the data, which include biological sequence information itself, third party or in-house annotation, and information from the scientific literature, are responsible for these difficulties. Another reason relates to the lack of data and information normalization, because the data repositories are often poorly designed, particularly in the case of older repositories. Furthermore, data processing procedures vary substantially, and the underlying semantic and data models are moving targets. Finally, there is the extreme specialization of research fields. Despite these problems, model organism studies and associated DNA and protein sequence data sets have revealed a high degree of sequence and functional conservation between organisms (Chervitz et al. 1999). Similarly, the accumulated protein structure data have shown that the number of protein folds is probably limited (Bowie et al. 1991). The limited number of biological roles, protein functions, and structural types 1 Present address: Tularik, Inc., 2 Corporate Drive, South San Francisco, CA 94080, USA. 2 Corresponding author. E-MAIL [email protected]; FAX (650) 825-7400. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.183301. 1766 Genome Research www.genome.org enable a common language for annotation, which is beginning to be implemented by biocomputational ontology engineering (Riley 1993; Baker et al. 1999; Ashburner et al. 2000; Karp 2000). Ontologies provide an ideal mechanism of organization of biological data at the conceptual level by providing a framework for data whose properties are otherwise nonnormalized. “Normalization” is used here to refer to a state in which several types of signifiers ultimately express the same concept, and in which a concept is defined as a generic abstraction derived from instances. An example of a concept is the notion of “cell adhesion molecules”, to which specific types (instances) of proteins such as cadherins, neural cell adhesion molecules, and integrins are conceptually associated. The proper assignment of DNA and protein sequences to ontologies therefore leverages the rigor of the underlying concepts networked within these ontologies, and enables computations that would otherwise be unreliable due to the variability of terms used to described biological data in most of the biocomputational databases. For example, ontologybased querying can enable the retrieval of DNA and protein sequences based on biological concepts rather than relying on keyword or synonym searches, which are inherently unreliable due to their present nonnormalized nature, therefore greatly hampering effective computing (Attwood 2000). Here we describe DIAN, an ontology assignment algorithm that assigns concepts to source records or, more generally, to biological objects within a database, and supports their querying using concepts rather than keywords. The algorithm supports a variety of ontologies for biological role, protein function, and protein structure, whereby each ontology is implemented on a knowledge base established via computer-assisted human curation of the protein universe. DIAN has the necessary throughput capacity to annotate entire genomes, transcriptomes, and proteomes onto any number of ontologies. The DIAN algorithm, together with the precom- 11:1766–1779 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org DIAN Knowledge Mining System puted DIAN annotation database and its associated utilities, enables users to retrieve, summarize, and predict the higher order properties of biological objects, therefore increasing their information content. Overall, DIAN is intended to facilitate the navigation of genomic data repositories in a biologically intuitive, scientifically accurate manner. RESULTS AND DISCUSSION Biologists rely heavily on databases and search tools such as the National Center for Biotechnology’s Entrez system to search and identify records containing information associated with biological objects such as protein structures and biological sequences (Wheeler et al. 2001). However, when computing on such information, most query systems suffer from the limitations inherent to the annotations associated with these objects. Even in highly curated databases such as the SWISSPROT database of protein information (Bairoch 1991), there remains significant variability in the descriptors present in these source records. This is because there are many legitimate ways of describing biological concepts. Furthermore, even when the data are curated by experts, a variety of factors introduce variability in the quality and comprehensiveness of these annotations. Thus, when querying annotation databases, conventional search tools encounter fundamental limitations, such that they cannot return records in a reliable manner unless a complete set of descriptors known to be present in the targeted records is provided in the query. This, of course, rarely is the case. DIAN is designed to enable the querying of popular biological databases in such a way that the limitations associated with the original source records of these databases can be partially overcome. This is accomplished by having the operator query biological ontologies for records associated with these ontologies, rather than querying the source records directly (for details, see supplementary material at http:// www.genome.org). The primary algorithm used by DIAN for associating records to ontologies relies on a domain-based approach that does not depend on the presence of annotations in the source record, thus bypassing the limitations associated with these annotations. In addition, because of this approach, DIAN often makes suggestive assignments, whereby proteins are predicted to belong to ontological nodes in the absence of definitive information. For these reasons, when performed using conventional keyword-based search engines, the queries described in Table 1 will fail to return a fraction of records because of an absence of matching annotations or because of the indirectness of these annotations (i.e., hyperlinked records). Three such cases of records that would otherwise not have been returned without DIAN are illustrated in Table 1. They involve two novel genes, one with predicted functional information listed in the source record and one without such information, as well as one well-characterized gene. In case 1, DIAN identified a gene with no known functional activity by predicting the cellular role and protein function of a sequence on the basis of its pattern of protein domains. UniGene was queried for records involved in the apoptotic Cellular Role. DIAN returned a record from the UniGene database where no functional information is available regarding this sequence, such that this record would not have been identified by keyword-based querying (Table 1). It is only after consulting the SWISS-PROT record linked to this UniGene entry that an apoptotic function is uncovered. Case 2 concerns the prediction of a cellular role for a hypothetical gene in SWISS-PROT in which putative functional information is available (zinc finger; DNA binding) but where the annotation does not specify a cellular role. In this case, DIAN predicted an involvement in the “RNA synthesis/transcription factor” Cellular Role node. In case 3, DIAN predicted a novel property for a highly characterized gene. Here, UniGene was queried for records involved in the apoptotic Cellular Role. The gene coding for the protein associated with the Wiskott-Aldrich syndrome (WAS; Derry et al. 1994) was one of the hits returned by this query. The WAS protein is thought to be involved in signal transduction, yet there is no indication of an apoptotic role in any of the records associated with this gene, including the associated SWISS-PROT and OMIM records. However, indications suggestive of a possible apoptotic role were found in these sources. On subsequent analysis of the scientific literature associated with WAS and its Drosophila ortholog, several publications were uncovered that strongly substantiate a recently discovered apoptotic role for this gene (Rawlings et al. 1999; Rengan et al. 2000; Ben-Yaacov et al. 2001). Beyond the performance of DIAN in returning records otherwise unretrievable, the combination of ontology-based and Boolean operators (e.g., NOT, AND, OR) enables users to query databases in a biologically meaningful manner rather than to submit to unfamiliar querying syntaxes and the vagaries of unstructured data. For example, using DIAN it is possible to formulate directly the following questions in a simple manner: Are there cytokines involved in the apoptosis biological process? Are there proteins harboring the caspase domain that are involved in apoptosis? What receptors are associated with apoptosis? What proteins are both apoptosis-associated and DNAassociated in terms of cellular role? (i.e., proteins that might perform an apoptotic role via DNA binding). Such questions cannot be addressed if the contents of annotation databases have not been normalized to various biological concepts, and furthermore, comprehensive biological query cannot be performed reliably when accomplished exclusively by using a simple keyword-search approach, as seen in most public databases. Organization of Biological Data Using Ontologies An ontology is a specification of a conceptualization that provides a written, formal description of a set concepts and their relationships within a domain of interest (Karp 2000). Ontologies are object-oriented data structures that use object composition and inheritance as techniques to encapsulate conceptual relationships. In biology, there are two kinds of relationships between conceptual objects to be represented: inheritance and compositional relationships. Inheritance hierarchies model IS-A relationships among base and derived conceptual objects. This is because a derived object IS-A type of base object. In contrast, composite objects, that is, objects that contain other objects as members, model HAS-A relationships. This is because the container object HAS-Another as its member component. For example, in the Gene Ontology (GO) ontologies (Ashburner et al. 2000), the Cellular Component ontology relies on HAS-A compositional relationships, whereas the Molecular Function ontology uses IS-A inheritance relationships. In this way, the granularity and richness of the universe of biological concepts can be modeled by ontologies. To encapsulate biological conceptual objects and support the goal of concept-based searching, the DIAN algorithm segments the spaces of protein function, biological role, and pro- Genome Research www.genome.org 1767 1768 Genome Research www.genome.org OMIM: 30100 SWISS-PROT: P42768 Wiskott-Aldrich syndrome protein •cell division •apoptosis •gene/protein expression •RNA synthesis •transcription factors •cell/organism defense •homeostasis •apoptosis •cell division •apoptosis EGAD celluar role •Enzymes •Transferase •Post-translational modifications •DNA or RNA associated proteins None Protein function •Transferases None None Enzyme classification •Non-immune cell defense •Apoptosis •Genome structure and Gene expression •Transcription factors •Non-immune cell defense •Apoptosis DoubleTwist biological role Source record: DIAN system annotation Structure (SCOP) •All beta proteins •PH domain-like •PH domain-like •Enabled/VASP homology 1 domain (EVH1 domain) •Small proteins •Classic zinc finger, C2H2 •Classic zinc finger, C2H2 •Classic zinc finger, C2H2 •All alpha proteins •DEATH domain •DEATH domain •DEATH domain Three illustrative cases of records that cannot be returned by conventional keyword-based querying systems but that were returned by DIAN are described here. Case 3: Known gene with novel predicted function Putative Zinc Protein UniGene Hs. 2157 SWISS-PROT P39959 None UniGene Hs. 104305 Annotation Identifier Example Queries that Cannot be Resolved Accurately by Conventional Querying Systems Case 2: Hypothetical gene with predicted function Case 1: Novel gene with no predicted function Case type Table 1. Pouliot et al. DIAN Knowledge Mining System tein structure using a collection of ontologies. Although HAS-A relationships should be supportable, in this study we rely exclusively on IS-A ontologies as a paradigm to show the DIAN methodology. Computationally, ontologies take the form of a graph, a tree being a special form of a graph. A node always inherits the properties of all parental nodes, such that a complete description of the biochemical function of a protein involves starting the path from the leaf to the root of the tree. The first three levels of the PROSITE Protein Function ontology are used to illustrate this conceptualization (Fig. 1). Starting from the root of the tree (level 0), each level describes biochemical protein function in increasingly greater detail. In this illustration, six proteins were assigned to the transferase node. Because this is a protein function ontology, proteins can belong to different families and species and yet be assigned to the same node. By providing standard classification data structures, ontologies are ideal in providing a common platform for annotation and therefore promoting reuse across different informatics systems and research fields. Because the focus of this paper is on a methodology for assigning protein sequences to ontologies, the relative merits of individual ontologies are addressed only briefly. Choice of Ontologies Because an ontology is essentially a specification of conceptualization (Karp 2000), the choice and quality of the chosen ontologies are essential in ensuring the integrity of encapsulating the biological data. To support the conceptualization of protein functions, biological roles, and cellular processes, substantial attention has been devoted, both in academia and industry alike, to the development of various ontologies to meet these needs. Examples include the enzyme commission classification system (Commission on Biochemical Nomenclature and International Union of Biochemistry. Standing Committee on Enzymes 1973; International Union of Biochemistry and Molecular Biology. Nomenclature Committee and Webb 1992; International Union of Biochemistry. No- menclature Committee and Commission on Biochemical Nomenclature 1979; International Union of Biochemistry. Nomenclature Committee et al. 1979; International Union of Biochemistry. Nomenclature Committee et al. 1984; International Union of Biochemistry. Standing Committee on Enzymes 1965); the Escherichia coli Protein Function ontology (Riley 1993); the EcoCyc system for E. coli metabolic pathway (Karp 2000); the PROSITE ontology of domain biological functions (Hofmann et al. 1999); the GO ontologies (Ashburner et al. 2000); the KEGG system for the classification of genes according to pathway information (Ogata et al. 1999); RIBOWEB (Chen et al. 1997); and the TIGR expressed gene anatomy database (EGAD, http://www.tigr.org/tdb/egad/ egad.shtml). Similarly, to facilitate the understanding and access to information of protein structures, several protein structure classifications have been constructed (Murzin et al. 1995; Orengo et al. 1997). Despite these efforts, there is still no accepted ontology with the necessary robustness, comprehensiveness, and level of detail to satisfy the demands of genome annotation, although this is an implied goal of the GO project. Given these limitations, and in the absence of the GO ontologies, we originally chose to rely on various publicly available ontologies, in addition to deriving the DoubleTwist Biological Role Ontology (Table 1). For Protein Function, DIAN supports the PROSITE Protein Function and the enzyme commission classification. Given its rigor, comprehensiveness, and rapid evolvement, the GO ontologies, including its three components (molecular function, biological process, cellular components), are expected to be integrated within DIAN in the foreseeable future. For the Cellular Role of proteins, the TIGR expressed gene anatomy database (EGAD) ontology and DoubleTwist Biological Role ontology are supported. Although not very comprehensive, the EGAD ontology currently is the only publicly available ontology designed to inventory human expressed genes. The DoubleTwist Biological Role ontology was derived from Riley’s Protein Function ontology (Riley 1993) and has been designed for the concise conceptual encapsulation of the biological role of the human gene to enable comprehensive human genome assignment. As for protein structure classification, the Structure Classification of Proteins (SCOP) ontology was selected because SCOP is sequence-based and its classifications provide a detailed and comprehensive description of the structural and evolutionary relationships of proteins of known structure (Murzin et al. 1995). Architectural Design Figure 1 Defining ontologies. Ontologies represent a specification of a domain of knowledge expressed in the structure of mathematical graphs (a tree being a special form of a graph). Connecting lines represent the relationship between the nodes, specifically IS-A relationships. Known protein functions are assigned to nodes (represented by circles) within the ontology graph. A child node always inherits the properties of all parent nodes, such that a complete description of the biochemical function of a protein involves retracing the path from the leaf to the root of the tree. DIAN (Fig. 2) integrates several databases through algorithms that perform the ontological assignment of proteins on the basis of two distinct principles. The first algorithm (vocabulary-based mapping) relies on the recognition of vocabulary within a source record from a database of protein annotations. The second algorithm (domain-based mapping) assigns protein sequences based on the detection of protein domains and does not rely on preexisting sequence annotation. DIAN has several subcomponents in support of these functions: a knowledge base of assignments of SWISS-PROT proteins to ontologies; two databases that provide operational definitions for each ontological node, based either on vocabulary or the assignment of protein domains; two assignment algorithms for assigning proteins on the basis of either vocabulary or the presence of protein domains; and lastly a data Genome Research www.genome.org 1769 Pouliot et al. Figure 2 DIAN overview. (Left) computer-aided human curation process for the assignment of SWISS-PROT sequences to the Protein Function and Cellular Role ontologies; (right) application of ontologies to organize biological annotation databases. Multiple ontologies, each representing a body of biological knowledge, are stored in the DIAN database. Individual source records stored in biological sequence and structure annotation databases are associated with one or more ontologies via domain-based and/or vocabulary-based mapping, such that they can be queried simultaneously across multiple ontologies. (GB) GenBank; (GP) GenPept; (SP) SWISS-PROT; (DB) database. indexing and retrieval engine to support user queries. Each subcomponent is described in the following sections. manual curation process, a database of controlled vocabulary was evolved from the assignment,in which for each ontological node, essential keywords were extracted from the annotations of the SWISS-PROT proteins assigned to the node. To enhance the selectivity and sensitivity of each definition, this data set was used to partition SWISS-PROT according to records that are either positively or negatively assigned to a node. Each set of partitioned SWISS-PROT records was examined thoroughly by curators to identify false positive records in the positive pool, and records characterized as false negatives in the negative pool. This information was then used for a second round of keyword refinement as feedback data in generating a subsequent, more refined set of controlled vocabulary. This process was repeated until no further additional identifiable false positives could be detected. Once this data set stabilized for all nodes, the SWISS-PROT-Ontology assignment table was finalized, resulting in the assignment of over 84% of SWISS-PROT (for further details, see on-line supplementary Table 2B at http://www.genome.org). This information was added to the knowledge base, such that it now provides an operational definition that expresses the knowledge associated with each node. The knowledge base was later used as the foundation for the development of nodal signatures, and along with periodic verifications of the selectivity and sensitivity on new releases of SWISS-PROT, it ensures the continued assignment of SWISS-PROT entries to the Protein Function and Cellular Role ontologies as SWISS-PROT evolves. Development of Nodal Signatures Development of the DIAN Knowledge Base An essential component in DIAN is a knowledge base derived from a computer-aided human curation process that associates entries of the SWISS-PROT database to ontologies. SWISSPROT is known for its high-quality curated annotations of protein sequences and minimal level of redundancy (Bairoch and Apweiler 2000). Although most sequence databases provide SWISS-PROT links to leverage its high-quality annotations, accurate and comprehensive classification of SWISSPROT entries onto Protein Function and Cellular Role ontologies has not been achieved. DIAN relies on this knowledge base as a foundation to define parameters and data sets to support the computational assignment of proteins to ontologies. During the early phase of the development of this knowledge base, we attempted to rely on preexisting links between SWISS-PROT and other publicly available databases to determine whether these links could be used directly to associate SWISS-PROT records to ontologies. It was found that this superficially straightforward method of assignment is error prone, and that the resulting coverage of SWISS-PROT was not comprehensive. Instead, the assignment of SWISS-PROT to the Protein Function and Cellular Role ontologies stored in the knowledge base was achieved through a computer-aided manual curation process (illustrated in Fig. 2). A group of scientific curators was assembled to manually assign SWISSPROT sequences to the DIAN Protein Function and Cellular Role ontologies by matching the functional annotation of each SWISS-PROT record to the definition of each node in a given ontology. To ensure the high accuracy of this underlying data set, we analyzed only the subset of SWISS-PROT proteins that are full length and have been characterized biochemically. This resulted in the initial assignment of over 40,000 proteins to the DIAN ontologies. Subsequent to this 1770 Genome Research www.genome.org To classify biological sequence annotations by assigning them to ontologies, we developed annotation signatures for each node of the supported ontologies. Such nodal signatures provide the operational definitions used by the DIAN assignment algorithms to recognize properties in protein sequences, such that sequences from input databases can be assigned to ontologies. Two kinds of nodal signatures are used in the DIAN algorithm: signatures based on either controlled vocabulary or protein domain profiles. A protein domain is here defined as an independent structural unit, which can be found alone, or in conjunction with other domains. Domains are often the mediators of the biochemical functions of proteins, although a substantial fraction of domains appears to play structural roles only. For this and other reasons, not all domains can be used as nodal signatures. For the Protein Function and Cellular Role ontologies, controlled vocabulary databases are used to efficiently collapse protein annotations present in source records and to assign these records to ontologies, as was done when assigning SWISS-PROT sequences to ontologies during the development of the knowledge base. This controlled vocabulary is expected to accurately classify sequence via annotations preexisting in the source records as long as the quality of these annotations is comparable to that of SWISS-PROT. Although sensitive enough to capture input sequence annotations under most circumstances, this approach is essentially a keyword-matching mechanism that may incorrectly assign records to ontologies as compared with the actual sequence annotation. This is an expected consequence of the process by which nodal vocabularies are derived. For example, it is possible for both a kinase substrate and a kinase enzyme to become assigned to the same ontology kinase node, when in fact only kinase enzymes should be assigned to this node. This is a consequence of the difficulty of defining assignments on the basis of vocabularies alone. DIAN Knowledge Mining System Of larger consequence is the intrinsic quality of the annotations associated with a sequence to be assigned, because annotations in most sequencing projects are transferences obtained through sequence similarity alignments with characterized gene or proteins. This can lead to so-called “multiplelinkage” errors during the annotation transfer process, which creates misleading annotations due to the localization of the alignment in a region with low functional information content (e.g., a region devoid of a functional domain). Therefore, an additional assignment algorithm was derived to compensate for this well-known problem by relying only on the presence of domains within protein sequences or the translation of DNA sequences into proteins. Whereas evolutionarily and functionally related protein sequences can diverge significantly through evolution, three-dimensional substructures, such as motifs, domains, and active sites, can remain largely unchanged (Gusfield 1997). As a result, protein domain profiles compiled from multiple sequence alignments can enable more accurate representation of protein families and superfamilies. Furthermore, such conserved sequence features are highly correlated with structure and function. As a result of the success of the protein profiling methodology, several protein domain and motif databases have been built: PFAM (Sonnhammer et al. 1998), PROSITE (Bairoch 1991), PRODOM (Corpet et al. 1998), DOMO (Gracy and Argos 1998), EMOTIF/ EMATRIX (Nevill-Manning et al. 1998; Wu et al. 2000), BLOCKS (Henikoff et al. 1999), PRINTS (Attwood et al. 1997). Although in the current DIAN algorithm we have chosen to rely on domains provided by the PFAM database because of its extensive coverage and the richness of its associated annotations, other domain or motif databases can be integrated in the same fashion. Because of the close relationship between a given protein domain and the function and structure of a protein that harbors this domain, the ontological classification of protein sequences using well-chosen protein domains can be achieved by using an effective balance between the specificity and sensitivity of individual domains. A filtering algorithm was therefore developed to select domains qualified to function as nodal signatures to be used in assigning proteins to ontologies. Comprehensive analyses of the DIAN knowledge base for patterns of association between PFAM domains and SWISSPROT sequences assigned to ontological nodes revealed frequent many-to-many relationships between domains and nodes. To promote specificity, it was therefore necessary to analyze all preliminarily assigned protein domains for the possibility of conversion to nodal signatures for a particular node. This was accomplished in the following way: For each of the protein domains in the source pool, the annotations of all SWISS-PROT sequences containing a particular protein domain were compared against the assignment of this sequence to a node, as maintained in the DIAN knowledge base. Second, if a set of annotations associated with sequences containing a given protein domain was found to be correlated with the description of the node, this domain was accepted as the annotation signature for that ontology node, as this domain is relatively specifically correlated with that node. These concepts are illustrated in Figure 3 in the case of the Protein Function ontology. This ontology is expressed as a tree in which each node represents a concept and is associated with other concepts via an “IS-A” relationship. The root of this tree (level 0) is a generic function. Child nodes inherit the properties of their parent and express increasingly specific protein functions. For example, among the children of the Figure 3 Converting protein domains into ontological nodal signatures. The Protein Function ontology is used here to illustrate the derivation and assignment of nodal annotation signatures. Proteins are depicted as rectangles; identical colors indicate membership to the same protein family in a given species, whereas the various protein domains are represented as geometrical shapes. root lies the “enzyme” node, which is defined as “ biomolecules that can catalyze reactions.” Associated with this node are keywords positively correlated with this function, such as “Oxidoreductase OR Transferase OR Hydrolase OR Lyase OR Isomerase OR Ligase”. As a first step in the derivation of the DIAN knowledge base, proteins described in the SWISS-PROT database were assigned to the most specific nodes possible. Here, six proteins were assigned to the transferase node (Fig. 3). Two proteins belong to the same gene family and are of human origin, whereas all other proteins are from different gene families from various species. Various protein domains are present within these proteins, sometimes more than once in a given protein. Thus, a total of five distinct types of protein domains are present within the group of proteins assigned to the transferase node. However, only three types of domains are retained by DIAN as protein annotation signatures, because according to the DIAN knowledge base these domains are the only domains to be specifically associated with transferase-related functions. Thus, the two remaining domain types were rejected as annotation signatures because they are either not encoding a function related to transferases, or are purely structural domains not directly involved in protein function. In this way, any database of protein motifs or domains can in principle be integrated in the DIAN algorithm to derive ontological node signatures. The current DIAN implementation relies on protein domains from the PFAM database as its source of protein domains to be converted into ontological node signatures. Based on overlaps between the annotations present in the 86,593 sequences of release 39 of SWISS-PROT and the Genome Research www.genome.org 1771 Pouliot et al. concepts associated with our ontological nodes, computeraided human curation associated 73% of SWISS-PROT sequences to the PROSITE Protein Function ontology, 68% to the EGAD Cellular Role ontology, and 68% to the DoubleTwist Biological Role ontology. Overall, 205,694 keywordbased patterns and 1699 PFAM domains were compiled to represent the biological concepts associated with each ontology node. Nodal signatures for the structural ontology were derived differently from the process described in Figure 3. This was achieved by profiling the SCOP domain sequences compiled by the SCOP consortium (Brenner et al. 1998), using selected protein domains from the PFAM database. Because high sequence similarity usually implies significant functional and structural similarity (Gusfield 1997), 824 PFAM domains were identified that are referenced in sequences of the SCOP domain database (S. Brenner, pers. comm.). These PFAM domains show strong sequence similarity to SCOP domains and were selected because they are likely to represent a similar structure in three-dimensional space. Ontological Assignment Process Two assignment algorithms are used to assign proteins to DIAN ontologies. This is achieved on the basis of either the presence of protein domains or the recognition of vocabularies within the source record. As shown in Figure 2, annotations in various biological sequence databases, including GenBank, SWISS-PROT, GenPept, PDB, and UniGene, are collapsed through either the domain-based or vocabulary-based algorithms into a centralized DIAN database. In cases where DNA sequences are operated on by the domain-based algorithm, a translation algorithm is applied, as DIAN only operates ultimately on protein sequences. Genomic DNA sequences are treated differently in this process because these sequences show very different properties from cDNA and proteins. In particular, sequence length can easily exceed a million characters. For this reason, it would therefore be incorrect to apply ontologies directly at the level of an entire genomic sequence. Thus, location coordinates are essential to segment genomic sequences into biologically meaningful ranges (“units”) before further processing. If available in the source record, information specifying the presence of genes, derived ab initio or experimentally, are used to define the unit. However, in sequences derived from high-throughput sequencing projects (e.g., sequences from the GenBank HTG division), this information is frequently unavailable. In such cases, DIAN can use available gene predictions from algorithms such as GENSCAN (Burge and Karlin 1997) or GENEWISE (Birney and Durbin 2000) to locate the genes in the genomic sequence. As mentioned earlier, another assignment approach applied by DIAN is based on the scanning of annotations associated with the input biological sequence using a vocabularybased mapping process. This is accomplished by the application of a collection of keywords that serve as the ontology node annotation signatures, enabling the collapse of preexisting annotations and their assignment to ontological nodes. The input sequence annotations can be derived from sequence similarity information, domain profiling information, human curation, computation-derived annotations, thirdparty annotations, and so forth. Together, the domain-based and vocabulary-based algorithms are used by DIAN to annotate and classify sequences from input biological databases in a high-throughput manner. 1772 Genome Research www.genome.org DIAN Algorithm Evaluation The sensitivity and selectivity of the DIAN algorithm were evaluated. Based on sequence similarity results, the vocabulary-based algorithm implicitly transfers existing annotations and assigns proteins to ontological nodes. However, this process suffers from two intrinsic types of errors: Because of the variability of vocabularies in the annotations, it is very difficult to identify and compensate for incorrect annotations during this annotation transfer process. Furthermore, multiple linkage errors are generated when annotations are wrongly transferred when the sequence similarity between both sequences is only present within core structural regions with low information content, rather than encompassing functional domains. However, the domain-based assignment algorithm is not susceptible to these problems. Thus, despite the observation that the domain-based algorithm generated less coverage than the vocabulary-based algorithm, the domain-based algorithm can make annotation assignments in the absence of preexisting annotations in the source records. The accuracy of an ontological mapping algorithm such as DIAN is defined as the fraction of correct assignments made to the nodes of an ontology, both in terms of type I variations (assignments that should not have been made but are present) and type II variations (assignments that are missing and that should have been made). Here we use the terms types I and II “variation”, rather than “type I error” and “type II error”, to emphasize that providing exact error rates in this context is fundamentally impossible (see the following discussion of error measurements in this context). The accuracy of the DIAN algorithm was evaluated using three complementary approaches, summarized in Table 2. The construction of the underlying data sets is described in Figure 4. Detailed results of evaluations are documented in Table 3. The DIAN assignments of well-characterized mouse sequences were compared with assignments made via an independent assignment process (method 2, Table 2). These assignments were provided by the Mouse Genome Database (MGD; Blake et al. 2000) using the Molecular Function and Biological Process ontologies from the Gene Ontology (GO) Consortium (Ashburner et al. 2000; http://www. geneontology.org). The application of GO ontologies to the mouse genome was chosen over that of other organisms such as Drosophila and others because of its closer relationship to human proteins and the bias in the SWISS-PROT database toward higher organisms. Because these ontologies are different from those currently supported by DIAN, a crossreferencing was first determined to enable comparisons of assignments. As shown in Figure 4B, comparing assignments made to ontologies is accomplished first by manually selecting nodes from a reference ontology for concepts that are shared between the ontologies. Because of the different levels of resolution supported by different ontologies, nodes at equivalent levels of resolution need to be identified. This results in some of the terminal nodes of one ontology being associated with middle nodes of the counterpart ontology. Furthermore, multiple nodes from one ontology may need to be selected to represent the concepts associated with a single node from the counterpart ontology (indicated by purple nodes from the reference ontology, all of which are conceptually equivalent to a single node from the DIAN ontology). Thus, the node associated with the INHIBITORS concept on level 3 of the DIAN ontology is conceptually equivalent to the APOPTOSIS INHIBITORS and ENZYME INHIBITORS nodes DIAN Knowledge Mining System Table 2. Methodologies Involved in the Evaluation of DIAN: Strengths and Weaknesses Approach number Approach type Description Strengths Extensive human expertise can confirm assignments made by method and substantiate its effectiveness. Presence of extensive shared assignments for numerous proteins lends credence to the method under evaluation. 1 Manual verification of assignments made to selected proteins. In-depth review by domain experts of assignments made to wellunderstood proteins. 2 Comparisons with other assignment data sets using a test set of sequences. Evaluation of sequence assignments made to cross-referenced ontologies using different methods. 3 Comparisons between orthologs. Verification that assignments made to closely related orthologs are balanced, (i.e., nearly identical). and subnodes on levels 6 and 8 and lower of the reference ontology. Other problems arise from the differing extent of coverage between ontologies, which can obscure the interpretation of the comparison. In this example, there are several more proteins mapped to the DIAN ontology than to corresponding nodes of the reference ontology. Some proteins are mapped to both ontologies (green area, where individual pair members are indicated by double arrows), whereas other proteins are mapped only to a single ontology (red area). Within Strong expectation that balanced assignments will be made. Weaknesses Suffers from lack of comprehensiveness; biased in favor of well-understood proteins. Assumes that the reference ontology can be treated as a standard of comparison; in practice, this is not the case. Results in the identification of weaknesses in both the test and reference ontology. Manual review is required to evaluate unbalanced assignments. Although orthologs share functions, even orthologs share functions, even orthologs from closely related species don’t necessarily have identical functions, resulting in unbalanced assignments; manual review is required to evaluate unbalanced assignments. the latter, a manual review will find that some proteins are correctly mapped (blue rectangles), whereas others are incorrectly mapped (yellow rectangles). Lastly, there can be variations in the comprehensiveness of assignments made to individual proteins, such that only a fraction of the properties associated with a single protein are assigned to an ontology (data not shown). Detailed results of this evaluation are listed in Table 3A and 3B. A number of intrinsic problems were identified from our Figure 4 Validation approaches. (A) Evaluating the effectiveness of DIAN by comparing assignments made to a reference ontology. Selected nodes from Gene Ontology (GO) ontologies were manually associated with nodes in the DIAN ontologies. Sequences assigned to these GO nodes by the MGI were processed by the DIAN pipeline to compare the assignments made by DIAN with those made by MGI. (B) Associating nodes and sequences from a reference ontology to a DIAN ontology for comparative evaluation. To estimate the error rates associated with the DIAN assignment algorithms, we compared mouse sequences mapped via DIAN (A) with assignments made to GO ontologies by MGI. Genome Research www.genome.org 1773 Pouliot et al. Table 3. Comparison between DIAN and MGI Ontological Assignments Results from the comparative approach are shown. A number of intrinsic problems were identified from this approach, such that type I and type II variances described here are for comparative purposes only and cannot be interpreted strictly as type I and II errors. Table 3A. Comparing Assignments Made to the Cellular Role Ontology Present in Variation DIAN node number Highest level matching GO modes DIAN and GO DIAN only GO only Type I Type II Sensitivity Selectivity Chromosome structure Transcription factors DNA duplication Cell-cell adhesion Transcription factors Microtubule DNA repair Programmed cell death Channel and transporter Amino acid metabolism Stress response Nucleotide metabolism 1.1 1.4 3.2 5.2 9.1.1.1 6.2 8.1 8.2 4.6 9.2 8.4 9.4 7 59 10 35 1 9 14 14 47 4 5 0 4 54 2 15 0 0 2 7 8 1 6 2 4 38 2 14 1 2 9 23 27 7 55 7 0.267 0.358 0.143 0.234 0.000 0.000 0.080 0.159 0.098 0.083 0.091 0.222 0.267 0.252 0.143 0.219 0.500 0.182 0.360 0.523 0.329 0.407 0.833 0.778 0.636 0.608 0.833 0.714 0.500 0.818 0.609 0.378 0.635 0.476 0.083 0.000 0.636 0.522 0.833 0.700 1.000 1.000 0.875 0.667 0.855 0.625 0.455 0.000 Cofactor metabolism Total DIAN and GO: 229 Total DIAN only: 113 Total GO only: 214 Total: 556 Average type I: 0.203 Average type II: 0.385 Sensitivity: 0.517 Selectivity: 0.670 9.5 GO:0007001;GO:0006323 GO:0003700 GO:0006260;GO:0003964 GO:0007155 GO:0008135 GO:0007017 GO:0006281 GO:0006915 GO:0006810;GO:0005216 GO:0006519 GO:0006950 GO:0006140:0006205 GO:0006143 GO:0006731 4 0 3 0.000 0.429 0.571 1.000 Concept DIAN assignments made to a group of well-characterized, nonredundant mouse sequences were compared to assignments made by the MGI to the GO Process and Function ontologies. GO modes corresponding to DIAN nodes are listed, along with the abbreviated essential concept from the DIAN Role ontology. For brevity, only the highest level GO nodes are listed. The number of sequences whose assignment is shared to both sets of ontologies is indicated (DIAN and GO), as well as the number of sequence assignments which differed (DIAN only, GO only). These numbers are used to calculate Type I and II variation using the following equations: Type I variation = DIAN only/(DIAN and GO + DIAN only + GO only); Type II variation = GO only/(DIAN and GO + DIAN only + GO only); Sensitivity = DIAN and GO/(DIAN and GO + GO only); Selectivity = DIAN and GO/(DIAN and GO + DIAN only). Sensitivity is defined as the ability of the DIAN algorithm to make what are believed to be all possible correct assignments. Selectivity is defined as the ability of the DIAN algorithm to not make what is believed to be an incorrect assignment. Table 3B. Comparing Assignments Made to the Protein Function Ontology Present in DIAN node number Concept Hormones and active peptides 10 Inhibitors 12 DNA or RNA associated proteins 1774 3 Genome Research www.genome.org Highest level matching GO modes GO:0005179;GO:0005103 GO:0005104;GO:0005105 GO:0005106;GO:0005109 GO:0005110;GO:0005111 GO:0005112;GO:0005113 GO:0005114;GO:0005115 GO:0005116;GO:0005117 GO:0005118;GO:0005119 GO:0005120;GO:0005121 GO:0005122;GO:0005123 GO:0005124;GO:0005177 GO:0005178;GO:0005186 GO:0004857;GO:0008189 GO:0005074;GO:0005092 GO:0008200;GO:0005517 GO:0003676;GO:0003735 GO:0004748;GO:0003910 DIAN and GO DIAN only Variation GO only Type I Type II Sensitivity Selectivity 7 3 8 0.167 0.444 0.467 0.700 12 3 9 0.125 0.375 0.571 0.800 255 21 26 0.070 0.086 0.907 0.924 DIAN Knowledge Mining System Table 3B. (Continued) Present in Concept Protein secretion and chaperones Electron transport proteins Other tranport proteins Structural proteins Receptors Cytokines and growth factors Variation DIAN node number Highest level matching GO modes DIAN and GO DIAN only GO only Type I Type II Sensitivity Selectivity 13 GO:0003911;GO:0004518 GO:0003899;GO:0008534 GO:0008263;GO:0003907 GO:0003905;GO:0003906 GO:0003904 GO:0004844;GO:0003908 GO:0003754;GO:0008565 11 3 2 0.188 0.125 0.846 0.786 0 7 6 0.538 0.462 0.000 0.000 62 31 67 35 17 23 43 19 19 40 15 10 0.173 0.245 0.344 0.297 0.194 0.426 0.120 0.156 0.765 0.437 0.817 0.778 0.785 0.574 0.609 0.648 5 GO:0006605 GO:0005489 6 7 8 9 GO:0005215 GO:0005198 GO:0004872 GO:0008083;GO:0005125 GO:0008009 Total DIAN and GO: 480 Total DIAN only: 139 Total GO only: 135 Total: 754 Average type I: 0.184 Average type II: 0.179 Sensitivity: 0.780 Selectivity: 0.775 A group of well-characterized, nonredundant mouse sequences were assigned to the Protein Function ontology by the DIAN domain-based mapping algorithm. These assignments were compared to assignments made to the GO Process and Function ontology by the MGI. investigation of the different evaluation methodologies described here. These are summarized in Table 2. For example, 50% of the Drosophila genes were classified against the GO Molecular and Biological Function ontologies by the Drosophila community, yet no analysis of the errors associated with this work was presented (Ashburner et al. 2000). This is due to the inherent difficulty of assessing error rates associated with ontological classification, such that none of these genome annotations and their associated evidence codes can be statistically evaluated with confidence levels. Here we provide the first attempt to analyze the error rates associated with ontological classification. Because of the lack of a collection of comprehensive, robust assignments that can be used as a standard of comparison, it is inherently impossible to achieve a completely robust assessment of any assignment methodology. Consequently, none of the approaches described here were entirely satisfactory because of these fundamental limitations. Problems range from multiple types of biases in testing sets, to the partiality of the field’s understanding of the function of the proteins in the test sets. Therefore, in many cases the DIAN algorithms were found to be making plausible assignments that cannot be verified with the present data. Additional problems include variability in the comprehensiveness of assignments made to a given protein, as well as variability in the comprehensiveness of assignments of various proteins to ontologies, that is, differences in the coverage between assignment data sets produced by different methods. For example, in the experiment depicted in Table 4, 40% of assignments generated by DIAN (representing 216 assignments) were originally found to be absent in MGD. These were initially considered to be erroneously introduced by the DIAN algorithm, and were Table 4. Requirement for Manual Validation of Comparative Results Concept Cytoskeletal Nucleotide Sugar/glycolysis RNA polymerases RNA processing Transcription factors DIAN/Role node number Present in DIAN and GO DIAN only GO only Reported type I variation Effective number of type I assignments Effective rate of type I variation 3.1 6.5 6.7 5.1.1 5.1.2 5.1.3 30 6 0 0 4 142 19 25 35 4 9 124 16 11 8 0 10 98 0.29 0.60 0.81 1.00 0.39 0.34 4 3 3 4 1 0 0.06 0.05 0.07 0.00 0.04 0.00** Type I variation here refers to those assignments made by DIAN but not in the reference ontology implementation system (GO system). Manual validation results show that Type I variation (DIAN-only assignments) cannot simple be treated as Type I error in a strict statistical sense. Genome Research www.genome.org 1775 Pouliot et al. therefore classified as type I variations. However, on manual review, most of these assignments were found to be correct, such that the number of true type I variations was ultimately reduced to 2.5%. Thus the Type I and II variations in our evaluation scheme cannot be interpreted simply as Type I or II errors in a strict statistical sense. The missing assignments presumably reflect limitations in the keyword-recognition algorithm used in most of the assignments currently provided by the Mouse Genome Database (outlined in Fig. 5A). As an illustration, MGD assignments for entry #104661, which codes for RAR-related orphan receptor ␣, are depicted in Figure 5B. This gene, a member of the nuclear hormone receptor superfamily involved in thyroid hormone signaling pathway, was assigned to GO categories by MGD on the basis of electronic annotation using a keyword-scanning algorithm (GO evidence code IEA). This algorithm correctly identified the protein function as “DNA binding” and the role of the gene as “transcriptional regulation”, but failed to also indicate its receptor function, which is involved in cell signaling (Fig. 5B). Despite the fact that a more systematic evaluation of assignment algorithms is not feasible because of these deficien- Figure 5 Comparison of assignment methods. (A) Comparison of automated and manual assignment methods. The properties of automated assignment methods such as DIAN are compared with those of manually generated assignments. (B) Comparison of DIAN and MGI assignments. Results from a simple keyword-based method are illustrated here in assignments made by the algorithm used by Mouse Genome Informatics Database, as compared with DIAN assignments. Note that the “DNA binding” cellular role is vague, as the correct function for this gene should be “transcription factor.” 1776 Genome Research www.genome.org cies, results from the evaluation approaches applied here indicate that DIAN returns generally correct assignments of proteins to its various ontologies. Deficiencies in DIAN’s assignment algorithms were most manifest in its favoring of underprediction (type II variation). Our manual curation and validation indicate that this error type is far more common than overprediction. This reflects the conservativeness of the selection of protein domains as bona fide annotation signatures for a given node, as well as the limited coverage of the protein universe by domains presently available in the PFAM database, on which the current version of the algorithm is based. In contrast, overprediction is much less frequent and relates to domains that are not completely specific to a given concept and thus return spurious assignments. Other problems include limitations in the resolution of the algorithm, such that DIAN may be unable to correctly assign sequences to very specific nodes such as leafs in the Enzyme hierarchy. DISCUSSION Considerations Related to the Assignment of Protein Domains to Biological Ontologies Because protein domains often involve many-to-many relationships with respect to biochemical function, that is, a given domain may be associated with multiple biochemical functions, the importance of curating these associations to ensure specificity is essential to reduce incorrect assignments. This is most manifest in cases where a simple linkage is made between a protein domain to a biological ontology, such as in the PRINTS and PROSITE databases. Therefore, it is necessary to review the specificity of an assignment in the context of all other assignments this domain may have to other nodes. Furthermore, an evaluation of a nodal annotation signature with respect to the protein universe, here currently approximated by the SWISS-PROT database, is required to be statistically rigorous. For such a review to be robust, it becomes necessary to first associate all known domains to all protein functions described by an ontology, followed by estimating the significance of these associations to ensure that they are informative and not due to, for example, a requirement for a structural role unconnected to the protein function under consideration. This is because only a fraction of domains are truly diagnostic for a given protein function, and although careful manual review can help strengthen the quality of these associations, we believe that only when a global view of associations is available can domains with a low specificity to different functions be eliminated and meaningful assignments be made. Because of the magnitude of the work, generating such a global view can only be achieved via a combination of automation followed by manual curation. In the case of DIAN, this was accomplished by deriving manually a knowledge base composed of the assignments of all SWISS-PROT proteins to the various ontologies used by DIAN. This was to serve as the first step in defining domains that are meaningfully associated with protein function. This knowledge base was then used to perform exhaustive verifications of the significance of these associations by deriving a heuristic decision rule by which to accept or reject the association of individual domains to ontological nodes. For each candidate protein domain for the annotation signature of the ontology node, the annotations of all SWISS-PROT sequences containing this particular protein domain were analyzed against the SWISS-PROT sequences previously associated with DIAN Knowledge Mining System this node by the DIAN knowledge base. The significant overlap between these pools of SWISS-PROT records and PFAM domains ensures that a particular protein domain can be used as a nodal signature. This information is enabled in a heuristic rule that further requires that a majority of at least four of five SWISS-PROT proteins used in the knowledge base be nonfragmentary, and that annotations associated with these sequences be derived from published laboratory results. Evaluation of Keyword-Based Versus Domain-Based Ontology Nodal Assignment Methods As described earlier, DIAN combines two algorithms for the automated assignment of proteins to ontologies that rely on an underlying knowledge base assembled using manual curation, along with heuristic rules. By comparison, other assignment efforts, such as those made by MGD in the context of the GO consortium, currently rely primarily on a simple process of scanning source records for keywords to GO terms. Full manual assignment of records is intended to follow this initial phase. However, such human curation poses several significant limitations, among which is the prohibitive expense of genome-scale assignment. For this reason, over 84% of the 14,801 assignments presently available in MGD were generated by using keyword-based association, with the remaining assignments being produced manually. Because automated assignments methods can be expected to remain a key technology due to their high-throughput capability, development of algorithms that go beyond the limitations of simple keyword-based assignment is imperative, as most genes will not receive the kind of textual descriptions that lend themselves to this approach. Therefore, the domain-based approach of DIAN provides a distinct additional approach beyond keyword scanning, and permits high-throughput assignment independently of the presence of prior textual annotations, while retaining reasonable accuracy. Lastly, because of the frequent difficulty of confirming whether a given assignment is incorrect, such reviews should perhaps be limited to providing a general confidence value on the mappings made by automated methods. As was done here, selective manual reviews of individual assignments based on the comparison of different algorithmic implementations can also be used to uncover possible errors and defects in their respective mapping methodologies. Worthy of mention here is DIAN’s validation module, which integrates manual reviews to compensate for deficiencies in the various automated validation methods. In summary, DIAN is a high-throughput annotation algorithm that uses biological ontologies to segment the spaces of protein function, biological role, and structure. When applied to data generated from genome sequencing projects, DIAN is an effective algorithm for the conceptual annotation of genome-scale in a timely and scientifically accurate manner. It is also an effective data mining algorithm, applicable to the identification of novel correlations that can only be made at the conceptual level. METHODS Ontologies DIAN currently supports five ontologies: the PROSITE ontology was used for Protein Function (http://www.expasy.ch/ prosite/, release/version 16.30); Cellular Role is enabled by the EGAD ontology from TIGR (http://www.tigr.org/docs/ tigr-scripts/egad_scripts/role_report.spl, release/version N/A), which was originally derived from Monica Riley’s E. coli protein ontology; the Enzyme classification is from IUBMB (International Union of Biochemistry and Molecular Biology (http://www.chem.qmw.ac.uk/iubmb/enzyme/, release/ version Enzyme Nomenclature 1992 and all of its supplements); SCOP is from the Medical Research Council (MRC) of the United Kingdom (http://scop.mrc-lmb.cam.ac.uk/scop/ index.html, release/version 1.53); the DoubleTwist Biological Role was derived internally (release/version 1.00). These ontologies can be viewed as taxonomies of IS-A links, in which a node situated at level 1 (Fig. 1) indicates a node expressing a more general concept than that of a node at level 2, whereas a node situated at level 3 indicates a more specialized node than the one at level 2. Component Databases Supported by DIAN The component databases supported by DIAN are the GenBank primate division (GB Release 121); UniGene (Build #129); SWISS-PROT (Release 39); PDB (Release as of 1/1/2001); and GenPept (Release as of 12/27/2000). Construction of the DIAN Knowledge Base Two databases were constructed as the foundation of the knowledge base associated with ontological nodes: a controlled vocabulary and regular expression database, and a protein domain signature database. For classification of protein structures, the PFAM motifs within the SCOP domain sequences compiled by the SCOP consortium (Brenner et al. 1998) were used as source material for the nodal signatures of the structural ontology. The controlled vocabulary database was populated during the construction of the SWISS-PROTOntology mapping table. A computer-aided human curation process was performed by a group of domain specialists whereby SWISS-PROT sequences were manually assigned to the supported ontologies. Node-specific vocabulary and regular expressions were derived and later used to control the association of source-record annotations to a given node of the supported ontologies. In this way, vocabulary data sets for each relevant node were created and manually curated with subsequent releases of SWISS-PROT. Using the manually curated association between SWISS-PROT sequences and ontological nodes, sequences in this database were processed to identify PFAM protein domains using Paracel’s GeneMatcher system. Through the SWISS-PROT-Ontology table, annotations made with respect to PFAM domains in SWISSPROT source records were used to verify the accuracy of the association of PFAM domains to an ontology node before assigning a domain to a node. Specifically, because PFAM domain and ontology node each have a satellite pool of SWISSPROT records, the extent of the overlap between these pools of records is used to confirm the correctness of the assignment of this PFAM domain to a particular ontology node. This was done in a many-to-many manner, such that a domain can be assigned to more than one node, and a given node can have more than one domain associated with it. DIAN Algorithm Implementation The underlying DIAN knowledge base was implemented using the Oracle 7.3 relational database management system (Oracle). For Hidden Markov Model searching, the GeneMatcher system was selected for its ability to perform highthroughput protein domain profiling using the PFAM database. User queries of the DIAN data set are performed using the PLS index-based search engine (http://www.pls.com) from American Online. Most of the DIAN pipeline was implemented using the Perl (v.5.0) language. Benchmarks of chromosome 22 were obtained as follows: chromosome 22 was first fragmented into overlapping fragments of 200,000 bp. GENSCAN (Burge and Karlin 1997) was used to generate a da- Genome Research www.genome.org 1777 Pouliot et al. tabase of predicted gene sequences. This collection of gene predictions was then processed by the DIAN pipeline for annotation analysis. In this case, rather than using GeneMatcher, PFAM domain profiling was done by farming the predicted gene translations to four workstations running the HMMER software package to show that DIAN can be applied easily as a component of a large-scale annotation system for genome-scale sequencing projects using a conventional computing architecture. The coverage by DIAN of chromosome 22 was thus based on this database of predicted gene sequences. Only the domain-based assignment algorithm was used in this case. DIAN Algorithm Evaluation Three approaches were applied in evaluating the assignment accuracy of DIAN: manual verification, comparisons between assignments to different ontologies, and ortholog-based validation. In the first approach, manual verification of assignments was made to selected proteins. A group of domain experts was given the task of reviewing annotation assignments of biological sequences made by the DIAN pipeline within their domain of expertise. Several dozen proteins of varied types were evaluated in this manner. In the second approach, a test set was constructed for the comparative evaluation of assignments. Nodes from the GO:Process or GO:Function ontologies that are conceptually equivalent to nodes of the DIAN Protein Function or Cellular Role ontologies were identified (Table 3, Fig. 4A,B; see Fig. 4 for explanation). Mouse genes assigned by MGD (http://www.informatics.jax.org/; Baker et al. 1999) to these GO nodes (or their child nodes) were then retrieved. The protein sequences for these genes were obtained from RefSeq via shared HUGO gene names (http:// www.gene.ucl.ac.uk/nomenclature/). An all-versus-all SmithWaterman sequence similarity search (Smith and Waterman 1981) was further performed to eliminate sequence redundancy within these mouse sequences. Only sequences with <:40% overall similarity were retained as the testing set, composed of 857 proteins. These sequences were then assigned to DIAN ontologies by the DIAN algorithm for comparison against their original assignments in GO ontologies (Table 3A, 3B). Sequences with unbalanced assignments between GO and DIAN ontologies were examined manually to assess the source of the imbalance: the presence of a missing assignment of a bona fide property listed in GO, or a missing or incorrect assignment of a bona fide property in DIAN. In the last approach, assignments made to orthologous sequences were compared. A test set of orthologous proteins was assembled, composed of a random set of 37 pairs of orthologous Refseq protein records for mouse and human. Orthology was assumed when genes shared the same HUGO gene name. Sequences from the test set were processed by the DIAN pipeline, and resulting assignments were compared between proteins, with the expectation that identical assignments would be generated. Sequences with unbalanced assignments were examined manually to assess the source of the imbalance, such as the presence of a species-specific function or from a possibly erroneous assignment made by the DIAN algorithm. ACKNOWLEDGMENTS A patent application for the DIAN algorithm has been filed with the U.S. Patent and Trademark Office. The authors are grateful to Drs. Doug Brutlag (Stanford University), Peter Karp (SRI International), and Andrew Karsaskis (DoubleTwist, Inc.) for valuable discussions. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact. 1778 Genome Research www.genome.org REFERENCES Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25–29. Attwood, T.K. 2000. The Babel of bioinformatics. Science 290: 471–473. Attwood, T.K., Avison, H., Beck, M.E., Bewley, M., Bleasby, A.J., Brewster, F., Cooper, P., Degtyarenko, K., Geddes, A.J., Flower, D.R., et al. 1997. The PRINTS database of protein fingerprints: A novel information resource for computational molecular biology. J. Chem. Inf. Comput. Sci. 37: 417–424. Bairoch, A. 1991. PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Res. (Suppl.) 19: 2241–2245. Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 45–48. Baker, P.G., Goble, C.A., Bechhofer, S., Paton, N.W., Stevens, R., and Brass, A. 1999. An ontology for bioinformatics applications. Bioinformatics 15: 510–520. Ben-Yaacov, S., Le Borgne, R., Abramson, I., Schweisguth, F., and Schejter, E.D. 2001. Wasp, the Drosophila Wiskott-Aldrich syndrome gene homologue, is required for cell fate decisions mediated by Notch signaling. J. Cell Biol. 152: 1–14. Birney, E. and Durbin, R. 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10: 547–548. Blake, J.A., Eppig, J.T., Richardson, J.E., and Davisson, M.T. 2000. The Mouse Genome Database (MGD): Expanding genetic and genomic resources for the laboratory mouse. The Mouse Genome Database Group. Nucleic Acids Res. 28: 108–111. Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164–170. Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95: 6073–6078. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78–94. Chen, R.O., Felciano, R., and Altman, R.B. 1997. RIBOWEB: Linking structural computations to a knowledge base of published experimental data. Ismb 5: 84–87. Chervitz, S.A., Hester, E.T., Ball, C.A., Dolinski, K., Dwight, S.S., Harris, M.A., Juvik, G., Malekian, A., Roberts, S., Roe, et al. 1999. Using the Saccharomyces Genome Database (SGD) for analysis of protein similarities and structure. Nucleic Acids Res. 27: 74–78. Commission on Biochemical Nomenclature, and International Union of Biochemistry. Standing Committee on Enzymes. 1973. Enzyme nomenclature; recommendations (1972) of the Commission on Biochemical Nomenclature on the nomenclature and classification of enzymes together with their units and the symbols of enzyme kinetics. Elsevier Scientific, New York. Corpet, F., Gouzy, J., and Kahn, D. 1998. The ProDom database of protein domain families. Nucleic Acids Res. 26: 323–326. Derry, J.M., Ochs, H.D., and Francke, U. 1994. Isolation of a novel gene mutated in Wiskott-Aldrich syndrome. Cell 78: 635–644. Gracy, J. and Argos, P. 1998. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics 14: 174–187. Gusfield, D. 1997. Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge. Henikoff, S., Henikoff, J.G., and Pietrokovski, S. 1999. Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 15: 471–479. Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. 1999. The PROSITE database, its status in 1999. Nucleic Acids Res. 27: 215–219. International Union of Biochemistry. Standing Committee on Enzymes. 1965. Enzyme nomenclature; recommendations, 1964, of the International Union of Biochemistry on the nomenclature and classification of enzymes, together with their units and the symbols of enzyme kinetics. Elsevier, New York. International Union of Biochemistry. Nomenclature Committee and Commission on Biochemical Nomenclature. 1979. Enzyme nomenclature, 1978: Recommendations of the Nomenclature Committee of the International Union of Biochemistry of the nomenclature and classification of enzymes. Academic Press, New York. DIAN Knowledge Mining System International Union of Biochemistry. Nomenclature Committee, International Union of Biochemistry, and Commission on Biochemical Nomenclature. 1979. Enzyme nomenclature, 1978: Recommendations of the Nomenclature Committee of the International Union of Biochemistry on the nomenclature and classification of enzymes. Academic Press, New York. International Union of Biochemistry. Nomenclature Committee, Webb, E.C., and International Union of Biochemistry. 1984. Enzyme nomenclature 1984: Recommendations of the Nomenclature Committee of the International Union of Biochemistry on the nomenclature and classification of enzyme-catalysed reactions. Academic Press, Orlando, FL. International Union of Biochemistry and Molecular Biology. Nomenclature Committee and Webb, E.C. 1992. Enzyme nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. Academic Press, San Diego. Karp, P.D. 2000. An ontology for biological function based on molecular interactions. Bioinformatics 16: 269–285. Lewis, S., Ashburner, M., and Reese, M.G. 2000. Annotating eukaryote genomes. Curr. Opin. Struct. Biol. 10: 349–354. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540. Nevill-Manning, C.G., Wu, T.D., and Brutlag, D.L. 1998. Highly specific protein sequence motifs for genome analysis. Proc. Natl. Acad. Sci. 95: 5865–5871. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27: 29–34. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH—A hierarchic classification of protein domain structures. Structure 5: 1093–1108. Rawlings, S.L., Crooks, G.M., Bockstoce, D., Barsky, L.W., Parkman, R., and Weinberg, K.I. 1999. Spontaneous apoptosis in lymphocytes from patients with Wiskott-Aldrich syndrome: Correlation of accelerated cell death and attenuated bcl-2 expression. Blood 94: 3872–3882. Rengan, R., Ochs, H.D., Sweet, L.I., Keil, M.L., Gunning, W.T., Lachant, N.A., Boxer, L.A., and Omann, G.M. 2000. Actin cytoskeletal function is spared, but apoptosis is increased, in WAS patient hematopoietic cells. Blood 95: 1283–1292. Riley, M. 1993. Functions of the gene products of Escherichia coli. Microbiol. Rev. 57: 862–952. Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197. Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R. 1998. Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26: 320–322. Wheeler, D.L., Church, D.M., Lash, A.E., Leipe, D.D., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Tatusova, T.A., Wagner, L., et al. 2001. Database resources of the national center for biotechnology information. Nucleic Acids Res. 29: 11–16. Wu, T.D., Nevill-Manning, C.G., and Brutlag, D.L. 2000. Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics 16: 233–244. Received February 7, 2001; accepted in revised form August 14, 2001. Genome Research www.genome.org 1779 Short Communication doi:10.1006/geno.2002.6824, available online at http://www.idealibrary.com on IDEAL Comparative Analysis of Human Genome Assemblies Reveals Genome-Level Differences Shuyu Li,1 Jiayu Liao,2,* Gene Cutler,1 Timothy Hoey,1 John B. Hogenesch,2 Michael P. Cooke,2 Peter G. Schultz,2 and Xuefeng Bruce Ling1,* 2The 1Tularik, Inc., Two Corporate Drive, South San Francisco, California 94080, USA Genomic Institute of Novartis Research Foundation, 10675 John Jay Hopkins Drive, San Diego, California 92121, USA *To whom correspondence and reprint requests should be addressed. Fax: (858) 812-1502. E-mail: [email protected]. Fax: (650) 825-7400. E-mail: [email protected]. Previous comparative analysis has revealed a significant disparity between the predicted gene sets produced by the International Human Genome Sequencing Consortium (HGSC) and Celera Genomics. To determine whether the source of this discrepancy was due to underlying differences in the genomic sequences or different gene prediction methodologies, we analyzed both genome assemblies in parallel. Using the GENSCAN gene prediction algorithm, we generated predicted transcriptomes that could be directly compared. BLAST-based comparisons revealed a 20–30% difference between the transcriptomes. Further differences between the two genomes were revealed with protein domain PFAM analyses. These results suggest that fundamental differences between the two genome assemblies are likely responsible for a significant portion of the discrepancy between the transcript sets predicted by the two groups. Celera Genomics and the International Human Genome Sequencing Consortium (HGSC) simultaneously published the description of the human genome sequencing, analysis, and gene annotation [1,2]. Although both teams identified approximately 30,000 human genes [1,2], a direct comparison of the Celera and HGSC (Ensembl) data sets revealed little overlap between their novel predicted genes [3]. Questions arose as to whether this observed difference is due to discrepancies in the underlying raw sequence data, the resultant genome assemblies, or the independent gene prediction methodologies used by both groups. To distinguish between these possibilities, we have carried out a comparative analysis of the HGSC genomes (Ensembl 1.0.0, Ensembl 1.1.0, and Ensembl 1.2.0; performed at Tularik, Inc.) and the Celera genome (CHGD_assembly_R25h; performed at the Genomics Institute of the Novartis Research Foundation) using the GENSCAN [4] gene prediction program to generate corresponding predicted transcriptomes. GENSCAN, which was a key component of both the Celera and HGSC gene 138 prediction pipelines, predicts both partial and full-length transcripts. GENSCAN full-length transcripts are defined as those for which GENSCAN predicts a promoter region, one or more exons, and a polyadenylation signal. This analysis revealed that the Celera transcriptome (150,571) has more predicted transcripts than that of HGSC (Ensembl 1.0.0; 109,083). The results for the more recent HGSC genome releases (Ensembl 1.1.0, Ensembl 1.2.0) gave very similar results and are therefore not shown here. A detailed analysis of these GENSCAN-predicted transcripts found that Celera (71,721) has fewer full-length gene predictions than does HGSC (87,295). A BLAST [5]-based comparison of all GENSCAN transcripts (threshold of ≥ 98% identity over at least 100 nucleotides) showed that 80% of predicted HGSC genes have at least one matching sequence in the Celera GENSCAN predictions, whereas 70% of Celera predictions have at least one overlapping sequence in the HGSC set. These results demonstrate that significant discrepancies exist even between Celera and HGSC assembly-derived gene sets predicted with the exact same methodology. To understand the impact of these transcriptome differences on the derived proteomes, we have analyzed the predicted translations of these sequences for the presence of known protein domains using the PFAM [6] 7.0 set of Hidden Markov Models (HMMs) (3360 models, hit threshold E value 1 ⫻ 10–10). The differences between the number of hits for each protein domain model in the HGSC and Celera predicted gene sets were plotted in Fig. 1 for the 1495 models that had hits (data for searches with E values of 1 ⫻ 10–5 or 1 ⫻ 10–2 gave similar results and are not shown). Of all the matching PFAM models, a large percentage have more matches (47%) in the HGSC-derived gene set than in the Celera-derived genes. This is more than the number of models that matched both data sets equally (30%), and more than twice the number that had excess matches in the Celera data (22%). This analysis further supports the conclusion that the genome assemblies had a significant impact on the predicted transcript sets. This parallel analysis of the genome assemblies released by the HGSC and Celera teams provides strong evidence that there are major fundamental differences between these two GENOMICS Vol. 80, Number 2, August 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved. 0888-7543/02 $35.00 doi:10.1006/geno.2002.6824, available online at http://www.idealibrary.com on IDEAL Short Communication data sets in the numbers, identities, and properties of predicted genes derived from these sequences. Based on this, we conclude that these sequence-level differences must be at least partly responsible for the discrepancies in the previous findings [3]. Along with the recent re-analysis [7,8] of Celera’s genome assembly [1], this report provides further evidence that the whole genome approach and the hierarchical shotgun sequencing approach yielded different genomes. RECEIVED FOR PUBLICATION JUNE 6; ACCEPTED JUNE 19, 2002. REFERENCES FIG. 1. PFAM domain profiling of Celera and HGSC derived transcriptomes. The x-axis represents the excess of matches per PFAM model in the HGSC versus Celera data sets. The y-axis represents the number of models that fall into each category. Upward bars represent PFAM models, which have more hits in the HGSC data set. Downward bars represent PFAM models, which have more hits in the Celera data set. GENOMICS Vol. 80, Number 2, August 2002 Copyright © 2002 Elsevier Science (USA). All rights reserved. 1. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291: 1304–1351. 2. Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409: 860–921. 3. Hogenesch, J. B., et al. (2001). A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell 106: 413–415. 4. Burge, C., and Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78–94. 5. Altschul, S. F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. 6. Bateman, A., et al. (2002). The Pfam protein families database. Nucleic Acids Res. 30: 276–280. 7. Waterston, R. H., Lander, E. S., and Sulston, J. E. (2002). On the sequencing of the human genome. Proc. Natl. Acad. Sci. USA 99: 3712–3716. 8. Myers, E. W., Sutton, G. G., Smith, H. O., Adams, M. D., and Venter, J. C. (2002). On the sequencing and assembly of the human genome. Proc. Natl. Acad. Sci. USA 99: 4145–4146. 139 Vol. 19 no. 0 2003, pages 1–9 DOI: 10.1093/bioinformatics/btg219 BIOINFORMATICS A comparative analysis of HGSC and Celera human genome assemblies and gene sets Shuyu Li1,† , Gene Cutler1,† , Jane Jijun Liu1,† , Timothy Hoey1 , Liangbiao Chen2 , Peter G. Schultz3 , Jiayu Liao3, ∗ and Xuefeng Bruce Ling1,∗ Inc. Two Corporate Drive, South San Francisco, CA 94080, USA, 2 Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, People’s Republic of China and 3 The Genomic Institute of Novartis Research Foundation, 10675 John Jay Hopkins Drive, San Diego, CA 92121, USA 1 Tularik, Received on December 21, 2002; revised on March 11, 2003; accepted on March 26, 2003 ABSTRACT Motivation: Since the simultaneous publication of the human genome assembly by the International Human Genome Sequencing Consortium (HGSC) and Celera Genomics, several comparisons have been made of various aspects of these two assemblies. In this work, we set out to provide a more comprehensive comparative analysis of the two assemblies and their associated gene sets. Results: The local sequence content for both draft genome assemblies has been similar since the early releases, however it took a year for the quality of the Celera assembly to approach that of HGSC, suggesting an advantage of HGSC’s hierarchical shotgun (HS) sequencing strategy over Celera’s whole genome shotgun (WGS) approach. While similar numbers of ab initio predicted genes can be derived from both assemblies, Celera’s Otto approach consistently generated larger, more varied gene sets than the Ensembl gene build system. The presence of a non-overlapping gene set has persisted with successive data releases from both groups. Since most of the unique genes from either genome assembly could be mapped back to the other assembly, we conclude that the gene set discrepancies do not reflect differences in local sequence content but rather in the assemblies and especially the different gene-prediction methodologies. Contact: [email protected] INTRODUCTION In February 2001, the International Human Genome Sequencing Consortium (HGSC) and Celera Genomics simultaneously published descriptions of the sequencing, assembly, analysis, and gene annotation of the human genome (IHGSC, 2001; Venter et al., 2001). Although both teams identified approximately 30 000 human genes (IHGSC, 2001; Venter ∗ To † whom correspondence should be addressed. Equal contribution to this publication et al., 2001), a direct comparison of the Celera and HGSC (Ensembl) data sets revealed relatively little overlap between their novel predicted genes (Hogenesch et al., 2001). Our previous parallel analysis (Li et al., 2002) of the two genome assemblies showed that there are major fundamental differences between these two data sets, in the numbers, identities, and properties of predicted genes derived from these sequences, and that assembly-level differences must be at least partly responsible for the gene set discrepancies. In addition, the recent re-analyses (Myers et al., 2002; Waterston et al., 2002) of Celera’s genome assembly debated how much of an impact Celera’s use of the public-domain genome data had on its assembly. In order to provide an up-to-date status report of the human genome sequencing efforts, understand how the genome assemblies have been evolving since their initial releases, and compare the different assembly approaches and their resulting gene data sets, we have collected the majority of HGSC and Celera assembly releases and performed a systematic comparative analysis. METHODS Sequence databases HGSC and Celera database of assemblies and transcriptomes, released from May 2000 to July 2002, were collected and summarized in Table 1. A total of nine HGSC human genome assemblies (June 2000, July 2000, September 2000, October 2000, December 2000, April 2001, August 2001, December 2001, April 2002) were downloaded from http://www.genome.ucsc.edu/#Downloading. Ensembl curated gene sets (Ensembl 0.8.0, Ensembl 1.0.0, Ensembl 1.2.0, Ensembl 3.26 and Ensembl 5.28) were downloaded from ftp.ensembl.org. Five Celera human genome assembly releases (R20, R25h, R26b, R26f and R26i) and four Celera gene sets (R25e, R25h, R26b, R26k) were licensed from subscription of the Celera Discovery System by GNF and analyzed by GNF (The Genomic Institute of Novartis Bioinformatics 19(0) © Oxford University Press 2003; all rights reserved. “bio015” — 2003/5/19 — page 1 — #1 1 S.Li et al. Table 1. HGSC and Celera genome assembly and gene set release history Release date 05-2000 06-2000 07-2000 08-2000 09-2000 10-2000 11-2000 12-2000 01-2001 04-2001 07-2001 08-2001 10-2001 11-2001 12-2001 01-2002 03-2002 04-2002 05-2002 06-2002 Assembly Curated genes HGSC (UCSC) Celera HGSC (Ensembl) 06-2000 07-2000 R18, R19 R20, R21 R22, R23 R24 09-2000 10-2000 Celera E−0.8.0 R25e 12-2000 E−1.0.0 R25h 04-2001 R25e R25h E−1.1.0 R26b R26b 08-2001 E−1.2.0 R26d R26e 12-2001 R26f E−3.26 E−4.28 R26i E−5.28 04-2002 R26f, R26h R26j R26k The release dates and release names (where applicable) are shown for the HGSC and Celera genome assembly releases analyzed in this study. The Ensembl and Celera gene set releases are also shown. Research Foundation). Human RefSeq sequences were obtained by FTP from ftp.ncbi.gov/refseq/H_sapiens. The PFAM 7.0 Hidden Markov Model (HMM) database was obtained by FTP from ftp.genetics.wustl.edu/pub/eddy/pfam 7.0/. The Research Genetics cDNA database was obtained by FTP from ftp://ftp.resgen.com/pub/sv_libraries/RG_Hs_seq_ ver_101100.txt. 07-2002 RefSeq database was downloaded from NCBI ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/ site. Local genome database setup and configuration UCSC annotation databases hg4, hg5, hg6, hg7, hg8, hg10, and hg11, corresponding to the September 2000, October 2000, December 2000, April 2001, August 2001, December 2001 and April 2002 UCSC genome assemblies respectively, were downloaded, and imported into a local relational database. The UCSC relational database schema is available online at http://genome.ucsc.edu/goldenPath/ gbdDescriptions.html Ensembl databases were set up and configured on local servers following instructions from http://www.ensembl.org/ Docs/ and personal communications with Ensembl colleagues ([email protected]). Data sets were downloaded from Ensembl and imported to a local relational database. server setup) or indirectly (locally installed UCSC genome database with pre-computed BLAT results). In the UCSC genome database, chromosome locations are stored in the all_est or all_mrna tables of which the qName column stores RG Genbank accession numbers. The BLAT server setup and homology search were performed using instructions from UCSC. BLAT analysis was run using an identity threshold of 95% over at least 40 bp as described at UCSC genome browser (http://genome.ucsc.edu/cgibin/hgBlat?command=start&org=human). These criteria have been previously determined to give optimal sensitivity, specificity, speed for genomic searches (Kent, 2002). Similar results were obtained when sequences were mapped by running BLAT or by querying pre-computed BLAT results from UCSC database. Gene prediction Predicted gene sets were derived from the HGSC and Celera genome assemblies by running the GENSCAN algorithm (Burge and Karlin, 1997) with its default settings. Fulllength gene sets, were derived from these total gene sets by selecting all predicted genes for which GENSCAN identified 5 promoter and the 3 poly-A signal sequences. BLAT to map sequences onto genome assemblies BLAST comparative data analysis Gene sequences were mapped onto genome assemblies using the BLAT program (Kent, 2002) directly (local BLAT Sequence comparison was performed using the NCBI BLAST algorithm (Altschul et al., 1997): BLASTN for gene–gene 2 “bio015” — 2003/5/19 — page 2 — #2 Human genome assembly comparison comparisons (E-value < 1 × 10−5 , at least 98% identity over 100 bp) and BLASTX for gene/SWISS-PROT comparisons (E-value < 1 × 10−5 ). PFAM domain analysis The PFAM 7.0 database release, containing 3360 HMMs, was used to analyze gene sets for their protein domain content. For this analysis, the HMMER software package (Eddy, 1998) or its compatible implementations from Paracel (http://www.paracel.com) and TimeLogic (http:// www.timelogic.com) were run on a Linux computing cluster (150 CPUs, Linux Networks), a Paracel GENEMATCHER machine, and a TimeLogic Decypher machine, respectively. RESULTS AND DISCUSSION Gene-based quality assessment of HGSC and celera genome assemblies Multiple releases of human genome assemblies and their associated predicted gene sets from HGSC (International Human Genome Sequencing Consortium, UCSC, Ensembl) and Celera are listed in Table 1 based on release dates. These data sets were the basis for comparing the HGSC and Celera genome assemblies and analyzing how they have changed over time. Genome assemblies can vary due to differences in local sequence content as well as long-range differences due to differing sequence assembly. As a gauge of the quality and completeness of the draft local sequence content in both genome assemblies, we used the BLAT algorithm (Kent, 2002) to map the large Research Genetics human cDNA sequence database (RG, 41 472 sequences) against the genome assemblies (Fig. 1). Since a positive BLAT hit only requires a match of 40 bp, this analysis should be largely insensitive to global assembly issues. We have observed a gradual increase in the number of mapped RG sequences with both HGSC and Celera assemblies, leveling off for both at around 97%. These results suggest that the HGSC and Celera assemblies have had similar local sequence content since their early releases. Gene sets derived from the genome assemblies can vary due to differences in local sequence, global assembly, and the particular gene-prediction pipelines used. Since genes can span large sequence lengths, all gene prediction algorithms, to some extent, will be sensitive to sequence coverage and assembly issues. To eliminate variability due to differing gene-prediction pipelines, GENSCAN was used to generate two sets of genes from multiple releases of both genome assemblies. The full-length GENSCAN genes subsets were extracted from the full sets, including only those GENSCAN predictions containing both 5 promoter and 3 poly-adenylation signal sequence predictions. Since longrange sequence discontinuity in the assemblies can lead GENSCAN to predict partial genes that would lack 5 promoter and/or 3 poly-adenylation signal sequences, this Fig. 1. BLAT mapping of Research Genetics sequences to HGSC and Celera genome assemblies. The percentages of sequences from the Research Genetics sequence database, which give positive BLAT hits, against various releases of the HGSC and Celera genome assemblies are plotted. full-length subset can be used to probe the quality of the genome assembly. The total and full-length GENSCAN HGSC gene counts as well as the Celera full-length GENSCAN gene counts all showed modest and gradual increases over time (Fig. 2A). In contrast, the total GENSCAN gene counts for the Celera assemblies started out at levels more than twice as high as the HGSC gene sets, and only came down to comparable levels in the July 2001 release. Since gene prediction depends on not only local sequence content but also on long-range assembled sequence, we believe that the initially high total GENSCAN gene numbers for Celera were due to sequence fragmentation resulting in many individual genes being split into separate GENSCAN predictions. This apparent Celera genome fragmentation, perhaps due to gaps or assembly errors, may indicate a disadvantage of Celera’s whole genome shotgun (WGS) sequencing approach (Huson et al., 2001; Myers et al., 2002) compared to HGSC’s hierarchical shotgun (HS) approach (IHGSC, 2001). Both the Ensembl gene build system (Hubbard et al., 2002) and Celera’s Otto pipeline (Venter et al., 2001) use various forms of evidence including homology to known proteins, ESTs and ab initio gene prediction with algorithms including GENSCAN (Burge and Karlin, 1997). Ensembl is more dependent on known human proteins from SPTREMBL, GENSCAN predictions, and gene prediction HMMs while Celera uses more data from outside of their genome such as cross-genome homology and even the Ensembl gene set [(Venter et al., 2001), reference 62]. Analyzing the length distributions of the Ensembl and Celera gene sets (Fig. 2B) shows a large decrease in short Celera genes accompanied by increases in the numbers of longer genes over time, similar but more pronounced than what is seen with the HGSC genes. A similar trend is seen with the GENSCAN-predicted gene sets (data not shown), further reinforcing the notion that initial 3 “bio015” — 2003/5/19 — page 3 — #3 S.Li et al. Fig. 3. Gene changes within gene sets across multiple releases. (A) The changes in the numbers of genes in the Ensembl and Celera gene sets and in the HGSC- and Celera-derived GENSCAN genes between successive genome releases are plotted. (B) Genes in successive gene sets were compared to genes in the previous gene sets using BLAST. The percent of genes that did not match any sequence in the previous gene set are plotted for each gene set group. Fig. 2. Gene set distributions from multiple HGSC and Celera genome releases. (A) The numbers of pipeline-derived genes from various releases of Ensembl and Celera gene sets along with the numbers of total GENSCAN-predicted genes and full-length GENSCANpredicted genes derived from various releases of the HGSC and Celera genomes are plotted based on release dates. For reference, the number of human genes in the July 2002 release of RefSeq is also shown. (B) Multiple Ensembl and Celera gene sets were analyzed based on gene length. The numbers of sequences from each release that lie in the given gene-length bins are shown. Celera assembly releases may have had comparatively high levels of fragmentation. Interestingly, the latest two Celera gene sets released show a reversal of this gene-length trend, with increasing numbers of short genes concomitant with an increase in total gene number. Within-group gene set comparisons An alternate way to look at changes in the assembly and gene data set is to compare the genes derived from each genome assembly release with those from the previous release. BLAST (Altschul et al., 1997) comparative analysis of genes with those from previous releases identified new genes as those sequences that did not match any sequence in the previous set. The analysis of the HGSC GENSCAN gene sets shows a 10–20% level of new gene content per gene set (Fig. 3), consistent with the modest increases in gene number (Fig. 2) and sequence coverage (Fig. 1) already observed. In contrast, the Celera GENSCAN gene sets show an initially high level of new GENSCAN gene content being added (30–40%) concomitant with a large decrease in gene number, a trend that has diminished in the most recent genome releases, where very few new GENSCAN genes appear to be present. The large gene count of the initial Celera GENSCAN set and its decrease over the course of time correlates with the decrease of the initial large fraction of short (<1 kb) Celera genes (Fig. 2), suggesting that the levels of fragmentation seen in the initial releases decrease overtime. The pattern of changes in the Celera Otto genes in successive releases is even more dramatic: more than 50% of the genes in the January 2001 gene set release cannot be found in the previous December 2000 release. By October 2001, however, virtually no new genes were being added. Interestingly, new gene addition can again be observed in the recent Celera releases, occurring in the same releases where the total gene number (Fig. 2A) and 4 “bio015” — 2003/5/19 — page 4 — #4 Human genome assembly comparison Fig. 4. Gene set comparisons between groups. (A) Human RefSeq genes were compared to multiple Ensembl and Celera gene sets using BLAST. The numbers of RefSeq sequences that matched both gene sets, only the Ensembl gene set, only the Celera gene set, and neither gene set are plotted on the left. The human RefSeq genes were also compared to multiple HGSC- and Celera-derived GENSCAN gene sets using BLAST. The distribution of matching sequences is plotted on the right. (B) Ensembl genes from multiple releases were compared with the corresponding Celera gene set releases using BLAST. The numbers of matching and non-matching (Ensembl-unique) sequences are plotted on the left. Similarly, Celera genes were compared with the corresponding Ensembl gene sets using BLAST and the numbers of matching and non-matching (Celera-unique) sequences are plotted on the right. (C) HGSC-derived GENSCAN genes and Celera-derived GENSCAN genes were compared with each other using BLAST in both directions as in (B). The numbers of sequences found in both gene sets, HGSC-unique sequences, and Celera-unique sequences are plotted. the short gene number (Fig. 2B) rebound. Since neither the genome content nor quality appears to have changed much in these releases, we believe that this recent trend is likely due to changes in Celera’s gene-prediction pipeline. RefSeq-based quality assessment of ensembl and celera gene sets The NCBI RefSeq database (Maglott et al., 2000; Pruitt and Maglott, 2001), derived Genbank sequences and the published literature, provides a non-redundant view of the current knowledge about human genes, transcripts and proteins. We evaluated the quality and comprehensiveness of the in silico GENSCAN predicted gene sets, by comparing them to the human RefSeq database with BLAST. Comparing RefSeq to multiple Ensembl and Celera pipeline gene sets and HGSC and Celera GENSCAN gene sets reveals that, even with the earliest releases, greater than 75% of RefSeq genes can be found in some form in gene sets from both groups (Fig. 4A). Small fractions of RefSeq genes could be matched only to genes from HGSC, only to Celera genes, or to neither gene set. Over the course of time, the numbers of unmatched RefSeq genes and those matching only HGSC have significantly decreased. At the same time, the Celera gene set continues to have a modest number of RefSeq genes not found in Ensembl, suggesting that the Celera gene set can be more comprehensive than the Ensembl data set with respect to RefSeq. Similar BLAST results were obtained after a permissive sequence clustering approach (Hogenesch et al., 2001) was applied to eliminate sequence redundancy in all RefSeq, HGSC and Celera gene sets (data not shown). Because RefSeq (07-2002 5 “bio015” — 2003/5/19 — page 5 — #5 S.Li et al. release, 15 740 genes) contains far fewer genes than Ensembl and Celera, more efforts are needed in order to complete RefSeq as a gene reference standard. Between-group gene set comparisons Much has been made of the concordance between the gene numbers of the initial HGSC and Celera gene releases (IHGSC, 2001; Venter et al., 2001) and the subsequent observations that each set actually contained many unique genes (Hogenesch et al., 2001; Li et al., 2002). We have repeated this analysis across multiple gene set releases. Comparing Ensembl to Celera genes shows that the fraction of Ensemblunique genes ranges from 29% initially to 12% in the most recent release analyzed (Fig. 4B), indicating that most of the Ensembl genes can find matches in the Celera set. The reverse comparison, Celera compared to Ensembl, reveals that the fraction of Celera-unique genes decreased from an initial 56 to 26% in the most recent analyzed release. The large increase in Celera-unique genes in R25h release coincided with the large increase in total gene number (Fig. 2A) consisting largely of short genes (Fig. 2B). Similar results were obtained when redundancy was removed from the data sets (data not shown). To discriminate between changes in actual sequence information versus changes in gene-prediction pipelines, this analysis was repeated with the GENSCAN-derived gene sets. The HGSC versus Celera GENSCAN gene set comparison (Fig. 4C) looks much like the Ensembl versus Celera pipeline gene comparison (Fig. 4B), with approximately 16% of the HGSC genes being unique. In contrast, the Celera versus HGSC GENSCAN-gene comparison shows an initially high number (33%) of Celera-unique genes, decreasing to a fraction (13%) similar to the number of unique HGSC GENSCAN genes. The difference between these results and the pipelinegene comparison suggests that the unique gene content of the Celera pipeline gene set cannot be explained by fundamental differences in the genome assemblies. To further characterize the HGSC and Celera-unique gene sets, we mapped the unique genes back to the genome assemblies from which they came as well as to that of the other group using BLAT. Nearly all of the sequences from all four unique gene sets can be mapped to both genome assemblies of the same or similar release date (Fig. 5A). This again confirms that genome content, specifically the local sequence content, is very similar between both assemblies. Since the differences between Ensembl and Celera gene sets are much larger than that observed between HGSC and Celera GENSCAN gene sets, we can conclude that the gene-building process, including human curation, must have contributed more to the observed gene set difference than the different genome sequencing and assembly processes. In order to estimate how likely the unique Ensembl or Celera genes are to represent true genes, we compared the unique pipeline genes to the large SWISS-PROT protein database using BLAST with moderate stringency (E-value = 1e − 5). Fig. 5. Analysis of HGSC and Celera-unique genes. (A) Sequences that were unique to the Ensembl, Celera, HGSC-derived GENSCAN, and Celera-derived GENSCAN gene sets based on BLAST analysis (Fig. 4) were mapped back to the genome assembly from which they were derived as well as to the other genome assembly using BLAT. The percentages of sequences from each unique set, which could be mapped to either genome assemblies, are plotted. (B) The unique Ensembl and Celera genes were compared with the SWISS-PROT database using a moderate-stringency BLAST analysis. The percentages of sequences from both sets for which homologous sequences could be identified in SWISS-PROT are plotted. While more than 60% of some of the earlier unique gene sets appear to have no significant homology to any protein sequence in SWISS-PROT, analysis of the most recent gene sets shows that 55% of Celera-unique genes and 68% of Ensembl unique genes have known protein homologs (Fig. 5B). Using SWISS-PROT homology matches as a rough estimate of the likelihood that predicted genes are real, it 6 “bio015” — 2003/5/19 — page 6 — #6 Human genome assembly comparison Fig. 6. Estimation of non-redundant gene count. For the releases shown, the Ensembl and Celera gene sets were combined along with the human subset of RefSeq. This combined gene set was clustered via a permissive clustering algorithm. The resulting gene cluster number represents the total number of unique genes in the Ensembl, Celera and RefSeq gene sets that could be resolved by our BLAST analysis. appears that a large fraction of the unique genes from both data sets are likely to be real. Total number of protein-coding genes—lower bound estimation As shown in Figure 2A, the Ensembl gene sets have consistently been comprised of around 30 000 sequences, while the Celera gene set has varied in the range of 20 000–45 000 sequences. Interestingly, the two latest Celera gene sets analyzed show an increase in gene number, bringing the total well above that of the Ensembl gene set. To put these numbers in perspective, the human component of RefSeq (Maglott et al., 2000; Pruitt and Maglott, 2001) contains many fewer genes (07-2002 release, 15 740 genes) than either of these two gene sets. In order to estimate the total gene number, the Ensembl, Celera and RefSeq gene sets were combined into a large superset. Following an all-to-all BLAST comparison, redundant sequences were removed with a permissive clustering algorithm (Hogenesch et al., 2001). The resulting gene cluster number represents the total number of unique genes in the Ensembl, Celera and RefSeq gene sets that could be resolved by our BLAST analysis. Different Ensembl and Celera releases were combined with RefSeq and processed to analyze how this total gene number has changed over time, increasing from an initial 24 238 to over 40 000 and then down to 28 475 (Fig. 6). The non-redundant gene number we computed here should represent a lower bound for the true human gene count: our BLAST threshold cannot distinguish between the nearly identical paralogs that are found in some gene families; this approach omits genes that were missed by both Ensembl and Celera gene identification processes. This analysis of multiple gene sets together, coupled with the removal of redundancy, allows us to make a more complete estimate of the total human genome gene content than has previously been described (IHGSC, 2001; Venter et al., 2001). Gene set domain content analysis Similar to the SWISS-PROT homology analysis (Fig. 5B), protein domain profiling should provide an indirect measure of the quality of the genome-derived gene sets. The drawback of this analysis is that it can only analyze genes that contain already known protein domains. We used the PFAM 7.0 (Bateman et al., 2002) database of domain models to look at the comparative domain content of gene sets from the HGSC and Celera genome assemblies. Figure 7A shows the numbers of PFAM models that have an excess of matches against various releases of either the Ensembl or Celera gene sets. In early releases, many more PFAM models had more matches against the Ensembl gene set than against the Celera gene set. However in recent releases, the domain content of the Celera gene set has increased dramatically relative to Ensembl. In contrast, when the GENSCAN gene sets are analyzed (Fig. 7B), while the gap has narrowed, the HGSC genes continue to contain more domain matches than the Celera GENSCAN genes. Similar to the SWISS-PROT homology analysis (Fig. 5B), this domain analysis should provide an approximate measure of the quality of the genome-derived gene sets. The GENSCAN-derived gene set numbers suggest that over time the Celera genome assembly has approached the quality of the HGSC assembly. Given the similarity of local sequence content between the HGSC and Celera assemblies, 7 “bio015” — 2003/5/19 — page 7 — #7 S.Li et al. Fig. 7. Domain profiling of HGSC and Celera gene sets. The domain content of multiple HGSC and Celera gene sets was analyzed by performing a search of these gene sets with the PFAM database. For each PFAM domain model, the number of hits against each pair of HGSC and Celera gene sets was identified. The numbers of PFAM models that have an excess of hits against HGSC are plotted in the upper section, based on how large the excess of HGSC hits was. Similarly, the numbers of PFAM models that have an excess of hits against Celera are plotted in the lower section based on the number of excess Celera hits. (A) PFAM analysis of Ensembl and Celera gene sets. (B) PFAM analysis of HGSC and Celera assembly GENSCAN gene sets. this PFAM analysis of the GENSCAN gene sets supports the idea that the HGSC HS approach may have had advantages over the Celera WGS approach. The significant difference in PFAM matches to the recent Celera pipeline gene sets, in contrast, suggests that Celera has been able to add many new gene types to their gene set that would not otherwise be identified by ab initio gene prediction, making their gene annotation efforts more comprehensive than that of Ensembl. Numerous reports comparing the HGSC and Celera genome assemblies (Aach et al., 2001; Olivier et al., 2001; Li et al., 2002; Xuan et al., 2003) and gene sets (Hogenesch et al., 2001) have been made since the simultaneous publication of the two genomes in February 2001 (IHGSC, 2001; Venter et al., 2001). The analysis presented here suggests that the initial HGSC genome assembly, although containing a similar amount of genomic sequence information as the Celera genome assembly, was in a much better state of assembly. This is not entirely unexpected as whole genome shotgun sequencing, the technique used by Celera, is more challenging to assemble than HGSC’s hierarchical shotgun approach. Over the course of two years, however, Celera has made up for the shortcomings of their initial assemblies with newer assemblies that have approached the quality of HGSC’s draft genome. Since the Ensembl gene build system predicts genes through GENSCAN, homology, and gene prediction HMM methods, the quality and quantity of their gene predictions should mirror the quality of the genome assembly, as we have observed. In contrast, Celera uses a richer gene prediction pipeline named Otto that places greater emphasis on cross-species genome comparisons, EST homology, and curated gene set homology (Venter et al., 2001). By incorporating information in addition to its genome sequence, Celera has been able to generate a larger, more unique gene set. While many of the predicted genes unique to both the Ensembl and Celera gene sets are likely to be proven not to be bona fide genes [Fig. 5B (Hogenesch et al., 2001)], we expect that a significant number of them will be validated when the full content of the human transcriptome is finally determined. ACKNOWLEDGEMENTS We thank Jim Kent (UCSC) and the members of the Ensembl project (UK) for various technical assistance and help in HGSC genome database setup, and Tularik/GNF Bioinformatics and IT staff for outstanding computational support. The authors are also grateful to Drs Greg Peterson and Zheng Pan for critical discussions. REFERENCES Aach,J., Bulyk,M.L., Church,G.M., Comander,J., Derti,A. and Shendure,J. (2001) Computational comparison of two draft sequences of the human genome. Nature, 409, 856–859. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. 8 “bio015” — 2003/5/19 — page 8 — #8 Human genome assembly comparison Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276–280. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. Hogenesch,J.B., Ching,K.A., Batalov,S., Su,A.I., Walker,J.R., Zhou,Y., Kay,S.A., Schultz,P.G. and Cooke,M.P. (2001) A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell, 106, 413–415. Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl genome database project. Nucleic Acids Res., 30, 38–41. Huson,D.H., Reinert,K., Kravitz,S.A., Remington,K.A., Delcher,A.L., Dew,I.M., Flanigan,M., Halpern,A.L., Lai,Z., Mobarry,C.M. et al. (2001) Design of a compartmentalized shotgun assembler for the human genome. Bioinformatics, 17, S132–S139. International Human Genome Sequencing Consortium (IHGSC) (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. Kent,W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664. Li,S., Liao,J., Cutler,G., Hoey,T., Hogenesch,J., Cooke,M., Schultz,P. and Ling,X. (2002) Comparative analysis of human genome assemblies reveals genome-level differences. Genomics, 80, 138. Maglott,D.R., Katz,K.S., Sicotte,H. and Pruitt,K.D. (2000) NCBI’s LocusLink and RefSeq. Nucleic Acids Res., 28, 126–128. Myers,E.W., Sutton,G.G., Smith,H.O., Adams,M.D. and Venter,J.C. (2002) On the sequencing and assembly of the human genome. Proc. Natl Acad. Sci. USA, 19, 19. Olivier,M., Aggarwal,A., Allen,J., Almendras,A.A., Bajorek,E.S., Beasley,E.M., Brady,S.D., Bushard,J.M., Bustos,V.I., Chu,A. et al. (2001) A high-resolution radiation hybrid map of the human genome draft sequence. Science, 291, 1298–1302. Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res., 29, 137–140. Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351. Waterston,R.H., Lander,E.S. and Sulston,J.E. (2002) On the sequencing of the human genome. Proc. Natl Acad. Sci. USA, 99, 3712–3716. Xuan,Z., Wang,J. and Zhang,M.Q. (2003) Computational comparison of two mouse draft genomes and the human golden path. Genome Biol., 4. 9 “bio015” — 2003/5/19 — page 9 — #9 J. Med. Chem. 2002, 45, 1221-1232 1221 PRO_SELECT: Combining Structure-Based Drug Design and Array-Based Chemistry for Rapid Lead Discovery. 2. The Development of a Series of Highly Potent and Selective Factor Xa Inhibitors John W. Liebeschuetz,*,† Stuart D. Jones,† Phillip J. Morgan,† Chris W. Murray,† Andrew D. Rimmer,† Jonathan M. E. Roscoe,† Bohdan Waszkowycz,† Pauline M. Welsh,† William A. Wylie,† Stephen C. Young,† Harry Martin,† Jacqui Mahler,† Leo Brady,‡ and Kay Wilkinson‡ Protherics Molecular Design, Beechfield House, Lyme Green Business Park, Macclesfield SK11 0JL, U.K., and Department of Biochemistry, University of Bristol, Bristol BS8 1TD, U.K. Received June 6, 2001 In silico screening of combinatorial libraries prior to synthesis promises to be a valuable aid to lead discovery. PRO_SELECT, a tool for the virtual screening of libraries for fit to a protein active site, has been used to find novel leads against the serine protease factor Xa. A small seed template was built upon using three iterations of library design, virtual screening, synthesis, and biological testing. Highly potent molecules with selectivity for factor Xa over other serine proteases were rapidly obtained. Introduction Serine proteases represent a class of enzymes of great therapeutic importance. Members of the class which have been targeted for drug design include tryptase and urokinase and, in the blood coagulation cascade, thrombin, factor VIIa, and factor Xa. Factor Xa lies at the junction of the intrinsic and extrinsic pathways of the coagulation cascade. It is the active enzyme present in the prothrombinase complex, which converts prothrombin into thrombin. Thrombin is the final enzymatic product of the blood coagulation cascade and is responsible for the conversion of fibrinogen into fibrin. Much effort has been spent targeting thrombin, in particular, and, more recently, factor Xa, with the aim of designing antithrombotic drugs which are orally available and which show a reduced potential for bleeding as a side effect.1 Current therapies include the heparins, which are not orally available, and the coumarins, which have a narrow therapeutic window with regard to bleeding. Factor Xa has been claimed to be a better antithrombotic target than thrombin because there are indications that factor Xa inhibitors may have less propensity to show bleeding side effects.2,3 Additionally, a rebound effect has been observed following cessation of therapy with direct thrombin inhibitors.4 Potential indications are for deep vein and arterial thrombosis, post operative prophylactic use, myocardial infarction, and stroke. Crystal derived structural models are available for quite a number of serine proteases. Their mode of catalytic action is well understood, and the structural features that give rise to substrate selectivity have in many cases been elucidated. Recently, for both thrombin and factor Xa, structures have been published which have a variety of competitive inhibitors bound in the active sites. Despite this wealth of structural informa* Corresponding author: John W. Liebeschuetz, Tularik Ltd, Beechfield House, Lyme Green Business Park, Macclesfield SK11 0JL, U.K. Tel: (44) 1625 427369. Fax: (44) 1625 612311. E-mail: jliebeschuetz@ tularik.com. † Protherics Molecular Design. ‡ University of Bristol. Figure 1. Bis-amidine factor Xa inhibitors DX-9065a and YM60828. tion, the role of structure-based design, to date, has generally been to help suggest analogues of an existing lead and to post-rationalize the activity data, rather than as a tool for the de novo design of inhibitors.5,6,7 The first crystal structure of the factor Xa enzyme was published by Bode’s group in 1993 (Brookhaven code 1HCG).8 This crystal structure is missing the Gla (γ-carboxyglutamic acid) domain (N terminal residues 1-45) and also residues Glu 146-Gln 151, close to the active site, which are apparently autocleaved during crystallization. The S1 pocket of the active site is occupied by the A-chain terminal Arg 439 of a neighboring factor Xa molecule, which hydrogen bonds to Asp 189 in the standard bidentate fashion. The active site is similar to trypsin, but differs from many other serine proteases in having a large S4 pocket. The S1 pocket differs from trypsin in that Ser 190 is replaced with hydrophobic Ala 190. These features suggest that selective small molecule inhibitors for factor Xa can be obtained and, indeed, many are already known. For example, DX-9065a is a dicationic inhibitor with a Ki of 41 nM against factor Xa and a Ki of 630 nM against trypsin and >2000 nM against thrombin (Figure 1.).9 YM-60828 is another dicationic inhibitor with a Ki of 2.3 nM against factor Xa, 159 nM against trypsin, and 10.1021/jm010944e CCC: $22.00 © 2002 American Chemical Society Published on Web 02/16/2002 1222 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 Figure 2. Flow diagram demonstrating the PRO_SELECT approach. >10 000 nM against thrombin.10 Both these compounds are currently in preclinical and clinical trials and show some promise as oral antithrombotics without bleeding side effects.11,12 A crystal derived structure for DX-9065a bound to factor Xa (Brookhaven code 1FAX) exists. This indicates that the naphthamidine portion sits in the S1 pocket, making a single hydrogen bond to Asp 189, while the acetamidinopyrrolidine portion sits in the electron rich S4 pocket and makes a hydrogen bond to the pendant Glu 97 side chain.13 Virtual Screening Methodology We have recently published a methodology for the computational or ‘virtual’ screening of combinatorial libraries against the active site of an enzyme.14 The program central to this methodology is called PRO_SELECT. A flow diagram illustrating the methodology is given in Figure 2. The idea is to generate relatively small candidate libraries (10-1000 compounds) which have a high ratio of hits to inactives. These libraries are generally based on a template chemistry and are designed to be synthetically accessible in a timely and cost-effective manner. The catholic nature of the screening procedure allows a wide diversity of structures to be explored within the constraints imposed by the template. We start with a template structure. This is designed to fit into part of the active site, normally in a central position, and make favorable interactions with the enzyme. The template has attachment points to which substituents can be affixed using simple chemical reactions. Each attachment point is ideally directed toward a different pocket in the binding site, in which a suitable substituent may find favorable interactions. The substituents must be readily accessible to allow rapid chemical synthesis. Therefore the substituents are directly derived from lists of appropriate reagents, each selected from a directory of commercially available Liebeschuetz et al. chemicals. The selection is normally carried out by searching according to a simple pharmacophore appropriate to the target pocket. Each list may contain several thousand different chemicals. Each substituent from a list is computationally screened for fit to the target pocket. This was done in two stages in the work presented here. First each substituent is attached to the template and then a user defined number of conformations are roughly assessed for goodness of fit and favorable interactions. This was done by assessing the substituent for complementarity to an ‘interaction site model’ in the manner of Klebe.15 The number of attempts at finding graph matches is of the order of a thousand. A match may not represent a viable conformation for a substituent, and this is checked by carrying out a local conformationally flexible fitting of the substituent onto the interaction sites using a directed tweak algorithm. The number of attempts at finding a viable conformation from a given match is of the order 30 to 120. Many substituents may be rejected at this stage. Accepted substituents can be further refined in a second stage, using molecular mechanics to optimize internal geometries and substituent:receptor interactions, via an implementation of the ‘CLEAN’ force field.16 Each substituent is scored using an empirical scoring function to find and preserve the best match for each substituent. More recent versions of PRO_SELECT use a docking protocol on both template and substituent to obtain a good binding conformation in which the template position can be adjusted.17 The empirical scoring function represents binding energy and is derived by regression analysis of measured binding affinity to terms known to be important in determining affinity and calculable from existing crystal structures. A number of different such functions have been published. The empirical scoring function used in this work is that of Böhm, although we have subsequently derived our own scoring function, ChemScore.18-20 The score is thus used to drive placement of the substituent. It is also used by the molecular designer to differentiate between substituents in the design of the final sublibrary. It is generally used only as a cutoff filter to pick out those substituents which have the best chance of making good binding interactions. It is not expected that the Böhm score will correlate well with actual binding affinities of the library members, as the Böhm score is derived from complexes in which only favorable ligand:protein interactions are made. Unfavorable interactions which are not penalised by the Böhm score, for instance polar:lipophilic contacts, are liable to frequently arise in the PRO_SELECT placements. For this reason, other criteria are also used in the selection process. These can include strain energy estimation, a diversity metric or calculated physical chemical properties. Manual inspection of the predicted binding modes also plays an important role. The process may be repeated for separate substituent lists attached to different points on the template. Thus the final library will consist of an array of substituents for a single attachment point or a combinatorial array constructed from two or more lists each corresponding to a separate attachment point. PRO_SELECT was first validated using thrombin as a target.21 We now report on the use of this methodology PRO_SELECT: A Tool for Rapid Lead Discovery Figure 3. Factor Xa active site VdW surface (A) and schematic (B) illustrating the strategy used in the iterative design process. The initial template, 3-benzamidinecarbonyl, and the interaction site model for the first library are shown (A). to design factor Xa inhibitors as the primary stage in an ongoing program to discover novel antithrombotic drugs. This has led to a new class of chemistry that is highly active and specific for factor Xa. Several groups have described similar ‘Virtual Screening’ approaches that have lead to active inhibitors against other targets.22,23 To our knowledge this is one of the first examples where such an approach has led to molecules sufficiently active to show therapeutic effect in relevant animal disease models. It is also the first published example of the de novo design of inhibitors for factor Xa. Design Strategy It is usual, when considering the design of a combinatorial library, to build the library around a central template to which two or more substituents are attached or incorporated via facile chemistry. The placement of such a template in an active site would necessarily be central. Examination of the factor Xa 1HCG structure reveals that the central region of the active site is very broad, and the disposition of polar ‘anchoring’ sites is sparse (Figure 3A). Therefore we felt no confidence that a central template could be designed which would be guaranteed to bind in a single predictable manner. An alternative strategy was chosen in which a template would be placed in one of the specificity pockets and PRO_SELECT would be used to find potential substituents to fit the central portion of the active site. Synthesis and screening of this initial library would then allow a lead molecule to be selected. This could be further elaborated, again by use of PRO_SELECT, to Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 1223 exploit other specificity pockets within the active site. The S1 pocket was chosen as the most suitable pocket in which to place a template. Inhibitors of the trypsin class of serine proteases generally have a cationic moiety in this pocket which can make a single or bidentate H-bond with Asp 189. This provides a good anchor and a predictable binding orientation for a variety of ligands. The strategy of iterative library design to grow into different regions of the factor Xa active site is demonstrated in Figure 3, with reference to the 1HCG structure. This was the only structure openly available at the start of this work and is therefore the one that was used. Initial placement of the template in S1 (blue) was to be followed by development of the first library to probe the central (red) region. This was to be followed up by second and third libraries to occupy the green and purple pockets. The red region incorporates the Ser 214 to Gly 218 backbone, the position of which is well conserved in many serine proteases and which is known to be capable of providing H-bonding recognition interactions with natural substrates. It also partially contains the characteristic ‘aromatic box’ of factor Xa, constructed from the Tyr 99, Trp 215, and Phe 174 side chains. This ‘box’ constitutes the majority of the S4 pocket. The purple region represents the back of the S4 pocket and is characterized by three backbone carbonyls from Thr 98, Glu 97, and Lys 96 and the anionic Glu 97 side chain which can overhang the pocket. In theory these groups are available to hydrogen bond strongly to electropositive and cationic groups. The green region represents a hydrophobic pocket that has as its base the Cys 191-Cys 220 disulfide bridge, Gln 192 and Arg 143 side chains as the left-hand wall, and the Gly 218 backbone as the right-hand wall. This pocket is well conserved in many serine proteases but has not been exploited frequently in the design of inhibitors, perhaps because the pocket can often be occupied by the mobile side chain of residue 192. The structure of the potent anti-Xa protein, tick anticoagulant peptide, bound to factor Xa, shows that this ligand can make use of this region, albeit with considerable reorganization of the active site.24 We were of the opinion, after consideration of the 1HCG structure, that this pocket could be a prime target area for substituent placement in factor Xa. Results and Discussion Template Selection and First Library Design. It was decided to employ an amidino group to anchor the S1 template via a bidentate hydrogen bond to Asp 189. We were aware that such a group has in the past lead to problems in oral availability and rapid clearance. Nevertheless it was felt important to use a template that had reasonable base activity of its own so that structure/activity trends would become immediately apparent. It was envisaged that the synthetically facile linkage of the template to the substituent would be via an amide bond. It was also envisaged that this amide bond might, itself, be able to make hydrogen bonding interactions with the active site. Several possible template candidates were considered. It was decided to use PRO_SELECT to design a library with each template and to compare the quality of the libraries in order to select the best template. Three of the templates examined are shown in Table 1. 1224 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 Liebeschuetz et al. Table 1. Hit Rate and Hit Quality of Libraries Designed around Three Different S1 Templates and a Single Substituent List, Using PRO-SELECT The initial stage in using PRO_SELECT was to create a ‘Design Model’. This is a simple ‘interaction site model” of the active cleft.14 The Design Model for the region targeted for the first library is included in Figure 3A. The dark/light blue vectors represent hydrogen bond donor sites, the blue/purple vectors represent hydrogen bond acceptor sites, and lipophilic point sites are represented as orange crosses. The first step in substituent evaluation was to find matches between interaction sites in the substituent, attached to template, and complementary sites in the design model. Also illustrated in Figure 3A is one of the templates bound in the S1 pocket in the binding orientation employed in the PRO_SELECT job. The choice of substituent lists and the virtual screening protocol used are given in the Experimental Section. A summary of the output performance is given in Table 1 for each of the three templates. The number of substituents that passed the matching stage was roughly the same for each template. However, the number of substituents binding reasonably well (Böhm score < -3) was much less for the 4-benzamidinocarbonyl template than for the other two templates. The arginine template had marginally more high quality hits than the 3-benzamidinocarbonyl template. However, the Böhm score for the former was 10 kJ mol -1 higher than that for the latter (-5.3 versus -16.4), corresponding to a shortfall of roughly 2 orders of magnitude in terms of binding affinity. It was concluded therefore that the 3-benzamidinocarbonyl template was the most appropriate template to use. A disappointing feature, even in the case of the 3-benzamidinocarbonyl template, was the paucity of high scoring substituents (score < -10). Substituents could be found which made either the desired polar interactions or hydrophobic interactions but rarely both. The number of highly promising substituents therefore appeared limited. The reason for this is likely to be the restriction on substituent diversity, placed upon us by using solely the Available Chemicals Directory as a source. Six targets from this first PRO_SELECT run were selected for synthesis. One substituent that did score exceptionally well was the glycine-2-naphthylamide, 1 (Figure 4). The naphthyl group was found to sit well in the hydrophobic S4 pocket, and the glycine CdO was able to H-bond with Gly 218. Both of these effects led to a good Böhm score. When modeled against the active site, it was found that the NH from the benzamide amide group also could make a hydrogen bond to Gly 216. This naphthylglyci- Figure 4. Lead compounds arising out of library 1. namide substituent was found to be unobtainable, and the presence of the highly carcinogenic β-naphthylamine substructure mitigated against its synthesis. Nevertheless the motif looked of sufficient interest to be investigated further. Accordingly it was decided to append the glycine to the 3-amidinobenzoyl group and use this molecule as a larger template. PRO_SELECT was used to search for substituents which, when attached via an amide bond to the glycine carbonyl, would probe deeper into the S4 pocket. It was envisaged that it might be possible to use the carbonyl groups at the back of the S4 pocket for hydrogen bonding. For this reason, the library of substituents chosen for virtual screening was selected to be bisamines separated by a hydrophobic group. PRO_SELECT offers the ability to interconvert functional groups in silico prior to virtual screening. This ‘deprotection’ facility mimics a synthetically facile functional group transforming reaction.13 This was exploited here to broaden the diversity of possible substituents by inclusion of a list of bis-nitriles, converted on the fly to bis-amines. Further details of the virtual screening protocol are given in the Experimental Section. Eight substituents were chosen which had Böhm contributions of better than -10 and strain energies of lower than 25 kJ mol -1.25 Good hits from this second search were incorporated with those of the first, and 14 compounds in total were selected for synthesis (library 1a). Addition of bulky and partially hydrophobic substituents to a small template could arguably be expected to lead to increased efficacy through nonspecific lipophilic interactions. We wanted to be confident that any increase in activity in the initial library was not generated this way. Therefore it was decided to prepare a second library using some of the same substituents selected for the 3-amidinobenzoyl template but attached to the ‘wrong’ 4-amidinobenzoyl template. Eight substituents were chosen (library 1b). The synthetic routes used for the preparation of these compounds are shown in Schemes 1 and 2. Simple amine substituents, as exemplified by amino acids and their esters, were prepared by coupling with 3-cyanobenzoic acid (using TBTU in DMF), convertion to the imidate (HCl in ethanol), and then to the amidine (using ammonia in ethanol). Hydrolysis of the ester was accomplished using aqueous sodium hydroxide in ethanol, Scheme 1. PRO_SELECT: A Tool for Rapid Lead Discovery Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 1225 Table 2. Comparison of Activities against Factor Xa, Trypsin, and Thrombin for Benzamidine and Libraries 1a-c, 2a-e, and 3aa factor Xa trypsin thrombin library n mean pKi (SD mean Kib (µM) best Ki (µM) mean pKi (SD mean Kib (µM) best Ki (µM) mean pKi benzamidine 1a 1a (subset) 1b 1a + 1c 2a 2b 2c 2d 2a + 2e 3a 14 7 7 36 6 6 9 11 34 106 3.7 4.3 4.2 3.1 4.4 5.5 4.7 4.7 4.5 5.7 6.5 (0.6 (0.8 (0.3 (0.8 (0.8 (0.7 (0.4 (0.5 (0.8 (0.8 200 50 58 780 36 3.5 21 21 29 1.9 0.34 200 8.5 8.5 162 2.1 0.22 1.0 2.9 3.0 0.063 0.016 4.8 4.4 4.2 3.7 4.2 4.6 4.3 4.3 4.3 4.9 5.4 (0.8 (0.9 (0.9 (0.8 (0.6 (0.4 (0.4 (0.4 (0.7 (0.6 17 38 61 180 61 26 44 43 47 11 4.1 17 6 6 2.3 6.5 9 21 8 15 0.79 0.040 4.6 4.5 4.6 3.6 4.3 c c c c 4.7d 5.2e (SD (0.7 (0.8 (0.5 (0.9 (0.5 (0.6 mean Kib (µM) best Ki (µM) 25 31 23 230 51 c c c c 18 7 25 5 5 89 3.0 c c c c 1.4 0.023 a Figures not in bold represent benchmark libraries. b K figures are geometric means calculated as the reciprocal antilog of the pK i i mean. c Insufficient compounds tested against thrombin. d Only 23 compounds tested against thrombin. e Only 99 compounds tested against thrombin. Scheme 1. Solution-Phase Synthesis of Inhibitorsa a Conditions: (a) TBTU in DMF; (b) HCl in ethanol; (c) ammonia in ethanol; (d) NaOH in ethanol. Where the S4 unit was a bis-amine a solid-phase route was preferred, Scheme 2. The bis-amine was attached to 2-chlorotrityl polystyrene resin and coupled to an Fmoc protected amino acid using TBTU in DMF. The Fmoc protection was removed with piperidine in DMF and free amine coupled to 3-amidinobenzoic acid TFA salt using DIPCI and HOBt. The product was then cleaved from the resin using TFA/triethylsilane. Compounds were tested in chromogenic assays and Ki’s calculated against a range of serine proteases. These included factor Xa, trypsin, and thrombin. The mean, standard deviation, and best activities of libraries 1a and 1b against factor Xa, trypsin, and thrombin are given in Table 2. The corresponding activities for the subset of library 1a with common substituents to library 1b are also given. The designed library 1a with the 3-amidinobenzoyl template demonstrates on average markedly improved activity over benzamidine (Ki of 200 µM, pKi of 3.7). Moreover, the ‘control’ library with the 4-amidinobenzoyl template, 1b, shows no such improvement and is, on average, more than an order of magnitude less active than the corresponding subset of compounds from library 1a. The elevation of activity in library 1a is therefore not simply a function of increasing the size of the molecule by addition of a lipophilic fragment. It was decided to broaden the library 1a by making some simple structure-predicated modifications of some of the hits. Thus a variety of more lipophilic esters of the naphthylalanine and tryptophan analogues 2 and 3 were prepared, with the idea they might better fill the S4 pocket (library 1c). This led to the low micromolar lead 4. This compound is selective for factor Xa over trypsin (Ki of 2.0 µM vs Ki of 12.5 µM) despite the fact that benzamidine itself is 10 fold better for trypsin (Ki of 20 µM vs trypsin). Another one of the most active compounds in library 1a was 5 (Ki, 14 µM), and this was selected as a second possible lead compound. The more active of the two leads, 4, did not readily lend itself to modification via easily accessible chemistry. Compound 5, on the other hand, looked ideally set up for quick variation, both by replacement of the glycine with other amino acids and by replacement of the 1,4-bis-aminomethylcyclohexane fragments. This was therefore chosen as the lead molecule from which to develop the second library. Scheme 2. Solid-Phase Synthesis of Inhibitorsa a Conditions: (a) TBTU/DIPEA and Fmoc-amino acid in DMF; (b) 20% piperidine in DMF; (c) 3-amidinobenzoic acid TFA salt and DIPCI/HOBt in DMF; (d) 10% triethylsilane in TFA. 1226 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 Liebeschuetz et al. Figure 5. Compound 5 modeled in the 1HCG factor Xa structure. Second Library Design. Figure 5 illustrates 5 docked into the 1HCG structure. The terminal nitrogen of the S4 binding portion has been modeled as protonated. The glycine in 5 can be elaborated from either of the prochiral hydrogens. Elaboration involving a Damino acid or analogue thereof (i.e., coming off the hydrogen marked purple in Figure 5) appeared to have a good chance of exploiting the lipophilic disulfide pocket (green area, Figure 3, disulfide in yellow in Figure 5) according to the model. The enantiomeric L-amino acids appear not to be able to access good binding sites in the vicinity. PRO_SELECT jobs were carried out using the list of available R-amino acids. Both L and D configurations were examined, and both polar and hydrophobic interactions were sought. The list of amino acids that resulted consisted mainly of D-amino acids, which docked into the disulfide pocket. A library of seven compounds from this list, all D-enantiomers, was synthesized (sublibrary 2a), all with the 1,4-bis-aminomethylcyclohexane S4 fragment. The corresponding Lenantiomers were also made (sublibrary 2b). A variety of other D-amino acids (9 in all, sublibrary 2c) and L-amino acids (11 in all, sublibrary 2d) was also utilized. Thus libraries 2b, 2c, and 2d represent benchmark libraries to compare with 2a. The solid-phase synthetic route in Scheme 2, was applicable to all the library 2 compounds. Table 2 gives the mean, standard deviation, and best activities for all four sublibraries against factor Xa, trypsin, and thrombin. The figures in Table 2 indicate that designed sublibrary 2a is, on average, almost an order of magnitude more active against factor Xa than any of the three comparison libraries, 2b, 2c, and 2d. However, this is not mirrored in the average trypsin activities. The compound within the library that showed the highest activity was 6 (Figure 6), with a Ki of 220 nM against factor Xa and 7.4 µM against trypsin. Analysis of the PRO_SELECT run and further modeling revealed that the phenyl group appeared able to sit well into the disulfide pocket. It also appeared able to make an edgeon interaction with the disulfide bridge. This compound was selected as the lead molecule for the third cycle of PRO_SELECT driven optimization. Analysis of the 1HCG structure suggested that there was plenty of room left in the disulfide pocket for further hydrophobic elaboration, especially in the vicinity of the 3 and 4 positions of the phenyl ring in 6. Accordingly, to exploit this extra binding possibility, further ana- Figure 6. Lead molecule for library 3 (6) and examples of library 3 compounds. logues were designed, using both medicinal chemistry principles and modeling (library 2e). Only modest increases in activity were achieved. The reasons for this remained unclear until the crystal structure of DX9065a bound to factor Xa was published (1FAX).12 This structure retains the autolysis loop, missing in the 1HCG structure. This loop sits at the back of the disulfide pocket area, severely curtailing its size and depth. Third Library Design and Activity. Only a limited number of substituents designed to access the S4 pocket was tried in libraries 1 and 2. Many of these were diamines. However, it was accepted that there was considerable scope to find better S4 pocket binders and that these could either contain cations or hydrogen bond donors or, alternatively, they could be hydrophobic in nature and interact strongly with the aromatic ‘box’. Therefore it was decided to carry out PRO_SELECT jobs using a docked conformation of 6 as the template, with the aim of replacing the terminal diamine with a hydrophobic primary or secondary monoamine. Several jobs were run. The list of available monoamines was large, and the pocket to be filled is also sizable. Therefore, a much bigger list of quality substituents was found from these jobs than from those run previously. Approximately 100 targets were selected (library 3a). The vast majority of these compounds contained a lipophilic S4 binder. Where the S4 binder was lipophilic, a solution-phase synthetic route was used (Scheme 3). Boc-D-phenylglycine was coupled to an S4 component in DMF using EDC/HOBt or HOAt. The products thus obtained were deprotected, using TFA in DCM, and then coupled to 3-amidinobenzoic acid TFA salt. Where the S4 binder was a diamine, the solid-phase route (Scheme 2) could be used. Summary activity data for library 3a is given in Table 2. Activity, in relation to the lead compound 6, was generally retained and, in many cases, improved upon in this library, despite the fact that there was usually no possibility of obtaining a hydrogen bond to the back of the S4 pocket in the new series. The most active targets out of this set of analogues, 7, Ki 16 nM, and 8, Ki 16 nM (Figure 6), showed a greater than 10-fold increase in activity over 6 and retained selectivity over other serine proteases (Ki’s of 980 and 1700 nM against PRO_SELECT: A Tool for Rapid Lead Discovery Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 1227 Scheme 3. Solution-Phase Synthesis of Inhibitorsa the disulfide pocket, which can also be accessed in trypsin, and the benzoyl piperidine group sits centrally in the S4-pocket with the carbonyl pointing upward. Both amide bonds in the ligand make the predicted hydrogen bonds. The first H-bonds Gly 216 through NH, the second, Gly 218, through CdO. The bound ligands appear least well superimposed in the S4 region. However, trypsin and factor Xa differ in the S4 pocket, most noticeably at residues 99 and 174 (Leu and Gln in trypsin). Leu 99 allows more room for the terminal phenyl group to sit on the right-hand side of the pocket, than does Tyr 99 in factor Xa. The third library contained a wide diversity of active chemistries. This gave rise to a number of structurally different lead molecules that could be exploited further using either classical medicinal chemistry or a structure informed approach. Compound 9 represents one example where structure-based modification of a PRO_SELECT lead gave rise to other chemistries with potent activity and selectivity. The synthesis of this type of compound is outlined in Scheme 4. The Boc-N-protected S4 intermediates were first prepared as shown in Scheme 3. The Boc protection was removed using TFA/ DCM, and the resulting amine reacted with 1,3-bis-tertbutoxycarbonyl methyl pseudothiourea in the presence of mercury II chloride to give the bis-butoxycarbonyl guanidines. Final treatment with TFA/DCM and purification by preparative HPLC gave the final products isolated as TFA salts. Compound 9 has a Ki of 26 nM against factor Xa but only 1.6 µM against trypsin and 8 µM against thrombin. Other related compounds had similar activity and selectivity factors of 250. Overview. Figure 8 illustrates how the activity of the series evolved. The benzamidine template and lead molecules from each library are indicated. Each library contains, as well as the original PRO_SELECT members, additional structure-based and medicinal-chemistrybased analogues, some of which are mentioned above. To obtain predicted binding affinities for all compounds under a consistent set of docking conditions, all molecules were subsequently redocked, keeping the appropriate template portion of each molecule rigid as in the original PRO_SELECT run. Predicted binding affinity was calculated from the docking score for each molecule. Predicted binding affinity is plotted against a Conditions: (a) coupling agentssee text; (b) 25% TFA in DCM; (c) 3-amidinobenzoic acid TFA salt and DIPCI/HOBt in DMF. Figure 7. Compound 7 bound in trypsin (green) and superimposed with the predicted binding mode in Factor Xa (blue). trypsin, 1100 and 1600 nM against thrombin). These molecules are both amenable to further optimization at the S4 end of the molecule and thus represent good third cycle leads. Selectivity for the library as a whole against thrombin and trypsin was slightly improved, despite the fact that both these enzymes have sizable lipophilic pockets that correspond to the Xa S4 pocket. A cocrystal of compound 7 bound in trypsin, was successfully obtained. Figure 7 compares the predicted binding mode in factor Xa (blue) with that found in trypsin (green). The benzamidine is found in S1, hydrogen bonding in a bidentate fashion to Asp189; the phenyl portion of the phenylglycine linker is found in Scheme 4. Preparation of S4 Guanidino Compoundsa a Conditions: (a) TFA/DCM; (b) 1,3-bis-tert-butoxycarbonyl methyl pseudothiourea/HgCl2; (c) TFA/DCM. 1228 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 Figure 8. Activity progression in the benzamidine series from the first library (circles) to the second (squares) and third (triangles) libraries. Original template and lead molecules for subsequent library development are marked. Liebeschuetz et al. important to have good compound integrity in each library. Impure compounds gave rise to data that confounded early stage SAR development in some cases. Several other points are worth making. First, the lead molecule chosen at the beginning of each cycle was not necessarily the most active molecule in the previous library. It is more important that the lead be reasonably easy to chemically modify and also be likely to have the least pharmacokinetic problems. Second, this approach is synergistic with modern combinatorial chemical techniques and it could be used with benefit alongside them, to design large focused libraries. However, in such an approach only one synthetic route is generally followed, and it is accepted that a fraction of library members will not be successfully made. All compounds in a PRO_SELECT designed library are to some extent ‘cherished’, and therefore there is more incentive to make all members of the library than is usual in combinatorial chemical exercises. If medium throughput array chemistry is employed and the libraries are relatively small, then it is practical and useful to develop and employ more than one synthetic route, as was done in this study. Conclusions Figure 9. Activity against factor Xa in the benzamidine series versus predicted activity calculated from docking score. Compounds are from both the PRO_SELECT (b) and benchmark (4) libraries. factor Xa activity in Figure 9. Both those compounds selected through virtual screening and those not so selected (libraries 1b and 2b,c,d) are plotted. There is not a tight correlation. Nevertheless, out of the selected compounds, there is only a small proportion which score well but which show poor activity. In addition, those compounds not selected through virtual screening generally show both poor predicted and actual activity. These things are what we would hope to see as the primary aim of the virtual screening approach is to concentrate synthetic resource in the areas where reasonable activity is most likely to reside. One of the reasons the correlation of predicted and measured activity is not better is because the empirical scoring function is only able to describe positive features of the binding mode and not negative ones other than rotational entropy. In addition, current scoring functions ignore subtle electrostatic effects such as π stacking and so on, which can greatly influence activity. So quite a spread of activities is obtained in the final data set. This spread of activity can be very useful, however, as it often envelops a rudimentary structure-activity relationship among subgroups of similar chemistry within the library. This provides a springboard for a classical lead optimization approach. For this reason, we found it We have described the successful application of ‘virtual’ screening in the rational design of potent and selective factor Xa inhibitors. The starting point for the program was a simple template, benzamidine, placed in the S1 pocket. Iterative library design built upon the template in order to access other pockets in the active site. The chemistry involved in the synthesis of these libraries was designed to be straightforward, allowing rapid access of targets. Libraries using the same chemistry were also synthesized which were not designed through use of PRO_SELECT but which were reasonable from a medicinal chemistry standpoint. These consistently showed poorer average activity against (by roughly a factor of 10) and selectivity for factor Xa than those designed by ‘virtual’ screening. Several lead molecules with diverse structure and Ki ’s in the range of 10 to 50 nM were obtained, representing 4 orders of magnitude increase with regard to the binding affinity of the starting template. Further testing established that some of these compounds showed antithrombotic activity when given intraperitoneally in the Wessler stasis model of venous thrombosis in rat and therefore have potential therapeutic use as an injectable treatment. None of the compounds showed an effect when given orally, however. It was felt this was likely to be because of the highly basic benzamidine moiety. Further work has since been carried out to replace this group with a group of moderate basicity and to optimize potency by modification elsewhere in the molecule. Highly potent and selective compounds with strong oral antithrombotic activity were found. These will be described in later publications. The design of libraries through structure-based ‘virtual screening’ is a methodology of drug design that is currently of high interest. We have demonstrated that it can be an efficient method for the generation of potent and selective lead molecules, in cases where a target protein structure exists. PRO_SELECT: A Tool for Rapid Lead Discovery Figure 10. Pharmacophores used in finding substituent lists for library 1a: (A) pharmacophore for H-bond acceptor substituents, (B) pharmacophore for lipophilic substituents. Experimental Section Computational Details. Manipulation and inspection of receptor, template, and ligands before and after minimization or simulation was carried out using InsightII 95.0.25 All molecular mechanics minimizations and molecular dynamics simulations were carried out using the Discover 2.97 program26 with the CFF95 force field. The Discover calculations were carried out on a Convex Exemplar (16 × HP7100s) running SPP-UX 3.1. The pharmacophore searches of the Available Chemical Directory (ACD)27 and substituent list generation were performed using ISIS/Base 1.228 and ISIS/Draw 1.28 Searches were carried out allowing full conformational flexibility. Inspection of the structures and associated numerical data generated by PRO_SELECT was carried out using inhouse graphics software (XMOLBROWSE). Substituent list generation and graphical inspection was carried out on SGI Indigo R3000 workstations running IRIX 4.0.5. Protocol for the Design of Library 1a Members Based on Three S1 Templates. The template positioning for the arginine template (Table 1) was taken from a minimized structure of PPACK in FXa. The docked conformation of PPACK was derived from the 1PPB (PDB designation) PPACK/ thrombin structure. The template positioning of the benzamidines was derived from a docking of DX-9056a into the 1HCG (PDB designation) factor Xa structure. The carbonyl group was placed at the ring position proximal to the lip of S4 in the case of the 3-amidinobenzoyl template (Table 1). Both possible orientations of the carbonyl group planar to the ring were treated as valid, and PRO_SELECT runs were carried out on each. Only one orientation was deemed useful for the 4-amidinobenzoyl template. Two lists of substituents were prepared using the ACD as a source. Pharmacophores for the substituent selection were calculated from the crystal structure assuming reasonable placement of the template and are given in Figure 10. One pharmacophore was targeted toward the polar functionality at the lip of S4, with the primary aim of picking up interactions with the N-H of Gly 218 (Figure 10A). The other was targeted at the hydrophobic ‘aromatic box’ (Figure 10B). The amino group common to both is the linking group to the template. The initial aim was to only probe the central region of the active site. Therefore a molecular weight limit of 250 was set. Conformationally flexible searching of the ACD with these pharmacophores using ISIS/Base software generated 2D structure lists which were than converted to 3D using Converter26 (Molecular Simulations Inc.). The final ‘polar’ list numbered 1534 substituents, the ‘nonpolar’ one numbered 797. Each list of substituents was evaluated separately by PRO_SELECT. The polar substituents were assigned hydrogen bonding acceptor sites on double bonded oxygen, nitrogen, and sulfur (this was found as effective at finding hits as assigning both acceptor and donor Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 1229 Figure 11. Template and pharmacophore for library 1b: (A) template, (B) pharmacophore for bis-amine substituents, (C) pharmacophore for bis-nitrile list. sites). Hydrophobic sites were assigned to carbons in five- and six-membered carbocyclic rings. Passes from the interaction site matching stage were minimized using the Clean force field, keeping the template rigid, and scored for binding affinity using the empirical scoring function of Böhm.18 The best scoring conformation per substituent was retained. The final list of substituents was filtered to pick out only those which had Böhm contributions of -3 kJ mol-1 or better. Protocol for the Further Design of Library 1a Members Based around the 3-Amidinobenzoylglycine Template. Molecular dynamics simulations were carried out on ligands 1 and 4, among others, docked into the 1HCG factor Xa structure. Snapshots were taken from the simulations at 1 ps intervals and each snapshot minimized. Analysis of these simulations allowed the selection of two significantly different template geometries, both of which gave reasonably good Böhm scores. PRO_SELECT runs were carried out using each geometry. The template structure is given in Figure 11A. The substituent list was prepared by searching the ACD using the pharmacophore in Figure 11B. The list contained 422 substituents. A second bis-nitrile list was prepared according to the pharmacophore in Figure 11C. PRO_SELECT runs were carried out separately on this list which contained 656 substituents. Substituents for library 1a were selected out of that set of substituents with Böhm contributions of -9 kJ mol-1 or better, and strain energies of 27 kJ mol-1 or lower.25 The hits were clustered using the Jarvis-Patrick method within XMOLBROWSE, and substituents were selected from those that passed the criteria via manual inspection of each ligand docking. It was found that the two different template positions gave rise to quite different sets of high scoring substituents. Protocol for the Design of Library 2. The origin of the templates for this library was an “annealed” geometry of 5 manually docked into the 1HCG FXa X-ray structure. The receptor geometry was held rigid while the ligand was subjected to molecular dynamics at temperatures up to 300 K with subsequent slow cooling followed by minimization. This final geometry then provided an appropriate ligand template geometry. The disconnection point for this template was taken to be either of the prochiral hydrogens of the ligand glycine methylene. Each of these hydrogen atoms was substituted during separate PRO_SELECT runs as it was desired to look at both L- and D-amino acids. The list of potential “amino acid” substituents was obtained by searching the ACD for all free R-amino acids. There were 1230 possible amino acid substituents after removal of high molecular weight compounds (MW > 250) filtering of undesirable chemistries and conversion into 3D. The substituent disconnection point employed here was the amino acid side chain f C-R bond. Several SELECT jobs were then carried out to focus on lipophilic or polar interactions arising from the amino acid 1230 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 Figure 12. Pharmacophores used in finding substituent lists for library 3: (A) pharmacophore for hydrophobic primary amines, (B) pharmacophore for hydrophobic cyclic secondary amines, (C) pharmacophore for hydrophobic acyclic secondary amines. Aromatic and fused bicyclic systems were allowed in the lists generated by pharmacophores A, B, and C. substituent with the receptor interaction sites. A list of substituents suitable for synthesis was generated after removal of duplicates, inappropriate substituents, e.g., side chains from substituents available only in L form where the D was preferred, and filtering of poor scoring substituents using the criterion that the Böhm score be less than -3.0 kJ mol-1. Protocol for the Design of Library 3. Compound 6 and also an analogue of compound 6 which had the bis(aminomethyl)cyclohexane replaced by a 1-adamantylamine, docked into the 1HCG factor Xa structure, were simulated at 300 K. The receptor was kept rigid, and snapshots were taken at 5 ps intervals and minimized. Low energy snapshots were selected to provide two different template positionings. Three pharmacophores were used to search the ACD for appropriate hydrophobic amines. These pharmacophores, given in Figure 12, represent respectively primary amines, secondary cyclic amines, and secondary acyclic amines. The hydrophobic parts of the pharmacophores were designed so as to avoid linear hydrocarbons. Hits that had several polar groups were excluded, as were certain classes of reactive chemistry. A molecular weight limit of 250 for free base was used. The primary amine list numbered, after conversion to 3D and enumeration of enantiomers and diastereomers, 1053 molecules, and the cyclic amine list numbered 250 compounds. The secondary acyclic amine list, which was restricted to substituents containing a six-membered carbocycle, numbered 366 compounds. The link site was chosen to be the N-H trans to the carbonyl of the phenyl glycine, in the case of the primary amine list, but the C(dO)-N bond for both the secondary and the cyclic lists. Hydrophobic interaction sites were placed at carbons adjacent to a hydrophobic branch point. Passes from the interaction site matching stage were treated as described above. Priority substituent lists were selected on the basis of having favorable Böhm contributions (generally -12 kJ mol-1 or better) and low strain energies (generally less than -21 kJ mol-1). These lists were clustered according to chemistry and processed by manual inspection of the binding mode to generate synthetic candidates. Chemistry. Abbreviations used follow IUPAC-IUB nomenclature. Additional abbreviations are HPLC, high performance liquid chromatography; DMF, dimethylformamide; DCM, dichloromethane; HATU, O-(7-azabenzotriazol-1-yl)1,1,3,3-tetramethyluronium hexafluorophosphate; HOBt, 1-hydroxybenzotriazole; TBTU, 2-(1H-(benzotriazol-1-yl)-1,1,3,3tetramethyluroniumtetrafluoroborate; DIPEA, diisopropylethylamine; TEA, triethylamine; HOAt, 1-hydroxy-7-azabenzotriazole; Fmoc, 1-(9H-fluoren-9-yl)methoxycarbonyl; TFA, trifluoroacetic acid; MALDI-TOF, matrix assisted laser desorption ionization-time-of-flight mass spectrometry. Unless otherwise indicated, amino acid derivatives, resins, and coupling reagents were obtained from Novabiochem (Nottingham, U.K.) and other solvents and reagents from Rathburn (Walkerburn, U.K.) or Aldrich (Gillingham, U.K.) and were used without further purification. Liebeschuetz et al. Purification was by gradient reverse-phase HPLC on a Waters Deltaprep 4000 at a flow rate of 50 mL/min using a Deltapak C18 radial compression column (40 mm × 210 mm, 10-15 mm particle size) and solvent mixtures consisting of eluant A (0.1% aq TFA) and eluant B (90% MeCN in 0.1% aq TFA) with gradient elution. Analytical HPLC was on a Shimadzu LC6 gradient system equipped with an autosampler, a variable wavelength detector at flow rates of 0.4 mL/min. Eluents A and B as for preparative HPLC used the following columns: Luna2 C18 2 × 150 mm 5 µm, Symmetry C8 4.6 × 30 mm 3.5 µm (Phenomenex). Purified products were further analyzed by Maldi TOF and/or LCMS and 1H NMR. Compound libraries were prepared using both solid-phase and solution-phase parallel synthetic methods as described below. Simple Amine S4 Units (Scheme 1). Amino acid ester hydrochlorides were either (i) coupled to 3-amidinobenzoic acid TFA salt using DIPCI/HOBt in DMF containing 1 equiv of DIPEA, purified by reverse-phase preparative HPLC and isolated as the TFA salt, or (ii) coupled to 3-cyanobenzoic acid using TBTU/HOBt in DMF containing 1 equiv of DIPEA and the nitrile converted to the amidine by sequential treatment with HCl gas in ethanol and ammonia gas in ethanol. The products were purified by reverse-phase preparative HPLC and isolated as the TFA salt. Compounds 3 and 4 were prepared by these routes. 3-Amidinobenzoyl-D-tryptophan TFA Salt, 3. To a solution of 3-cyanobenzoic acid (500 mg, 3.4 mmol), D-tryptophan methyl ester hydrochloride (866 mg, 3.40 mmol), and HOBt (459 mg, 3.4 mmol) in DMF (10 mL) were added TBTU (1.09 g, 3.40 mmol) and DIPEA (592 µL, 3.4 mmol). The reaction was stirred until complete by TLC and then partitioned between ethyl acetate and water. The organic solution was evaporated in vacuo to give 3-cyanobenzoyl-D-tryptophan methyl ester (954 mg, 81%). HCl gas was bubbled into a solution of 3-cyanobenzoyl-Dtryptophan methyl ester (925 mg, 2.66 mmol) in ethanol, and the mixture was left overnight before evaporating to dryness in vacuo. The solid was taken up in ethanol, and the solution was saturated with ammonia gas. After being stirred overnight, the mixture was evaporated to dryness in vacuo and the resulting solid purified by preparative HPLC to give a mixture of 3-amidinobenzoyl-D-tryptophan methyl and ethyl ester TFA salts. To a solution of 3-amidinobenzoyl-D-tryptophan methyl and ethyl ester TFA salts (50 mg) in ethanol (5 mL) was added 1 M aqueous sodium hydroxide (1 mL), and the mixture was stirred overnight. The mixture was evaporated to dryness in vacuo and purified by preparative HPLC to give 3-amidinobenzoyl-D-tryptophan TFA salt (11 mg). 1H NMR (CD3CN) δ 8.14 (1H, s, Ar); 8.08 (1H, d, Ar); 7.82 (1H, d, Ar); 7.68 (2H, m, Ar); 7.45 (1H, d, Ar); 7.05-7.30 (3H, m, Ar); 4.96 (1H, m, R-proton); 3.3-3.5 (d-ABq, β-proton). Homogeneous by HPLC Luna C18, Symmetry C8. High resolution MS (M+1)+ found 351.14555 (C19H18N4O3 requires 351.14568). 3-Amidino-D-2-naphthylalanine Ethyl Ester TFA Salt, 4. 3-Amidinobenzoic acid TFA salt (100 mg) was added to a mixture of HOBT (48.6 mg) and DIPCI (57 µL) in DMF that had been stirring for 10 min. To this mixture was added a solution of 2-naphthylalanine ethyl ester hydrochloride (100.5 mg) and triethylamine (50 µL). After being stirred overnight, the crude reaction mixture was purified by preparative HPLC to give 3-amidinobenzoyl-D-2-naphthylalanine ethyl ester TFA salt. 1H NMR (CD3CN) δ 7.98 (1H, s, Ar); 7.88 (1H, d, Ar); 7.67 (5H, m, Ar); 7.50 (1H, t, Ar); 7.3-7.4 (3H, m, Ar); 4.80 (1H, dd, R-proton); 4.0 (2H,q, Et); 3.35-3.1 (d-ABq, β-proton); 1.1 (3H, t, Et). Homogeneous by HPLC Luna C18, Symmetry C8. LCMS 390 (M+1)+ , high resolution MS (M+1)+ found 390.18077 (C23H23N3O3 requires 390.181740). Bis-amine S4 Units by Solid-Phase Methodology (Scheme 2). The S4 component (bis-1,4-aminomethylcyclohexane) was supported on 2-chlorotrityl resin (1.2 mmol/g) and coupled with an Fmoc protected amino acid using TBTU/ PRO_SELECT: A Tool for Rapid Lead Discovery Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 1231 DIPEA in DMF. The washed resin was treated with 20% piperidine in DMF to remove the Fmoc protection and then reacted with 3-amidinobenzoic acid TFA salt using DIPCI/ HOBt in DMF. The product was then cleaved off the washed resin using 10% triethylsilane in TFA ,and the crude product obtained was purified by preparative reverse-phase HPLC and isolated as the TFA salt. Compounds 5 and 6 were made by this route. 3-Amidinobenzoyl-glycinyl-(4-aminomethylcyclohexyl)methylamine, 5. 1H NMR (CD3CN/D2O) mixture of cis/trans isomers, major isomer only: δ 8.15 (s, 1H, Ar); 8.1 (d,1H, Ar); 7.9 (d, 1H, Ar); 7.66 (t, 1H, Ar); 4.97 (s, 2H, “Gly CH2”); 3.0 (d, 2H, amide CH2); 2.70 (d, 2H, amine CH2); 1.70 (m, 4H, cyclohexyl); 1.40 (m, 3H, cyclohexyl; 0.90 (m, 3H, cyclohexyl). Homogeneous by HPLC Luna C18, Symmetry C8. LCMS 346 (M+1)+, high resolution MS (M+1)+ found 346.22378 (C18H27N5O2 requires 346.22426). 3-Amidinobenzoyl-D-phenylglycinyl-(4-aminomethylcyclohexyl)methylamine, 6. 1H NMR (D2O) mixture of cyclohexyl cis and trans isomers 8.09 (1H, s); δ 8.05 (1H, d, J ) 7.5 Hz); 7.90 (1H, d, J ) 7.5 Hz); 7.66 (1H, t, J ) 7.5 Hz); 7.43 (5H, m); 5.47 (1H, s); 3.05 (2H, m); 2.78 (2H, m); 1.48 (7H, m); 0.86 (3H, m), Homogeneous by HPLC Luna C18, Symmetry C8. LCMS 422 (M+1)+ , high resolution MS (M+1)+ found 422.25548 (C24H31N5O2 requires 422.25556). Solution Phase II (Scheme 3). Boc-D-Phenylglycine was coupled to an S4 component in DMF using HATU or TBTU/ DIPEA or, alternatively, EDCI or DIPCI with HOBt or HOAt as an additive. When the S4 component was an alcohol, catalytic DMAP was added. The products thus obtained were deprotected using TFA in DCM and then coupled to 3-amidinobenzoic acid TFA salt using DIPCI/HOBt in DMF. The products were purified by reverse-phase preparative HPLC and isolated as the TFA salts. Compounds 7 and 8 were made by this route. 1-(3-Amidinobenzoyl-D-phenylglycinyl)-4-benzoylpiperidine, 7. To a solution of Boc-D-phenyl glycine (251 mg, 1 mmol) and a mixture of DMF (1 mL) and DCM were added 4-benzoylpiperidine (339 mg 1.5 mmol), DIPEA (348 µL, 2 mmol), and TBTU (353 mg 1.1 mmol). After being stirred at room temperature overnight, the mixture was partitioned between ethyl acetate (6 mL) and 10% hydrochloric acid (2 mL). The organic layer was washed with 10% hydrochloric acid (2 mL), saturated aqueous sodium bicarbonate, and then brine. Evaporation of solvent gave the crude product which was taken up in dichloromethane (2 mL) and treated with trifluoroacetic acid (2 mL) until removal of the Boc group was complete. Solvent was evaporated in vacuo, and the residue was taken up in ethyl acetate and washed with saturated aqueous sodium bicarbonate and then brine before evaporating to dryness. The residue was dissolved in DMF (5 mL), and to this was added a mixture of HOAt (150 mg, 1.1 mmol), 3-amidinobenzoic acid TFA salt (300 mg, 1.08 mmol), and DIPCI (180 µL, 1.15 mmol), and the mixture was stirred overnight. Any solids were removed by filtration, and solvent was removed in vacuo. The residue was taken up in ethyl acetate, washed with saturated aqueous sodium bicarbonate, dried (MgSO4), and evaporated in vacuo. The residue was converted to the TFA salt by addition and evaporation of 25% TFA in acetonitrile and then dissolved in a minimum of aqueous acetonitrile for purification by preparative RPHPLC to give 1-(3-amidinobenzoyl-D-phenylglycinyl)-4-benzoylpiperidine TFA salt (159 mg, 27% over three steps). 1H NMR (DMSO-d6) δ 8.40 (2H, m); 8.10 (1H, d); 7.70 (1H,t); 7.50 (10H, m); 5.55 (1H, s); 3.60 (1H, m); 2.5 (2H, m); 1.00 (6H,m). Homogeneous by HPLC Luna C18, Symmetry C8. LCMS 469 (M+1)+ , high resolution MS (M+1)+ ) 469.22282 (C28H28N4O3 requires 469.223935). 1-(3-Amidinobenzoyl-D-phenylglycinyl)-4-chlorophenylpiperazine TFA Salt, 8. 1-(3-Amidinobenzoyl-D-phenylglycinyl)-4-chlorophenylpiperazine TFA salt was prepared from 4-chlorophenylpiperazine in a manner similar to that described above. 1H NMR (CD3CN) δ 8.05 (1H, s); 8.00 (1H, d); 7.87 (1H, d); 7.55(1H, t); 7.31 (5H, m); 7.08,(2H,d); 6.75,(2H,d); 5.95 (1H, s); 3.70 (1H,m); 3.55 (2H,m); 3.45 (1H,m); 3.12 (1H,m); 3.00 (1H,m); 2.85 (1H,m); 2.35 (1H,m), Homogeneous by HPLC Luna C18, Symmetry C8. LCMS 476 (M+1)+, high resolution MS (M+1)+ ) 476.18340 (C26H26N5O2 requires 476.18529). Amidino compounds such as 9 were prepared initially using the solution-phase method II described above to give Boc-Nprotected S4 intermediates. The Boc protection was removed using TFA/DCM, and the resulting amine reacted with 1,3bis-tert-butoxycarbonyl methyl pseudothiourea in the presence of mercury II chloride to give the bis-butoxycarbonyl guanidines. Final treatment with TFA/DCM and purification by preparative HPLC gave the final products isolated as TFA salts (Scheme 4). 3-Amidinobenzoyl-D-phenylglycine 1-Amidinopiperidin-4-ylethyl Ester, 9. 3-Amidinobenzoyl-D-phenylglycine 1-Boc-piperidin-4-ylethyl ester was prepared using the general solution-phase method described above for compounds 7 and 8 and then treated with 25% TFA in DCM to remove the Boc protection. Treatment with 1,3-bis-tert-butyloxycarbonylmethylthiopseudourea (1 equiv), TEA (3 equiv), and mercury(II) chloride (1 equiv) in DMF overnight followed by extraction into ethyl acetate, washing with 2 N aqueous sodium hydroxide and water, and drying (MgSO4) gave 3-amidinobenzoylphenylglycine 1-(1,3-bis-tert-butyloxycarbonyl amidine)-4-piperidin-4ylethanol ester. The above compound was treated with 25% TFA in DCM until the Boc protection was removed and then evaporated in vacuo. The residue was purified by preparative RPHPLC to give 3-amidinobenzoyl-D-phenylglycine 1-amidinopiperidin-4ylethyl ester. 1H NMR (D2O) 8.17 (1H, m); δ 8.07 (1H, d); 7.93 (1H, d); 7.70 (1H, t); 7.45 (5H, m); 5.60 (1H,s); 4.25 (2H,m); 3.55 (2H,m); 2.75 (2H, m); 1.60 (4H, m); 1.25 (1H, m); 1.00 (2H, m). Homogeneous by HPLC Luna C18, Symmetry C8. LCMS451 (M+1)+ , high resolution MS (M+1)+ ) 451.244439 (C24H30N6O3 requires 451.245725). Biology. Inhibition of factor Xa was assessed at room temperature in 0.1 M phosphate buffer, pH 7.4, according to the method of Tapparelli et al.29 Purified human factor Xa was purchased from Alexis corporation, Nottingham, U.K. Chromogenic substrate pefachrome-FXA was purchased from Pentapharm AG, Basel, Switzerland. Product (4-nitroaniline) was quantified by absorption at 405 nm in 96-well plates using a Dynatech MR5000 reader (Dynex Ltd, Billingshurst, U.K.). Km and Ki were calculated using SAS PROC NLIN (SAS Institute, Cary, NC, Release 6.11). Km values were determined as 100.9 µM for factor Xa/pefachrome-FXA. Inhibitor stock solutions were prepared at 40 mM in dimethylsulfoxide and tested at 500 µM, 50 µM, and 5 µM. Accuracy of Ki measurements was confirmed by comparison with Ki values of known inhibitors of factor Xa. Crystallization, Data Collection, and Refinement. Bovine trypsin (Sigma, Type III) was further purified by ionexchange chromatography (Mono S, 0.1 M sodium phosphate pH 6.0, eluted with a 0-1 M sodium chloride gradient). A complex of the purified trypsin with compound 7 was prepared by incubating a 3-fold molar excess of the inhibitor with the enzyme, which was then concentrated to 15 mg/mL in 0.05 M Tris pH 8, 3 mM calcium chloride, 18% acetonitrile, and 5% DMF. Crystals were grown by vapor diffusion against a well containing 2.1 M ammonium sulfate and 0.05 M Tris pH 8.15. Nucleation of crystal growth required streak seeding using low-density crystals grown in the presence of benzamidine according to the procedure of Batunik (1989).30 Crystals of the bovine trypsin-compound 7 complex belonged to space group P212121, with a ) 60.08 Å, b ) 63.83 Å, and c ) 70.04 Å, and diffracted to beyond 2.0 Å. A complete native data set, comprising 17 272 unique reflections in the range 30-2.0 Å and with an average redundancy of 4.0, was collected at 100 K using station PX7.2 of the Daresbury SRS synchrotron (wavelength 1.488 Å). These data were processed using DENZO and SCALEPACK,31 and the structure solved by molecular replacement using AMORE32 with the coordinates from PDB entry 3PTN as search model. The structure was refined using iterative cycles of simulated annealing refine- 1232 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 Liebeschuetz et al. ment with X-PLOR33 and manual rebuilding using O.34 The final model has good geometry and an Rcryst of 17.8% and Rfree of 24.0% (calculated with data in the range 15-2.0 Å). The coordinates have been deposited in the PDB (reference code 1eb2). (13) Brandstetter, H.; Kühne, A.; Bode. W.; Huber, R.; von der Saal, W.; Wirtensohn, K.; Engh, R. A. X-ray Structure of Active Siteinhibited Clotting Factor Xa. J. Biol. Chem. 1996, 271 (47), 29988-29992. (14) Murray, C. W.; Clark, D. E.; Auton, T. R.; Firth, M. A.; Li, J.; Sykes, R. A.; Waszkowycz, B.; Westhead, D. R.; Young, S. C. PRO_SELECT: Combining structure-based drug design and combinatorial chemistry for rapid lead discovery. 1. Technology. J. Comput.-Aided Mol. Des. 1997, 11, 193. (15) Klebe, G. J. The Use of Composite Crystal-field Environments in Molecular recognition and the de Novo Design of Protein Ligands. J. Mol. Biol. 1994, 237, 212. (16) Hahn, M. Receptor Surface Models. 1. Definition and Construction. J. Med. Chem. 1995, 38, 2080. (17) Baxter, C. A.; Murray, C. W.; Clark, D. E.; Westhead, D. R.; Eldridge, M. D. Flexible Docking using Tabu Search and an Empirical Estimate of Binding Affinity. Proteins: Struct., Funct., Genet. 1998, 33, 367. (18) Böhm, H.-J. The development of a simple empirical scoring function to estimate the binding constant for a protein-ligand complex of known three-dimensional structure. J. Comput.Aided Mol. Des. 1994, 8, 243. (19) Eldridge, D. E.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J. Comput.-Aided Mol. Des. 1997, 11, 425-445. (20) Murray, C. W.; Auton, T. R.; Eldridge, M. D. Empirical Scoring Functions. II. The testing of an empirical scoring function for the prediction of ligand-receptor binding affinities and the use of Bayesian Regression to improve the quality of the model. J. Comput.-Aided Mol. Des. 1998, 12, 503. (21) Li, J.; Murray, C. W.; Waszkowycz, B.; Young, S. C. Targeted Molecular Diversity in Drug Discovery - Integration of StructureBased design and Combinatorial Chemistry. Drug Discovery Today 1998, 3 (3), 105-112. (22) Kick, E. K.; Roe, D. C.; Skillman, A. G.; Liu, G.; Ewing, T. J. A.; Sun, Y.; Kuntz, I. D.; Ellman, J. A. Structure-based design and combinatorial chemistry yield low nanomolar inhibitors of cathepsin D. Chem. Biol. 1997, 4, 297-307. (23) Böhm, H.-J.; Banner, D. W.; Weber, L. Combinatorial docking and combinatorial chemistry: Design of potent non-peptide thrombin inhibitors. J. Comput.-Aided Mol. Des. 1999, 13, 5156. (24) Wei, A.; Alexander, R. S.; Duke, J.; Ross, H.; Rosenfeld, S. A.; Chang, C.-H. Unexpected Binding Mode of Tick Anticoagulant Peptide Complexed to Bovine Factor Xa. J. Mol. Biol. 1998, 283, 147-154. (25) The strain energies quoted here are generally much higher than the associated calculated binding energy. This is because they are calculated by different methods, the strain energy arising out of estimates, derived using the ‘Clean’ force field. Therefore they cannot be compared and are used independently from one another in the process of ranking the substituents. (26) Copyright 1995, BIOSYM/Molecular Simulations, San Diego. (27) Copyright 1990-1994, MDL Information Systems, Inc. San Leandro, CA. All Rights Reserved. (28) MDL Information Systems, Inc. San Leandro, CA. All Rights Reserved. (29) Tapparelli, C.; Metternich, R.; Ehrardt, C.; Zurini, M.; Claeson, G.; Scully, M. F.; Stone, S. R. In Vitro and In Vivo Characterization of a Neutral Boron-containing Thrombin Inhibitor. J. Biol. Chem. 1993, 268, 4734-4741. (30) Bartunik, H. D.; Summers, L. J.; Bartsch, H. H. Crystal structure of bovine b-trypsin at 1.5 Å resolution in a crystal form with low molecular packing density. J. Mol. Biol. 1989, 210, 813828. (31) Otwinoski, Z.; Minor, W. Processing of X-ray diffraction data collected in oscillation mode. Methods Enzymol. 1996, 276, 307326. (32) Collaborative Computational Project, Number 4. The CCP4 Suite: Programs for Protein Crystallography. Acta Crystallogr. 1994, D50, 760-763. (33) Brunger, A. T. 1992 X-PLOR Manual Version 3.1. (34) Jones, T. A.; Zou, J.-Y.; Cowan, S. W.; Kjeldgaard, M. Improved methods for building protein structures in electron-density maps and the location of errors in these models. Acta Crystallogr. 1991, A47, 110-119. Acknowledgment. The authors thank Allen Miller for encouragement and for helpful suggestions during the preparation of this manuscript. References (1) (a) Wiley, M. R.; Fisher, M. J. Small-molecule direct thrombin inhibitors. Expert Opin. Ther. Pat. 1997, 7 (11), 1265-1282. (b) Menear, K. Progress towards the discovery of orally active thrombin inhibitors. Curr. Med. Chem. 1998, 5, 457-468. (c) Rewinkel, J. B. M.; Adang, A. E. P. Strategies and progress towards the ideal orally active thrombin inhibitor. Curr. Pharm. Des. 1999, 5, 1043-1075. (d) Zhu, B.-Y.; Scarborough, R. M. Recent advances in inhibitors of factor Xa in the prothrombinase complex. Curr. Opin. Cardiovasc., Pulm. Renal Invest. Drugs 1999, 1 (1), 63-88. (e) Walenga, J. M.; Jeske, W. P.; Hoppensteadt, D.; Kaiser, B. Factor Xa inhibitors: Today and beyond. Curr. Opin. Cardiovas., Pulm. Renal Invest. Drugs 1999, 1 (1), 13-27. (f) Al-Obeidi, F.; Ostrem, J. A. Factor Xa inhibitors. Expert Opin. Ther. Pat. 1999, 9 (7), 931-953. (2) Chi, L.; Rogers, K. L.; Uprichard, A. C. G.; Gallagher, K. P. The therapeutic potential of novel anticoagulants. Expert Opin. Invest. Drugs 1997, 6 (11), 1591-1622. (3) Morishima, Y.; Tanabe, K.; Terada, Y.; Hara, T.; Kunitada, S. Antithrombotic and Hemorrhagic Effects of DX-9065a, a Direct and Selective Factor Xa: Comparison of a Direct Thrombin Inhibitor and Antithrombin III-Dependent Anticoagulants. Thromb. Haemostasis 1997, 78, 1366-1371. (4) Gold, H. K.; Torres, F. W.; Garabedian, H. D.; Werner, W.; Jang, I.; Khan, A.; Hagstrom, J. N.; Yasuda, T.; Leinbach, R. C.; Newell, J. B.; Bovill, E. G.; Stump, D. C.; Collen, D. Evidence for a Rebound Coagulation Phenomenon after Cessation of a 4-hour Infusion of a Specific Thrombin Inhibitor in Patients with Unstable Angina Pectoris. J. Am. Coll. Cardiol. 1993, 21, 10391047. (5) Galemmo, R. A.; Maduskuie, T. P.; Dominguez, C.; Rossi, K. A.; Knabb, R. M.; Wexler, R. R.; Stouten, P. F. W. The de novo design and synthesis of cyclic urea inhibitors of Factor Xa: Initial SAR studies. Bioorg. Med. Chem. Lett. 1998, 8, 2705-2710. (6) Klein, S. I.; Czekaj, M.; Gardner, C. J.; Guertin, K. R.; Cheney, D. L.; Spada, A. P.; Bolton, S. A.; Brown, K.; Colussi, D.; Heran, C. L.; Morgan, S. R.; Leadley, R. J.; Dunwiddie, C. T.; Perrone, M. H.; Chu, V. Identification and Initial Structure-Activity Relationships of a Novel Class of Nonpeptide Inhibitors of Blood Coagulation. J. Med. Chem. 1998, 41, 437-450. (7) Dominguez, C.; Duffy, D. E.; Han, Q.; Alexander, R. S.; Galemmo, R. A.; Park, J. M.; Wong, P. C.; Amparo, E. C.; Knabb, R. M.; Luettgen, J.; Wexler, R. R. Design and Synthesis of Potent and Selective 5,6-fused Heterocyclic Thrombin Inhibitors. Bioorg. Med. Chem. Lett. 1999, 9, 925-930. (8) Padmanabhan, K. P.; Tulinsky, A.; Park, C. H.; Bode, W.; Huber, R.; Blankenship, D. T.; Cardin, A. D.; Kiesel, W. Structure of Human Des(1-45) Factor Xa at 2.2 Å Resolution. J. Mol. Biol. 1993, 232, 947-966. (9) Hara, T.; Yokoyama, A.; Ishihara, H.; Yokoyama, Y.; Nagahara, T.; Iwamoto, M. DX-9065a, a New Synthetic, Potent Anticoagulant and Selective Inhibitor for Factor Xa. Thromb. Haemostasis 1994, 71 (3), 314-319. (10) Hirayama, F.; Koshio, H.; Taniuchi, Y.; Sato, K.; Hisamichi, N.; Sakai, Y.; Katayama, N.; Kawasaki, T.; Matsumoto, Y.; Yanagisawa, I. Abstracts of Papers, 214th National Meeting of the American Chemical Society, Las Vegas, NV, 1997; American Chemical Society: Washington, DC, 1997; MEDI049. (11) Yamazaki, M.; Asakura, H.; Aoshima, K.; Saito, M.; Jokaji, H.; Uotani, C.; Kumbashiri, I.; Morishita, E.; Ikeda, T.; Matsuda, T. Protective Effects of DX-9065a, an Orally Active Novel Synthesized and Selective Inhibitor of Factor Xa, Against Thromboplastin-Induced Experimental Disseminated Intravascular Coagulation in Rats. Sem. Thromb. Hemostasis 1996, 22 (3), 255-259. (12) Sato, K.; Taniuchi, Y.; Hirayama, T.; Koshio, H.; Matsumoto, Y.; Iizumi, Y. Comparison of the Anticoagulant and Antithrombotic Effects of YM-75466, a novel orally-Active Factor Xa Inhibitor and warfarin in Mice. Jpn. J. Pharmacol. 1998, 78, 191-197. JM010944E ceptor–ligand steric complementarity. As this median value was found to be correlated with ligand size, the value is normalized by ligand surface area with respect to the set of receptor–ligand complexes used for the calibration of the ChemScore scoring function. The normalized value, StericPenalty, has a value of zero for ligands as tightly bound as the average of the reference set, has a negative value for ligands more tightly bound (e.g., clashing), and a positive value for ligands less tightly bound. 31. J. D. Oburn, N. J. Koszewski, and A. C. Notides, “Hormoneand DNA-Binding Mechanisms of the Recombinant Human Estrogen Receptor,” Biochemistry 32, 6229 – 6236 (1993). China, and a Ph.D. degree in macromolecular sciences from Aston University, followed by postdoctoral research in theoretical biochemistry at the University of Manchester with Dr. Barry Robson. He joined Protherics in 1990 to undertake research into computer simulation of protein folding and protein structure prediction. Since 1994 he has led the computational chemistry team in developing methods and software for molecular design, particularly in the areas of de novo design, molecular docking, and combinatorial library design. More recently Dr. Li was responsible for initiating and directing the DockCrunch project, involving a million compound virtual screen versus the estrogen receptor. Accepted for publication January 22, 2001. Bohdan Waszkowycz Protherics Molecular Design Ltd., Beechfield House, Lyme Green Business Park, Macclesfield, Cheshire SK11 0JL, United Kingdom (electronic mail: Bohdan.Waszkowycz@ protherics.com). Dr. Waszkowycz received a B.Sc. degree in pharmacy and a Ph.D. degree in computational chemistry at the University of Manchester, then joined Proteus Molecular Design Ltd. (now Protherics) in 1990. He served as a molecular modeler on a number of structure-based drug design projects, most recently on the design of tryptase inhibitors, before leading the computational team in the development of the in-house software suite, Prometheus. The recent focus of the group has been the validation of software for high-throughput virtual screening, and Dr. Waszkowycz is currently responsible for establishing collaborative projects on virtual screening with a number of pharmaceutical and biotechnology companies. Tim D. J. Perkins Protherics Molecular Design Ltd., Beechfield House, Lyme Green Business Park, Macclesfield, Cheshire SK11 0JL, United Kingdom. Dr. Perkins graduated from Cambridge with a B.A. degree in natural sciences and received a Ph.D. degree in medicinal chemistry from the University of London. He joined the Drug Design Group in the Pharmacology Department at Cambridge with Dr. Philip Dean, where he researched computational methods for conformational analysis and molecular superposition. This group formed the basis of the TeknoMed drug design collaboration with Rhône-Poulenc Rorer. On joining Protherics in 1998, Dr. Perkins has worked on software development within Prometheus, particularly in implementing tools for facilitating high-throughput virtual screening, including novel methods for analysis of receptor–ligand complementarity. Richard A. Sykes Protherics Molecular Design Ltd., Beechfield House, Lyme Green Business Park, Macclesfield, Cheshire SK11 0JL, United Kingdom. Mr. Sykes received a B.Sc. degree in logic with mathematics from the University of Sussex and worked as a programmer for a number of years before returning to academia to research the theory and application of functional programming languages at the University of London. Since joining Protherics in 1991, he has been the senior programmer involved in the development of Prometheus. He has a particular interest in the development of the scripting language Global and the design and implementation of graphical user interfaces to support the requirements of the group’s structure-based design and virtual screening efforts. Jin Li Protherics Molecular Design Ltd., Beechfield House, Lyme Green Business Park, Macclesfield, Cheshire SK11 0JL, United Kingdom. Dr. Li is Head of Computational Chemistry at Protherics. He received a B.Sc. degree from Sichuan University, 376 WASZKOWYCZ ET AL. IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001