Document 6424736

Transcription

Document 6424736

"Form Approved Through 05/2004
OMB No.
0925-0001
LEAVE BLANK—FOR PHS USE ONLY.
Type
Activity
Number
Review Group
Formerly
Department of Health and Human Services
Public Health Services
Grant Application
Do not exceed 56-character length restrictions, including spaces.
1. TITLE OF PROJECT
Council/Board (Month, Year)
Date Received
Industrialized informatics for drug discovery
2. RESPONSE TO SPECIFIC REQUEST FOR APPLICATIONS OR PROGRAM ANNOUNCEMENT OR SOLICITATION
(If “Yes,” state number and title)
Number: PAR-03-106
Title:
NO
✘ YES
Innovations in biomedical computational science and technology
✘
3. PRINCIPAL INVESTIGATOR/PROGRAM DIRECTOR
New Investigator
3a. NAME (Last, first, middle)
3b. DEGREE(S)
Ling, Bruce, Xuefeng
BS, MA, Ph.D.
3c. POSITION TITLE
3d. MAILING ADDRESS (Street, city, state, zip code)
Director, Research Informatics
1120 Veterans Blvd,
South San Francisco,
CA 94080
3e. DEPARTMENT, SERVICE, LABORATORY, OR EQUIVALENT
Bioinformatics
No
Yes
3f. MAJOR SUBDIVISION
Research Informatics
3g. TELEPHONE AND FAX (Area code, number and extension)
TEL:
650-825-7143
4. HUMAN SUBJECTS
RESEARCH
✘
No
FAX:
4a.
Research Exempt
[email protected]
509-271-7814
✘
No
Yes
5. VERTEBRATE ANIMALS
✘
No
Yes
If “Yes,” Exemption No.
4b. Human Subjects
Assurance No.
Yes
E-MAIL ADDRESS:
4c. NIH-defined Phase III
Clinical Trial
✘
No
5a.
If “Yes,” IACUC approval Date
5b. Animal welfare assurance no
Yes
6. DATES OF PROPOSED PERIOD OF
SUPPORT (month, day, year—MM/DD/YY)
7. COSTS REQUESTED FOR INITIAL
BUDGET PERIOD
8. COSTS REQUESTED FOR PROPOSED
PERIOD OF SUPPORT
From
Through
7a. Direct Costs ($)
7b. Total Costs ($)
8a. Direct Costs ($)
10/01/03
10/01/08
$450,000
$684,000
$2,406,772
9. APPLICANT ORGANIZATION
$3,658,283
10. TYPE OF ORGANIZATION
Tularik Inc.
Name
8b. Total Costs ($)
Address
Public:
→
Private:
→
For-profit: →
1120 Veterans Blvd,
CA 94080
Federal
State
Local
Private Nonprofit
General
✘
Small Business
Woman-owned
Socially and Economically Disadvantaged
11. ENTITY IDENTIFICATION NUMBER
94-31-48800
DUNS NO. (if available) 795475946
Institutional Profile File Number (if known)
Congressional District
12. ADMINISTRATIVE OFFICIAL TO BE NOTIFIED IF AWARD IS MADE
13. OFFICIAL SIGNING FOR APPLICANT ORGANIZATION
Name
Title
Address
Tel
Bruce Ling, Ph.D.
FAX
Tel
509-271-7814
[email protected]
650-825-7181
E-Mail
14. PRINCIPAL INVESTIGATOR/PROGRAM DIRECTOR ASSURANCE: I certify that the
statements herein are true, complete and accurate to the best of my knowledge. I am
aware that any false, fictitious, or fraudulent statements or claims may subject me to
criminal, civil, or administrative penalties. I agree to accept responsibility for the scientific
conduct of the project and to provide the required progress reports if a grant is awarded as
a result of this application.
15. APPLICANT ORGANIZATION CERTIFICATION AND ACCEPTANCE: I certify that the
statements herein are true, complete and accurate to the best of my knowledge, and
accept the obligation to comply with Public Health Services terms and conditions if a grant
is awarded as a result of this application. I am aware that any false, fictitious, or fraudulent
statements or claims may subject me to criminal, civil, or administrative penalties.
PHS 398 (Rev. 05/01)
Louisa M. Daniels
Corporate Counsel
Veterans Blvd,
CA 94080
Title Senior
Address 1120
1120 Veterans Blvd,
CA 94080
650-825-7143
E-Mail
Name
12th
FAX
650-825-7664
[email protected]
SIGNATURE OF PI/PD NAMED IN 3a.
(In ink. “Per” signature not acceptable.)
DATE
SIGNATURE OF OFFICIAL NAMED IN 13.
(In ink. “Per” signature not acceptable.)
DATE
Face Page
Form Page 1
Principal Investigator/Program Director (Last, first, middle):
Ling, Bruce, Xuefeng, Ph.D.
DESCRIPTION: State the application’s broad, long-term objectives and specific aims, making reference to the health relatedness of the project. Describe
concisely the research design and methods for achieving these goals. Avoid summaries of past accomplishments and the use of the first person. This abstract
is meant to serve as a succinct and accurate description of the proposed work when separated from the application. If the application is funded, this
description, as is, will become public information. Therefore, do not include proprietary/confidential information. DO NOT EXCEED THE SPACE
PROVIDED.
Currently the global pharmaceutical industry is facing unprecedented pressure to increase its productivity to deliver new
chemical entities. The long-term goal of this proposal is to develop a robust, high throughput informatics platform to
accelerate industrialized drug discovery. The specific aims for the Tularik Discovery Informatics Platform are: (1).
Architect scalable and robust high throughput enterprise computing infrastructures. The Java 2 platform,
Enterprise Edition (J2EE), Microsoft  .NET and high throughput/performance computing (HTC/HPC) technologies will
be interoperable to build a state of the art Discovery Informatics platform. (2). Integrate various standalone robotic
applications into networked automated Discovery pipelines. Thanks to technological innovations, robotics and
automation are now absolutely essential in various stages of the drug discovery processes. The Discovery Informatics
platform will integrate robotic vendor proprietary software through .NET Web Services to automate the inter-robotic data
management and mechanical operations. (3). Systemize the high throughput discovery workflows to reveal
“knowledge” from the raw data. Tularik Discovery Informatics platform has automated the data analysis and data
management in the areas of array-based comparative genomic hybridization, high throughput screening (HTS),
structure activity relationship (SAR) and ADMET. Additional machine learning algorithms and visualization modules will
be developed to automatically extract knowledge, e.g. the novel compound structural motifs, from large-scale bioassay
databases. Discovery Informatics platform will integrate computational chemistry approaches for parallel drug lead
optimization of potency, selectivity, and ADMET properties. (4). Integrate in silico drug lead seeking, explosion
and optimization processes into the high throughput Discovery platform. Integrate ligand or receptor based
virtual screening algorithms into the Discovery platform to increase the throughput for drug lead seeking, explosion and
optimization. Algorithms, including proper compound filters, will be developed to create a 1 billion-member virtual
screening library. (5). Standardize the informatics data flow and implement interoperable service-oriented
computing architecture. Tularik will work with I3C (Interoperable Informatics Infrastructure Consortium) to adopt and
enact proper standardizations for data flow in the areas of genomics, biological pathway, compound acquisition,
compound inventory, lead discovery and optimization. The current Tularik Discovery platform hosts various XML based
J2EE and .NET distributed applications, providing a solid foundation to extend to the service-oriented computing
architecture. (6). Establish industrialized software configuration management (SCM) mechanisms for
application build and deployment. Discovery platform has evolved to ensure code portability, robust build and easy
deployment to Tularik’s worldwide campuses and relevant research communities. Discovery platform will continue to
improve through the utilization of the open source standards and applications. These developments will make Tularik
Discovery Informatics platform generalizable, scalable, extensible and interoperable to the entire biomedical research
community.
PERFORMANCE SITE(S) (organization, city, state)
Tularik Inc., South San Francisco, California
KEY PERSONNEL. See instructions. Use continuation pages as needed to provide the required information in the format shown below.
Start with Principal Investigator. List all other key personnel in alphabetical order, last name first.
Name
Organization
Role on Project
Ling, Bruce, Ph.D.
Tularik Inc.
Principal Investigator
King, Brian
Hoey, Tim, Ph.D.
Life code, Inc. & Interoperable Informatics
Infrastructure Consortium
Tularik Inc.
Jaen, Juan, C., Ph.D.
Shuttleworth, Stephen J., Ph.D.
Tularik Inc.
Tularik Inc.
Young, Stephen, W., Ph.D.
Tularik Inc.
Waszkowycz, Bohdan, Ph.D.
Tularik Ltd. (UK)
Consultant, standardization
and interoperability
Director, directs biology
efforts
VP, directs chemistry efforts
Director, directs
combinatorial chemistry
efforts
Director, directs lead
discovery efforts
Director, direct virtual
Screening
Connor, Richard, Ph.D.
Tularik Inc.
Scientist, combi-chem
Name
Organization
Role on Project
PHS 398 (Rev. 05/01)
Page _2a___
Form Page 2
Cardozo, Mario, Ph.D.
Tularik Inc.
Young, Steve, Ph.D.
Cutler, Gene, Ph.D.
Tularik Ltd. (UK)
Tularik Inc.
Pan, Zheng, Ph.D.
Tularik Inc.
Liu, Jane, M.D.
Lukes, Melissa
Tularik Inc.
Tularik Inc.
Scientist, computational
chemistry
Scientist, virtual screening
Research Investigator, in
silico target identification and
microarray
Scientist, informatics
operation
Scientist, chemoinformatics
DBA
Porter, Richard
Subramani, Jayanthi
Ding, Epic
Tularik Inc.
Tularik Inc.
Tularik Inc.
Developer
Developer
Developer
Charati, Kaveri
Self employed consultant
XML specialist
PHS 398 (Rev. 05/01)
Page _2b___
Form Page 2
The name of the principal investigator/program director must be provided at the top of each printed page and each continuation page.
RESEARCH GRANT
TABLE OF CONTENTS
Page Numbers
Face Page ......................................................................................................................................
Description, Performance Sites, and Personnel ............................................................................
Table of Contents ..........................................................................................................................
Detailed Budget for Initial Budget Period (or Modular Budget)......................................................
Budget for Entire Proposed Period of Support (not applicable with Modular Budget) ........................
1
a,b
3
4
5-8
Budgets Pertaining to Consortium/Contractual Arrangements (not applicable with Modular Budget)
Biographical Sketch—Principal Investigator/Program Director (Not to exceed four pages).................
Other Biographical Sketches (Not to exceed four pages for each – See instructions)) ......................
Resources......................................................................................................................................
9-11
12-44
45
2-
Research Plan
Introduction to Revised Application (Not to exceed 3 pages)...........................................................................................................
Introduction to Supplemental Application (Not to exceed one page)................................................................................................
A. Specific Aims ........................................................................................................................................................................
B. Background and Significance................................................................................................................................................
C. Preliminary Studies/Progress Report/
(Items A-D: not to exceed 25 pages*)
Phase I Progress Report (SBIR/STTR Phase II ONLY)
* SBIR/STTR Phase I: Items A-D limited to 15 pages.
D. Research Design and Methods.............................................................................................................................................
E. Human Subjects....................................................................................................................................................................
46-47
47-50
50-56
56-65
Protection of Human Subjects (Required if Item 4 on the Face Page is marked “Yes”)
Inclusion of Women (Required if Item 4 on the Face Page is marked “Yes”) ..................................................................
Inclusion of Minorities (Required if Item 4 on the Face Page is marked “Yes”) ................................................................
Inclusion of Children (Required if Item 4 on the Face Page is marked “Yes”) ..................................................................
Data and Safety Monitoring Plan (Required if Item 4 on the Face Page is marked “Yes” and a Phase I, II, or III clinical
roposed research
trial is proposed.........................................................................................................................................................
F.
G.
H.
I.
J.
Vertebrate Animals ...............................................................................................................................................................
Literature Cited .....................................................................................................................................................................
Consortium/Contractual Arrangements.................................................................................................................................
Letters of Support (e.g., Consultants)...................................................................................................................................
Product Development Plan (SBIR/STTR Phase II and Fast-Track ONLY) ...........................................................................
65-66
67-68
69
Checklist ........................................................................................................................................
Check if
Appendix is
Included
Appendix (Five collated sets. No page numbering necessary for Appendix.)
Appendices NOT PERMITTED for Phase I SBIR/STTR unless specifically solicited.
Number of publications and manuscripts accepted for publication (not to exceed 10)
10
70
Other items (list):
Appendix summary
PHS 398 (Rev. 05/01)
Page ___3____
Form Page 3
Ling, Bruce, Ph.D.
BUDGET FOR ENTIRE PROPOSED PROJECT PERIOD
DIRECT COSTS ONLY
BUDGET CATEGORY
INITIAL BUDGET
PERIOD
TOTALS
(from Form Page 4)
PERSONNEL: Salary and
fringe benefits. Applicant
organization only.
ADDITIONAL YEARS OF SUPPORT REQUESTED
2nd
3rd
4th
5th
$346,400
$360,256
$374,666
$374,666
$389,653
$86,000
$89,440
$93,018
$96,739
$100,608
EQUIPMENT
$0
$0
$0
$0
$0
SUPPLIES
$0
$0
$0
$0
$0
$17,600
$18,304
$19,036
$19,797
$20,589
450,000
468,000
486,720
491,202
510,850
450,000
468,000
486,720
491,202
510,850
CONSULTANT COSTS
TRAVEL
PATIENT
CARE
COSTS
INPATIENT
OUTPATIENT
ALTERATIONS AND
RENOVATIONS
OTHER EXPENSES
SUBTOTAL DIRECT COSTS
CONSORTIUM/
CONTRACTUAL
COSTS
DIRECT
F&A
TOTAL DIRECT COSTS
TOTAL DIRECT COSTS FOR ENTIRE PROPOSED PROJECT PERIOD (Item 8a, Face Page)
$
2,406,772
$
0
SBIR/STTR Only
Fee Requested
SBIR/STTR Only: Total Fee Requested for Entire Proposed Project Period
(Add Total Fee amount to “Total direct costs for entire proposed project period” above and Total F&A/indirect costs from
Checklist Form Page, and enter these as “Costs Requested for Proposed Period of Support on Face Page, Item 8b.)
JUSTIFICATION. Follow the budget justification instructions exactly. Use continuation pages as needed.
PHS 398 (Rev. 05/01)
Page __5_____
Form Page 5
BUDGET FOR ENTIRE PROPOSED PROJECT PERIOD
DIRECT COSTS ONLY
JUSTIFICATION. Follow the budget justification instructions exactly.
CONSULTANT:
King, Brian, President, LifeCode Inc., I3C (Interoperable Informatics Infrastructure Consortium) committee section leader.
Mr. King has agreed to serve as a consultant on this project. He will advise at $110 per hour rate on the architecture and
process of informatics standardization and interoperability. $20,000 first year and 4% annual increase as the funding for
his consulting work.
Charati, Kaveri, XML specialist. Ms. Charati has agreed to serve as a consultant on the project. She will be responsible for
the XML based data transaction and XSLT data transformation. $66,000 first year and 4% annual increase as the funding
for her consulting work.
PERSONNEL:
Bruce Xuefeng Ling, Ph.D. will supervise all studies and manage the implementation progress on a weekly basis. He will
directly coordinate project team staff in different disciplinary areas and interact with them on a daily basis if necessary.
Juan Jaen, Ph.D. will oversee the entire chemistry efforts. Salary is not requested.
Tim Hoey, Ph.D. will coordinate target identification and high throughput assay development to ensure the data integrity and
data flow. Salary is not requested.
Stephen Shuttleworth, Ph.D. will coordinate combinatorial chemistry data flow and will be actively involved in the design of
the informatics architecture to integrate high throughput combi-chem in the lead discovery informatics platform. Salary is
not requested. Salary is not requested.
Stephen W. Young, Ph.D. will coordinate and be involved in the informatics area of lead discovery high throughput screening
and compound inventory. Salary is not requested.
Steve Young, Ph.D. will be part of the team to design and integrate the in silico lead identification (docking) and optimization
into the Discovery informatics platform. Salary is not requested.
Waszkowycz, Bohdan, Ph.D. will supervise and coordinate the high throughput computational chemistry efforts. Salary is not
requested.
Richard Connor, Ph.D. will be part of the team to integrate the combi-chem robotics Tularik proprietary driver and master
control program into the Discovery informatics platform through the .NET technologies. Salary is not requested.
Mario Cardozo, Ph.D. will be responsible for the enabling of algorithms for the high throughput compound property
calculation and integration into the Discovery informatics platform. Salary is not requested.
Gene Cutler, Ph.D. will be responsible for the in silico target identification and high throughput microarray data management.
Zheng Pan, Ph.D. will be responsible for the Discovery site configuration and ISIS integration for compound handling.
Jane Liu, M.D. will be responsible for the high throughput chemoinformatics application integration into the Discovery
informatics platform.
Melissa Lukes will be responsible for Oracle database architecture, setup, maintenance, and data integrity.
Rick Porter will be responsible for the robotics informatics implementation, which enables the interface, integration of campus
wide robotics machines through vendor driver and .NET framework applications into the Discovery platform.
PHS 398 (Rev. 05/01)
Page ___ 6____
Form Page 5
Jayanthi Subramani will be responsible for compound inventory data management and Discovery platform web application
architecture.
Epic Ding will be responsible for third party software integration and Discovery platform database modeling.
Walter Pan will be responsible for HTS, SAR data flow support and .NET architecture implementation.
SUPPLIES:
Tularik will cover the necessary cost for various supplies and software licenses.
TRAVEL:
Funding of $17,600 is requested for the project members to attend the following conference. Remaining balance of
the registration fees, lodging and airfare expenses will be covered by Tularik Inc.
conference name
JAVA ONE
Intelligent Drug Discovery & Development
Information Systems and Technology for Life Sciences
ICSB2003: 4th International Conference on Systems Biology
PHS 398 (Rev. 05/01)
date
Registration
location
fee
Jun-04 San Francisco
$2,500
May-04 Philadelphia, Pennsylvania
$2,000
Feb-04 London, UK
$2,000
Nov-03 St. Louis,MO
$1,000
Page ___7____
Form Page 5
BUDGET JUSTIFICATION PAGE
MODULAR RESEARCH GRANT APPLICATION
Initial Budget Period
Second Year of Support
$ 450,000
$ 468,000
Third Year of Support
Fourth Year of Support
$ 486,720
Fifth Year of Support
$ 491,202
Total Direct Costs Requested for Entire Project Period
$ 510,850
$ 2,406,772
Personnel
Details of the personnel budget justification can be found on Form Page 5.
Name
Organization
Role on Project
Ling, Bruce, Ph.D.
Hoey, Tim, Ph.D.
Tularik Inc.
Tularik Inc.
Shuttleworth, Stephen J., Ph.D.
Tularik Inc.
Young, Stephen, W., Ph.D.
Tularik Inc.
Waszkowycz, Bohdan, Ph.D.
Tularik Ltd. (UK)
Jaen, Juan, C., Ph.D.
Tualrik Inc.
Principal Investigator
Director, directs biology
efforts
Director, directs
highthroughput
combinatorial chemistry
efforts
Director, directs high
throughput lead discovery
efforts
Director, direct virtual
screening
VP Chemistry, directs
chemistry efforts
Connor, Richard, Ph.D.
Cardozo, Mario, Ph.D.
Tularik Inc.
Tularik Inc.
Scientist, combi-chem
Scientist, computational
chemistry
Young, Steve, Ph.D.
Cutler, Gene, Ph.D.
Tularik Ltd. (UK)
Tularik Inc.
Pan, Zheng, Ph.D.
Tularik Inc.
Scientist, virtual screening
Research Investigator, in
silico target identification and
microarray
Scientist, informatics
operation
Liu, Jane, M.D.
Lukes, Melissa
Porter, Richard
Tularik Inc.
Tularik Inc.
Tularik Inc.
Scientist, chemoinformatics
DBA
Developer
Walter Pan
Subramani, Jayanthi
Ding, Epic
Tularik Inc.
Tularik Inc.
Tularik Inc.
Developer
Developer
Developer
King, Brian
Life code, Inc. & I3C
Charati, Kaveri
Self employed consultant
Consultant, standardization
and interoperability
XML specialist
Consortium
Fee (SBIR/STTR Only)
PHS 398 (Rev. 05/01)
Page ___8____
Modular Budget Format Page
BIOGRAPHICAL SKETCH
Provide the following information for the key personnel in the order listed for Form Page 2.
Follow the sample format for each person. DO NOT EXCEED FOUR PAGES.
NAME
POSITION TITLE
EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.)
INSTITUTION AND LOCATION
DEGREE
(if applicable)
Fudan University
UCLA
UCLA
B.S.
M.A.
Ph.D.
Stanford Medical Center
Postdoc.
YEAR(s)
FIELD OF STUDY
1990
1994
1996
Biochemistry
Molecular Biology
Biological Chemistry
1996-1998 Molecular Immunology
A. Positions and Honors. List in chronological order previous positions, concluding with your present position. List
any honors. Include present membership on any Federal Government public advisory committee.
Positions
2003 Director, Research informatics
Tularik Inc.
2001 - 2002 Director, Bioinformatics
Tularik Inc.
2000 - 2001 Associate Director, R&D
DoubleTwist, Inc.
1999 - 2000 Project manager, Research Dept.
Pangea System, Inc.
1998 - 1999 Computation/Bioinformatics Scientist
Incyte Pharmaceuticals, Inc.
1997 Member, Medical Advisor Board
National Kidney Foundation of Northern California
1996 - 1998 Fellow
Stanford Molecular Immunology Laboratory, SUMC, CA
Honors
1997 – 1998
Walter Berry Medical Research Award
1997
National Kidney Foundation Research Award
1996
Dean's Fellowship, Stanford University
1992 -1993
University Fellowship, UCLA, CA
1991 - 1992
University Fellowship, University of Iowa, IA
1990 - 1991
University Fellowship, Fudan University, China
1990
Summa cum laude, Fudan University, China
1986 – 1990
University Fellowship, Fudan University, China
B. Selected peer-reviewed publications (in chronological order). Do not include publications submitted or in
preparation.
•
•
•
•
Li S, Cutler G, Liu J, Hoey T, Chen L, Schultz PG, Liao J, Ling XB (corresponding author). 2003. A
Comparative Analysis Of HGSC and Celera Human Genome Assemblies and Gene Sets.
Bioinformatics in press.
Li S, Liao J, Cutler G, Hoey T, Hogenesch JB, Cooke MP, Schultz PG, Ling XB (corresponding
author). 2002 Comparative Analysis of Human Genome Assemblies Reveals Genome-Level
Differences. Genomics 80 (2): 138.
Pei L, Peng Y, YangY, Ling XB, van Eyndhoven WG, Nguyen K, Rubin M, Hoey T, Powers S and Li,
J 2002. PRC17, a novel oncogene encoding a Rab GTPase-activating protein, is amplified in prostate
cancer. Cancer Res. 62 (19):5420-4.
Jiang Y, Chen D, Lyu S-C, Ling X, Krensky AM, Clayberger C. 2002. DQ 65-79, a Peptide Derived
from HLA Class II, Induces IkappaB Expression. J. of Immunology 168(7):3323-8
PHS 398/2590 (Rev. 05/01)
Page ____9___
Biographical Sketch Format Page
•
•
•
•
•
•
Pouliot Y, Gao J, Su Q, Liu G, Ling XB (corresponding author). 2001. DIAN, a Novel Algorithm for
Genome Ontological Classification. Genome Research. Genome Res. 11(10):1766-79.
Ling X, Kamamgar S; Boytim ML; Kelman Z; Huie P; Lyu S-C, Sibley RK; Hurwitz J; Clayberger C;
Krensky A. 2000. Proliferating Cell Nuclear Antigen as the Cell Cycle Sensor for an HLA-Derived
Peptide Blocking T Cell Proliferation. J. of Immunology. 164: 6188-92
Ling X, Tamaki T; Xiao Y; Kamangar S; Clayberger C; Lewis DB; Krensky AM. 2000. An immunosuppressive
and anti-inflammatory HLA class I-derived peptide binds vascular cell adhesion molecule -1. Transplantation
70(4):662-7
Lenfant F; Mann RK; Thomsen B; Ling X; Grunstein M. 1996 All four core histone N-termini contain
sequences required for the repression of basal transcription in yeast. EMBO J.15:3974-85.
Ling X; Harkness TA; Schultz MC; Fisher-Adams G; Grunstein M. 1996 Yeast histone H3 and H4
amino termini are important for nucleosome assembly in vivo and in vitro: redundant and positionindependent functions in assembly but not in gene regulation. Genes and Development 10:686-99.
Thompson JS; Ling X; Grunstein M. 1994. Histone H3 amino terminus is required for telomeric and
silent mating locus repression in yeast. Nature 369:245-7.
C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal
and non-federal support). Begin with the projects that are most relevant to the research proposed in this
application. Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in
the research project. Do not list award amounts or percent effort in projects.
•
•
•
•
•
Role: Research Informatics Director. Responsible for the research informatics area
(bioinformatics/Chemoinformatics/Lead discovery) support for genomics-based drug discovery @
Tularik Inc.
• IEEE (2002) report: http://siliconvalleycs.org/Ling.htm http://siliconvalleycs.org/LingBio.htm
• Bioinformatics data mining for drug targets: in silico identification and validation
• Architect in implementation of micro-array amplicon data analysis and Oligo-chip design software
for the cancer drug target identification
• Lead Discovery Data Flow: automatic data flow and data management development
• Setup the robust and scalable enterprise computing platform for genomics and lead discovery data
flow
• Architect JAVA 2 Enterprise Edition Platform for genomics and lead discovery data flow
• Architect and integrate .NET Enterprise Platform with the J2EE Platform computing backbone
for genomics and lead discovery data flow
• Architect data modeling covering genomics, assay development, HTS, SAR, Lead Optimization etc
data repository
• Setup Linux Cluster and IT infrastructure for genomics based drug discovery
• Build Linux clusters which enables Tularik on the US TOP-500 cluster list.
• http://clusters.top500.org/db/site.php?mode=listanon
• http://www.bio-itworld.com/archive/071102/linux.html
Role: Architect and project manager. Responsible for DoubleTwist Genomic Mapping Project for
Integration into DoubleTwist Human Genomic Database
• Design the Genomic Mapping Project data flow
• Managing the project implementation and system integration
Role: Archietect and project manager. Responsible for DoubleTwist Concept Mining Tool (DIAN
System)
• Design and architect the DIAN system data flow and algorithm logistics
• Manage the system implementation
Role: Archietect and developer. Responsible for the design of the protein domain analysis pipeline for
DoubleTwist Protein Comprehensive Analysis Agent
Role: Architect and developer for http://www.doubletwist.com genomic project
PHS 398/2590 (Rev. 05/01)
Page ___10____
•
•
•
Design the genomic project data flow
Implementation of Human Genomic Project BAC fragment ordering tool for the genomic BAC
sequence ordering and assembly (coded in C++ and perl)
Provide leadership in the implementation of the DoubleTwist Human Genomic Database (Prophecy)
PHS 398/2590 (Rev. 05/01)
Page ____11___
?
BIOGRAPHICAL SKETCH
Follow the sample format on for each person. (See attached sample). DO NOT EXCEED FOUR PAGES.
NAME
POSITION TITLE
Hoey, Timothy C.
Director, Biology Department
Columbia University, New York, NY
University of Michigan, Ann Arbor, MI
DEGREE
(if applicable)
YEAR(s)
Ph.D.
M. Phil.
M.A.
B.S.
FIELD OF STUDY
1989
1987
1986
1980
Molecular Biology
Molecular Biology
Molecular Biology
Biology
A. Positions and Honors.
Positions and Employment
1983-1984
Research Technician, Catholic Medical Center, New York
1983-1984
Research Technician, Columbia University, New York
1984-1989
Graduate Research Fellow/Teaching Assistant, Columbia University, New York
1989-1993
Postdoctoral Fellow, University of California, Berkeley
1993-1999
Scientist, Biology Department, Tularik, Inc., South San Francisco
1999Director, Biology Department, Tularik, Inc., South San Francisco
Other Experience and Professional Memberships
1994
Seminar, Department of Molecular Pharmacology, Stanford University
1996
Seminar, Department of Immunology, University of Washington
1996
Seminar, Department of Pathology, Brown University
1996
Seminar, Department of Biology, UC Santa Cruz
1996
Seminar, Roussel Signal Transduction Symposium, Oxford University, UK
1996
Seminar, IBC Transcriptional Regulation Conference, San Diego
1996
Seminar, Department of Pathology, Yale University
1996
Seminar, Samsung International Symposium, Seoul, Korea
1996
Seminar, Shock Society Conference, Indian Wells, CA
1997
Seminar, IBC Transcriptional Regulation Conference, San Diego
1997
Seminar, Department of Molecular Pharmacology, Stanford University
1997
Seminar, Keystone Symposium of Jaks and STATS, Tamarron, CO
1997
Seminar, Swiss Society for Experimental Biology, Lausanne, Switzerland
1998
Seminar, Department of Pathology, Emory University
1998
Seminar, Department of Immunology, Gladstone Institute, UCSF
1999
Seminar, ZMBH Cancer Center, Heidelberg, Germany
2000
Seminar, Department of Molecular Biology, USC
2001
Seminar, Inflammation Society Annual Meeting, San Diego, CA
2001
Seminar, Bay Area Biotechnology Conference, UCSF
2001
Seminar, Department of Immunology, Lerner Institute, Cleveland Clinic
2002
Seminar, Institute of Medicine Cancer Meeting, National Academy of Sciences
2002Associate Director, Journal of Immunology
? PHS 398/2590 (Rev. 05/01)
Page __12_____
Number pages consecutively at the bottom throughout the application. Do not use suffixes such as 3a, 3b.
Biographical Sketch Format Page ?
?
B. Selected peer-reviewed publications (in chronological order).
(Publications selected from 40 peer-reviewed publications.)
1. Hoey, T. and Levine, M. Divergent homeo box proteins recognize similar DNA sequences in Drosophila.
Nature 1988;332:858-861.
2. Levine, M. and Hoey, T. Homeo box proteins as sequence-specific transcription factors. Cell 1988;55:537540.
3. Hoey, T., Dynlacht, B.D., Peterson, M.G., Pugh, B.F., and Tjian, R. Isolation and characterization of the
Drosophila gene encoding the TATA box binding protein, TFIID. Cell 1990;61:1179-1186.
4. Dynlacht, B.D., Hoey, T. and Tjian, R. Isolation of coactivators associated with the TATA-binding protein
that mediate transcriptional activation. Cell 1991;66:563-576.
5. Hoey, T., Weinzierl, R.O.J., Gill, G., Chen, J-L., Dynlacht, B.D. and Tjian, R. Molecular cloning and
functional analysis of Drosophila TAF110 reveal properties expected of coactivators. Cell 1993;72:247-260.
6. Goodrich, J.A. Hoey, T., Thut, C.J., Admon, A. and Tjian, R. Drosophila TAFΠ40 interacts with both a
VP16 activation domain and the basal transcription factor TFIIB. Cell 1993;75:519-530.
7. Rooney, J.W., Hoey, T., and Glimcher, L.H. Coordinate and cooperative roles for NF-AT and AP-1 in the
regulation of the murine IL-4 gene. Immunity 1995;2:461-472.
8. Hoey, T., Sun, Y.L., Williamson, K. and Xu, X. Isolation of two new member of the NFAT gene family and
functional characterization of the NFAT proteins. Immunity 1995;2:473-483.
9. Rooney, J.W., Sun, Y.L., Glimcher, LG., and Hoey, T. Novel NFAT sites that mediate activation of the
Interleukin-2 promoter in response to T-cell receptor stimulation. Mo. Cell. Biol. 1995;15:6299-6310.
10. Hodge, M.R., Ranger,A.M., de la Brousse, F., Hoey, T., Grusby, M.J., Glimcher, L.H. Hyper proliferation
and dysregulation of Interleukin-4 expression in NFATp deficient mice. Immunity 1996;4:397-405.
11. Kaplar, M.H., Sun, Y.L., Hoey, T., and Grusby, M. Impaired IL-12 responses and enhanced development
of TH2 cells in STAT4-deficient mice. Nature 1996;382:174-177.
12. Xu, X., Sun, Y.L., and Hoey, T. The STAT amino-terminal domain mediates cooperative DNA binding and
confers selective sequence recognition. Science 1996:263:794-797.
13. Hoey, T. A new play in cell death. Science 1997;278:1578-1579.
14. Naeger, L., and Hoey T. Identification of STAT4 binding site in the IL-12 receptor required for signaling. J.
Biol. Chem. 1999;274;1875-1878.
15. Lawless, V.A., Zhang, S., Ozes, O.N., Bruns, H.A., Oldham, I., Hoey, T., Grysby, M.J., and Kaplan, M.H.
STAT4 regulates multiple components of IFN-gamma-inducing signaling pathways. J. Immunol.
2000;165:6803-6808.
16. Li, J., Yan, Y., Austin, R., van Eyndhoven, W., Peng, Y., Mcurrach, M.E., Nguyen, K., Apella, E., Lowe,
S.W., Hoey, T., and Powers, S. Oncogenic properties of PPIMD located within a breast cancer amplification
epicenter at 17q23. Nature Genetics. 2002;31:133-134.
? PHS 398/2590 (Rev. 05/01)
Page ___13____
17. Pei, L, Peng, Y., van Eyndhoven, W., Ling, X.B., Nguyen, K., Rubin, M., Hoey, T., Powers, S., and Li, J.
(in press, 2002 Cancer Research).
18. Li, S., Liao, J., Cutler, G., Hoey, T., Hogenesch, J., Cooke, M., Schultz, P., and Ling, X. Genomics
2002;80:138.
Patents/Inventorships
TATA-Binding Protein Associated Factors drug screens
TATA-Binding Protein Associated Factors nucleic acids
Human Nuclear Factors and binding assays
Human Signal Transducer and binding assays
PHS 398/2590 (Rev. 05/01)
U.S. patent number 5,534,410
Page ___14____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Shuttleworth, Stephen J.
Scientist
University of Liverpool, Liverpool, UK
University of Liverpool, Liverpool, UK
DEGREE
(if applicable)
B.Sc.
PhD
YEAR(s)
FIELD OF STUDY
1991 Chemistry
1994 Organic Chemistry
A. Positions and Honors
1994-1997
Senior Research Chemist Chiroscience Ltd., Cambridge, UK
1997
Head of Combinatorial Chemistry, Glycodesign, Inc., Toronto, Ontario, Canada
1999-2000
Research Leader, Combinatorial Chemistry, BioChem Pharma, Inc. Laval, Montreal, Canada
2000- present
Associate Director, Chemistry, Tularik Inc., South San Francisco, CA, USA
1990-present
Member of the Society of Chemical Industry
1990
Awarded GRSC from the Royal Society of Chemistry
1993
Awarded C Chem, MRSC from the Royal Society of Chemistry
1995-present
Member of the European Chemical Society
1996-1997
Member of the UK Automation Society
1997-present
Member of the Chemical Institute of Canada
1997
Awarded MCIC from the Chemical Institute of Canada
1999-present
Member of the American Chemical Society
Honors
1990
1991
1994
Nuffield Foundation Research Scholarship
Fully-funded Ph.D. Studentship awarded from Glaxo, UK
First Prize, SCI National Postgraduate Symposium, Manchester University, UK
B. Peer-reviewed Publications (In Chronological Order).
1. Allcock, S.J., Gilchrist, T. L., Shuttleworth, S.J. King, F.D. Intramolecular and Intermolecular Diels-Alder
Reactions of Ac ylhydrazones Derived from Methacrolein and Ethylacrolein. Tetrahedron 1991;47:1005310064.
2. Page, P.C.B., Gareh, M.T., Shuttleworth, S.J. 1,3-Dithiane 1-Oxide: New Applications and its First
Asymmetric Synthesis. IUPAC 18th Symposium on the Chemistry of Natural Products. 1992;262-263
3. Page, P.C.B., Shuttleworth, S.J. Schilling, M.B., Tapolczay, D.J. One-Pot Stereocontrolled
Cyclolkanone Synthesis using 1,3-Dithiane 1-Oxides. Tetrahedron Lett. 1993;34:6947-6950.
4. Page, P.C.B., Shuttleworth, S.J., McKenzie, M.J., Schilling M.B., Tapolczay, D.J. Pummerer and
Related Rearrangements of 2-Acyl-1,3-Dithiane 1-Oxides. Synthesis 1995;73-77
5. Allin, S.M. and Shuttleworth, S.J. Synthesis and Uses of a Resin-Bound “Evans” Auxiliary. Tetrahedron
1996;37:8023-8026.
PHS 398/2590 (Rev. 05/01)
Page ___15 ____
?
6. Allin, S.M. Button, M.C. and Shuttleworth, S.J. Aza-Cope Rearrangement in the Asymmetric Alkylation
of Enamines. Synett. 1997;725-727.
7. Shuttleworth, S.J. Allin, S.M. and Sharma, P.K. Functionalized Polymers: Recent Developments & New
Applications in Synthetic Organic Chemistry. Synthesis 1997;1217-1239.
8. Page, P.C.B., Allin, S.M., Shuttleworth, S.J. Organosulfur Chemistry:Synthetic and Stereochemical
Aspects. Organosulfur Chemistry: Volume 2 ed. P.C.B. Page, Academic Press, UK 1998;97-155.
9. Shuttleworth, S.J. An Overview of Combinatorial Synthesis and its Applications in the Identification of
Matrix Metalloproteinase Inhibitors. Advances in Drug Discovery, ed. Harvey, A., Wiley, UK. 1998;115-141.
10. Montana, J., Baxter, A., Shuttleworth, S.J., Manallack, D., Bird, J., Bhogal, R., Minton, K., Jagpal, S.
Int. Combinatorial Synthesis of Matrix Metalloprotienase Inhibitors. J. Pharm. Med. 1998;9-12.
11. Shuttleworth, S.J., Quimpere, M., Lee, N., DeLuca, J. Parallel Solution Synthesis of Pyridinones,
Pyridinethiones and Thienopyridines. Molecular Diversity 1999:4(3):183-185.
12. Shuttleworth, S.J., Allin, S.M., Wilson, R.D., Nasturica, D. Functionalised Polymers in Organic
Chemistry, Part 2. Synthesis 2000;8:1035-1074.
13. Shuttleworth, S.J., Nasturica, D., Gervais, C., Siddiqui, M.A., Rando, R., Lee, N. Parallel Synthesis of
Isatin-Based Serine Protease Inhibitors. Bioorg. Med. Chem. Lett. 2000;2501-2504.
14. Kearney, P.C., Fernandez, M., Fu, M., Flygare, J., Shuttleworth, S.J., Wahhab, A., Wilson, R., De
Luca, J. Solid Phase Synthesis of 2-Aminothiazoles. Solid-Phase Org. Synth. 2001;1:1-8.
15. Shuttleworth, S.J., (Guest Editor), Development and Applications of Polymer-Supported Reagents and
Ion Exchange Resins in Organic Synthesis and Combinatorial Chemistry. Combinatorial Chemistry & High
Throughput Screening 2002;5(3):197-261.
16. Lizaraburu, M.E., Shuttleworth, S.J. Synthesis of Aryl Ethers from Protected Aminoalcohols Using
Polymer-Supported Triphenylphosphine. Tetrahedron Lett. 2002;43:2157-2159.
17. Kong, L.C.C., Bedard, J., Das, S.K., Ba, N.N., Pereira, O.Z., Shuttleworth, S.J. Compounds and
Methods for the Treatment or Prevention of Flavivirus Infections. PHAR-130-002-USA.
18. Connors, R.V., Zhang, A. J. and Shuttleworth, S.J. Pictet-Spendler Synthesis of Tetrahydro-βCarbolines using Vinylsulfonylmethyl Resin. Tetrahedron Lett. 2002;43:6661-6663
? PHS 398/2590 (Rev. 05/01)
Page ___16____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Stephen W. Young
Director, Lead Discovery
Tularik Inc
1120 Veterans Blvd
South San Francisco
CA 94080
DEGREE
(if applicable)
University of Bristol, Bristol, U.K
BSc
University of Bristol, Bristol, U.K.
PhD
Open University, U.K.
Diploma
YEAR(s)
19871990
19911994
19992001
FIELD OF STUDY
Biochemistry
Insulin Receptor Signal Transduction
Business Studies (Executive study – conducted
while working at Roche)
NOTE: The Biographical Sketch may not exceed four pages. Items A and B (together) may not exceed two of
the four-page limit. Follow the formats and instructions on the attached sample.
1994-1996 Senior Research Biologist, BioMolecular Screening Department, Glaxo R&D, Stevenage, U.K.
1996-1999 Senior Research Biologist, Immunology Unit, GlaxoWellcome R&D, Stevenage, U.K.
1999-2001 Head Of High Throughput Screening, Roche Discovery, Welwyn, U.K.
2001-2003 Director, Lead Discvoery, Tularik Inc, South San Francisco, California, USA
preparation.
1. Issad, T., Young, S.W., Tavaré, J.M. and Denton, R.M. “Effect of Glucagon on Insulin Receptor
Phosphorylation in Intact Cells” FEBS Lett. 296 41-45 1992
2.
Young, S.W., Poole, R.C., Hudson, A.T., Halestrap, A.P., Denton, R.M. and Tavaré, J.M. “Effects of Tyrosine
Kinase Inhibitors on Protein Kinase Independent Systems” FEBS Lett. 316, 278-282 1993
3.
Young, S.W., Dickens, M. and Tavaré, J.M. “Differentiation of PC12 Cells in Response to a cAMP Analogue is
accompanied by Sustained Activation of Mitogen Activated Protein Kinase. Comparison with the Effects of
Insulin, Growth Factors and Phorbol Esters” FEBS Lett. 338, 212-216 1994
4.
Welsh, G.I., Foulstone, E.J., Young, S.W. Tavaré, J.M. and Proud, C.G. “Wortmannin Inhibits the Effects of
Insulin and Serum on the Activities of Glycogen Synthase Kinase-3 and Mitogen Activated Protein Kinase”
Biochem J. 303, 12-20 1994
5.
Young, S.W., Dickens, M. and Tavaré “Activation of Mitogen Activated Protein Kinase by PKC isotopes α, β
and γ but not ε” J.M. FEBS Lett. 384, 181-184 1996
6.
Young, S.W. “HTS Personal Perspectives: Big Pharma – interview with Rebecca Lawrence” Drug Discovery
Today 6 (12) S8-S10 2001
7.
Mallari, R., Swearingen, E., Liu, W., Ow, A., Young, S.W. and Huang, S.G. “A Generic High-throughput
Screening Assay for Kinases: Protein Kinase A as an Example” J. Biomol. Screening 8 (2) 198-204 2003
PHS 398/2590 (Rev. 05/01)
Page ___17____
8.
Hong, C.A., Swearingen, E., Mallari, R., Gao, X., Cao, Z., North, A., Young, S.W. and Huang, S.G.
“Development of A High-Throughput Time-Resolved Fluorescence Resonance Energy Transfer Assay for
TRAF6 Ubiquitin Polymerization” Assay and Drug Development Technologies 1 (1-2) 175-180 2003
the research project. Do not list award amounts or percent effort in projects
PHS 398/2590 (Rev. 05/01)
Page ___18____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Jaen, Juan.
Vice President, Chemistry
University of Complutense, Madrid, Spain
University of Complutense, Madrid, Spain
University of Michigan, Ann Arbor, Michigan
University of Michigan, Ann Arbor, Michigan
DEGREE
(if applicable)
B.S.
M.S.
M.S.
Ph.D.
YEAR(s)
1979
1980
1981
1984
FIELD OF STUDY
Organic Chemistry
Organic Chemistry
Organic Chemistry
Organic Chemistry
A.
Positions and Honors.
1983-1985
Scientist, Parke-Davis Pharmaceutical Research Division, Ann Arbor, MI.
1985-1987
Senior Scientist, Parke-Davis Pharmaceutical Research Division,
1988-1991
Research Associate, Parke-Davis Pharmaceutical Research Division
1991-1992
Senior Research Associate, Parke-Davis Pharmaceutical Research Division
1992-1993
Section Director, Neurodegenerative Diseases Chem., Parke-Davis Research
1993-1996 Director, Neurodegenerative Diseases Chem., Parke-Davis Research Division,
1996-2000
Director of Chemistry, Tularik Inc., South San Francisco, CA
1999
Vice-President of Chemistry, Tularik Inc., South San Francisco, CA
Professional Activities
1992-1995
CNS Section Editor for Expert Opining in Therapeutic Patents.
1992-1996
Editorial Board of Current Drugs: Neurodegenerative Disorders.
1992-1993
NIH Special Study Section 7, Drug Development & Drug Delivery - Small Business Innovation Research
1994-1995
Special Study Section Z, Multipisciplinary Special Emphasis Panel - Small Business. Innovation Reseach
2000-present Editorial Board – Current Medicinal Chemistry (Immunology, Endocrine and Metabolic Agents).
B.
Selected Peer-reviewed publications (in chronological order).
(Publications selected from 58 peer-reviewed publications)
1. Davis, R.E., Doyle, P.D., Carroll, R.T., Emmerling, M.R., Jaen, J.C., Cholenergic therapies for Alzheimer’s Disease:
palliative or disease altering? Arzneim.Forsch/Drug Res. 1995;45(1):425.
2. Jaen, J.C., Laborde, E., Bucsh, R.A, Caprathe, B.W., Sorenson, R.J., Fergus, J., Spiegel, K., Dickerson, M.R., Davis,
R.E. Kynurenic acid derivatives inhibit the binding of Nerve Growth Factor (NGF) to the low affinity p75 NGF receptor. J.
Med. Chem. 1995;38:4439-4445.
3. Emmerling, M.R., Gregor, V.E. Callahan, M.J., Schwarz, R.D., Scholten, J.D., Orr, E.L., Pugsley, T., Moore, C.J., Raby,
C., Myers, S.L., Davis, R.E., Jaen, J.C. CI-1002, a combined acetylcholinesterase inhibitor and muscarinic antagonist.
CNS Drug Reviews 1995;1:27-29.
4. Pool, W.F., Woolf, T.F., Reily, M.D., Caprathe, B.W., Emmerling, M.R., Jaen, J.C. Identification of a 3-hydroxylated
tacrine metabolite in rat and man: metabolic profiling implications and pharmacology. J. Med. Chem. 1996;39:3014-3018.
5. Glase, S.A., Akunne, H.C., Heffner, T.G., Jaen, J.C., Meltzer, L.T., Pugsley, T.A., Smith, S.J., Wise, L.D. Aryl 1-but-3ynyl-4-phenyl-1,2,3,6-tetrahydropyrines as potential antipsychotic agents: Synthesis and Structure-Activity relationships.
J. Med. Chem. 1996;39:3179-3187.
6. Jaen, J.C. and Schwarz, R.D. Development of muscarinic agonists for the symptomatic treatment of Alzheimer’s
Disease. In: Pharmacological Treatment of Alzheimer’s Disease. J.D. Brioni and M.W. Decker, Eds. Wiley & Sons:
1997;409-432.
PHS 398/2590 (Rev. 05/01)
Page ___19____
?
7. Schwarz, R.D., Callahan, M.J., Davis, R.E., Jaen, J.C., Jaen, J.C., Tecle, H. Development of M1-subtype-selective
muscarinic agonists for Alzheimer’s Disease: translation of in vitro selectivity into in vivo efficacy. Drug Dev, Res.
1997;40:133-143.
8. Hays, S.J., Caprathe, B.W., Gilmore, J.L., Amin, N., Emmerling, M.R., Michael, W., Nadimpali, R., Nath, R. Raser, K.J.,
Stafford, D., Watson, D., Wang, K., Jaen, J. C. 2-Amino-4H-3,1-benzoxazin-4-ones as inhibitors of C1r serine protease. J.
Med. Chem. 1998;41:1060-1067.
9. Augelli-Szafran C.E., Jaen, J.C., Moreland, D.W., Nelson, C.B., Penvose-Yi, J.R., Schwarz, R.D. Identification and
characterization of m4-selective muscarinic antagonists. BioOrg. Med. Chem. Lett. 1998;8:1991-1996.
10. Medina, J.C., Shan, B., Bechmann, H., Farrell, R.P., Clark, DL., Learned, M., Roche, D., Li, A., Baichwal, V., Case, C.,
Baeurle, P., Rosen, T., Jaen, J.C. Novel antineoplastic agents with efficacy against multidrug resistant tumor cells.
BioOrg. Med. Chem. Lett. 1998;8:2653-2656.
11. Augelli-Szafran, C.E., Blankley, C.J., Jaen, J.C., Moreland, D.W., Nelson, C.B., Penvose-Yi, J.R., Schwarz, R.D.,
Thomas, A.J. Identification and characterization of m1 selective muscarinic receptor antagonists. J. Med. Chem.
1999;42:356-363.
12. Plummer, J.S., Cai, C., Hays, S.J., Gilmore, J.L., Emmerling, M.R., Michael, W., Narasimhan, L.S., Watson, M.D.,
Wang, K., Nath, R., Evans, L.M., Jaen, J.C. Benzenesulfonamide derivatives of 2-substituted 4H-3,1-benzoxazin-4-ones
and benzthiazin-4-ones as inhibitors of complement C1r protease. . BioOrg. Med. Chem. Lett. 1999;9:815-820.
13. Shan, B. Medina, J.C., Santha, E., Frankmoelle, W.P., Chou, T.C. Learned, R.M., Narbut, M.R., Stott, D., Wu, P.,
Jaen, J.C., Rosen, T., Timmmermans, P.B.M.W.M., Beckmann, H. Selective, covalent modification of β-tubulin residue
Cys239 by T138067, an antitumor agent with in vivo efficacy against multidrug-resistant tumors. Proc. Nat. Acad. Sci.
(USA) 1999;96:5686-5691.
14. Medina, J.C., Roche, D., Shan, B., Learned, R.M., Frankmoelle, W.P., Clark, D.L., Rosen, T., Jaen, J.C., Novel
halogenated sulfonamides inhibit the growth of multidrug resistant MCF-7?ADR cancer cells. . BioOrg. Med. Chem. Lett.
1999;9:1843-1846.
15. Tecle, H., Schwarz, R.D., Barrett, S.D., Callahan, M.J., Caprathe, B.W., Davis, R.E., Doyle, P., Emmerling, M.,
Lauffer, D.J., Mirzadegan, T., Moreland, D.W., Lipinski, W., Nelson, C., Raby, C., Spencer, C., Spiegel, K., Thomas, A.J.,
Jaen, J.C. CI-1017, a functionally M1-selective muscarinic agonist: design, synthesis and preclinical pharmacology.
Pharm. Acta Helv. 200;74(2-3):141-148.
Patents/Inventorships
(Selected Patents out of 41 US Patents and PCT Publications for Pending US Patents)
Preparation of 2-substituted-4H-3,1-benzoxazin-4-ones and benzothiazin-4-ones as inhibitors of complement C1r
protease for the treatment of inflammatory processes. Caprathe, B.W., Gilmore, J., Hays, S., Jaen, J.C.: US 5,652,237
(July 29, 1997).
Method of imaging amyloid deposits. Caprathe, B.W., Gilmore, J.L., Hays, S., Jaen, J.C., LeVine, H.: US 6,001,331
(December 14, 1999).
PPARg modulators. DeLaBrouse-Elwood, F., Chen, J-L., Cushing, T.D., Flygare, J.A., Houze, J.B., ., Jaen, J.C., McGee,
L.R., Miao, S-C., Rubenstein, S.M., Kearney, P.C. ; US 6,200,995 (March 13, 2001).
Pyrimidine derivatives. Cushing, T.D., Mellon, H.L., Jaen, J.C., Flygare, J.A., Miao, S-C., Chen, X, Powers, J.P. US
6,200,977 (March 13, 2001).
HIV integrase inhibitors. Young, S.D., Egbertson, M., Payne, L.S., Wai, J.S., Fisher, T.E., Guare, J.P., Embrey, M.W.,
Tran, L., Zhuang, L., Vacca, J.P., Langford, M., Melamed, J. Jaen, J.C., Clark, D.L., Medina, J.C. US 6,380,249 (April 30,
2002).
Preparation of arylsulfonanilide amino acid derivatives. Rubenstein, S., Jaen, J.C.; US 6,153,585 (November 28, 2000).
PHS 398/2590 (Rev. 05/01)
Page ____20___
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Waszkowycz, Bohdan
Head of Computational Chemistry
University of Manchester, UK
University of Manchester, UK
DEGREE
(if applicable)
BSc
PhD
YEAR(s)
FIELD OF STUDY
1983 Pharmacy
1990 Theoretical Chemistry
1984-1985: Pharmacist, Withington Hospital, Manchester, UK
1985-1987: Pharmacist, Christie Hospital, Manchester, UK
1990-1993: Computational Chemist, Proteus Molecular Design Ltd, Stockport, UK
1993-1999: Group Leader, Computational Chemistry, Proteus Molecular Design Ltd, Macclesfield, UK
1999-2001: Group Leader, Computational Chemistry, Protherics Molecular Design Ltd, Macclesfield,
UK
2001-present: Head, Computational Chemistry, Tularik Ltd, Macclesfield, UK
preparation.
1. B Waszkowycz, I H Hillier, N Gensmantel and D W Payling, Aspects of the Mechanism of Catalysis in
Phospholipase A2. A Combined ab initio Molecular Orbital and Molecular Mechanics Study. J. Chem. Soc.
Perkin Trans. 2, 1989, 1795.
2. D E Clark, D Frenkel, S A Levy, J Li, C W Murray, B Robson, B Waszkowycz and D R Westhead,
PRO_LIGAND: An Approach to de Novo Molecular Design. 1: Application to the Design of Organic
Molecules. J. Comput.-Aided Mol. Design, 1995, 9, 13.
3. B Waszkowycz, D E Clark, D Frenkel, J Li, C W Murray, B Robson and D R Westhead, PRO_LIGAND: An
Approach to de Novo Molecular Design. 2: Design of Novel Molecules from Molecular Field Analysis (MFA)
Models and Pharmacophores, J. Med. Chem., 1994, 37, 3994.
4. CW Murray, DE Clark, TR Auton et al PRO_SELECT : Combining structure-based drug design and
combinatorial chemistry for rapid lead discovery. 1. Technology. J. Comput.-Aided Mol.Des. 1997, 11, 193.
5. J Li, CW Murray, B Waszkowycz and SC Young. Targeted molecular diversity in drug discovery integration of structure-based design and combinatorial chemistry Drug Discovery Today 1998, 3, 105
6. B Waszkowycz. New methods for structure-based de novo drug design in “Advances in Drug Discovery
Techniques”, Ed. A L Harvey, Publ. Wiley 1998
7. CA Baxter, CW Murray, B Waszkowycz et al. New approach to molecular docking and its application to
virtual screening of chemical databases. J. Chem. Inf. Comp. Sci. 2000, 40, 254.
8. B Waszkowycz, TDJ Perkins, RA Sykes & J Li, Large scale virtual screening for lead discovery in the
post-genomics era. IBM Systems J. 2001, 40, 360.
PHS 398/2590 (Rev. 05/01)
Page ____21__
?
9. JW Liebeschuetz et al, PRO_SELECT: combining structure-based drug design and array-based
chemistry for rapid lead discovery. 2. The development of a series of highly potent and selective factor Xa
inhibitors. J. Med. Chem. 2002, 45, 1221.
10. B Waszkowycz, Structure-based approaches to drug design and virtual screening. Curr. Opin. Drug
Discov. Devel. 2002, 5, 407
PHS 398/2590 (Rev. 05/01)
Page __22_____
Principal Investigator/Program Director
Ling, Bruce, Xuefeng, Ph.D.:
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Stephen C Young
Head of Chemsitry, Tularik Ltd
DEGREE
(if applicable)
Nottingham University (UK)
Nottingham University (UK)
BSc Hons
PhD
YEAR(s)
1981
1984
FIELD OF STUDY
Chemistry
Medicinal Chemistry
1984 – 1986
1986 – 1989
1989 – 1991
1992 – 1995
1996 – 2001
2001 – 2003
Research Fellow
Senior Research Chemist
Sales and Marketing Manager
Experimental Facilities Manager
Synthetic Chemistry Section Head
Head of Chemistry
Edinburgh University (UK).
Merck Sharp and Dohme (Neuroscience Research Labs).
Novabiochem (UK) Ltd.
Proteus Molecular Design Ltd.
Protherics Molecular Design Ltd.
Tularik Ltd.
1981 – 2003
1987 – 2003
Member of the Royal Society of Chemistry
Member of the American Chemical Society
preparation.
Papers:
Neuroscience Letters, 1987, 80, 321-326. Senktide, a selective neurokinin B-like agonist, elicits serotonin-mediated
behaviour following intracisternal administration in the mouse, A.J. Stoessl, C.T. Dourish, S.C. Young, S.D. Iversen and L.L.
Iversen. Merck Sharp and Dohme Research Laboratories, Harlow, Essex, U.K.
Peptides 1990, 313-5 (Eds. E. Giralt and D. Andreu). Counterion distribution monitoring: A novel method for acylation
monitoring in solid phase peptide synthesis. S.C. Young, P.D. White. J.W. Davies, D.E.I.A. Owen, S.A. Salisbury and E.J.
Tremeer. Novabiochem U.K. Ltd., Cambridge, U.K.
Journal of Medicinal Chemistry, 1993, 36, 2-10. Cyclic peptides as selective tachykinin antagonists, B.J. Williams, N.R.
Curtis, A.T. McKnight, J.J. Maguire, S.C. Young, D.F. Veber, R. Baker. Merck Sharp and Dohme Research Laboratories,
Harlow, Essex, U.K.
Veterinary Immunology and immunopathology, 1996, 55, 243, Immunisation of rainbow trout Oncrhynchus mykiss with
multiple antigen peptide system (MAPS). E.M. Riley, S.C. Young, C.J. Secombes. University of Aberdeen and Proteus
Molecular Design Ltd., UK
Journal of Computer aided Molecular Design 1997, 11, 193, PRO-SELECT: Combining combinatorial chemistry and
structure-based drug design for rapid lead discovery. 1. Technology C.W. Murray, D.E. Clark, T.R Auton, M.A. Firth, J. Li, B.
Waszkowycz, D.R. Westhead, and S.C. Young, Proteus Molecular Design Ltd, Macclesfield, Cheshire, U.K.
Drug Discovery Today 1998, 3, 105-112, Targeted molecular diversity in drug discovery – integration of structure-based
design and combinatorial chemistry. Li, Jin; Murray, Christopher W.; Waszkowycz Bohdan; Young, Stephen C., Proteus
Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK
PHS 398/2590 (Rev. 05/01)
Page ___23____
Acta Crystallogr. 1999, C55, IUC9900072: 2-Amino-4-(methoxymethyl)-thiazole-5-carboxilic acid methyl ester, A.R.
Kennedy, A.I. Khalaf, A.R.Pitt, M. Scobie, C.J. Suckling, J. Urwin, R.D. Waigh and S.C.Young
J. Med. Chem. 2000, 43, 3257-3266, DNA binding, solubility and partitioning characteristics of extended Lexitropsins. R.V.
Fishleigh, K.R. Fox, A.I. Khalaf, A.R.Pitt, M. Scobie, C.J. Suckling, J. Urwin, R.D. Waigh and S.C.Young. Proteus Molecular
Design Ltd, Macclesfield, Cheshire, U.K and University of Strathclyde, Glasgow, UK.
Tetrahedron. 2000, 56, 5225-5239, The synthesis of some head to head linked DNA minor groove binders. A.I. Khalaf,
A.R.Pitt, M. Scobie, C.J. Suckling, J. Urwin, R.D. Waigh, R.V. Fishleigh, W.A. Wylie and S.C.Young. Proteus Molecular
Design Ltd, Macclesfield, Cheshire, U.K and University of Strathclyde, Glasgow, UK.
Journal of Chemical Research-S 2000, 6, 264-265 Synthesis of novel DNA binding agents: indole-containing analogues of
bis-netropsin. Khalaf, A.I.; Pitt, A.R.; Scobie, M.; Suckling, C.J.; Urwin, J.; Waigh, R.D.; Fishleigh R.V.; Young, S.C. Proteus
Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK , and Univ Strathclyde, Glasgow G4 0NR, UK
Drug Discovery & Development, April 2000, 34-38: Virtual screening speeds discovery. Young, Stephen; Li, Jin Protherics
Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK
Innovations in Pharmaceutical Technology (2000), 00(5), 24-28: Virtual screening of focused combinatorial libraries Young,
S. ; Li, J. , Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK
BioOrganic & Medicinal Chemistry Letters (2001), 11(5), 733-736, The design of phenylglycine containing benzamidine
carboxamides as potent and selective inhibitors of factor Xa; Jones, S D ; Liebeschuetz, J W ; Morgan, P J ; Murray, C W ;
Rimmer, A D ; Roscoe, J M E ; Waszkowycz, B ; Welsh, P M ; Wylie, W A ; Young, S C ; Mahler, J ; Martin, H., Brady ; L.,
and Wilkinson, K ; Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK and Bristol University
J. Med. Chem. (2002), 45, 1221; PRO-SELECT: Combining structure-based drug design and array-based chemistry for
rapid lead discovery. 2. The Development of a Series of Highly Potent and Selective Factor Xa Inhibitors. J Liebeschuetz,
S.D. Jones, J. Mahler, H. Martin, P.J. Morgan, C.W. Murray, A.D. Rimmer, J.M.E. Roscoe, B. Waszkowycz, P.M. Welsh,
W.A. Wylie and S.C. Young, Protherics Molecular Design Ltd, Macclesfield, Cheshire, U.K. and Bristol University
BioOrganic & Medicinal Chemistry Letters (2003 in Press), A Four Component Coupling Strategy for the Synthesis of DPhenylglycinamide-Derived Non-Covalent Factor Xa Inhibitors, Scott M. Sheehan, John J. Masters, Michael R. Wiley,
Stephen C. Young, John W. Liebeschuetz, Stuart D. Jones, Christopher W. Murray, Jeffrey B. Franciskovich, David B.
Engel, Wayne W. Weber II, Jothirajah Marimuthu, Jeffrey A. Kyle, Jeffrey K. Smallwood, Mark W. Farmen, and Gerald F.
Smith.
Patents:
EP818744 Process for selecting candidate drug compounds. C.W. Murray and S.C. Young. Proteus Molecular Design Ltd,
Macclesfield, Cheshire, U.K.
World Patent WO-9858952 Angiotensin derivatives. Glover, J F.; Rushton A.; Morgan P J.; Young S C. Proteus Molecular
Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK
World Patent WO-9911657 1-Amino-7-isoquinoline derivatives as serine protease inhibitors. Liebeschuetz, J.W.; Wylie,
W.A.; Waszkowycz, B.; Young, S.C.; Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK
US2002/0040144 1-Amino-7-isoquinoline derivatives as serine protease inhibitors. Liebeschuetz, J.W.; Wylie, W.A.;
Waszkowycz, B.; Young, S.C.; Proteus Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK
World Patent WO-9911658 meta-Benzamidine derivatives as serine protease inhibitors Liebeschuetz, J.W.; Wylie, W.A.;
US 2002/0055522 meta-Benzamidine derivatives as serine protease inhibitors Liebeschuetz, J.W.; Wylie, W.A.;
World Patent WO-0076970 Use of compounds as serine protease inhibitors Liebeschuetz, J W ; Lyons, A J ; Murray, C W ;
Rimmer, A D, Young S.C., Camp, N.P., Jones S.D., Morgan, P.J., Richards, S.J., Wylie, W.A., Lively, S.E., Harrison, M.J.,
Waszkowycz, B., Masters, J.J., Wiley, M.J. Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli
Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285
World Patent WO-0076971 Compounds as serine protease (especially factor Xa) inhibitors useful as antithrombotic agents
Liebeschuetz, J W ; Lyons, A J ; Murray, C W ; Rimmer, A D, Young S.C., Camp, N.P., Jones S.D., Morgan, P.J., Richards,
S.J., Wylie, W.A., Masters, J.J., Wiley, M.J. Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli
Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285
World Patent WO-0077027 Compounds as serine protease (especially tryptase) inhibitors useful as antiinflammatory agents
Liebeschuetz, J W; Young, S C; Lively, S E; Harrison, M J; Morgan, P.J; Waszkowycz, B; Protherics Molecular Design Ltd.,
Macclesfield, Cheshire, SK11 0JL, UK
World Patent WO-0196303 compounds as serine protease inhibitors Liebeschuetz, J W, Murray, C W., Young S.C., Camp,
N.P., Jones S.D., Wylie, W.A., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Protherics Molecular
Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285
PHS 398/2590 (Rev. 05/01)
Page ___24____
N.P., Jones S.D., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Protherics Molecular Design Ltd.,
Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285
N.P., Jones S.D., Wylie, W.A., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Guzzo, P.R.,
Protherics Molecular Design Ltd., Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center,
Indianapolis, IN 46285
N.P., Jones S.D., Masters, J.J., Wiley, M.J., Sheehan, S. M., Watson, B., Engel, D. B., Protherics Molecular Design Ltd.,
Macclesfield, Cheshire, SK11 0JL, UK & Eli Lilly & Co., Lilly Corporate Center, Indianapolis, IN 46285
PHS 398/2590 (Rev. 05/01)
Page ___25____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Mario G. Cardozo
Research Investigator
DEGREE
(if applicable)
YEAR(s)
FIELD OF STUDY
Faculty of Chemical Sciences
National University of Cordoba
Cordoba, ARGENTINA
B. Sc. in
Pharmacy
4/78 to 12/81 Pharmacy
Faculty of Chemical Sciences
National University of Cordoba
Cordoba, ARGENTINA
M. Sc. in
Organic
Chemistry
12/81 to 3/83 Physical Organic
Chemistry
Faculty of Pharmacy and Biochemistry
University of Buenos Aires,
Buenos Aires, ARGENTINA
Ph. D. in
Pharmacy
4/83 to 12/87 Medicinal Chemistry
College of Pharmacy
University of Illinois at Chicago (UIC)
Chicago Il
Postdoctoral
Research
Associated
05/89 to 06/92 Computer-aided drug
design
POSITIONS
07/92 to 12/96:
Senior Scientist
Medicinal Chemistry Department
Boehringer Ingelheim Pharmaceuticals, Inc.
01/97 to 07/02:
Principal Scientist
Molecular Modeling Laboratory-Structural Research Group
Boehringer Ingelheim Pharmaceuticals, Inc.
900 Ridgebury RD, Ridgefield CT
08/02 to Present
Tularik Inc
Department of Structural Biology
1120 Veterans Blvd, South San Francisco, CA
PHS 398/2590 (Rev. 05/01)
Page ___26____
FELLOWSHIPS AND AWARDS:
04/84 to 03/88
05/89 to 04-91:
1998
1999
Graduate Student Fellowship (National Research Council, ARGENTINA).
Fogarty International Postdoctoral Fellowship (NIH-USA).
Boehringer Ingelheim Vice President Golden Achievement Award.
Boehringer Ingelheim Medicinal Chemistry Department. Achievement Award
preparation.
1. Cardozo, M.G., with Pierini, A.B., Montiel, A.A., Albonico, S.M., and Pizzorno, M.T., 1,3-dipolar
Cycloaddition Reactions. Regioselective Synthesis of Heterocycles and Theoretical Studies. J. Heterocyclic
Chem., 26, 1003 (1989).
2. Cardozo, M.G., with Hopfinger, A.J., Molecular Mechanics and Molecular Dynamics Studies of the
Intercalation of Dynemicin-A with Oligonucleotide Models of DNA. Mol. Pharmacol., 40, 1023 (1991).
3. Cardozo, M.G., with Hopfinger, A.J., Iimura, Y., Sugimoto, H., and Yamanishi, Y., QSAR Analyses of the
Substituted Indanone and Benzyl Piperidine Inhibitors of Acetylcholinesterase, J. Med. Chem. 35, 584 (1992).
4. Cardozo, M.G., with Hopfinger, A.J., Iimura,. Y., Sugimoto, H., and Yamanishi, Y., Conformational Analysis
and Molecular Shape Comparisons of a Series of Indanone Benzyl Piperidine Inhibitors of Acetylcholinesterase,
J. Med. Chem. 35, 590 (1992).
5. Cardozo, M.G., with Hopfinger, A.J., Burke, B.J., Rowberg, K.L., and Koehler, M.G., New Methods in
Molecular Shape Analysis to Identify and Characterize Active Conformations, in Second International
Telesymposium Procced. on QSAR, (ed. K. Kuchar), Prous Press, Barcelona, Spain, 1991.
6. Cardozo, M.G., with Kawakami, Y., and Hopfinger, A.J., Construction of QSARs from Ligand-DNA
Intercalation Molecular Modeling Studies. "Nucleic Acid Targeted Drug Design", In Computer-Aided Drug
Design Methods and Aplications, vol 2. Ed. T. J. Perum and C.L. Propst, Marcel Dekker, Inc. New York. pp
151-193 (1992)
7. Cardozo, M.G., with Hopfinger, A.J., A Model for the Dynemicin-A Cleavage of DNA Using Molecular
Dynamics Simulation, Biopolymers 33, 377 (1993).
8 Cardozo, M.G., with Tong, L.; Jones P.-J.; and Adams, J., Preliminary Structural Analysis of the Mutations
Selected by Non-Nucleoside Inhibitors of HIV-1 Reverse Transcriptase. Bioorganic & Medicinal Chemistry
Letters, 3, 721 (1993).
9. Cardozo, M.G. with Hopfinger, A.J., , and Kawakami, Y. Molecular Modelling of Ligand-DNA Intercalation
Interactions. J. Chem. Soc. Faraday Trans. 1995, 91 (16) 2515-2524..
10. Cardozo, M.G. with Proudfoot, J., etal. Novel Non-nucleoside Inhibitors of Human
Virus Type 1 Reverse Transcriptase. 4. J. Med. Chem. 1995, 38 (24) 4830-4838.
Immunodeficiency
11.Cardozo, M.G. with Kelly, T.A., Proudfoot, J.R., McNeil, D.W., Patel, U.R., David, E, Farina, V., Hargrave,
K.D., Grob, P., Agarwal, A., and Adams, J. Non_nucleoside Inhibitors of Human Immunodeficiency Virus Type
1 Reverse Transcriptase. 5. J. Med. Chem. 1995, 38 (24) 4839-4847.
PHS 398/2590 (Rev. 05/01)
Page ___27____
12. Cardozo, M. G, with Betageri, R., etal. Phosphotyrosine-Containing Dipeptides as High-Affinity Ligands for
p56lck SH2 Domain. J. Med. Chem. 1999 42 (4) 722-729.
13. Cardozo, M.G., with Betageri, R., etal. Ligands for the Tyrosine p56lck SH2 Domain: Discovery of potent
Dipeptide Derivatives with Monocharged, Nonhydrolyzable Phosphate Replacement. J. Med. Chem. 1999, 42
(10) 1757-1766.
14. Cardozo, M.G., with Last-Barney, K. , Davidson, W., etal. Binding Site Elucidation of Hydantoin-based
Antagonosts of LFA-1 Using Multidisciplinary Technologies: Evidences for the Allosteric Inhibition of a
Protein-Protein Interaction. J.Am.Chem.Soc. 2001, 123, 5643-5650.
15. Cardozo, M.G., with Proudfoot, J.R. etal. Non-peptidic, Monocharged, Cell Permeable Ligands for the
p56lck SH2 Domain. Submited J. Med. Chem. 2001, 44 (15) 2421-2431.
16. Cardozo, M.G., with Graham, E, and Jacober, S. A method for selecting compounds
from a combinatorial or other chemistry libraries for efficient synthesis. J. Chem. Inf.
Comp. Sci. 2001, 41 (6) 1508-1516.
20. Cardozo, M.G., with Snow, R. J., Morwick, T. M. etal. Discovery of 2-Phenylamine-imidazo [4,5h]isoquinolin-9-one: A New Class of Inhibitors of Lck Kinase. J. Med. Chem. 2002, 45 (16) 3394-3405.
PHS 398/2590 (Rev. 05/01)
Page ___28____
BIOGRAPHICAL SKETCH
Provide the follow ing information for the key personnel in the order listed for Form Page 2.
NAME
POSITION TITLE
Connors, Richard Victor
Senior Chemistry Scientist
Laurentian University, Sudbury Canada
University of Ottawa, Ottawa, Canada
Columbia University
Duke University
A.
DEGREE
(if applicable)
B.Sc.(Hon)
Ph.D.
Postdoc
Postdoc
YEAR(s)
1988
1994
1996
1997
FIELD OF STUDY
Biochemistry
Chemistry
Chemistry
Chemistry
Positions and Honors.
1997-1999
1999-2001
2003-
Research Scientist, Pharmacopeia Inc, Princeton, NJ.
Senior Scientist, Pharmacopeia Inc, Princeton, NJ.
Senior Chemistry Scientist, Tularik, Inc, South San Francisco, CA.
Honors
1986-1988
1988-1990
1990-1992
B.
Dean’s Honor List, Laurentian University, Sudbury, Canada.
University of Ottawa Entrance Scholarship, Ottawa, Canada.
NSERC PGS-3 Predoctoral Scholarship, Ottawa, Canada.
Peer-Reviewed Publications.
Connors, Richard; Durst, Tony. Acyl cyanides as carbonyl heterodienophiles. Tetrahedron
Letters (1992), 33(48), 7277-80.
Breslow, Ronald; Connors, Richard V.. Quantitative Antihydrophobic Effects as Probes for
Transition State Structures. 1. Benzoin Condensation and Displacement Reactions. Journal
of the American Chemical Society (1995), 117(24), 6601-2.
Connors, Richard; Tran, Elisabeth; Durst, Tony. Acyl cyanides as carbonyl heterodienophiles:
application to the synthesis of naphthols, isoquinolones, and isocoumarins. Canadian
Journal of Chemistry (1996), 74(2), 221-6.
Breslow, Ronald; Connors, Richard. Antihydrophobic Cosolvent Effects Detect Two Different
Geometries for an SN2 Displacement and the Change to a Single-Electron-Transfer
Mechanism in Related Cases. Journal of the American Chemical Society (1996), 118(26), 63236324.
PHS 398/2590 (Rev. 05/01)
Page ____29___
?
Breslow, Ronald; Connors, Richard; Zhu, Zhaoning. Mechanistic studies using antihydrophobic
agents. Pure and Applied Chemistry (1996), 68(8), 1527-1533.
Pirrung, Michael C.; Connors, Richard V.; Odenbaugh, Amy L.; Montague-Smith, Michael P.; Walcott,
Nathan G.; Tollett, Jeff J. The arrayed primer extension method for DNA microchip analysis.
Molecular computation of satisfaction problems. Journal of the American Chemical Society
(2000), 122(9), 1873-1882.
Pirrung, Michael; Connors, Richard; Odenbaugh, Amy; Montague-Smith, Michael; Walcott, Nathan;
Tollett, Jeff. Arrayed primer extension on DNA microchips (APEX). Molecular computation of
satisfaction (SAT) problems. Frontiers Science Series (2000), 30(Currents in Computational
Molecular Biology), 20-21.
Pirrung, Michael C.; Odenbaugh, Amy L.; Connors, Richard V.; Worden, Janice D. Method of
attaching a biopolymer to a solid support using bromoacetamidosilanes to functionalize the
support. U.S. Pat. Appl. Publ. (2002), 13 pp.
Connors, Richard V.; Zhang, Alex J.; Shuttleworth, Stephen J. Pictet-Spengler synthesis of
tetrahydro-β -carbolines using vinylsulfonylmethyl resin. Tetrahedron Letters (2002), 43(37),
6661-6663.
PHS 398/2590 (Rev. 05/01)
Page ___30____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Gene Cutler
Research Investigator
Cornell University; Ithaca, NY
University of California; Berkeley, CA
Tularik Inc.
DEGREE
(if applicable)
BA
PhD
Postdoc
YEAR(s)
1988-1992
1992-1997
1998-2000
FIELD OF STUDY
Biology
Molecular and Cell Biology
Biology
1997 – 1998
Postdoctoral Fellow, Molecular and Cell Biology Dept, University of California,
Berkeley
1998 – 2000
Postdoctoral Fellow, Biology Dept, Tularik Inc.
2000 – 2002
Scientist, Bioinformatics Dept, Tularik Inc.
2002 – Research Investigator, Bioinformatics Dept, Tularik Inc.
Honors
1988
New York State Science Supervisors Association Award
1988
National Merit Scholarship Finalist
1988
Lavinia Wright Scholarship, OPEIU
1989
New York State Scholarship of Excellence
1989
Howard Coughlin Memorial Scholarship, OPEIU
1989-1992
Dean’s List, Cornell University
1991
Cornell Hughes Scholars Program
1992
Phi Beta Kappa Society Membership, Cornell Chapter
1992
National Science Foundation Graduate Fellowship
1992
Howard Hughes Medical Institute Predoctoral Fellowship
preparation.
•
•
•
Goodrich JA, Cutler G, and Tjian R. Contacts in context: promoter specificity and
macromolecular interactions in transcription. Cell, 1 1996 Mar 22, 84(6):825-30.
Cutler G, Perry K, and Tjian R. Transcription factor Adf-1 contains a TAF-binding myb motif
as part of a non-modular activation domain. Molecular and Cell Biology, 1998 Apr; 18(4):225261.
An S, Cutler G, Zhao JJ, Huang SG, Tian H, Li W, Liang L, RIch M, Bakleh A, Du J, Chen JL,
Dai K. Identification and characterization of a melanin-concentrating hormone receptor.
Proceedings of the National Academy of Sciences USA, 2001 Jun 19; 98(13):7576-81.
PHS 398/2590 (Rev. 05/01)
Page ____31___
•
•
Li S, Liao J, Cutler G, Hoey T, Hogenesch JB, Cooke MP, Schultz PG, Ling XB. Comparative
analysis of human genome assemblies reveals genome-level differences. Genomics, 2002 Aug;
80(2):138-9.
Li S, Cutler G, Liu JJ, Hoey T., Chen L, Schultz PG, Liao J., Ling XB . A Comparative
Analysis Of HGSC and Celera Human Genome Assemblies and Gene Sets. Bioinformatics, in
press.
Ongoing Research Support
Tularik Inc
2000 – present
Role: Co-Investigator
Design microarray experiments and analyze resulting data in experiments to probe the activities of novel
bioactive compounds, the effects of ectopically expressed genes, and the effects of knock-out genes in
animals and tissue-culture systems. These experiments probe a variety of pathways related to cancer,
disorders of the immune system, and metabolic disorders.
Tularik Inc
2002 – present
Analyze human protein kinase sequences to better predict which kinases will bind a given substrate
analog.
Completed Research Support
Tularik Inc
2000 – 2002 Role: Co-Investigator
Perform exhaustive sequence analysis of Human genome sequence to identify and classify novel G
Protein-Coupled Receptors and Protein Kinases.
Tularik Inc
2000 – 2003
Role: PI
Design and develop a suite of tools for storing, retrieving, manipulating, and analyzing microarray data.
This includes development of novel algorithms for microarray data normalization and gene expression
data clustering.
Tularik Inc
2002
Perform analysis on human genome assemblies from different sources, comparing the development of
these assemblies over time.
PHS 398/2590 (Rev. 05/01)
Page ____32___
BIOGRAPHICAL SKETCH
Follow this format for each person. DO NOT EXCEED FOUR PAGES.
NAME
POSITION TITLE
Jane Liu
Bioinformatics specialist
Beijing Medical University
Santa Clara University, CA
DEGREE
(if applicable)
M.D.
M.S.
YEAR(s)
1992-1997
2000-2002
FIELD OF STUDY
Medicine
Computer Engineering
2002 –
Bioinformatics specialist, Bioinformatics Dept, Tularik Inc.
Honors
preparation.
Li S, Cutler G, Liu J (Co-first author), Hoey T, Chen L, Schultz PG, Liao J, Ling XB. 2003. A Comparative Analysis
Of HGSC and Celera Human Genome Assemblies and Gene Sets. Bioinformatics in press.
Tularik Inc 2003 – present
Develop machine learning algorithms e.g. neural network and HMM to associate biological sequences with Gene
Ontology terms.
Tularik Inc 2002
Perform comparative analysis on human genome assemblies from HGSC and Celera Genomics and their
associated gene sets. Study the evolvement of these assemblies over time.
PHS 398/2590 (Rev. 05/01)
Page ____33___
Biographical Sketch Fo rmat Page
BIOGRAPHICAL SKETCH
Follow this format for each person. DO NOT EXCEED FOUR PAGES.
NAME
POSITION TITLE
Zheng (Sam) Pan
Scientist
DEGREE
(if applicable)
YEAR(s)
FIELD OF STUDY
Fudan Univ. Shanghai, China
Fudan Univ. Shanghai, China
BS
MS
1987
1990
Univ. of Massachusetts, Amherst, MA, USA
Harvard Medical School, Boston, MA, USA
Ph.D.
Post Doc
1997
2000
Biology/Plant Physiology
Biochemistry/Plant
Physiology
Fungal Genetics
Leukemia, cancer
1992-1997
1997-2000
2000-2002
2002-
Research Assistant, Dept. of Microbiology, Univ. of Massachusetts, Amherst, MA
Post-doc/Research fellow, Harvard Institute of Medicine/Harvard Medical School,
Boston, MA
Scientist, DoubleTwist, Inc, Oakland, CA
Scientist, Tularik Inc, South San Francisco, CA,
preparation.
•
•
•
•
•
•
Pan Z, Zhou LM, Hetherington CJ, et al. Hepatocytes contribute to soluble CD14 production, and
CD14 expression is differentially regulated in hepatocytes and monocytes, J BIOL CHEM 275 (46):
36430-36435 NOV 17 2000
Schwer H, Liu LQ, Zhou LM, Pan Z, et al. Cloning and characterization of a novel human
ubiquitin-specific protease, a homologue of murine UBP43 (Usp18), GENOMICS 65 (1): 44-52
APR 1 2000
Pan Z, Zhou L, Hetherington CJ, et al. Differential regulation of human CD14 expression in
monocytes and hepatocytes. BLOOD 94 (10): 1644 Part 1 Suppl. 1 NOV 15 1999
Libermann TA, Pan Z, Akbarali Y, et al. AML1 (CBF alpha 2) cooperates with B cell-specific
activating protein (BSAP/PAX5) in activation of the B cell-specific BLK gene promoter, J BIOL
CHEM 274 (35): 24671-24676 AUG 27 1999
Pan Z, Hetherington CJ, Zhang DE, CCAAT/enhancer-binding protein activates the CD14
promoter and mediates transforming growth factor beta signaling in monocyte development, J
BIOL CHEM 274 (33): 23242-23248 AUG 13 1999
Pan Z, Hetherington CJ, Zhang DE, Regulation of CD14 gene expression during monocytic
differentiation by C/EBPs., BLOOD 92 (10): 2875 Part 1 Suppl. 1 NOV 15 1998
C. Research Support. List selected ongoing or completed (during the last three years) research projects (federal and
non-federal support). Begin with the projects that are most relevant to the research proposed in this application.
PHS 398/2590 (Rev. 05/01)
Page ____34___
Briefly indicate the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research
1
project. Do not list award amounts or percent effort in projects.
Tularik Inc
2002-present
Role: Investigator
Designed and developed tools and oligos for cancer-related amplicon discovery in human genome
Tularik Inc
2002-2003
Role: Investigator
Perform exhaustive sequence analysis of Human genome sequence with Hiden Markov Model to identify
and classify novel phosphatases and proteases.
PHS 398/2590 (Rev. 05/01)
Page ___35____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
King, Brian D.
President, Life Code, Inc.
DEGREE
(if applicable)
Michigan State University
B.S.
YEAR(s)
FIELD OF STUDY
1989 Computer Science
2003
2002-2003
1999-2002
1998-1999
1997-1998
1997-1998
1996-1997
1995-1996
1994-1995
1989-1994
President, Life Code, Inc.
Contractor, Sun Microsystems
Senior Software Architect, DoubleTwist, Inc.
Contractor, Oregon Dept. of Transportation
Contractor, Hewlett-Packard
Contractor, Pangea Systems
Contractor, Kaiser Permanente
Contractor, Strategic Concepts Corp.
Software Engineer, IA Corp.
Staff Programmer, IBM
Professional Memberships
2001-2003 Interoperable Informatics Infrastructure Consortium (I3C)
preparation.
PHS 398/2590 (Rev. 05/01)
Page ____36___
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Lukes, Melissa, Ann
Database Administrator
California State University Hayward
DEGREE
(if applicable)
YEAR(s)
BS
FIELD OF STUDY
1985 Biology, Minor
Computer Science
Position and Employment
1985-1988
Network Manager at NASA AMES Research Center, Sterling Software, Palo Alto, CA
1988-1990
Bioanalyst, Syntex Reasearch, Palo Alto, CA
1990-1991
Sr. Analysis Programmer, Syntex Research, Palo Alto, CA
1991-1993
Technical Analyst, Syntex Research, Palo Alto, CA
1993-1994
Sr. Technical Analyst, Syntex Research, Palo Alto, CA
1994-1996
Systems Project Manager, Syntex Research, Palo Alto, CA
1996-1999
Data Manager/Analyst, Mercator Genetics/Progenitor, Menlo Park, CA
1999Database Administrator, Tularik Inc, South San Francisco, CA
1985- NOCOUG, Northern California Oracle Users Group Membership
Honors
1988
1994
Ozone Hole Participation Award, NASA AMES Research Center
Chairman's Recognition Award for Individual Effort, Syntex Research
preparation.
Elizabeth Kunysz, Douglas W. Bonhaus and Melissa Lukes, `Bar-code technology and a Centralized database: Key
components in a Radioligand Binding High Throughput Screening Program', Accepted for publication by Packard 1996
Marshall B. Wallach, Tim Maslyn, Melissa Lukes, and Ronald Rhodes, 'Automated Sample Preparation and
Dispensing For High Throughput Screening Assays', Syntex Discovery Research. Proceedings International
Symposium on Laboratory Automation and Robotics, 1994 p474
Maureen Laney, Ronnel Cabuslay, Ronald Rhodes, Melissa Lukes, and Randall Schatzman, 'A fully Automated Assay
of in vitro Enzyme Activity by Continuous Kinetic Measurement', Syntex Discovery Research. Proceedings
International Symposium on Laboratory Automation and Robotics, 1994 p485
PHS 398/2590 (Rev. 05/01)
Page ___37____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Richard K. Porter
Senior Software Engineer
Temple University, Phila., PA
Temple University, Phila., PA
Brown University, Providence, RI
UC Santa Cruz Extension, Santa Clara, CA
DEGREE
(if applicable)
YEAR(s)
FIELD OF STUDY
B.S.
M.S.
1968
1971
1969 - 1972
1993 - 1994
Chemistry
Organic Chemistry
Phys. Org. Chem.
C Prog. Language, and
Advanced C
SQL*Forms V3 to Oracle
Forms 4.5: New Feature
Basic Java Programming Intro to Jdeveloper
MDL ISIS/Direct Molecules
v2.0
Designing with Visual
Basic 6
Oracle 9i ODTUG
JDeveloper Certificate of
Training
Oracle Corp, Redwood Shores, CA
1998
Oracle Corp, Redwood Shores, CA
2000
MDL Information Systems, San Leandro, CA
2002
Foothill College, Los Altos, CA
2002
Oracle Development Tools User Group, Las Vegas,
NV
2002
1978-1979
1979-1984
1984-1990
1990-1992
1993-1996
1996-1997
1997-2000
2001-
Scientist/Analytical Chemistry, Lockheed Missiles and Space Company, Palo Alto, CA
Senior Scientist/Analytical Chemistry, Lockheed Missiles and Space Company, Palo Alto, CA
Senior Member of the Technical Staff/Project Mgr., Space Applications Corp., Sunnyvale, CA
Senior Software Engineer, Advanced Software Resources, Santa Clara, CA
Senior Analyst/Programmer (contract), Syntex/Roche Bioscience, Palo Alto, CA
Senior Systems Analyst/Database Programmer (contract), CareAmerica Compensation, Burlingame, CA
Senior Analyst/Programmer (contract), Quantum Corporation, Milpitas, CA
Senior Software Engineer, Tularik Inc., South San Francisco, CA
preparation.
Nature of the carbonium ion. VIII. Cycloalkyl cations from thiocyanate isomerizations
Langley A. Spurlock, Richard K. Porter, Walter G. Cox;
J. Org. Chem.; 1972; 37(8); 1162-1168.
PHS 398/2590 (Rev. 05/01)
Page ___38____
On-going Research Support
Tularik Inc 2001SQUID, Select Query Inventory Data, Project Manager/Analyst/Designer
Inventory Automation
Automation and Visualization of Compound Inventory processes supporting the Inventory staff, Chemists, and the biologists including support for
material for HTS assays and Structure Activity Projects
Tularik Inc, 2000-2001
COI, Compounds of Interest, Project Manager/Analyst/Designer
Analysis and Reporting of Compound Activity
Automation and Visualization of Compound Activity across all projects and assays. Giving history of activity of the compound as well as statistical
summary of compound assay activity.
Tularik Inc, 1999-2000
Automation of Assay Data Analysis, Project Manager/Analyst/Designer
Automation of assay data analysis and reporting for High Throughput Assays including data storage.
Mercator Genetics/Progenitor Inc, 1996-1998
Genotyper, Project Manager/Analyst/Designer
Automation of Data Analysis and Querying of a Human Genotyping data for an Asthma Project. Worked with researchers to implement company
critical genotyping database including integration and automation of analysis programs and visualization tools.
Mercator Genetics/Progenitor Inc, 1998
Mutation Detection, Project Analyst/Designer
Worked with Researchers to designed and started implementation of Mutation Detection database
Syntex Research, Inc 1991-1995
Robotic Implementation and Data Analysis, Project Manager/Designer/Analyst
Designed and developed automated statistical systems required to handle volume of data generated from robotic systems.
Designed and developed inventory and assay robotic system processes for High Throughput Screening.
PHS 398/2590 (Rev. 05/01)
Page ___39____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Pan, Zhiyu
Scientific Programmer
Drexel University
Beijing Polytechnic University
DEGREE
(if applicable)
YEAR(s)
M.S.
B.S.
FIELD OF STUDY
1993 Computer Science
1985 Computer Software
1985-1989
1989-1991
1992-1993
1993-1995
1996-1997
1997-1998
1998-2000
2000-2001
2000-2002
2002-present
Software Engineer, National Laboratory of Pattern Recognition, Beijing, CHINA
Programmer, Dept. of Physiology, University of Pennsylvania, Philadelphia, PA
Programmer, HEM Pharmaceuticals Corporation, Philadelphia, PA
Software Engineer, Amiable Technologies Inc., Philadelphia, PA
Senior Member Technical Staff, NYMA Inc, JPL, NASA, Pasadena, CA
Software Engineer, National Semiconductor, Santa Clara, CA
Software Engineer, Triada Ltd., Foster City, CA
Sr. Software Engineer, E-Compare Corp., San Jose, CA
Sr. Software Engineer, Brokat Technologies, San Jose, CA
Scientific Programmer, Tularik Inc., South San Francisco, CA
preparation.
PHS 398/2590 (Rev. 05/01)
Page ___40____
Principal Investigator/Program Director (Last, First, Middle):
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Epic Junjie Ding
Software Engineer
Stanford University
University of Wisconsin-Madison
Tsinghua University, P.R.China
DEGREE
(if applicable)
MS
BS
YEAR(s)
1999-2001
1998-1999
1993-1998
FIELD OF STUDY
Computer Science
Engineering Physics
Engineering Physics
2001 – Software Engineer, Research Informatics Dept, Tularik Inc.
1999 – 2001 Research Assistant, Medical School, Stanford University
preparation.
C.
Research Support. List selected ongoing or completed (during the last three years) research projects (federal
Tularik Inc
2003 – present
Role: System Architect
Evaluate, integrate, and customize third-party drug discovery software in the area of compound
inventory, registration, assay data management and reporting, and chemo-informatics software in the
area of compound library generation, analysis and compound property calculation with in-house
developed software.
Tularik Inc
2001 – 2003
Role: Database Architect, System Architect
Design, develop, and maintain a drug discovery data management web application suite to keep track of
enterprise electronic data creation, collection, analysis, modification, storage, and reporting for each
phase of drug discovery. This includes the design of corporate database to provide storage for drug
discovery data, and information for validation, authorization, and signature of data operation. J2EE is
the main technology being used. XML is widely used in the system.
Stanford University
1999 – 2001 Role: Developer and Administrator
Design and develop a suite of tools for storing, retrieving, manipulating and analyzing gene sequence
and gene tree. This includes design of an Oracle relational database for the storage and algorithms for
gene tree analysis. The project is done for Genetics Dept. Stanford University.
PHS 398/2590 (Rev. 05/01)
Page ____41___
Ling, Bruce, Xuefeng,Ph.D.
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Jayanthi Subramani
Software Engineer
DEGREE
(if applicable)
YEAR(s)
Government College Of Technology, Coimbatore,
India
B.S.
1996
SSI Systems, Chennai, India
Post Graduate
Diploma
M.S.
1997
San Jose State University, San Jose, CA
2001
FIELD OF STUDY
Electronics and
Communication
Engineering
Relational Database
Management Systems
SOFTWARE ENGINEER, Tularik Inc., South San Francisco, CA (Mar 2002 - current)
SOFTWARE INTERN, Tularik Inc., South San Francisco, CA (Sep 2001 - Feb 2002)
SOFTWARE PROGRAMMER, Turbo Information Systems, Chennai, India, Dec 1996- May 98
preparation.
SMIL-based graphical interface for Interactive TV: The paper was accepted and presented for the Internet
Imaging TV Conference at IS&T/SPIE’s Electronic Imaging 2003.
Discovery Framework: Designed and implemented a comprehensive, easy-to-use and easily reconfigurable Software
Platform to automate gathering, storing, organizing and analysing data for the purpose of drug discovery. The
technology
used
is
Java
- EJB, JMS, Servlets, JSP, Applets, JDBC, Oracle, and XML
PHS 398/2590 (Rev. 05/01)
Page ___42____
BIOGRAPHICAL SKETCH
NAME
POSITION TITLE
Kaveri Charati
Software Engineer
K.L.E College of Engineering and Technology,
Belgaum, India
San Jose State University, San Jose, CA
DEGREE
(if applicable)
YEAR(s)
FIELD OF STUDY
BS
1996
MS
2001
Electrical and Electronics
Engineering
1996-1998
2000-2001
2001-2001
2002-2003
Lecturer, Motichand Lengade Bharatesh Polytechnic, Belgaum, India
Graduate Assistant, San Jose State University, San Jose, CA
Software Engineer, iCommerce, San Jose, CA
Software Engineer, Tularik Inc, South San Francisco, CA
preparation.
C.
Research Support. List selected ongoing or completed (during the last three years) research projects (federal and nonfederal support). Begin with the projects that are most relevant to the research proposed in this application. Briefly indicate
the overall goals of the projects and your role (e.g. PI, Co-Investigator, Consultant) in the research project. Do not list award
amounts or percent effort in projects.
Project: Assay Registration
Role: Software Engineer
Currently developing an application to register Assays (Experiments) for High Throughput Screening (HTS) and Structural
Activity Relationship (SAR) projects. The registration information gathered included Data Mining parameters, Data Analysis
Configuration Parameters and Screening logistics. Through this application Scientists can register new Assay, edit and update
registered assays, promote an Assay to the next stage etc.
Environment: Jakarta-tomcat, Oracle 8 RDBMS, JSP, Servlet, XML
Operating System: Red Hat Linux 7.1
Project: Structural Biology
Designed and developed a web application to provide a central repository to store protein preparation data and to keep track
of the different stages in protein preparation. Through this user-friendly Web application, users can add requests to obtain
proteins, get status on their protein preparation via email etc. The application also supports different types of users to support
different privileges.
Environment: J2EE- JBoss, Jakarta-tomcat, Oracle 8 RDBMS, JSP, Servlet, XML
PHS 398/2590 (Rev. 05/01)
Page ____43___
Project:Security Server
Designed and implemented a security system to handle authentication and integrated it with the current
architecture. This project aimed at developing a dedicated server to handle authentication and security
privileges. A centralized intra- web site was designed and developed to provide users with a common
interface to access various projects. The authentication and privilege information was maintained in
LDAP and Oracle.
Environment: J2EE- JBoss, Jakarta-tomcat, Oracle 8 RDBMS
Project: Search
Designed and developed a scalable and user-friendly application to allow scientists to mine drug discovery data generated
from various stages (high throughput screening, structural activity relationship, lead optimization, etc). The application
retrieves data based on a list of compounds provided as input. XML is used to configure data fields for each assay.
Environment: Jakarta-tomcat, Oracle 8 RDBMS, JSP, Servlet, XML
Project: XML database
Designed an XML Schema to accommodate scientific data, which can be classified as assay based,
process based, and protocols based. The aim of this project was to gather data in an XML format
according to the DTD to provide minimal processing and retrieval. This included designing the XML
database to store XML documents. Open Source XML Database (eXist) was used as the XML
repository. XML Authority and XML Spy were used to develop the schema.
Operating System: Mandrake Linux 8.2
Project: ISIS
Designed Oracle Tables and Views to be used by ISIS. This involved analyzing the existing oracle
tables to extract the information to be viewed in ISIS, creating new tables and views to allow scientists
to connect to the backend data through ISIS interface.
Project: SAR Application
Designed and developed an application for Structural activity relationship (SAR)
Projects using Java and XML. The SAR Data was maintained in XML format.
SAS was used to analyze data and plot data to provide a graphical analysis to the users.
PHS 398/2590 (Rev. 05/01)
Page ____44___
RESOURCES
FACILITIES: Specify the facilities to be used for the conduct of the proposed research. Indicate the performance sites and describe capacities,
pertinent capabilities, relative proximity, and extent of availability to the project. Under “Other,” identify support services such as machine shop,
electronics shop, and specify the extent to which they will be available to the project. Use continuation pages if necessary.
Laboratory:
Tularik Research Division has allowed the full access to its state of the art labs, robotic equipments and computing
infrastructure for the proposed research in the area of drug discovery informatics. Worldwide, the company employs 400
people, 85 percent of whom are engaged in research and development. We currently have five subsidiaries, located in the
U.S. and Europe. Tularik has fully epuipped laboratories for biology, chemisty and pharmacology, an extensive library,
complete wiring for electronic communication and data transmission. All the research related data generated from the
different laboratories will be managed through the Discovery informatics platform.
Clinical:
Animal:
Computer:
Tularik Research Informatics Department has the state of the art supercomputer, Linux clusters, and various SUN, SGI
workstations. All these computing facilities, Oracle/MYSQL servers and various statistics tools can be fully accessed for
the proposed research in the area of algorithm development and high throughput genome scale annotation and data
repository.
Office:
Tularik has sufficient office space and equipment to support the activities for the proposed research.
Other:
Over the years, Tularik has developed novel algorithms, database schema and data flow architecture. All these utilities
and proprietary databases such as internal genome databases, microarray databases and amplicon databases can be
utilized for the sake of the algorithm development and data analysis.
MAJOR EQUIPMENT: List the most important equipment items already available for this project, noting the location and pertinent capabilities of each.
Linux cluster and parallel computing environment
Tularik Inc
High throughput data analysis
Paracel super computer
Tularik Inc
High throughput HMM data analysis
Oracle/MySQL database server
Tularik Inc
Data repository
Genome Server
Tularik Inc
Mirror all the public domain genome data and encapsulate Tularik proprietary
genomic content
Various robotic equipments
Tularik Inc.
High throughput screening and combinatorial chemistry
Wet lab facilities
Tularik Inc
Allow all the state of the art molecular and cellular lab work
PHS 398 (Rev. 05/01)
Page ___45____
Resources Format Page
Research Plan
This proposal contains proprietary information.
A. Specific aim
The genomics revolution and other rapid advances in technologies,
such as combinatorial chemistry, high throughput drug screening, and
computer aided drug design, demand efficient high throughput data
management and powerful computing application support. Operating at
the crossroads of biomedical research and computing innovation, the
Tularik Research Informatics team has pioneered the pharmaceutical
industry to integrate cutting edge enterprise technologies including J2EE,
Microsoft  .NET, Linux cluster to meet the ever-increasing scale and
complexity of discovery research. The Tularik Discovery Informatics
platform (http://discovery.tularik.com) has been a successful prototype,
enabling a powerful, flexible infrastructure that promotes workflow
efficiency in a high throughput, collaborative discovery environment. In
order to continue this development such that the Discovery Informatics
platform can be ultimately generalizable, scalable, extensible and
interoperable, we are proposing the following approaches.
1. Architect scalable and robust high throughput enterprise
computing infrastructures.
The Java 2 platform, Enterprise Edition (J2EE), Microsoft  .NET
and high throughput/performance computing (HTC/HPC)
technologies will be interoperable to build a state of the art
Discovery Informatics platform.
2. Integrate various standalone robotics applications into
networked automated Discovery pipelines.
Thanks to technological innovations, robotics and automation are
now absolutely essential in various stages of the drug discovery
processes. The Discovery Informatics platform will integrate robotic
vendor proprietary software through Microsoft  .NET Web
Services to automate the inter-robotic data management and
mechanical operations.
3. Automate high throughput discovery workflows.
Tularik Discovery Informatics platform has automated the data
analysis and data management in the areas of array-based
comparative genomic hybridization, high throughput screening
(HTS), structure activity relationship (SAR) and ADMET. Additional
machine learning algorithms and visualization modules will be
developed to automatically extract knowledge, e.g. the novel
compound structural motifs, from large-scale bioassay databases.
Discovery Informatics platform will integrate computational
PHS 398 (Rev. 05/01)
46
Page _______
Research Plan
chemistry approaches for parallel drug lead optimization of potency,
selectivity, and ADMET properties.
4. Integrate in silico drug lead seeking, explosion and
optimization processes into the high throughput Discovery
platform.
Integrate ligand or receptor based virtual screening algorithms into
the Discovery platform to increase the throughput for drug lead
seeking, explosion and optimization. Algorithms, including proper
compound filters, will be developed to create a 1 billion-member
virtual screening library.
5. Standardize the informatics data flow and implement
interoperable service-oriented computing architecture.
Tularik will work with I3C (Interoperable Informatics Infrastructure
Consortium) to adopt and enact proper standardizations for data
flow in the areas of genomics, biological pathway, compound
acquisition, compound inventory, lead discovery and optimization.
The current Tularik Discovery platform hosts various XML based
J2EE and .NET distributed applications, providing a solid
foundation to extend to the service-oriented computing architecture.
6. Establish industrialized software configuration management
(SCM) mechanisms for application build and deployment.
Discovery platform has evolved to ensure code portability, robust
build and easy deployment to Tularik’s worldwide campuses and
relevant research communities. Discovery platform will continue to
improve through the utilization of the open source standards and
applications. These developments will make Tularik Discovery
Informatics platform generalizable, scalable, extensible and
interoperable to the entire biomedical research community.
B. Background and Significance
J2EE, Microsoft  .NET enterprise platforms and their implications in
pharmaceutical research informatics
Both J2EE (Java 2 Platform, Enterprise Edition) and Microsoft  .NET
platforms offer cutting edge scalable technologies to simplify multitier
distributed application architecture and Web Service development. J2EE is
platform neutral and has “Write Once, Run Anywhere” portability while .NET
is restricted to the Windows platform application support. Pharmaceutical
research enterprises have very dynamic use cases and requirements,
demanding fast informatics development and deployment to gain a
competitive advantage. J2EE and Microsoft  .NET offers the enterprise
solutions, which promise to scale up the data management, deliver the robust
business services, and accelerate the discovery process.
PHS 398 (Rev. 05/01)
47
Page _______
Research Plan
Robotics and industrialized drug discovery process
Modern high throughput methodologies and robotics symbolize the
industrialization of the drug discovery process. Large investments in robotic
instruments are justified not only on the basis of the labor cost savings,
operational precision and higher throughput, but also by the financial impact
of shortening the drug discovery time lines. Almost all robotics come with
some form of scheduling software in addition to the required system
management software. Most, if not all, of the robotic applications are
dependent on the Microsoft Windows platform. The interoperability between
different robotics can be difficult due to the proprietary nature of the vendor
software, and the distribution of robotic applications on PC workstations with
different Windows versions. All of these may create operational bottlenecks
once different vendor robotics are required to work interactively.
High throughput discovery data management
The publication of the human genome sequence, and the advance of
biomedical and robotic technology significantly increase the volume and
complexity of the drug discovery data. In the area of target identification and
validation, microarray applications have become an indispensable and routine
process. Modern combinatorial chemistry (CC) allows the automated
explosion of large number of compounds across many drug discovery
programs. Compound acquisition, registration, receiving, inventory and
distribution are very complex processes yet data transaction needs to be
robust, flexible and real time. With the advent of modern high throughput
bioassays and robotics, HTS platforms are capable of screening more than
ten thousand compounds per day. Thus, developing comprehensive
customized informatics packages for discovery research have become a
formidable endeavor.
There are commercially available software packages for various types of
discovery applications including GeneSpring, Spot Fire, Activity Base, MDL
Information Systems, Oxford Molecular, Tripos, Accelrys Accord Enterprise.
In spite of the promises from the vendors, none of these packages can supply
a complete solution. Highly complicated and dynamic situations require high
levels of cooperation between the various teams that produce and consume
information and those that develop or integrate the software applications.
Every consideration should be given to the ease of use and flexibility, but
inevitably the flexibility desired by the users must be carefully balanced
against both the flexibility allowed by the automated assay systems and the IT
cost of maintaining overly complex software.
Targeted molecular diversity in drug discovery: in silico lead seeking,
explosion and optimization
In silico screening of combinatorial libraries prior to synthesis promises to
be valuable aid to lead discovery (Lyne 2002). Although still an evolving
approach, virtual screening (VS) can serve as a complementary approach to
PHS 398 (Rev. 05/01)
48
Page _______
Research Plan
Figure 1. Compound library diversity analysis
and lead explosion. The molecules in the full
library are marked as circles according to two
computational molecular descriptors. For a
molecule to be active, it must lie in a grey area
on the graph. The red circles indicate the
elements of a diverse sublibrary. Lead
explosion: The gold circles show how a directed
sublibrary can be expanded around an active
library member.
experimental screening (HTS). When coupled with structural biology, virtual
screening has emerged as an efficient, cost-effective identification of lead
molecules (Figure 1). Broadly speaking, virtual screening can be classified
into either ligand-based or receptor-based categories. Ligand-based methods
extract the common structural motif (Mestres and Knegtel 2000), similar
pharmacophore (Mason et al. 2001), and 3D shape (Srinivasan et al. 2002)
from the known active compound to screen for additional compounds with
similar properties.
The receptor-based approach docks the compound library to the predetermined target structure and prioritizes compounds according to the
quality of the fit to the target-binding site. The advance in the high throughput computing infrastructure has made the computational chemistry
technically feasible to analyze large databases of chemical compounds for
lead seeking, explosion and optimization.
Standardization and interoperability
Industrialized drug discovery strongly depends on information exchange.
Automated robotic applications in the fields of chemical synthesis, biological
assay, drug metabolism, and even protein crystallography have resulted in an
explosion of data. The discovery data sets are very difficult to relate to each
other because they can be of heterogeneous sources, in many formats and
file types, and typically dispersed across incompatible IT systems (Attwood
2000). The problem of bringing together heterogeneous and distributed
systems is known as the interoperability problem. The current approach to
achieve the data interoperability is, mainly, to write ad-hoc data interface
programs for each pair of communicating systems, resulting in bottlenecks
due to continuous development and maintenance of these programs.
Ontology mediated document exchange (Ashburner et al. 2000; Pouliot et
al. 2001) provides a standard conceptual specification and a solution for this
scalability problem. The data model of source contents is aligned with the
representation specified by the ontological standardization and transformed
accordingly in both ways.
The W3C standard for semi-structured data representation XML
(eXensible Markup Language) has become the industry standard for data
exchange over the web based enterprise applications. In life science field, the
Interoperability Informatics Infrastructure Consortium (I3C) was formed in
PHS 398 (Rev. 05/01)
49
Page _______
Research Plan
2001 to promote global, vendor-neutral informatics solutions to accelerate
discovery and product development.
Web Services
Web Services provide a standard means of interoperating between
different software applications, running on a variety of platforms and/or
frameworks. Web Services promise to greatly increase interoperability and
ease data exchange even as it lowers costs. It is expected that the impact of
the Web Services on the IT industry will be profound. Specifically: 1). By
lowering the cost of software integration between systems, Web Service
offers a way to maintain and integrate legacy IT systems at a lower cost than
typical Enterprise Application Integration (EAI) efforts. 2). By allowing
software running on different platforms to communicate, Web Services enable
the interoperability between multiple platforms running on everything from
mainframes to servers to desktops to PDAs. 3). By employing universal, nonproprietary standards, Web Services dramatically lower the IT costs of
collaborating with external partners, vendors and clients.
Software configuration management
One of the key issues in enterprise software development is how to
manage the software code repository. Any software configuration
management (SCM) scheme requires a central system and robust
architecture for tracking, deploying throughout the entire lifecycle of software
packages. Specifically, the SCM should: 1). Maintain source code under
revision control. 2). Manage code dependencies and third-party library
dependencies. 3). Manage builds and build dependencies. 4). Manage
dependencies on third-party libraries. Enterprise applications like J2EE and
Microsoft  .NET demand a comprehensive SCM, which requires the
integration of proper code management and revision tools, and a carefully
planned source code tree.
C. Preliminary Studies
This section summarizes the preliminary studies for each of the Specific
Aims. The detailed design and procedures for the results are described in the
next section “Research Design and Methods”
Specific Aim 1: Architect scalable and robust high throughput
enterprise computing infrastructures.
Over the past two years, Tularik has built and deployed several Linux
Farms for different applications: a 440-processor cluster for high throughput
computational chemistry, a 150-processor cluster for genome-scale
bioinformatics computing, and a 100-processor cluster for J2EE enterprise
applications (Figure 2).
PHS 398 (Rev. 05/01)
50
Page _______
Research Plan
Old vs. New
•
Time taken to BLAST raw mouse genomic sequence read against
human genome database:
Darwin
Linux Cluster
(SGI Irix 6.5)
(RedHat 7.1)
1 mouse sequence
1 minute
10 seconds
1000 mouse sequences
15 hours
3 minutes
All mouse genomic
sequence reads at NCBI
(22 million reads)
38 years
34 days
Note:
Darwin is a heavily used machine, So it is not a machine to machine comparison
-
However, it does accurately reflect the environment in which these computers are used.
Figure 2. Left: Linux cluster benchmark. Right: High availability of the database clusters.
The high throughput computing for the computational biology and
chemistry was setup through PBS (Portable Batch System,
http://www.openpbs.org), which improves utilization of overall computing
resources (CPU) from less than 20% to over 90%. These computing
environments significantly empower the Tularik’s scientific computing
capability in the areas of computational chemistry (Waszkowycz et al. 2001)
and functional genomics (Li et al. 2003; Li et al. 2002).
The distribution and clustering of nodes for J2EE enterprise applications
were managed through application servers including BEA WebLogic and
open source JBOSS (Figure 3).
Data Central
JMS
Reader
XML
Application
Controller
Cluster
Oracle
Server
XSLT
Transformer
Central Controller
XML
Objects
Client
Security
Server
Web Server
LDAP
XML
Objects
XML
Rule Server
Figure 3. Tularik J2EE Discovery Architecture
PHS 398 (Rev. 05/01)
51
Page _______
Research Plan
Specific Aim 2: Integrate various standalone robotic applications into
networked automated Discovery pipelines. .
Most robotic instruments are not interoperable between each other,
performing unique applications on vendor proprietary platforms and
incompatible computing environments. Thus, workflow integration of
Tecan Genesis RSP 100
•Library synthesis
•Fraction pooling
AccordConv
Biotage Parallex Flex 2 -Channel Prep-HPLC
•Two channels scalable to four in parallel
•Each channel independently controlled
•Monitored at 220 and 254 nm simultaneously
PE Sciex High -Throughput Analytical HPLC
•Teamed with a Gilson 215 Liquid Handler
•Confirmation of product mass and purity
Accord
SD File
SubMaster
VialMaster
BioMaster
PureConv
BioConv
SciexConv
PureConv
SynMaster
SciexMaster
SetOne
BioInput
SciexInput
SetTwo
Tecan
LiquidHand
Sciex
LCMS
Tecan
LiquidHand
Biotage
HPLC
Sciex
LCMS
Tecan
LiquidHand
BioOutput
SciexOutput
SciexOutput
Tecan-Program
VB-Program
Hardware
Spreadsheet
PurityConv
PurityConv
PurityReport
PurityReport
Figure 4. Integration of liquid handler, LCMS, HPC robotics instruments
into automated high throughput combi-chem synthesis pipeline.
functionally relevant robotics has largely remained a labor-intensive manual
work, impeding operational productivity. As a pilot program, Tularik has
successfully completed the robotic workflow integration in the area of
combinatorial chemistry. As shown in Figure 4, centralized software program
and associated robotic interfacing drivers have been developed to integrate
the robotics as components of an automated workflow pipeline, enabling
automated synthesis, purification, quantitation of the combi-chem process.
Specific Aim 3: Systemize the high throughput discovery workflows to
reveal “knowledge” from the raw data.
As summarized (Figure 5), Tularik Discovery Informatics platform has: 1).
Established the genomics infrastructure integrating Ensembl (Clamp et al.
2003; Hubbard et al. 2002) utilities and MYSQL databases. 2). Automated the
data management in the areas of array-based comparative genomic
hybridization, HTS, SAR and ADMET. Rational Rose, Borland Together
Control Center, and Microsoft  .NET ARCHITECT have been utilized to
design and architect the Discovery informatics platform using Unified
Modeling Language (UML). Both the XML and Oracle database have been
rigorously designed to enforce data integrity on various business entities,
associated properties, and relationships in the areas of druggable target
identification and validation, compound acquisition and inventory, and
bioassays (HTS, SAR and ADMET).
PHS 398 (Rev. 05/01)
52
Page _______
Research Plan
Types of Data, Links, Applications
Data
Management
MicroArray
Structural
Biology
Personnel
Legal
Acquisition
Genome
Clinical
Virtual
Screening
Storage
Targets
Identification
Validation
• T-number database
• Biological results
from HTS
• Equipment
scheduling
HTS
Reports
Assays
Chemistry • Chemical characterization
HTS
• Inventory
Compounds
SAR
In vivo
• Biological assays
• CYP, solubility/permeability,
cytotoxicity
• PK
• Pharmacology
• Special PK
experiments
Acquisition
ADMET
Inventory
Processes
SAR
MicroArray
Figure 5. Discovery Informatics platform use cases and enterprise services
A successful prototype, Tularik Discovery Platform (Figure 6), has been
developed based upon the J2EE architecture (Figure 3) and high throughput
computing Linux clusters (Figure 2). Using high throughput screening or
Structure Activity Relationship analysis as examples, data generated from
robotic readers are stored via SAMBA server directly to the data central
server, which has a rigorous backup mechanism for disaster recovery.
http://discovery.tularik.com is the centralized gateway to Discovery platform
application suites, enterprise services and data sets. User logins are
authenticated through LDAP server and security application server. Userprofiling process is flexible enough to distinguish different user types and has
CORPORATE DATABASE
Inventory
Data Analysis
Raw Data
upload
Legacy
Database
Results
invoke
ort
ep
alR
fici
Of
update
HVIEW
Reconfiguration
SQL
upload
Point To Point
ISIS/HOST
Live Data Tracking
Bruno
Reports
View
Browser
Figure 6. Discovery Platform – Informatics gateway
PHS 398 (Rev. 05/01)
53
Page _______
Research Plan
dynamically configured the web interface to authorize different categories of
applications based upon the user’s predefined privileges. Since efficient data
analysis and advanced data visualization demand powerful and flexible
application support, the 100 CPU Linux cluster has been dedicated to
empower live web transactions and Web Services. Third party software like
ISIS can be utilized to open a window to the corporate database to provide
additional data analysis and visualization capabilities.
TMAX (Tularik MicroArray eXplorer, Figure 7), an essential component of
the Tularik Discovery platform, is Tularik’s in-house micro-array data storage
T M A X ChipViewer
T M A X ChipCluster
T M A X ChipPlotter
TMAX GenomeScan
Figure 7. TMAX has flexible technology-independent design that handles data in in a variety of formats
including Incyte, Affymetrix, Scanalyze, Genepix, Rosetta, Motorola, and simple spreadsheets. TMAX is
comprised of 12 front-end applications plus a database server and several administrator support tools. Currently,
TMAX contains about more than 14 million data points.
and analysis software solution. TMAX has satisfied Tularik’s scaled up
microarray operation in the area of gene expression and genomic
amplification/deletion research.
Specific Aim 4: Integrate in silico drug lead seeking, explosion and
optimization processes into the high throughput Discovery platform.
Tularik has acquired and integrated Protherics (Li et al. 1998;
Liebeschuetz et al. 2002; Waszkowycz et al. 2001) virtual screening
technology as cost-effective approach to enable in silico lead seeking,
explosion and optimization (Figure 8). The recent setup of the 440 CPU
Linux cluster has solved the computing deficit challenge: the computational
PHS 398 (Rev. 05/01)
54
Page _______
Research Plan
cost has been reduced to the order of one minute of CPU time per ligand per
processor.
Figure 8. Tularik Protherics virtual
screen platform.
Specific Aim 5: Standardize the informatics data flow and implement
Based upon careful use case study (Figure 5), the Discovery Informatics
platform has standardized to formulate a unified data model to encapsulate
discovery data sets. The architecture (Figure 3) heavily replies on XML and
Java technologies and tools. Through the use of a unified Discovery Java
class library and XML serialization of objects, the Discovery platform allows a
common data model and common data exchange format. Central to the
Discovery platform is a Discovery XML Document Type Definition (DTD) and
a corresponding Java Object Model that facilitate data exchange, data
integration and data transformation between components. Based upon this
lightweight architecture, the entire Discovery platform has been designed and
configured through the Discovery XML Document Type Definition. The
Discovery architecture has allowed significant system flexibility, enabled rapid
system integration and application interoperability, and eased the burden of
schema and use case evolution.
Microsoft  Windows has been widely used to host various in house or
vendor specific applications including most robotic management software.
Interoperability between Windows applications and mainframe J2EE
Discovery platform needs to be solved in order to scale up and completely
automate the Discovery informatics operations. The cutting edge technology
Web Service provides a solution for this purpose. Both Microsoft  .NET and
PHS 398 (Rev. 05/01)
55
Page _______
Research Plan
J2EE platforms support Web Service, enabling SOAP over HTTP for XML
based data transactions and services. Resources and efforts have been
allocated to pilot
SOAP over HTTP
SOAP over HTTP
Internet
XML
XML
Window Apps
.NET Platform
J2EE Discovery
Platform
Figure 9. Web Services promise interoperability across application platforms.
the adoption of Microsoft  .NET technology to integrate Windows
applications as Web Services to the J2EE Discovery platform (Figure 9). The
pilot project has been designed and implemented to launch a bioassay
reporting Web Service. Specifically, based upon the service request from the
J2EE Discovery platform, .NET Windows server extracts compound 2D
chemical structure via integrated MDL Windows applications and combines
both compound structure and assay data points into a comprehensive
Microsoft Office EXCEL report. The success of this pilot development helped
to finalize our decision to introduce Microsoft  .NET as a secondary main
frame platform and use Web Service to address the interoperability issues in
the Discovery operations.
Specific Aim 6: Establish industrialized software configuration
management (SCM) mechanisms for application build and deployment.
Open source “Concurrent Version Control” system (CVS) have been
introduced in house to manage the software code versioning and code
sharing among project team members. Open source tools, Ant and NAant,
have been introduced to automate the nightly build and deployment for the
respective J2EE and .NET Discovery platforms. During Discovery platform
development, reticulate interdependencies arose within the source code tree
when the tree grew beyond a certain point of complexity. This package
interdependency problem has been carefully examined and resolved. The
source code tree has then been restructured and code management policies
have been made to manage the application development and check in
process to avoid the code circular interdependencies.
D. Research Design and Methods
Specific Aim 1: Architect scalable and robust high throughput
enterprise computing infrastructures.
PHS 398 (Rev. 05/01)
56
Page _______
Research Plan
Rationale:
As a result of the prototyping work of the Discovery platform
http://discovery.tularik.com, we have accumulated significant experiences
and technical know-how in the areas of J2EE and Microsoft  .NET
enterprise application architecture, XML based data transaction and
validation, and Linux cluster high throughput environment setup. The goal for
this specific aim is to create an architecture to consolidate different enterprise
technologies into a mature Discovery informatics platform.
Design and Methods:
Although both are powered by the Linux operating system, currently the
J2EE cluster and the PBS (Portable Batch System) configured cluster are
functionally related but mechanistically separate units at the Discovery
platform. Integrating the high throughput computing PBS farm with the high
performance computing J2EE farm will harness the advantages of both
systems, boost up the productivity and streamline the operational process.
To achieve this, we will deploy a J2EE application server on the master node
of the PBS farm and develop Java driver object to wrap up the PBS
computing job management functionalities. As a result, the PBS farm master
node will be transformed to become one application node in the J2EE farm
and thus the entire PBS farm will be harnessed upon computational request
from the J2EE central controller. Another approach is to utilize the J2EE
application server on the PBS farm master node, wrapping up the PBS job
management utilities into a Web Service. Theoretically either approach
should work to enable the interoperability between these two computing
platforms.
Based upon our in house .NET prototyping efforts, Microsoft  .NET
architecture implementation has been proven to be robust, scalable and
interoperable. Since Tularik has decided to switch from MAC to Microsoft
Windows platform, .NET technology will provide add-on value to Tularik if
leveraged properly during this desktop transition. Along with the potential
huge number of Windows based equipments after the MAC to Windows
platform transition, a new high throughput computing resource, Microsoft
.NET cluster, can emerge. Microsoft .NET will integrate these PC processors
into a high throughput computing cluster for applications that require
parallelism, fault tolerance, and load balance. Thanks to the relative
homogeneous Windows computing environment, the integration overhead will
be compensated by simplicity, ease of Windows programming, immediacy of
execution, and desktop integration.
The Discovery Informatics platform depends heavily on XML for data
processing and data transaction. We will integrate a native XML database
into the Discovery platform. After careful evaluation of the available XML
databases on the market, we have chosen eXist, an Open Source native XML
database to support Discovery XML transactions. “eXist” database features
efficient, index-based XPath query processing, extensions for keyword search
and tight integration with existing XML development tools. The database is
PHS 398 (Rev. 05/01)
57
Page _______
Research Plan
lightweight, completely written in Java and may be easily deployed in a
number of ways, running either as a stand-alone server process, inside a
servlet-engine or directly embedded into an application. We believe the
integration of the eXist XML database will significantly improve the scalability
of the Discovery Informatics platform.
Advantages and limitations:
Despite the fact that J2EE, .NET, high throughput and high performance
computing (HTC/HPC), and native XML database technologies promise to
offer a competitive edge, coping with heterogeneous computing platforms and
environments may impose significant operational and integration overhead
and challenges. We will focus our initial efforts on the scaling up of Microsoft
.NET integration of mission critical Windows applications into the Discovery
Informatics platform.
Specific Aim 2: Integrate various standalone robotics applications into
networked automated Discovery pipelines.
Rationale: Most robotics driver applications remain proprietary, win32
dependent, and not interoperable, giving rise to bottlenecks in operation and
application integration. The Discovery Informatics platform will integrate the
standalone third party vendor software through Microsoft .NET Web Services
to automate the inter-robotic data management and mechanical operations.
User
Interface
Application Module
VB defined Procedures
C# Objects
Application Layer
ActiveX/ TCP/IP Link
Robotics
Device
Driver
Device
Driver
Device
Driver
Driver Layer
RS232/485 IEEE488 Link
Web Services
Device
Device
Device
XML
DB
Hardware Layer
Oracle
Servers
Figure 10. .NET robotics integration Architecture Schematic
Design and Methods: Microsoft .NET framework will be deployed to PCs
hosting robotics Windows applications (Figure 10). The .NET applications will
integrate the three main components of the robotic specific platforms: robotic
PHS 398 (Rev. 05/01)
58
Page _______
Research Plan
specific application; the hardware modules of the mechanical system; and the
module level software. The .NET integration will empower the robotic
instruments to provide Web Services to allow live data transaction and
interoperate with different enterprise applications and equipments. To
assemble a series of different types of robotic equipments into a pipeline
fulfilling specific discovery needs, one master program will be developed to
coordinate various Web Services coming through the different robotic .NET
servers.
Specific Aim 3: Systemize the high throughput discovery workflows to
reveal “knowledge” from the raw data.
Rationale:
Tularik Discovery Informatics platform has significantly automated the data
analysis and data management in the areas of array-based comparative
genomic hybridization, HTS, SAR and ADMET. The total amount of data
continues to explode because of the high throughput nature these processes.
An important goal of the Discovery platform is to assist scientists analyzing
the raw data thereby revealing new “knowledge” that might otherwise have
been missed.
Design and Methods:
Knowledge discovery is defined as “the non-trivial extraction of implicit,
unknown, and potentially useful information from data'' (Frawley et al. 1991).
Often this information is not typically retrievable by standard techniques but is
uncovered through the use of Artificial Intelligence (AI) techniques. Discovery
informatics platform will provide user-friendly interface and visualization
modules to integrate data mining technologies including various multivariable
classification, linear or non-linear regression, expert system, and machine
learning. Most likely these tools and modules would be acquired from third
party licenses. Middle ware software drivers and data exchange utilities need
to be developed to push these technologies to the Discovery platform such
that average scientists can be empowered to utilize them for data mining
purposes. Discovery platform has integrated commercial MATLAB  and
SAS toolboxes as mathematical computation, analysis, visualization, and
algorithm development utilities for the automated SAR and ADMET data flow.
In addition, database will be modeled to encapsulate the entire data sets.
The exploratory analysis will select appropriate descriptors, which relies on
the clear understanding of the scientific problem that one is trying to solve.
Patterns, which can lead to reasonable prediction, would be discovered using
relevant descriptors and managed in the knowledge Oracle database. Once
the pattern has been validated, de-convolution or data visualization
technologies are required to translate the abstract pattern, such as neural
network patterns, so that scientists can take chemical or biological actions.
Despite the fact that knowledge discovery process heavily depends on similar
statistic approaches, there is a lack of commercially available solutions
PHS 398 (Rev. 05/01)
59
Page _______
Research Plan
suitable for browsing through large lists of data and enabling simultaneous
visual inspection and interrogation of various aspects of the data.
Discovery Informatics platform will integrate algorithms from in house
development or from commercial vendors such as MDL, Tripos, Accerys to
computationally derive the compound physicochemical and ADMET
properties to contribute to the knowledge database. These tools would be
available as protocols that will run upon request. Further, the implementation
will allow the computational chemist to update the models as more validation
studies become available. In silico computed compound properties will be
loaded to the Discovery database and will be utilized as part of the multiparametric optimization drug discovery strategy.
Due to the extreme complexity in the drug discovery process, caution
should be made in the process of knowledge discovery.
It is true that HTS data studies discover knowledge, e.g. compound
structural patterns, which are responsible for the bioactivities. However, at
the start of one’s data mining efforts, it is not known if such knowledge is
present in the database, if it can be effectively used or even if patterns can be
reasonably extracted. The issue is not lack of good computational science,
but a matter of not having enough underlying data.
The paradox of predictivity versus diversity can arise: the problem evolves
from the fact that the greater the diversity of the data set, the smaller the
chance models with prediction power can be uncovered; on the other hand,
the information content of the model (if it exists) will increase as the
boundaries of the space and the diversity of the subjects under investigation
increases. Many ADMET models are based upon small sets of chemical
compounds (from tens to hundreds), thus frequently cited as non-significance
by potential users. A similar situation also exists with empirical observations.
That is why the value or utility of the Lipinski ‘Rule of Five’ (Lipinski 2000)
have been questioned by many medicinal chemists. Since most of these in
silico efforts have not been well validated due to lack of data, the project team
coordination so that the results of these data sets are appropriately
integrated.
Specific Aim 4: Integrate in silico drug lead seeking, explosion and
optimization processes into the high throughput Discovery platform.
Rationale:
Considering the total lead-like molecular space, the percentage occupied
by compounds that current technologies have made and screened is quite
limited. Virtual screening utilizing Linux cluster has made it possible to screen
compounds that do not exist within the corporate inventory. In this proposal,
the goal has been set to generate a 1 billion member virtual screening library
to extend the database diversity and availability to the processes of in silico
lead seeking, explosion and optimization.
PHS 398 (Rev. 05/01)
60
Page _______
Research Plan
Design and Methods:
This 1 billion member virtual screening (VS) library will be generated using
a computational approach. The virtual screening library quality and
usefulness depends on the library diversity, compounds’ ADMET properties
and synthetic chemistry accessibility.
We will define the list of scaffolds and reagents as the basis for the virtual
library construction. The compiled database should contain reagent lists,
covering the majority of relevant reagent classes. These reagents will be
stored in a compound 3D processed form, which will be available for routine
library construction. Drug-like scaffolds will be identified and compiled from
the literature or from our combinatorial chemistry experiences. First round
library will be built based on A+B reactions. Sybyl diversity analysis tools will
be evaluated to explore any particular properties applicable to the library
design. Compounds that disobey Lipinski’s “Rule of Five” (Lipinski 2000) will
be flaged (but not excluded).
Database of viable scaffolds and routes appropriate for parallel chemistry
will be established. Software applications will be developed to integrate
various chemoinformatics filters to construct and store the drug like libraries.
Graphical interface will be developed to facilitate enumeration and sampling
of the library, which will allow modelers to easily evaluate, optimize, and build
subset libraries when required. Because the large number of this targeted VS
library, current virtual screen data flow should be updated with additional
capability for screening virtual libraries on the 1 billion compound scale: the
goal is to access large synthetically accessible libraries for docking or
pharmacophore search.
The 1 billion member virtual screening library, if built as projected, will
significantly expand the lead or drug like molecular space to improve the
overall in silico lead seeking, explosion and optimization processes.
Specific Aim 5: Standardize the informatics data flow and implement
Rationale:
We intend to define an interoperable service-oriented architecture and work
flow that will allow various applications to be provided in whatever language
and on whatever platform is most appropriate, while ensuring that
applications can inter-communicate seamlessly and be managed and
assembled with minimal effort.
Design and Methods:
Discovery Informatics platform has focused on design approaches,
processes, and application tools supporting the concept that large software
PHS 398 (Rev. 05/01)
61
Page _______
Research Plan
Repository
Services
Orchestration Services
State Security and Data Management
Discovery
Services
Publishing
Services
JSP
Event
Handling
Versioning
Services
Components
Presentation
Designer
Presentation Services
Java
Integration
Window
Integration
Data
Translation
Contract
Management
Third Party
Integration
Service
Execution
Business
Rules
Application
Orchestrator
Service
Assembler
Logical Services
Lifecycle
Services
Repository
Manager
Component Services
MetaData
Services
Repository
Integration
Data Integration
Services
External
Services
Gene Expression; Gene sequence; Patents; References;
Compounds; Reactions; Bio assay (HTS, SAR); ADMET;
UDDI
Integration
Version Control
Integration
Remote
Services
Database Services
Third Party
Integration
JAXR
Integration
Local
Services
Core Services
Robotics
Unit
Tester
Component
Developer
Business Rules
Manager
Monitoring
Dynamic Service Management
Simulation
Testing
Debugging
External Service Access Methods
Security Management
Figure 11. A service based Discovery architecture.
systems can be assembled from independent, reusable collections of
functionality. Some of the functionalities, such as compound handling and
analysis, genomics data mining etc., may already be available and
implemented in house or acquired from a third party, while the remaining
functionalities may need to be created. We will implement a service oriented
architecture (Figure 11) to bring together all of these elements into a single,
coherent whole. Each service will provide access to a well-defined collection
of functionalities interacting with other services. J2EE and .NET frameworks
have allowed the feasible implementation of this architecture through Web
Service. For example, robotics services have to improve performance,
availability and scalability through coordinating functionality executing on a
collection of distributed hardware. Handling of the service provider,
requestor, locator and broker will leverage the open source I3C’s Life Science
Identifier Resolution (LSIR) scheme using Web Services. The LSID URN
format template and examples are shown as following:
urn:LSID:<AuthorityID>:<NamespaceID>:<ObjectID>[:<RevisionID>]
Examples:
urn:LSID:ebi.ac.uk:SWISS-PROT/accession:P34355:3
urn:LSID:rcsb.org:PDB:1D4X:22
urn:LSID::ncbi.nlm.nih.gov:GenBank/accession:NT_001063:2
urn:LSID:ibm.com:rowenfsdb:DAC8266B-9B9E-4CD3-853F-7DB764F9D2D3:1
PHS 398 (Rev. 05/01)
62
Page _______
Research Plan
LSID Client software will resolve access to data objects named using the
LSID format by discovering the network location of the LSID resolution
service using a combination of the Dynamic Delegation Discovery System
(DDDS) standard, DNS Naming Authority Pointer (NAPTR) records, the DNS
SRV standard and finally a number of web service interfaces. The grant coapplicant I3C member Brian King authors the LSIR implantation and will
contribute the integration of LSIR to the Tularik Discovery Informatics
platform.
Our prototype http://discovery.tularik.com and its associated applications
have been built as a component based architecture, heavily relying on XML
and Java technologies and tools. We intend to leverage the previous
development, wrapping components into Web Services. The effective use of
XML as a serialization of syntax for Java objects and as inter-service
exchange format is key to the transition from component based to the service
based design and architecture. We intend to share a common data model
and common data exchange format throughout the platform via ubiquitous
use of unified XML and Java data model for data exchange and persistence.
This design allows the freedom to use different languages without risking
errors due to impedence mismatch between data models. Since XML-based
data transformation can be done with robust standard components and tools,
XML and Java validation eliminates errors early in development and improves
data quality.
The current Tularik Discovery platform hosts various XML based J2EE
and .NET distributed applications, providing solid foundations to extend to the
service-oriented computing architecture. Data modeling and workflow
standardization with open source and other research communities will be
important for developing interoperable platforms. Tularik will work with I3C
(Interoperable Informatics Infrastructure Consortium) to adopt and enact
proper standardizations for data flow involved in the areas of genomics,
biological pathway, compound acquisition, compound inventory, lead
discovery and optimization. Orchestration between various teams in open
source or consortium can be effort and time consuming. This overhead cost
needs to be projected in the informatics operations to ensure the timeliness of
the final delivery.
Specific Aim 6: Setup industrialized software configuration management
(SCM) mechanisms for application build and deployment.
Rationale:
Tularik has established a well-structured source tree of J2EE and .NET code
versioned under CVS repository. Nightly build and deployment are managed
through Apache ANT and NANT. Build and deploy operations have to scale
up and ensure the robustness due to the ever-increasing number and
different types of application servers. Tularik’s current worldwide campuses
PHS 398 (Rev. 05/01)
63
Page _______
Research Plan
and operations demand the global informatics support, including capabilities
to deploy enterprise solutions globally.
Design and Methods:
Code
CVS
Code Repository
Application
Assembler Service
XML
SOAP over HTTP
check out
Application
Publishing Service
SOAP over HTTP
Application
Assembler Service
SOAP over HTTP
Intranet
Window Apps
.NET Platform
XML
Internet
XML
J2EE Discovery
Platform
Figure 12. Application publishing and assembler Web Services.
Both J2EE and .NET application servers allow “hot” deploy without
interruption of the enterprise services and applications. One application
server will be dedicated to provide the application publishing Web Service,
which nightly checks out code remotely from the CVS repository server,
builds the software packages and deploy to the relevant machines world wide
through the application publishing Web Service using SOAP over HTTP
through internet or intranet. The .NET or J2EE platform servers host
application assembler Web Service, accepting the packages, and deploying
the packages into the application server environment after validation
according to the predefined contract.
Data and software sharing:
Academic License Agreement:
All software, design methods, and analysis protocols developed through the
funding of this grant will be made publicly available for free use by the
biomedical researchers in academic universities and institutions. The
software is being provided on an 'as is' basis for the non-commercial research
purposes. Please do not distribute the software, or any portion or derivative
thereof, beyond the academic organization. We are providing the software
PHS 398 (Rev. 05/01)
64
Page _______
Research Plan
without warranties and with no provisions for support or future enhancements.
Please note that Tularik Inc. and its employees have no liability in connection
with the use of the software.
Commercial License Agreement:
Commercial or corporate use of the relevant information and utilities requires
a signed license agreement from Tularik Inc. To get the appropriate forms
and detailed instructions for licensed use of the software packages, please
contact Terry Rosen, [email protected].
E. Human Subjects
N/A
F. Vertebrate Animals
N/A
G. Literature Cited
Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry,
A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig et al. 2000. Gene
ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25: 25-29.
Attwood, T. K. 2000. Genomics. The Babel of bioinformatics. Science 290:
471-473.
Clamp, M., D. Andrews, D. Barker, P. Bevan, G. Cameron, Y. Chen, L.
Clark, T. Cox, J. Cuff, V. Curwen et al. 2003. Ensembl 2002:
accommodating comparative genomics. Nucleic Acids Res 31: 38-42.
Frawley, W. J., G. Piatetsky-Shapiro and C. Matheus, 1991 Knowledge
Discovery In Databases: An Overview. In Knowledge Discovery In
Databases,. AAAI Press/MIT Press.
Hubbard, T., D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox,
J. Cuff, V. Curwen, T. Down et al. 2002. The Ensembl genome database
project. Nucleic Acids Res 30: 38-41.
Li, J., C. W. Murray, B. Waszkowycz and S. C. Young. 1998. Targeted
molecular diversity in drug discovery: integration of structure-based design
and combinatorial chemistry. DDT 3: 105-112.
Li, S., G. Cutler, J. J. Liu, T. Hoey, C. Chen, P. G. Schultz, J. Liao and X.
B. Ling. 2003. A comparative analysis of HGSC and Celera human
genome assemblies and gene sets. Bioinformatics In press.
PHS 398 (Rev. 05/01)
65
Page _______
Research Plan
Li, S., J. Liao, G. Cutler, T. Hoey, J. B. Hogenesch, M. P. Cooke, P. G.
Schultz and X. B. Ling. 2002. Comparative analysis of human genome
assemblies reveals genome-level differences. Genomics 80: 138-139.
Liebeschuetz, J. W., S. D. Jones, P. J. Morgan, C. W. Murray, A. D.
Rimmer, J. M. Roscoe, B. Waszkowycz, P. M. Welsh, W. A. Wylie, S. C.
Young et al. 2002. PRO_SELECT: combining structure-based drug design
and array-based chemistry for rapid lead discovery. 2. The development of
a series of highly potent and selective factor Xa inhibitors. J Med Chem
45: 1221-1232.
Lipinski, C. A. 2000. Drug-like properties and the causes of poor solubility
and poor permeability. J Pharmacol Toxicol Methods 44: 235-249.
Lyne, P. D. 2002. Structure-based virtual screening: an overview. Drug
Discov Today 7: 1047-1055.
Mason, J. S., A. C. Good and E. J. Martin. 2001. 3-D pharmacophores in
drug discovery. Curr Pharm Des 7: 567-597.
Mestres, J., and R. M. Knegtel. 2000. Similarity versus docking in 3D
virtual screening. Perspect. Drug Des. Discovery 20: 191-207.
Pouliot, Y., J. Gao, Q. J. Su, G. G. Liu and X. B. Ling. 2001. DIAN: a novel
algorithm for genome ontological classification. Genome Res 11: 17661779.
Srinivasan, J., A. Castellino, E. K. Bradley, J. E. Eksterowicz, P. D.
Grootenhuis, S. Putta and R. V. Stanton. 2002. Evaluation of a novel
shape-based computational filter for lead evolution: application to thrombin
inhibitors. J Med Chem 45: 2494-2500.
Waszkowycz, B., T. D. J. Perkins, R. A. Sykes and J. Li. 2001. Largescale virtual screening for discovering leads in the postgenomic era. IBM
systems journal 40: 360-376.
H. Consortium/Contractual Arrangements
N/A
I. Letters of Support (e.g., Consultants)
Fax attached as the next page.
PHS 398 (Rev. 05/01)
66
Page _______
Research Plan
CHECKLIST
TYPE OF APPLICATION (Check all that apply.)
NEW application. (This application is being submitted to the PHS for the first time.)
SBIR Phase I
SBIR Phase II: SBIR Phase I Grant No. _
______________________
SBIR Fast Track
STTR Phase I
STTR Phase II: STTR Phase I Grant No. _
______________________
STTR Fast Track
REVISION of application number:
(This application replaces a prior unfunded version of a new, competing continuation, or supplemental application.)
INVENTIONS AND PATENTS
COMPETING CONTINUATION of grant number:
(Competing continuation appl. and Phase II only)
(This application is to extend a funded grant beyond its current project period.)
No
Previously reported
SUPPLEMENT to grant number:
(This application is for additional funds to supplement a currently funded grant.)
Yes. If “Yes,”
Not previously reported
CHANGE of principal investigator/program director.
Name of former principal investigator/program director:
FOREIGN application or significant foreign component.
1. PROGRAM INCOME (See instructions.)
All applications must indicate whether program income is anticipated during the period(s) for which grant support is request. If program income
is anticipated, use the format below to reflect the amount and source(s).
Budget Period
Anticipated Amount
Source(s)
2. ASSURANCES/CERTIFICATIONS (See instructions.)
The following assurances/certifications are made and verified by the
signature of the Official Signing for Applicant Organization on the Face
Page of the application. Descriptions of individual assurances/
certifications are provided in Section III. If unable to certify compliance,
where applicable, provide an explanation and place it after this page.
•Debarment and Suspension; •Drug- Free Workplace (applicable to new
[Type 1] or revised [Type 1] applications only); •Lobbying; •NonDelinquency on Federal Debt; •Research Misconduct; •Civil Rights
(Form HHS 441 or HHS 690); •Handicapped Individuals (Form HHS 641
or HHS 690); •Sex Discrimination (Form HHS 639-A or HHS 690); •Age
Discrimination (Form HHS 680 or HHS 690); •Recombinant DNA and
Human Gene Transfer Research; •Financial Conflict of Interest (except
Phase I SBIR/STTR) •STTR ONLY: Certification of Research Institution
Participation.
•Human Subjects; •Research Using Human Embryonic Stem Cells•
•Research on Transplantation of Human Fetal Tissue •Women and
Minority Inclusion Policy •Inclusion of Children Policy• Vertebrate Animals•
3. FACILITIES AND ADMINSTRATIVE COSTS (F&A)/ INDIRECT COSTS. See specific instructions.
DHHS Agreement dated:
No Facilities And Administrative Costs Requested.
Regional Office.
DHHS Agreement being negotiated with
Date
No DHHS Agreement, but rate established with
CALCULATION* (The entire grant application, including the Checklist, will be reproduced and provided to peer reviewers as confidential information.)
a. Initial budget period:
Amount of base $
b. 02 year
Amount of base $
c. 03 year
Amount of base $
d. 04 year
Amount of base $
e. 05 year
Amount of base $
$450,000
$468,000
$486,720
$491,202
$510,850
x Rate applied
x Rate applied
x Rate applied
x Rate applied
x Rate applied
52.0
52.0
52.0
52.0
52.0
% = F&A costs
$
$234,000
% = F&A costs
$
$243,360
% = F&A costs
$
$253,094
% = F&A costs
$
$255,425
% = F&A costs
$
$265,642
TOTAL F&A Costs $
$1,251,511
*Check appropriate box(es):
Salary and wages base
Modified total direct cost base
Other base (Explain)
Off-site, other special rate, or more than one rate involved (Explain)
Explanation (Attach separate sheet, if necessary.):
4. SMOKE-FREE WORKPLACE
PHS 398 (Rev. 05/01)
Yes
No (The response to this question has no impact on the review or funding of this application.)
Page __69_____
Checklist Form Page
Principal Investigator/Program Director (Ling, Bruce):
Place this form at the end of the signed original copy of the application.
Do not duplicate.
PERSONAL DATA ON
PRINCIPAL INVESTIGATOR/PROGRAM DIRECTOR
The Public Health Service has a continuing commitment to monitor the operation of its review and award
processes to detect—and deal appropriately with—any instances of real or apparent inequities with
respect to age, sex, race, or ethnicity of the proposed principal investigator/program director. To provide the PHS with
the
information
it
needs
for
this
important
task,
complete
the
form
below
and
attach it to the signed original of the application after the Checklist. Do not attach copies of this form
to the duplicated copies of the application.
Upon receipt of the application by the PHS, this form will be separated from the application. This form will
not be duplicated, and it will not be a part of the review process. Data will be confidential, and will be
maintained in Privacy Act record system 09-25-0036, “Grants: IMPAC (Grant/Contract Information).” The
PHS requests social Security numbers for accurate identification, referral, and review of applications and
for management of PHS grant programs. Provision of the Social Security number is voluntary. No
individual will be denied any right, benefit, or privilege provided by law because of refusal to disclose his
or her Social Security Number. The PHS requests the Social Security Number under Sections 301 (a) and
487 of the PHS Act as amended (42 USC214a and USC288). All analyses conducted on the date of birth
and race and/or ethnic origin data will report aggregate statistical findings only and will not identify
individuals. If you decline to provide this information, it will in no way affect consideration of your application. Your
cooperation will be appreciated.
DATE OF BIRTH (07/27/68)
SEX/GENDER
Female
Male
Social Security Number 484-19-0649
ETHNICITY
1. Do you consider yourself to be Hispanic or Latino? (See definition below.) Select one.
Hispanic or Latino. A person of Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or
origin, regardless of race. The term, “Spanish origin,” can be used in addition to “Hispanic or Latino.”
Hispanic or Latino
Not Hispanic or Latino
RACE
2. What race do you consider yourself to be? Select one or more of the following.
American Indian or Alaska Native. A person having origins in any of the original peoples of North, Central, or South
America, and who maintains tribal affiliation or community attachment.
Asian. A person having origins in any of the original peoples of the Far East, Southeast Asia, or the
Indian subcontinent, including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine
Islands, Thailand, and Vietnam. (Note: Individuals from the Philippine Islands have been recorded as Pacific Islanders in
previous data collection strategies.)
Black or African American. A person having origins in any of the black racial groups of Africa. Terms such as “Haitian” or
“Negro” can be used in addition to “Black” or African American.”
Native Hawaiian or Other Pacific Islander. A person having origins in any of the original peoples of Hawaii, Guam,
Samoa, or other Pacific Islands.
White. A person having origins in any of the original peoples of Europe, the Middle East, or North Africa.
Check here if you do not wish to provide some or all of the above information.
PHS 398 (Rev. 05/01)
DO NOT PAGE NUMBER THIS FORM
Personal Data Form Page
Appendix
1. Tularik Inc., a biopharmaceutical company – http://www.tularik.com.
2. J2EE, The Platform for Enterprise Solutions.
3. Microsoft  .NET, Working Beyond the Network.
4. I3C: Interoperable Informatics Infrastructure Consortium.
5. September 20, 1999 New York Time Technology Headline
Surfing the Human Genome – Database of Genetic Code Are Moving to the Web.
6. Journal “Computer World” Tularik case study – Linux cluster tackles gene research
7. Pouliot, Y., J. Gao, Q. J. Su, G. G. Liu and X. B. Ling. 2001. DIAN: a novel algorithm for genome
ontological classification. Genome Res 11: 1766-1779.
8. Li, S., J. Liao, G. Cutler, T. Hoey, J. B. Hogenesch, M. P. Cooke, P. G. Schultz and X. B. Ling.
2002. Comparative analysis of human genome assemblies reveals genome-level differences.
Genomics 80: 138-139.
9. Li, S., G. Cutler, J. J. Liu, T. Hoey, C. Chen, P. G. Schultz, J. Liao and X. B. Ling. 2003. A
comparative analysis of HGSC and Celera human genome assemblies and gene sets.
Bioinformatics In press.
10. Liebeschuetz, J. W., S. D. Jones, P. J. Morgan, C. W. Murray, A. D. Rimmer, J. M. Roscoe, B.
Waszkowycz, P. M. Welsh, W. A. Wylie, S. C. Young et al. 2002. PRO_SELECT: combining
structure-based drug design and array-based chemistry for rapid lead discovery. 2. The
development of a series of highly potent and selective factor Xa inhibitors. J Med Chem 45: 12211232.
PHS 398/2590 (Rev. 05/01)
Page ___70____
Appendix
Tularik d d d d d d d d d
Appendix 1
Our History
Our Mission
CEO Message
Science & Technology
Page 1 of 2
Leadership
Patents
Publications
Global Presence
Our Integrated Drug Discovery and Development Platform Places us in a Leading
Position to Create Novel and Superior Drugs.
The gene regulation approach to the discovery of novel therapeutics is enabled by
a solid foundation of biology, biochemistry and molecular biology. Tularik does
not depend on a single technique or technology. Rather our scientists develop and
take advantage of multiple approaches and cutting-edge technologies, bringing
them all to bear on the drug discovery process. We are continually incorporating
new capabilities that increasing the probability of success.
A Distinctive Approach for Building the Company
Tularik has grown in a logical, stepwise fashion, having concentrated in its early
years on establishing excellence in fundamental biological science. Our approach
has been to build the company "from the bottom up," with scientific need driving
the internal development or acquisition of appropriate enabling technologies. To
our core biological expertise we first added assay and high throughput screening
capabilities, assembled a substantial chemical library. We then made major
investments in medicinal chemistry, structural biology and pharmacology. More
recently, we have been building strength in the clinical area, as a number of our
drug candidates enter and move through human testing. We have integrated a
number of new technologies that help us advance our programs. We acquired a
core technology called Representational Difference Analysis (RDA) in order to
discover the full set of cancer-causing genes; it is the centerpiece of our
oncogene-based drug discovery effort. We have also acquired an innovative
computer-aided molecular design (CAMD) capability that enables us to add
virtual screening to our already robust high throughput screening program.
Drug Discovery and Development Process
We seek to develop compounds that have novel mechanisms of action and treat
serious diseases more effectively than the best existing drugs. The discovery of
new drug leads is a multi-step process. We begin by establishing a precise
biological link between a disease state and inappropriate gene expression. Once
such a link is clear, we seek to elucidate the corresponding gene regulation
pathway inside the cell. We can then begin to develop highly specific
http://www.tularik.com/page.php?id=3
6/17/2003
Tularik
Page 2 of 2
biochemical and cell-based assays for these targets that will help us to identify
promising leads for therapeutic intervention. At this stage, we bring to bear our
million-compound chemical library, and, with the aid of robotics technology
developed in-house, we perform high throughput screens, running select portions
of that vast library against our targets. "Hits" are subjected to secondary assays
designed to eliminate compounds that lack potency or specificity, or that have
other unwanted characteristics.
If a compound survives the secondary assay screening process, it is then
subjected to further testing and ultimately optimization by medicinal chemists to
improve drug-like characteristics such as potency, specificity, oral bioavailability
and safety. Once again, our team's expertise in the fundamentals--biology and
medicinal chemistry--guides the process. Our tools are state-of-the-art:
combinatorial chemistry; CAMD; and an impressive array of advanced structural
biology technologies that include nuclear magnetic resonance (NMR)
spectroscopy and x-ray crystallography. Utilizing structural information, our
chemists can design and synthesize new analogs of lead compounds that are more
likely to have a better fit with target proteins, and thus, potentially, greater
potency and specificity.
Finally, our lead series enters pharmacological analysis and testing in animal
models of disease. From these studies, we learn about our candidate drug's
effectiveness, pharmacokinetic profile, its selectivity with respect to its target, its
potency and its possible side effects. All data are carefully recorded and
compiled, in preparation for the submission of an Investigational New Drug
application (IND) to the U.S. Food and Drug Administration.
Once we have decided to pursue an IND, our clinical team--assisted and
counseled by the scientific team that discovered and developed our drug
candidate--undertakes the task of designing and implementing clinical trials.
Members of our expanding clinical development group have brought many
important medicines through trials to commercialization at major pharmaceutical
companies. They possess expertise in clinical research, clinical pharmacology,
biostatistics and data management, drug safety and surveillance and regulatory
affairs.
Copyright © Tularik Inc. 2003. All rights Reserved.
http://www.tularik.com/page.php?id=3
6/17/2003
Java 2 Platform, Enterprise Edition - Overview
Page 1 of 2
Advanced Search
Technologies
Downloads
Documentation
Industry News
Developer Services
Java BluePrints
Java 2 Platform, Enterprise Edition (J2EE)
J2EE Technologies
J2EE Downloads
J2EE Documentation
OVERVIEW
J2EE Main
-
APIs
Compatibility
Licensees
Java Verification
New to Java
Tools
Simplified Guide to
the Java 2 Platform,
Enterprise Edition
format
· PostScript
(292,059 bytes)
* format
· PDF
(90,042 bytes)
*
View & print PDF files
with Acrobat Reader
from Adobe.
Printable Page
Introduction | Application Model | Setting the Standard
The Platform for Enterprise Solutions
The Java 2 Platform, Enterprise Edition (J2EE) defines the standard for developing
multitier enterprise applications. J2EE simplifies enterprise applications by basing them
on standardized, modular components, by providing a complete set of services to thos
components, and by handling many details of application behavior automatically,
without complex programming.
The Java 2 Platform, Enterprise Edition, takes advantage of many features of the Java
2 Platform, Standard Edition, such as "Write Once, Run Anywhere" portability, JDBC
API for database access, CORBA technology for interaction with existing enterprise
resources, and a security model that protects data even in internet applications.
Building on this base, Java 2 Enterprise Edition adds full support for Enterprise
JavaBeans components, Java Servlets API, JavaServer Pages TM and XML technology.
The J2EE standard includes complete specifications and compliance tests to ensure
portability of applications across the wide range of existing enterprise systems capable
of supporting J2EE.
Making Middleware Easier
Today's enterprises gain competitive advantage by quickly developing and deploying
custom applications that provide unique business services. Whether they're internal
applications for employee productivity, or internet applications for specialized custome
or vendor services, quick development and deployment are key to success.
Portability and scalability are also important for long term viability. Enterprise
applications must scale from small working prototypes and test cases to complete 24 x
7, enterprise-wide services, accessible by tens, hundreds, or even thousands of clients
simultaneously.
However, multitier applications are hard to architect. They require bringing together a
variety of skill -sets and resources, legacy data and legacy code. In today's
heterogeneous environment, enterprise applications have to integrate services from a
variety of vendors with a diverse set of application models and other standards.
Industry experience shows that integrating these resources can take up to 50% of
http://java.sun.com/j2ee/overview.html
6/16/2003
Java 2 Platform, Enterprise Edition - Overview
Page 2 of 2
application development time.
As a single standard that can sit on top of a wide range of existing enterprise systems
- database management systems, transaction monitors, naming and directory services
and more -- J2EE breaks the barriers inherent between current enterprise systems. Th
unified J2EE standard wraps and embraces existing resources required by multitier
applications with a unified, component -based application model. This enables the next
generation of components, tools, systems, and applications for solving the strategic
requirements of the enterprise.
With simplicity, portability, scalability and legacy integration, J2EE is the platform for
enterprise solutions.
The Standard with Industry Momentum
While Sun Microsystems invented the Java programming language and pioneered its
use for enterprise services, the J2EE standard represents a collaboration between
leaders from throughout the enterprise software arena. Our partners include OS and
database management system providers, middleware and tool vendors, and vertical
market applications and component developers. Working with these partners, Sun has
defined a robust, flexible platform that can be implemented on the wide variety of
existing enterprise systems currently available, and that supports the range of
applications IT organizations need to keep their enterprises competitive.
Introduction | Application Model | Setting the Standard
[ This page was last updated Apr-12-2003 ]
Company Info | Licensing | Employment | Press | Contact |
JavaOne | Java Community Process | Java Wear and Books | Content Feeds | Java Series Books
Java, J2EE, J2SE, J2ME, and all Java -based marks are trademarks or registered trademarks of Sun Microsystem
Inc. in the United States and other countries.
Unless otherwise licensed, code in all
technical manuals herein (including articles,
FAQs, samples) is provided under this License.
http://java.sun.com/j2ee/overview.html
Copyright © 1995-2003 Sun Microsystems, I
All Rights Reserved. Terms of Use
6/16/2003
Microsoft .NET
Page 1 of 3
All Products
.NET Home | Site Map
| .NET Worldwide
Search
GO
news
Advanced Search
Keep working whether or not you're plugged into the
network. Smart Client software, based on
Microsoft® .NET-connected technology, combines the
reach of the Internet with the power of local
computing hardware. Learn how.
.NET Home
What Is .NET?
Technical Resources
Services
Business Agility
Smart Clients combine the power of the PC with the
reach of the Web
For Partners
The technology map to build Smart Client software
Home & Entertainment
The .NET Framework is the foundation of a new
generation of software
Product Information
.NET Connected
Directory
downloads
technical resources
Get the latest tools, guides, code samples, and community links to
help you build Web services and deploy and maintain a .NETconnected environment.
Microsoft releases recommended practices for solving enterprise
books and t
problems with .NET
Using the .NET Framework
MSDN ® resources for .NET Framework developers
TechNet resources for IT professionals
More technical resources …
business agility
Don't throw out your existing systems. Microsoft .NET -connected
software makes it easier for you to share or integrate information
using the technology you own now.
webcasts
McKinsey Quarterly shows businesses how to fight complexity in
IT
Microsoft helps businesses benefit from Web services today
What .NET means for IT professionals
Case studies: See .NET in action
More business agility information…
for partners
Your success is our success. That's why we build programs that
provide your company with new business opportunities.
.NET Connected Directory helps businesses find Web service
solutions and products
Register your product with the .NET Connected Logo program
Online resources for Microsoft partners
home and entertainment
Change flight reservations with your PDA? Automatically reserve
concert tickets the minute they go on sale? The possibilities are
endless in a .NET-connected world.
Digital decade vision: From personal computer to personal
http://www.microsoft.com/net/
6/16/2003
Microsoft .NET
Page 2 of 3
computing
Can the Internet change your oil?
More home and entertainment information…
what
is .NET?
product
information
.NET services
.NET Alerts
.NET 101: The
.NET products
.NET Passport
basic elements
Smart devices
MSN® Messenger
Servers
Connect
Developer tools
Microsoft
of .NET
The ABCs of Web
services
.NET glossary
Frequently asked
More links …
MapPoint® Web
Service
More links …
questions
More links …
Contact Us
| E-mail This Page
6/16/2003
Microsoft .NET
© 2003 Microsoft Corporation. All rights reserved.
Page 3 of 3
Terms of Use Privacy Statement
Accessibility
6/16/2003
Interoperable Informatics Infrastructure Consortium
Page 1 of 2
Interoperable Informatics Infrastructure
Consortium
Search
I3C Working
Groups
LSID
Registry
Security
Outreach
TechOps
Committee
I3C Emerging
Work Areas
Pathways/Systems
Biology
Life Science
Object Ontology
Chemiinformatics
Meetings/Events
Mail
Lists/Discussion
Threads
Demos
Downloads
Publications
Members Only
Member Home
My Profile
Membership
How to Join
Benefits
Press Room
About Us
Contact Us
http://www.i3c.org/
Accelerating Life Sciences Discovery Through Software
Interoperability
Home | Working Groups | Meetings | Members Only | Membership | About I3C |
Contact Us | Site Map
I3C develops and promotes global, vendor-neutral
informatics solutions that improve data quality and
accelerate the development of life science products.
W h a t ' s
l
l
N E W !
WEB SITE CHANGES IN STORE
Upcoming
As the result of your comments, we'll be
Events
making some navigational adjustments to
the Web site over the next few days. The
Mark your
result should be a simplified left menu and
calendar for
easier, more direct access to Working
the remainder
Group activities. Immediate requests for
of 2003!
fixes/links should be addressed to Suzanne
····
Mahler at [email protected]. Thank you
I3C Demo at
once again for your patience.
BIO 2003
I3C Annual Meeting, Elections,
"Merging LSID
Technical Meeting & Hackathon
& BioMOBY"
Summary
June 22-25
I3C's first Annual Meeting began with
Board President Tim Clark's "State of I3C"
report. (See next item for details.) Tim
reviewed I3C's purpose of promoting,
developing, and recommending "best-inclass" interoperability solutions for the life
sciences community. But more than
interoperability, Tim stressed that we need
to look at the total recommended approach
that includes specific point solutions and
approaches, interoperability solutions with
open interfaces, and methods of semantic
integration across the domain. He also
reviewed I3C's approach to accomplishing
its work by bringing the best people
together from IT, academic, and
biopharma communities. Tim went on to
outline I3C's many accomplishments todate, as well as exciting future
opportunities. "What we want to do is
make it easier for life scientists, and their
laboratories, to do their jobs while lowering
cost and speeding the development of
at the
Washington
Convention
Center
Washington, D.
C.
** Details Here
** · · · ·
Technical
Meeting
Oct. 27-29
Hackathon
Oct. 30-31
Wellcome Trust
Genome Campus
Hinxton, U.K.
** Details Here
** · · · ·
6/13/2003
Interoperable Informatics Infrastructure Consortium
Page 2 of 2
needed drugs," he said.
l
l
l
To read the entire summary, click here.
To download a photo slide show of the
Hackathon, click here.
I3C Annual Meeting - Chairperson's
Report
On Monday, May 5, I3C Board President
Tim Clark gave an important presentation
at the Annual Meeting. An outline of the
talk appears below; if you missed it in
person, you can download the PowerPoint
presentation here .
¡ I3C's Purpose & Approach
¡ Accomplishments 2002-3
¡ Problems Requiring Attention
¡ Driving the Work Forward
¡ The I3C Vision: Next Stage
MAKE PLANS FOR HINXTON - Details
now available
The last meeting of 2003 is scheduled for
October 27-31 on the Wellcome Trust
Genome Campus in Hinxton, U.K. Meeting
and accommodation details are available
here. Remember to reserve your room
early for best selection.
EMERGING WORK AREAS
In addition to I3C's existing Working
Groups (LSID, Registry, and Security), new
work areas are forming, including
Chemiinformatics, Pathways/Systems
Biology, and Life Science Object Ontology.
You may join a group at any time.
Click on the "Working Groups" link in
the left menu for more information.
Home | Working Groups | Meetings | Members Only | Membership | About I3C | Contact Us | Site Map
Copyright © 2003 I3C
http://www.i3c.org/
6/13/2003
Surfing the Human Genome
Page 1 of 7
September 20, 1999
Databases of Genetic Code Are Moving to the Web
By LAWRENCE M. FISHER
AN FRANCISCO -- Call it an end-of-the-century business
case study.
Pangea Systems Inc. is a small but leading company in
"bioinformatics," a hot new field that combines the two keystone
technologies of the 1990s -- computing and biotechnology. But its
products are expensive and difficult for mortals to use, which limits
Pangea's potential market and reduces the prospects for a public
stock offering.
What to do? This being 1999,
the answer if you are Pangea
is to dot-com yourself.
This week Pangea, which is
based in Oakland, Calif.,
intends to begin a shakedown
Thor Swift for The New York Times
test of DoubleTwist.com, a
At
Pangea
Systems,
left to right, Kyle
new Web site intended to
Hart,
Bruce
Xuefeng
Ling and Brian King,
make online genetic and
revise genetic data base software.
biological research fast, easy
and available to any amateur
or professional biologist. While the test phase is available only to
faculty and students at Stanford University, the site is scheduled to
go live for general use in December.
The DoubleTwist site, whose name is a play on the double-helix
structure of DNA, holds the near-term promise of lifting Pangea
above the pack of competitors chasing the business opportunities in
bioinformatics. But other companies may not be far behind. And the
implications go beyond the interests of professional biologists and
biotechnology executives.
As more of the arcane secrets of genetics and molecular biology
become available to the modemed masses, some industry executives
http://www.nytimes.com/library/tech/99/09/biztech/articles/20gene.html
6/12/2003
Page 2 of 7
foresee the day when an educated consumer might take a CD-ROM
containing a laboratory's rendering of his or her genetic profile, and
combine it with a Web surf through gene libraries to determine the
person's predisposition toward adverse drug reactions, for example,
or for Alzheimer's disease, colon cancer or other afflictions that
might eventually be treatable through gene therapy.
To promote its name and capabilities, Pangea plans to let
individuals who make only casual use of the site have access to its
software and data base at no charge. Heavy users and corporations
may obtain licenses to pay for access on a sliding fee scale -- which
could run tens of thousands of dollars a year, but would still be
significantly less than the $500,000 or more that Pangea now
charges big pharmaceutical companies to buy its software outright.
"The power of bioinformatics has been somewhat limited to those
who could afford it," said John Couch, Pangea's president and chief
executive, who was an executive at Apple Computer in the late
1970s and early 1980s. "I've been trying to figure out how to
empower the scientist the way we did computer users at Apple in
the early days," Couch said. "We saw the opportunity to be the first
Web portal that enabled scientists to do molecular research."
Celera Genomics Group is another company that has said it will
offer its bioinformatics tools from its Web site, although it has not
specified a launch date.
"This is an Internet company,"
said Craig Venter, president and
chief executive of Celera, a unit
of the PE Corp., which is based in
Rockville, Md. Scientists and
nonscientists alike, he said, will
be able to use Celera's tools to
gain insights into their genetic
makeup. And as catalogs of
common mutations correlated
with disease become broadly
Pangea Systems' Doubletwist is a
available, he said, individuals will
Web version of genetics research
be able to make appropriate
software designed to let a person
lifestyle changes or health-care
search libraries of gene-code
decisions. "You'll be able to log
fragments for matches.
on to our data base and get
information about yourself,"
Venter said. "Our ultimate customer on the Internet is individuals."
Bioinformatics is a field that emerged from the Human Genome
Project, the international quest -- which began in 1988 and is
expected to be concluded in the next two years -- to spell out the
precise sequence of the three billion letters in the human genetic
code. The first industry spawned by the genome project was
6/12/2003
Page 3 of 7
genomics companies, which sell data bases of individual genes
whose sequences have already been identified or are developing
drugs aimed at gene targets. As these efforts began to produce vast
amounts of biological information, they needed powerful software
to keep track and make sense of it all. And so, in the early 1990s,
bioinformatics was born as a tool of genomics.
While the software created by the government-funded labs like the
Whitehead Institute at the Massachusetts Institute of Technology is
in the public domain, with intriguing names like Blast and Fasta, the
genomics companies, like Human Genome Sciences Inc. and Incyte
Pharmaceuticals Inc., have kept their tools for use by themselves or
their licensed partners. That is Celera's primary business as well,
despite Venter's intent to offer bioinformatics services on the Web.
It was not long before a few entrepreneurs and venture capitalists
saw an opportunity in a pure-play bioinformatics company, which
would sell not genes or data, but software. As private companies,
none of the bioinformatics players publish revenue figures, but most
say they are between $5 million and $10 million in annual sales, and
growing. Indeed, some analysts predict a multibillion-dollar
bioinformatics market within the next 10 years.
"Bioinformatics is not necessarily the next wave, but the glue that
holds everything together," said Tim Wilson, an analyst with S.G.
Cowen. "If you don't get that part right, it's hard to realize the value
of genomics," he said. "The opportunity is something obvious to
anyone who speaks to pharmaceutical companies."
With the DoubleTwist site, according to Pangea, a researcher would
have many of the same capabilities previously available only to the
company's big corporate customers, which include drug companies
like Bristol-Myers Squibb and Hoechst Marion Roussel.
After logging on to the DoubleTwist site, a visitor could enter a
partial sequence of a gene -- some combination of the letters A, C, T
and G, which make up the genetic alphabet -- and then search for
contiguous sequences that might lead to a full-length gene. Or if the
code of a full-length gene were known, the researcher could ask in
which tissues of the body that gene is found or found only when in
the presence of cancer. To the extent the answer is available in the
scientific literature, including patent filings, the software would
retrieve it and highlight relevant passages. Other cross-referenced
data might include notations on what biochemical materials are
required for working with a given gene in the laboratory.
Such are the capabilities of the computational biology that underlies
bioinformatics -- a field that Francis Collins, director of the Human
Genome Project for the National Institutes of Health, says he now
often counsels promising graduate students to look to for career
opportunities. "I just think it is going to hit us like a freight train and
6/12/2003
Page 4 of 7
we really have too small a supply of expertise in that area," he said.
But there has been a dichotomy between the opportunity and the
market reality for Pangea and competitors like Netgenics Inc. of
Cleveland; Informax Inc. of Rockville, Md.; Lion Bioscience AG of
Heidelberg, Germany; Compugen Ltd. of Tel Aviv; the Genomica
Corp. of Boulder, Colo.; and Molecular Applications Group of Palo
Alto, Calif. Most of these companies are five years old or more, yet
few are profitable.
Couch, Pangea's president,
said the two hurdles to
expanding the market have
been complexity and cost.
Besides the $500,000 price
for Pangea's suite of software
programs, a suite customer
must make a comparable
investment in hardware. And
Thor Swift for The New York Times
even though they have a
point-and-click graphical user John Couch, left, president of Pangea
Systems, and Robert Williamson, senior
interface, like any Windows
vice president for marketing, say the
application, their
Oakland, Calif., company's Web site will
sophistication has tended to
let scientists do complex genetic research
restrict their use to
on line.
bioinformatics specialists
within large pharmaceutical
or biotechnology companies, not to individual research scientists
without special training.
In moving to the Web, Pangea will find neighbors with some
similar-sounding offerings. This week, HySeq Inc., a genomics
company in Sunnyvale, Calif., will launch GeneSolutions.com,
which will sell genes and genetic information over the Web. And
there are various Web sites, for example, that freely offer publicdomain algorithms, or mathematical formulas, that can perform the
basic tasks of bioinformatics. These include a technique called
clustering and alignment, which pieces together full-length genes
from the fragments spewed out by so-called automated sequencing
machines that derive their data from DNA samples.
But these public-domain tools tend to be difficult to use, and limited
in their application to specific gene data bases. Pangea's
DoubleTwist, by contrast, will aggregate data from multiple
sources, and then make it available using software agents -- small
automated software programs that will scan the Web at a user's
request and return answers to complex biological queries via e-mail.
Theses agents can update information as it becomes available,
suggest necessary laboratory supplies and provide links to vendors.
DoubleTwist is intended to complement rather than supplant
6/12/2003
Page 5 of 7
Pangea's established software suites. But Couch said it was possible
that a growing portion of the company's revenues would come from
the Web rather than packaged programs. Rather than buy Pangea's
software suite for $500,000, companies or academic institutions
could spend, say $10,000 a year to provide each user access to these
programs over the Web.
Pangea's competition in this arena is companies very much like
itself: small, financed with venture capital and possessing more
programming prowess than marketing skills.
All of these companies are looking for ways to differentiate
themselves, and while an Internet presence is one way to do that, it
is by no means the only one.
For example, Netgenics' programs run on corporate intranets, rather
than the World Wide Web. But they are built using Internet
technology like the Java programming language so that they can be
easily adapted to the specific needs of different customers. "Pangea
decided they would come up with the perfect schema for all types of
drug discovery and put a nice graphic user interface on it," said
Manuel J. Glynias, president and chief executive of Netgenics,
which was founded in 1996. "We decided there was no perfect
schema because every pharmaceutical company is different."
Netgenics did consider a Web-based electronic commerce business
model, but decided a faster route to growth was to bundle consulting
services with custom bioinformatics software. So far, customers
include Abbott Laboratories and Pfizer. "We've very much targeted
big pharma and biotech," Glynias said. "They're the only ones who
can afford it, and really the only ones it makes sense for. At the end
of the day you've got 50 big pharma and biotech companies and 100
medium-sized ones. It's not a big market."
If the market is small, creating a big company requires that each sale
be large, and Netgenics bases its goals on finding at least 20
customers willing to pay $5 million annually for its services.
Another player, Lion Bioscience, takes that model a step further. It
recently announced a deal in which it would develop new
bioinformatics systems and identify target genes for drug
development by Bayer A.G. for an investment estimated at $100
million. The figure includes an up-front equity stake in Lion as well
as fees for use of Lion's existing information systems, research and
set-up costs for a new subsidiary to be based in Cambridge, Mass.,
and royalties on drugs developed from the gene targets identified at
the subsidiary.
Lion calls its concept iBiology, and like Netgenics' approach, it uses
intranets rather than the Internet. "It goes far beyond the usual gene
6/12/2003
Page 6 of 7
sequence analysis software," said Claus Kermoser, Lion's vice
president for corporate development. "We crawl further up the value
chain to include the chemical side, and also pharmacological and
toxicology data. It's not just a software package, tools and data; it's a
solution for pharmaceuticals research data management."
In fact, Lion is actually a hybrid of pure-play bioinformatics and
genomics, because it sells gene targets along with informationprocessing capabilities. Similarly, Compugen, after building a
successful business selling bioinformatics tools, has recently added
a genomics thrust, selling novel gene variants the Compugen
researchers have identified with the company's tools.
Compared with these other companies, which have aimed for a
corporate clientele, Informax has taken a vastly different tack. For
six years it has sold a program for individual scientists, Vector NTI,
which is almost to biology what desktop publishing software was to
print publications. At $3,500 a user, for the Windows or Macintosh
versions, Vector NTI is not inexpensive. But because it is a
purchase that typically can be authorized at the department level, it
is the most widely used bioinformatics program in the industry. It is
used at 60 pharmaceuticals companies, 250 biotechnology concerns
and 500 academic institutions.
"We've built our franchise by meeting the needs of the bench
biologist," said Timothy Sullivan, Informax's senior vice president
for marketing and sales. "Informax took a bottom-up approach and
did it well, versus Pangea and Netgenics, who started out at the
enterprise level," he said. Informax recently introduced its own
enterprise product, Software Solution for Bioscience, and hopes to
use the leverage of its existing customer base to win sales at large
companies.
One hurdle for all of these competitors is that the large companies
that are their obvious customers often have substantial
bioinformatics capabilities of their own -- expertise that the
company may even view as a proprietary advantage.
"You're trying to do cutting-edge research, and if you're on the
leading edge of the curve that means you also have to develop the
software to do it," said Paul Godowski, director of molecular
biology at Genentech Inc., the pioneering biotech company. "On the
other hand, there are products out there from these third-party
vendors we can import for our programs," Godowski. "It's a
mixture, and I don't see that going away, certainly not at a place like
Genentech."
No wonder Pangea is looking to cyberspace to expand its potential
audience.
6/12/2003
Page 7 of 7
"Only a few select pharmas can afford the tools, and if they can,
then in some cases they can also afford to produce their own
software," Couch said. "Why not take the infrastructure we've
created, add a graphic interface that makes it easier, and offer it
directly to the scientist? We are taking the Internet, which was
originally developed to do research, and giving it back to the
researchers."
Related Sites
These sites are not part of The New York Times on the Web, and The Times has
no control over their content or availability.
l
Pangea Systems Inc.
l
DoubleTwist.com
l
Celera Genomics Group
l
Human Genome Project
l
GeneSolutions.com
l
Informax Inc.
l
Lion Bioscience AG
l
Compugen Ltd.
l
Genomica Corp.
l
Molecular Applications Group
l
Netgenics Inc.
Home | Site Index | Site Search | Forums | Archives | Marketplace
Quick News | Page One Plus | International | National/N.Y. | Business | Technology |
Science | Sports | Weather | Editorial | Op-Ed | Arts | Automobiles | Books | Diversions |
Job Market | Real Estate | Travel
Help/Feedback | Classifieds | Services | New York Today
Copyright 1999 The New York Times Company
6/12/2003
Linux cluster tackles gene research - Computerworld
Page 1 of 3
Computerworld
Home > Browse Topics > Software > Operating Systems > Linux > Story
Linux cluster tackles gene
research
By TODD R. WEISS
AUGUST 05, 2002
Content Type: Story
Source: Computerworld
Operating Systems
Knowledge Center
Operating Systems
News
Discussions
Glossary
Vendor Listing
Resource Links
White Papers
Operating Systems
XML Feed
Mobile Channel
E-mail newsletters
Knowledge Centers
Careers
CRM
Data Management
Development
E-business
ERP/Supply Chain
Hardware
IT Management
Mobile & Wireless
Networking
Operating Systems
ROI
Security
Storage
Web Site Mgmt
xSP
More topics...
Departments
QuickStudies
SharkTank
FutureWatch
Careers
Opinions/Letters
More departments...
A 150processor
cluster
supercomputer
built by
Linux
NetworX
Inc. is
helping a
California
biopharmaceutical
company
compare
genes
from
mice and
humans
in the race to find effective drugs to fight cancer and other
diseases.
Using the cluster supercomputer, Tularik Inc. has been able
to accomplish in two months what would have taken the
company 38 years with its older hardware, said Bruce Ling,
director of bioinformatics at the South San Francisco-based
company. Tularik had been using an older four-processor
mainframe computer from Silicon Graphics Inc. for such
work.
The new cluster, by contrast, has 150 Pentium III 1-GHz
http://www.computerworld.com/softwaretopics/os/linux/story/0,10801,73254,00.html
6/15/2003
Services
Forums
Research
QuickPolls
WhitePapers
Buyer's Guide
More services...
Page 2 of 3
processors, 300GB of memory and 3TB of storage; it has
given Tularik researchers far greater potential for their gene
mapping work, Ling said.
The cluster was installed late last year and began
operations in December. Details about the project were
announced today after performance benchmark testing
results were released, showing performance that was 75
times greater than the old system.
The Evolocity cluster supercomputer was built by Salt Lake
City-based Linux NetworX and has helped accelerate
Tularik's drug discovery efforts by mining massive
databases of information and quickly identifying gene
combinations behind diseases in the areas of cancer,
immunology and metabolic disorder.
The 75-node supercomputer has two CPUs per node, plus
a four-CPU administrative node. The machine was built at
about a tenth of the cost of a traditional mainframe, Ling
said. The price is not being released, he said.
The biopharmaceutical company is working to compare
human and mouse genomes because the genetic makeup
of the two species is very similar, Ling said. A genome is a
collection of genetic information consisting of individual
genes. When experiments are necessary, researchers can
use mouse genomes instead of having humans undergo
experimental procedures.
By cross-mapping the genomes, researchers can find
similarities and differences in the genomes between the
species that can help them in their experiments.
Tularik, founded in 1991, has been a pioneer in the recent
cross-mapping of mouse and human genomes, Ling said,
and hopes to develop pharmaceuticals that regulate the
development of cancer and other diseases in genes.
Clark Roundy, a vice president of marketing at Linux
NetworX, said the company is seeing continued growth in
the biotechnology and pharmaceutical marketplaces for
cluster supercomputers for research.
"This cluster has saved them years of time in mapping the
mouse genome," Roundy said.
Related Content
IBM expands server cluster technology , AUG 02, 2002
6/15/2003
Page 3 of 3
Clusters give users supercomputer power, JUN 07, 2002
IBM nets its biggest supercomputer deal yet, JUN 03, 2002
Source: Computerworld
Page Utilities
Send feedback to editor
Printer friendly version
E-mail this article
Request reprints of this article
Sponsored Links
Gateway: The Value of Wireless Mobility
Oracle9i Database: Click to calculate your savings.
Caught and Kept! How to Keep Your Customers by Knowing Who They are.
Oracle9iAS Can help with all your integration challenges.
Get two FREE audio titles from Audible. Click here!
Sony – Marriage of Storage and E -Business: A Match.com Success Story.
Microsoft Get this Free White Paper on Business Portals
Webcast: Caught and Kept! Tips to keep customers faithful
Microsoft®: Windows® Server 2003. Free Evaluation Kit.
Got Outsourcing Questions?
Apply for Computerworld’s complimentary half-day
summit on outsourcing
Get the latest news on Windows Server 2003 - Across all IDG sites
AMD Opteron: Introducing the AMD Opteron Processor
Sun: Get a FREE mainframe rehosting assessment now
News Latest News Week in Review E-mail Newsletters Special Coverage This Week in Print Corrections
Technology
QuickStudies Emerging Technologies Future Watch Reviews Field Reports Security Manager
Management Book Reviews Case Studies Managing ROI Q&As
Careers Career Adviser Education Salary/Skills Surveys Best Places
Workstyles Search/Post Jobs
Opinions Editorial Columns Letters to the Editor Shark Tank QuickPoll Center
Events Premier 100 IT Leaders Storage Networking World Computerworld Honors Program Mobile & Wireless World
Services
Forums Buyer's Guide Research White Papers Media Kit Subscriptions Reprints
About Us
Contacts
Editorial Calendar
Help Desk
Advertise
Privacy Policy
Copyright © 2003 Computerworld Inc. All rights reserved. Reproduction in whole or in part in any form or medium without
express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective
logos are trademarks of International Data Group Inc.
6/15/2003
Methods
DIAN: A Novel Algorithm for Genome
Ontological Classification
Yannick Pouliot, Jing Gao, Qiaojuan Jane Su, Guozhen Gordon Liu, and
Xuefeng Bruce Ling1,2
DoubleTwist, Inc., Oakland, California 94612, USA
Faced with the determination of many completely sequenced genomes, computational biology is now faced with
the challenge of interpreting the significance of these data sets. A multiplicity of data-related problems impedes
this goal: Biological annotations associated with raw data are often not normalized, and the data themselves are
often poorly interrelated and their interpretation unclear. All of these problems make interpretation of genomic
databases increasingly difficult. With the current explosion of sequences now available from the human genome
as well as from model organisms, the importance of sorting this vast amount of conceptually unstructured
source data into a limited universe of genes, proteins, functions, structures, and pathways has become a
bottleneck for the field. To address this problem, we have developed a method of interrelating data sources by
applying a novel method of associating biological objects to ontologies. We have developed an intelligent
knowledge-based algorithm, DIAN, to support biological knowledge mapping, and, in particular, to facilitate the
interpretation of genomic data. In this respect, the method makes it possible to inventory genomes by collapsing
multiple types of annotations and normalizing them to various ontologies. By relying on a conceptual view of
the genome, researchers can now easily navigate the human genome in a biologically intuitive, scientifically
accurate manner.
Biologists have never before been exposed to such vast
amounts of sequence data as that from the human genome
and a variety of model organisms. This development now
raises the issue of how to interpret the meaning of the genome on the basis of prior biological understandings. Annotation tasks, such as the prediction of protein function and
structure, are essential to this process and are by no means
completely robust. Furthermore, the integration of historical
domain knowledge accumulated in individual research fields
with these sequence and structural annotations is becoming
increasingly complex and difficult. The size, diversity, and
complexity of the data, which include biological sequence
information itself, third party or in-house annotation, and
information from the scientific literature, are responsible for
these difficulties. Another reason relates to the lack of data
and information normalization, because the data repositories
are often poorly designed, particularly in the case of older
repositories. Furthermore, data processing procedures vary
substantially, and the underlying semantic and data models
are moving targets. Finally, there is the extreme specialization
of research fields.
Despite these problems, model organism studies and associated DNA and protein sequence data sets have revealed a
high degree of sequence and functional conservation between
organisms (Chervitz et al. 1999). Similarly, the accumulated
protein structure data have shown that the number of protein
folds is probably limited (Bowie et al. 1991). The limited number of biological roles, protein functions, and structural types
1
Present address: Tularik, Inc., 2 Corporate Drive, South San
Francisco, CA 94080, USA.
2
Corresponding author.
E-MAIL [email protected]; FAX (650) 825-7400.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/
gr.183301.
1766
Genome Research
www.genome.org
enable a common language for annotation, which is beginning to be implemented by biocomputational ontology engineering (Riley 1993; Baker et al. 1999; Ashburner et al. 2000;
Karp 2000). Ontologies provide an ideal mechanism of organization of biological data at the conceptual level by providing a framework for data whose properties are otherwise nonnormalized. “Normalization” is used here to refer to a state in
which several types of signifiers ultimately express the same
concept, and in which a concept is defined as a generic abstraction derived from instances. An example of a concept is
the notion of “cell adhesion molecules”, to which specific
types (instances) of proteins such as cadherins, neural cell
adhesion molecules, and integrins are conceptually associated. The proper assignment of DNA and protein sequences to
ontologies therefore leverages the rigor of the underlying concepts networked within these ontologies, and enables computations that would otherwise be unreliable due to the variability of terms used to described biological data in most of
the biocomputational databases. For example, ontologybased querying can enable the retrieval of DNA and protein
sequences based on biological concepts rather than relying on
keyword or synonym searches, which are inherently unreliable due to their present nonnormalized nature, therefore
greatly hampering effective computing (Attwood 2000).
Here we describe DIAN, an ontology assignment algorithm that assigns concepts to source records or, more generally, to biological objects within a database, and supports
their querying using concepts rather than keywords. The algorithm supports a variety of ontologies for biological role,
protein function, and protein structure, whereby each ontology is implemented on a knowledge base established via computer-assisted human curation of the protein universe. DIAN
has the necessary throughput capacity to annotate entire genomes, transcriptomes, and proteomes onto any number of
ontologies. The DIAN algorithm, together with the precom-
11:1766–1779 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org
DIAN Knowledge Mining System
puted DIAN annotation database and its associated utilities,
enables users to retrieve, summarize, and predict the higher
order properties of biological objects, therefore increasing
their information content. Overall, DIAN is intended to facilitate the navigation of genomic data repositories in a biologically intuitive, scientifically accurate manner.
RESULTS AND DISCUSSION
Biologists rely heavily on databases and search tools such as
the National Center for Biotechnology’s Entrez system to
search and identify records containing information associated
with biological objects such as protein structures and biological sequences (Wheeler et al. 2001). However, when computing on such information, most query systems suffer from the
limitations inherent to the annotations associated with these
objects. Even in highly curated databases such as the SWISSPROT database of protein information (Bairoch 1991), there
remains significant variability in the descriptors present in
these source records. This is because there are many legitimate
ways of describing biological concepts. Furthermore, even
when the data are curated by experts, a variety of factors introduce variability in the quality and comprehensiveness of
these annotations. Thus, when querying annotation databases, conventional search tools encounter fundamental limitations, such that they cannot return records in a reliable
manner unless a complete set of descriptors known to be present in the targeted records is provided in the query. This, of
course, rarely is the case.
DIAN is designed to enable the querying of popular biological databases in such a way that the limitations associated
with the original source records of these databases can be
partially overcome. This is accomplished by having the operator query biological ontologies for records associated with
these ontologies, rather than querying the source records directly (for details, see supplementary material at http://
www.genome.org). The primary algorithm used by DIAN for
associating records to ontologies relies on a domain-based approach that does not depend on the presence of annotations
in the source record, thus bypassing the limitations associated
with these annotations. In addition, because of this approach,
DIAN often makes suggestive assignments, whereby proteins
are predicted to belong to ontological nodes in the absence of
definitive information.
For these reasons, when performed using conventional
keyword-based search engines, the queries described in Table
1 will fail to return a fraction of records because of an absence
of matching annotations or because of the indirectness of
these annotations (i.e., hyperlinked records). Three such cases
of records that would otherwise not have been returned without DIAN are illustrated in Table 1. They involve two novel
genes, one with predicted functional information listed in the
source record and one without such information, as well as
one well-characterized gene. In case 1, DIAN identified a gene
with no known functional activity by predicting the cellular
role and protein function of a sequence on the basis of its
pattern of protein domains. UniGene was queried for records
involved in the apoptotic Cellular Role. DIAN returned a record from the UniGene database where no functional information is available regarding this sequence, such that this record
would not have been identified by keyword-based querying
(Table 1). It is only after consulting the SWISS-PROT record
linked to this UniGene entry that an apoptotic function is
uncovered. Case 2 concerns the prediction of a cellular role
for a hypothetical gene in SWISS-PROT in which putative
functional information is available (zinc finger; DNA binding)
but where the annotation does not specify a cellular role. In
this case, DIAN predicted an involvement in the “RNA synthesis/transcription factor” Cellular Role node. In case 3,
DIAN predicted a novel property for a highly characterized
gene. Here, UniGene was queried for records involved in the
apoptotic Cellular Role. The gene coding for the protein associated with the Wiskott-Aldrich syndrome (WAS; Derry et
al. 1994) was one of the hits returned by this query. The WAS
protein is thought to be involved in signal transduction, yet
there is no indication of an apoptotic role in any of the records associated with this gene, including the associated
SWISS-PROT and OMIM records. However, indications suggestive of a possible apoptotic role were found in these
sources. On subsequent analysis of the scientific literature associated with WAS and its Drosophila ortholog, several publications were uncovered that strongly substantiate a recently
discovered apoptotic role for this gene (Rawlings et al. 1999;
Rengan et al. 2000; Ben-Yaacov et al. 2001). Beyond the performance of DIAN in returning records otherwise unretrievable, the combination of ontology-based and Boolean operators (e.g., NOT, AND, OR) enables users to query databases in
a biologically meaningful manner rather than to submit to
unfamiliar querying syntaxes and the vagaries of unstructured
data. For example, using DIAN it is possible to formulate directly the following questions in a simple manner: Are there
cytokines involved in the apoptosis biological process? Are
there proteins harboring the caspase domain that are involved in apoptosis? What receptors are associated with apoptosis? What proteins are both apoptosis-associated and DNAassociated in terms of cellular role? (i.e., proteins that might
perform an apoptotic role via DNA binding). Such questions
cannot be addressed if the contents of annotation databases
have not been normalized to various biological concepts, and
furthermore, comprehensive biological query cannot be performed reliably when accomplished exclusively by using a
simple keyword-search approach, as seen in most public databases.
Organization of Biological Data Using Ontologies
An ontology is a specification of a conceptualization that provides a written, formal description of a set concepts and their
relationships within a domain of interest (Karp 2000). Ontologies are object-oriented data structures that use object
composition and inheritance as techniques to encapsulate
conceptual relationships.
In biology, there are two kinds of relationships between
conceptual objects to be represented: inheritance and compositional relationships. Inheritance hierarchies model IS-A relationships among base and derived conceptual objects. This
is because a derived object IS-A type of base object. In contrast, composite objects, that is, objects that contain other
objects as members, model HAS-A relationships. This is because the container object HAS-Another as its member component. For example, in the Gene Ontology (GO) ontologies
(Ashburner et al. 2000), the Cellular Component ontology
relies on HAS-A compositional relationships, whereas the Molecular Function ontology uses IS-A inheritance relationships.
In this way, the granularity and richness of the universe of
biological concepts can be modeled by ontologies.
To encapsulate biological conceptual objects and support
the goal of concept-based searching, the DIAN algorithm segments the spaces of protein function, biological role, and pro-
Genome Research
www.genome.org
1767
1768
Genome Research
www.genome.org
OMIM:
30100
SWISS-PROT:
P42768
Wiskott-Aldrich
syndrome
protein
•cell division
•apoptosis
•gene/protein expression
•RNA synthesis
•transcription factors
•cell/organism defense
•homeostasis
•apoptosis
•cell division
•apoptosis
EGAD
celluar role
•Enzymes
•Transferase
•Post-translational modifications
•DNA or RNA
associated
proteins
None
Protein
function
•Transferases
None
None
Enzyme
classification
•Non-immune cell defense
•Apoptosis
•Genome structure and
Gene expression
•Transcription factors
•Non-immune cell defense
•Apoptosis
DoubleTwist
biological role
Source record: DIAN system annotation
Structure
(SCOP)
•All beta proteins
•PH domain-like
•PH domain-like
•Enabled/VASP homology
1 domain (EVH1 domain)
•Small proteins
•Classic zinc finger, C2H2
•All alpha proteins
•DEATH domain
•DEATH domain
•DEATH domain
Three illustrative cases of records that cannot be returned by conventional keyword-based querying systems but that were returned by DIAN are described here.
Case 3:
Known
gene
with
novel
predicted
function
Putative
Zinc
Protein
UniGene
Hs. 2157
SWISS-PROT
P39959
None
UniGene
Hs. 104305
Annotation
Identifier
Example Queries that Cannot be Resolved Accurately by Conventional Querying Systems
Case 2:
Hypothetical
gene with
predicted
function
Case 1:
Novel
gene with
no
predicted
function
Case
type
Table 1.
Pouliot et al.
tein structure using a collection of ontologies. Although
HAS-A relationships should be supportable, in this study we
rely exclusively on IS-A ontologies as a paradigm to show the
DIAN methodology. Computationally, ontologies take the
form of a graph, a tree being a special form of a graph. A node
always inherits the properties of all parental nodes, such that
a complete description of the biochemical function of a protein involves starting the path from the leaf to the root of the
tree. The first three levels of the PROSITE Protein Function
ontology are used to illustrate this conceptualization (Fig. 1).
Starting from the root of the tree (level 0), each level describes
biochemical protein function in increasingly greater detail. In
this illustration, six proteins were assigned to the transferase
node. Because this is a protein function ontology, proteins
can belong to different families and species and yet be assigned to the same node. By providing standard classification
data structures, ontologies are ideal in providing a common
platform for annotation and therefore promoting reuse across
different informatics systems and research fields. Because the
focus of this paper is on a methodology for assigning protein
sequences to ontologies, the relative merits of individual ontologies are addressed only briefly.
Choice of Ontologies
Because an ontology is essentially a specification of conceptualization (Karp 2000), the choice and quality of the chosen
ontologies are essential in ensuring the integrity of encapsulating the biological data. To support the conceptualization of
protein functions, biological roles, and cellular processes, substantial attention has been devoted, both in academia and
industry alike, to the development of various ontologies to
meet these needs. Examples include the enzyme commission
classification system (Commission on Biochemical Nomenclature and International Union of Biochemistry. Standing
Committee on Enzymes 1973; International Union of Biochemistry and Molecular Biology. Nomenclature Committee
and Webb 1992; International Union of Biochemistry. No-
menclature Committee and Commission on Biochemical Nomenclature 1979; International Union of Biochemistry. Nomenclature Committee et al. 1979; International Union of
Biochemistry. Nomenclature Committee et al. 1984; International Union of Biochemistry. Standing Committee on Enzymes 1965); the Escherichia coli Protein Function ontology
(Riley 1993); the EcoCyc system for E. coli metabolic pathway
(Karp 2000); the PROSITE ontology of domain biological
functions (Hofmann et al. 1999); the GO ontologies (Ashburner et al. 2000); the KEGG system for the classification of
genes according to pathway information (Ogata et al. 1999);
RIBOWEB (Chen et al. 1997); and the TIGR expressed gene
anatomy database (EGAD, http://www.tigr.org/tdb/egad/
egad.shtml). Similarly, to facilitate the understanding and access to information of protein structures, several protein
structure classifications have been constructed (Murzin et al.
1995; Orengo et al. 1997). Despite these efforts, there is still
no accepted ontology with the necessary robustness, comprehensiveness, and level of detail to satisfy the demands of genome annotation, although this is an implied goal of the GO
project.
Given these limitations, and in the absence of the GO
ontologies, we originally chose to rely on various publicly
available ontologies, in addition to deriving the DoubleTwist
Biological Role Ontology (Table 1). For Protein Function,
DIAN supports the PROSITE Protein Function and the enzyme
commission classification. Given its rigor, comprehensiveness, and rapid evolvement, the GO ontologies, including its
three components (molecular function, biological process,
cellular components), are expected to be integrated within
DIAN in the foreseeable future. For the Cellular Role of proteins, the TIGR expressed gene anatomy database (EGAD) ontology and DoubleTwist Biological Role ontology are supported. Although not very comprehensive, the EGAD ontology currently is the only publicly available ontology designed
to inventory human expressed genes. The DoubleTwist Biological Role ontology was derived from Riley’s Protein Function ontology (Riley 1993) and has been designed for the concise conceptual encapsulation of the biological role of the
human gene to enable comprehensive human genome assignment. As for protein structure classification, the Structure
Classification of Proteins (SCOP) ontology was selected because SCOP is sequence-based and its classifications provide a
detailed and comprehensive description of the structural and
evolutionary relationships of proteins of known structure
(Murzin et al. 1995).
Architectural Design
Figure 1 Defining ontologies. Ontologies represent a specification
of a domain of knowledge expressed in the structure of mathematical
graphs (a tree being a special form of a graph). Connecting lines
represent the relationship between the nodes, specifically IS-A relationships. Known protein functions are assigned to nodes (represented by circles) within the ontology graph. A child node always
inherits the properties of all parent nodes, such that a complete description of the biochemical function of a protein involves retracing
the path from the leaf to the root of the tree.
DIAN (Fig. 2) integrates several databases through algorithms
that perform the ontological assignment of proteins on the
basis of two distinct principles. The first algorithm (vocabulary-based mapping) relies on the recognition of vocabulary
within a source record from a database of protein annotations. The second algorithm (domain-based mapping) assigns
protein sequences based on the detection of protein domains
and does not rely on preexisting sequence annotation.
DIAN has several subcomponents in support of these
functions: a knowledge base of assignments of SWISS-PROT
proteins to ontologies; two databases that provide operational
definitions for each ontological node, based either on vocabulary or the assignment of protein domains; two assignment
algorithms for assigning proteins on the basis of either vocabulary or the presence of protein domains; and lastly a data
Genome Research
www.genome.org
1769
Pouliot et al.
Figure 2 DIAN overview. (Left) computer-aided human curation
process for the assignment of SWISS-PROT sequences to the Protein
Function and Cellular Role ontologies; (right) application of ontologies to organize biological annotation databases. Multiple ontologies,
each representing a body of biological knowledge, are stored in the
DIAN database. Individual source records stored in biological sequence and structure annotation databases are associated with one or
more ontologies via domain-based and/or vocabulary-based mapping, such that they can be queried simultaneously across multiple
ontologies. (GB) GenBank; (GP) GenPept; (SP) SWISS-PROT; (DB) database.
indexing and retrieval engine to support user queries. Each
subcomponent is described in the following sections.
manual curation process, a database of controlled vocabulary
was evolved from the assignment,in which for each ontological node, essential keywords were extracted from the annotations of the SWISS-PROT proteins assigned to the node. To
enhance the selectivity and sensitivity of each definition, this
data set was used to partition SWISS-PROT according to records that are either positively or negatively assigned to a
node. Each set of partitioned SWISS-PROT records was examined thoroughly by curators to identify false positive records
in the positive pool, and records characterized as false negatives in the negative pool. This information was then used for
a second round of keyword refinement as feedback data in
generating a subsequent, more refined set of controlled vocabulary. This process was repeated until no further additional identifiable false positives could be detected. Once this
data set stabilized for all nodes, the SWISS-PROT-Ontology
assignment table was finalized, resulting in the assignment of
over 84% of SWISS-PROT (for further details, see on-line
supplementary Table 2B at http://www.genome.org). This information was added to the knowledge base, such that it now
provides an operational definition that expresses the knowledge associated with each node. The knowledge base was later
used as the foundation for the development of nodal signatures, and along with periodic verifications of the selectivity
and sensitivity on new releases of SWISS-PROT, it ensures the
continued assignment of SWISS-PROT entries to the Protein
Function and Cellular Role ontologies as SWISS-PROT
evolves.
Development of Nodal Signatures
Development of the DIAN Knowledge Base
An essential component in DIAN is a knowledge base derived
from a computer-aided human curation process that associates entries of the SWISS-PROT database to ontologies. SWISSPROT is known for its high-quality curated annotations of
protein sequences and minimal level of redundancy (Bairoch
and Apweiler 2000). Although most sequence databases provide SWISS-PROT links to leverage its high-quality annotations, accurate and comprehensive classification of SWISSPROT entries onto Protein Function and Cellular Role ontologies has not been achieved. DIAN relies on this knowledge
base as a foundation to define parameters and data sets to
support the computational assignment of proteins to ontologies.
During the early phase of the development of this knowledge base, we attempted to rely on preexisting links between
SWISS-PROT and other publicly available databases to determine whether these links could be used directly to associate
SWISS-PROT records to ontologies. It was found that this superficially straightforward method of assignment is error
prone, and that the resulting coverage of SWISS-PROT was not
comprehensive. Instead, the assignment of SWISS-PROT to
the Protein Function and Cellular Role ontologies stored in
the knowledge base was achieved through a computer-aided
manual curation process (illustrated in Fig. 2). A group of
scientific curators was assembled to manually assign SWISSPROT sequences to the DIAN Protein Function and Cellular
Role ontologies by matching the functional annotation of
each SWISS-PROT record to the definition of each node in a
given ontology. To ensure the high accuracy of this underlying data set, we analyzed only the subset of SWISS-PROT proteins that are full length and have been characterized biochemically. This resulted in the initial assignment of over
40,000 proteins to the DIAN ontologies. Subsequent to this
1770
Genome Research
www.genome.org
To classify biological sequence annotations by assigning them
to ontologies, we developed annotation signatures for each
node of the supported ontologies. Such nodal signatures provide the operational definitions used by the DIAN assignment
algorithms to recognize properties in protein sequences, such
that sequences from input databases can be assigned to ontologies. Two kinds of nodal signatures are used in the DIAN
algorithm: signatures based on either controlled vocabulary
or protein domain profiles. A protein domain is here defined
as an independent structural unit, which can be found alone,
or in conjunction with other domains. Domains are often the
mediators of the biochemical functions of proteins, although
a substantial fraction of domains appears to play structural
roles only. For this and other reasons, not all domains can be
used as nodal signatures. For the Protein Function and Cellular Role ontologies, controlled vocabulary databases are used
to efficiently collapse protein annotations present in source
records and to assign these records to ontologies, as was done
when assigning SWISS-PROT sequences to ontologies during
the development of the knowledge base. This controlled vocabulary is expected to accurately classify sequence via annotations preexisting in the source records as long as the quality
of these annotations is comparable to that of SWISS-PROT.
Although sensitive enough to capture input sequence annotations under most circumstances, this approach is essentially
a keyword-matching mechanism that may incorrectly assign
records to ontologies as compared with the actual sequence
annotation. This is an expected consequence of the process by
which nodal vocabularies are derived. For example, it is possible for both a kinase substrate and a kinase enzyme to become assigned to the same ontology kinase node, when in
fact only kinase enzymes should be assigned to this node. This
is a consequence of the difficulty of defining assignments on
the basis of vocabularies alone.
Of larger consequence is the intrinsic quality of the annotations associated with a sequence to be assigned, because
annotations in most sequencing projects are transferences obtained through sequence similarity alignments with characterized gene or proteins. This can lead to so-called “multiplelinkage” errors during the annotation transfer process, which
creates misleading annotations due to the localization of the
alignment in a region with low functional information content (e.g., a region devoid of a functional domain). Therefore,
an additional assignment algorithm was derived to compensate for this well-known problem by relying only on the presence of domains within protein sequences or the translation
of DNA sequences into proteins. Whereas evolutionarily and
functionally related protein sequences can diverge significantly through evolution, three-dimensional substructures,
such as motifs, domains, and active sites, can remain largely
unchanged (Gusfield 1997). As a result, protein domain profiles compiled from multiple sequence alignments can enable
more accurate representation of protein families and superfamilies. Furthermore, such conserved sequence features are
highly correlated with structure and function. As a result of
the success of the protein profiling methodology, several protein domain and motif databases have been built: PFAM (Sonnhammer et al. 1998), PROSITE (Bairoch 1991), PRODOM
(Corpet et al. 1998), DOMO (Gracy and Argos 1998), EMOTIF/
EMATRIX (Nevill-Manning et al. 1998; Wu et al. 2000),
BLOCKS (Henikoff et al. 1999), PRINTS (Attwood et al. 1997).
Although in the current DIAN algorithm we have chosen to
rely on domains provided by the PFAM database because of its
extensive coverage and the richness of its associated annotations, other domain or motif databases can be integrated in
the same fashion.
Because of the close relationship between a given protein
domain and the function and structure of a protein that harbors this domain, the ontological classification of protein sequences using well-chosen protein domains can be achieved
by using an effective balance between the specificity and sensitivity of individual domains. A filtering algorithm was therefore developed to select domains qualified to function as
nodal signatures to be used in assigning proteins to ontologies. Comprehensive analyses of the DIAN knowledge base for
patterns of association between PFAM domains and SWISSPROT sequences assigned to ontological nodes revealed frequent many-to-many relationships between domains and
nodes. To promote specificity, it was therefore necessary to
analyze all preliminarily assigned protein domains for the
possibility of conversion to nodal signatures for a particular
node. This was accomplished in the following way: For each
of the protein domains in the source pool, the annotations of
all SWISS-PROT sequences containing a particular protein domain were compared against the assignment of this sequence
to a node, as maintained in the DIAN knowledge base. Second, if a set of annotations associated with sequences containing a given protein domain was found to be correlated
with the description of the node, this domain was accepted as
the annotation signature for that ontology node, as this domain is relatively specifically correlated with that node.
These concepts are illustrated in Figure 3 in the case of
the Protein Function ontology. This ontology is expressed as
a tree in which each node represents a concept and is associated with other concepts via an “IS-A” relationship. The root
of this tree (level 0) is a generic function. Child nodes inherit
the properties of their parent and express increasingly specific
protein functions. For example, among the children of the
Figure 3 Converting protein domains into ontological nodal signatures. The Protein Function ontology is used here to illustrate the
derivation and assignment of nodal annotation signatures. Proteins
are depicted as rectangles; identical colors indicate membership to
the same protein family in a given species, whereas the various protein domains are represented as geometrical shapes.
root lies the “enzyme” node, which is defined as “ biomolecules that can catalyze reactions.” Associated with this node
are keywords positively correlated with this function, such as
“Oxidoreductase OR Transferase OR Hydrolase OR Lyase OR
Isomerase OR Ligase”. As a first step in the derivation of the
DIAN knowledge base, proteins described in the SWISS-PROT
database were assigned to the most specific nodes possible.
Here, six proteins were assigned to the transferase node (Fig.
3). Two proteins belong to the same gene family and are of
human origin, whereas all other proteins are from different
gene families from various species. Various protein domains
are present within these proteins, sometimes more than once
in a given protein. Thus, a total of five distinct types of protein domains are present within the group of proteins assigned to the transferase node. However, only three types of
domains are retained by DIAN as protein annotation signatures, because according to the DIAN knowledge base these
domains are the only domains to be specifically associated
with transferase-related functions. Thus, the two remaining
domain types were rejected as annotation signatures because
they are either not encoding a function related to transferases,
or are purely structural domains not directly involved in protein function. In this way, any database of protein motifs or
domains can in principle be integrated in the DIAN algorithm
to derive ontological node signatures. The current DIAN
implementation relies on protein domains from the PFAM
database as its source of protein domains to be converted into
ontological node signatures.
Based on overlaps between the annotations present in
the 86,593 sequences of release 39 of SWISS-PROT and the
Genome Research
www.genome.org
1771
Pouliot et al.
concepts associated with our ontological nodes, computeraided human curation associated 73% of SWISS-PROT sequences to the PROSITE Protein Function ontology, 68% to
the EGAD Cellular Role ontology, and 68% to the DoubleTwist Biological Role ontology. Overall, 205,694 keywordbased patterns and 1699 PFAM domains were compiled to
represent the biological concepts associated with each ontology node.
Nodal signatures for the structural ontology were derived
differently from the process described in Figure 3. This was
achieved by profiling the SCOP domain sequences compiled
by the SCOP consortium (Brenner et al. 1998), using selected
protein domains from the PFAM database. Because high sequence similarity usually implies significant functional and
structural similarity (Gusfield 1997), 824 PFAM domains were
identified that are referenced in sequences of the SCOP domain database (S. Brenner, pers. comm.). These PFAM domains show strong sequence similarity to SCOP domains and
were selected because they are likely to represent a similar
structure in three-dimensional space.
Ontological Assignment Process
Two assignment algorithms are used to assign proteins to
DIAN ontologies. This is achieved on the basis of either the
presence of protein domains or the recognition of vocabularies within the source record. As shown in Figure 2, annotations in various biological sequence databases, including GenBank, SWISS-PROT, GenPept, PDB, and UniGene, are collapsed through either the domain-based or vocabulary-based
algorithms into a centralized DIAN database. In cases where
DNA sequences are operated on by the domain-based algorithm, a translation algorithm is applied, as DIAN only operates ultimately on protein sequences. Genomic DNA sequences are treated differently in this process because these
sequences show very different properties from cDNA and proteins. In particular, sequence length can easily exceed a million characters. For this reason, it would therefore be incorrect
to apply ontologies directly at the level of an entire genomic
sequence. Thus, location coordinates are essential to segment
genomic sequences into biologically meaningful ranges
(“units”) before further processing. If available in the source
record, information specifying the presence of genes, derived
ab initio or experimentally, are used to define the unit. However, in sequences derived from high-throughput sequencing
projects (e.g., sequences from the GenBank HTG division),
this information is frequently unavailable. In such cases,
DIAN can use available gene predictions from algorithms such
as GENSCAN (Burge and Karlin 1997) or GENEWISE (Birney and
Durbin 2000) to locate the genes in the genomic sequence.
As mentioned earlier, another assignment approach applied by DIAN is based on the scanning of annotations associated with the input biological sequence using a vocabularybased mapping process. This is accomplished by the application of a collection of keywords that serve as the ontology
node annotation signatures, enabling the collapse of preexisting annotations and their assignment to ontological nodes.
The input sequence annotations can be derived from sequence similarity information, domain profiling information,
human curation, computation-derived annotations, thirdparty annotations, and so forth. Together, the domain-based
and vocabulary-based algorithms are used by DIAN to annotate and classify sequences from input biological databases in
a high-throughput manner.
1772
Genome Research
www.genome.org
DIAN Algorithm Evaluation
The sensitivity and selectivity of the DIAN algorithm were
evaluated. Based on sequence similarity results, the vocabulary-based algorithm implicitly transfers existing annotations
and assigns proteins to ontological nodes. However, this process suffers from two intrinsic types of errors: Because of the
variability of vocabularies in the annotations, it is very difficult to identify and compensate for incorrect annotations
during this annotation transfer process. Furthermore, multiple linkage errors are generated when annotations are
wrongly transferred when the sequence similarity between
both sequences is only present within core structural regions
with low information content, rather than encompassing
functional domains. However, the domain-based assignment
algorithm is not susceptible to these problems. Thus, despite
the observation that the domain-based algorithm generated
less coverage than the vocabulary-based algorithm, the domain-based algorithm can make annotation assignments in
the absence of preexisting annotations in the source records.
The accuracy of an ontological mapping algorithm such
as DIAN is defined as the fraction of correct assignments made
to the nodes of an ontology, both in terms of type I variations
(assignments that should not have been made but are present)
and type II variations (assignments that are missing and that
should have been made). Here we use the terms types I and II
“variation”, rather than “type I error” and “type II error”, to
emphasize that providing exact error rates in this context is
fundamentally impossible (see the following discussion of error measurements in this context). The accuracy of the DIAN
algorithm was evaluated using three complementary approaches, summarized in Table 2. The construction of the
underlying data sets is described in Figure 4. Detailed results
of evaluations are documented in Table 3.
The DIAN assignments of well-characterized mouse
sequences were compared with assignments made via an
independent assignment process (method 2, Table 2). These
assignments were provided by the Mouse Genome Database (MGD; Blake et al. 2000) using the Molecular Function
and Biological Process ontologies from the Gene Ontology
(GO) Consortium (Ashburner et al. 2000; http://www.
geneontology.org). The application of GO ontologies to the
mouse genome was chosen over that of other organisms such
as Drosophila and others because of its closer relationship to
human proteins and the bias in the SWISS-PROT database
toward higher organisms. Because these ontologies are different from those currently supported by DIAN, a crossreferencing was first determined to enable comparisons of assignments. As shown in Figure 4B, comparing assignments
made to ontologies is accomplished first by manually selecting nodes from a reference ontology for concepts that are
shared between the ontologies. Because of the different levels
of resolution supported by different ontologies, nodes at
equivalent levels of resolution need to be identified. This results in some of the terminal nodes of one ontology being
associated with middle nodes of the counterpart ontology.
Furthermore, multiple nodes from one ontology may need to
be selected to represent the concepts associated with a single
node from the counterpart ontology (indicated by purple
nodes from the reference ontology, all of which are conceptually equivalent to a single node from the DIAN ontology).
Thus, the node associated with the INHIBITORS concept on
level 3 of the DIAN ontology is conceptually equivalent to the
APOPTOSIS INHIBITORS and ENZYME INHIBITORS nodes
Table 2. Methodologies Involved in the Evaluation of DIAN: Strengths and Weaknesses
Approach
number
Approach
type
Description
Strengths
Extensive human expertise
can confirm assignments
made by method and
substantiate its
effectiveness.
Presence of extensive shared
assignments for numerous
proteins lends credence
to the method under
evaluation.
1
Manual verification of
assignments made
to selected proteins.
In-depth review by domain
experts of assignments
made to wellunderstood proteins.
2
Comparisons with
other assignment
data sets using a
test set of
sequences.
Evaluation of sequence
assignments made to
cross-referenced
ontologies using different
methods.
3
Comparisons between
orthologs.
Verification that
assignments made to
closely related orthologs
are balanced, (i.e., nearly
identical).
and subnodes on levels 6 and 8 and lower of the reference
ontology. Other problems arise from the differing extent of
coverage between ontologies, which can obscure the interpretation of the comparison. In this example, there are several
more proteins mapped to the DIAN ontology than to corresponding nodes of the reference ontology. Some proteins are
mapped to both ontologies (green area, where individual pair
members are indicated by double arrows), whereas other proteins are mapped only to a single ontology (red area). Within
Strong expectation that
balanced assignments will
be made.
Weaknesses
Suffers from lack of
comprehensiveness; biased in favor
of well-understood proteins.
Assumes that the reference ontology
can be treated as a standard of
comparison; in practice, this is not
the case. Results in the identification
of weaknesses in both the test and
reference ontology. Manual review is
required to evaluate unbalanced
assignments.
Although orthologs share functions,
even orthologs share functions, even
orthologs from closely related
species don’t necessarily have
identical functions, resulting in
unbalanced assignments; manual
review is required to evaluate
unbalanced assignments.
the latter, a manual review will find that some proteins are
correctly mapped (blue rectangles), whereas others are incorrectly mapped (yellow rectangles). Lastly, there can be variations in the comprehensiveness of assignments made to individual proteins, such that only a fraction of the properties
associated with a single protein are assigned to an ontology
(data not shown). Detailed results of this evaluation are listed
in Table 3A and 3B.
A number of intrinsic problems were identified from our
Figure 4 Validation approaches. (A) Evaluating the effectiveness of DIAN by comparing assignments made to a reference ontology. Selected
nodes from Gene Ontology (GO) ontologies were manually associated with nodes in the DIAN ontologies. Sequences assigned to these GO nodes
by the MGI were processed by the DIAN pipeline to compare the assignments made by DIAN with those made by MGI. (B) Associating nodes and
sequences from a reference ontology to a DIAN ontology for comparative evaluation. To estimate the error rates associated with the DIAN
assignment algorithms, we compared mouse sequences mapped via DIAN (A) with assignments made to GO ontologies by MGI.
Genome Research
www.genome.org
1773
Pouliot et al.
Table 3. Comparison between DIAN and MGI Ontological Assignments
Results from the comparative approach are shown. A number of intrinsic problems were identified from this approach, such that type I and type
II variances described here are for comparative purposes only and cannot be interpreted strictly as type I and II errors.
Table 3A. Comparing Assignments Made to the Cellular Role Ontology
Present in
Variation
DIAN node
number
Highest level
matching GO modes
DIAN
and GO
DIAN
only
GO
only
Type I
Type II
Sensitivity
Selectivity
Chromosome structure
Transcription factors
DNA duplication
Cell-cell adhesion
Microtubule
DNA repair
Programmed cell death
Channel and transporter
Amino acid metabolism
Stress response
Nucleotide metabolism
1.1
1.4
3.2
5.2
9.1.1.1
6.2
8.1
8.2
4.6
9.2
8.4
9.4
7
59
10
35
1
9
14
14
47
4
5
0
4
54
2
15
0
0
2
7
8
1
6
2
4
38
2
14
1
2
9
23
27
7
55
7
0.267
0.358
0.143
0.234
0.000
0.000
0.080
0.159
0.098
0.083
0.091
0.222
0.267
0.252
0.143
0.219
0.500
0.182
0.360
0.523
0.329
0.407
0.833
0.778
0.636
0.608
0.833
0.714
0.500
0.818
0.609
0.378
0.635
0.476
0.083
0.000
0.636
0.522
0.833
0.700
1.000
1.000
0.875
0.667
0.855
0.625
0.455
0.000
Cofactor metabolism
Total DIAN and GO: 229
Total DIAN only: 113
Total GO only: 214
Total: 556
Average type I: 0.203
Average type II: 0.385
Sensitivity: 0.517
Selectivity: 0.670
9.5
GO:0007001;GO:0006323
GO:0003700
GO:0006260;GO:0003964
GO:0007155
GO:0008135
GO:0007017
GO:0006281
GO:0006915
GO:0006810;GO:0005216
GO:0006519
GO:0006950
GO:0006140:0006205
GO:0006143
GO:0006731
4
0
3
0.000
0.429
0.571
1.000
Concept
DIAN assignments made to a group of well-characterized, nonredundant mouse sequences were compared to assignments made by the MGI
to the GO Process and Function ontologies. GO modes corresponding to DIAN nodes are listed, along with the abbreviated essential concept
from the DIAN Role ontology. For brevity, only the highest level GO nodes are listed. The number of sequences whose assignment is shared
to both sets of ontologies is indicated (DIAN and GO), as well as the number of sequence assignments which differed (DIAN only, GO only).
These numbers are used to calculate Type I and II variation using the following equations: Type I variation = DIAN only/(DIAN and GO + DIAN
only + GO only); Type II variation = GO only/(DIAN and GO + DIAN only + GO only); Sensitivity = DIAN and GO/(DIAN and GO + GO only);
Selectivity = DIAN and GO/(DIAN and GO + DIAN only). Sensitivity is defined as the ability of the DIAN algorithm to make what are believed
to be all possible correct assignments. Selectivity is defined as the ability of the DIAN algorithm to not make what is believed to be an incorrect
assignment.
Table 3B. Comparing Assignments Made to the Protein Function Ontology
Present in
DIAN
node
number
Concept
Hormones and active
peptides
10
Inhibitors
12
DNA or RNA associated
proteins
1774
3
Genome Research
www.genome.org
Highest level
matching GO modes
GO:0005179;GO:0005103
GO:0005104;GO:0005105
GO:0005106;GO:0005109
GO:0005110;GO:0005111
GO:0005112;GO:0005113
GO:0005114;GO:0005115
GO:0005116;GO:0005117
GO:0005118;GO:0005119
GO:0005120;GO:0005121
GO:0005122;GO:0005123
GO:0005124;GO:0005177
GO:0005178;GO:0005186
GO:0004857;GO:0008189
GO:0005074;GO:0005092
GO:0008200;GO:0005517
GO:0003676;GO:0003735
GO:0004748;GO:0003910
DIAN
and GO
DIAN
only
Variation
GO
only
Type I
Type II
Sensitivity
Selectivity
7
3
8
0.167
0.444
0.467
0.700
12
3
9
0.125
0.375
0.571
0.800
255
21
26
0.070
0.086
0.907
0.924
Table 3B. (Continued)
Present in
Concept
Protein secretion and
chaperones
Electron transport
proteins
Other tranport proteins
Structural proteins
Receptors
Cytokines and growth
factors
Variation
DIAN
node
number
Highest level
matching GO modes
DIAN
and GO
DIAN
only
GO
only
Type I
Type II
Sensitivity
Selectivity
13
GO:0003911;GO:0004518
GO:0003899;GO:0008534
GO:0008263;GO:0003907
GO:0003905;GO:0003906
GO:0003904
GO:0004844;GO:0003908
GO:0003754;GO:0008565
11
3
2
0.188
0.125
0.846
0.786
0
7
6
0.538
0.462
0.000
0.000
62
31
67
35
17
23
43
19
19
40
15
10
0.173
0.245
0.344
0.297
0.194
0.426
0.120
0.156
0.765
0.437
0.817
0.778
0.785
0.574
0.609
0.648
5
GO:0006605
GO:0005489
6
7
8
9
GO:0005215
GO:0005198
GO:0004872
GO:0008083;GO:0005125
GO:0008009
Total DIAN and GO: 480
Total DIAN only: 139
Total GO only: 135
Total: 754
Average type I: 0.184
Average type II: 0.179
Sensitivity: 0.780
Selectivity: 0.775
A group of well-characterized, nonredundant mouse sequences were assigned to the Protein Function ontology by the DIAN domain-based
mapping algorithm. These assignments were compared to assignments made to the GO Process and Function ontology by the MGI.
investigation of the different evaluation methodologies described here. These are summarized in Table 2. For example,
50% of the Drosophila genes were classified against the GO
Molecular and Biological Function ontologies by the Drosophila community, yet no analysis of the errors associated
with this work was presented (Ashburner et al. 2000). This is
due to the inherent difficulty of assessing error rates associated with ontological classification, such that none of these
genome annotations and their associated evidence codes can
be statistically evaluated with confidence levels. Here we provide the first attempt to analyze the error rates associated with
ontological classification.
Because of the lack of a collection of comprehensive,
robust assignments that can be used as a standard of comparison, it is inherently impossible to achieve a completely robust
assessment of any assignment methodology. Consequently,
none of the approaches described here were entirely satisfactory because of these fundamental limitations. Problems
range from multiple types of biases in testing sets, to the partiality of the field’s understanding of the function of the proteins in the test sets. Therefore, in many cases the DIAN algorithms were found to be making plausible assignments that
cannot be verified with the present data. Additional problems
include variability in the comprehensiveness of assignments
made to a given protein, as well as variability in the comprehensiveness of assignments of various proteins to ontologies,
that is, differences in the coverage between assignment data
sets produced by different methods. For example, in the experiment depicted in Table 4, 40% of assignments generated
by DIAN (representing 216 assignments) were originally found
to be absent in MGD. These were initially considered to be
erroneously introduced by the DIAN algorithm, and were
Table 4. Requirement for Manual Validation of Comparative Results
Concept
Cytoskeletal
Nucleotide
Sugar/glycolysis
RNA polymerases
RNA processing
DIAN/Role
node number
Present in
DIAN and GO
DIAN
only
GO
only
Reported type I
variation
Effective number of
type I assignments
Effective rate of
type I variation
3.1
6.5
6.7
5.1.1
5.1.2
5.1.3
30
6
0
0
4
142
19
25
35
4
9
124
16
11
8
0
10
98
0.29
0.60
0.81
1.00
0.39
0.34
4
3
3
4
1
0
0.06
0.05
0.07
0.00
0.04
0.00**
Type I variation here refers to those assignments made by DIAN but not in the reference ontology implementation system (GO system). Manual
validation results show that Type I variation (DIAN-only assignments) cannot simple be treated as Type I error in a strict statistical sense.
Genome Research
www.genome.org
1775
Pouliot et al.
therefore classified as type I variations. However, on manual
review, most of these assignments were found to be correct,
such that the number of true type I variations was ultimately
reduced to 2.5%. Thus the Type I and II variations in our
evaluation scheme cannot be interpreted simply as Type I or
II errors in a strict statistical sense. The missing assignments
presumably reflect limitations in the keyword-recognition algorithm used in most of the assignments currently provided
by the Mouse Genome Database (outlined in Fig. 5A). As an
illustration, MGD assignments for entry #104661, which
codes for RAR-related orphan receptor ␣, are depicted in Figure 5B. This gene, a member of the nuclear hormone receptor
superfamily involved in thyroid hormone signaling pathway,
was assigned to GO categories by MGD on the basis of electronic annotation using a keyword-scanning algorithm (GO
evidence code IEA). This algorithm correctly identified the
protein function as “DNA binding” and the role of the gene as
“transcriptional regulation”, but failed to also indicate its receptor function, which is involved in cell signaling (Fig. 5B).
Despite the fact that a more systematic evaluation of assignment algorithms is not feasible because of these deficien-
Figure 5 Comparison of assignment methods. (A) Comparison of
automated and manual assignment methods. The properties of automated assignment methods such as DIAN are compared with those
of manually generated assignments. (B) Comparison of DIAN and
MGI assignments. Results from a simple keyword-based method are
illustrated here in assignments made by the algorithm used by Mouse
Genome Informatics Database, as compared with DIAN assignments.
Note that the “DNA binding” cellular role is vague, as the correct
function for this gene should be “transcription factor.”
1776
Genome Research
www.genome.org
cies, results from the evaluation approaches applied here indicate that DIAN returns generally correct assignments of proteins to its various ontologies. Deficiencies in DIAN’s
assignment algorithms were most manifest in its favoring of
underprediction (type II variation). Our manual curation and
validation indicate that this error type is far more common
than overprediction. This reflects the conservativeness of the
selection of protein domains as bona fide annotation signatures for a given node, as well as the limited coverage of the
protein universe by domains presently available in the PFAM
database, on which the current version of the algorithm is
based. In contrast, overprediction is much less frequent and
relates to domains that are not completely specific to a given
concept and thus return spurious assignments. Other problems include limitations in the resolution of the algorithm,
such that DIAN may be unable to correctly assign sequences to
very specific nodes such as leafs in the Enzyme hierarchy.
DISCUSSION
Considerations Related to the Assignment of Protein
Domains to Biological Ontologies
Because protein domains often involve many-to-many relationships with respect to biochemical function, that is, a
given domain may be associated with multiple biochemical
functions, the importance of curating these associations to
ensure specificity is essential to reduce incorrect assignments.
This is most manifest in cases where a simple linkage is made
between a protein domain to a biological ontology, such as in
the PRINTS and PROSITE databases. Therefore, it is necessary
to review the specificity of an assignment in the context of all
other assignments this domain may have to other nodes. Furthermore, an evaluation of a nodal annotation signature with
respect to the protein universe, here currently approximated
by the SWISS-PROT database, is required to be statistically
rigorous. For such a review to be robust, it becomes necessary
to first associate all known domains to all protein functions
described by an ontology, followed by estimating the significance of these associations to ensure that they are informative
and not due to, for example, a requirement for a structural
role unconnected to the protein function under consideration. This is because only a fraction of domains are truly
diagnostic for a given protein function, and although careful
manual review can help strengthen the quality of these associations, we believe that only when a global view of associations is available can domains with a low specificity to different functions be eliminated and meaningful assignments be
made. Because of the magnitude of the work, generating such
a global view can only be achieved via a combination of automation followed by manual curation.
In the case of DIAN, this was accomplished by deriving
manually a knowledge base composed of the assignments of
all SWISS-PROT proteins to the various ontologies used by
DIAN. This was to serve as the first step in defining domains
that are meaningfully associated with protein function. This
knowledge base was then used to perform exhaustive verifications of the significance of these associations by deriving a
heuristic decision rule by which to accept or reject the association of individual domains to ontological nodes. For each
candidate protein domain for the annotation signature of the
ontology node, the annotations of all SWISS-PROT sequences
containing this particular protein domain were analyzed
against the SWISS-PROT sequences previously associated with
this node by the DIAN knowledge base. The significant overlap between these pools of SWISS-PROT records and PFAM
domains ensures that a particular protein domain can be used
as a nodal signature. This information is enabled in a heuristic
rule that further requires that a majority of at least four of five
SWISS-PROT proteins used in the knowledge base be nonfragmentary, and that annotations associated with these sequences be derived from published laboratory results.
Evaluation of Keyword-Based Versus Domain-Based
Ontology Nodal Assignment Methods
As described earlier, DIAN combines two algorithms for the
automated assignment of proteins to ontologies that rely on
an underlying knowledge base assembled using manual curation, along with heuristic rules. By comparison, other assignment efforts, such as those made by MGD in the context of
the GO consortium, currently rely primarily on a simple process of scanning source records for keywords to GO terms. Full
manual assignment of records is intended to follow this initial
phase. However, such human curation poses several significant limitations, among which is the prohibitive expense of
genome-scale assignment. For this reason, over 84% of the
14,801 assignments presently available in MGD were generated by using keyword-based association, with the remaining
assignments being produced manually. Because automated
assignments methods can be expected to remain a key technology due to their high-throughput capability, development
of algorithms that go beyond the limitations of simple keyword-based assignment is imperative, as most genes will not
receive the kind of textual descriptions that lend themselves
to this approach. Therefore, the domain-based approach of
DIAN provides a distinct additional approach beyond keyword scanning, and permits high-throughput assignment independently of the presence of prior textual annotations,
while retaining reasonable accuracy. Lastly, because of the
frequent difficulty of confirming whether a given assignment
is incorrect, such reviews should perhaps be limited to providing a general confidence value on the mappings made by
automated methods. As was done here, selective manual reviews of individual assignments based on the comparison of
different algorithmic implementations can also be used to
uncover possible errors and defects in their respective mapping methodologies. Worthy of mention here is DIAN’s validation module, which integrates manual reviews to compensate for deficiencies in the various automated validation
methods.
In summary, DIAN is a high-throughput annotation algorithm that uses biological ontologies to segment the spaces
of protein function, biological role, and structure. When applied to data generated from genome sequencing projects,
DIAN is an effective algorithm for the conceptual annotation
of genome-scale in a timely and scientifically accurate manner. It is also an effective data mining algorithm, applicable to
the identification of novel correlations that can only be made
at the conceptual level.
METHODS
Ontologies
DIAN currently supports five ontologies: the PROSITE ontology was used for Protein Function (http://www.expasy.ch/
prosite/, release/version 16.30); Cellular Role is enabled by
the EGAD ontology from TIGR (http://www.tigr.org/docs/
tigr-scripts/egad_scripts/role_report.spl, release/version N/A),
which was originally derived from Monica Riley’s E. coli protein ontology; the Enzyme classification is from IUBMB (International Union of Biochemistry and Molecular Biology
(http://www.chem.qmw.ac.uk/iubmb/enzyme/, release/
version Enzyme Nomenclature 1992 and all of its supplements); SCOP is from the Medical Research Council (MRC) of
the United Kingdom (http://scop.mrc-lmb.cam.ac.uk/scop/
index.html, release/version 1.53); the DoubleTwist Biological
Role was derived internally (release/version 1.00). These ontologies can be viewed as taxonomies of IS-A links, in which a
node situated at level 1 (Fig. 1) indicates a node expressing a
more general concept than that of a node at level 2, whereas
a node situated at level 3 indicates a more specialized node
than the one at level 2.
Component Databases Supported by DIAN
The component databases supported by DIAN are the GenBank primate division (GB Release 121); UniGene (Build
#129); SWISS-PROT (Release 39); PDB (Release as of 1/1/2001);
and GenPept (Release as of 12/27/2000).
Construction of the DIAN Knowledge Base
Two databases were constructed as the foundation of the
knowledge base associated with ontological nodes: a controlled vocabulary and regular expression database, and a protein domain signature database. For classification of protein
structures, the PFAM motifs within the SCOP domain sequences compiled by the SCOP consortium (Brenner et al.
1998) were used as source material for the nodal signatures of
the structural ontology. The controlled vocabulary database
was populated during the construction of the SWISS-PROTOntology mapping table. A computer-aided human curation
process was performed by a group of domain specialists
whereby SWISS-PROT sequences were manually assigned to
the supported ontologies. Node-specific vocabulary and regular expressions were derived and later used to control the association of source-record annotations to a given node of the
supported ontologies. In this way, vocabulary data sets for
each relevant node were created and manually curated with
subsequent releases of SWISS-PROT. Using the manually curated association between SWISS-PROT sequences and ontological nodes, sequences in this database were processed to
identify PFAM protein domains using Paracel’s GeneMatcher system. Through the SWISS-PROT-Ontology table,
annotations made with respect to PFAM domains in SWISSPROT source records were used to verify the accuracy of the
association of PFAM domains to an ontology node before assigning a domain to a node. Specifically, because PFAM domain and ontology node each have a satellite pool of SWISSPROT records, the extent of the overlap between these pools
of records is used to confirm the correctness of the assignment
of this PFAM domain to a particular ontology node. This was
done in a many-to-many manner, such that a domain can be
assigned to more than one node, and a given node can have
more than one domain associated with it.
DIAN Algorithm Implementation
The underlying DIAN knowledge base was implemented using the Oracle 7.3 relational database management system
(Oracle). For Hidden Markov Model searching, the GeneMatcher system was selected for its ability to perform highthroughput protein domain profiling using the PFAM database. User queries of the DIAN data set are performed using the
PLS index-based search engine (http://www.pls.com) from
American Online. Most of the DIAN pipeline was implemented using the Perl (v.5.0) language. Benchmarks of chromosome 22 were obtained as follows: chromosome 22 was
first fragmented into overlapping fragments of 200,000 bp.
GENSCAN (Burge and Karlin 1997) was used to generate a da-
Genome Research
www.genome.org
1777
Pouliot et al.
tabase of predicted gene sequences. This collection of gene
predictions was then processed by the DIAN pipeline for annotation analysis. In this case, rather than using GeneMatcher, PFAM domain profiling was done by farming the
predicted gene translations to four workstations running the
HMMER software package to show that DIAN can be applied
easily as a component of a large-scale annotation system for
genome-scale sequencing projects using a conventional computing architecture. The coverage by DIAN of chromosome 22
was thus based on this database of predicted gene sequences.
Only the domain-based assignment algorithm was used in
this case.
DIAN Algorithm Evaluation
Three approaches were applied in evaluating the assignment
accuracy of DIAN: manual verification, comparisons between
assignments to different ontologies, and ortholog-based validation. In the first approach, manual verification of assignments was made to selected proteins. A group of domain experts was given the task of reviewing annotation assignments
of biological sequences made by the DIAN pipeline within
their domain of expertise. Several dozen proteins of varied
types were evaluated in this manner. In the second approach,
a test set was constructed for the comparative evaluation of
assignments. Nodes from the GO:Process or GO:Function ontologies that are conceptually equivalent to nodes of the DIAN
Protein Function or Cellular Role ontologies were identified
(Table 3, Fig. 4A,B; see Fig. 4 for explanation). Mouse genes
assigned by MGD (http://www.informatics.jax.org/; Baker et
al. 1999) to these GO nodes (or their child nodes) were then
retrieved. The protein sequences for these genes were obtained from RefSeq via shared HUGO gene names (http://
www.gene.ucl.ac.uk/nomenclature/). An all-versus-all SmithWaterman sequence similarity search (Smith and Waterman
1981) was further performed to eliminate sequence redundancy within these mouse sequences. Only sequences with
&lt:40% overall similarity were retained as the testing set,
composed of 857 proteins. These sequences were then assigned to DIAN ontologies by the DIAN algorithm for comparison against their original assignments in GO ontologies
(Table 3A, 3B). Sequences with unbalanced assignments between GO and DIAN ontologies were examined manually to
assess the source of the imbalance: the presence of a missing
assignment of a bona fide property listed in GO, or a missing
or incorrect assignment of a bona fide property in DIAN. In
the last approach, assignments made to orthologous sequences were compared. A test set of orthologous proteins
was assembled, composed of a random set of 37 pairs of orthologous Refseq protein records for mouse and human. Orthology was assumed when genes shared the same HUGO
gene name. Sequences from the test set were processed by the
DIAN pipeline, and resulting assignments were compared between proteins, with the expectation that identical assignments would be generated. Sequences with unbalanced assignments were examined manually to assess the source of the
imbalance, such as the presence of a species-specific function
or from a possibly erroneous assignment made by the DIAN
algorithm.
ACKNOWLEDGMENTS
A patent application for the DIAN algorithm has been filed
with the U.S. Patent and Trademark Office. The authors are
grateful to Drs. Doug Brutlag (Stanford University), Peter Karp
(SRI International), and Andrew Karsaskis (DoubleTwist, Inc.)
for valuable discussions.
The publication costs of this article were defrayed in part
by payment of page charges. This article must therefore be
hereby marked “advertisement” in accordance with 18 USC
section 1734 solely to indicate this fact.
1778
Genome Research
www.genome.org
REFERENCES
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry,
J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.
2000. Gene ontology: Tool for the unification of biology. The
Gene Ontology Consortium. Nat. Genet. 25: 25–29.
Attwood, T.K. 2000. The Babel of bioinformatics. Science
290: 471–473.
Attwood, T.K., Avison, H., Beck, M.E., Bewley, M., Bleasby, A.J.,
Brewster, F., Cooper, P., Degtyarenko, K., Geddes, A.J., Flower,
D.R., et al. 1997. The PRINTS database of protein fingerprints: A
novel information resource for computational molecular biology.
J. Chem. Inf. Comput. Sci. 37: 417–424.
Bairoch, A. 1991. PROSITE: A dictionary of sites and patterns in
proteins. Nucleic Acids Res. (Suppl.) 19: 2241–2245.
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein
sequence database and its supplement TrEMBL in 2000. Nucleic
Acids Res. 28: 45–48.
Baker, P.G., Goble, C.A., Bechhofer, S., Paton, N.W., Stevens, R., and
Brass, A. 1999. An ontology for bioinformatics applications.
Bioinformatics 15: 510–520.
Ben-Yaacov, S., Le Borgne, R., Abramson, I., Schweisguth, F., and
Schejter, E.D. 2001. Wasp, the Drosophila Wiskott-Aldrich
syndrome gene homologue, is required for cell fate decisions
mediated by Notch signaling. J. Cell Biol. 152: 1–14.
Birney, E. and Durbin, R. 2000. Using GeneWise in the Drosophila
annotation experiment. Genome Res. 10: 547–548.
Blake, J.A., Eppig, J.T., Richardson, J.E., and Davisson, M.T. 2000.
The Mouse Genome Database (MGD): Expanding genetic and
genomic resources for the laboratory mouse. The Mouse Genome
Database Group. Nucleic Acids Res. 28: 108–111.
Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify
protein sequences that fold into a known three-dimensional
structure. Science 253: 164–170.
Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing
sequence comparison methods with reliable structurally
identified distant evolutionary relationships. Proc. Natl. Acad. Sci.
95: 6073–6078.
Burge, C. and Karlin, S. 1997. Prediction of complete gene structures
in human genomic DNA. J. Mol. Biol. 268: 78–94.
Chen, R.O., Felciano, R., and Altman, R.B. 1997. RIBOWEB: Linking
structural computations to a knowledge base of published
experimental data. Ismb 5: 84–87.
Chervitz, S.A., Hester, E.T., Ball, C.A., Dolinski, K., Dwight, S.S.,
Harris, M.A., Juvik, G., Malekian, A., Roberts, S., Roe, et al. 1999.
Using the Saccharomyces Genome Database (SGD) for analysis of
protein similarities and structure. Nucleic Acids Res. 27: 74–78.
Commission on Biochemical Nomenclature, and International
Union of Biochemistry. Standing Committee on Enzymes. 1973.
Enzyme nomenclature; recommendations (1972) of the Commission
on Biochemical Nomenclature on the nomenclature and classification
of enzymes together with their units and the symbols of enzyme
kinetics. Elsevier Scientific, New York.
Corpet, F., Gouzy, J., and Kahn, D. 1998. The ProDom database of
protein domain families. Nucleic Acids Res. 26: 323–326.
Derry, J.M., Ochs, H.D., and Francke, U. 1994. Isolation of a novel
gene mutated in Wiskott-Aldrich syndrome. Cell 78: 635–644.
Gracy, J. and Argos, P. 1998. Automated protein sequence database
classification. II. Delineation of domain boundaries from
sequence similarities. Bioinformatics 14: 174–187.
Gusfield, D. 1997. Algorithms on strings, trees, and sequences: Computer
science and computational biology. Cambridge University Press,
Cambridge.
Henikoff, S., Henikoff, J.G., and Pietrokovski, S. 1999. Blocks+: A
non-redundant database of protein alignment blocks derived
from multiple compilations. Bioinformatics 15: 471–479.
Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. 1999. The
PROSITE database, its status in 1999. Nucleic Acids Res.
27: 215–219.
International Union of Biochemistry. Standing Committee on
Enzymes. 1965. Enzyme nomenclature; recommendations, 1964, of
the International Union of Biochemistry on the nomenclature and
classification of enzymes, together with their units and the symbols of
enzyme kinetics. Elsevier, New York.
International Union of Biochemistry. Nomenclature Committee and
Commission on Biochemical Nomenclature. 1979. Enzyme
nomenclature, 1978: Recommendations of the Nomenclature
Committee of the International Union of Biochemistry of the
nomenclature and classification of enzymes. Academic Press, New
York.
International Union of Biochemistry. Nomenclature Committee,
International Union of Biochemistry, and Commission on
Biochemical Nomenclature. 1979. Enzyme nomenclature, 1978:
Recommendations of the Nomenclature Committee of the
International Union of Biochemistry on the nomenclature and
classification of enzymes. Academic Press, New York.
International Union of Biochemistry. Nomenclature Committee,
Webb, E.C., and International Union of Biochemistry. 1984.
Enzyme nomenclature 1984: Recommendations of the Nomenclature
Committee of the International Union of Biochemistry on the
nomenclature and classification of enzyme-catalysed reactions.
Academic Press, Orlando, FL.
International Union of Biochemistry and Molecular Biology.
Nomenclature Committee and Webb, E.C. 1992. Enzyme
nomenclature 1992: Recommendations of the Nomenclature
Committee of the International Union of Biochemistry and Molecular
Biology on the nomenclature and classification of enzymes. Academic
Press, San Diego.
Karp, P.D. 2000. An ontology for biological function based on
molecular interactions. Bioinformatics 16: 269–285.
Lewis, S., Ashburner, M., and Reese, M.G. 2000. Annotating
eukaryote genomes. Curr. Opin. Struct. Biol. 10: 349–354.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995.
SCOP: A structural classification of proteins database for the
investigation of sequences and structures. J. Mol. Biol.
247: 536–540.
Nevill-Manning, C.G., Wu, T.D., and Brutlag, D.L. 1998. Highly
specific protein sequence motifs for genome analysis. Proc. Natl.
Acad. Sci. 95: 5865–5871.
Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa,
M. 1999. KEGG: Kyoto encyclopedia of genes and genomes.
Nucleic Acids Res. 27: 29–34.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B.,
and Thornton, J.M. 1997. CATH—A hierarchic classification of
protein domain structures. Structure 5: 1093–1108.
Rawlings, S.L., Crooks, G.M., Bockstoce, D., Barsky, L.W., Parkman,
R., and Weinberg, K.I. 1999. Spontaneous apoptosis in
lymphocytes from patients with Wiskott-Aldrich syndrome:
Correlation of accelerated cell death and attenuated bcl-2
expression. Blood 94: 3872–3882.
Rengan, R., Ochs, H.D., Sweet, L.I., Keil, M.L., Gunning, W.T.,
Lachant, N.A., Boxer, L.A., and Omann, G.M. 2000. Actin
cytoskeletal function is spared, but apoptosis is increased, in
WAS patient hematopoietic cells. Blood 95: 1283–1292.
Riley, M. 1993. Functions of the gene products of Escherichia coli.
Microbiol. Rev. 57: 862–952.
Smith, T.F. and Waterman, M.S. 1981. Identification of common
molecular subsequences. J. Mol. Biol. 147: 195–197.
Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin,
R. 1998. Pfam: Multiple sequence alignments and HMM-profiles
of protein domains. Nucleic Acids Res. 26: 320–322.
Wheeler, D.L., Church, D.M., Lash, A.E., Leipe, D.D., Madden, T.L.,
Pontius, J.U., Schuler, G.D., Schriml, L.M., Tatusova, T.A.,
Wagner, L., et al. 2001. Database resources of the national center
for biotechnology information. Nucleic Acids Res. 29: 11–16.
Wu, T.D., Nevill-Manning, C.G., and Brutlag, D.L. 2000. Fast
probabilistic analysis of sequence function using scoring
matrices. Bioinformatics 16: 233–244.
Received February 7, 2001; accepted in revised form August 14, 2001.
Genome Research
www.genome.org
1779
Short Communication
doi:10.1006/geno.2002.6824, available online at http://www.idealibrary.com on IDEAL
Comparative Analysis of Human Genome Assemblies
Reveals Genome-Level Differences
Shuyu Li,1 Jiayu Liao,2,* Gene Cutler,1 Timothy Hoey,1 John B. Hogenesch,2
Michael P. Cooke,2 Peter G. Schultz,2 and Xuefeng Bruce Ling1,*
2The
1Tularik, Inc., Two Corporate Drive, South San Francisco, California 94080, USA
Genomic Institute of Novartis Research Foundation, 10675 John Jay Hopkins Drive, San Diego, California 92121, USA
*To whom correspondence and reprint requests should be addressed. Fax: (858) 812-1502. E-mail: [email protected]. Fax: (650) 825-7400. E-mail: [email protected].
Previous comparative analysis has revealed a significant disparity between the predicted gene sets produced by the International Human Genome
Sequencing Consortium (HGSC) and Celera
Genomics. To determine whether the source of this
discrepancy was due to underlying differences in the
genomic sequences or different gene prediction
methodologies, we analyzed both genome assemblies
in parallel. Using the GENSCAN gene prediction
algorithm, we generated predicted transcriptomes
that could be directly compared. BLAST-based comparisons revealed a 20–30% difference between the
transcriptomes. Further differences between the two
genomes were revealed with protein domain PFAM
analyses. These results suggest that fundamental
differences between the two genome assemblies are
likely responsible for a significant portion of the
discrepancy between the transcript sets predicted by
the two groups.
Celera Genomics and the International Human Genome
Sequencing Consortium (HGSC) simultaneously published
the description of the human genome sequencing, analysis,
and gene annotation [1,2]. Although both teams identified
approximately 30,000 human genes [1,2], a direct comparison of the Celera and HGSC (Ensembl) data sets revealed
little overlap between their novel predicted genes [3].
Questions arose as to whether this observed difference is
due to discrepancies in the underlying raw sequence data,
the resultant genome assemblies, or the independent gene
prediction methodologies used by both groups.
To distinguish between these possibilities, we have
carried out a comparative analysis of the HGSC genomes
(Ensembl 1.0.0, Ensembl 1.1.0, and Ensembl 1.2.0; performed at Tularik, Inc.) and the Celera genome
(CHGD_assembly_R25h; performed at the Genomics
Institute of the Novartis Research Foundation) using the
GENSCAN [4] gene prediction program to generate corresponding predicted transcriptomes. GENSCAN, which was
a key component of both the Celera and HGSC gene
138
prediction pipelines, predicts both partial and full-length
transcripts. GENSCAN full-length transcripts are defined as
those for which GENSCAN predicts a promoter region, one
or more exons, and a polyadenylation signal. This analysis
revealed that the Celera transcriptome (150,571) has more
predicted transcripts than that of HGSC (Ensembl 1.0.0;
109,083). The results for the more recent HGSC genome
releases (Ensembl 1.1.0, Ensembl 1.2.0) gave very similar
results and are therefore not shown here. A detailed analysis of these GENSCAN-predicted transcripts found that
Celera (71,721) has fewer full-length gene predictions than
does HGSC (87,295). A BLAST [5]-based comparison of all
GENSCAN transcripts (threshold of ≥ 98% identity over at
least 100 nucleotides) showed that 80% of predicted HGSC
genes have at least one matching sequence in the Celera
GENSCAN predictions, whereas 70% of Celera predictions
have at least one overlapping sequence in the HGSC set.
These results demonstrate that significant discrepancies
exist even between Celera and HGSC assembly-derived
gene sets predicted with the exact same methodology.
To understand the impact of these transcriptome differences on the derived proteomes, we have analyzed the predicted translations of these sequences for the presence of
known protein domains using the PFAM [6] 7.0 set of
Hidden Markov Models (HMMs) (3360 models, hit threshold E value 1 ⫻ 10–10). The differences between the number of hits for each protein domain model in the HGSC and
Celera predicted gene sets were plotted in Fig. 1 for the
1495 models that had hits (data for searches with E values
of 1 ⫻ 10–5 or 1 ⫻ 10–2 gave similar results and are not
shown). Of all the matching PFAM models, a large percentage have more matches (47%) in the HGSC-derived
gene set than in the Celera-derived genes. This is more than
the number of models that matched both data sets equally
(30%), and more than twice the number that had excess
matches in the Celera data (22%). This analysis further supports the conclusion that the genome assemblies had a significant impact on the predicted transcript sets.
This parallel analysis of the genome assemblies released
by the HGSC and Celera teams provides strong evidence that
there are major fundamental differences between these two
GENOMICS Vol. 80, Number 2, August 2002
Copyright © 2002 Elsevier Science (USA). All rights reserved.
0888-7543/02 $35.00
doi:10.1006/geno.2002.6824, available online at http://www.idealibrary.com on IDEAL
Short Communication
data sets in the numbers, identities, and properties of predicted genes derived from these sequences. Based on this, we
conclude that these sequence-level differences must be at least
partly responsible for the discrepancies in the previous findings [3]. Along with the recent re-analysis [7,8] of Celera’s
genome assembly [1], this report provides further evidence
that the whole genome approach and the hierarchical shotgun
sequencing approach yielded different genomes.
RECEIVED FOR PUBLICATION JUNE 6; ACCEPTED JUNE 19, 2002.
REFERENCES
FIG. 1. PFAM domain profiling of Celera and HGSC derived transcriptomes.
The x-axis represents the excess of matches per PFAM model in the HGSC
versus Celera data sets. The y-axis represents the number of models that fall
into each category. Upward bars represent PFAM models, which have more
hits in the HGSC data set. Downward bars represent PFAM models, which
have more hits in the Celera data set.
GENOMICS Vol. 80, Number 2, August 2002
Copyright © 2002 Elsevier Science (USA). All rights reserved.
1. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291: 1304–1351.
2. Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature
409: 860–921.
3. Hogenesch, J. B., et al. (2001). A comparison of the Celera and Ensembl predicted gene
sets reveals little overlap in novel genes. Cell 106: 413–415.
4. Burge, C., and Karlin, S. (1997). Prediction of complete gene structures in human genomic
DNA. J. Mol. Biol. 268: 78–94.
5. Altschul, S. F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25: 3389–3402.
6. Bateman, A., et al. (2002). The Pfam protein families database. Nucleic Acids Res. 30:
276–280.
7. Waterston, R. H., Lander, E. S., and Sulston, J. E. (2002). On the sequencing of the human
genome. Proc. Natl. Acad. Sci. USA 99: 3712–3716.
8. Myers, E. W., Sutton, G. G., Smith, H. O., Adams, M. D., and Venter, J. C. (2002). On the
sequencing and assembly of the human genome. Proc. Natl. Acad. Sci. USA 99: 4145–4146.
139
Vol. 19 no. 0 2003, pages 1–9
DOI: 10.1093/bioinformatics/btg219
BIOINFORMATICS
A comparative analysis of HGSC and Celera
human genome assemblies and gene sets
Shuyu Li1,† , Gene Cutler1,† , Jane Jijun Liu1,† , Timothy Hoey1 ,
Liangbiao Chen2 , Peter G. Schultz3 , Jiayu Liao3, ∗ and
Xuefeng Bruce Ling1,∗
Inc. Two Corporate Drive, South San Francisco, CA 94080, USA, 2 Institute of
Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101,
People’s Republic of China and 3 The Genomic Institute of Novartis Research
Foundation, 10675 John Jay Hopkins Drive, San Diego, CA 92121, USA
1 Tularik,
Received on December 21, 2002; revised on March 11, 2003; accepted on March 26, 2003
ABSTRACT
Motivation: Since the simultaneous publication of the human
genome assembly by the International Human Genome
Sequencing Consortium (HGSC) and Celera Genomics,
several comparisons have been made of various aspects of
these two assemblies. In this work, we set out to provide a
more comprehensive comparative analysis of the two assemblies and their associated gene sets.
Results: The local sequence content for both draft genome
assemblies has been similar since the early releases, however
it took a year for the quality of the Celera assembly to approach
that of HGSC, suggesting an advantage of HGSC’s hierarchical shotgun (HS) sequencing strategy over Celera’s whole
genome shotgun (WGS) approach. While similar numbers of
ab initio predicted genes can be derived from both assemblies, Celera’s Otto approach consistently generated larger,
more varied gene sets than the Ensembl gene build system.
The presence of a non-overlapping gene set has persisted with
successive data releases from both groups. Since most of the
unique genes from either genome assembly could be mapped
back to the other assembly, we conclude that the gene set
discrepancies do not reflect differences in local sequence content but rather in the assemblies and especially the different
gene-prediction methodologies.
Contact: [email protected]
INTRODUCTION
In February 2001, the International Human Genome Sequencing Consortium (HGSC) and Celera Genomics simultaneously published descriptions of the sequencing, assembly,
analysis, and gene annotation of the human genome (IHGSC,
2001; Venter et al., 2001). Although both teams identified
approximately 30 000 human genes (IHGSC, 2001; Venter
∗ To
†
whom correspondence should be addressed.
Equal contribution to this publication
et al., 2001), a direct comparison of the Celera and HGSC
(Ensembl) data sets revealed relatively little overlap between
their novel predicted genes (Hogenesch et al., 2001). Our
previous parallel analysis (Li et al., 2002) of the two genome assemblies showed that there are major fundamental
differences between these two data sets, in the numbers, identities, and properties of predicted genes derived from these
sequences, and that assembly-level differences must be at least
partly responsible for the gene set discrepancies. In addition,
the recent re-analyses (Myers et al., 2002; Waterston et al.,
2002) of Celera’s genome assembly debated how much of an
impact Celera’s use of the public-domain genome data had on
its assembly. In order to provide an up-to-date status report
of the human genome sequencing efforts, understand how
the genome assemblies have been evolving since their initial
releases, and compare the different assembly approaches and
their resulting gene data sets, we have collected the majority of HGSC and Celera assembly releases and performed a
systematic comparative analysis.
METHODS
Sequence databases
HGSC and Celera database of assemblies and transcriptomes, released from May 2000 to July 2002, were collected and summarized in Table 1. A total of nine HGSC
human genome assemblies (June 2000, July 2000, September 2000, October 2000, December 2000, April 2001, August
2001, December 2001, April 2002) were downloaded from
http://www.genome.ucsc.edu/#Downloading. Ensembl curated gene sets (Ensembl 0.8.0, Ensembl 1.0.0, Ensembl
1.2.0, Ensembl 3.26 and Ensembl 5.28) were downloaded
from ftp.ensembl.org. Five Celera human genome assembly
releases (R20, R25h, R26b, R26f and R26i) and four
Celera gene sets (R25e, R25h, R26b, R26k) were licensed
from subscription of the Celera Discovery System by GNF
and analyzed by GNF (The Genomic Institute of Novartis
Bioinformatics 19(0) © Oxford University Press 2003; all rights reserved.
“bio015” — 2003/5/19 — page 1 — #1
1
S.Li et al.
Table 1. HGSC and Celera genome assembly and gene set release history
Release date
05-2000
06-2000
07-2000
08-2000
09-2000
10-2000
11-2000
12-2000
01-2001
04-2001
07-2001
08-2001
10-2001
11-2001
12-2001
01-2002
03-2002
04-2002
05-2002
06-2002
Assembly
Curated genes
HGSC (UCSC)
Celera
HGSC (Ensembl)
06-2000
07-2000
R18, R19
R20, R21
R22, R23
R24
09-2000
10-2000
Celera
E−0.8.0
R25e
12-2000
E−1.0.0
R25h
04-2001
R25e
R25h
E−1.1.0
R26b
R26b
08-2001
E−1.2.0
R26d
R26e
12-2001
R26f
E−3.26
E−4.28
R26i
E−5.28
04-2002
R26f, R26h
R26j
R26k
The release dates and release names (where applicable) are shown for the HGSC and Celera genome assembly releases analyzed in this study. The Ensembl and Celera gene set
releases are also shown.
Research Foundation). Human RefSeq sequences were
obtained by FTP from ftp.ncbi.gov/refseq/H_sapiens. The
PFAM 7.0 Hidden Markov Model (HMM) database was
obtained by FTP from ftp.genetics.wustl.edu/pub/eddy/pfam
7.0/. The Research Genetics cDNA database was obtained by
FTP from ftp://ftp.resgen.com/pub/sv_libraries/RG_Hs_seq_
ver_101100.txt. 07-2002 RefSeq database was downloaded
from NCBI ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/ site.
Local genome database setup and configuration
UCSC annotation databases hg4, hg5, hg6, hg7, hg8,
hg10, and hg11, corresponding to the September 2000,
October 2000, December 2000, April 2001, August 2001,
December 2001 and April 2002 UCSC genome assemblies
respectively, were downloaded, and imported into a local
relational database. The UCSC relational database schema
is available online at http://genome.ucsc.edu/goldenPath/
gbdDescriptions.html
Ensembl databases were set up and configured on local
servers following instructions from http://www.ensembl.org/
Docs/ and personal communications with Ensembl colleagues
([email protected]). Data sets were downloaded from
Ensembl and imported to a local relational database.
server setup) or indirectly (locally installed UCSC genome
database with pre-computed BLAT results). In the UCSC
genome database, chromosome locations are stored in the
all_est or all_mrna tables of which the qName column
stores RG Genbank accession numbers. The BLAT server
setup and homology search were performed using instructions from UCSC. BLAT analysis was run using an identity threshold of 95% over at least 40 bp as described
at UCSC genome browser (http://genome.ucsc.edu/cgibin/hgBlat?command=start&org=human). These criteria
have been previously determined to give optimal sensitivity,
specificity, speed for genomic searches (Kent, 2002). Similar
results were obtained when sequences were mapped by running BLAT or by querying pre-computed BLAT results from
UCSC database.
Gene prediction
Predicted gene sets were derived from the HGSC and Celera
genome assemblies by running the GENSCAN algorithm
(Burge and Karlin, 1997) with its default settings. Fulllength gene sets, were derived from these total gene sets by
selecting all predicted genes for which GENSCAN identified
5 promoter and the 3 poly-A signal sequences.
BLAT to map sequences onto genome assemblies
BLAST comparative data analysis
Gene sequences were mapped onto genome assemblies
using the BLAT program (Kent, 2002) directly (local BLAT
Sequence comparison was performed using the NCBI BLAST
algorithm (Altschul et al., 1997): BLASTN for gene–gene
2
“bio015” — 2003/5/19 — page 2 — #2
Human genome assembly comparison
comparisons (E-value < 1 × 10−5 , at least 98% identity over
100 bp) and BLASTX for gene/SWISS-PROT comparisons
(E-value < 1 × 10−5 ).
PFAM domain analysis
The PFAM 7.0 database release, containing 3360 HMMs,
was used to analyze gene sets for their protein domain
content. For this analysis, the HMMER software package (Eddy, 1998) or its compatible implementations from
Paracel (http://www.paracel.com) and TimeLogic (http://
www.timelogic.com) were run on a Linux computing cluster
(150 CPUs, Linux Networks), a Paracel GENEMATCHER
machine, and a TimeLogic Decypher machine, respectively.
RESULTS AND DISCUSSION
Gene-based quality assessment of HGSC and celera
genome assemblies
Multiple releases of human genome assemblies and their
associated predicted gene sets from HGSC (International
Human Genome Sequencing Consortium, UCSC, Ensembl)
and Celera are listed in Table 1 based on release dates. These
data sets were the basis for comparing the HGSC and Celera
genome assemblies and analyzing how they have changed
over time. Genome assemblies can vary due to differences in
local sequence content as well as long-range differences due
to differing sequence assembly. As a gauge of the quality and
completeness of the draft local sequence content in both genome assemblies, we used the BLAT algorithm (Kent, 2002)
to map the large Research Genetics human cDNA sequence
database (RG, 41 472 sequences) against the genome assemblies (Fig. 1). Since a positive BLAT hit only requires a match
of 40 bp, this analysis should be largely insensitive to global
assembly issues. We have observed a gradual increase in the
number of mapped RG sequences with both HGSC and Celera
assemblies, leveling off for both at around 97%. These results
suggest that the HGSC and Celera assemblies have had similar
local sequence content since their early releases.
Gene sets derived from the genome assemblies can vary
due to differences in local sequence, global assembly, and
the particular gene-prediction pipelines used. Since genes can
span large sequence lengths, all gene prediction algorithms,
to some extent, will be sensitive to sequence coverage
and assembly issues. To eliminate variability due to differing gene-prediction pipelines, GENSCAN was used to
generate two sets of genes from multiple releases of both
genome assemblies. The full-length GENSCAN genes subsets were extracted from the full sets, including only those
GENSCAN predictions containing both 5 promoter and 3
poly-adenylation signal sequence predictions. Since longrange sequence discontinuity in the assemblies can lead
GENSCAN to predict partial genes that would lack 5
promoter and/or 3 poly-adenylation signal sequences, this
Fig. 1. BLAT mapping of Research Genetics sequences to HGSC
and Celera genome assemblies. The percentages of sequences from
the Research Genetics sequence database, which give positive BLAT
hits, against various releases of the HGSC and Celera genome
assemblies are plotted.
full-length subset can be used to probe the quality of the
genome assembly.
The total and full-length GENSCAN HGSC gene counts
as well as the Celera full-length GENSCAN gene counts all
showed modest and gradual increases over time (Fig. 2A).
In contrast, the total GENSCAN gene counts for the Celera
assemblies started out at levels more than twice as high as
the HGSC gene sets, and only came down to comparable
levels in the July 2001 release. Since gene prediction depends
on not only local sequence content but also on long-range
assembled sequence, we believe that the initially high total
GENSCAN gene numbers for Celera were due to sequence
fragmentation resulting in many individual genes being split
into separate GENSCAN predictions. This apparent Celera
genome fragmentation, perhaps due to gaps or assembly
errors, may indicate a disadvantage of Celera’s whole genome shotgun (WGS) sequencing approach (Huson et al., 2001;
Myers et al., 2002) compared to HGSC’s hierarchical shotgun
(HS) approach (IHGSC, 2001).
Both the Ensembl gene build system (Hubbard et al., 2002)
and Celera’s Otto pipeline (Venter et al., 2001) use various
forms of evidence including homology to known proteins,
ESTs and ab initio gene prediction with algorithms including GENSCAN (Burge and Karlin, 1997). Ensembl is more
dependent on known human proteins from SPTREMBL,
GENSCAN predictions, and gene prediction HMMs while
Celera uses more data from outside of their genome such
as cross-genome homology and even the Ensembl gene set
[(Venter et al., 2001), reference 62]. Analyzing the length
distributions of the Ensembl and Celera gene sets (Fig. 2B)
shows a large decrease in short Celera genes accompanied by
increases in the numbers of longer genes over time, similar but
more pronounced than what is seen with the HGSC genes. A
similar trend is seen with the GENSCAN-predicted gene sets
(data not shown), further reinforcing the notion that initial
3
“bio015” — 2003/5/19 — page 3 — #3
S.Li et al.
Fig. 3. Gene changes within gene sets across multiple releases. (A)
The changes in the numbers of genes in the Ensembl and Celera gene
sets and in the HGSC- and Celera-derived GENSCAN genes between
successive genome releases are plotted. (B) Genes in successive gene
sets were compared to genes in the previous gene sets using BLAST.
The percent of genes that did not match any sequence in the previous
gene set are plotted for each gene set group.
Fig. 2. Gene set distributions from multiple HGSC and Celera genome releases. (A) The numbers of pipeline-derived genes from various releases of Ensembl and Celera gene sets along with the numbers
of total GENSCAN-predicted genes and full-length GENSCANpredicted genes derived from various releases of the HGSC and
Celera genomes are plotted based on release dates. For reference,
the number of human genes in the July 2002 release of RefSeq is
also shown. (B) Multiple Ensembl and Celera gene sets were analyzed based on gene length. The numbers of sequences from each
release that lie in the given gene-length bins are shown.
Celera assembly releases may have had comparatively high
levels of fragmentation. Interestingly, the latest two Celera
gene sets released show a reversal of this gene-length trend,
with increasing numbers of short genes concomitant with an
increase in total gene number.
Within-group gene set comparisons
An alternate way to look at changes in the assembly and
gene data set is to compare the genes derived from each genome assembly release with those from the previous release.
BLAST (Altschul et al., 1997) comparative analysis of genes
with those from previous releases identified new genes as
those sequences that did not match any sequence in the previous set. The analysis of the HGSC GENSCAN gene sets shows
a 10–20% level of new gene content per gene set (Fig. 3), consistent with the modest increases in gene number (Fig. 2) and
sequence coverage (Fig. 1) already observed. In contrast, the
Celera GENSCAN gene sets show an initially high level of
new GENSCAN gene content being added (30–40%) concomitant with a large decrease in gene number, a trend that
has diminished in the most recent genome releases, where
very few new GENSCAN genes appear to be present. The
large gene count of the initial Celera GENSCAN set and its
decrease over the course of time correlates with the decrease
of the initial large fraction of short (<1 kb) Celera genes
(Fig. 2), suggesting that the levels of fragmentation seen in
the initial releases decrease overtime. The pattern of changes
in the Celera Otto genes in successive releases is even more
dramatic: more than 50% of the genes in the January 2001
gene set release cannot be found in the previous December
2000 release. By October 2001, however, virtually no new
genes were being added. Interestingly, new gene addition can
again be observed in the recent Celera releases, occurring in
the same releases where the total gene number (Fig. 2A) and
4
“bio015” — 2003/5/19 — page 4 — #4
Fig. 4. Gene set comparisons between groups. (A) Human RefSeq genes were compared to multiple Ensembl and Celera gene sets using
BLAST. The numbers of RefSeq sequences that matched both gene sets, only the Ensembl gene set, only the Celera gene set, and neither gene
set are plotted on the left. The human RefSeq genes were also compared to multiple HGSC- and Celera-derived GENSCAN gene sets using
BLAST. The distribution of matching sequences is plotted on the right. (B) Ensembl genes from multiple releases were compared with the
corresponding Celera gene set releases using BLAST. The numbers of matching and non-matching (Ensembl-unique) sequences are plotted
on the left. Similarly, Celera genes were compared with the corresponding Ensembl gene sets using BLAST and the numbers of matching and
non-matching (Celera-unique) sequences are plotted on the right. (C) HGSC-derived GENSCAN genes and Celera-derived GENSCAN genes
were compared with each other using BLAST in both directions as in (B). The numbers of sequences found in both gene sets, HGSC-unique
sequences, and Celera-unique sequences are plotted.
the short gene number (Fig. 2B) rebound. Since neither the
genome content nor quality appears to have changed much in
these releases, we believe that this recent trend is likely due
to changes in Celera’s gene-prediction pipeline.
RefSeq-based quality assessment of
ensembl and celera gene sets
The NCBI RefSeq database (Maglott et al., 2000; Pruitt and
Maglott, 2001), derived Genbank sequences and the published literature, provides a non-redundant view of the current
knowledge about human genes, transcripts and proteins. We
evaluated the quality and comprehensiveness of the in silico
GENSCAN predicted gene sets, by comparing them to the
human RefSeq database with BLAST. Comparing RefSeq to
multiple Ensembl and Celera pipeline gene sets and HGSC
and Celera GENSCAN gene sets reveals that, even with the
earliest releases, greater than 75% of RefSeq genes can be
found in some form in gene sets from both groups (Fig. 4A).
Small fractions of RefSeq genes could be matched only to
genes from HGSC, only to Celera genes, or to neither gene set.
Over the course of time, the numbers of unmatched RefSeq
genes and those matching only HGSC have significantly
decreased. At the same time, the Celera gene set continues to
have a modest number of RefSeq genes not found in Ensembl,
suggesting that the Celera gene set can be more comprehensive
than the Ensembl data set with respect to RefSeq. Similar
BLAST results were obtained after a permissive sequence
clustering approach (Hogenesch et al., 2001) was applied
to eliminate sequence redundancy in all RefSeq, HGSC and
Celera gene sets (data not shown). Because RefSeq (07-2002
5
“bio015” — 2003/5/19 — page 5 — #5
S.Li et al.
release, 15 740 genes) contains far fewer genes than Ensembl
and Celera, more efforts are needed in order to complete
RefSeq as a gene reference standard.
Between-group gene set comparisons
Much has been made of the concordance between the gene
numbers of the initial HGSC and Celera gene releases
(IHGSC, 2001; Venter et al., 2001) and the subsequent observations that each set actually contained many unique genes
(Hogenesch et al., 2001; Li et al., 2002). We have repeated
this analysis across multiple gene set releases. Comparing
Ensembl to Celera genes shows that the fraction of Ensemblunique genes ranges from 29% initially to 12% in the most
recent release analyzed (Fig. 4B), indicating that most of the
Ensembl genes can find matches in the Celera set. The reverse
comparison, Celera compared to Ensembl, reveals that the
fraction of Celera-unique genes decreased from an initial 56 to
26% in the most recent analyzed release. The large increase in
Celera-unique genes in R25h release coincided with the large
increase in total gene number (Fig. 2A) consisting largely
of short genes (Fig. 2B). Similar results were obtained when
redundancy was removed from the data sets (data not shown).
To discriminate between changes in actual sequence information versus changes in gene-prediction pipelines, this analysis was repeated with the GENSCAN-derived gene sets.
The HGSC versus Celera GENSCAN gene set comparison
(Fig. 4C) looks much like the Ensembl versus Celera pipeline
gene comparison (Fig. 4B), with approximately 16% of the
HGSC genes being unique. In contrast, the Celera versus
HGSC GENSCAN-gene comparison shows an initially high
number (33%) of Celera-unique genes, decreasing to a fraction (13%) similar to the number of unique HGSC GENSCAN
genes. The difference between these results and the pipelinegene comparison suggests that the unique gene content of the
Celera pipeline gene set cannot be explained by fundamental
differences in the genome assemblies.
To further characterize the HGSC and Celera-unique gene
sets, we mapped the unique genes back to the genome assemblies from which they came as well as to that of the other
group using BLAT. Nearly all of the sequences from all four
unique gene sets can be mapped to both genome assemblies of the same or similar release date (Fig. 5A). This
again confirms that genome content, specifically the local
sequence content, is very similar between both assemblies.
Since the differences between Ensembl and Celera gene sets
are much larger than that observed between HGSC and Celera
GENSCAN gene sets, we can conclude that the gene-building
process, including human curation, must have contributed
more to the observed gene set difference than the different
genome sequencing and assembly processes.
In order to estimate how likely the unique Ensembl or Celera
genes are to represent true genes, we compared the unique
pipeline genes to the large SWISS-PROT protein database
using BLAST with moderate stringency (E-value = 1e − 5).
Fig. 5. Analysis of HGSC and Celera-unique genes. (A) Sequences
that were unique to the Ensembl, Celera, HGSC-derived GENSCAN,
and Celera-derived GENSCAN gene sets based on BLAST analysis
(Fig. 4) were mapped back to the genome assembly from which they
were derived as well as to the other genome assembly using BLAT.
The percentages of sequences from each unique set, which could be
mapped to either genome assemblies, are plotted. (B) The unique
Ensembl and Celera genes were compared with the SWISS-PROT
database using a moderate-stringency BLAST analysis. The percentages of sequences from both sets for which homologous sequences
could be identified in SWISS-PROT are plotted.
While more than 60% of some of the earlier unique gene
sets appear to have no significant homology to any protein sequence in SWISS-PROT, analysis of the most recent
gene sets shows that 55% of Celera-unique genes and 68%
of Ensembl unique genes have known protein homologs
(Fig. 5B). Using SWISS-PROT homology matches as a rough
estimate of the likelihood that predicted genes are real, it
6
“bio015” — 2003/5/19 — page 6 — #6
Fig. 6. Estimation of non-redundant gene count. For the releases shown, the Ensembl and Celera gene sets were combined along with the
human subset of RefSeq. This combined gene set was clustered via a permissive clustering algorithm. The resulting gene cluster number
represents the total number of unique genes in the Ensembl, Celera and RefSeq gene sets that could be resolved by our BLAST analysis.
appears that a large fraction of the unique genes from both
data sets are likely to be real.
Total number of protein-coding genes—lower
bound estimation
As shown in Figure 2A, the Ensembl gene sets have consistently been comprised of around 30 000 sequences, while
the Celera gene set has varied in the range of 20 000–45 000
sequences. Interestingly, the two latest Celera gene sets analyzed show an increase in gene number, bringing the total
well above that of the Ensembl gene set. To put these numbers in perspective, the human component of RefSeq (Maglott
et al., 2000; Pruitt and Maglott, 2001) contains many fewer
genes (07-2002 release, 15 740 genes) than either of these two
gene sets.
In order to estimate the total gene number, the Ensembl,
Celera and RefSeq gene sets were combined into a large
superset. Following an all-to-all BLAST comparison, redundant sequences were removed with a permissive clustering
algorithm (Hogenesch et al., 2001). The resulting gene
cluster number represents the total number of unique genes
in the Ensembl, Celera and RefSeq gene sets that could be
resolved by our BLAST analysis. Different Ensembl and
Celera releases were combined with RefSeq and processed
to analyze how this total gene number has changed over time,
increasing from an initial 24 238 to over 40 000 and then down
to 28 475 (Fig. 6). The non-redundant gene number we computed here should represent a lower bound for the true human
gene count: our BLAST threshold cannot distinguish between
the nearly identical paralogs that are found in some gene families; this approach omits genes that were missed by both
Ensembl and Celera gene identification processes. This analysis of multiple gene sets together, coupled with the removal
of redundancy, allows us to make a more complete estimate
of the total human genome gene content than has previously
been described (IHGSC, 2001; Venter et al., 2001).
Gene set domain content analysis
Similar to the SWISS-PROT homology analysis (Fig. 5B),
protein domain profiling should provide an indirect measure
of the quality of the genome-derived gene sets. The drawback of this analysis is that it can only analyze genes that
contain already known protein domains. We used the PFAM
7.0 (Bateman et al., 2002) database of domain models to
look at the comparative domain content of gene sets from
the HGSC and Celera genome assemblies. Figure 7A shows
the numbers of PFAM models that have an excess of matches
against various releases of either the Ensembl or Celera gene
sets. In early releases, many more PFAM models had more
matches against the Ensembl gene set than against the Celera
gene set. However in recent releases, the domain content
of the Celera gene set has increased dramatically relative to
Ensembl. In contrast, when the GENSCAN gene sets are analyzed (Fig. 7B), while the gap has narrowed, the HGSC genes
continue to contain more domain matches than the Celera
GENSCAN genes. Similar to the SWISS-PROT homology
analysis (Fig. 5B), this domain analysis should provide an
approximate measure of the quality of the genome-derived
gene sets. The GENSCAN-derived gene set numbers suggest
that over time the Celera genome assembly has approached the
quality of the HGSC assembly. Given the similarity of local
sequence content between the HGSC and Celera assemblies,
7
“bio015” — 2003/5/19 — page 7 — #7
S.Li et al.
Fig. 7. Domain profiling of HGSC and Celera gene sets. The domain content of multiple HGSC and Celera gene sets was analyzed by
performing a search of these gene sets with the PFAM database. For each PFAM domain model, the number of hits against each pair of
HGSC and Celera gene sets was identified. The numbers of PFAM models that have an excess of hits against HGSC are plotted in the upper
section, based on how large the excess of HGSC hits was. Similarly, the numbers of PFAM models that have an excess of hits against Celera
are plotted in the lower section based on the number of excess Celera hits. (A) PFAM analysis of Ensembl and Celera gene sets. (B) PFAM
analysis of HGSC and Celera assembly GENSCAN gene sets.
this PFAM analysis of the GENSCAN gene sets supports the
idea that the HGSC HS approach may have had advantages
over the Celera WGS approach. The significant difference in
PFAM matches to the recent Celera pipeline gene sets, in contrast, suggests that Celera has been able to add many new gene
types to their gene set that would not otherwise be identified
by ab initio gene prediction, making their gene annotation
efforts more comprehensive than that of Ensembl.
Numerous reports comparing the HGSC and Celera genome
assemblies (Aach et al., 2001; Olivier et al., 2001; Li et al.,
2002; Xuan et al., 2003) and gene sets (Hogenesch et al.,
2001) have been made since the simultaneous publication of
the two genomes in February 2001 (IHGSC, 2001; Venter
et al., 2001). The analysis presented here suggests that the
initial HGSC genome assembly, although containing a similar amount of genomic sequence information as the Celera
genome assembly, was in a much better state of assembly. This
is not entirely unexpected as whole genome shotgun sequencing, the technique used by Celera, is more challenging to
assemble than HGSC’s hierarchical shotgun approach. Over
the course of two years, however, Celera has made up for the
shortcomings of their initial assemblies with newer assemblies
that have approached the quality of HGSC’s draft genome.
Since the Ensembl gene build system predicts genes through
GENSCAN, homology, and gene prediction HMM methods,
the quality and quantity of their gene predictions should mirror
the quality of the genome assembly, as we have observed. In
contrast, Celera uses a richer gene prediction pipeline named
Otto that places greater emphasis on cross-species genome
comparisons, EST homology, and curated gene set homology
(Venter et al., 2001). By incorporating information in addition
to its genome sequence, Celera has been able to generate a larger, more unique gene set. While many of the predicted genes
unique to both the Ensembl and Celera gene sets are likely
to be proven not to be bona fide genes [Fig. 5B (Hogenesch
et al., 2001)], we expect that a significant number of them will
be validated when the full content of the human transcriptome
is finally determined.
ACKNOWLEDGEMENTS
We thank Jim Kent (UCSC) and the members of the Ensembl
project (UK) for various technical assistance and help in
HGSC genome database setup, and Tularik/GNF Bioinformatics and IT staff for outstanding computational support. The
authors are also grateful to Drs Greg Peterson and Zheng Pan
for critical discussions.
REFERENCES
Aach,J., Bulyk,M.L., Church,G.M., Comander,J., Derti,A. and
Shendure,J. (2001) Computational comparison of two draft
sequences of the human genome. Nature, 409, 856–859.
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
8
“bio015” — 2003/5/19 — page 8 — #8
Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L.,
Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and
Sonnhammer,E.L. (2002) The Pfam protein families database.
Nucleic Acids Res., 30, 276–280.
Burge,C. and Karlin,S. (1997) Prediction of complete gene structures
in human genomic DNA. J. Mol. Biol., 268, 78–94.
Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14,
755–763.
Hogenesch,J.B., Ching,K.A., Batalov,S., Su,A.I., Walker,J.R.,
Zhou,Y., Kay,S.A., Schultz,P.G. and Cooke,M.P. (2001) A comparison of the Celera and Ensembl predicted gene sets reveals little
overlap in novel genes. Cell, 106, 413–415.
Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L.,
Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl
genome database project. Nucleic Acids Res., 30, 38–41.
Huson,D.H.,
Reinert,K.,
Kravitz,S.A.,
Remington,K.A.,
Delcher,A.L., Dew,I.M., Flanigan,M., Halpern,A.L., Lai,Z.,
Mobarry,C.M. et al. (2001) Design of a compartmentalized
shotgun assembler for the human genome. Bioinformatics, 17,
S132–S139.
International Human Genome Sequencing Consortium (IHGSC)
(2001) Initial sequencing and analysis of the human genome.
Nature, 409, 860–921.
Kent,W.J. (2002) BLAT—the BLAST-like alignment tool. Genome
Res., 12, 656–664.
Li,S., Liao,J., Cutler,G., Hoey,T., Hogenesch,J., Cooke,M.,
Schultz,P. and Ling,X. (2002) Comparative analysis of human
genome assemblies reveals genome-level differences. Genomics,
80, 138.
Maglott,D.R., Katz,K.S., Sicotte,H. and Pruitt,K.D. (2000) NCBI’s
LocusLink and RefSeq. Nucleic Acids Res., 28, 126–128.
Myers,E.W., Sutton,G.G., Smith,H.O., Adams,M.D. and Venter,J.C.
(2002) On the sequencing and assembly of the human genome.
Proc. Natl Acad. Sci. USA, 19, 19.
Olivier,M., Aggarwal,A., Allen,J., Almendras,A.A., Bajorek,E.S.,
Beasley,E.M., Brady,S.D., Bushard,J.M., Bustos,V.I., Chu,A.
et al. (2001) A high-resolution radiation hybrid map of the human
genome draft sequence. Science, 291, 1298–1302.
Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI
gene-centered resources. Nucleic Acids Res., 29, 137–140.
Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J.,
Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A.
et al. (2001) The sequence of the human genome. Science, 291,
1304–1351.
Waterston,R.H., Lander,E.S. and Sulston,J.E. (2002) On the sequencing of the human genome. Proc. Natl Acad. Sci. USA, 99,
3712–3716.
Xuan,Z., Wang,J. and Zhang,M.Q. (2003) Computational comparison of two mouse draft genomes and the human golden path.
Genome Biol., 4.
9
“bio015” — 2003/5/19 — page 9 — #9
J. Med. Chem. 2002, 45, 1221-1232
1221
PRO_SELECT: Combining Structure-Based Drug Design and Array-Based
Chemistry for Rapid Lead Discovery. 2. The Development of a Series of Highly
Potent and Selective Factor Xa Inhibitors
John W. Liebeschuetz,*,† Stuart D. Jones,† Phillip J. Morgan,† Chris W. Murray,† Andrew D. Rimmer,†
Jonathan M. E. Roscoe,† Bohdan Waszkowycz,† Pauline M. Welsh,† William A. Wylie,† Stephen C. Young,†
Harry Martin,† Jacqui Mahler,† Leo Brady,‡ and Kay Wilkinson‡
Protherics Molecular Design, Beechfield House, Lyme Green Business Park, Macclesfield SK11 0JL, U.K., and Department of
Biochemistry, University of Bristol, Bristol BS8 1TD, U.K.
Received June 6, 2001
In silico screening of combinatorial libraries prior to synthesis promises to be a valuable aid to
lead discovery. PRO_SELECT, a tool for the virtual screening of libraries for fit to a protein
active site, has been used to find novel leads against the serine protease factor Xa. A small
seed template was built upon using three iterations of library design, virtual screening,
synthesis, and biological testing. Highly potent molecules with selectivity for factor Xa over
other serine proteases were rapidly obtained.
Introduction
Serine proteases represent a class of enzymes of great
therapeutic importance. Members of the class which
have been targeted for drug design include tryptase and
urokinase and, in the blood coagulation cascade, thrombin, factor VIIa, and factor Xa. Factor Xa lies at the
junction of the intrinsic and extrinsic pathways of the
coagulation cascade. It is the active enzyme present in
the prothrombinase complex, which converts prothrombin into thrombin. Thrombin is the final enzymatic
product of the blood coagulation cascade and is responsible for the conversion of fibrinogen into fibrin. Much
effort has been spent targeting thrombin, in particular,
and, more recently, factor Xa, with the aim of designing
antithrombotic drugs which are orally available and
which show a reduced potential for bleeding as a side
effect.1 Current therapies include the heparins, which
are not orally available, and the coumarins, which have
a narrow therapeutic window with regard to bleeding.
Factor Xa has been claimed to be a better antithrombotic target than thrombin because there are indications
that factor Xa inhibitors may have less propensity to
show bleeding side effects.2,3 Additionally, a rebound
effect has been observed following cessation of therapy
with direct thrombin inhibitors.4 Potential indications
are for deep vein and arterial thrombosis, post operative
prophylactic use, myocardial infarction, and stroke.
Crystal derived structural models are available for
quite a number of serine proteases. Their mode of
catalytic action is well understood, and the structural
features that give rise to substrate selectivity have in
many cases been elucidated. Recently, for both thrombin
and factor Xa, structures have been published which
have a variety of competitive inhibitors bound in the
active sites. Despite this wealth of structural informa* Corresponding author: John W. Liebeschuetz, Tularik Ltd, Beechfield House, Lyme Green Business Park, Macclesfield SK11 0JL, U.K.
Tel: (44) 1625 427369. Fax: (44) 1625 612311. E-mail: jliebeschuetz@
tularik.com.
† Protherics Molecular Design.
‡ University of Bristol.
Figure 1. Bis-amidine factor Xa inhibitors DX-9065a and YM60828.
tion, the role of structure-based design, to date, has
generally been to help suggest analogues of an existing
lead and to post-rationalize the activity data, rather
than as a tool for the de novo design of inhibitors.5,6,7
The first crystal structure of the factor Xa enzyme
was published by Bode’s group in 1993 (Brookhaven
code 1HCG).8 This crystal structure is missing the Gla
(γ-carboxyglutamic acid) domain (N terminal residues
1-45) and also residues Glu 146-Gln 151, close to the
active site, which are apparently autocleaved during
crystallization. The S1 pocket of the active site is
occupied by the A-chain terminal Arg 439 of a neighboring factor Xa molecule, which hydrogen bonds to Asp
189 in the standard bidentate fashion. The active site
is similar to trypsin, but differs from many other serine
proteases in having a large S4 pocket. The S1 pocket
differs from trypsin in that Ser 190 is replaced with
hydrophobic Ala 190. These features suggest that selective small molecule inhibitors for factor Xa can be
obtained and, indeed, many are already known. For
example, DX-9065a is a dicationic inhibitor with a Ki
of 41 nM against factor Xa and a Ki of 630 nM against
trypsin and >2000 nM against thrombin (Figure 1.).9
YM-60828 is another dicationic inhibitor with a Ki of
2.3 nM against factor Xa, 159 nM against trypsin, and
10.1021/jm010944e CCC: $22.00 © 2002 American Chemical Society
Published on Web 02/16/2002
1222
Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6
Figure 2. Flow diagram demonstrating the PRO_SELECT
approach.
>10 000 nM against thrombin.10 Both these compounds
are currently in preclinical and clinical trials and show
some promise as oral antithrombotics without bleeding
side effects.11,12 A crystal derived structure for DX-9065a
bound to factor Xa (Brookhaven code 1FAX) exists. This
indicates that the naphthamidine portion sits in the S1
pocket, making a single hydrogen bond to Asp 189, while
the acetamidinopyrrolidine portion sits in the electron
rich S4 pocket and makes a hydrogen bond to the
pendant Glu 97 side chain.13
Virtual Screening Methodology
We have recently published a methodology for the
computational or ‘virtual’ screening of combinatorial
libraries against the active site of an enzyme.14 The
program central to this methodology is called PRO_SELECT. A flow diagram illustrating the methodology is
given in Figure 2. The idea is to generate relatively
small candidate libraries (10-1000 compounds) which
have a high ratio of hits to inactives. These libraries
are generally based on a template chemistry and are
designed to be synthetically accessible in a timely and
cost-effective manner. The catholic nature of the screening procedure allows a wide diversity of structures to
be explored within the constraints imposed by the
template.
We start with a template structure. This is designed
to fit into part of the active site, normally in a central
position, and make favorable interactions with the
enzyme. The template has attachment points to which
substituents can be affixed using simple chemical reactions. Each attachment point is ideally directed toward
a different pocket in the binding site, in which a suitable
substituent may find favorable interactions. The substituents must be readily accessible to allow rapid
chemical synthesis. Therefore the substituents are
directly derived from lists of appropriate reagents, each
selected from a directory of commercially available
Liebeschuetz et al.
chemicals. The selection is normally carried out by
searching according to a simple pharmacophore appropriate to the target pocket. Each list may contain
several thousand different chemicals.
Each substituent from a list is computationally
screened for fit to the target pocket. This was done in
two stages in the work presented here. First each
substituent is attached to the template and then a user
defined number of conformations are roughly assessed
for goodness of fit and favorable interactions. This was
done by assessing the substituent for complementarity
to an ‘interaction site model’ in the manner of Klebe.15
The number of attempts at finding graph matches is of
the order of a thousand. A match may not represent a
viable conformation for a substituent, and this is
checked by carrying out a local conformationally flexible
fitting of the substituent onto the interaction sites using
a directed tweak algorithm. The number of attempts at
finding a viable conformation from a given match is of
the order 30 to 120. Many substituents may be rejected
at this stage. Accepted substituents can be further
refined in a second stage, using molecular mechanics
to optimize internal geometries and substituent:receptor
interactions, via an implementation of the ‘CLEAN’
force field.16 Each substituent is scored using an empirical scoring function to find and preserve the best match
for each substituent. More recent versions of PRO_SELECT use a docking protocol on both template and substituent to obtain a good binding conformation in which
the template position can be adjusted.17 The empirical
scoring function represents binding energy and is
derived by regression analysis of measured binding
affinity to terms known to be important in determining
affinity and calculable from existing crystal structures.
A number of different such functions have been published. The empirical scoring function used in this work
is that of Böhm, although we have subsequently derived
our own scoring function, ChemScore.18-20
The score is thus used to drive placement of the
substituent. It is also used by the molecular designer
to differentiate between substituents in the design of
the final sublibrary. It is generally used only as a cutoff
filter to pick out those substituents which have the best
chance of making good binding interactions. It is not
expected that the Böhm score will correlate well with
actual binding affinities of the library members, as the
Böhm score is derived from complexes in which only
favorable ligand:protein interactions are made. Unfavorable interactions which are not penalised by the
Böhm score, for instance polar:lipophilic contacts, are
liable to frequently arise in the PRO_SELECT placements. For this reason, other criteria are also used in
the selection process. These can include strain energy
estimation, a diversity metric or calculated physical
chemical properties. Manual inspection of the predicted
binding modes also plays an important role. The process
may be repeated for separate substituent lists attached
to different points on the template. Thus the final
library will consist of an array of substituents for a
single attachment point or a combinatorial array constructed from two or more lists each corresponding to a
separate attachment point.
PRO_SELECT was first validated using thrombin as
a target.21 We now report on the use of this methodology
PRO_SELECT: A Tool for Rapid Lead Discovery
Figure 3. Factor Xa active site VdW surface (A) and schematic (B) illustrating the strategy used in the iterative design
process. The initial template, 3-benzamidinecarbonyl, and the
interaction site model for the first library are shown (A).
to design factor Xa inhibitors as the primary stage in
an ongoing program to discover novel antithrombotic
drugs. This has led to a new class of chemistry that is
highly active and specific for factor Xa. Several groups
have described similar ‘Virtual Screening’ approaches
that have lead to active inhibitors against other targets.22,23 To our knowledge this is one of the first
examples where such an approach has led to molecules
sufficiently active to show therapeutic effect in relevant
animal disease models. It is also the first published
example of the de novo design of inhibitors for factor
Xa.
Design Strategy
It is usual, when considering the design of a combinatorial library, to build the library around a central
template to which two or more substituents are attached
or incorporated via facile chemistry. The placement of
such a template in an active site would necessarily be
central. Examination of the factor Xa 1HCG structure
reveals that the central region of the active site is very
broad, and the disposition of polar ‘anchoring’ sites is
sparse (Figure 3A). Therefore we felt no confidence that
a central template could be designed which would be
guaranteed to bind in a single predictable manner. An
alternative strategy was chosen in which a template
would be placed in one of the specificity pockets and
PRO_SELECT would be used to find potential substituents to fit the central portion of the active site.
Synthesis and screening of this initial library would
then allow a lead molecule to be selected. This could be
further elaborated, again by use of PRO_SELECT, to
Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6 1223
exploit other specificity pockets within the active site.
The S1 pocket was chosen as the most suitable pocket
in which to place a template. Inhibitors of the trypsin
class of serine proteases generally have a cationic moiety
in this pocket which can make a single or bidentate
H-bond with Asp 189. This provides a good anchor and
a predictable binding orientation for a variety of ligands.
The strategy of iterative library design to grow into
different regions of the factor Xa active site is demonstrated in Figure 3, with reference to the 1HCG structure. This was the only structure openly available at
the start of this work and is therefore the one that was
used. Initial placement of the template in S1 (blue) was
to be followed by development of the first library to
probe the central (red) region. This was to be followed
up by second and third libraries to occupy the green and
purple pockets. The red region incorporates the Ser 214
to Gly 218 backbone, the position of which is well
conserved in many serine proteases and which is known
to be capable of providing H-bonding recognition interactions with natural substrates. It also partially contains the characteristic ‘aromatic box’ of factor Xa,
constructed from the Tyr 99, Trp 215, and Phe 174 side
chains. This ‘box’ constitutes the majority of the S4
pocket. The purple region represents the back of the S4
pocket and is characterized by three backbone carbonyls
from Thr 98, Glu 97, and Lys 96 and the anionic Glu
97 side chain which can overhang the pocket. In theory
these groups are available to hydrogen bond strongly
to electropositive and cationic groups. The green region
represents a hydrophobic pocket that has as its base
the Cys 191-Cys 220 disulfide bridge, Gln 192 and Arg
143 side chains as the left-hand wall, and the Gly 218
backbone as the right-hand wall. This pocket is well
conserved in many serine proteases but has not been
exploited frequently in the design of inhibitors, perhaps
because the pocket can often be occupied by the mobile
side chain of residue 192. The structure of the potent
anti-Xa protein, tick anticoagulant peptide, bound to
factor Xa, shows that this ligand can make use of this
region, albeit with considerable reorganization of the
active site.24 We were of the opinion, after consideration
of the 1HCG structure, that this pocket could be a prime
target area for substituent placement in factor Xa.
Results and Discussion
Template Selection and First Library Design. It
was decided to employ an amidino group to anchor the
S1 template via a bidentate hydrogen bond to Asp 189.
We were aware that such a group has in the past lead
to problems in oral availability and rapid clearance.
Nevertheless it was felt important to use a template
that had reasonable base activity of its own so that
structure/activity trends would become immediately
apparent.
It was envisaged that the synthetically facile linkage
of the template to the substituent would be via an amide
bond. It was also envisaged that this amide bond might,
itself, be able to make hydrogen bonding interactions
with the active site. Several possible template candidates were considered. It was decided to use PRO_SELECT to design a library with each template and to
compare the quality of the libraries in order to select
the best template. Three of the templates examined are
shown in Table 1.
1224 Journal of Medicinal Chemistry, 2002, Vol. 45, No. 6
Liebeschuetz et al.
Table 1. Hit Rate and Hit Quality of Libraries Designed
around Three Different S1 Templates and a Single Substituent
List, Using PRO-SELECT
The initial stage in using PRO_SELECT was to create
a ‘Design Model’. This is a simple ‘interaction site model”
of the active cleft.14 The Design Model for the region
targeted for the first library is included in Figure 3A.
The dark/light blue vectors represent hydrogen bond
donor sites, the blue/purple vectors represent hydrogen
bond acceptor sites, and lipophilic point sites are
represented as orange crosses. The first step in substituent evaluation was to find matches between interaction sites in the substituent, attached to template, and
complementary sites in the design model. Also illustrated in Figure 3A is one of the templates bound in
the S1 pocket in the binding orientation employed in
the PRO_SELECT job.
The choice of substituent lists and the virtual screening protocol used are given in the Experimental Section.
A summary of the output performance is given in Table
1 for each of the three templates. The number of
substituents that passed the matching stage was roughly
the same for each template. However, the number of
substituents binding reasonably well (Böhm score < -3)
was much less for the 4-benzamidinocarbonyl template
than for the other two templates. The arginine template
had marginally more high quality hits than the 3-benzamidinocarbonyl template. However, the Böhm score
for the former was 10 kJ mol -1 higher than that for
the latter (-5.3 versus -16.4), corresponding to a
shortfall of roughly 2 orders of magnitude in terms of
binding affinity. It was concluded therefore that the
3-benzamidinocarbonyl template was the most appropriate template to use.
A disappointing feature, even in the case of the
3-benzamidinocarbonyl template, was the paucity of
high scoring substituents (score < -10). Substituents
could be found which made either the desired polar
interactions or hydrophobic interactions but rarely both.
The number of highly promising substituents therefore
appeared limited. The reason for this is likely to be the
restriction on substituent diversity, placed upon us by
using solely the Available Chemicals Directory as a
source. Six targets from this first PRO_SELECT run
were selected for synthesis.
One substituent that did score exceptionally well was
the glycine-2-naphthylamide, 1 (Figure 4). The naphthyl
group was found to sit well in the hydrophobic S4
pocket, and the glycine CdO was able to H-bond with
Gly 218. Both of these effects led to a good Böhm score.
When modeled against the active site, it was found that
the NH from the benzamide amide group also could
make a hydrogen bond to Gly 216. This naphthylglyci-
Figure 4. Lead compounds arising out of library 1.
namide substituent was found to be unobtainable, and
the presence of the highly carcinogenic β-naphthylamine
substructure mitigated against its synthesis. Nevertheless the motif looked of sufficient interest to be investigated further. Accordingly it was decided to append
the glycine to the 3-amidinobenzoyl group and use this
molecule as a larger template. PRO_SELECT was used
to search for substituents which, when attached via an
amide bond to the glycine carbonyl, would probe deeper
into the S4 pocket. It was envisaged that it might be
possible to use the carbonyl groups at the back of the
S4 pocket for hydrogen bonding. For this reason, the
library of substituents chosen for virtual screening was
selected to be bisamines separated by a hydrophobic
group. PRO_SELECT offers the ability to interconvert
functional groups in silico prior to virtual screening.
This ‘deprotection’ facility mimics a synthetically facile
functional group transforming reaction.13 This was
exploited here to broaden the diversity of possible
substituents by inclusion of a list of bis-nitriles, converted on the fly to bis-amines.
Further details of the virtual screening protocol are
given in the Experimental Section. Eight substituents
were chosen which had Böhm contributions of better
than -10 and strain energies of lower than 25 kJ mol
-1.25 Good hits from this second search were incorporated with those of the first, and 14 compounds in total
were selected for synthesis (library 1a).
Addition of bulky and partially hydrophobic substituents to a small template could arguably be expected
to lead to increased efficacy through nonspecific lipophilic interactions. We wanted to be confident that any
increase in activity in the initial library was not
generated this way. Therefore it was decided to prepare
a second library using some of the same substituents
selected for the 3-amidinobenzoyl template but attached
to the ‘wrong’ 4-amidinobenzoyl template. Eight substituents were chosen (library 1b).
The synthetic routes used for the preparation of these
compounds are shown in Schemes 1 and 2. Simple
amine substituents, as exemplified by amino acids and
their esters, were prepared by coupling with 3-cyanobenzoic acid (using TBTU in DMF), convertion to
the imidate (HCl in ethanol), and then to the amidine
(using ammonia in ethanol). Hydrolysis of the ester was
accomplished using aqueous sodium hydroxide in ethanol, Scheme 1.
Table 2. Comparison of Activities against Factor Xa, Trypsin, and Thrombin for Benzamidine and Libraries 1a-c, 2a-e, and 3aa
factor Xa
trypsin
thrombin
library
n
mean
pKi
(SD
mean
Kib (µM)
best
Ki (µM)
mean
pKi
(SD
mean
Kib (µM)
best
Ki (µM)
mean
pKi
benzamidine
1a
1a (subset)
1b
1a + 1c
2a
2b
2c
2d
2a + 2e
3a
14
7
7
36
6
6
9
11
34
106
3.7
4.3
4.2
3.1
4.4
5.5
4.7
4.7
4.5
5.7
6.5
(0.6
(0.8
(0.3
(0.8
(0.8
(0.7
(0.4
(0.5
(0.8
(0.8
200
50
58
780
36
3.5
21
21
29
1.9
0.34
200
8.5
8.5
162
2.1
0.22
1.0
2.9
3.0
0.063
0.016
4.8
4.4
4.2
3.7
4.2
4.6
4.3
4.3
4.3
4.9
5.4
(0.8
(0.9
(0.9
(0.8
(0.6
(0.4
(0.4
(0.4
(0.7
(0.6
17
38
61
180
61
26
44
43
47
11
4.1
17
6
6
2.3
6.5
9
21
8
15
0.79
0.040
4.6
4.5
4.6
3.6
4.3
c
c
c
c
4.7d
5.2e
(SD
(0.7
(0.8
(0.5
(0.9
(0.5
(0.6
mean
Kib (µM)
best
Ki (µM)
25
31
23
230
51
c
c
c
c
18
7
25
5
5
89
3.0
c
c
c
c
1.4
0.023
a Figures not in bold represent benchmark libraries. b K figures are geometric means calculated as the reciprocal antilog of the pK
i
i
mean. c Insufficient compounds tested against thrombin. d Only 23 compounds tested against thrombin. e Only 99 compounds tested against
thrombin.
Scheme 1. Solution-Phase Synthesis of Inhibitorsa
a Conditions: (a) TBTU in DMF; (b) HCl in ethanol; (c) ammonia
in ethanol; (d) NaOH in ethanol.
Where the S4 unit was a bis-amine a solid-phase route
was preferred, Scheme 2. The bis-amine was attached
to 2-chlorotrityl polystyrene resin and coupled to an
Fmoc protected amino acid using TBTU in DMF. The
Fmoc protection was removed with piperidine in DMF
and free amine coupled to 3-amidinobenzoic acid TFA
salt using DIPCI and HOBt. The product was then
cleaved from the resin using TFA/triethylsilane.
Compounds were tested in chromogenic assays and
Ki’s calculated against a range of serine proteases. These
included factor Xa, trypsin, and thrombin.
The mean, standard deviation, and best activities of
libraries 1a and 1b against factor Xa, trypsin, and
thrombin are given in Table 2. The corresponding
activities for the subset of library 1a with common
substituents to library 1b are also given. The designed
library 1a with the 3-amidinobenzoyl template demonstrates on average markedly improved activity over
benzamidine (Ki of 200 µM, pKi of 3.7). Moreover, the
‘control’ library with the 4-amidinobenzoyl template, 1b,
shows no such improvement and is, on average, more
than an order of magnitude less active than the corresponding subset of compounds from library 1a. The
elevation of activity in library 1a is therefore not simply
a function of increasing the size of the molecule by
addition of a lipophilic fragment.
It was decided to broaden the library 1a by making
some simple structure-predicated modifications of some
of the hits. Thus a variety of more lipophilic esters of
the naphthylalanine and tryptophan analogues 2 and
3 were prepared, with the idea they might better fill
the S4 pocket (library 1c). This led to the low micromolar
lead 4. This compound is selective for factor Xa over
trypsin (Ki of 2.0 µM vs Ki of 12.5 µM) despite the fact
that benzamidine itself is 10 fold better for trypsin (Ki
of 20 µM vs trypsin).
Another one of the most active compounds in library
1a was 5 (Ki, 14 µM), and this was selected as a second
possible lead compound. The more active of the two
leads, 4, did not readily lend itself to modification via
easily accessible chemistry. Compound 5, on the other
hand, looked ideally set up for quick variation, both by
replacement of the glycine with other amino acids and
by replacement of the 1,4-bis-aminomethylcyclohexane
fragments. This was therefore chosen as the lead
molecule from which to develop the second library.
Scheme 2. Solid-Phase Synthesis of Inhibitorsa
a Conditions: (a) TBTU/DIPEA and Fmoc-amino acid in DMF; (b) 20% piperidine in DMF; (c) 3-amidinobenzoic acid TFA salt and
DIPCI/HOBt in DMF; (d) 10% triethylsilane in TFA.
1226
Liebeschuetz et al.
Figure 5. Compound 5 modeled in the 1HCG factor Xa
structure.
Second Library Design. Figure 5 illustrates 5
docked into the 1HCG structure. The terminal nitrogen
of the S4 binding portion has been modeled as protonated. The glycine in 5 can be elaborated from either of
the prochiral hydrogens. Elaboration involving a Damino acid or analogue thereof (i.e., coming off the
hydrogen marked purple in Figure 5) appeared to have
a good chance of exploiting the lipophilic disulfide pocket
(green area, Figure 3, disulfide in yellow in Figure 5)
according to the model. The enantiomeric L-amino acids
appear not to be able to access good binding sites in the
vicinity. PRO_SELECT jobs were carried out using the
list of available R-amino acids. Both L and D configurations were examined, and both polar and hydrophobic
interactions were sought. The list of amino acids that
resulted consisted mainly of D-amino acids, which
docked into the disulfide pocket. A library of seven
compounds from this list, all D-enantiomers, was synthesized (sublibrary 2a), all with the 1,4-bis-aminomethylcyclohexane S4 fragment. The corresponding Lenantiomers were also made (sublibrary 2b). A variety
of other D-amino acids (9 in all, sublibrary 2c) and
L-amino acids (11 in all, sublibrary 2d) was also utilized.
Thus libraries 2b, 2c, and 2d represent benchmark
libraries to compare with 2a. The solid-phase synthetic
route in Scheme 2, was applicable to all the library 2
compounds.
Table 2 gives the mean, standard deviation, and best
activities for all four sublibraries against factor Xa,
trypsin, and thrombin.
The figures in Table 2 indicate that designed sublibrary 2a is, on average, almost an order of magnitude
more active against factor Xa than any of the three
comparison libraries, 2b, 2c, and 2d. However, this is
not mirrored in the average trypsin activities. The
compound within the library that showed the highest
activity was 6 (Figure 6), with a Ki of 220 nM against
factor Xa and 7.4 µM against trypsin. Analysis of the
PRO_SELECT run and further modeling revealed that
the phenyl group appeared able to sit well into the
disulfide pocket. It also appeared able to make an edgeon interaction with the disulfide bridge. This compound
was selected as the lead molecule for the third cycle of
PRO_SELECT driven optimization.
Analysis of the 1HCG structure suggested that there
was plenty of room left in the disulfide pocket for further
hydrophobic elaboration, especially in the vicinity of the
3 and 4 positions of the phenyl ring in 6. Accordingly,
to exploit this extra binding possibility, further ana-
Figure 6. Lead molecule for library 3 (6) and examples of
library 3 compounds.
logues were designed, using both medicinal chemistry
principles and modeling (library 2e). Only modest
increases in activity were achieved. The reasons for this
remained unclear until the crystal structure of DX9065a bound to factor Xa was published (1FAX).12 This
structure retains the autolysis loop, missing in the
1HCG structure. This loop sits at the back of the
disulfide pocket area, severely curtailing its size and
depth.
Third Library Design and Activity. Only a limited
number of substituents designed to access the S4 pocket
was tried in libraries 1 and 2. Many of these were
diamines. However, it was accepted that there was
considerable scope to find better S4 pocket binders and
that these could either contain cations or hydrogen bond
donors or, alternatively, they could be hydrophobic in
nature and interact strongly with the aromatic ‘box’.
Therefore it was decided to carry out PRO_SELECT jobs
using a docked conformation of 6 as the template, with
the aim of replacing the terminal diamine with a
hydrophobic primary or secondary monoamine. Several
jobs were run. The list of available monoamines was
large, and the pocket to be filled is also sizable.
Therefore, a much bigger list of quality substituents was
found from these jobs than from those run previously.
Approximately 100 targets were selected (library 3a).
The vast majority of these compounds contained a
lipophilic S4 binder. Where the S4 binder was lipophilic,
a solution-phase synthetic route was used (Scheme 3).
Boc-D-phenylglycine was coupled to an S4 component
in DMF using EDC/HOBt or HOAt. The products thus
obtained were deprotected, using TFA in DCM, and then
coupled to 3-amidinobenzoic acid TFA salt. Where the
S4 binder was a diamine, the solid-phase route (Scheme
2) could be used.
Summary activity data for library 3a is given in Table
2. Activity, in relation to the lead compound 6, was
generally retained and, in many cases, improved upon
in this library, despite the fact that there was usually
no possibility of obtaining a hydrogen bond to the back
of the S4 pocket in the new series. The most active
targets out of this set of analogues, 7, Ki 16 nM, and 8,
Ki 16 nM (Figure 6), showed a greater than 10-fold
increase in activity over 6 and retained selectivity over
other serine proteases (Ki’s of 980 and 1700 nM against
Scheme 3. Solution-Phase Synthesis of Inhibitorsa
the disulfide pocket, which can also be accessed in
trypsin, and the benzoyl piperidine group sits centrally
in the S4-pocket with the carbonyl pointing upward.
Both amide bonds in the ligand make the predicted
hydrogen bonds. The first H-bonds Gly 216 through NH,
the second, Gly 218, through CdO. The bound ligands
appear least well superimposed in the S4 region.
However, trypsin and factor Xa differ in the S4 pocket,
most noticeably at residues 99 and 174 (Leu and Gln
in trypsin). Leu 99 allows more room for the terminal
phenyl group to sit on the right-hand side of the pocket,
than does Tyr 99 in factor Xa.
The third library contained a wide diversity of active
chemistries. This gave rise to a number of structurally
different lead molecules that could be exploited further
using either classical medicinal chemistry or a structure
informed approach. Compound 9 represents one example where structure-based modification of a PRO_SELECT lead gave rise to other chemistries with potent
activity and selectivity. The synthesis of this type of
compound is outlined in Scheme 4. The Boc-N-protected
S4 intermediates were first prepared as shown in
Scheme 3. The Boc protection was removed using TFA/
DCM, and the resulting amine reacted with 1,3-bis-tertbutoxycarbonyl methyl pseudothiourea in the presence
of mercury II chloride to give the bis-butoxycarbonyl
guanidines. Final treatment with TFA/DCM and purification by preparative HPLC gave the final products
isolated as TFA salts. Compound 9 has a Ki of 26 nM
against factor Xa but only 1.6 µM against trypsin and
8 µM against thrombin. Other related compounds had
similar activity and selectivity factors of 250.
Overview. Figure 8 illustrates how the activity of the
series evolved. The benzamidine template and lead
molecules from each library are indicated. Each library
contains, as well as the original PRO_SELECT members, additional structure-based and medicinal-chemistrybased analogues, some of which are mentioned above.
To obtain predicted binding affinities for all compounds
under a consistent set of docking conditions, all molecules were subsequently redocked, keeping the appropriate template portion of each molecule rigid as in
the original PRO_SELECT run. Predicted binding affinity was calculated from the docking score for each
molecule. Predicted binding affinity is plotted against
a Conditions: (a) coupling agentssee text; (b) 25% TFA in DCM;
(c) 3-amidinobenzoic acid TFA salt and DIPCI/HOBt in DMF.
Figure 7. Compound 7 bound in trypsin (green) and superimposed with the predicted binding mode in Factor Xa (blue).
trypsin, 1100 and 1600 nM against thrombin). These
molecules are both amenable to further optimization at
the S4 end of the molecule and thus represent good third
cycle leads. Selectivity for the library as a whole against
thrombin and trypsin was slightly improved, despite the
fact that both these enzymes have sizable lipophilic
pockets that correspond to the Xa S4 pocket.
A cocrystal of compound 7 bound in trypsin, was
successfully obtained. Figure 7 compares the predicted
binding mode in factor Xa (blue) with that found in
trypsin (green). The benzamidine is found in S1, hydrogen bonding in a bidentate fashion to Asp189; the
phenyl portion of the phenylglycine linker is found in
Scheme 4. Preparation of S4 Guanidino Compoundsa
a
Conditions: (a) TFA/DCM; (b) 1,3-bis-tert-butoxycarbonyl methyl pseudothiourea/HgCl2; (c) TFA/DCM.
1228
Figure 8. Activity progression in the benzamidine series from
the first library (circles) to the second (squares) and third
(triangles) libraries. Original template and lead molecules for
subsequent library development are marked.
Liebeschuetz et al.
important to have good compound integrity in each
library. Impure compounds gave rise to data that
confounded early stage SAR development in some cases.
Several other points are worth making. First, the lead
molecule chosen at the beginning of each cycle was not
necessarily the most active molecule in the previous
library. It is more important that the lead be reasonably
easy to chemically modify and also be likely to have the
least pharmacokinetic problems.
Second, this approach is synergistic with modern
combinatorial chemical techniques and it could be used
with benefit alongside them, to design large focused
libraries. However, in such an approach only one
synthetic route is generally followed, and it is accepted
that a fraction of library members will not be successfully made. All compounds in a PRO_SELECT designed
library are to some extent ‘cherished’, and therefore
there is more incentive to make all members of the
library than is usual in combinatorial chemical exercises. If medium throughput array chemistry is employed and the libraries are relatively small, then it is
practical and useful to develop and employ more than
one synthetic route, as was done in this study.
Conclusions
Figure 9. Activity against factor Xa in the benzamidine series
versus predicted activity calculated from docking score. Compounds are from both the PRO_SELECT (b) and benchmark
(4) libraries.
factor Xa activity in Figure 9. Both those compounds
selected through virtual screening and those not so
selected (libraries 1b and 2b,c,d) are plotted. There is
not a tight correlation. Nevertheless, out of the selected
compounds, there is only a small proportion which score
well but which show poor activity. In addition, those
compounds not selected through virtual screening generally show both poor predicted and actual activity.
These things are what we would hope to see as the
primary aim of the virtual screening approach is to
concentrate synthetic resource in the areas where
reasonable activity is most likely to reside. One of the
reasons the correlation of predicted and measured
activity is not better is because the empirical scoring
function is only able to describe positive features of the
binding mode and not negative ones other than rotational entropy. In addition, current scoring functions
ignore subtle electrostatic effects such as π stacking and
so on, which can greatly influence activity. So quite a
spread of activities is obtained in the final data set. This
spread of activity can be very useful, however, as it often
envelops a rudimentary structure-activity relationship
among subgroups of similar chemistry within the library. This provides a springboard for a classical lead
optimization approach. For this reason, we found it
We have described the successful application of
‘virtual’ screening in the rational design of potent and
selective factor Xa inhibitors. The starting point for the
program was a simple template, benzamidine, placed
in the S1 pocket. Iterative library design built upon the
template in order to access other pockets in the active
site. The chemistry involved in the synthesis of these
libraries was designed to be straightforward, allowing
rapid access of targets. Libraries using the same chemistry were also synthesized which were not designed
through use of PRO_SELECT but which were reasonable from a medicinal chemistry standpoint. These
consistently showed poorer average activity against (by
roughly a factor of 10) and selectivity for factor Xa than
those designed by ‘virtual’ screening.
Several lead molecules with diverse structure and Ki
’s in the range of 10 to 50 nM were obtained, representing 4 orders of magnitude increase with regard to the
binding affinity of the starting template. Further testing
established that some of these compounds showed
antithrombotic activity when given intraperitoneally in
the Wessler stasis model of venous thrombosis in rat
and therefore have potential therapeutic use as an
injectable treatment. None of the compounds showed
an effect when given orally, however. It was felt this
was likely to be because of the highly basic benzamidine
moiety. Further work has since been carried out to
replace this group with a group of moderate basicity and
to optimize potency by modification elsewhere in the
molecule. Highly potent and selective compounds with
strong oral antithrombotic activity were found. These
will be described in later publications.
The design of libraries through structure-based ‘virtual screening’ is a methodology of drug design that is
currently of high interest. We have demonstrated that
it can be an efficient method for the generation of potent
and selective lead molecules, in cases where a target
protein structure exists.
Figure 10. Pharmacophores used in finding substituent lists
for library 1a: (A) pharmacophore for H-bond acceptor substituents, (B) pharmacophore for lipophilic substituents.
Experimental Section
Computational Details. Manipulation and inspection of
receptor, template, and ligands before and after minimization
or simulation was carried out using InsightII 95.0.25 All
molecular mechanics minimizations and molecular dynamics
simulations were carried out using the Discover 2.97 program26
with the CFF95 force field. The Discover calculations were
carried out on a Convex Exemplar (16 × HP7100s) running
SPP-UX 3.1. The pharmacophore searches of the Available
Chemical Directory (ACD)27 and substituent list generation
were performed using ISIS/Base 1.228 and ISIS/Draw 1.28
Searches were carried out allowing full conformational flexibility. Inspection of the structures and associated numerical
data generated by PRO_SELECT was carried out using inhouse graphics software (XMOLBROWSE). Substituent list
generation and graphical inspection was carried out on SGI
Indigo R3000 workstations running IRIX 4.0.5.
Protocol for the Design of Library 1a Members Based
on Three S1 Templates. The template positioning for the
arginine template (Table 1) was taken from a minimized
structure of PPACK in FXa. The docked conformation of
PPACK was derived from the 1PPB (PDB designation) PPACK/
thrombin structure. The template positioning of the benzamidines was derived from a docking of DX-9056a into the
1HCG (PDB designation) factor Xa structure. The carbonyl
group was placed at the ring position proximal to the lip of S4
in the case of the 3-amidinobenzoyl template (Table 1). Both
possible orientations of the carbonyl group planar to the ring
were treated as valid, and PRO_SELECT runs were carried
out on each. Only one orientation was deemed useful for the
4-amidinobenzoyl template.
Two lists of substituents were prepared using the ACD as
a source. Pharmacophores for the substituent selection were
calculated from the crystal structure assuming reasonable
placement of the template and are given in Figure 10. One
pharmacophore was targeted toward the polar functionality
at the lip of S4, with the primary aim of picking up interactions
with the N-H of Gly 218 (Figure 10A). The other was targeted
at the hydrophobic ‘aromatic box’ (Figure 10B). The amino
group common to both is the linking group to the template.
The initial aim was to only probe the central region of the
active site. Therefore a molecular weight limit of 250 was set.
Conformationally flexible searching of the ACD with these
pharmacophores using ISIS/Base software generated 2D structure lists which were than converted to 3D using Converter26
(Molecular Simulations Inc.).
The final ‘polar’ list numbered 1534 substituents, the
‘nonpolar’ one numbered 797. Each list of substituents was
evaluated separately by PRO_SELECT. The polar substituents
were assigned hydrogen bonding acceptor sites on double
bonded oxygen, nitrogen, and sulfur (this was found as
effective at finding hits as assigning both acceptor and donor
Figure 11. Template and pharmacophore for library 1b: (A)
template, (B) pharmacophore for bis-amine substituents, (C)
pharmacophore for bis-nitrile list.
sites). Hydrophobic sites were assigned to carbons in five- and
six-membered carbocyclic rings. Passes from the interaction
site matching stage were minimized using the Clean force field,
keeping the template rigid, and scored for binding affinity
using the empirical scoring function of Böhm.18 The best
scoring conformation per substituent was retained. The final
list of substituents was filtered to pick out only those which
had Böhm contributions of -3 kJ mol-1 or better.
Protocol for the Further Design of Library 1a Members Based around the 3-Amidinobenzoylglycine Template. Molecular dynamics simulations were carried out on
ligands 1 and 4, among others, docked into the 1HCG factor
Xa structure. Snapshots were taken from the simulations at
1 ps intervals and each snapshot minimized. Analysis of these
simulations allowed the selection of two significantly different
template geometries, both of which gave reasonably good Böhm
scores. PRO_SELECT runs were carried out using each
geometry. The template structure is given in Figure 11A.
The substituent list was prepared by searching the ACD
using the pharmacophore in Figure 11B. The list contained
422 substituents. A second bis-nitrile list was prepared according to the pharmacophore in Figure 11C. PRO_SELECT
runs were carried out separately on this list which contained
656 substituents. Substituents for library 1a were selected out
of that set of substituents with Böhm contributions of -9 kJ
mol-1 or better, and strain energies of 27 kJ mol-1 or lower.25
The hits were clustered using the Jarvis-Patrick method
within XMOLBROWSE, and substituents were selected from
those that passed the criteria via manual inspection of each
ligand docking. It was found that the two different template
positions gave rise to quite different sets of high scoring
substituents.
Protocol for the Design of Library 2. The origin of the
templates for this library was an “annealed” geometry of 5
manually docked into the 1HCG FXa X-ray structure. The
receptor geometry was held rigid while the ligand was
subjected to molecular dynamics at temperatures up to 300 K
with subsequent slow cooling followed by minimization. This
final geometry then provided an appropriate ligand template
geometry. The disconnection point for this template was taken
to be either of the prochiral hydrogens of the ligand glycine
methylene. Each of these hydrogen atoms was substituted
during separate PRO_SELECT runs as it was desired to look
at both L- and D-amino acids.
The list of potential “amino acid” substituents was obtained
by searching the ACD for all free R-amino acids. There were
1230 possible amino acid substituents after removal of high
molecular weight compounds (MW > 250) filtering of undesirable chemistries and conversion into 3D. The substituent
disconnection point employed here was the amino acid side
chain f C-R bond.
Several SELECT jobs were then carried out to focus on
lipophilic or polar interactions arising from the amino acid
1230
Figure 12. Pharmacophores used in finding substituent lists
for library 3: (A) pharmacophore for hydrophobic primary
amines, (B) pharmacophore for hydrophobic cyclic secondary
amines, (C) pharmacophore for hydrophobic acyclic secondary
amines. Aromatic and fused bicyclic systems were allowed in
the lists generated by pharmacophores A, B, and C.
substituent with the receptor interaction sites. A list of
substituents suitable for synthesis was generated after removal of duplicates, inappropriate substituents, e.g., side
chains from substituents available only in L form where the D
was preferred, and filtering of poor scoring substituents using
the criterion that the Böhm score be less than -3.0 kJ mol-1.
Protocol for the Design of Library 3. Compound 6 and
also an analogue of compound 6 which had the bis(aminomethyl)cyclohexane replaced by a 1-adamantylamine, docked
into the 1HCG factor Xa structure, were simulated at 300 K.
The receptor was kept rigid, and snapshots were taken at 5
ps intervals and minimized. Low energy snapshots were
selected to provide two different template positionings. Three
pharmacophores were used to search the ACD for appropriate
hydrophobic amines.
These pharmacophores, given in Figure 12, represent
respectively primary amines, secondary cyclic amines, and
secondary acyclic amines. The hydrophobic parts of the pharmacophores were designed so as to avoid linear hydrocarbons.
Hits that had several polar groups were excluded, as were
certain classes of reactive chemistry. A molecular weight limit
of 250 for free base was used. The primary amine list
numbered, after conversion to 3D and enumeration of enantiomers and diastereomers, 1053 molecules, and the cyclic
amine list numbered 250 compounds. The secondary acyclic
amine list, which was restricted to substituents containing a
six-membered carbocycle, numbered 366 compounds. The link
site was chosen to be the N-H trans to the carbonyl of the
phenyl glycine, in the case of the primary amine list, but the
C(dO)-N bond for both the secondary and the cyclic lists.
Hydrophobic interaction sites were placed at carbons adjacent
to a hydrophobic branch point. Passes from the interaction site
matching stage were treated as described above. Priority
substituent lists were selected on the basis of having favorable
Böhm contributions (generally -12 kJ mol-1 or better) and
low strain energies (generally less than -21 kJ mol-1). These
lists were clustered according to chemistry and processed by
manual inspection of the binding mode to generate synthetic
candidates.
Chemistry. Abbreviations used follow IUPAC-IUB nomenclature. Additional abbreviations are HPLC, high performance liquid chromatography; DMF, dimethylformamide;
DCM, dichloromethane; HATU, O-(7-azabenzotriazol-1-yl)1,1,3,3-tetramethyluronium hexafluorophosphate; HOBt, 1-hydroxybenzotriazole; TBTU, 2-(1H-(benzotriazol-1-yl)-1,1,3,3tetramethyluroniumtetrafluoroborate; DIPEA, diisopropylethylamine; TEA, triethylamine; HOAt, 1-hydroxy-7-azabenzotriazole; Fmoc, 1-(9H-fluoren-9-yl)methoxycarbonyl; TFA, trifluoroacetic acid; MALDI-TOF, matrix assisted laser desorption
ionization-time-of-flight mass spectrometry. Unless otherwise
indicated, amino acid derivatives, resins, and coupling reagents were obtained from Novabiochem (Nottingham, U.K.)
and other solvents and reagents from Rathburn (Walkerburn,
U.K.) or Aldrich (Gillingham, U.K.) and were used without
further purification.
Liebeschuetz et al.
Purification was by gradient reverse-phase HPLC on a
Waters Deltaprep 4000 at a flow rate of 50 mL/min using a
Deltapak C18 radial compression column (40 mm × 210 mm,
10-15 mm particle size) and solvent mixtures consisting of
eluant A (0.1% aq TFA) and eluant B (90% MeCN in 0.1% aq
TFA) with gradient elution.
Analytical HPLC was on a Shimadzu LC6 gradient system
equipped with an autosampler, a variable wavelength detector
at flow rates of 0.4 mL/min. Eluents A and B as for preparative
HPLC used the following columns: Luna2 C18 2 × 150 mm 5
µm, Symmetry C8 4.6 × 30 mm 3.5 µm (Phenomenex). Purified
products were further analyzed by Maldi TOF and/or LCMS
and 1H NMR.
Compound libraries were prepared using both solid-phase
and solution-phase parallel synthetic methods as described
below.
Simple Amine S4 Units (Scheme 1). Amino acid ester
hydrochlorides were either (i) coupled to 3-amidinobenzoic acid
TFA salt using DIPCI/HOBt in DMF containing 1 equiv of
DIPEA, purified by reverse-phase preparative HPLC and
isolated as the TFA salt, or (ii) coupled to 3-cyanobenzoic acid
using TBTU/HOBt in DMF containing 1 equiv of DIPEA and
the nitrile converted to the amidine by sequential treatment
with HCl gas in ethanol and ammonia gas in ethanol. The
products were purified by reverse-phase preparative HPLC
and isolated as the TFA salt. Compounds 3 and 4 were
prepared by these routes.
3-Amidinobenzoyl-D-tryptophan TFA Salt, 3. To a solution of 3-cyanobenzoic acid (500 mg, 3.4 mmol), D-tryptophan
methyl ester hydrochloride (866 mg, 3.40 mmol), and HOBt
(459 mg, 3.4 mmol) in DMF (10 mL) were added TBTU (1.09
g, 3.40 mmol) and DIPEA (592 µL, 3.4 mmol). The reaction
was stirred until complete by TLC and then partitioned
between ethyl acetate and water. The organic solution was
evaporated in vacuo to give 3-cyanobenzoyl-D-tryptophan
methyl ester (954 mg, 81%).
HCl gas was bubbled into a solution of 3-cyanobenzoyl-Dtryptophan methyl ester (925 mg, 2.66 mmol) in ethanol, and
the mixture was left overnight before evaporating to dryness
in vacuo. The solid was taken up in ethanol, and the solution
was saturated with ammonia gas. After being stirred overnight, the mixture was evaporated to dryness in vacuo and
the resulting solid purified by preparative HPLC to give a
mixture of 3-amidinobenzoyl-D-tryptophan methyl and ethyl
ester TFA salts.
To a solution of 3-amidinobenzoyl-D-tryptophan methyl and
ethyl ester TFA salts (50 mg) in ethanol (5 mL) was added 1
M aqueous sodium hydroxide (1 mL), and the mixture was
stirred overnight. The mixture was evaporated to dryness in
vacuo and purified by preparative HPLC to give 3-amidinobenzoyl-D-tryptophan TFA salt (11 mg). 1H NMR (CD3CN)
δ 8.14 (1H, s, Ar); 8.08 (1H, d, Ar); 7.82 (1H, d, Ar); 7.68 (2H,
m, Ar); 7.45 (1H, d, Ar); 7.05-7.30 (3H, m, Ar); 4.96 (1H, m,
R-proton); 3.3-3.5 (d-ABq, β-proton). Homogeneous by HPLC
Luna C18, Symmetry C8. High resolution MS (M+1)+ found
351.14555 (C19H18N4O3 requires 351.14568).
3-Amidino-D-2-naphthylalanine Ethyl Ester TFA Salt,
4. 3-Amidinobenzoic acid TFA salt (100 mg) was added to a
mixture of HOBT (48.6 mg) and DIPCI (57 µL) in DMF that
had been stirring for 10 min. To this mixture was added a
solution of 2-naphthylalanine ethyl ester hydrochloride (100.5
mg) and triethylamine (50 µL). After being stirred overnight,
the crude reaction mixture was purified by preparative HPLC
to give 3-amidinobenzoyl-D-2-naphthylalanine ethyl ester TFA
salt. 1H NMR (CD3CN) δ 7.98 (1H, s, Ar); 7.88 (1H, d, Ar);
7.67 (5H, m, Ar); 7.50 (1H, t, Ar); 7.3-7.4 (3H, m, Ar); 4.80
(1H, dd, R-proton); 4.0 (2H,q, Et); 3.35-3.1 (d-ABq, β-proton);
1.1 (3H, t, Et). Homogeneous by HPLC Luna C18, Symmetry
C8. LCMS 390 (M+1)+ , high resolution MS (M+1)+ found
390.18077 (C23H23N3O3 requires 390.181740).
Bis-amine S4 Units by Solid-Phase Methodology
(Scheme 2). The S4 component (bis-1,4-aminomethylcyclohexane) was supported on 2-chlorotrityl resin (1.2 mmol/g) and
coupled with an Fmoc protected amino acid using TBTU/
DIPEA in DMF. The washed resin was treated with 20%
piperidine in DMF to remove the Fmoc protection and then
reacted with 3-amidinobenzoic acid TFA salt using DIPCI/
HOBt in DMF. The product was then cleaved off the washed
resin using 10% triethylsilane in TFA ,and the crude product
obtained was purified by preparative reverse-phase HPLC and
isolated as the TFA salt. Compounds 5 and 6 were made by
this route.
3-Amidinobenzoyl-glycinyl-(4-aminomethylcyclohexyl)methylamine, 5. 1H NMR (CD3CN/D2O) mixture of cis/trans
isomers, major isomer only: δ 8.15 (s, 1H, Ar); 8.1 (d,1H, Ar);
7.9 (d, 1H, Ar); 7.66 (t, 1H, Ar); 4.97 (s, 2H, “Gly CH2”); 3.0 (d,
2H, amide CH2); 2.70 (d, 2H, amine CH2); 1.70 (m, 4H,
cyclohexyl); 1.40 (m, 3H, cyclohexyl; 0.90 (m, 3H, cyclohexyl).
Homogeneous by HPLC Luna C18, Symmetry C8. LCMS
346 (M+1)+, high resolution MS (M+1)+ found 346.22378
(C18H27N5O2 requires 346.22426).
3-Amidinobenzoyl-D-phenylglycinyl-(4-aminomethylcyclohexyl)methylamine, 6. 1H NMR (D2O) mixture of
cyclohexyl cis and trans isomers 8.09 (1H, s); δ 8.05 (1H, d, J
) 7.5 Hz); 7.90 (1H, d, J ) 7.5 Hz); 7.66 (1H, t, J ) 7.5 Hz);
7.43 (5H, m); 5.47 (1H, s); 3.05 (2H, m); 2.78 (2H, m); 1.48
(7H, m); 0.86 (3H, m), Homogeneous by HPLC Luna C18,
Symmetry C8. LCMS 422 (M+1)+ , high resolution MS (M+1)+
found 422.25548 (C24H31N5O2 requires 422.25556).
Solution Phase II (Scheme 3). Boc-D-Phenylglycine was
coupled to an S4 component in DMF using HATU or TBTU/
DIPEA or, alternatively, EDCI or DIPCI with HOBt or HOAt
as an additive. When the S4 component was an alcohol,
catalytic DMAP was added. The products thus obtained were
deprotected using TFA in DCM and then coupled to 3-amidinobenzoic acid TFA salt using DIPCI/HOBt in DMF. The
products were purified by reverse-phase preparative HPLC
and isolated as the TFA salts. Compounds 7 and 8 were made
by this route.
1-(3-Amidinobenzoyl-D-phenylglycinyl)-4-benzoylpiperidine, 7. To a solution of Boc-D-phenyl glycine (251 mg, 1
mmol) and a mixture of DMF (1 mL) and DCM were added
4-benzoylpiperidine (339 mg 1.5 mmol), DIPEA (348 µL, 2
mmol), and TBTU (353 mg 1.1 mmol). After being stirred at
room temperature overnight, the mixture was partitioned
between ethyl acetate (6 mL) and 10% hydrochloric acid (2
mL). The organic layer was washed with 10% hydrochloric acid
(2 mL), saturated aqueous sodium bicarbonate, and then brine.
Evaporation of solvent gave the crude product which was taken
up in dichloromethane (2 mL) and treated with trifluoroacetic
acid (2 mL) until removal of the Boc group was complete.
Solvent was evaporated in vacuo, and the residue was taken
up in ethyl acetate and washed with saturated aqueous sodium
bicarbonate and then brine before evaporating to dryness. The
residue was dissolved in DMF (5 mL), and to this was added
a mixture of HOAt (150 mg, 1.1 mmol), 3-amidinobenzoic acid
TFA salt (300 mg, 1.08 mmol), and DIPCI (180 µL, 1.15 mmol),
and the mixture was stirred overnight. Any solids were
removed by filtration, and solvent was removed in vacuo. The
residue was taken up in ethyl acetate, washed with saturated
aqueous sodium bicarbonate, dried (MgSO4), and evaporated
in vacuo. The residue was converted to the TFA salt by
addition and evaporation of 25% TFA in acetonitrile and then
dissolved in a minimum of aqueous acetonitrile for purification
by preparative RPHPLC to give 1-(3-amidinobenzoyl-D-phenylglycinyl)-4-benzoylpiperidine TFA salt (159 mg, 27% over
three steps). 1H NMR (DMSO-d6) δ 8.40 (2H, m); 8.10 (1H, d);
7.70 (1H,t); 7.50 (10H, m); 5.55 (1H, s); 3.60 (1H, m); 2.5 (2H,
m); 1.00 (6H,m). Homogeneous by HPLC Luna C18, Symmetry
C8. LCMS 469 (M+1)+ , high resolution MS (M+1)+ )
469.22282 (C28H28N4O3 requires 469.223935).
1-(3-Amidinobenzoyl-D-phenylglycinyl)-4-chlorophenylpiperazine TFA Salt, 8. 1-(3-Amidinobenzoyl-D-phenylglycinyl)-4-chlorophenylpiperazine TFA salt was prepared from
4-chlorophenylpiperazine in a manner similar to that described
above. 1H NMR (CD3CN) δ 8.05 (1H, s); 8.00 (1H, d); 7.87 (1H,
d); 7.55(1H, t); 7.31 (5H, m); 7.08,(2H,d); 6.75,(2H,d); 5.95 (1H,
s); 3.70 (1H,m); 3.55 (2H,m); 3.45 (1H,m); 3.12 (1H,m); 3.00
(1H,m); 2.85 (1H,m); 2.35 (1H,m), Homogeneous by HPLC
Luna C18, Symmetry C8. LCMS 476 (M+1)+, high resolution
MS (M+1)+ ) 476.18340 (C26H26N5O2 requires 476.18529).
Amidino compounds such as 9 were prepared initially using
the solution-phase method II described above to give Boc-Nprotected S4 intermediates. The Boc protection was removed
using TFA/DCM, and the resulting amine reacted with 1,3bis-tert-butoxycarbonyl methyl pseudothiourea in the presence
of mercury II chloride to give the bis-butoxycarbonyl guanidines.
Final treatment with TFA/DCM and purification by preparative HPLC gave the final products isolated as TFA salts
(Scheme 4).
3-Amidinobenzoyl-D-phenylglycine 1-Amidinopiperidin-4-ylethyl Ester, 9. 3-Amidinobenzoyl-D-phenylglycine
1-Boc-piperidin-4-ylethyl ester was prepared using the general
solution-phase method described above for compounds 7 and
8 and then treated with 25% TFA in DCM to remove the Boc
protection. Treatment with 1,3-bis-tert-butyloxycarbonylmethylthiopseudourea (1 equiv), TEA (3 equiv), and mercury(II) chloride
(1 equiv) in DMF overnight followed by extraction into ethyl
acetate, washing with 2 N aqueous sodium hydroxide and
water, and drying (MgSO4) gave 3-amidinobenzoylphenylglycine 1-(1,3-bis-tert-butyloxycarbonyl amidine)-4-piperidin-4ylethanol ester.
The above compound was treated with 25% TFA in DCM
until the Boc protection was removed and then evaporated in
vacuo. The residue was purified by preparative RPHPLC to
give 3-amidinobenzoyl-D-phenylglycine 1-amidinopiperidin-4ylethyl ester. 1H NMR (D2O) 8.17 (1H, m); δ 8.07 (1H, d); 7.93
(1H, d); 7.70 (1H, t); 7.45 (5H, m); 5.60 (1H,s); 4.25 (2H,m);
3.55 (2H,m); 2.75 (2H, m); 1.60 (4H, m); 1.25 (1H, m); 1.00
(2H, m). Homogeneous by HPLC Luna C18, Symmetry C8.
LCMS451 (M+1)+ , high resolution MS (M+1)+ ) 451.244439
(C24H30N6O3 requires 451.245725).
Biology. Inhibition of factor Xa was assessed at room
temperature in 0.1 M phosphate buffer, pH 7.4, according to
the method of Tapparelli et al.29 Purified human factor Xa was
purchased from Alexis corporation, Nottingham, U.K. Chromogenic substrate pefachrome-FXA was purchased from Pentapharm AG, Basel, Switzerland. Product (4-nitroaniline) was
quantified by absorption at 405 nm in 96-well plates using a
Dynatech MR5000 reader (Dynex Ltd, Billingshurst, U.K.). Km
and Ki were calculated using SAS PROC NLIN (SAS Institute,
Cary, NC, Release 6.11). Km values were determined as 100.9
µM for factor Xa/pefachrome-FXA. Inhibitor stock solutions
were prepared at 40 mM in dimethylsulfoxide and tested at
500 µM, 50 µM, and 5 µM. Accuracy of Ki measurements was
confirmed by comparison with Ki values of known inhibitors
of factor Xa.
Crystallization, Data Collection, and Refinement.
Bovine trypsin (Sigma, Type III) was further purified by ionexchange chromatography (Mono S, 0.1 M sodium phosphate
pH 6.0, eluted with a 0-1 M sodium chloride gradient). A
complex of the purified trypsin with compound 7 was prepared
by incubating a 3-fold molar excess of the inhibitor with the
enzyme, which was then concentrated to 15 mg/mL in 0.05 M
Tris pH 8, 3 mM calcium chloride, 18% acetonitrile, and 5%
DMF. Crystals were grown by vapor diffusion against a well
containing 2.1 M ammonium sulfate and 0.05 M Tris pH 8.15.
Nucleation of crystal growth required streak seeding using
low-density crystals grown in the presence of benzamidine
according to the procedure of Batunik (1989).30 Crystals of the
bovine trypsin-compound 7 complex belonged to space group
P212121, with a ) 60.08 Å, b ) 63.83 Å, and c ) 70.04 Å, and
diffracted to beyond 2.0 Å. A complete native data set,
comprising 17 272 unique reflections in the range 30-2.0 Å
and with an average redundancy of 4.0, was collected at 100
K using station PX7.2 of the Daresbury SRS synchrotron
(wavelength 1.488 Å). These data were processed using
DENZO and SCALEPACK,31 and the structure solved by
molecular replacement using AMORE32 with the coordinates
from PDB entry 3PTN as search model. The structure was
refined using iterative cycles of simulated annealing refine-
1232
Liebeschuetz et al.
ment with X-PLOR33 and manual rebuilding using O.34 The
final model has good geometry and an Rcryst of 17.8% and Rfree
of 24.0% (calculated with data in the range 15-2.0 Å). The
coordinates have been deposited in the PDB (reference code
1eb2).
(13) Brandstetter, H.; Kühne, A.; Bode. W.; Huber, R.; von der Saal,
W.; Wirtensohn, K.; Engh, R. A. X-ray Structure of Active Siteinhibited Clotting Factor Xa. J. Biol. Chem. 1996, 271 (47),
29988-29992.
(14) Murray, C. W.; Clark, D. E.; Auton, T. R.; Firth, M. A.; Li, J.;
Sykes, R. A.; Waszkowycz, B.; Westhead, D. R.; Young, S. C.
PRO_SELECT: Combining structure-based drug design and
combinatorial chemistry for rapid lead discovery. 1. Technology.
J. Comput.-Aided Mol. Des. 1997, 11, 193.
(15) Klebe, G. J. The Use of Composite Crystal-field Environments
in Molecular recognition and the de Novo Design of Protein
Ligands. J. Mol. Biol. 1994, 237, 212.
(16) Hahn, M. Receptor Surface Models. 1. Definition and Construction. J. Med. Chem. 1995, 38, 2080.
(17) Baxter, C. A.; Murray, C. W.; Clark, D. E.; Westhead, D. R.;
Eldridge, M. D. Flexible Docking using Tabu Search and an
Empirical Estimate of Binding Affinity. Proteins: Struct., Funct.,
Genet. 1998, 33, 367.
(18) Böhm, H.-J. The development of a simple empirical scoring
function to estimate the binding constant for a protein-ligand
complex of known three-dimensional structure. J. Comput.Aided Mol. Des. 1994, 8, 243.
(19) Eldridge, D. E.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee,
R. P. Empirical scoring functions: I. The development of a fast
empirical scoring function to estimate the binding affinity of
ligands in receptor complexes. J. Comput.-Aided Mol. Des. 1997,
11, 425-445.
(20) Murray, C. W.; Auton, T. R.; Eldridge, M. D. Empirical Scoring
Functions. II. The testing of an empirical scoring function for
the prediction of ligand-receptor binding affinities and the use
of Bayesian Regression to improve the quality of the model. J.
Comput.-Aided Mol. Des. 1998, 12, 503.
(21) Li, J.; Murray, C. W.; Waszkowycz, B.; Young, S. C. Targeted
Molecular Diversity in Drug Discovery - Integration of StructureBased design and Combinatorial Chemistry. Drug Discovery
Today 1998, 3 (3), 105-112.
(22) Kick, E. K.; Roe, D. C.; Skillman, A. G.; Liu, G.; Ewing, T. J. A.;
Sun, Y.; Kuntz, I. D.; Ellman, J. A. Structure-based design and
combinatorial chemistry yield low nanomolar inhibitors of
cathepsin D. Chem. Biol. 1997, 4, 297-307.
(23) Böhm, H.-J.; Banner, D. W.; Weber, L. Combinatorial docking
and combinatorial chemistry: Design of potent non-peptide
thrombin inhibitors. J. Comput.-Aided Mol. Des. 1999, 13, 5156.
(24) Wei, A.; Alexander, R. S.; Duke, J.; Ross, H.; Rosenfeld, S. A.;
Chang, C.-H. Unexpected Binding Mode of Tick Anticoagulant
Peptide Complexed to Bovine Factor Xa. J. Mol. Biol. 1998, 283,
147-154.
(25) The strain energies quoted here are generally much higher than
the associated calculated binding energy. This is because they
are calculated by different methods, the strain energy arising
out of estimates, derived using the ‘Clean’ force field. Therefore
they cannot be compared and are used independently from one
another in the process of ranking the substituents.
(26) Copyright 1995, BIOSYM/Molecular Simulations, San Diego.
(27) Copyright 1990-1994, MDL Information Systems, Inc. San
Leandro, CA. All Rights Reserved.
(28) MDL Information Systems, Inc. San Leandro, CA. All Rights
Reserved.
(29) Tapparelli, C.; Metternich, R.; Ehrardt, C.; Zurini, M.; Claeson,
G.; Scully, M. F.; Stone, S. R. In Vitro and In Vivo Characterization of a Neutral Boron-containing Thrombin Inhibitor. J. Biol.
Chem. 1993, 268, 4734-4741.
(30) Bartunik, H. D.; Summers, L. J.; Bartsch, H. H. Crystal structure
of bovine b-trypsin at 1.5 Å resolution in a crystal form with
low molecular packing density. J. Mol. Biol. 1989, 210, 813828.
(31) Otwinoski, Z.; Minor, W. Processing of X-ray diffraction data
collected in oscillation mode. Methods Enzymol. 1996, 276, 307326.
(32) Collaborative Computational Project, Number 4. The CCP4
Suite: Programs for Protein Crystallography. Acta Crystallogr.
1994, D50, 760-763.
(33) Brunger, A. T. 1992 X-PLOR Manual Version 3.1.
(34) Jones, T. A.; Zou, J.-Y.; Cowan, S. W.; Kjeldgaard, M. Improved
methods for building protein structures in electron-density maps
and the location of errors in these models. Acta Crystallogr. 1991,
A47, 110-119.
Acknowledgment. The authors thank Allen Miller
for encouragement and for helpful suggestions during
the preparation of this manuscript.
References
(1) (a) Wiley, M. R.; Fisher, M. J. Small-molecule direct thrombin
inhibitors. Expert Opin. Ther. Pat. 1997, 7 (11), 1265-1282. (b)
Menear, K. Progress towards the discovery of orally active
thrombin inhibitors. Curr. Med. Chem. 1998, 5, 457-468. (c)
Rewinkel, J. B. M.; Adang, A. E. P. Strategies and progress
towards the ideal orally active thrombin inhibitor. Curr. Pharm.
Des. 1999, 5, 1043-1075. (d) Zhu, B.-Y.; Scarborough, R. M.
Recent advances in inhibitors of factor Xa in the prothrombinase
complex. Curr. Opin. Cardiovasc., Pulm. Renal Invest. Drugs
1999, 1 (1), 63-88. (e) Walenga, J. M.; Jeske, W. P.; Hoppensteadt, D.; Kaiser, B. Factor Xa inhibitors: Today and beyond.
Curr. Opin. Cardiovas., Pulm. Renal Invest. Drugs 1999, 1 (1),
13-27. (f) Al-Obeidi, F.; Ostrem, J. A. Factor Xa inhibitors.
Expert Opin. Ther. Pat. 1999, 9 (7), 931-953.
(2) Chi, L.; Rogers, K. L.; Uprichard, A. C. G.; Gallagher, K. P. The
therapeutic potential of novel anticoagulants. Expert Opin.
Invest. Drugs 1997, 6 (11), 1591-1622.
(3) Morishima, Y.; Tanabe, K.; Terada, Y.; Hara, T.; Kunitada, S.
Antithrombotic and Hemorrhagic Effects of DX-9065a, a Direct
and Selective Factor Xa: Comparison of a Direct Thrombin
Inhibitor and Antithrombin III-Dependent Anticoagulants.
Thromb. Haemostasis 1997, 78, 1366-1371.
(4) Gold, H. K.; Torres, F. W.; Garabedian, H. D.; Werner, W.; Jang,
I.; Khan, A.; Hagstrom, J. N.; Yasuda, T.; Leinbach, R. C.;
Newell, J. B.; Bovill, E. G.; Stump, D. C.; Collen, D. Evidence
for a Rebound Coagulation Phenomenon after Cessation of a
4-hour Infusion of a Specific Thrombin Inhibitor in Patients with
Unstable Angina Pectoris. J. Am. Coll. Cardiol. 1993, 21, 10391047.
(5) Galemmo, R. A.; Maduskuie, T. P.; Dominguez, C.; Rossi, K. A.;
Knabb, R. M.; Wexler, R. R.; Stouten, P. F. W. The de novo design
and synthesis of cyclic urea inhibitors of Factor Xa: Initial SAR
studies. Bioorg. Med. Chem. Lett. 1998, 8, 2705-2710.
(6) Klein, S. I.; Czekaj, M.; Gardner, C. J.; Guertin, K. R.; Cheney,
D. L.; Spada, A. P.; Bolton, S. A.; Brown, K.; Colussi, D.; Heran,
C. L.; Morgan, S. R.; Leadley, R. J.; Dunwiddie, C. T.; Perrone,
M. H.; Chu, V. Identification and Initial Structure-Activity
Relationships of a Novel Class of Nonpeptide Inhibitors of Blood
Coagulation. J. Med. Chem. 1998, 41, 437-450.
(7) Dominguez, C.; Duffy, D. E.; Han, Q.; Alexander, R. S.; Galemmo,
R. A.; Park, J. M.; Wong, P. C.; Amparo, E. C.; Knabb, R. M.;
Luettgen, J.; Wexler, R. R. Design and Synthesis of Potent and
Selective 5,6-fused Heterocyclic Thrombin Inhibitors. Bioorg.
Med. Chem. Lett. 1999, 9, 925-930.
(8) Padmanabhan, K. P.; Tulinsky, A.; Park, C. H.; Bode, W.; Huber,
R.; Blankenship, D. T.; Cardin, A. D.; Kiesel, W. Structure of
Human Des(1-45) Factor Xa at 2.2 Å Resolution. J. Mol. Biol.
1993, 232, 947-966.
(9) Hara, T.; Yokoyama, A.; Ishihara, H.; Yokoyama, Y.; Nagahara,
T.; Iwamoto, M. DX-9065a, a New Synthetic, Potent Anticoagulant and Selective Inhibitor for Factor Xa. Thromb. Haemostasis
1994, 71 (3), 314-319.
(10) Hirayama, F.; Koshio, H.; Taniuchi, Y.; Sato, K.; Hisamichi, N.;
Sakai, Y.; Katayama, N.; Kawasaki, T.; Matsumoto, Y.; Yanagisawa, I. Abstracts of Papers, 214th National Meeting of the
American Chemical Society, Las Vegas, NV, 1997; American
Chemical Society: Washington, DC, 1997; MEDI049.
(11) Yamazaki, M.; Asakura, H.; Aoshima, K.; Saito, M.; Jokaji, H.;
Uotani, C.; Kumbashiri, I.; Morishita, E.; Ikeda, T.; Matsuda,
T. Protective Effects of DX-9065a, an Orally Active Novel
Synthesized and Selective Inhibitor of Factor Xa, Against
Thromboplastin-Induced Experimental Disseminated Intravascular Coagulation in Rats. Sem. Thromb. Hemostasis 1996, 22
(3), 255-259.
(12) Sato, K.; Taniuchi, Y.; Hirayama, T.; Koshio, H.; Matsumoto,
Y.; Iizumi, Y. Comparison of the Anticoagulant and Antithrombotic Effects of YM-75466, a novel orally-Active Factor Xa
Inhibitor and warfarin in Mice. Jpn. J. Pharmacol. 1998, 78,
191-197.
JM010944E
ceptor–ligand steric complementarity. As this median value
was found to be correlated with ligand size, the value is normalized by ligand surface area with respect to the set of receptor–ligand complexes used for the calibration of the ChemScore scoring function. The normalized value, StericPenalty,
has a value of zero for ligands as tightly bound as the average
of the reference set, has a negative value for ligands more
tightly bound (e.g., clashing), and a positive value for ligands
less tightly bound.
31. J. D. Oburn, N. J. Koszewski, and A. C. Notides, “Hormoneand DNA-Binding Mechanisms of the Recombinant Human
Estrogen Receptor,” Biochemistry 32, 6229 – 6236 (1993).
China, and a Ph.D. degree in macromolecular sciences from Aston University, followed by postdoctoral research in theoretical
biochemistry at the University of Manchester with Dr. Barry Robson. He joined Protherics in 1990 to undertake research into computer simulation of protein folding and protein structure prediction. Since 1994 he has led the computational chemistry team in
developing methods and software for molecular design, particularly in the areas of de novo design, molecular docking, and combinatorial library design. More recently Dr. Li was responsible
for initiating and directing the DockCrunch project, involving a
million compound virtual screen versus the estrogen receptor.
Accepted for publication January 22, 2001.
Bohdan Waszkowycz Protherics Molecular Design Ltd., Beechfield House, Lyme Green Business Park, Macclesfield, Cheshire SK11
0JL, United Kingdom (electronic mail: Bohdan.Waszkowycz@
protherics.com). Dr. Waszkowycz received a B.Sc. degree in pharmacy and a Ph.D. degree in computational chemistry at the University of Manchester, then joined Proteus Molecular Design Ltd.
(now Protherics) in 1990. He served as a molecular modeler on
a number of structure-based drug design projects, most recently
on the design of tryptase inhibitors, before leading the computational team in the development of the in-house software suite,
Prometheus. The recent focus of the group has been the validation of software for high-throughput virtual screening, and Dr.
Waszkowycz is currently responsible for establishing collaborative projects on virtual screening with a number of pharmaceutical and biotechnology companies.
Tim D. J. Perkins Protherics Molecular Design Ltd., Beechfield
House, Lyme Green Business Park, Macclesfield, Cheshire SK11
0JL, United Kingdom. Dr. Perkins graduated from Cambridge with
a B.A. degree in natural sciences and received a Ph.D. degree
in medicinal chemistry from the University of London. He joined
the Drug Design Group in the Pharmacology Department at Cambridge with Dr. Philip Dean, where he researched computational
methods for conformational analysis and molecular superposition. This group formed the basis of the TeknoMed drug design
collaboration with Rhône-Poulenc Rorer. On joining Protherics
in 1998, Dr. Perkins has worked on software development within
Prometheus, particularly in implementing tools for facilitating
high-throughput virtual screening, including novel methods for
analysis of receptor–ligand complementarity.
Richard A. Sykes Protherics Molecular Design Ltd., Beechfield
House, Lyme Green Business Park, Macclesfield, Cheshire SK11
0JL, United Kingdom. Mr. Sykes received a B.Sc. degree in logic
with mathematics from the University of Sussex and worked as
a programmer for a number of years before returning to academia
to research the theory and application of functional programming
languages at the University of London. Since joining Protherics
in 1991, he has been the senior programmer involved in the development of Prometheus. He has a particular interest in the development of the scripting language Global and the design and
implementation of graphical user interfaces to support the requirements of the group’s structure-based design and virtual
screening efforts.
Jin Li Protherics Molecular Design Ltd., Beechfield House, Lyme
Green Business Park, Macclesfield, Cheshire SK11 0JL, United
Kingdom. Dr. Li is Head of Computational Chemistry at Protherics. He received a B.Sc. degree from Sichuan University,
376
WASZKOWYCZ ET AL.
IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001

Document 6424736

Transcription

Similar documents

Quiz 10

Exagen Genomic Technology

Nanomedicine

here - Center for Genetics and Society

What has changed - Center for Genetics and Society

Dragonfly genome project

View the full schedule - Genome Engineering 4.0 Workshop

Barbara Schoenfeld

Array-to-Go

How Journalists Explore (and Sometimes Get Lost In)