CS470: Introduction to Database Management Systems Functional Dependencies and Normal Forms
Transcription
CS470: Introduction to Database Management Systems Functional Dependencies and Normal Forms
CS470: Introduction to Database Management Systems Functional Dependencies and Normal Forms Relational Database Design (Chapters 10 and 11) V Kumar School of Computing and Engineering University of Missouri-Kansas City Relational Database Design Logical database schema design is concerned with organizing data into a logical form acceptable to the underlying database system. One of the logical structures is Relational structure which we use to develop logical schema (Conceptual schema). It is a complicated process. Some of the important points which make it complicated are: 1. The designer is constrained by the limited data structure types supported by the data base system. 2. The database designer may have to consider the access path of the records. 3. The database designer may have to consider how to make database access and modification efficient. 4. The designer has to identify and select a set of most relevant attributes for an entity. 5. The designer has to identify the size of a relation and connect two or more relations for navigation. A good relational database is a set of good relational schema. A good relational schema contains a set of relevant attributes of the entity it represents where every attribute is clearly related (directly or indirectly) to other attributes of the relation. A good relation should require minimum storage space and have minimum data redundancy. I relational schema has a name and a set of related attributes. Let us consider the following relational schema – Dept.- of an entity “Department”. Example: A bad relational schema Dept Dept Dnumber Dlocation Dname Interest rate Dnumber Dlocation Dname Room size The first schema is bad because the attribute “Interest Rate” is nothing to do with entity Department as a result it is not related to other attributes (Dnmber, Dlocation, and Dname). The second schema is a good schema. Sport Sname Inst_id Inst_name Expertise Fee Football Tennis Baseball Golf 1 1 2 2 S1 Tom Tom Peter Peter Football Tennis Baseball Golf 200 200 300 300 S2 Inst_name Inst_id Expertise 1 Expertise 2 Fee Tom Peter 1 2 Football Baseball Tennis Golf 2/17 200 300 Sname Football Tennis Baseball Golf Inst_id 1 1 2 2 Functional Dependencies and Normal Forms When storage space requirement to store a schema is concerned, a good schema conserves space. To achieve this information redundancy (repetition of same data) is minimized. This is done by breaking (splitting) the relation into two or more smaller relations. This split helps to reduce information redundancy, which improves database consistency (correctness). Example: The schema Sport (given above) is a good schema but it can be further improved. The storage requirement of Sport can be reduced by splitting it into two relations S1 and S2. Suppose it takes one word to store the value of an attribute. Sport relation will require 20 words (this does not include the space the structure takes). When Sport is split into S1 and S2 relations, then S1 and S2 together need 18 words to store the same information. The duplications (Tom Tom and Peter Peter) are removed. Let us identify a few relational schema design problems. Consider the following Student relation records Student identify, the activity students take and fee they pay for these activities. Student Stu_id 100 100 150 175 175 200 200 Activity Skiing Golf Swimming Squash Swimming Swimming Golf Fee 200 65 50 50 50 50 65 Suppose Student 100 gave up skiing. The first record must be deleted from the database. If it is not then the database will have incorrect information (not consistent with the real world). This deletion has bad side effect which makes the database incomplete. Effect: Deleting this record also removes information about Skiing fee. Thus, more information is deleted than intended. This is not a good relation because a query “What is the fee for Skiing?” cannot be answered. Now suppose you want to add activity Baseball in the database and you are going to charge 200. Effect: You cannot add this information since there is no student enrolled in Baseball. You cannot leave Stu-id field for Baseball empty. We cannot add a fact about one attribute until we have an additional fact about another related attribute. To add this record you must have a student willing to learn baseball. You are not allowed to insert null value in Stu_id attribute. This is not a good relation. These problems are referred to as Modification Anomalies. There are many and we will discuss them in length. What we need is a relation where a delete operation deletes only the relevant tuples and an insert can insert any tuple at any time in other words we wants relations which have minimum or no modification anomalies. How to minimize or eliminate modification anomalies? To get an answer of this question we first need to understand the dependency theory to be able to design good relational schema. Dependency theory illustrates how one or more attributes of R depends on one or more attributes of R. Thus, how Ai depends on Aj (i = j, i ≠ j). 3/17 Functional Dependencies and Normal Forms Dependency Theory To maintain consistency a relation must satisfy a set of integrity constraints. A consistent relation reflects the facts. For example, if an instructor teaches a database course then the database must reflect this information, i.e., the relation which stores this information must satisfy constraints related to instructor and course attributes. With respect to database processing it means that assigning new values to a set of attributes must satisfy this set of constraints to maintain the correctness (fact) of the database. We need to formalize these concepts. We need the following notations: R = indicates a relational schema name. A1, A2, …, An = indicate attributes included in the R. R(A1, A2, …, An) = indicate the structure of the schema with name R. r = indicates a relation which is an instance of a relational schema. r(R) = indicates an instance of R(A1, A2, …, An). Let γ be an integrity constraint on a schema R. γ is a function that associates each relation r(R) a Boolean value γ(r). A relation r satisfies γ if γ(r) = true (i.e., γ(r) holds in r) and violates if γ(r) = false. Every database must satisfy a set of defined constraints for preserving correctness. For this reason every database schema is associated with a set of constraints which are usually expressed by means of closed sentences of some first-order predicate calculus. Integrity constraints can be classified into two main categories: 1. Intra-relational constraints. Each such constraint involve only one relation scheme. 2. Inter-relational constraints. Each such constraint spans over more than one relational scheme. We will deal mainly with intra-relational constraints. One of the most common intrarelational constraints is called Key dependency. Given a relation scheme R(U) where (U = A1, A2, …, An) a key dependency is expressed as key(K), where K ⊆ U, and is satisfied by a relation r, if and only if t1(K) ≠ t2(K). When intra-relational constraints encompass non-key attributes, then they are called as Functional Dependencies (FD) and key dependency becomes a subset of functional dependencies. These functional dependencies are the basis of relational database design. Inter-relational involves more than one relation, where attribute of one relation establishes relationship with the attribute of another relation in forming functional dependencies. Functional Dependencies (FD) We will study the reasons for modification anomalies discussed earlier and study possible solutions. These problems arise since attributes of a relation are logically and semantically related. A good relation require that the attributes of a relation must be related but on the other hand such relationships cause modification anomalies. Our aim is to minimize these anomalies because they cannot be eliminated completely. To do so we begin our discussion with Functional Dependency (FD). 4/17 Functional Dependencies and Normal Forms Schedule Pilot Cushing Cushing Clark Clark Clark Chin Chin Copley Copley Copley Flight 83 116 281 301 83 83 116 281 281 412 Date 9 Aug 10 Aug 8 Aug 12 Aug 11 Aug 13 Aug 12 Aug 9 Aug 13 Aug 15 Aug Departs 10:15a 1:25p 5:50a 6:35p 10:15a 10:15a 1:25p 5:50a 5:50a 1:25p Informally functional dependency defines the relationship among attributes of a relation, i.e., it defines the effect of a change of the value of a set of attributes on another set of attributes. For example, if we change the value of a SSN then the value of the corresponding Name attribute must also change to preserve consistency. We can represent this functional relationship (functional since an operation on a set of attribute initiates corresponding changes on other related attributes of a relation) in terms of a set of constraints. Consider the following relation SCHEDULE. We can identify the following restrictions: 1. Exactly one time for one flight. 2. For a {pilot, date, time} there is one flight. 3. For a {flight, date} there is one pilot. These restrictions indicate how this relation can be processed (modified, expand, contract etc.). These are examples of a relationship called Functional Dependencies (FD). We can say that an FD occurs when in a tuple the value of a set of attributes uniquely determines the values of another set of attributes. Notations: we will use → to indicate functional dependency between two set of attributes. So X → Y will mean X functionally determines Y. X is called the Left side of FD and Y the Right side of FD. X is also called the determinant of the FD X → Y Example: If we have Phone number → City name, then the value of phone number determines the city name and if the value of the phone number changes then the value of (name) city name will also change. Formally: Let r be a relation on R(X, Y). if r satisfies the FD X → Y then if t1(X) = t2(X), we must have t1(Y) = t2(Y). This means that the Y value of a tuple in r(R) is determined by the X value of that tuple in r(R), i.e., Y is functionally dependent on X or X functionally determines Y. Full functional dependency: X → Y is a full functional dependency if all members of attribute set X must be present to hold the dependency. For example, let X = {A, B, C} and Y = {D}. If {A, B, C} → {D} then it is full functional dependency, i.e., Y is fully functionally 5/17 Functional Dependencies and Normal Forms dependent on X. This means that the value of D can only be determined by the values of A, B, and C. On the other hand if we have {A, B} → {D} then Y is partially dependent on X (partial functional dependency) because the value of C is not necessary to determine the value of D. Prime attribute: Attribute ∈ primary key set. Non-prime attribute: Attribute ∉ primary key set. There is no formula to establish FD. The semantics of a relation should indicate how attributes of a relation are related. Algorithm to test if a given FD holds on a relation. SATISFY (<relational name>, FD). Output: true if relation satisfies FD X → Y, false otherwise 1. Sort the relation r on its X attribute values to bring tuples with equal X-values together. 2. If for each ti(X) = tj(X) there exist ti(Y) = tj(Y) then return true otherwise return false. Apply SATISFY (SCHEDULE, FD: Flight → Depart) to relation SCHEDULE. Schdule Pilot Cushing Clark Chin Cushing Chin Clark Copley Copley Clark Copley Flight 83 83 83 116 116 281 281 281 301 412 Date 9 Aug 11 Aug 13 Aug 10 Aug 12 Aug 8 Aug 9 Aug 13 Aug 12 Aug 15 Aug Departs 10:15a 10:15a 10:15a 1:25p 1:25p 5:50a 5:50a 5:50a 6:35p 1:25p Result: This FD exists because we have left hand side value Flight = 281 we have Depart = 5:50a as the right hand value. Similarly whenever we have left hand side value Flight = 83, we have Depart = 10:15a as the right hand value. Note that FD is the relationship among attributes of a relation. We can examine another FD: Departs → Flight. SATISFY (SCHEDULE, Departs → Flight). The relation schedule is analyzed as before to find this FD. It appears that this FD is not satisfied because for a left hand value Depart = 1:25p, there are two different values for Flight, i.e., Flight = 116 and Flight = 412. The following relation illustrates the result (False). Similarly FD: Date → Flight is not satisfied in this relation. Schedule Pilot Clark Copley Copley Cushing Flight 281 281 281 83 6/17 Date 8 Aug 9 Aug 13 Aug 9 Aug Departs 5:50a 5:50a 5:50a 10:15a Functional Dependencies and Normal Forms Clark Chin Cushing Chin Copley Clark 83 83 116 116 412 301 11 Aug 13 Aug 10 Aug 12 Aug 15 Aug 12 Aug 10:15a 10:15a 1:25p 1:25p 1:25p 6:35p Two extreme cases: X → ∅ trivially satisfied by any relation and ∅ → Y satisfied by those relations where every tuple has the same Y-value. Graphical representation of FD The head of the arrows pointing to the right side of FDs and the tails are connected to the left side of FDs. Schema Emp_Dept: SSN → {Ename, Bdate, Address, Dnumber}. Dnumber → {Dname, Dmgrssn}. Emp_Dept Ename SSN Bd ate Ad d ress Dnu mber Dname Dm grssn Schema: Emp_Proj FD (F): {SSN, Pnumber}→ Hours SSN→ Ename Pnumber → {Pname, Plocation} Emp_Proj SSN Pnu m ber H ou rs Enam e Pnam e Plocation IMPORTANT 1. One must remember that FDs in a relation are defined by the database designer. In the above examples these FDs may not be valid if they have not been defined, even though the occurrence of these relation schemas exhibit FDs. 2. A relational schema R may have n instances. If an FD for R is identified then every instance of R must satisfy the FD. A FD on R is false if one instance of a relation satisfies it while another instance does not. To verify if a certain FD is true one has to check all possible instances of R. Closure of a FD: Defining FDs of a schema requires semantic knowledge. A database designer defines a set of FDs for a schema. Let us call this set F. It is possible that some 7/17 Functional Dependencies and Normal Forms additional FDs may be the side effect of F. We will call them as derived FDs. Formally, the set of all such FDs derived by F is called the closure of F and represented by F+. For example consider schema EMP-DEPT: Emp_Dept Ename SSN Bd ate Ad d ress Dnu mber Dname Dm grssn Defined F = SSN → {Ename, Bdate, Address, Dnumber} Dnumber → {Dname, Dmgrssn} Derived F+ = SSN → {Dname, Dmgrssn} SSN → SSN Dnumber → Dname} It is very time consuming to discover all possible F+ for F for a given schema using a sequential scan of the relation. Even though the cardinality of a schema is finite, the degree of an instance may be very large and a multiple complete scanning of all possible instances of this schema would be prohibitively expensive. To discover F+ we define a set of rules called inference rules. We use notation F X → Y to indicate that F infers X → Y or defined FD set F derives X → Y. We define 6 inference rules. The first three are known as Armstron's inference rules. 1. Reflexsive rule: This rule states that a set of attributes determines themselves. For example, {SSN, Account No.} → {SSN, Account No.} or {SSN, Account No.} → {Account No.} Formally, if Y ⊆ X ⊆ U, then X → Y. This rule gives the trivial dependency, those that have a right side contained in the left side. Proof 1: The reflexivity axiom is clearly sound, i.e., using this rule we cannot deduce from F any dependency that is not in F+. We cannot have a relation r with two tuples that agree on X yet disagree on some subset of X. Let R be a schema with attribute subset X and Y where Y ⊆ X. Consider a pair of tuples t1 and t2 ∈ r(R). If t1 (X) = t2(X), then t1(A) = t2(A) for every A ∈ X. Then, since Y ⊆ X, t1(A) = t2(A) for every A ∈ Y, which is equivalent to t1(Y) = t2(Y). Therefore, X → Y must hold in r(R). If the attribute set has only one element, e.g., X, then X → X. Thus ΠX(σX=x(r)) always has at most one tuple. Consider a relation Emp: Emp SSN 123 111 100 110 Emp_Name A B C D 8/17 Salary 50K 60K 60K 65K Age 65 45 42 42 Functional Dependencies and Normal Forms Suppose X = {SSN, Emp_Name, Age} and Y = {Emp_Name, Age}. Thus, Y ⊆ X. We define X → Y. That is {SSN, Emp_Name, Age}→ {Emp_Name, Age}. We take two tuples t1 (X) = t1(123, A, 65) and t2(X) = t2(111, B, 60K), and t1(Y) = t1(A, 65) and t2(Y) = t2(B, 45). We see that t1(X) ≠ t2(X) and, therefore, t1(Y) ≠ t2(Y). Now suppose we say t1(X) = t2(X), which means t1 (X) = t1(123, A, 65) and t2(123, A, 65). This indicates that t1(Y) = t2(Y), i.e., t1(A, 65) = t2(A, 65). 2. Augmentation: This rule states that a new valid FD is generated if the same set of attributes is added to the left and right side of the existing FD. Formally, if X → Y then XZ → YZ or XZ → Y where Z ⊆ R. Proof: If r satisfies X → Y, then ΠY(σX = x (r)) has at most one tuple for any X-value x. This means that when we apply a select on r with predicate X = x then there will be at most one tuple and a projection on this will get the corresponding Y value. Similarly, if Z ⊆ R then σXZ=xz(r) ⊆ σX=x(r) and hence ΠY(σXZ=xz(r)) ⊆ ΠY(σX=x(r)). Thus ΠY(σXZ=xz(r)) has at most one tuple and r must satisfy XZ→Y. Example: F = A → B r(A a1 a2 a1 a3 B b1 b2 b1 b3 C c1 c1 c1 c2 D) d1 d1 d2 d3 We see that by axiom 2, F+ are AB → B, AC → B, AD → B, ABC → B, ADB → B, ACD → B and ABCD → B. It can also be seen that AC → BC, AD → BD, ACD → BCD and so on. AC → BC means whenever t1(AC) = t2(AC), there will be tuples where t1(BC) = t2(BC). The text book (Elmasri/Navathe) proves it by contradiction. 3. Transitive Rule: This establishes that if X → Y and Y → Z then A → Z. Proof: Let r's F = {X→Y, Y→Z}. Let t1 ∈ r and t2 ∈ r. We know that if t1(X) = t2(X), then t1(Y) = t2(Y) and also if t1(Y) = t2(Y), then t1(Z) = t2(Z). Therefore if t1(X) = t2(X), then t1(Z) = t2(Z). This is one of the most important axioms. Example: F = A → B, B → C, r satisfies A → C r(A a1 a2 a3 a1 B b1 b2 b1 b1 C c2 c1 c2 c2 D) d1 d2 d1 d3 The rest of the three axioms can be proved by using the first three axioms. 9/17 Functional Dependencies and Normal Forms 4. Projectivity (Decomposition): This rule states that some attributes can be removed from the right hand side without affecting the FD. Formally, if X → YZ then X → Y. Proof: If r satisfies X → YZ, then ΠYZ(σX=x (r)) has at most one tuple for any X-value x. Since ΠY(ΠYZ(σX=x (r))) = ΠY(σX=x (r)), ΠY(σX=x (r)) can have at most one tuple. Hence r satisfies X → Y. 5. Union or additive rule: This axiom allows us to combine two or more FDs with the same left side. Formally, if X → Y and X → Z, then X → YZ . Proof: If r satisfies X → Y and X → Z then ΠY(σX=x (r)) and ΠZ(σX=x (r)) both have at most one tuple for any X-value x. If ΠYZ(σX=x (r)) had more than one tuple, then at least one of ΠY(σX=x (r)) and ΠZ(σX=x (r)) would have more than one tuple. Thus X → YZ. 6. Pseutransitivity rule: This rule allows us to extend the transitive rule further. Formally, if {X → Y, WY → Z} then WX → Z. Let r satisfy X → Y, WY → Z and let t1 and t2 be tuples in r. We know that if t1(X) = t2(X), then t1(Y) = t2(Y) and also t1(WY) = t2(WY) then t1(Z) = t2(Z). From t1(WX) = t2(WX) we can deduce that t1(X) = t2(X) and so t1(Y) = t2(Y) and further t1(WY) = t2(WY), which implies t1(Z) = t2(Z). Thus WX → Z. Reflexivity, Augmentation and Transitivity rules are called Armstrong’s inference rules. The other three rules can be proved by the first three. Using these 6 rules, it is possible to derive other inference rules for FDs. Normal forms and Modification Anomalies Normal forms (NF): A NF of a relation defines the type of modification anomalies it eliminates. There are First normal form (1NF), Second normal form (2NF), Third normal form (3NF), Boyce-Codd normal form (BCNF), Fourth normal form (4NF), Domain/Key normal form (DK/NF) and Fifth normal form (5NF). We will study several of them. Normalization: The process of transforming a relation from a lower normal, including nonnormal form to upper normal forms is called normalization. Degree Student Name John Kumar Year 1990 2002 1967 1969 1983 Degree MS BS BS MS Ph.D. Repeating Groups: If one value of an attribute determines more than one value of another attribute, then these multiple values are called repeating groups. For example, in relation Degree Student Name and Degrees are two attributes. For one value of Student Name = John, there are three value of Degree (BS, MS, Ph.D.). Relation Degree is in a non-normalized 10/17 Functional Dependencies and Normal Forms form. Year vales are repeated for one value of Student Name, thus, Year has repeating groups. Degree is a non-normalized relation and not allowed in relational model. Relation Degree must be normalized before it can be processed by relational database systems. The normalized relation Degree is given below. Degree Student Name John John Kumar Kumar Kumar Year 1990 2002 1967 1969 1983 Degree MS BS BS MS Ph.D. We now define these in terms of Normal Forms (NF). 1NF: A relation is in 1NF if it does not contain repeating groups, i.e., all its attributes are atomic. Normalized: Order Ono 12489 12491 12491 12494 12495 12498 12498 12500 Date 90287 90287 90287 90487 90487 90587 90587 90587 Pno AX12 BT04 BZ66 CB03 CX11 AZ52 BA74 BT04 No_Ordered 11 1 1 4 2 2 4 1 Order relation is in 1NF, since it does not have repeating groups. Degree: 4. Cardinality: 8 Primary Key (PK): {Order_number, Part_number}. Only one attribute cannot be a candidate key. There are many superkeys. We want to identify if this relation has any modification anomalies. If it has then we try to minimize or eliminate them. Consider the following relation. It is in 1NF. Order Ono 12489 12491 12491 12494 12495 12498 12498 12500 Date Part_descrip 90287 Iron 90287 Stove 90287 Washer 90487 Bike 90487 Mixer 90587 Skates 90587 Baseball 90587 Stove 11/17 Pno AX12 BT04 BZ66 CB03 CX11 AZ52 BA74 BT04 No_Ordered 11 1 1 4 2 2 4 1 Functional Dependencies and Normal Forms Modification anomalies Update: A change to the description of BT04 requires several changes since BT04 has been duplicated as a result of normalization. Inconsistent data: In the absence of incomplete update, BT04 may have different values in other attributes. Additions: We cannot add ZZ14 until we have an order for ZZ14. Deletion: By deleting BT04 we lose that BT04 represents Stove. Conclusion: We conclude that Order has modification anomalies, which should be minimized or eliminated. Minimization or elimination process: A relation with modification anomalies is further normalized to higher normal form. Usually the normalization to one higher normal form resolves the issue. If not then it is normalized to next higher normal form. Further normalization is usually done by splitting (vertically) the relation into two or more relations. So the solution for Order is to normalize it to 2NF. We first define 2NF. 2NF: A relation schema is in 2NF if it is in 1NF and non-prime attribute is fully functionally dependent on PK. Dependency diagram of Order Ord er_N o Date Part_N o PK: {Order_no, Part_no} Part_Desc N o_Ord ered Price Order is 1NF: All attributes are atomic. Order is not in 2NF, because non-key attribute Part_desc is dependent upon Part_no, which is a portion of the PK. Also Date is dependent upon Order_no (part of the PK). There is partial dependency among attributes of Order. Order Order N o. 12489 12491 12494 12495 12498 12500 Part D ate 90287 90287 90487 90487 90587 90587 Part N o. AX12 AZ52 BA74 BH 22 BT04 BZ66 CA14 CB03 CX11 Order_line Part_D esc. Iron Skates Baseball Toaster Stove Washer Skillet Bike Mixer Part N o. AX12 BT04 BZ66 CS03 CX11 AZ52 BA74 BT04 N o_Ordered Order_N o 11 1 1 4 2 2 4 1 12489 12491 12491 12494 12495 12498 12498 12500 Price 14.95 402.99 311.95 175.00 57.95 22.95 4.95 402.99 1NF → 2NF: Split the relation into two or more via projection as follows 1. Identify the set of attributes that makes up the PK: {Order_no, Part_no}. 2. Create all subsets of the above set: {Order_no}, {Part_no} and {Order_no and Part_no}. 12/17 Functional Dependencies and Normal Forms 3. Designate each of these subsets as the PK of a relation that contains those attributes, which are dependent on these PKs: Primary Keys: {Order_no}, {Part_no} and {Order_no, Part_no} Relational schemas Order (Order_no, Date). Date is functionally dependent on Order_no (see diagram above). Part (Part_no, Part_desc). Part_desc is functionally dependent on Part_no. Order_line (Order_no, Part_no, No_ordered, Price) All these relations are in 2NF. Anomalies have been eliminated and can be verified as follows: Change: If BT04 is changed to something else then it requires only one change in Part relation. Add a new part and its description: If a new tuple is added in Part then there is no need to have an order exist for that part. Delete order 12489: This delete does not cause AX12 to be deleted from Part, thus we do not loose the description of AX12. Information loss: none. Q. Does this imply that relations in 2NF do not have modification anomalies? A. No. Relations in 2NF may suffer with all modification anomalies. Example Customer Cust_no 124 256 311 315 405 412 522 567 587 622 Name Sally A Ann S Don C Tom D Al W Sally A Mary N Joe B Judy R Dan M Address 4747 Troost 215 Oak 48 College 914 Cherry 519 Watson 16 Elm 108 Pine 808 Ridge 512 Pine 419 Chip Slsrep_no 2 6 12 6 12 3 12 6 6 3 Slsrep_name Tom J Bill S Sam B Bill S Sam B Mary J Sam B Bill S Bill S Mary J The dependency diagram of Customer Cu st_N o N am e Ad d ress Slsrep _N o Slsrep _N am e Customer is in 2NF. It suffers with all the anomalies 13/17 Functional Dependencies and Normal Forms Update: A change to Slsrep_name requires multiple changes. Inconsistent data: There is nothing in the design that would prohibit a Slsrep_name from having two different names. Additions: To add Slsrep_no 47, there must be a customer for 47 first. Deletions: If we delete all the customers of a sales rep then we lose the name of the Sales rep also. Reason for these anomalies: Slsrep_no, which is not a PK, determines Slsrep_name. As a result Slsrep_no can appear many times in the relation Remedy: Normalize Customer relation by transforming it into 3NF relations. 3NF: A relation scheme R is in 3NF if it is in 2NF and no non-prime attribute of R is transitively dependent on the primary key. Transitive dependency: A transitive dependency exists among 3 or more attributes. Example Ename SSN Bd ate Ad d ress Dnu mber Dname Dm grssn SSN → Dnumber Dnumber → Dname and Dnumber → Dmgrssn Therefore SSN → Dname and SSN → Dmgrssn transitively. Example Housing relation is in 2NF. Fee is transitively dependent on SID so it is not in 3NF. This relation has all modification anomalies. Housing can be converted to 3NF by normalization process. Housing SID Building Fee 100 150 200 250 300 Randolph Ingersol Randolph Pitkin Randolph 1200 1100 1200 1100 1200 SID Building Fee 2NF → 3NF Housing Fee SID Building Building Fee 100 150 200 250 300 Randolph Ingersol Randolph Pitkin Randolph Randolph Ingersol Pitkin 1200 1100 1100 3NF also have modification anomalies. Consider the Advisor relational. 14/17 Functional Dependencies and Normal Forms Relationships 1. 2. 3. 4. A student can have one or more majors. A major can have several faculty as advisors. A faculty member advises in only one major area. SID cannot be a key since a student can have many majors and therefore many advisors. 5. A a student cannot have many advisors in the same area. Keys are: (SID, Major) → Fname. (SID, Fname)→Major. One of these sets can be selected as a primary key. Determinant: Fname→ Major Advisor SID 100 150 200 250 300 300 Major Fnam e Math Psychology Math Math Psychology Math Cau chy Ju ng Riem ann Cau chy Perls Riem ann SID Major Fnam e This relation does not have transitive dependency. Advisor is in 3NF since there is no transitive dependency but it has modification anomalies. Deletion: Delete SID 300, we lose the fact that Perls advises in Psychology. Addition: Cannot add the fact that Keynes advises in Economics if there is no student enrolled. Update: If Cauchy advises in Physics then multiple changes are required. Inconsistency: Any change in Cauchy-Math will make Advisor inconsistent. Solution: Further normalization to Boyce/Codd Normal form. Boyce/Codd Normal Form (BCNF): A relation is in BCNF if every determinant is a candidate key. Advisor is not in BCNF because Fname → Major, and Fname is not a candidate key. 3NF to BCNF Advisor SID 100 150 200 250 300 300 Major Fnam e Major Fnam e Cauchy Jung Riemann Cauchy Perls Riemann Math Psychology Math Math Psychology Math Cau chy Ju ng Riem ann Cau chy Perls Riem ann Relations in BCNF are not entirely free from anomalies. Consider Student relation: Student SID 100 100 100 100 150 Major Music Accounting Music Accounting Math 15/17 Activity Swimming Swimming Tennis Tennis Jogging Functional Dependencies and Normal Forms Semantics: A student can enroll in more than one major and in more than one activity. PK: All three attributes. What is the relationship between SID and Major? It is not functional dependency, because students have several majors. There is some sort of relationship. This relationship can be illustrated by an example. Suppose: Student 100 wants to enroll in Skiing. Add: tuple 100 Music Skiing. Resulting relation Student SID Major Activity 100 Music Skiing 100 Music Swimming 100 Accounting Swimming 100 Music Tennis 100 Accounting Tennis 150 Math Jogging Semantics: It implies that Student 100 Skis as a Music major but he/she does not know to ski as an Accounting major. Illogical. Solution: Add tuple: 100 Accounting Skiing. The resulting relation is consistent. The relationship between SID and Major is a Multivalued dependency. SID determines not a single value but several values. Thus (SID 100) determines majors (Music, Accounting) and activities (Skiing, Swimming, Tennis). The relation is in BCNF since all attributes make the primary key. It has anomalies. Addition: If one tuple (100 Accounting Skiing) is added then several other tuples must be added to preserve consistency. Student SID Major 100 Music 100 Accounting 100 Music 100 Accounting 100 Music 100 Accounting 150 Math Activity Skiing Skiing Swimming Swimming Tennis Tennis Jogging Delete: If (100 Music Skiing) is deleted then (100 Accounting Skiing) also has to be deleted even though Major and Activity are not related. Solution: Break this relation into two by projection S_major S_Activity SID Major SID Activity 100 100 150 Mu sic Accou nting Math 100 100 100 150 Sking Sw imming Tennis Jogging 16/17 Functional Dependencies and Normal Forms Multivalued dependency always occur in pairs. For the Student relation SID →→ Major because Major depends only on the value of SID and not on the value of Activity. Similarly, SID →→ Activity. Activity does not dependent on Major since having a particular major implies nothing about Activity. The no relationship between Major and Activity creates problem in the sense that whenever we add a new Major, we must add a tuple for every value of Activity. Multivalued dependencies always occur in pairs. In a relational scheme R, if A→→B, then A→→C. This must be so because if B is unrelated to C then C must also be unrelated to B. The independence among attributes is not a problem if the attributes have a single value. Example Student_shoe SID 100 150 200 250 Shoe-size 8 10 5 12 Marital_status M S S S Primary Key: SID. In Student_shoe, Shoe_size and Marital_status are independent and have single value, anomalies cannot occur. We can delete or add any tuple with no problem. This observation leads to the definition of 4NF 4NF: A relation is in 4NF if it is in BCNF and it has no multivalued dependencies. 17/17 Functional Dependencies and Normal Forms