Traverse Hierarchies - Teradata Magazine Online
Transcription
Traverse Hierarchies - Teradata Magazine Online
HANDS ON Traverse Hierarchies Analyze data and prevent problems with recursive queries. by Ralph Meira W e find hierarchies in our everyday lives: bills of materials, organizational structures, genealogy charts, etc. In theory, hierarchies are easy to understand. However, they can be quite a challenge when you attempt to represent and manipulate them using a database and SQL. The efficient handling of hierarchies can lead to the detection of anomalies in complex systems, the uncovering of unknown relationships and much more. Unfortunately, in the real world, they are often plagued by data quality problems that can negatively affect their analysis. The Teradata Database can shield you from these problems through the use of recursive queries—as long as you take a few precautions. Analyze Levels of Data Hierarchies are best represented in the form of a treelike structure with several levels. Their in-database representation typically relies on parent-child relationships that, in the case of an organizational chart, identify the relationship between employer and employee. Due to the nature of hierachical structures, data quality issues such as missing links or cyclic childparent-child relationships can often complicate the analysis of data. Querying Hierarchies The recursive query syntax was created to enable SQL to query hierarchical data of an unknown depth. Using the WITH clause, you can define derived tables before the main query instead of within it. A recursive query contains at least one reference to its name within its own definition and is composed of these parts: >A seed query UNIONed to an iterative (recursive) segment > At least one logical condition to prevent infinite loops from occurring PAGE 1 l Teradata Magazine l Q3/2011 l ©2011 Teradata Corporation l AR-6393 Simple Hierarchy figure 1 1 11 111 111 1 11 1111 11111 0 0 2 2 22 1111 11111 22 3 2222 222 1 11 11 33 111 222 1 3 33 333 Hierarchy With Problems figure 2 111 0 0 2 2 22 22 11111 1 11111 1 111111 111 111111 111 333 2222 3 3 33 33 222 222 2222 2222 Navigate this straightforward hierarchy by following the parentchild relationships. 333 333 22222 22222 This hierarchy has some data problems: The green lines highlight (a) a cyclic loop (1,11,111,1,…) and (b) a lower-level child item 11111 as a parent to 222, which is higher up in the hierarchy. There is also a missing link between 22 and 222. You can easily visualize how data quality problems occur by comparing figure 1 to figure 2. Notice how the link between “22” and “222” is missing in figure 2, perhaps due to input error. Also notice that “1,” “11” and “111” can lock you into a cyclic loop that never ends. Recursive queries can be used to analyze hierarchical data while preventing data quality problems. As an example, the recursive SQL syntax shown below has “safety features” that help identify cyclic data loops and keep the recursion depth to a maximum of 10 levels. This syntax can be repurposed to unravel any hierarchy that follows the format in table 1 and to let you go much deeper than just 10 levels. Some key elements in the syntax are: > A materialized PATH is composed of concatenated values that show the route taken by each level of recursion. > POSITION determines whether a node from the hierarchy is being repeated in the PATH, thus identifying cyclic data. > The TREE.LEVEL < 10 effectively stops the query from going any deeper than 10 levels. > The WHERE E.PARENT_ID = 0 clause in the SQL is responsible for defining node 0 as the starting point. Each step of all possible routes initiated at node 0 is shown following the PARENT-CHILD row order. It’s not uncommon to be confused by the syntax of recursive queries, so it is useful to employ recursive VIEWS to simplify the analysis of hierarchies. Using the RECURSIVE VIEW format enables the use of SET functions such as MINUS and INTERSECT that allow comparative analysis between two similar hierarchies. table 1 Parent-Child Rules PARENT_ID SQL SAMPLE 1 WITH RECURSIVE TREE ( LEVEL, PARENT_ID, CHILD_ID, PATH) AS ( SELECT 0 AS LEVEL, E.PARENT_ID, E.CHILD_ID, CAST (E.PARENT_ID AS VARCHAR(200)) AS PATH FROM TABLE_1 E WHERE E.PARENT_ID = 0 UNION ALL SELECT TREE.LEVEL +1 , S.PARENT_ID, S.CHILD_ID, CAST (TREE.PATH || S.PARENT_ID AS VARCHAR(200)) AS PATH FROM TREE, TABLE_1 S WHERE TREE.CHILD_ID=S.PARENT_ID AND TREE.LEVEL < 10 AND POSITION (S.PARENT_ID IN TREE.PATH) < 1 ) SELECT PARENT_ID, CHILD_ID, MIN(LEVEL)+1 AS DEPTH, PATH||CHILD_ID AS PATH, CASE WHEN (POSITION(CHILD_ID IN PATH)>0) THEN ‘CYCLIC’ ELSE ‘’ END AS CYCLIC FROM TREE GROUP BY 1,2,4,5 ORDER BY 4,3,1,2; PAGE 2 l Teradata Magazine l Q3/2011 l ©2011 Teradata Corporation l AR-6393 CHILD_ID 0 1 0 2 0 3 1 11 2 22 3 33 11 111 11 1111 22 222 33 333 222 2222 1111 11111 By using REPLACE RECURSIVE VIEW, you can start to analyze changes or anomalies introduced by data quality issues. ANSI SQL recursive queries are also available to Teradata Database users and have been for many years. SQL sample 2 shows what syntax with recursive VIEWS can look like. If the appropriate row changes are made to table 1, it’s possible to have the table represent the hierarchy shown in figure 2 (page 2). Executing the recursive SQL syntax against a revised table 1 will produce the results shown in table 2. You can compare the hierarchy of nodes in figures 1 and 2 using SQL sample 3. Note that the SQL before and after the MINUS is identical to the syntax used earlier in the recursive SQL syntax to extract, group and order the results of the recursive query. A simpler analysis of changes to a hierarchy can often be carried out without the need for recursive queries at all. For example, to find out which nodes in a hierarchy are considered to be top nodes, that is, nodes without parents, you only need to SELECT all PARENT_IDs MINUS the SELECT of all CHILD_IDs for the same hierarchy. In order to obtain the results shown in table 3, it is being assumed that TREE_FIG2_V and TREE_FIG1_V correspond to recursive VIEWs in the format shown in the recursive VIEWs syntax where the underlying TABLEs contain the parent-child relationships that correspond to figures 2 and 1, respectively. Table 3 shows the differences between figures 1 and 2, as originally intended. Data Quality Protection Hierarchies, though easy to traverse based on their parent-child rules, often come with data quality issues that can lead to infinite loops. Recursive queries can protect you from data problems such as missing links and cyclic loops. The WITH RECURSIVE syntax can guide you through hierarchies to find their depth, detect infinite loops and more. You’ll be surprised how easy it is to use SQL to navigate the structures and identify data quality problems. T Ralph Meira is a Teradata senior solution architect focused primarily on manufacturing accounts. SQL SAMPLE 2 REPLACE RECURSIVE VIEW TREE_V ( LEVEL, PARENT_ID, CHILD_ID, PATH) AS ( SELECT 0 AS LEVEL, E.PARENT_ID, E.CHILD_ID, CAST (E.PARENT_ID AS VARCHAR(200)) AS PATH FROM TABLE_1 E WHERE E.PARENT_ID = 0 UNION ALL SELECT TREE_V.LEVEL +1 , S.PARENT_ID, S.CHILD_ID, CAST (TREE_V.PATH || S.PARENT_ID AS VARCHAR(200)) AS PATH FROM TREE_V, TABLE_1 S WHERE TREE_V.CHILD_ID=S.PARENT_ID AND TREE_V.LEVEL < 10 AND POSITION (S.PARENT_ID IN TREE_V.PATH) <1 ); SQL SAMPLE 3 SELECT PARENT_ID, CHILD_ID, MIN(LEVEL)+1 AS DEPTH, PATH||CHILD_ID AS PATH, CASE WHEN ( POSITION(CHILD_ID IN PATH) > 0) THEN ‘CYCLIC’ ELSE ‘’ END AS CYCLIC FROM TREE_FIG2_V GROUP BY 1,2,4,5 MINUS SELECT PARENT_ID, CHILD_ID, MIN(LEVEL)+1 AS DEPTH, PATH||CHILD_ID AS PATH, CASE WHEN ( POSITION(CHILD_ID IN PATH) > 0) THEN ‘CYCLIC’ ELSE ‘’ END AS CYCLIC FROM TREE_FIG1_V GROUP BY 1,2,4,5 ORDER BY 4,3,1,2; Paths From Node 0 table 2 PARENT_ID CHILD_ID DEPTH PATH CYCLIC 0 0 1 0 1 1 1 2 0 1 11 11 111 3 0 1 11 111 111 1 4 0 1 11 111 11 1111 3 0 1 11 1111 1111 11111 4 0 1 11 1111 11111 11111 222 5 0 1 11 1111 11111 222 222 2222 6 0 1 11 1111 11111 222 2222 2222 22222 7 0 1 11 1111 11111 222 2222 0 2 1 0 2 2 22 2 0 2 22 22 222 3 0 2 22 222 222 2222 4 0 2 22 222 2222 2222 22222 5 0 2 22 222 2222 0 3 1 0 3 3 33 2 0 3 33 33 333 3 0 3 33 333 333 2222 4 0 3 33 333 2222 2222 22222 5 0 3 33 333 2222 33 22222 3 0 3 33 22222 1 CYCLIC 22222 22222 22222 There are many available paths when starting at Node 0. The cyclic path is correctly flagged in the last column. Different Paths table 3 PARENT_ID CHILD_ID DEPTH PATH CYCLIC 111 1 4 0 1 11 111 1 CYCLIC 1111 11111 4 0 1 11 1111 11111 222 11111 222 5 0 1 11 1111 11111 222 2222 222 2222 6 0 1 11 1111 11111 222 2222 2222 22222 7 0 1 11 1111 11111 22 222 3 0 2 22 222 2222 4 0 2 22 222 2222 2222 22222 5 0 2 22 222 2222 333 2222 4 0 3 33 333 2222 2222 22222 5 0 3 33 333 2222 33 22222 3 0 3 33 22222 22222 222 22222 22222 This table shows how the paths differ between figures 1 and 2. xx l 3TeradataMagazine.com PAGE l Teradata Magazine l Q3/2011 l ©2011 Teradata Corporation l AR-6393 QX/201X l TDM l XX