Traverse Hierarchies - Teradata Magazine Online

Transcription

Traverse Hierarchies - Teradata Magazine Online
HANDS ON
Traverse Hierarchies
Analyze data and prevent problems with recursive queries.
by Ralph Meira
W
e find hierarchies in our
everyday lives: bills of
materials, organizational
structures, genealogy charts, etc. In
theory, hierarchies are easy to understand.
However, they can be quite a challenge
when you attempt to represent and manipulate them using a database and SQL.
The efficient handling of hierarchies can
lead to the detection of anomalies in complex
systems, the uncovering of unknown relationships and much more. Unfortunately, in
the real world, they are often plagued by data
quality problems that can negatively affect
their analysis. The Teradata Database can
shield you from these problems through the
use of recursive queries—as long as you take
a few precautions.
Analyze Levels of Data
Hierarchies are best represented in the form
of a treelike structure with several levels.
Their in-database representation typically
relies on parent-child relationships that, in
the case of an organizational chart, identify
the relationship between employer and
employee. Due to the nature of hierachical structures, data quality issues such
as missing links or cyclic childparent-child relationships can
often complicate the analysis
of data.
Querying Hierarchies
The recursive query syntax was created to enable SQL to query hierarchical data of an unknown depth. Using the WITH clause, you
can define derived tables before the main
query instead of within it. A recursive
query contains at least one reference
to its name within its own definition
and is composed of these parts:
>A
seed query UNIONed
to an iterative (recursive) segment
> At least one logical
condition to prevent infinite loops
from occurring
PAGE 1 l Teradata Magazine l Q3/2011 l ©2011 Teradata Corporation l AR-6393
Simple Hierarchy
figure 1
1
11
111
111
1
11
1111
11111
0
0
2
2
22
1111
11111
22
3
2222
222
1
11
11
33
111
222
1
3
33
333
Hierarchy With Problems
figure 2
111
0
0
2
2
22
22
11111
1
11111
1
111111
111
111111
111
333
2222
3
3
33
33
222
222
2222
2222
Navigate this straightforward hierarchy by following the parentchild relationships.
333
333
22222 22222
This hierarchy has some data problems: The green lines highlight
(a) a cyclic loop (1,11,111,1,…) and (b) a lower-level child item 11111 as
a parent to 222, which is higher up in the hierarchy. There is also a
missing link between 22 and 222.
You can easily visualize how data quality
problems occur by comparing figure 1 to
figure 2. Notice how the link between “22”
and “222” is missing in figure 2, perhaps due
to input error. Also notice that “1,” “11” and
“111” can lock you into a cyclic loop that
never ends.
Recursive queries can be used to analyze
hierarchical data while preventing data
quality problems. As an example, the
recursive SQL syntax shown below has
“safety features” that help identify cyclic
data loops and keep the recursion depth to
a maximum of 10 levels. This syntax can be
repurposed to unravel any hierarchy that
follows the format in table 1 and to let you
go much deeper than just 10 levels.
Some key elements in the syntax are:
> A materialized PATH is composed of
concatenated values that show the route
taken by each level of recursion.
> POSITION determines whether a node
from the hierarchy is being repeated in
the PATH, thus identifying cyclic data.
> The TREE.LEVEL < 10 effectively
stops the query from going any deeper
than 10 levels.
> The WHERE E.PARENT_ID = 0
clause in the SQL is responsible
for defining node 0 as the starting
point. Each step of all possible
routes initiated at node 0 is shown
following the PARENT-CHILD
row order.
It’s not uncommon to be confused
by the syntax of recursive queries, so
it is useful to employ recursive VIEWS
to simplify the analysis of hierarchies.
Using the RECURSIVE VIEW format
enables the use of SET functions such
as MINUS and INTERSECT that allow
comparative analysis between two similar hierarchies.
table 1
Parent-Child Rules
PARENT_ID
SQL SAMPLE 1
WITH RECURSIVE TREE
( LEVEL, PARENT_ID, CHILD_ID, PATH) AS
( SELECT 0 AS LEVEL,
E.PARENT_ID,
E.CHILD_ID,
CAST (E.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TABLE_1 E
WHERE E.PARENT_ID = 0
UNION ALL
SELECT TREE.LEVEL +1 ,
S.PARENT_ID, S.CHILD_ID,
CAST (TREE.PATH || S.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TREE, TABLE_1 S
WHERE TREE.CHILD_ID=S.PARENT_ID
AND TREE.LEVEL < 10
AND POSITION (S.PARENT_ID IN TREE.PATH) < 1
)
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN (POSITION(CHILD_ID IN PATH)>0)
THEN ‘CYCLIC’
ELSE ‘’ END AS CYCLIC
FROM TREE GROUP BY 1,2,4,5
ORDER BY 4,3,1,2;
PAGE 2 l Teradata Magazine l Q3/2011 l ©2011 Teradata Corporation l AR-6393
CHILD_ID
0
1
0
2
0
3
1
11
2
22
3
33
11
111
11
1111
22
222
33
333
222
2222
1111
11111
By using REPLACE RECURSIVE VIEW,
you can start to analyze changes or anomalies
introduced by data quality issues. ANSI
SQL recursive queries are also available to
Teradata Database users and have been for
many years.
SQL sample 2 shows what syntax with
recursive VIEWS can look like.
If the appropriate row changes are made
to table 1, it’s possible to have the table
represent the hierarchy shown in figure 2
(page 2). Executing the recursive SQL syntax
against a revised table 1 will produce the
results shown in table 2.
You can compare the hierarchy of nodes in
figures 1 and 2 using SQL sample 3.
Note that the SQL before and after the
MINUS is identical to the syntax used
earlier in the recursive SQL syntax to
extract, group and order the results of
the recursive query. A simpler analysis of
changes to a hierarchy can often be carried
out without the need for recursive queries
at all. For example, to find out which nodes
in a hierarchy are considered to be top
nodes, that is, nodes without parents, you
only need to SELECT all PARENT_IDs
MINUS the SELECT of all CHILD_IDs for
the same hierarchy.
In order to obtain the results shown in table
3, it is being assumed that TREE_FIG2_V
and TREE_FIG1_V correspond to recursive
VIEWs in the format shown in the recursive
VIEWs syntax where the underlying TABLEs
contain the parent-child relationships that
correspond to figures 2 and 1, respectively.
Table 3 shows the differences between figures 1
and 2, as originally intended.
Data Quality Protection
Hierarchies, though easy to traverse based
on their parent-child rules, often come
with data quality issues that can lead
to infinite loops. Recursive queries can
protect you from data problems such as
missing links and cyclic loops.
The WITH RECURSIVE syntax can
guide you through hierarchies to find their
depth, detect infinite loops and more. You’ll
be surprised how easy it is to use SQL to
navigate the structures and identify data
quality problems. T
Ralph Meira is a Teradata senior
solution architect focused primarily
on manufacturing accounts.
SQL SAMPLE 2
REPLACE RECURSIVE VIEW TREE_V
( LEVEL, PARENT_ID, CHILD_ID, PATH) AS
( SELECT 0 AS LEVEL,
E.PARENT_ID,
E.CHILD_ID,
CAST (E.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TABLE_1 E
WHERE E.PARENT_ID = 0
UNION ALL
SELECT TREE_V.LEVEL +1 ,
S.PARENT_ID, S.CHILD_ID,
CAST (TREE_V.PATH || S.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TREE_V, TABLE_1 S
WHERE TREE_V.CHILD_ID=S.PARENT_ID
AND TREE_V.LEVEL < 10
AND POSITION (S.PARENT_ID IN TREE_V.PATH)
<1
);
SQL SAMPLE 3
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN ( POSITION(CHILD_ID IN PATH) > 0)
THEN ‘CYCLIC’ ELSE ‘’ END AS CYCLIC
FROM TREE_FIG2_V GROUP BY 1,2,4,5
MINUS
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN ( POSITION(CHILD_ID IN PATH) > 0)
THEN ‘CYCLIC’ ELSE ‘’ END AS CYCLIC
FROM TREE_FIG1_V GROUP BY 1,2,4,5
ORDER BY 4,3,1,2;
Paths From Node 0
table 2
PARENT_ID
CHILD_ID
DEPTH
PATH
CYCLIC
0
0
1
0
1
1
1
2
0
1
11
11
111
3
0
1
11
111
111
1
4
0
1
11
111
11
1111
3
0
1
11
1111
1111
11111
4
0
1
11
1111
11111
11111
222
5
0
1
11
1111
11111
222
222
2222
6
0
1
11
1111
11111
222
2222
2222
22222
7
0
1
11
1111
11111
222
2222
0
2
1
0
2
2
22
2
0
2
22
22
222
3
0
2
22
222
222
2222
4
0
2
22
222
2222
2222
22222
5
0
2
22
222
2222
0
3
1
0
3
3
33
2
0
3
33
33
333
3
0
3
33
333
333
2222
4
0
3
33
333
2222
2222
22222
5
0
3
33
333
2222
33
22222
3
0
3
33
22222
1
CYCLIC
22222
22222
22222
There are many available paths when starting at Node 0. The cyclic path is correctly
flagged in the last column.
Different Paths
table 3
PARENT_ID
CHILD_ID
DEPTH
PATH
CYCLIC
111
1
4
0
1
11
111
1
CYCLIC
1111
11111
4
0
1
11
1111
11111
222
11111
222
5
0
1
11
1111
11111
222
2222
222
2222
6
0
1
11
1111
11111
222
2222
2222
22222
7
0
1
11
1111
11111
22
222
3
0
2
22
222
2222
4
0
2
22
222
2222
2222
22222
5
0
2
22
222
2222
333
2222
4
0
3
33
333
2222
2222
22222
5
0
3
33
333
2222
33
22222
3
0
3
33
22222
22222
222
22222
22222
This table shows how the paths differ between figures 1 and 2.
xx
l 3TeradataMagazine.com
PAGE
l Teradata Magazine l Q3/2011 l ©2011 Teradata Corporation l AR-6393
QX/201X l TDM l XX