White Paper What’s Cool about Columns (and how to extend their benefits)

Transcription

White Paper What’s Cool about Columns (and how to extend their benefits)
White Paper
What’s Cool about Columns
(and how to extend their benefits)
A White Paper by Bloor Research
Author : Philip Howard
Publish date : November 2010
…the use of columns is not
a panacea. In fact, Infobright
demonstrates this very clearly
by its improvement on the
fundamental architecture
of column-based relational
databases
Philip Howard
Free copies of this publication have been
sponsored by
What’s Cool about Columns
(and how to extend their benefits)
Executive summary
It hardly needs reiterating that companies
and other organisations are under increasing
pressure to understand their customers, work
ever more closely with their suppliers, evaluate
their own performance, improve their competitive position, and generally take advantage of
whatever business opportunities arise. Add to
this the huge growth in information generated
by the Internet as well as specialist technologies such as RFID (radio frequency identification) and event processing, plus the need to
retain data for extended periods of time for
compliance reasons, and it is not surprising
that business intelligence and query systems in
general are under increasing pressure. Worse,
demands on this information are increasingly
widespread and of a real-time nature.
Historically, all long term storage of data for
query purposes has relied on data warehouses
and data marts that have been traditionally supplied by vendors using conventional
relational databases. However, for more than a
decade a sub-genre of the relational database
has been making inroads into the market, using a technology known as a column-based
relational database (sometimes referred to as
CBRD). While we will explain the differences
between a conventional and a column-based
approach in due course, for the moment just
think of it as an ordinary relational database
(using SQL and so forth) that inserts and reads
columns of data instead of rows of data.
A Bloor White Paper
1
For much of the last decade the use of columnbased approaches has been very much a niche
activity. However, with a substantial number
of vendors now active in the market, with over
2,000 customers (many of them Global 2000
companies) between them, we believe that it
is time for columns to step out of the shadows
to become a major force in the data warehouse
and associated markets. Given that view, this
paper will define what column-based relational databases do and how they do it, and where
they have advantages (and disadvantages)
compared to traditional approaches. This, in
turn, will lead to a discussion of the sort of environments for which columns are best suited.
Up to this point all discussions in this paper
are generic. However, we will conclude with a
section on Infobright, a column-based vendor,
discussing how its technology has extended
the column-based paradigm to provide additional performance and other benefits.
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
The general problem
There is a popular joke in which a traveler asks
a local for directions to a well-known location. The latter, after a pause, replies that he
wouldn’t start from here if that was where he
was going.
Relational databases are in an analogous
position. Having dominated the market for
OLTP, the relational vendors have extended
the capabilities of their products into areas for
which relational technology was not originally
designed. The question is whether you would
start from a relational base if you began with a
blank sheet of paper?
Traditional relational databases were designed initially to process transactions. In
this environment you manipulate individual
transactions one at a time. Each transaction is
represented by one or more rows that have to
be inserted or modified in one or more tables.
When you are processing a query, on the other
hand, you typically work with one or more tables, from which data is selected according to
certain criteria—and those criteria are nearly
always column-based. For example, the query
“retrieve the names of all personnel who are
females over the age of 50” would typically be
resolved based on columns for name, gender
and date of birth. So, while transactions are
naturally row-based, queries are naturally
column-based.
There is, of course, a major performance issue involved here. If you are using a simple
row-based paradigm then you retrieve all of
the personnel records and then search them
according to the defined criteria. This will
obviously be relatively long-winded and slow.
A number of opportunities for improving performance present themselves in a relational
environment.
The first is to implement indexes on each of the
relevant columns. There are two problems with
this. First, you cannot necessarily predict the
queries that business analysts and managers
may choose to make. This means that there
may be a requirement for an ad hoc enquiry
that involves a search against a non-indexed
field. So, the first problem is that indexes cannot guarantee to be universally applicable.
Secondly, indexes impose a significant overhead in terms of additional disk capacity. It is
by no means unusual for a heavily indexed database (together with other constructs) to take
up anything up to five times (or more) the space
that would be required for the data alone. Nor
© 2010 Bloor Research
2
is this just a question of additional hardware
requirement. Even at a simple level, it means
that you at least double the amount of I/O
required, which necessarily impairs performance. Moreover, this hit also applies when you
are updating your data warehouse, and you will
also incur overheads for index maintenance.
So:
1. We would like a way around the problem
of indexes: implementing multiple indexes
improves performance for specific queries
(and therefore may be useful in specific
cases) but causes a performance overhead
in more general terms.
To meet this need, traditional approaches implement other performance features in order
to cover the downside created by the need to
hold indexes in addition to data. The first of
these is data compression. The problem with
this is that individual fields in a row tend to be
represented by different datatypes. This makes
compression complex to implement since different compression algorithms will be better
suited to different datatypes. However, as we
mentioned above, because each row has to be
compressed in its entirety, this means that you
are forced to adopt a relatively low level approach to compression. In the recent past the
leading relational database vendors have all
introduced (or announced) more sophisticated
compression algorithms for their databases.
However, they still do not match up to those
offered by column-based vendors. Moreover,
there is an administrative overhead in applying this compression on a row basis. We will
discuss this in due course.
Another performance enhancement adopted
by most of the major data warehousing vendors is the extensive use of parallelism. This
applies both within the software and in making
use of hardware facilities that can be provided
by the platform. Both of these forms of parallelism have associated problems.
In the case of database parallelism the problem is that the parallelism is (just like the
use of indexes) designed to improve the performance of individual queries (with facilities
such as parallel sorting, for example). This, of
course, is a tacit admission that there is something wrong with their performance in the first
place. However, that is not the main point.
What you really would like to do with parallelism is use it to ensure that a mix of queries
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
The general problem
can run simultaneously (especially, using
divide and conquer rather than a sequential
approach) with optimum results, but this, of
course, pre-supposes that individual queries
will perform well in the first place.
The most significant point arising out of support for hardware-based parallelism is the
use of disk partitioning. This is the ability to
split a table across multiple disks, which
is normally done through what is known as
horizontal partitioning. This can be achieved
in a variety of ways such as hash partitioning,
round-robin partitioning and so on, but what
these methods have in common is that they
partition tables by row, so that some rows are
stored on one disk and some on another. The
problem with this approach is that it is difficult
to ensure that each partition is roughly the
same size as all the others. If this ceases to
be the case for some reason then you start to
lose any performance benefits. That is, if one
partition becomes significantly larger than
the others then it is, by definition, being accessed more frequently, which in turn means
that its I/O performance must deteriorate.
The different forms of partitioning algorithm
are designed to combat this problem, but they
are neither infallible nor universally available.
Should an imbalance occur, then a major redistribution of data, with all the consequences
that that implies, may be required to rectify
the problem.
Finally, a further hardware solution is proposed by many vendors, by which they suggest that the best approach to business
intelligence is to implement a central data
warehouse supporting satellite data marts. It
is the latter, which hold only limited subsets
of data, that are to be used to fulfill query requirements. By limiting the scope of information held you obviously reduce the size of the
database and, thence, the performance issues
that will arise.
We can therefore further extend our list of requirements (apart from fixing the index problem) by saying that we need:
2. A way around the problems of horizontal
partitioning.
3. A form of parallelism that optimises crossquery loading rather that individual queries, predicated on the assumption that we
already have decent performance for individual queries (no matter how complex).
A Bloor White Paper
3
The problem with conventional data warehousing is that it is not as all-embracing as
it might appear on the surface or as the vendors of such solutions might have you believe.
The reason for this is that in order to provide
the best possible performance to the largest
number of users, warehouses are significantly
pre-designed. While logically this may be a reflection of the business model that underpins
the warehouse, in physical terms this means
the pre-definition of indexes and index clustering, specified data partitioning, particular uses
of parallel disk striping, specially identified
pre-joined tables and so on.
Now, these techniques have an important role
to play when it comes to improving performance. However, they all pre-suppose that you
know, in advance, what you are going to do
with your data. The problem is that this is not
always the case. In particular, the data warehouse you plan today may not take account of
the exigencies of tomorrow.
To reduce this argument to its simplest level:
if you have two queries that you want to run
against a data source then you can optimise
the database design for one query or for the
other, but not both. For two queries you can
probably produce a hybridised structure that
will provide acceptable performance for most
users most of the time. However, the greater
the number of query types that you have to support, the more compromises you have to make.
As any data warehousing administrator will tell
you, left to their own devices just a handful of
business analysts, pursuing their own trains of
thought, can bring any data warehouse to its
knees. This is why all the popular systems have
query governors that can limit the resources
that will be assigned to any particular query.
However, query governors tend to penalise
queries that fall outside the parameters that
were originally considered when the data
warehouse was constructed. Moreover, and
here’s the rub, it is precisely those queries that
fall outside the scope of the initial conception
of the data warehouse that can bring the biggest benefit to the business. This is because
the issues raised by predictable enquiries are,
almost by definition, the ones that the company already knows how to deal with. It is the
unpredictable that often offers the greatest
threat, or the greatest reward, to an organisation, and it is precisely these questions that
a conventional data warehouse is least well
equipped to answer.
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
The general problem
On the basis of these discussions we can further define some of the features that we would
like from a data warehouse:
4. Flexibility is paramount. The whole point
is that what we want is an “ask anything”
warehouse.
5. It should be possible to store interim results. That is, you may want to perform a
query and use the output from that query
as a part of the input to another.
6. It should be easy to administer.
7. It should be cost effective and offer a return
on investment in as short a timescale as is
reasonable.
8. It should be efficient in terms of its resources, both in machine and personnel
terms. In particular, the business analyst
pursuing a line-of-thought enquiry should
be able to follow this train through to its
end from his desktop, without requiring
outside assistance of any sort.
9. Performance is also fundamental. While
different queries will obviously take different lengths of time, typical responses
should be in seconds, or minutes at most.
10.In modern-day enterprise data warehouses
there is a growing requirement to support a
much larger number of users/queries than
was previously the case and, at the same
time, a much broader range of query types.
Thus user scalability is as much an issue
as conventional concerns in terms of disk
scalability.
Size matters
For obvious reasons there is a continuing emphasis, in all forms of business intelligence
environments, on providing improved performance. To run queries faster, to run more
queries simultaneously, to run more complex
queries, to run against larger datasets. One
way to achieve this is by investing in more, and
faster, hardware. However, a more fruitful approach is to achieve this through software; and
© 2010 Bloor Research
4
one of the most useful ways of achieving this
is by reducing the size of the database, not in
logical terms, but physically. If, for example,
you have two databases that contain the same
data, and one requires 10Tb and the other requires 5Tb then, all other things being equal, it
will take less time to search through the latter
than the former.
Of course, there are also other benefits of a
smaller database. For example, it takes less
time to load it, it requires less maintenance
and tuning, it has a smaller footprint and it
requires less in the way of cooling and power
requirements.
The first and most obvious way of reducing the
size of a database is to use compression techniques. However, the problem with this is that
the different fields in a data row all have different attributes and it is therefore not possible,
using conventional databases, to optimise
the compression beyond a certain point. New
introductions in the databases of the leading
relational vendors have improved on this of
late but the facilities offered are still not as
capable as the might be: we will discuss this
further later.
Another way to reduce the size of a database is
to reduce the number of indexes. Typically, the
indexes in a data warehouse take up as much
space as the data itself, in effect doubling (or
more; it is not uncommon for the addition of
indexes, as well as other constructs such as
materialised views to mean that the resulting data warehouse is as much as 8 times the
size of the raw data) the size of the database.
However, in conventional environments, the
more indexes you remove the slower that individual queries run. In effect this is a Catch
22 situation.
We can therefore add to our list of requirements:
11.We would like to reduce the size of the
database.
12.We would like to minimise the number of
indexes (and other constructs) that we
need to define.
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
The specific problem
While we have identified a number of issues
that traditional data warehousing solutions
face there are also specific issues that relate
to particular types of query.
Unpredictable queries
Unpredictable queries (as used in exploratory
analysis) are, by definition, those where you do
not know in advance what the user may want
to find out. These pose a number of problems,
including:
• You do not know whether the answer to a
query will require aggregated data or access
to transaction-level data. If the answer can
be satisfied by aggregated data then an approach that uses some form of OLAP (on-line
analytic processing) solution may be appropriate. However, if transaction-level access
is required then an OLAP-based approach
will slow down significantly, particularly if
the requirement is substantial.
• Even where a query may be satisfied through
use of pre-aggregated data, the nature of
unpredictable queries is such that you cannot guarantee that the correct aggregations
are available. If they are not then the cube
will need to be re-generated with the desired
aggregations. Not only will this take time in
itself but it will also usually mean recourse
to the IT department, where it will typically
join a long queue of work to be done. The
user will be lucky to get a response to his
query within weeks—a time lapse of months
is more likely.
• A third problem with OLAP and unpredictable queries is that the cube has pre-defined
dimensions and hierarchies. If a user happens to want to pose a question that has aspects that fall outside of these parameters
then, again, it will be necessary to redefine
and re-generate the cube with all the likely
delays outlined above.
• Thus OLAP-based approaches cannot cope
with unpredictable queries, except in very
limited circumstances. This means that recourse will have to be made to the main data
warehouse (or a suitable data mart). However, conventional relational databases also
have problems with unpredictable queries.
As far as an rdbms is concerned, the problem
with unpredictable queries is that appropriate
indexes may not be defined. While the database optimiser can re-write badly constructed
A Bloor White Paper
5
SQL, determine the most efficient joins and
optimise the query path in general, it cannot
make up for any lack of indexes. In practice,
if a column is not indexed at all, then this will
usually mean that the query has to perform a
full table scan, and if this is a large table (see
below) then there will be a substantial performance hit as a result.
Of course, the obvious route to take is to build
indexes on every conceivable column. Unfortunately this is not usually practical. While every
index you build will help to improve the performance of queries that use that index, this
is subject to the law of diminishing returns.
Every index you add to the database increases
the size of the database as a whole, doubles
the maintenance whenever that column is updated (because you have to update the index as
well) and doubles I/O requirements in a similar
fashion. All of this means that database performance as a whole deteriorates, not to mention the time spent in tuning indexes to provide
best performance.
For these reasons, users tend to be severely
restricted in the degree of unpredictability that
they are allowed. Even ad hoc query tools tend
to be limited in what they allow the user to ask.
If you want to go outside those parameters
then you will be obliged to refer to the IT department, for them to program your query for
you, with all the attendant delays that that involves. Even when you can define such queries
through your front-end tool, all too often your
question will be cut short by the database’s
query limiter because it takes too long to run
or consumes too much resource.
Complexity
There is no hard and fast definition of what
constitutes a complex query. However, we can
say that they typically involve transaction-level
data, usually depend on multiple business
rules requiring multiple joins, and are often
forced to resort to full table scans. Perhaps a
reasonable definition would be that a complex
query always involves multiple set operations.
That is, you make a selection and then, based
on the result of that selection, go on to make
further selections. In other words complexity
involves recursive set operations.
In non-technical terms, complex queries often
involve a requirement to contrast and compare
different data sets. Some typical complex queries are as follows:
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
The specific problem
• “To what extent has our new service can-
nibalised existing products?” – that is, which
customers are using the new service instead
of the old ones, rather than as an addition.
• “List the top 10% of customers most likely to
respond to our new marketing campaign.”
• “Which good shoppers shop where bad shoppers shop?”
• “What aspects of a bill are most likely to lead to
customer defection?”
• “Are employees more likely to be sick when
they are overdue for a holiday?”
• “Which promotions shorten sales cycles the
most?”
Consider just the question about the top 10% of
customers. In order to answer this question we
need to analyse previous marketing campaigns,
understand which customers responded (which
is not easy in itself: it often means a timelapsed comparison between the campaign and
subsequent purchases), and identify common
characteristics shared by those customers. We
then need to search for recipients of the campaign that share those characteristics and rank
them (how to do this may require significant
input) according to the closeness of their match
to the identified characteristics.
It is unlikely that anyone would question the
premise that this is a complex query. You could
answer it using a conventional relational database but it would be time-consuming and slow,
not to mention difficult to program. It would be
impossible using an OLAP-based approach.
It might be thought that you could use data
mining technology for these sorts of complex
queries. However, there are three problems
with data mining. First, these products are typically the domain of specialist business analysts
rather than ordinary business users. Second, if
we refer to this particular query about the top
10% of customers, then the volume of data to
be processed could well be prohibitive. Third,
even if you could use data mining techniques
then what you get back will be a predictive
model, which is not particularly useful when
what you want is a set of customers.
It might also appear reasonable to argue that
these sorts of queries can satisfactorily be
© 2010 Bloor Research
6
answered by analytic applications both from
CRM vendors and specialist suppliers. To a certain extent this is true, but where complex queries are supported as part of such an application
they have typically been purpose-built, and most
analytic applications will not support the full
range of complex queries that you might want to
ask, in particular because they cannot support
the unpredictable nature of such questions.
Finally, another aspect of complexity is what
is sometimes called a ‘predicate’. These are
selection criteria such as those based on
sex, age, house ownership, annual income,
social classification, location and so on. As
companies want to understand their customers better, the range of these predicates, and
their combinations, is substantially increasing. However, this complexity places an added
strain on query performance. In particular, the
efficiency of the database in evaluating these
predicates is of increasing importance.
It should be noted that a similar concern exists for e-commerce based search engines.
For example, in the travel industry you want
to make it easy for clients to select holidays
based on a wide range of different predicates
such as average temperature, distance from
the beach, whether rooms have air conditioning and so on.
Large table scans
There is nothing clever or mystical about this.
Certain types of queries require that the whole
of a table must be scanned. Some of these
arise when there are no available indexes, or
from the sorts of complex queries described
above. However, very much simpler queries
can also give rise to full table scans. Two such
are quoted by the Winter Corporation in its
white paper “Efficient Data Warehousing for a
New Era”. These are:
• “List the full name and email address for
customers born in July” – given that one
in 12 customers are born in July a typical database optimiser will not consider it
worthwhile to use an index, and it will conduct a full table scan. If you have 10 million
customers for each of whom you store 3,200
bytes, say, then this will mean reading a total
of 32,000,000,000 bytes. As we will see later
a column-based database could reduce this
by a factor of more than 100.
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
The specific problem
• “Count the married, employed customers who
own their own home” – if we assume the
database as above, then conventional approaches still mean reading 32,000,000,000
bytes. In this case, however, column-based
approaches can achieve improvements
measured in thousands of times.
A point of warning: this is the first time we
have mentioned some of the performance
improvements that can be reached by using
column-based products—you may not believe
them. If you are not familiar with the technologies covered here, then talk of hundreds or
thousands of times performance benefits may
seem outlandish and unreasonable. However,
they are true. We will discuss how these can be
achieved (and when they cannot) in due course.
Time-based queries
We are referring here to time-lapse queries
rather than time per se. This is particularly
important because it is often desirable to study
people’s behaviour and activities over time.
For example, take a very simple example such
as: “Which customers bought barbecues within
7 days of ordering patio furniture?” In order to
answer this sort of query you need to search
the database to find out who bought barbecues
and then scan for patio furniture buying within
the required time period.
You cannot easily answer this sort of query
using either conventional relational databases
or OLAP cubes. In the case of an OLAP solution, you would have to organise your cube by
the shortest time period you are ever going to
measure against (days in this case) and then
you count cells for seven days. Unfortunately,
this will mean very large cubes (30 times typical sizes today, which are most commonly implemented by month) and such queries would
therefore be extremely inefficient. Moreover,
the question posed is based on transactionlevel detail in any case, which will not be contained in a cube.
but it is very complex, whereas in SQL ’92 you
would have to use a multi-pass approach. Of
course, some vendors have specialised data
extenders that handle time series problems of
this type, while others have extended their versions of SQL but then you would still have the
performance issues arising from large table
scans, as identified above.
Qualitative/quantitative queries
This is another straightforward issue. There
are occasions, such as if you wanted to compare the performance of different medical
teams carrying out surgery to treat a particular
condition, in which it is useful to be able to employ text searching against a data warehouse.
To answer this sort of query in a conventional
environment you would typically have to combine quantitative results from the data warehouse with qualitative details extracted from
a content management or document management system.
In practice, of course, it would not be difficult to
build text indexes and search capabilities into
conventional data warehouses. However, it is
likely that the sorts of queries that would require this capability would fall into the complex
category described above and, for this reason,
would be prone to the poor performance that
is symptomatic of relational and OLAP-based
approaches to complex queries.
Combination/workload issues
It is important that the foregoing, and traditional queries and reports, are not treated in isolation. Certainly, there is significant demand for
what we might call ‘analytic warehouses’ but
there is also a demand to support all of these
query types along with others, such as lookup queries in what is sometimes referred to
as EDW (enterprise data warehouse) 2.0. This
predicates a mix of query types and users that
may run into thousands or, in the future, tens
of thousands of concurrent queries.
Row-based relational databases cannot cope
well with this sort of query either. Not, in this
case, because the information isn’t there but
because SQL is not very good at coping with
this sort of query: in SQL ’99 you could do it
A Bloor White Paper
7
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
Column-based databases: a query solution
Conventional approaches to data warehousing
use traditional relational databases. However, these were originally designed to support transaction processing (OLTP) and do not
have an architecture specifically designed for
supporting queries. Column-based relational
databases, on the other hand, have been designed from the ground up with that specific
goal in mind.
A column-based relational database is exactly
what its name suggests, a relational database
(using conventional set algebra, SQL and so
on) that stores and retrieves data by column
instead of by row. In all other respects it is conceptually identical to a conventional relational
database. So the use of such a product does
not require any re-training, and does not need
the user to learn any new concepts (except any
that may be specific to a particular vendor).
The change from rows to columns may seem
a trivial one but it does, in fact, have profound
consequences, which we now need to examine
in some detail. In particular, we need to consider the impact of using columns with respect
to indexes because there are many circumstances where it is not necessary to define an
index when using a column-based approach.
For example, suppose that you simply want to
list all customers by name. Using a standard
relational database you would define an index
against the name and then use that to access
the data. Now consider the same situation
from a column-based perspective. In effect,
the column is the index. So you don’t need to
define a separate index for this sort of query.
The effect is quite dramatic. Not only do you
not have the overhead (in disk space, maintenance and so forth) of an index, you also halve
the number of I/Os required, because you don’t
have to read the index prior to every data read.
Further, because indexes and columns are so
closely aligned, it is a relatively easy process
for column-based products to provide automated indexing capabilities where that is appropriate, though some vendors, particularly
those employing large-scale parallelism, eschew indexes altogether.
To put it baldly, the use of columns enables
you to answer certain types of query, especially those highlighted in the previous section,
much more quickly than would otherwise be
the case, so we will begin by considering these
query types. In the following section we will discuss the importance (on query performance) of
© 2010 Bloor Research
8
other consequences of using a column-based
approach such as compression.
Unpredictable queries
As we have seen, a column is equivalent to an
index but without any of the overhead incurred
by having to define an index. It is as if you had
a conventional database with an index on every
column. It should be easy to see, therefore, that
if you are undertaking some exploratory analysis using unpredictable queries then these
should run just as quickly as predictable ones
when using a column-based approach. Moreover, all sorts of queries (with the exception
of row-based look-up queries) will run faster
than when using a traditional approach, all
other things being equal, precisely because of
the reduced I/O involved in not having indexes.
The same considerations apply to quantitative/
qualitative queries.
Complex queries
Complex queries tend to be slow or, in some
cases, simply not achievable, not because of
their complexity per se but because they combine elements of unpredictable queries and
time-based or quantitative/qualitative queries
and they frequently require whole table scans.
Column-based approaches make complex
queries feasible precisely because they optimise the capability of the warehouse in all of
these other areas.
Large table scans
It is usually the case that queries are only
interested in a limited subset of the data in
each row. However, when using a traditional
approach it is necessary to read each row in
its entirety. This is wasteful in the extreme.
Column-based approaches simply read the
relevant data from each column.
If we return to the question posed previously:
“List the full name and email address for customers born in July” then if the row consists of
3,200 bytes and there are ten million rows then
the total read requirement for a conventional
relational database is 32,000,000,000 bytes.
However, if we assume that the date of birth
field consists of 4 bytes, and the full name and
email addresses both consist of 25 characters,
then the total amount of data that needs to be
read from each row is just 54 bytes if you are
using a column-based approach. This makes
a total read requirement of 540,000,000 bytes.
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
Column-based databases: a query solution
This represents a reduction of 59.26 times,
and this is before we take other factors into
account, so it is hardly surprising then that
column-based approaches provide dramatically improved performance.
It should be noted that this advantage is not
necessarily all one way. Each column you need
to retrieve needs to be accessed separately
whereas you can retrieve an entire row in a
single read. So the greater the amount of the
information that you need from a row the less
performance advantage that a column-based
approach offers. To take a simplistic example,
if you want to read a single row then that is
one read. If that row has 15 columns then that
is, in theory, 15 reads, so there is a trade-off
between the number of rows you want to read
versus the number of columns, together with
the overhead of finding the rows/columns you
need to read in the first place.
A further consideration is that there is a
class of query that can be answered directly
from an index. These are what are known as
“count queries”. For example, the question
posed previously: “Count the married, employed
customers who own a house.” If you have a
row-based database and you have appropriate indexes defined then you can resolve these
queries without having to read the data at all
(see box for details). Of course, in the case of
a column-based database the data is the index
(or vice versa) so you should always be able to
answer count queries in this way.
Let us assume that relevant indexes are available in a row-based database. If you compare
the advantage of this approach to using a
standard method (using the same 3,200 byte
records with 10 million customers as above)
then you get a performance advantage that
works out at more than 8,500. So, you get this
advantage for row-based approaches if the
right subset of indexes is available. If it isn’t
you don’t. Using a column-based approach you
always get this advantage. Further, it should
be noted that count queries extend to arithmetic comparisons (greater than, less than and so
on) and ordering queries as well, since all of
these results can be derived directly using the
same approach.
Time-based queries
The issue here is not so much one of performance but more one of whether relevant queries
are possible at all. This is because a) you need
the extended SQL (or other approach) in order
to handle time lapse queries and b) you need
the ability to store time-stamped transactions.
Neither of these is typically the case with traditional purveyors of data warehousing. Conversely, there are a number of column-based
vendors that provide exactly such an approach.
Note that there are a number of use cases
that require such capabilities that go beyond
conventional warehouse environments. For
example, in telecommunications it is mandated that companies must retain call detail
records, against which relevant queries can be
run, often on a time-limited basis. Similarly,
you will want to be able to run time-based queries against log information (from databases,
system logs, web logs and so forth) as well as
emails and other corporate data that you may
need for evidentiary reasons.
To support count queries conventionally you need bit-mapped indexes defined against
the relevant tables; then you can read the bitmaps, intersect them and count the results
without reference to the underlying data.
As it happens, all the leading data warehouse vendors offer some form of bit-mapping
so that it could be argued that this is simply an illustration of the advantages of bitmapping over (say) Btrees. However, it is not as simple as that. Bit-maps are usually
only applied for numeric data so there is a limit to what you can bit-map in a conventional environment. Moreover, conventional approaches tend to have sparsity issues
(see later). However, if you can combine tokenisation (see later) with bit-mapping, as a
number of column-based vendors do, or you can make use of vector processing (as is
the case with some other column-based suppliers) then you can, in effect, bit-map any
column. So, while the benefits of bit-mapping may be constant, they are more widely
applicable within column-based approaches.
A Bloor White Paper
9
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
Column-based databases: generalised functions
There is no question that column-based databases typically outperform, by a significant
margin, their row-based counterparts. However, there are other considerations when it
comes to using columns over and above those
associated with not needing to define indexes
or other constructs. The main issues are discussed in the following sections.
Compression
One of the major advantages that a columnbased approach has, and perhaps the easiest
to understand, is its effect on compression.
Because you are storing data by column, and
each column consists of a single datatype
(often with recurring and/or similar values),
it is possible to apply optimal compression
algorithms for each column. This may seem
like a small change (in some senses it is) but
the difference in database size can be very significant when compared to other approaches.
Moreover, there is a performance benefit: because there is more data held within a specific
space you can read more data with a single I/O,
which means fewer I/Os per query and therefore better performance. Of course, the better
the compression the greater the performance
improvement and the smaller the overall
warehouse, with all of the cost benefits that
that implies.
As we have already mentioned, in the recent
past the merchant database vendors have
started to introduce more sophisticated compression algorithms (in at least one case,
using tokenisation—see later), which has significantly improved their ability to compress
data, so the advantages that columns can offer
in this area are not as significant as they once
were. For example, you might get a typical average compression ratio of 75% (depending on
the type of data) from a column-based vendor
(some suppliers can do significantly better
than this) whereas 50–60% might be more
typical for a row-based database.
However, this isn’t the only issue: there is, of
course, an overhead involved in de-compressing the data (and indexes) in order to process
the data. This means a performance hit, so for
small tables it is usually not worth compressing the data because it will slow queries down.
In other words there is an administrative overhead in deciding which tables to compress and
which not to. Conversely, when using columns
typically everything is compressed. Further,
in some cases vendors allow direct querying
© 2010 Bloor Research
10
of compressed data without having to decompress it first.
A further point is that some column-based
products can compress unstructured data such
as text. Often, this will result in significantly
better compression ratios than those mentioned above and it can make it cost effective
to store large amounts of unstructured data
alongside relational data where that would not
otherwise be the case.
Partitioning
As you might expect, column-based approaches partition the data by column rather than
by row. That is, they use vertical partitioning
rather than horizontal partitioning. In principle, at least, there is no reason why this should
have any effect on performance, depending on
the particular algorithms employed. However,
where it does have an impact is that partitions
cannot become unbalanced when they are arranged by column.
Horizontal partitions become unbalanced
when new row insertions and old row deletions are not uniformly spread across a table.
This means that you end up with a situation
where there are different numbers of records
(rows) in each partition. If this imbalance
becomes significant then performance will
be impaired and it will be necessary to rebalance the partitions, which is a significant
maintenance operation.
By contrast, when you use vertical partitioning, the partitions never become unbalanced.
This is because there are always exactly the
same number of fields in each column of the
table. There is, however, a downside. When
you are partitioning by row, you know that
each row is exactly the same size (disregarding nulls) as every other row in the table.
This is not the case with columns. One column may contain 5 digit numeric fields while
another holds a 30 digit alphanumeric field.
Thus the initial calculation as to the optimal
partitioning approach is relatively complex
(though you would expect the database software to help you in this process). In addition,
the different compression algorithms used
against each column also needs to be taken
into account. Nevertheless, in our view this
trade-off is one that is worthwhile since it
obviates the need for re-balancing once the
system is set up.
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
Column-based databases: generalised functions
Loading data
Loading large amounts of batch data is not typically a problem for column-based databases
as they load data by column. Moreover, leading vendors typically have partnerships with
ETL (extract, transform and load) suppliers to
enable this. As a result, there is no theoretical reason why bulk load speeds should be any
different for a column-based as opposed to a
row-based database.
It is a different matter, however, when it comes
to inserting new records on a one-by-one basis
(or updating and deleting them, for that matter). In a conventional environment you simply
add or delete the relevant row and update any
indexes that may be in place. However, when
using a column-based approach you have to
make a separate insertion or deletion for each
column to which that row refers. Thus you
could easily have 30 or 100 times as much work
to do. Of course, the judicious use of parallelism can reduce the performance implications
of this but a sensible approach would be to
defer insertions and deletions until you have a
mini-batch size where the number of columns
to be updated at least equals the number of
rows (so batch sizes of 30 or 100 say). Note
that, provided the software offers the ability to
query this data in-memory prior to be written
to disk, then the use of these micro-batches
for loading should have no impact on real-time
query performance.
Nevertheless, it should be clear from the
preceding paragraph that column-based databases are not suitable for use in transactional
environments because one-by-one row insertions and updates are precisely what you need
in this environment. Columns are used in data
retrieval scenarios not update ones.
Parallelism
There are no intrinsic advantages that parallelism brings to columns as opposed to rows:
spreading a query load across multiple processors should bring commensurate benefits in
either environment and, similarly, parallelising loads across columns is equivalent to doing
the same thing across rows. However, that assumes that you are starting with a blank sheet
of paper: you are not likely to achieve such
good results if you shoehorn parallelism into
a database that was not originally designed for
it, as opposed to designing it for parallelism in
the first place.
A Bloor White Paper
11
More pertinent to this column versus row-based
discussion is that some column-based vendors
have taken a different approach to parallelism when compared to traditional row-based
vendors. As discussed, the emphasis from the
latter is on improving the performance of individual queries because, all too often, they simply
aren’t good enough. However, column-based
suppliers, thanks to the superior performance
that columns can bring, do not have the same
concerns and, for this reason, a number of
these vendors have concentrated, for parallelism, on improving performance across queries
rather than within them so that overall workload
throughput is much improved. Of course, ideally
you would like to have both and some suppliers
are doing this: so that you can dedicate parallel
resources to any queries that need them while
otherwise focusing on the broader workload. As
workloads increase (especially, with more realtime and operational BI users) this is going to
become increasingly important.
Combination/workload issues
The issue here is combining high performance
for individual queries with similarly high performance across multiple queries and query
types, some of which may be very short running queries and others of which may be long
running, or anything in between. There is a
clear architectural benefit to be gained here
from using a column-based approach. This is
because, as previously stated, you do not have
to worry about the performance of individual
queries so that suppliers can focus their design efforts on ensuring high performance
across the potentially (tens of) thousands of
queries that may be running at any one time.
This is not to say that this is impossible to resolve using a traditional row-based approach
but the challenge is much greater because you
have two, not necessarily complementary,
������������������
design criteria that you have to meet.
Tokenisation
While the core of column-based approaches is
based upon the use of columns this is typically
(though not always) combined with some form
of tokenisation, which is the subject of this section. This section is quite technical and it can
be skipped by those not needing this level of
information. Note that you don’t need to know
anything about this, even as a DBA, because it
should all be handled automatically for you by
the software, so this is just to explain what is
going on under the covers.
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
Column-based databases: generalised functions
Tokenisation is often taken to be synonymous
with column-based processing. However, it
is not. Some vendors employ tokenisation
throughout their products while for others it
is optional. In addition, some of the leading
row-based vendors now also use a form of
tokenisation, specifically to provide advanced
compression facilities (as discussed previously). Nevertheless, tokenisation is a major
part of the column-based story.
However, the usual (third) method of supporting tokenisation (on which we will now concentrate) is that each row in a table is assigned a
row ID (usually a sequential integer) and each
unique value within a column is assigned a
value ID. These may also consist of sequential
integers (for example, Michigan might be assigned “7” and New York “8”) but where the
column contains numeric values then these
can form their own IDs.
Put briefly, the aim of tokenisation is to separate data values from data use. This has the
effect of reducing data requirements and
improving performance. As a practical example, in a customer table you might have many
customers in Michigan and each of them would
have “Michigan” stored as a part of their address. To store this several hundred, or even
thousands, of times is wasteful. Tokenisation
aims to minimise this redundancy. However,
there is more than one way of doing this.
The next step in the tokenisation process is, for
each column, to combine the row IDs with the
column values into a matrix. This might result
in a table such as Table 1.
The simplest method of tokenisation is to store
a token (usually a numerical value—see below)
that represents “Michigan” each time that it
appears within a table and then have a look-up
table so that you can convert from the token
to the data value. Now, of course, there is no
reason why you couldn’t do this with a conventional relational database. The reason why it
isn’t usually implemented is that any savings
are not worth the candle (but see discussion
on compression). Yes, you may require less
storage capacity but you have additional I/O
because you have to read the look-up table.
A second approach is to store “Michigan” once
and then use pointers (more accurately, vectors), which associate this data value to its
use. A simple way to think of this is that data
is stored as a giant list, which holds each data
value just once, together with the vectors that
define where each data value is used. Then you
can use search engine technology to answer
queries. A typical approach would be to apply
tokenisation algorithms that are datatypespecific. So, for example, there would be a
different tokenisation algorithm for numeric,
decimal, alphanumeric, and date and time datatypes, amongst others.
© 2010 Bloor Research
12
Row ID
Value ID
1
1
2
7
3
8
4
8
5
3
So, row 2 refers to a customer in Michigan,
while rows 3 and 4 are for customers both
of whom are in New York. This process is referred to as decomposition into collections of
columns. In practice, of course, it is not necessary to store the Row IDs, since these can be
inferred because of their sequential position.
However, while this is the way that we can
logically think about tokenisation it is not, in
practice, the way that it is implemented. This
is achieved through an incidence matrix (or bit
array) that would represent this information as
shown in Table 2.
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
Column-based databases: generalised functions
Value ID
Row ID
1
2
3
5
6
7
8
1
1
0
0
0
0
0
0
0
2
0
0
0
0
0
0
1
0
3
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
1
5
0
0
1
0
0
0
0
0
In this diagram a one (1) represents a correspondence and a zero (0) shows no such relationship. There is precisely a single one in each
row but there may be multiple ones in each
column. It should be immediately clear that
if there are a large number of unique values
then there will be an explosion of zeros to be
stored. While bit arrays can offer very rapid information retrieval, particularly when cached
in memory, this data expansion needs to be
contained if it is not to mean that you lose the
space saving advantages of using tokenisation
in the first place. This is, of course, a common
problem, not just with tokenisation but also
in OLAP cubes, for example, as well as with
conventional bit-maps vis a vis our previous
discussions on count queries.
A Bloor White Paper
4
13
In theory there are two obvious approaches
to this issue. The first is to limit the use of
tokenisation to low cardinality fields. That is,
fields where there are a limited number of different values. There are, for example, only 50
US states, so this would be ideal for tokenisation. In practice, you start to lose the benefits
of tokenisation with a cardinality much above
1,500, so it may be useful if a vendor can offer
alternative approaches for higher cardinality fields such as other types of (conventional)
indexing. Alternatively, you can compress each
column in the bit array into an encoded bit vector and then process these encoded bit vectors
directly, without their being unpacked. So, one
way or another, you limit any data explosion
because of nulls.
© 2010 Bloor Research
What’s Cool about Columns
(and how to extend their benefits)
Moving beyond columns
It should be clear that column-based architectures offer some significant advantages over
traditional row-based approaches. However,
this does not mean that they are better than
their row-based counterparts for all types of
queries and it does not mean that they are
optimal in every respect, even in areas where
they already offer superior performance. A report card might read “has done well, at the top
of the class—but could do better”.
Needless to say, there are multiple ways in
which one might approach the issue of improving the performance and utility of columnbased databases. In the remainder of this
section we will describe and discuss the approach taken by Infobright to extend columnar
capability. This is based around the use of a
technology known as a Knowledge Grid. This
is unique and makes Infobright completely different from any other product on the market.
In order to understand the Knowledge Grid you
first need to understand data packs. A data
pack is a part of a column with a total size of
64k and Infobright breaks each column down
into these packs and then stores each data
pack separately, with compression applied at
the pack level (which actually means that it
can sometimes be more efficient than when
operating on a pure column basis). At the same
time the software creates metadata about the
contents of each data pack, which is stored in
the Knowledge Grid and is automatically updated whenever the warehouse is updated.
This metadata in the Knowledge Grid includes
parameters for each data pack such as the
maximum and minimum values contained
therein, a histogram of the range of these
values, a count of the number of entries within
the data pack, aggregates (where relevant)
and so on. What all of this means is that a
number of queries (such as count queries)
can be resolved without reading the data at all.
Moreover, as Infobright continues to expand
the metadata held in its Knowledge Grid (for
example, it intends to extend it into vertical
and domain-specific areas) then more and
more queries will be answered directly from
the Knowledge Grid.
© 2010 Bloor Research
14
So, the first thing that happens when a query
is received is that the database engine looks
to see if it can answer all or part of the query
directly from the Knowledge Grid. However the
nirvana of being able to answer all questions
directly from the Knowledge Grid will never
come. Sometimes you just have to read the
data. When this is necessary the software first
accesses the Knowledge Grid to see which
data packs it needs to resolve that query and
then it only reads and decompresses those
data packs. Further, it works in an iterative
fashion so that as it processes each part of
a query it can eliminate the need to access
more data. Thus, at least for some queries,
depending on how many data packs you need
to access, you read and decompress even less
data than you would when using a standard
columnar database.
Further, apart from understanding each data
pack, the Knowledge Grid also understands the
relationships that exist between different data
packs in order to provide even better query performance; for example, because it knows which
pairs of data packs (on columns from different
tables) would need to be accessed if a join condition applied across their respective columns.
Note that thanks to the Knowledge Grid, Infobright does not require you to partition the data.
This not only reduces administration but it also
prevents data skew, which is a performance
problem for vendors using horizontal (rowbased) partitioning and which forces re-balancing of the warehouse, as discussed previously.
Readers interested in learning more about Infobright can visit its website at www.infobright.
com or the Infobright open source community
at www.infobright.org.
A Bloor White Paper
What’s Cool about Columns
(and how to extend their benefits)
Conclusion
Columns provide better performance at a lower cost with a smaller footprint: it is difficult to understand why any company seriously interested
in query performance would not consider a column-based solution.
Using columns instead of rows means that you get greatly reduced I/O
because you only read the columns referenced by the query. This means
that you get dramatically improved performance. The use of compression improves I/O rates (and performance) still further. In addition, depending on the supplier, you can eliminate or greatly reduce any need
for indexes (thereby reducing on-going administration requirements)
and, where they may be usefully used, they can be created automatically. In summary, this means that queries run faster (much faster), the
database is much smaller that it would otherwise be (with all the upfront
and ongoing cost benefits that implies) and there is less administration
required than would otherwise be the case (with further ongoing cost
benefits). In addition, you may be able to run queries that simply could
not be supported by more conventional means.
However, despite all of these comments, the use of columns is not a
panacea. In fact, Infobright demonstrates this very clearly by its improvement on the fundamental architecture of column-based relational
databases, which will provide better compression, at least in some
instances, and improved query performance through extending the columnar paradigm.
Further Information
Further information about this subject is available from
http://www.BloorResearch.com/update/2065
A Bloor White Paper
15
© 2010 Bloor Research
Bloor Research overview
About the author
Bloor Research is one of Europe’s leading IT research, analysis and consultancy organisations. We
explain how to bring greater Agility to corporate IT
systems through the effective governance, management and leverage of Information. We have built a
reputation for ‘telling the right story’ with independent, intelligent, well-articulated communications
content and publications on all aspects of the ICT
industry. We believe the objective of telling the right
story is to:
Philip Howard
Research Director - Data
• Describe the technology in context to its business value and the other systems and processes
it interacts with.
• Understand how new and innovative technologies fit in with existing ICT investments.
• Look at the whole market and explain all the solutions available and how they can be more effectively evaluated.
• Filter “noise” and make it easier to find the additional information or news that supports both
investment and implementation.
• Ensure all our content is available through the
most appropriate channel.
Founded in 1989, we have spent over two decades
distributing research and analysis to IT user and
vendor organisations throughout the world via online
subscriptions, tailored research services, events and
consultancy projects. We are committed to turning
our knowledge into business value for you.
Philip started in the computer industry way back in
1973 and has variously worked as a systems analyst,
programmer and salesperson, as well as in marketing
and product management, for a variety of companies
including GEC Marconi, GPT, Philips Data Systems,
Raytheon and NCR.
After a quarter of a century of not being his own boss
Philip set up what is now P3ST (Wordsmiths) Ltd in 1992
and his first client was Bloor Research (then ButlerBloor), with Philip working for
the company as an associate analyst. His relationship with Bloor Research has
continued since that time and he is now Research Director. His practice area encompasses anything to do with data and content and he has five further analysts
working with him in this area. While maintaining an overview of the whole space
Philip himself specialises in databases, data management, data integration, data
quality, data federation, master data management, data governance and data
warehousing. He also has an interest in event stream/complex event processing.
In addition to the numerous reports Philip has written on behalf of Bloor Research, Philip also contributes regularly to www.IT-Director.com and www.ITAnalysis.com and was previously the editor of both “Application Development
News” and “Operating System News” on behalf of Cambridge Market Intelligence
(CMI). He has also contributed to various magazines and published a number of
reports published by companies such as CMI and The Financial Times.
Away from work, Philip’s primary leisure activities are canal boats, skiing,
playing Bridge (at which he is a Life Master) and walking the dog.
Copyright & disclaimer
This document is copyright © 2010 Bloor Research. No part of this publication
may be reproduced by any method whatsoever without the prior consent of Bloor
Research.
Due to the nature of this material, numerous hardware and software products
have been mentioned by name. In the majority, if not all, of the cases, these product
names are claimed as trademarks by the companies that manufacture the
products. It is not Bloor Research’s intent to claim these names or trademarks
as our own. Likewise, company logos, graphics or screen shots have been reproduced with the consent of the owner and are subject to that owner’s copyright.
Whilst every care has been taken in the preparation of this document to ensure
that the information is correct, the publishers cannot accept responsibility for
any errors or omissions.
2nd Floor,
145–157 St John Street
LONDON,
EC1V 4PY, United Kingdom
Tel: +44 (0)207 043 9750
Fax: +44 (0)207 043 9748
Web: www.BloorResearch.com
email: [email protected]