White Paper What’s Cool about Columns (and how to extend their benefits)
Transcription
White Paper What’s Cool about Columns (and how to extend their benefits)
White Paper What’s Cool about Columns (and how to extend their benefits) A White Paper by Bloor Research Author : Philip Howard Publish date : November 2010 …the use of columns is not a panacea. In fact, Infobright demonstrates this very clearly by its improvement on the fundamental architecture of column-based relational databases Philip Howard Free copies of this publication have been sponsored by What’s Cool about Columns (and how to extend their benefits) Executive summary It hardly needs reiterating that companies and other organisations are under increasing pressure to understand their customers, work ever more closely with their suppliers, evaluate their own performance, improve their competitive position, and generally take advantage of whatever business opportunities arise. Add to this the huge growth in information generated by the Internet as well as specialist technologies such as RFID (radio frequency identification) and event processing, plus the need to retain data for extended periods of time for compliance reasons, and it is not surprising that business intelligence and query systems in general are under increasing pressure. Worse, demands on this information are increasingly widespread and of a real-time nature. Historically, all long term storage of data for query purposes has relied on data warehouses and data marts that have been traditionally supplied by vendors using conventional relational databases. However, for more than a decade a sub-genre of the relational database has been making inroads into the market, using a technology known as a column-based relational database (sometimes referred to as CBRD). While we will explain the differences between a conventional and a column-based approach in due course, for the moment just think of it as an ordinary relational database (using SQL and so forth) that inserts and reads columns of data instead of rows of data. A Bloor White Paper 1 For much of the last decade the use of columnbased approaches has been very much a niche activity. However, with a substantial number of vendors now active in the market, with over 2,000 customers (many of them Global 2000 companies) between them, we believe that it is time for columns to step out of the shadows to become a major force in the data warehouse and associated markets. Given that view, this paper will define what column-based relational databases do and how they do it, and where they have advantages (and disadvantages) compared to traditional approaches. This, in turn, will lead to a discussion of the sort of environments for which columns are best suited. Up to this point all discussions in this paper are generic. However, we will conclude with a section on Infobright, a column-based vendor, discussing how its technology has extended the column-based paradigm to provide additional performance and other benefits. © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) The general problem There is a popular joke in which a traveler asks a local for directions to a well-known location. The latter, after a pause, replies that he wouldn’t start from here if that was where he was going. Relational databases are in an analogous position. Having dominated the market for OLTP, the relational vendors have extended the capabilities of their products into areas for which relational technology was not originally designed. The question is whether you would start from a relational base if you began with a blank sheet of paper? Traditional relational databases were designed initially to process transactions. In this environment you manipulate individual transactions one at a time. Each transaction is represented by one or more rows that have to be inserted or modified in one or more tables. When you are processing a query, on the other hand, you typically work with one or more tables, from which data is selected according to certain criteria—and those criteria are nearly always column-based. For example, the query “retrieve the names of all personnel who are females over the age of 50” would typically be resolved based on columns for name, gender and date of birth. So, while transactions are naturally row-based, queries are naturally column-based. There is, of course, a major performance issue involved here. If you are using a simple row-based paradigm then you retrieve all of the personnel records and then search them according to the defined criteria. This will obviously be relatively long-winded and slow. A number of opportunities for improving performance present themselves in a relational environment. The first is to implement indexes on each of the relevant columns. There are two problems with this. First, you cannot necessarily predict the queries that business analysts and managers may choose to make. This means that there may be a requirement for an ad hoc enquiry that involves a search against a non-indexed field. So, the first problem is that indexes cannot guarantee to be universally applicable. Secondly, indexes impose a significant overhead in terms of additional disk capacity. It is by no means unusual for a heavily indexed database (together with other constructs) to take up anything up to five times (or more) the space that would be required for the data alone. Nor © 2010 Bloor Research 2 is this just a question of additional hardware requirement. Even at a simple level, it means that you at least double the amount of I/O required, which necessarily impairs performance. Moreover, this hit also applies when you are updating your data warehouse, and you will also incur overheads for index maintenance. So: 1. We would like a way around the problem of indexes: implementing multiple indexes improves performance for specific queries (and therefore may be useful in specific cases) but causes a performance overhead in more general terms. To meet this need, traditional approaches implement other performance features in order to cover the downside created by the need to hold indexes in addition to data. The first of these is data compression. The problem with this is that individual fields in a row tend to be represented by different datatypes. This makes compression complex to implement since different compression algorithms will be better suited to different datatypes. However, as we mentioned above, because each row has to be compressed in its entirety, this means that you are forced to adopt a relatively low level approach to compression. In the recent past the leading relational database vendors have all introduced (or announced) more sophisticated compression algorithms for their databases. However, they still do not match up to those offered by column-based vendors. Moreover, there is an administrative overhead in applying this compression on a row basis. We will discuss this in due course. Another performance enhancement adopted by most of the major data warehousing vendors is the extensive use of parallelism. This applies both within the software and in making use of hardware facilities that can be provided by the platform. Both of these forms of parallelism have associated problems. In the case of database parallelism the problem is that the parallelism is (just like the use of indexes) designed to improve the performance of individual queries (with facilities such as parallel sorting, for example). This, of course, is a tacit admission that there is something wrong with their performance in the first place. However, that is not the main point. What you really would like to do with parallelism is use it to ensure that a mix of queries A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) The general problem can run simultaneously (especially, using divide and conquer rather than a sequential approach) with optimum results, but this, of course, pre-supposes that individual queries will perform well in the first place. The most significant point arising out of support for hardware-based parallelism is the use of disk partitioning. This is the ability to split a table across multiple disks, which is normally done through what is known as horizontal partitioning. This can be achieved in a variety of ways such as hash partitioning, round-robin partitioning and so on, but what these methods have in common is that they partition tables by row, so that some rows are stored on one disk and some on another. The problem with this approach is that it is difficult to ensure that each partition is roughly the same size as all the others. If this ceases to be the case for some reason then you start to lose any performance benefits. That is, if one partition becomes significantly larger than the others then it is, by definition, being accessed more frequently, which in turn means that its I/O performance must deteriorate. The different forms of partitioning algorithm are designed to combat this problem, but they are neither infallible nor universally available. Should an imbalance occur, then a major redistribution of data, with all the consequences that that implies, may be required to rectify the problem. Finally, a further hardware solution is proposed by many vendors, by which they suggest that the best approach to business intelligence is to implement a central data warehouse supporting satellite data marts. It is the latter, which hold only limited subsets of data, that are to be used to fulfill query requirements. By limiting the scope of information held you obviously reduce the size of the database and, thence, the performance issues that will arise. We can therefore further extend our list of requirements (apart from fixing the index problem) by saying that we need: 2. A way around the problems of horizontal partitioning. 3. A form of parallelism that optimises crossquery loading rather that individual queries, predicated on the assumption that we already have decent performance for individual queries (no matter how complex). A Bloor White Paper 3 The problem with conventional data warehousing is that it is not as all-embracing as it might appear on the surface or as the vendors of such solutions might have you believe. The reason for this is that in order to provide the best possible performance to the largest number of users, warehouses are significantly pre-designed. While logically this may be a reflection of the business model that underpins the warehouse, in physical terms this means the pre-definition of indexes and index clustering, specified data partitioning, particular uses of parallel disk striping, specially identified pre-joined tables and so on. Now, these techniques have an important role to play when it comes to improving performance. However, they all pre-suppose that you know, in advance, what you are going to do with your data. The problem is that this is not always the case. In particular, the data warehouse you plan today may not take account of the exigencies of tomorrow. To reduce this argument to its simplest level: if you have two queries that you want to run against a data source then you can optimise the database design for one query or for the other, but not both. For two queries you can probably produce a hybridised structure that will provide acceptable performance for most users most of the time. However, the greater the number of query types that you have to support, the more compromises you have to make. As any data warehousing administrator will tell you, left to their own devices just a handful of business analysts, pursuing their own trains of thought, can bring any data warehouse to its knees. This is why all the popular systems have query governors that can limit the resources that will be assigned to any particular query. However, query governors tend to penalise queries that fall outside the parameters that were originally considered when the data warehouse was constructed. Moreover, and here’s the rub, it is precisely those queries that fall outside the scope of the initial conception of the data warehouse that can bring the biggest benefit to the business. This is because the issues raised by predictable enquiries are, almost by definition, the ones that the company already knows how to deal with. It is the unpredictable that often offers the greatest threat, or the greatest reward, to an organisation, and it is precisely these questions that a conventional data warehouse is least well equipped to answer. © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) The general problem On the basis of these discussions we can further define some of the features that we would like from a data warehouse: 4. Flexibility is paramount. The whole point is that what we want is an “ask anything” warehouse. 5. It should be possible to store interim results. That is, you may want to perform a query and use the output from that query as a part of the input to another. 6. It should be easy to administer. 7. It should be cost effective and offer a return on investment in as short a timescale as is reasonable. 8. It should be efficient in terms of its resources, both in machine and personnel terms. In particular, the business analyst pursuing a line-of-thought enquiry should be able to follow this train through to its end from his desktop, without requiring outside assistance of any sort. 9. Performance is also fundamental. While different queries will obviously take different lengths of time, typical responses should be in seconds, or minutes at most. 10.In modern-day enterprise data warehouses there is a growing requirement to support a much larger number of users/queries than was previously the case and, at the same time, a much broader range of query types. Thus user scalability is as much an issue as conventional concerns in terms of disk scalability. Size matters For obvious reasons there is a continuing emphasis, in all forms of business intelligence environments, on providing improved performance. To run queries faster, to run more queries simultaneously, to run more complex queries, to run against larger datasets. One way to achieve this is by investing in more, and faster, hardware. However, a more fruitful approach is to achieve this through software; and © 2010 Bloor Research 4 one of the most useful ways of achieving this is by reducing the size of the database, not in logical terms, but physically. If, for example, you have two databases that contain the same data, and one requires 10Tb and the other requires 5Tb then, all other things being equal, it will take less time to search through the latter than the former. Of course, there are also other benefits of a smaller database. For example, it takes less time to load it, it requires less maintenance and tuning, it has a smaller footprint and it requires less in the way of cooling and power requirements. The first and most obvious way of reducing the size of a database is to use compression techniques. However, the problem with this is that the different fields in a data row all have different attributes and it is therefore not possible, using conventional databases, to optimise the compression beyond a certain point. New introductions in the databases of the leading relational vendors have improved on this of late but the facilities offered are still not as capable as the might be: we will discuss this further later. Another way to reduce the size of a database is to reduce the number of indexes. Typically, the indexes in a data warehouse take up as much space as the data itself, in effect doubling (or more; it is not uncommon for the addition of indexes, as well as other constructs such as materialised views to mean that the resulting data warehouse is as much as 8 times the size of the raw data) the size of the database. However, in conventional environments, the more indexes you remove the slower that individual queries run. In effect this is a Catch 22 situation. We can therefore add to our list of requirements: 11.We would like to reduce the size of the database. 12.We would like to minimise the number of indexes (and other constructs) that we need to define. A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) The specific problem While we have identified a number of issues that traditional data warehousing solutions face there are also specific issues that relate to particular types of query. Unpredictable queries Unpredictable queries (as used in exploratory analysis) are, by definition, those where you do not know in advance what the user may want to find out. These pose a number of problems, including: • You do not know whether the answer to a query will require aggregated data or access to transaction-level data. If the answer can be satisfied by aggregated data then an approach that uses some form of OLAP (on-line analytic processing) solution may be appropriate. However, if transaction-level access is required then an OLAP-based approach will slow down significantly, particularly if the requirement is substantial. • Even where a query may be satisfied through use of pre-aggregated data, the nature of unpredictable queries is such that you cannot guarantee that the correct aggregations are available. If they are not then the cube will need to be re-generated with the desired aggregations. Not only will this take time in itself but it will also usually mean recourse to the IT department, where it will typically join a long queue of work to be done. The user will be lucky to get a response to his query within weeks—a time lapse of months is more likely. • A third problem with OLAP and unpredictable queries is that the cube has pre-defined dimensions and hierarchies. If a user happens to want to pose a question that has aspects that fall outside of these parameters then, again, it will be necessary to redefine and re-generate the cube with all the likely delays outlined above. • Thus OLAP-based approaches cannot cope with unpredictable queries, except in very limited circumstances. This means that recourse will have to be made to the main data warehouse (or a suitable data mart). However, conventional relational databases also have problems with unpredictable queries. As far as an rdbms is concerned, the problem with unpredictable queries is that appropriate indexes may not be defined. While the database optimiser can re-write badly constructed A Bloor White Paper 5 SQL, determine the most efficient joins and optimise the query path in general, it cannot make up for any lack of indexes. In practice, if a column is not indexed at all, then this will usually mean that the query has to perform a full table scan, and if this is a large table (see below) then there will be a substantial performance hit as a result. Of course, the obvious route to take is to build indexes on every conceivable column. Unfortunately this is not usually practical. While every index you build will help to improve the performance of queries that use that index, this is subject to the law of diminishing returns. Every index you add to the database increases the size of the database as a whole, doubles the maintenance whenever that column is updated (because you have to update the index as well) and doubles I/O requirements in a similar fashion. All of this means that database performance as a whole deteriorates, not to mention the time spent in tuning indexes to provide best performance. For these reasons, users tend to be severely restricted in the degree of unpredictability that they are allowed. Even ad hoc query tools tend to be limited in what they allow the user to ask. If you want to go outside those parameters then you will be obliged to refer to the IT department, for them to program your query for you, with all the attendant delays that that involves. Even when you can define such queries through your front-end tool, all too often your question will be cut short by the database’s query limiter because it takes too long to run or consumes too much resource. Complexity There is no hard and fast definition of what constitutes a complex query. However, we can say that they typically involve transaction-level data, usually depend on multiple business rules requiring multiple joins, and are often forced to resort to full table scans. Perhaps a reasonable definition would be that a complex query always involves multiple set operations. That is, you make a selection and then, based on the result of that selection, go on to make further selections. In other words complexity involves recursive set operations. In non-technical terms, complex queries often involve a requirement to contrast and compare different data sets. Some typical complex queries are as follows: © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) The specific problem • “To what extent has our new service can- nibalised existing products?” – that is, which customers are using the new service instead of the old ones, rather than as an addition. • “List the top 10% of customers most likely to respond to our new marketing campaign.” • “Which good shoppers shop where bad shoppers shop?” • “What aspects of a bill are most likely to lead to customer defection?” • “Are employees more likely to be sick when they are overdue for a holiday?” • “Which promotions shorten sales cycles the most?” Consider just the question about the top 10% of customers. In order to answer this question we need to analyse previous marketing campaigns, understand which customers responded (which is not easy in itself: it often means a timelapsed comparison between the campaign and subsequent purchases), and identify common characteristics shared by those customers. We then need to search for recipients of the campaign that share those characteristics and rank them (how to do this may require significant input) according to the closeness of their match to the identified characteristics. It is unlikely that anyone would question the premise that this is a complex query. You could answer it using a conventional relational database but it would be time-consuming and slow, not to mention difficult to program. It would be impossible using an OLAP-based approach. It might be thought that you could use data mining technology for these sorts of complex queries. However, there are three problems with data mining. First, these products are typically the domain of specialist business analysts rather than ordinary business users. Second, if we refer to this particular query about the top 10% of customers, then the volume of data to be processed could well be prohibitive. Third, even if you could use data mining techniques then what you get back will be a predictive model, which is not particularly useful when what you want is a set of customers. It might also appear reasonable to argue that these sorts of queries can satisfactorily be © 2010 Bloor Research 6 answered by analytic applications both from CRM vendors and specialist suppliers. To a certain extent this is true, but where complex queries are supported as part of such an application they have typically been purpose-built, and most analytic applications will not support the full range of complex queries that you might want to ask, in particular because they cannot support the unpredictable nature of such questions. Finally, another aspect of complexity is what is sometimes called a ‘predicate’. These are selection criteria such as those based on sex, age, house ownership, annual income, social classification, location and so on. As companies want to understand their customers better, the range of these predicates, and their combinations, is substantially increasing. However, this complexity places an added strain on query performance. In particular, the efficiency of the database in evaluating these predicates is of increasing importance. It should be noted that a similar concern exists for e-commerce based search engines. For example, in the travel industry you want to make it easy for clients to select holidays based on a wide range of different predicates such as average temperature, distance from the beach, whether rooms have air conditioning and so on. Large table scans There is nothing clever or mystical about this. Certain types of queries require that the whole of a table must be scanned. Some of these arise when there are no available indexes, or from the sorts of complex queries described above. However, very much simpler queries can also give rise to full table scans. Two such are quoted by the Winter Corporation in its white paper “Efficient Data Warehousing for a New Era”. These are: • “List the full name and email address for customers born in July” – given that one in 12 customers are born in July a typical database optimiser will not consider it worthwhile to use an index, and it will conduct a full table scan. If you have 10 million customers for each of whom you store 3,200 bytes, say, then this will mean reading a total of 32,000,000,000 bytes. As we will see later a column-based database could reduce this by a factor of more than 100. A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) The specific problem • “Count the married, employed customers who own their own home” – if we assume the database as above, then conventional approaches still mean reading 32,000,000,000 bytes. In this case, however, column-based approaches can achieve improvements measured in thousands of times. A point of warning: this is the first time we have mentioned some of the performance improvements that can be reached by using column-based products—you may not believe them. If you are not familiar with the technologies covered here, then talk of hundreds or thousands of times performance benefits may seem outlandish and unreasonable. However, they are true. We will discuss how these can be achieved (and when they cannot) in due course. Time-based queries We are referring here to time-lapse queries rather than time per se. This is particularly important because it is often desirable to study people’s behaviour and activities over time. For example, take a very simple example such as: “Which customers bought barbecues within 7 days of ordering patio furniture?” In order to answer this sort of query you need to search the database to find out who bought barbecues and then scan for patio furniture buying within the required time period. You cannot easily answer this sort of query using either conventional relational databases or OLAP cubes. In the case of an OLAP solution, you would have to organise your cube by the shortest time period you are ever going to measure against (days in this case) and then you count cells for seven days. Unfortunately, this will mean very large cubes (30 times typical sizes today, which are most commonly implemented by month) and such queries would therefore be extremely inefficient. Moreover, the question posed is based on transactionlevel detail in any case, which will not be contained in a cube. but it is very complex, whereas in SQL ’92 you would have to use a multi-pass approach. Of course, some vendors have specialised data extenders that handle time series problems of this type, while others have extended their versions of SQL but then you would still have the performance issues arising from large table scans, as identified above. Qualitative/quantitative queries This is another straightforward issue. There are occasions, such as if you wanted to compare the performance of different medical teams carrying out surgery to treat a particular condition, in which it is useful to be able to employ text searching against a data warehouse. To answer this sort of query in a conventional environment you would typically have to combine quantitative results from the data warehouse with qualitative details extracted from a content management or document management system. In practice, of course, it would not be difficult to build text indexes and search capabilities into conventional data warehouses. However, it is likely that the sorts of queries that would require this capability would fall into the complex category described above and, for this reason, would be prone to the poor performance that is symptomatic of relational and OLAP-based approaches to complex queries. Combination/workload issues It is important that the foregoing, and traditional queries and reports, are not treated in isolation. Certainly, there is significant demand for what we might call ‘analytic warehouses’ but there is also a demand to support all of these query types along with others, such as lookup queries in what is sometimes referred to as EDW (enterprise data warehouse) 2.0. This predicates a mix of query types and users that may run into thousands or, in the future, tens of thousands of concurrent queries. Row-based relational databases cannot cope well with this sort of query either. Not, in this case, because the information isn’t there but because SQL is not very good at coping with this sort of query: in SQL ’99 you could do it A Bloor White Paper 7 © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) Column-based databases: a query solution Conventional approaches to data warehousing use traditional relational databases. However, these were originally designed to support transaction processing (OLTP) and do not have an architecture specifically designed for supporting queries. Column-based relational databases, on the other hand, have been designed from the ground up with that specific goal in mind. A column-based relational database is exactly what its name suggests, a relational database (using conventional set algebra, SQL and so on) that stores and retrieves data by column instead of by row. In all other respects it is conceptually identical to a conventional relational database. So the use of such a product does not require any re-training, and does not need the user to learn any new concepts (except any that may be specific to a particular vendor). The change from rows to columns may seem a trivial one but it does, in fact, have profound consequences, which we now need to examine in some detail. In particular, we need to consider the impact of using columns with respect to indexes because there are many circumstances where it is not necessary to define an index when using a column-based approach. For example, suppose that you simply want to list all customers by name. Using a standard relational database you would define an index against the name and then use that to access the data. Now consider the same situation from a column-based perspective. In effect, the column is the index. So you don’t need to define a separate index for this sort of query. The effect is quite dramatic. Not only do you not have the overhead (in disk space, maintenance and so forth) of an index, you also halve the number of I/Os required, because you don’t have to read the index prior to every data read. Further, because indexes and columns are so closely aligned, it is a relatively easy process for column-based products to provide automated indexing capabilities where that is appropriate, though some vendors, particularly those employing large-scale parallelism, eschew indexes altogether. To put it baldly, the use of columns enables you to answer certain types of query, especially those highlighted in the previous section, much more quickly than would otherwise be the case, so we will begin by considering these query types. In the following section we will discuss the importance (on query performance) of © 2010 Bloor Research 8 other consequences of using a column-based approach such as compression. Unpredictable queries As we have seen, a column is equivalent to an index but without any of the overhead incurred by having to define an index. It is as if you had a conventional database with an index on every column. It should be easy to see, therefore, that if you are undertaking some exploratory analysis using unpredictable queries then these should run just as quickly as predictable ones when using a column-based approach. Moreover, all sorts of queries (with the exception of row-based look-up queries) will run faster than when using a traditional approach, all other things being equal, precisely because of the reduced I/O involved in not having indexes. The same considerations apply to quantitative/ qualitative queries. Complex queries Complex queries tend to be slow or, in some cases, simply not achievable, not because of their complexity per se but because they combine elements of unpredictable queries and time-based or quantitative/qualitative queries and they frequently require whole table scans. Column-based approaches make complex queries feasible precisely because they optimise the capability of the warehouse in all of these other areas. Large table scans It is usually the case that queries are only interested in a limited subset of the data in each row. However, when using a traditional approach it is necessary to read each row in its entirety. This is wasteful in the extreme. Column-based approaches simply read the relevant data from each column. If we return to the question posed previously: “List the full name and email address for customers born in July” then if the row consists of 3,200 bytes and there are ten million rows then the total read requirement for a conventional relational database is 32,000,000,000 bytes. However, if we assume that the date of birth field consists of 4 bytes, and the full name and email addresses both consist of 25 characters, then the total amount of data that needs to be read from each row is just 54 bytes if you are using a column-based approach. This makes a total read requirement of 540,000,000 bytes. A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) Column-based databases: a query solution This represents a reduction of 59.26 times, and this is before we take other factors into account, so it is hardly surprising then that column-based approaches provide dramatically improved performance. It should be noted that this advantage is not necessarily all one way. Each column you need to retrieve needs to be accessed separately whereas you can retrieve an entire row in a single read. So the greater the amount of the information that you need from a row the less performance advantage that a column-based approach offers. To take a simplistic example, if you want to read a single row then that is one read. If that row has 15 columns then that is, in theory, 15 reads, so there is a trade-off between the number of rows you want to read versus the number of columns, together with the overhead of finding the rows/columns you need to read in the first place. A further consideration is that there is a class of query that can be answered directly from an index. These are what are known as “count queries”. For example, the question posed previously: “Count the married, employed customers who own a house.” If you have a row-based database and you have appropriate indexes defined then you can resolve these queries without having to read the data at all (see box for details). Of course, in the case of a column-based database the data is the index (or vice versa) so you should always be able to answer count queries in this way. Let us assume that relevant indexes are available in a row-based database. If you compare the advantage of this approach to using a standard method (using the same 3,200 byte records with 10 million customers as above) then you get a performance advantage that works out at more than 8,500. So, you get this advantage for row-based approaches if the right subset of indexes is available. If it isn’t you don’t. Using a column-based approach you always get this advantage. Further, it should be noted that count queries extend to arithmetic comparisons (greater than, less than and so on) and ordering queries as well, since all of these results can be derived directly using the same approach. Time-based queries The issue here is not so much one of performance but more one of whether relevant queries are possible at all. This is because a) you need the extended SQL (or other approach) in order to handle time lapse queries and b) you need the ability to store time-stamped transactions. Neither of these is typically the case with traditional purveyors of data warehousing. Conversely, there are a number of column-based vendors that provide exactly such an approach. Note that there are a number of use cases that require such capabilities that go beyond conventional warehouse environments. For example, in telecommunications it is mandated that companies must retain call detail records, against which relevant queries can be run, often on a time-limited basis. Similarly, you will want to be able to run time-based queries against log information (from databases, system logs, web logs and so forth) as well as emails and other corporate data that you may need for evidentiary reasons. To support count queries conventionally you need bit-mapped indexes defined against the relevant tables; then you can read the bitmaps, intersect them and count the results without reference to the underlying data. As it happens, all the leading data warehouse vendors offer some form of bit-mapping so that it could be argued that this is simply an illustration of the advantages of bitmapping over (say) Btrees. However, it is not as simple as that. Bit-maps are usually only applied for numeric data so there is a limit to what you can bit-map in a conventional environment. Moreover, conventional approaches tend to have sparsity issues (see later). However, if you can combine tokenisation (see later) with bit-mapping, as a number of column-based vendors do, or you can make use of vector processing (as is the case with some other column-based suppliers) then you can, in effect, bit-map any column. So, while the benefits of bit-mapping may be constant, they are more widely applicable within column-based approaches. A Bloor White Paper 9 © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) Column-based databases: generalised functions There is no question that column-based databases typically outperform, by a significant margin, their row-based counterparts. However, there are other considerations when it comes to using columns over and above those associated with not needing to define indexes or other constructs. The main issues are discussed in the following sections. Compression One of the major advantages that a columnbased approach has, and perhaps the easiest to understand, is its effect on compression. Because you are storing data by column, and each column consists of a single datatype (often with recurring and/or similar values), it is possible to apply optimal compression algorithms for each column. This may seem like a small change (in some senses it is) but the difference in database size can be very significant when compared to other approaches. Moreover, there is a performance benefit: because there is more data held within a specific space you can read more data with a single I/O, which means fewer I/Os per query and therefore better performance. Of course, the better the compression the greater the performance improvement and the smaller the overall warehouse, with all of the cost benefits that that implies. As we have already mentioned, in the recent past the merchant database vendors have started to introduce more sophisticated compression algorithms (in at least one case, using tokenisation—see later), which has significantly improved their ability to compress data, so the advantages that columns can offer in this area are not as significant as they once were. For example, you might get a typical average compression ratio of 75% (depending on the type of data) from a column-based vendor (some suppliers can do significantly better than this) whereas 50–60% might be more typical for a row-based database. However, this isn’t the only issue: there is, of course, an overhead involved in de-compressing the data (and indexes) in order to process the data. This means a performance hit, so for small tables it is usually not worth compressing the data because it will slow queries down. In other words there is an administrative overhead in deciding which tables to compress and which not to. Conversely, when using columns typically everything is compressed. Further, in some cases vendors allow direct querying © 2010 Bloor Research 10 of compressed data without having to decompress it first. A further point is that some column-based products can compress unstructured data such as text. Often, this will result in significantly better compression ratios than those mentioned above and it can make it cost effective to store large amounts of unstructured data alongside relational data where that would not otherwise be the case. Partitioning As you might expect, column-based approaches partition the data by column rather than by row. That is, they use vertical partitioning rather than horizontal partitioning. In principle, at least, there is no reason why this should have any effect on performance, depending on the particular algorithms employed. However, where it does have an impact is that partitions cannot become unbalanced when they are arranged by column. Horizontal partitions become unbalanced when new row insertions and old row deletions are not uniformly spread across a table. This means that you end up with a situation where there are different numbers of records (rows) in each partition. If this imbalance becomes significant then performance will be impaired and it will be necessary to rebalance the partitions, which is a significant maintenance operation. By contrast, when you use vertical partitioning, the partitions never become unbalanced. This is because there are always exactly the same number of fields in each column of the table. There is, however, a downside. When you are partitioning by row, you know that each row is exactly the same size (disregarding nulls) as every other row in the table. This is not the case with columns. One column may contain 5 digit numeric fields while another holds a 30 digit alphanumeric field. Thus the initial calculation as to the optimal partitioning approach is relatively complex (though you would expect the database software to help you in this process). In addition, the different compression algorithms used against each column also needs to be taken into account. Nevertheless, in our view this trade-off is one that is worthwhile since it obviates the need for re-balancing once the system is set up. A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) Column-based databases: generalised functions Loading data Loading large amounts of batch data is not typically a problem for column-based databases as they load data by column. Moreover, leading vendors typically have partnerships with ETL (extract, transform and load) suppliers to enable this. As a result, there is no theoretical reason why bulk load speeds should be any different for a column-based as opposed to a row-based database. It is a different matter, however, when it comes to inserting new records on a one-by-one basis (or updating and deleting them, for that matter). In a conventional environment you simply add or delete the relevant row and update any indexes that may be in place. However, when using a column-based approach you have to make a separate insertion or deletion for each column to which that row refers. Thus you could easily have 30 or 100 times as much work to do. Of course, the judicious use of parallelism can reduce the performance implications of this but a sensible approach would be to defer insertions and deletions until you have a mini-batch size where the number of columns to be updated at least equals the number of rows (so batch sizes of 30 or 100 say). Note that, provided the software offers the ability to query this data in-memory prior to be written to disk, then the use of these micro-batches for loading should have no impact on real-time query performance. Nevertheless, it should be clear from the preceding paragraph that column-based databases are not suitable for use in transactional environments because one-by-one row insertions and updates are precisely what you need in this environment. Columns are used in data retrieval scenarios not update ones. Parallelism There are no intrinsic advantages that parallelism brings to columns as opposed to rows: spreading a query load across multiple processors should bring commensurate benefits in either environment and, similarly, parallelising loads across columns is equivalent to doing the same thing across rows. However, that assumes that you are starting with a blank sheet of paper: you are not likely to achieve such good results if you shoehorn parallelism into a database that was not originally designed for it, as opposed to designing it for parallelism in the first place. A Bloor White Paper 11 More pertinent to this column versus row-based discussion is that some column-based vendors have taken a different approach to parallelism when compared to traditional row-based vendors. As discussed, the emphasis from the latter is on improving the performance of individual queries because, all too often, they simply aren’t good enough. However, column-based suppliers, thanks to the superior performance that columns can bring, do not have the same concerns and, for this reason, a number of these vendors have concentrated, for parallelism, on improving performance across queries rather than within them so that overall workload throughput is much improved. Of course, ideally you would like to have both and some suppliers are doing this: so that you can dedicate parallel resources to any queries that need them while otherwise focusing on the broader workload. As workloads increase (especially, with more realtime and operational BI users) this is going to become increasingly important. Combination/workload issues The issue here is combining high performance for individual queries with similarly high performance across multiple queries and query types, some of which may be very short running queries and others of which may be long running, or anything in between. There is a clear architectural benefit to be gained here from using a column-based approach. This is because, as previously stated, you do not have to worry about the performance of individual queries so that suppliers can focus their design efforts on ensuring high performance across the potentially (tens of) thousands of queries that may be running at any one time. This is not to say that this is impossible to resolve using a traditional row-based approach but the challenge is much greater because you have two, not necessarily complementary, ������������������ design criteria that you have to meet. Tokenisation While the core of column-based approaches is based upon the use of columns this is typically (though not always) combined with some form of tokenisation, which is the subject of this section. This section is quite technical and it can be skipped by those not needing this level of information. Note that you don’t need to know anything about this, even as a DBA, because it should all be handled automatically for you by the software, so this is just to explain what is going on under the covers. © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) Column-based databases: generalised functions Tokenisation is often taken to be synonymous with column-based processing. However, it is not. Some vendors employ tokenisation throughout their products while for others it is optional. In addition, some of the leading row-based vendors now also use a form of tokenisation, specifically to provide advanced compression facilities (as discussed previously). Nevertheless, tokenisation is a major part of the column-based story. However, the usual (third) method of supporting tokenisation (on which we will now concentrate) is that each row in a table is assigned a row ID (usually a sequential integer) and each unique value within a column is assigned a value ID. These may also consist of sequential integers (for example, Michigan might be assigned “7” and New York “8”) but where the column contains numeric values then these can form their own IDs. Put briefly, the aim of tokenisation is to separate data values from data use. This has the effect of reducing data requirements and improving performance. As a practical example, in a customer table you might have many customers in Michigan and each of them would have “Michigan” stored as a part of their address. To store this several hundred, or even thousands, of times is wasteful. Tokenisation aims to minimise this redundancy. However, there is more than one way of doing this. The next step in the tokenisation process is, for each column, to combine the row IDs with the column values into a matrix. This might result in a table such as Table 1. The simplest method of tokenisation is to store a token (usually a numerical value—see below) that represents “Michigan” each time that it appears within a table and then have a look-up table so that you can convert from the token to the data value. Now, of course, there is no reason why you couldn’t do this with a conventional relational database. The reason why it isn’t usually implemented is that any savings are not worth the candle (but see discussion on compression). Yes, you may require less storage capacity but you have additional I/O because you have to read the look-up table. A second approach is to store “Michigan” once and then use pointers (more accurately, vectors), which associate this data value to its use. A simple way to think of this is that data is stored as a giant list, which holds each data value just once, together with the vectors that define where each data value is used. Then you can use search engine technology to answer queries. A typical approach would be to apply tokenisation algorithms that are datatypespecific. So, for example, there would be a different tokenisation algorithm for numeric, decimal, alphanumeric, and date and time datatypes, amongst others. © 2010 Bloor Research 12 Row ID Value ID 1 1 2 7 3 8 4 8 5 3 So, row 2 refers to a customer in Michigan, while rows 3 and 4 are for customers both of whom are in New York. This process is referred to as decomposition into collections of columns. In practice, of course, it is not necessary to store the Row IDs, since these can be inferred because of their sequential position. However, while this is the way that we can logically think about tokenisation it is not, in practice, the way that it is implemented. This is achieved through an incidence matrix (or bit array) that would represent this information as shown in Table 2. A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) Column-based databases: generalised functions Value ID Row ID 1 2 3 5 6 7 8 1 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 1 4 0 0 0 0 0 0 0 1 5 0 0 1 0 0 0 0 0 In this diagram a one (1) represents a correspondence and a zero (0) shows no such relationship. There is precisely a single one in each row but there may be multiple ones in each column. It should be immediately clear that if there are a large number of unique values then there will be an explosion of zeros to be stored. While bit arrays can offer very rapid information retrieval, particularly when cached in memory, this data expansion needs to be contained if it is not to mean that you lose the space saving advantages of using tokenisation in the first place. This is, of course, a common problem, not just with tokenisation but also in OLAP cubes, for example, as well as with conventional bit-maps vis a vis our previous discussions on count queries. A Bloor White Paper 4 13 In theory there are two obvious approaches to this issue. The first is to limit the use of tokenisation to low cardinality fields. That is, fields where there are a limited number of different values. There are, for example, only 50 US states, so this would be ideal for tokenisation. In practice, you start to lose the benefits of tokenisation with a cardinality much above 1,500, so it may be useful if a vendor can offer alternative approaches for higher cardinality fields such as other types of (conventional) indexing. Alternatively, you can compress each column in the bit array into an encoded bit vector and then process these encoded bit vectors directly, without their being unpacked. So, one way or another, you limit any data explosion because of nulls. © 2010 Bloor Research What’s Cool about Columns (and how to extend their benefits) Moving beyond columns It should be clear that column-based architectures offer some significant advantages over traditional row-based approaches. However, this does not mean that they are better than their row-based counterparts for all types of queries and it does not mean that they are optimal in every respect, even in areas where they already offer superior performance. A report card might read “has done well, at the top of the class—but could do better”. Needless to say, there are multiple ways in which one might approach the issue of improving the performance and utility of columnbased databases. In the remainder of this section we will describe and discuss the approach taken by Infobright to extend columnar capability. This is based around the use of a technology known as a Knowledge Grid. This is unique and makes Infobright completely different from any other product on the market. In order to understand the Knowledge Grid you first need to understand data packs. A data pack is a part of a column with a total size of 64k and Infobright breaks each column down into these packs and then stores each data pack separately, with compression applied at the pack level (which actually means that it can sometimes be more efficient than when operating on a pure column basis). At the same time the software creates metadata about the contents of each data pack, which is stored in the Knowledge Grid and is automatically updated whenever the warehouse is updated. This metadata in the Knowledge Grid includes parameters for each data pack such as the maximum and minimum values contained therein, a histogram of the range of these values, a count of the number of entries within the data pack, aggregates (where relevant) and so on. What all of this means is that a number of queries (such as count queries) can be resolved without reading the data at all. Moreover, as Infobright continues to expand the metadata held in its Knowledge Grid (for example, it intends to extend it into vertical and domain-specific areas) then more and more queries will be answered directly from the Knowledge Grid. © 2010 Bloor Research 14 So, the first thing that happens when a query is received is that the database engine looks to see if it can answer all or part of the query directly from the Knowledge Grid. However the nirvana of being able to answer all questions directly from the Knowledge Grid will never come. Sometimes you just have to read the data. When this is necessary the software first accesses the Knowledge Grid to see which data packs it needs to resolve that query and then it only reads and decompresses those data packs. Further, it works in an iterative fashion so that as it processes each part of a query it can eliminate the need to access more data. Thus, at least for some queries, depending on how many data packs you need to access, you read and decompress even less data than you would when using a standard columnar database. Further, apart from understanding each data pack, the Knowledge Grid also understands the relationships that exist between different data packs in order to provide even better query performance; for example, because it knows which pairs of data packs (on columns from different tables) would need to be accessed if a join condition applied across their respective columns. Note that thanks to the Knowledge Grid, Infobright does not require you to partition the data. This not only reduces administration but it also prevents data skew, which is a performance problem for vendors using horizontal (rowbased) partitioning and which forces re-balancing of the warehouse, as discussed previously. Readers interested in learning more about Infobright can visit its website at www.infobright. com or the Infobright open source community at www.infobright.org. A Bloor White Paper What’s Cool about Columns (and how to extend their benefits) Conclusion Columns provide better performance at a lower cost with a smaller footprint: it is difficult to understand why any company seriously interested in query performance would not consider a column-based solution. Using columns instead of rows means that you get greatly reduced I/O because you only read the columns referenced by the query. This means that you get dramatically improved performance. The use of compression improves I/O rates (and performance) still further. In addition, depending on the supplier, you can eliminate or greatly reduce any need for indexes (thereby reducing on-going administration requirements) and, where they may be usefully used, they can be created automatically. In summary, this means that queries run faster (much faster), the database is much smaller that it would otherwise be (with all the upfront and ongoing cost benefits that implies) and there is less administration required than would otherwise be the case (with further ongoing cost benefits). In addition, you may be able to run queries that simply could not be supported by more conventional means. However, despite all of these comments, the use of columns is not a panacea. In fact, Infobright demonstrates this very clearly by its improvement on the fundamental architecture of column-based relational databases, which will provide better compression, at least in some instances, and improved query performance through extending the columnar paradigm. Further Information Further information about this subject is available from http://www.BloorResearch.com/update/2065 A Bloor White Paper 15 © 2010 Bloor Research Bloor Research overview About the author Bloor Research is one of Europe’s leading IT research, analysis and consultancy organisations. We explain how to bring greater Agility to corporate IT systems through the effective governance, management and leverage of Information. We have built a reputation for ‘telling the right story’ with independent, intelligent, well-articulated communications content and publications on all aspects of the ICT industry. We believe the objective of telling the right story is to: Philip Howard Research Director - Data • Describe the technology in context to its business value and the other systems and processes it interacts with. • Understand how new and innovative technologies fit in with existing ICT investments. • Look at the whole market and explain all the solutions available and how they can be more effectively evaluated. • Filter “noise” and make it easier to find the additional information or news that supports both investment and implementation. • Ensure all our content is available through the most appropriate channel. Founded in 1989, we have spent over two decades distributing research and analysis to IT user and vendor organisations throughout the world via online subscriptions, tailored research services, events and consultancy projects. We are committed to turning our knowledge into business value for you. Philip started in the computer industry way back in 1973 and has variously worked as a systems analyst, programmer and salesperson, as well as in marketing and product management, for a variety of companies including GEC Marconi, GPT, Philips Data Systems, Raytheon and NCR. After a quarter of a century of not being his own boss Philip set up what is now P3ST (Wordsmiths) Ltd in 1992 and his first client was Bloor Research (then ButlerBloor), with Philip working for the company as an associate analyst. His relationship with Bloor Research has continued since that time and he is now Research Director. His practice area encompasses anything to do with data and content and he has five further analysts working with him in this area. While maintaining an overview of the whole space Philip himself specialises in databases, data management, data integration, data quality, data federation, master data management, data governance and data warehousing. He also has an interest in event stream/complex event processing. In addition to the numerous reports Philip has written on behalf of Bloor Research, Philip also contributes regularly to www.IT-Director.com and www.ITAnalysis.com and was previously the editor of both “Application Development News” and “Operating System News” on behalf of Cambridge Market Intelligence (CMI). He has also contributed to various magazines and published a number of reports published by companies such as CMI and The Financial Times. Away from work, Philip’s primary leisure activities are canal boats, skiing, playing Bridge (at which he is a Life Master) and walking the dog. Copyright & disclaimer This document is copyright © 2010 Bloor Research. No part of this publication may be reproduced by any method whatsoever without the prior consent of Bloor Research. Due to the nature of this material, numerous hardware and software products have been mentioned by name. In the majority, if not all, of the cases, these product names are claimed as trademarks by the companies that manufacture the products. It is not Bloor Research’s intent to claim these names or trademarks as our own. Likewise, company logos, graphics or screen shots have been reproduced with the consent of the owner and are subject to that owner’s copyright. Whilst every care has been taken in the preparation of this document to ensure that the information is correct, the publishers cannot accept responsibility for any errors or omissions. 2nd Floor, 145–157 St John Street LONDON, EC1V 4PY, United Kingdom Tel: +44 (0)207 043 9750 Fax: +44 (0)207 043 9748 Web: www.BloorResearch.com email: [email protected]