ClickHouse Projections for Query Optimization

Table of Contents

Introduction

Among DBAs, there is much debate regarding the significance of database performance tuning. As we now understand, various types of data are what really power the business world. It should be operating very effectively and always be accessible to ensure that the data is quickly and easily accessible for all beneficiaries.

Getting a regular database performance audit is one of the finest strategies to deal with previous performance problems; however, some performance tips are unique to each database. These special techniques present excellent database performance, and in this article, we’ll mention another one for ClickHouse.

We’ll use the table we created in the earlier article. Recall the following table and the same approximately 12 million data again:

CREATE TABLE exp_test
(
    `id` UInt64,
    `location` String,
    `value` Varchar
)
ENGINE = MergeTree
ORDER BY id;

Only filtering the id column in a query, the primary key will use to optimize scanning. All values are sorted for this column. But if we filter the location or value column in our query, it will lead to a full table scan. All of the granules and parts have been read 100%, which is expected if we look at the EXPLAIN output.

Let’s see:

EXPLAIN json = 1, indexes = 1 SELECT count(*)
FROM exp_test
WHERE location = 'Geneva';

"Initial Parts": 6,
"Selected Parts": 6,
"Initial Granules": 12209,
"Selected Granules": 12209

Using Projection with One Column

Projections can be used to create additional sorted datasets for table columns, which ClickHouse can leverage to run particular queries more quickly. A projection is simply a portion of the original table’s data that has been stored in an ordered format so that it can be used for specific queries.

To apply this, firstly, we can add the location column for Projection. Data that has just received the table will automatically be processed and stored in the projection for this column. Then, if we want to make a projection for already-existing data, we require the MATERIALIZE operation.

ALTER TABLE exp_test ADD projection location (SELECT location ORDER BY location);
ALTER TABLE exp_test MATERIALIZE projection location;

Thus, the location column’s data is sequentially ordered, just like the primary key column. So, when we run the query, we will see that our query is executed much faster.

┌─explain────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ [
  {
    "Plan": {
      "Node Type": "Expression",
      "Description": "(Projection + Before ORDER BY)",
      "Plans": [
        {
          "Node Type": "Aggregating",
          "Plans": [
            {
              "Node Type": "Expression",
              "Description": "Before GROUP BY",
              "Plans": [
                {
                  "Node Type": "ReadFromStorage",
                  "Description": "MergeTree(with Normal projection location)"
                }
              ]
            }
          ]
        }
      ]
    }
  }
] │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can obsolete the outputs with and without the Projection feature:

Without Projection

┌─count()─┐
│ 4299126 │
└─────────┘

1 row in set. Elapsed: 0.025 sec. Processed 11.97 million rows, 11.97 MB (488.72 million rows/s., 488.72 MB/s.)

To determine the solution, ClickHouse must search the full data collection.

With Projection

┌─count()─┐
│ 4299126 │
└─────────┘

1 row in set. Elapsed: 0.012 sec. Processed 4.31 million rows, 4.31 MB (351.75 million rows/s., 351.75 MB/s.)

Now query is executed in a much more efficient way with Projection. Certainly, the difference in performance will widen with Projection, despite the bigger data set.

Using Projection with Multiple Columns

Of course we can utilize this feature with adding more column if we need.

ALTER TABLE exp_test ADD projection location_value (SELECT location, value ORDER BY location, value);
ALTER TABLE exp_test MATERIALIZE projection location_value;

Now, we can utilize the location and value column together or separately. Good thing is that we will utilize the Projection functionality if we only wish to query the value column even though, Projection exists for location and value column together.

Drop Projection

We need to carefully analyze the whole query and choose the optimum candidate query. Because operations will be slower if our table contains INSERT operations due to the Projection feature, we should pick a query that has a negative effect on the system, if possible.

ALTER TABLE exp_test DROP PROJECTION location;
ALTER TABLE exp_test DROP PROJECTION location_value;

Conclusion

Projections can be utilized to enhance query performance when the primary key is unable to manage the load. It is important to remember that each projection will reduce the speed at which data is ingested; therefore, projections should only be used when slow queries severely affect the system. It’s important to note that, all of these procedures have been examined with ClickHouse 22.11.2.30.

To read more about SQL Engineering in ClickHouse for query performance, do consider reading the following articles:

ChistaDATA Inc.

Enterprise-class 24*7 ClickHouse Consultative Support and Managed Services

ClickHouse Performance: How to use Projections for Query Optimization