Comprehensive Guide to ClickHouse EXPLAIN

ClickHouse Query Analysis using Explain Plan

Introduction

Have you ever pondered over how database engines handle queries and optimize their execution? Have you ever wondered which queries are responsible for slowing down your database? Are you curious to know who is running those queries? And where exactly is it going wrong? Well, look no further, because the Explain plan is the answer you’ve been seeking!

In this blog post, we will delve into the world of ClickHouse’s Explain plan, revealing its significance and the invaluable insights it provides. By understanding the Explain plan, you’ll better understand how your queries are processed and optimized, empowering you to identify and resolve performance bottlenecks in your database. So, let’s dive in.

As someone with a background in MySQL, I initially encountered challenges in comprehending the comprehensive plan generated by ClickHouse. However, upon closer examination, I discovered that ClickHouse provides detailed information that is immensely valuable for developers and administrators seeking to optimize queries. In my opinion, compared to other databases, ClickHouse offers the most comprehensive and informative explanation plan.

In this experiment, I used the “Airlines On-Time Performance” database from the US Department of Transportation Research and Innovative Technology Administration website. I chose to analyze a larger dataset, which I obtained from a publicly available link. . The dataset I worked with contains information about domestic flights within the USA, encompassing flight length and delays. Spanning from 1988 to 2009, the data is divided into monthly chunks. For detailed information on loading this data into ClickHouse, refer to the official ClickHouse website.

Although slightly out of context, I wanted to share something that again caught my attention while loading these data. It’s about the intelligent utilization of a multi-threaded, parallelized approach and vectorization techniques by ClickHouse for efficient data processing. ClickHouse leverages the full potential of modern hardware by distributing the workload across multiple cores on the underlying host. This capability makes ClickHouse exceptionally suitable for handling demanding workloads and large-scale analytics tasks.

Now, let’s delve into the main topic of this blog post

Explaining the Explain Plan.

The EXPLAIN feature has been available since ClickHouse version 20.6. At the time of writing this article, I am using the ClickHouse version 23.3.2.

Different types of EXPLAIN are available in ClickHouse; We will look at them one by one with examples.

  1. AST (Abstract Syntax Tree): Represents the Abstract syntax tree.
  2. SYNTAX: Represents Query text after AST-level optimizations.
  3. QUERY TREE: Display Query tree after Query Tree level optimizations
  4. PLAN: Represents the query execution plan.
  5. PIPELINE — Shows Query execution pipeline.

What happens in the background

When a query is sent to ClickHouse, it undergoes a series of well-orchestrated steps from parsing into an Abstract Syntax Tree (AST), which is then rewritten and normalized. ClickHouse creates a query plan from the AST, builds a concrete query pipeline, and executes it using the query engine, resulting in efficient processing and storage of results.

Here’s a high-level overview of the journey your query takes within ClickHouse:

  1. Client: You send your query to ClickHouse from your client application.
  2. TCP Handler: ClickHouse’s TCP Handler receives the query and initiates the parsing process.
  3. Lexical Analyzer: The query is passed through the Lexical Analyzer, which transforms it into a sequence of tokens. Each token represents a distinct element, such as keywords, identifiers, operators, or literals.
  4. Syntactic Analyzer: The Syntactic Analyzer takes the tokens and constructs an Abstract Syntax Tree (AST). The AST represents the query’s structure and the relationships between its components.
  5. Semantic Analyzer: The Semantic Analyzer examines the AST for semantic errors, ensuring the validity of table and column references and performing data type checks, among other semantic validations.
  6. Expression Analyzer and Optimizer: The Expression Analyzer leverages the AST to optimize the query execution plan. It applies various optimizations based on query patterns, statistics, and configuration settings, enhancing overall performance.
  7. Executor: With the optimized execution plan in hand, the Query Engine proceeds to execute it. This involves reading data from storage, applying filters and transformations, aggregating results, and performing any necessary operations to produce the final outcome.
  8. Socket Buffer: The query results are stored in a Block Stream within the Socket Buffer. This optimized storage mechanism facilitates efficient data transfer between the ClickHouse server and the client application.
  9. Client: The client application retrieves and reads the results from the Socket Buffer, allowing you to access and utilize the final outcome of the query.

 

 

The above diagram is a representation that offers a simplified view of ClickHouse’s query processing pipeline. The process involves additional optimizations and internal procedures, including data filtering, indexing, and more.

See below syntax for EXPLAIN.

Syntax:

EXPLAIN [AST | SYNTAX | QUERY TREE | PLAN | PIPELINE | ESTIMATE | TABLE OVERRIDE] [setting = value, ...]
    [
      SELECT ... |
      tableFunction(...) [COLUMNS (...)] [ORDER BY ...] [PARTITION BY ...] [PRIMARY KEY] [SAMPLE BY ...] [TTL ...]
    ]
    [FORMAT ...]

Explain Plan in ClickHouse is configurable, so here are some important settings related to the Explain Plan.

  • header: Prints an output header for each step. Default: 0.
  • description: Prints a description for each step. Default: 1.
  • indexes: Shows the indexes used and the number of filtered parts and granules for each index applied. This is supported for MergeTree tables. Default: 0.
  • actions: Print detailed information about the actions performed at each step. Default: 0.
  • json: Prints the query plan steps in JSON format. Default: 0. It is recommended to use the TSVRaw format to avoid unnecessary escaping.

By default, ClickHouse’s explain plan shows the breakdown of the entire query plan. To enhance its usefulness for readers, we can leverage the aforementioned options as its configurable listed above.

EXPLAIN
SELECT count(*)
FROM airline.ontime
WHERE OriginState = 'NY'

Query id: a3f8d68e-2dc8-40e2-b4ab-3382f09e08b2

┌─explain─────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│   Aggregating                               │
│     Expression (Before GROUP BY)            │
│       Filter (WHERE)                        │
│         ReadFromMergeTree (airline.ontime)  │
└─────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.012 sec.

Note: When reading the Explain Plan, starting from the bottom and working your way up is typically best. This allows for a logical understanding of the query execution flow.

Table Schema used in sample queries  

clickhouse1 :)  select formatReadableQuantity(count()) from ontime;

SELECT formatReadableQuantity(count())
FROM ontime

Query id: 4b13dd8d-53b0-463c-85bb-ea04b7e25bd1

┌─formatReadableQuantity(count())─┐
│ 209.96 million                  │
└─────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.


clickhouse1 :) show create table airline.ontime\G

SHOW CREATE TABLE airline.ontime

Query id: 4d781105-262d-4f4d-9c19-0c0417285a1a

Row 1:
──────
statement: CREATE TABLE airline.ontime
(
    `Year` UInt16,
    `Quarter` UInt8,
    `Month` UInt8,
    `DayofMonth` UInt8,
    `DayOfWeek` UInt8,
    `FlightDate` Date,
    `Reporting_Airline` LowCardinality(String),
    `DOT_ID_Reporting_Airline` Int32,
    `IATA_CODE_Reporting_Airline` LowCardinality(String),
    `Tail_Number` LowCardinality(String),
    `Flight_Number_Reporting_Airline` LowCardinality(String),
    `OriginAirportID` Int32,
    `OriginAirportSeqID` Int32,
    `OriginCityMarketID` Int32,
    `Origin` FixedString(5),
    `OriginCityName` LowCardinality(String),
    `OriginState` FixedString(2),
    `OriginStateFips` FixedString(2),
    `OriginStateName` LowCardinality(String),
    `OriginWac` Int32,
    `DestAirportID` Int32,
    `DestAirportSeqID` Int32,
    `DestCityMarketID` Int32,
    `Dest` FixedString(5),
    `DestCityName` LowCardinality(String),
    `DestState` FixedString(2),
    `DestStateFips` FixedString(2),
    `DestStateName` LowCardinality(String),
    `DestWac` Int32,
    `CRSDepTime` Int32,
    `DepTime` Int32,
    `DepDelay` Int32,
    `DepDelayMinutes` Int32,
    `DepDel15` Int32,
    `DepartureDelayGroups` LowCardinality(String),
    `DepTimeBlk` LowCardinality(String),
    `TaxiOut` Int32,
    `WheelsOff` LowCardinality(String),
    `WheelsOn` LowCardinality(String),
    `TaxiIn` Int32,
    `CRSArrTime` Int32,
    `ArrTime` Int32,
    `ArrDelay` Int32,
    `ArrDelayMinutes` Int32,
    `ArrDel15` Int32,
    `ArrivalDelayGroups` LowCardinality(String),
    `ArrTimeBlk` LowCardinality(String),
    `Cancelled` Int8,
    `CancellationCode` FixedString(1),
    `Diverted` Int8,
    `CRSElapsedTime` Int32,
    `ActualElapsedTime` Int32,
    `AirTime` Int32,
    `Flights` Int32,
    `Distance` Int32,
    `DistanceGroup` Int8,
    `CarrierDelay` Int32,
    `WeatherDelay` Int32,
    `NASDelay` Int32,
    `SecurityDelay` Int32,
    `LateAircraftDelay` Int32,
    `FirstDepTime` Int16,
    `TotalAddGTime` Int16,
    `LongestAddGTime` Int16,
    `DivAirportLandings` Int8,
    `DivReachedDest` Int8,
    `DivActualElapsedTime` Int16,
    `DivArrDelay` Int16,
    `DivDistance` Int16,
    `Div1Airport` LowCardinality(String),
    `Div1AirportID` Int32,
    `Div1AirportSeqID` Int32,
    `Div1WheelsOn` Int16,
    `Div1TotalGTime` Int16,
    `Div1LongestGTime` Int16,
    `Div1WheelsOff` Int16,
    `Div1TailNum` LowCardinality(String),
    `Div2Airport` LowCardinality(String),
    `Div2AirportID` Int32,
    `Div2AirportSeqID` Int32,
    `Div2WheelsOn` Int16,
    `Div2TotalGTime` Int16,
    `Div2LongestGTime` Int16,
    `Div2WheelsOff` Int16,
    `Div2TailNum` LowCardinality(String),
    `Div3Airport` LowCardinality(String),
    `Div3AirportID` Int32,
    `Div3AirportSeqID` Int32,
    `Div3WheelsOn` Int16,
    `Div3TotalGTime` Int16,
    `Div3LongestGTime` Int16,
    `Div3WheelsOff` Int16,
    `Div3TailNum` LowCardinality(String),
    `Div4Airport` LowCardinality(String),
    `Div4AirportID` Int32,
    `Div4AirportSeqID` Int32,
    `Div4WheelsOn` Int16,
    `Div4TotalGTime` Int16,
    `Div4LongestGTime` Int16,
    `Div4WheelsOff` Int16,
    `Div4TailNum` LowCardinality(String),
    `Div5Airport` LowCardinality(String),
    `Div5AirportID` Int32,
    `Div5AirportSeqID` Int32,
    `Div5WheelsOn` Int16,
    `Div5TotalGTime` Int16,
    `Div5LongestGTime` Int16,
    `Div5WheelsOff` Int16,
    `Div5TailNum` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (Year, Quarter, Month, DayofMonth, FlightDate, IATA_CODE_Reporting_Airline)
SETTINGS index_granularity = 8192

1 row in set. Elapsed: 0.002 sec.

Now, let’s put the above configuration into action. We’ll consider two simple queries: one utilizing the Primary Index and another that doesn’t use any index, resulting in a scan of all parts and granules.

In the first query, we filter based on the Year column in the WHERE clause. When we examine the execution plan, we can observe the Indexes block, which provides information about the used indexes and their details. In our example, ClickHouse was able to leverage the primary key for query execution, requiring only a single part out of the total 8 and 701 granules out of the total 25,839 to retrieve the results. Generally, having fewer values for Selected Parts and Selected Granules indicates better performance.

**************************** Query using INDEX ******************************************************** 

clickhouse1 :) explain header = 1 , json = 1, indexes = 1 , actions = 1 , description = 1 select count(*) from airline.ontime where Year = '2000'\G

EXPLAIN header = 1, json = 1, indexes = 1, actions = 1, description = 1
SELECT count(*)
FROM airline.ontime
WHERE Year = '2000'

Query id: 00bfb656-5233-4327-b8e5-5ecb8f536003

Row 1:
──────
explain: [
  {
    "Plan": {
      "Node Type": "Expression",
      "Description": "(Projection + Before ORDER BY)",
      "Header": [
        {
          "Name": "count()",
          "Type": "UInt64"
        }
      ],
      "Expression": {
        "Inputs": [
          {
            "Name": "count()",
            "Type": "UInt64"
          }
        ],
        "Actions": [
          {
            "Node Type": "INPUT",
            "Result Type": "UInt64",
            "Result Name": "count()",
            "Arguments": [0],
            "Removed Arguments": [0],
            "Result": 0
          }
        ],
        "Outputs": [
          {
            "Name": "count()",
            "Type": "UInt64"
          }
        ],
        "Positions": [0],
        "Project Input": true
      },
      "Plans": [
        {
          "Node Type": "Aggregating",
          "Header": [
            {
              "Name": "count()",
              "Type": "UInt64"
            }
          ],
          "Keys": [],
          "Aggregates": [
            {
              "Name": "count()",
              "Function": {
                "Name": "count",
                "Argument Types": [],
                "Result Type": "UInt64"
              },
              "Arguments": []
            }
          ],
          "Skip merging": false,
          "Plans": [
            {
              "Node Type": "Expression",
              "Description": "Before GROUP BY",
              "Header": [],
              "Expression": {
                "Inputs": [],
                "Actions": [],
                "Outputs": [],
                "Positions": [],
                "Project Input": false
              },
              "Plans": [
                {
                  "Node Type": "Filter",
                  "Description": "WHERE",
                  "Header": [],
                  "Filter Column": "equals(Year, '2000')",
                  "Removes Filter": true,
                  "Expression": {
                    "Inputs": [
                      {
                        "Name": "Year",
                        "Type": "UInt16"
                      }
                    ],
                    "Actions": [
                      {
                        "Node Type": "INPUT",
                        "Result Type": "UInt16",
                        "Result Name": "Year",
                        "Arguments": [0],
                        "Removed Arguments": [],
                        "Result": 0
                      },
                      {
                        "Node Type": "COLUMN",
                        "Result Type": "String",
                        "Result Name": "'2000'",
                        "Column": "Const(String)",
                        "Arguments": [],
                        "Removed Arguments": [],
                        "Result": 1
                      },
                      {
                        "Node Type": "FUNCTION",
                        "Result Type": "UInt8",
                        "Result Name": "equals(Year, '2000')",
                        "Function": "equals",
                        "Compiled": false,
                        "Arguments": [0, 1],
                        "Removed Arguments": [0, 1],
                        "Result": 2
                      }
                    ],
                    "Outputs": [
                      {
                        "Name": "equals(Year, '2000')",
                        "Type": "UInt8"
                      }
                    ],
                    "Positions": [2],
                    "Project Input": true
                  },
                  "Plans": [
                    {
                      "Node Type": "ReadFromMergeTree",
                      "Description": "airline.ontime",
                      "Header": [
                        {
                          "Name": "Year",
                          "Type": "UInt16"
                        }
                      ],
                      "Read Type": "Default",
                      "Parts": 1,
                      "Granules": 701,
                      "Indexes": [
                        {
                          "Type": "PrimaryKey",
                          "Keys": ["Year"],
                          "Condition": "(Year in [2000, 2000])",
                          "Initial Parts": 8,
                          "Selected Parts": 1,
                          "Initial Granules": 25839,
                          "Selected Granules": 701
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  }
]

1 row in set. Elapsed: 0.006 sec.

Now Let’s consider the second query, where we filter based on the OriginState column in the WHERE clause. In this case, the execution plan reveals that all granules (25,839 out of 25,839) and all parts (8 out of 8) are being scanned. This implies that the query needs to process the entire dataset, which can have an impact on performance.

**************************** Query not using INDEX ******************************************************** 


clickhouse1 :) explain header = 1 , json = 1, indexes = 1 , actions = 1 , description = 1 select count(*) from airline.ontime where OriginState='NY'\G

EXPLAIN header = 1, json = 1, indexes = 1, actions = 1, description = 1
SELECT count(*)
FROM airline.ontime
WHERE OriginState = 'NY'

Query id: 1b20574a-86fe-4f81-abfc-d64011e92a09

Row 1:
──────
explain: [
  {
    "Plan": {
      "Node Type": "Expression",
      "Description": "(Projection + Before ORDER BY)",
      "Header": [
        {
          "Name": "count()",
          "Type": "UInt64"
        }
      ],
      "Expression": {
        "Inputs": [
          {
            "Name": "count()",
            "Type": "UInt64"
          }
        ],
        "Actions": [
          {
            "Node Type": "INPUT",
            "Result Type": "UInt64",
            "Result Name": "count()",
            "Arguments": [0],
            "Removed Arguments": [0],
            "Result": 0
          }
        ],
        "Outputs": [
          {
            "Name": "count()",
            "Type": "UInt64"
          }
        ],
        "Positions": [0],
        "Project Input": true
      },
      "Plans": [
        {
          "Node Type": "Aggregating",
          "Header": [
            {
              "Name": "count()",
              "Type": "UInt64"
            }
          ],
          "Keys": [],
          "Aggregates": [
            {
              "Name": "count()",
              "Function": {
                "Name": "count",
                "Argument Types": [],
                "Result Type": "UInt64"
              },
              "Arguments": []
            }
          ],
          "Skip merging": false,
          "Plans": [
            {
              "Node Type": "Expression",
              "Description": "Before GROUP BY",
              "Header": [],
              "Expression": {
                "Inputs": [],
                "Actions": [],
                "Outputs": [],
                "Positions": [],
                "Project Input": false
              },
              "Plans": [
                {
                  "Node Type": "Filter",
                  "Description": "WHERE",
                  "Header": [],
                  "Filter Column": "equals(OriginState, 'NY')",
                  "Removes Filter": true,
                  "Expression": {
                    "Inputs": [
                      {
                        "Name": "OriginState",
                        "Type": "FixedString(2)"
                      }
                    ],
                    "Actions": [
                      {
                        "Node Type": "INPUT",
                        "Result Type": "FixedString(2)",
                        "Result Name": "OriginState",
                        "Arguments": [0],
                        "Removed Arguments": [],
                        "Result": 0
                      },
                      {
                        "Node Type": "COLUMN",
                        "Result Type": "String",
                        "Result Name": "'NY'",
                        "Column": "Const(String)",
                        "Arguments": [],
                        "Removed Arguments": [],
                        "Result": 1
                      },
                      {
                        "Node Type": "FUNCTION",
                        "Result Type": "UInt8",
                        "Result Name": "equals(OriginState, 'NY')",
                        "Function": "equals",
                        "Compiled": false,
                        "Arguments": [0, 1],
                        "Removed Arguments": [0, 1],
                        "Result": 2
                      }
                    ],
                    "Outputs": [
                      {
                        "Name": "equals(OriginState, 'NY')",
                        "Type": "UInt8"
                      }
                    ],
                    "Positions": [2],
                    "Project Input": true
                  },
                  "Plans": [
                    {
                      "Node Type": "ReadFromMergeTree",
                      "Description": "airline.ontime",
                      "Header": [
                        {
                          "Name": "OriginState",
                          "Type": "FixedString(2)"
                        }
                      ],
                      "Read Type": "Default",
                      "Parts": 8,
                      "Granules": 25839,
                      "Indexes": [
                        {
                          "Type": "PrimaryKey",
                          "Condition": "true",
                          "Initial Parts": 8,
                          "Selected Parts": 8,
                          "Initial Granules": 25839,
                          "Selected Granules": 25839
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  }
]

1 row in set. Elapsed: 0.008 sec

Let’s see how to read the output with the below tags.

  • Name — An index name (for now, is used only for Skip index).
  • Keys — An array of columns used by the index.
  • Condition — A string with condition used.
  • Description — An index (for now, is used only for Skip index).
  • Initial Parts — A number of parts before the index is applied.
  • Selected Parts — A number of parts after the index is applied.
  • Initial Granules — A number of granules before the index is applied.
  • Selected Granulesis — A number of granules after the index is applied.

Let’s look at some more forms of Explains in ClickHouse to illustrate its application.

EXPLAIN ESTIMATE

By utilizing the EXPLAIN ESTIMATE feature, we can pre-estimate the number of rows to scan before executing the query. This estimation can help assess the potential impact on execution time and resource utilization.

clickhouse1 :) EXPLAIN ESTIMATE select count(*) from airline.ontime where Year = '2000';

EXPLAIN ESTIMATE
SELECT count(*)
FROM airline.ontime
WHERE Year = '2000'

Query id: be6d263e-e4a3-469e-8d3a-f90cbf647614

┌─database─┬─table──┬─parts─┬────rows─┬─marks─┐
│ airline  │ ontime │     1 │ 5686047 │   701 │
└──────────┴────────┴───────┴─────────┴───────┘

1 row in set. Elapsed: 0.011 sec.

clickhouse1 :) EXPLAIN ESTIMATE select count(*) from airline.ontime where OriginState='NY';

EXPLAIN ESTIMATE
SELECT count(*)
FROM airline.ontime
WHERE OriginState = 'NY'

Query id: 97b0ec8f-4bce-4c0e-8861-7f91350a9a5b

┌─database─┬─table──┬─parts─┬──────rows─┬─marks─┐
│ airline  │ ontime │     8 │ 209961414 │ 25839 │
└──────────┴────────┴───────┴───────────┴───────┘

1 row in set. Elapsed: 0.005 sec.

EXPLAIN PLAN

The EXPLAIN PLAN function is the default and widely used tool for viewing it. With the EXPLAIN function, you can gain valuable insights into the execution plan by utilizing different parameters, such as header, description, indexesjson and action. This allows you to explore and analyze the plan from various angles, providing a comprehensive understanding of how your queries are executed in the database.

clickhouse1 :) EXPLAIN PLAN SELECT
                  min(Year), max(Year), IATA_CODE_Reporting_Airline AS Carrier, count(*) AS cnt,
                  sum(ArrDelayMinutes>30) AS flights_delayed,
                  round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
               FROM ontime
               WHERE
                  DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
                  AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
                  AND FlightDate < '2010-01-01'
               GROUP by Carrier
               HAVING cnt>100000 and max(Year)>1990
               ORDER by rate DESC
               LIMIT 1000;

EXPLAIN
SELECT
    min(Year),
    max(Year),
    IATA_CODE_Reporting_Airline AS Carrier,
    count(*) AS cnt,
    sum(ArrDelayMinutes > 30) AS flights_delayed,
    round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate
FROM ontime
WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01')
GROUP BY Carrier
HAVING (cnt > 100000) AND (max(Year) > 1990)
ORDER BY rate DESC
LIMIT 1000

Query id: 2eb05bc0-e353-411c-abcc-1c58c6ab8b75

┌─explain────────────────────────────────────────────┐
│ Expression (Project names)                         │
│   Limit (preliminary LIMIT (without OFFSET))       │
│     Sorting (Sorting for ORDER BY)                 │
│       Expression ((Before ORDER BY + Projection))  │
│         Filter (HAVING)                            │
│           Aggregating                              │
│             Expression (Before GROUP BY)           │
│               Filter                               │
│                 ReadFromMergeTree (airline.ontime) │
└────────────────────────────────────────────────────┘

9 rows in set. Elapsed: 0.006 sec.

EXPLAIN AST

It is used to view the syntax tree of the SQL. AST is supported for any command; below example, we have used Explain on ALTER and SELECT commands.

clickhouse1 :) EXPLAIN AST ALTER TABLE  airline.ontime  DELETE WHERE  Year = '2000';

EXPLAIN AST
ALTER TABLE airline.ontime
    DELETE WHERE Year = '2000'

Query id: 6b2d23c3-3d36-4975-b04c-0f4e31be21ae

┌─explain────────────────────────────────┐
│ AlterQuery airline ontime (children 3) │
│  ExpressionList (children 1)           │
│   AlterCommand DELETE (children 1)     │
│    Function equals (children 1)        │
│     ExpressionList (children 2)        │
│      Identifier Year                   │
│      Literal '2000'                    │
│  Identifier airline                    │
│  Identifier ontime                     │
└────────────────────────────────────────┘

9 rows in set. Elapsed: 0.001 sec.


clickhouse1 :) EXPLAIN AST SELECT
                   Origin,
                   count(*) AS c
               FROM ontime
               WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
               GROUP BY Origin
               ORDER BY c DESC
               LIMIT 10

EXPLAIN AST
SELECT
    Origin,
    count(*) AS c
FROM ontime
WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
GROUP BY Origin
ORDER BY c DESC
LIMIT 10

Query id: fa4ed131-d22b-4175-9c76-04c7623e7078

┌─explain─────────────────────────────────────┐
│ SelectWithUnionQuery (children 1)           │
│  ExpressionList (children 1)                │
│   SelectQuery (children 6)                  │
│    ExpressionList (children 2)              │
│     Identifier Origin                       │
│     Function count (alias c) (children 1)   │
│      ExpressionList (children 1)            │
│       Asterisk                              │
│    TablesInSelectQuery (children 1)         │
│     TablesInSelectQueryElement (children 1) │
│      TableExpression (children 1)           │
│       TableIdentifier ontime                │
│    Function and (children 1)                │
│     ExpressionList (children 3)             │
│      Function greater (children 1)          │
│       ExpressionList (children 2)           │
│        Identifier DepDelay                  │
│        Literal UInt64_10                    │
│      Function greaterOrEquals (children 1)  │
│       ExpressionList (children 2)           │
│        Identifier Year                      │
│        Literal UInt64_2000                  │
│      Function lessOrEquals (children 1)     │
│       ExpressionList (children 2)           │
│        Identifier Year                      │
│        Literal UInt64_2008                  │
│    ExpressionList (children 1)              │
│     Identifier Origin                       │
│    ExpressionList (children 1)              │
│     OrderByElement (children 1)             │
│      Identifier c                           │
│    Literal UInt64_10                        │
└─────────────────────────────────────────────┘

32 rows in set. Elapsed: 0.001 sec.

EXPLAIN SYNTAX

It is used to optimize the syntax; when validation of the query is successful, we the prepared form on SQL in return.

clickhouse1 :) EXPLAIN SYNTAX SELECT CASE
                                  WHEN number = 1 THEN 'yes'
                                  WHEN number = 2 THEN 'no'
                                  ELSE 'unknown'
                                END
                              FROM numbers(10)

EXPLAIN SYNTAX
SELECT multiIf(number = 1, 'yes', number = 2, 'no', 'unknown')
FROM numbers(10)

Query id: 04517992-1e79-41fa-ba80-dd4d4aa08157

┌─explain────────────────────────────────────────────────────────┐
│ SELECT multiIf(number = 1, 'yes', number = 2, 'no', 'unknown') │
│ FROM numbers(10)                                               │
└────────────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.001 sec.

EXPLAIN QUERY TREE

It shows the Query tree after Query Tree level optimizations. It works with parameters such as run_passes, dump_passes, and passes. For more details refer clickhouse explain 

Please note that EXPLAIN QUERY TREE is only supported with the new analyzer. You can enable it by setting allow_experimental_analyzer = 1. (Please note that it is marked as NOT_IMPLEMENTED.)

clickhouse1 :) Set allow_experimental_analyzer = 1

SET allow_experimental_analyzer = 1

Query id: 0baf3926-23e5-4153-9135-657c3cc0b825

Ok.

0 rows in set. Elapsed: 0.001 sec.

clickhouse1 :) EXPLAIN QUERY TREE
               SELECT
                   IATA_CODE_Reporting_Airline AS Carrier,
                   count(*)
               FROM ontime
               WHERE (DepDelay > 10) AND (Year = 2007)
               GROUP BY Carrier
               ORDER BY count(*) DESC

EXPLAIN QUERY TREE
SELECT
    IATA_CODE_Reporting_Airline AS Carrier,
    count(*)
FROM ontime
WHERE (DepDelay > 10) AND (Year = 2007)
GROUP BY Carrier
ORDER BY count(*) DESC

Query id: 6be6b829-8fb1-4849-944d-4aa18846924c

┌─explain─────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY id: 0                                                                                                     │
│   PROJECTION COLUMNS                                                                                            │
│     Carrier LowCardinality(String)                                                                              │
│     count() UInt64                                                                                              │
│   PROJECTION                                                                                                    │
│     LIST id: 1, nodes: 2                                                                                        │
│       COLUMN id: 2, column_name: IATA_CODE_Reporting_Airline, result_type: LowCardinality(String), source_id: 3 │
│       FUNCTION id: 4, function_name: count, function_type: aggregate, result_type: UInt64                       │
│   JOIN TREE                                                                                                     │
│     TABLE id: 3, table_name: airline.ontime                                                                     │
│   WHERE                                                                                                         │
│     FUNCTION id: 5, function_name: and, function_type: ordinary, result_type: UInt8                             │
│       ARGUMENTS                                                                                                 │
│         LIST id: 6, nodes: 2                                                                                    │
│           FUNCTION id: 7, function_name: greater, function_type: ordinary, result_type: UInt8                   │
│             ARGUMENTS                                                                                           │
│               LIST id: 8, nodes: 2                                                                              │
│                 COLUMN id: 9, column_name: DepDelay, result_type: Int32, source_id: 3                           │
│                 CONSTANT id: 10, constant_value: UInt64_10, constant_value_type: UInt8                          │
│           FUNCTION id: 11, function_name: equals, function_type: ordinary, result_type: UInt8                   │
│             ARGUMENTS                                                                                           │
│               LIST id: 12, nodes: 2                                                                             │
│                 COLUMN id: 13, column_name: Year, result_type: UInt16, source_id: 3                             │
│                 CONSTANT id: 14, constant_value: UInt64_2007, constant_value_type: UInt16                       │
│   GROUP BY                                                                                                      │
│     LIST id: 15, nodes: 1                                                                                       │
│       COLUMN id: 2, column_name: IATA_CODE_Reporting_Airline, result_type: LowCardinality(String), source_id: 3 │
│   ORDER BY                                                                                                      │
│     LIST id: 16, nodes: 1                                                                                       │
│       SORT id: 17, sort_direction: DESCENDING, with_fill: 0                                                     │
│         EXPRESSION                                                                                              │
│           FUNCTION id: 18, function_name: count, function_type: aggregate, result_type: UInt64                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

32 rows in set. Elapsed: 0.002 sec.

EXPLAIN PIPELINE

It is used to view pipeline plans and see how many threads execute the assigned process; PIPELINE can also work with parameters like header, graph, and compact. For graphs, DOT graphic language is used to describe the pipeline in graphical format, and it is disabled by default.

clickhouse1 :) EXPLAIN PIPELINE
               SELECT
                   DestCityName,
                   uniqExact(OriginCityName) AS u
               FROM ontime
               WHERE (Year >= 2000) AND (Year <= 2010)
               GROUP BY DestCityName
               ORDER BY u DESC
               LIMIT 10

EXPLAIN PIPELINE
SELECT
    DestCityName,
    uniqExact(OriginCityName) AS u
FROM ontime
WHERE (Year >= 2000) AND (Year <= 2010)
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10

Query id: c19cc5e6-39e6-4d05-a64d-e699ff0a8cf6

┌─explain───────────────────────────────────────────┐
│ (Expression)                                      │
│ ExpressionTransform                               │
│   (Limit)                                         │
│   Limit                                           │
│     (Sorting)                                     │
│     MergingSortedTransform 8 → 1                  │
│       MergeSortingTransform × 8                   │
│         LimitsCheckingTransform × 8               │
│           PartialSortingTransform × 8             │
│             (Expression)                          │
│             ExpressionTransform × 8               │
│               (Aggregating)                       │
│               Resize 8 → 8                        │
│                 AggregatingTransform × 8          │
│                   StrictResize 8 → 8              │
│                     (Expression)                  │
│                     ExpressionTransform × 8       │
│                       (Expression)                │
│                       ExpressionTransform × 8     │
│                         (ReadFromMergeTree)       │
│                         MergeTreeThread × 8 0 → 1 │
└───────────────────────────────────────────────────┘

21 rows in set. Elapsed: 0.005 sec.

Bonus Tips and Best Practices for Unleashing Query Performance

When it comes to optimization, investing time in carefully designing your schema can accomplish more than 50% of the optimization job. Below are some best practices to consider when designing your schema.

  1. Schema Design: Use the most appropriate data type that accurately represents the data while consuming the least amount of space.
  2. Selection of Codec and LowCardinality: Utilize codecs to compress your data and reduce storage requirements. Additionally, consider using the LowCardinality feature for columns with limited distinct values to optimize memory usage and query performance.
  3. Use Encoding: Use encoding to reduce data size before compression, for example, Delta, Double Delta, and Gorilla.
  4. Compression: Experiment with different compression algorithms to find the best match for your data. Benchmark the performance of various compression methods to determine the one that provides the highest compression ratio without sacrificing query speed.
  5. Select Minimal Columns: Since ClickHouse is a columnar database, fetching only the required columns can significantly improve query performance. Avoid using “SELECT *” when possible, especially if your database is large, as it can cause performance issues.
  6. Filter Rows with WHERE Clause: Use the WHERE clause to filter out unnecessary rows before querying. Filtering at the earliest possible stage reduces the amount of data that needs to be processed and improves query performance.
  7. Optimize Join Operations: If you’re performing JOIN operations, ensure that the table on the right side of the JOIN is smaller and can fit into memory. Loading a smaller table into memory improves the performance of join operations.
  8. Skip Index: Use Skip indexes to avoid unnecessary I/O, for example, Bloom filter indices.
  9. PREWHERE: Use PREWHERE to Filter Unindexed Data. The PREWHERE clause can filter out rows before applying the main WHERE conditions. This is particularly useful for filtering unindexed data and can significantly improve query performance.
  10. Partition Key Selection: Choose a partition key that results in well-distributed and evenly sized partitions. Aim for partition sizes between 1-300 GB and keep the total number of partitions below 1000. Proper partitioning can enhance query performance, especially when working with large datasets.
  11. Bulk Insert Instead of Single Inserts: When inserting data into ClickHouse, use bulk insert methods instead of individual single inserts. Bulk inserts are more efficient and can help avoid background merging operations, which can impact performance.
  12. Primary key/Sorting Key Selection: Consider selecting the primary and sorting keys based on how frequently you access the corresponding columns in your queries. Choosing the right keys can improve data locality and reduce the amount of data that needs to be scanned, resulting in faster query performance.
  13. Materialized views: Use materialized views to transform data outside the source table.

Remember to measure and benchmark the performance impact of each optimization technique in your specific use case to determine the best configuration for your ClickHouse database.

Conclusion

To conclude, numerous external factors can contribute to slow-running queries in database performance, such as network congestion, limited disk space, or high CPU usage. Running it multiple times is crucial to accurately assess query speed, considering various factors like concurrency or isolated testing on dedicated hosts. The Explain Plan is a valuable asset to support this optimization journey. It visually represents the query execution flow within ClickHouse, shedding light on the optimization and execution processes. This empowers developers and administrators to understand the underlying mechanisms at play better. Armed with this knowledge, they can pinpoint areas for potential optimization or fine-tuning, paving the way for improved performance.

To read more about the EXPLAIN tool in ClickHouse, do consider reading the following articles

  1. ClickHouse Performance: Decoding Query Execution Plan with EXPLAIN
  2. ClickHouse EXPLAIN: Determine JOIN Order in Query Execution Plans
  3. ClickHouse EXPLAIN: Display & Analyze Execution Plans