Understanding Named Collections in ClickHouse

Introduction

ClickHouse, a high-performance columnar database management system, provides a feature called “named collections” to streamline the management and reuse of configuration parameters. This feature is especially beneficial for managing external dictionaries and database connections, making your configurations more maintainable and your SQL queries cleaner and more readable.

What are Named Collections?

Named collections in ClickHouse are predefined sets of parameters that can be reused across different parts of your configuration and SQL queries. They help avoid redundancy, simplify updates, and keep your SQL queries concise and clear.

Benefits of Named Collections

  • Reusability: Define parameters once and reuse them as needed across various configurations and queries.
  • Maintainability: Easily update connection details or other parameters in a single location without needing to modify multiple files or queries.
  • Clarity: Simplify and shorten SQL queries and configuration files, making them easier to understand and manage.

Creating Named Collections

Named collections in ClickHouse allow users to define reusable sets of key-value pairs that can be referenced in SQL queries. This feature is particularly useful for managing connection parameters to external data sources like S3, MySQL, and PostgreSQL.

To create a named collection, you use the CREATE NAMED COLLECTION statement. Here is an example of creating a named collection for accessing an S3 bucket:

CREATE NAMED COLLECTION s3_mydata AS 
    access_key_id = 'CHISTADATA47EXAMPLE', 
    secret_access_key = 'wJalrXUtnFEMI/KCHISTADATA7MDENG/bPxRfiCYEXAMPLEKEY', 
    format = 'CSV', 
    url = 'https://s3.us-east-1.amazonaws.com/yourbucket/mydata/';

This named collection can then be used in various functions and table engines to simplify and secure the connection details.

Defining Named Collections in Configuration Files

Named collections can also be defined in the ClickHouse server configuration file, typically config.xml, or another included XML configuration file. Here’s an example of defining a named collection for a MySQL connection:

<yandex>
    ...
    <named_collections>
        <mysql_connection>
            <host>mysql_host</host>
            <port>3306</port>
            <user>mysql_user</user>
            <password>mysql_password</password>
            <database>mysql_database</database>
        </mysql_connection>
    </named_collections>
    ...
</yandex>

In this example, the mysql_connection collection includes parameters required to connect to a MySQL database.

Using Named Collections

Once a named collection is created, it can be used in SQL queries to reference the predefined parameters. For instance, using the named collection s3_mydata in an INSERT statement:

INSERT INTO FUNCTION s3(s3_mydata, filename = 'test_file.tsv.gz', format = 'TSV', structure = 'number UInt64', compression_method = 'gzip')
SELECT * FROM numbers(10000);

Similarly, it can be used to create tables with the S3 engine:

CREATE TABLE s3_engine_table (number Int64)
ENGINE=S3(s3_mydata, url='https://s3.us-east-1.amazonaws.com/yourbucket/mydata/test_file.tsv.gz', format = 'TSV')
SETTINGS input_format_with_names_use_header = 0;

Modifying Named Collections

You can alter existing named collections using the ALTER NAMED COLLECTION statement. This allows you to add, update, or delete keys within the collection. Here are some examples:

  1. Add or Update Keys:
    ALTER NAMED COLLECTION s3_mydata 
    SET access_key_id = 'NEWACCESSKEY', 
        secret_access_key = 'NEWSECRETKEY';
    
  2. Delete Keys:
    ALTER NAMED COLLECTION s3_mydata 
    DELETE format, url;
    
  3. Combination of Add and Delete:
    ALTER NAMED COLLECTION s3_mydata 
    SET access_key_id = 'NEWACCESSKEY' 
    DELETE secret_access_key;
    

     

Dropping Named Collections

To delete a named collection, use the DROP NAMED COLLECTION statement:

DROP NAMED COLLECTION s3_mydata;

Advanced Use Cases

  1. Handling Multiple Connections: If your application needs to connect to multiple databases or different instances of the same database, you can define multiple named collections:
    <yandex>
        ...
        <named_collections>
            <mysql_connection_primary>
                <host>primary_mysql_host</host>
                <port>3306</port>
                <user>primary_user</user>
                <password>primary_password</password>
                <database>primary_database</database>
            </mysql_connection_primary>
            <mysql_connection_secondary>
                <host>secondary_mysql_host</host>
                <port>3306</port>
                <user>secondary_user</user>
                <password>secondary_password</password>
                <database>secondary_database</database>
            </mysql_connection_secondary>
        </named_collections>
        ...
    </yandex>
    
  2. Dynamic Configuration Updates: In environments where database configurations change frequently, using named collections allows for dynamic updates. By updating the named collection definitions in the configuration file and restarting the ClickHouse server, you ensure all queries referencing these collections automatically use the updated settings.

Challenges and Solutions

  1. Configuration Management: Managing configurations across multiple environments (development, staging, production) can be challenging. Using named collections can simplify this by having environment-specific configuration files, ensuring consistency and reducing manual errors.
  2. Security Considerations:Ensure that sensitive information such as database passwords is handled securely. Use environment variables or secrets management tools to inject these values into the ClickHouse configuration securely.

Best Practices

  1. Consistent Naming Conventions: Use clear and consistent naming conventions for your named collections to avoid confusion and make the configuration files more readable.
  2. Document Configuration Changes: Maintain documentation for any changes made to named collections. This practice helps in tracking updates and understanding the current configuration state.
  3. Test Configuration Updates: Before applying configuration changes in a production environment, thoroughly test them in a staging environment to ensure they work as expected without causing disruptions.

Practical Example

Consider a scenario where you have multiple external dictionaries using a named MySQL connection. Here’s how you can manage the named collection effectively:

<yandex>
    <named_collections>
        <mysql_connection>
            <host>mysql_host</host>
            <port>3306</port>
            <user>mysql_user</user>
            <password>mysql_password</password>
            <database>mysql_database</database>
        </mysql_connection>
    </named_collections>
</yandex>

Initial Definition:

<yandex>
    <named_collections>
        <mysql_connection>
            <host>mysql_host</host>
            <port>3306</port>
            <user>mysql_user</user>
            <password>mysql_password</password>
            <database>mysql_database</database>
        </mysql_connection>
    </named_collections>
</yandex>

Using the Named Collection in SQL Queries:

CREATE DICTIONARY example_dict
(
    id UInt64,
    name String
)
PRIMARY KEY id
SOURCE(MYSQL(named_collection 'mysql_connection', table 'example_table'))
LAYOUT(HASHED())
LIFETIME(300);

Updating the Connection:

If the MySQL server host changes:

<yandex>
    <named_collections>
        <mysql_connection>
            <host>new_mysql_host</host>
            <port>3306</port>
            <user>mysql_user</user>
            <password>mysql_password</password>
            <database>mysql_database</database>
        </mysql_connection>
    </named_collections>
</yandex>

Dropping the Named Collection:

If the connection is no longer needed, remove it:

<yandex>
    <named_collections>
        <!-- mysql_connection block removed -->
    </named_collections>
</yandex>

Managing Permissions

To manage named collections, users need appropriate permissions, typically configured in the ClickHouse server’s user settings. This involves editing the configuration files to grant the named_collection_control privilege.

<clickhouse>
  <users>
    <default>
      <password_sha256_hex>65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5</password_sha256_hex replace=true>
      <access_management>1</access_management>
      <named_collection_control>1</named_collection_control>
    </default>
  </users>
</clickhouse>

Conclusion

We can define, update, and remove named collections in ClickHouse. These operations help ensure that your configuration remains clean, efficient, and easy to maintain. By following the outlined procedures, you can effectively manage your named collections, keeping your ClickHouse environment organized and up-to-date. For more detailed information, refer to the official ClickHouse documentation​ .

About Can Sayn 38 Articles
Can Sayın is experienced Database Administrator in open source relational and NoSql databases, working in complicated infrastructures. Over 5 years industry experience, he gain managing database systems. He is working at ChistaDATA Inc. His areas of interest are generally on open source systems.
Contact: Website

Be the first to comment

Leave a Reply