Introduction
Benchmarking is the practice of testing the performance of a system under various workloads and situations in order to identify potential errors, bottlenecks and opportunities for improvement. In the case of ClickHouse, benchmarking can help users determine the hardware and software combinations required to achieve the best performance for specific use cases. Therefore, in this article we will use the TPC-DS benchmarking on the ClickHouse database.
TPC-DS simulates many features of a decision support system, including query and data management. TPC-DS is intended to provide a complete and realistic workload for testing and evaluating database system performance in a retail environment. The TPC-DS benchmark replicates sales and returns data from three retail distribution channels. It includes a simple inventory system, a marketing system, and the construction of tables for sales and returns data models. More information about TPC and its members may be found on the official website.
Why is the TPC-DS benchmark used?
There are various ClickHouse benchmarks available. But TPC-DS is an industry-standard online analytical processing benchmarks (OLAP) for data warehouse. So we will dive into them.
We will demonstrate that how to use dsqgen
tools on an Ubuntu machine to generate test data and test queries, and how to load test data into the generated database to run some TPC-DS queries. There is no need to go into too much detail, as TPC already gives a very thorough discussion of the database and schema it provides, as well as a clear analysis of the limitations and assumptions. This article focuses on creating the components needed to evaluate the target platform (ClickHouse) for TPC-DS datasets and loading data.
TPC-DS Performance Benchmarking with ClickHouse
Firstly you can clone the tpcds-kit and then clone in your environment.
Then you can learn the your gcc version and install it if needed. On Ubuntu, we decided to use gcc-9 to compile the dsdgen, as it might fail with gcc-11 or newer.
gcc --version sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa -y sudo apt update sudo apt install g++-9 gcc-9
Then we can proceed with the process.
sudo apt-get install gcc make flex bison byacc git git clone https://github.com/gregrahn/tpcds-kit.git cd tpcds-kit/tools make clean make CC=gcc-9 OS=LINUX
Data Generation
Created a tpcds
directory to stock up the generated data, then we generate the 10 TB data.
vi generate_tpcds.sh #! /bin/bash cd /data/tpcds-kit/tools/ ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 1 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 2 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 3 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 4 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 5 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 6 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 7 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 8 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 9 & ./dsdgen -scale 10000 -dir /data/tpcds -TERMINATE Y -parallel 10 -child 10 & chmod +x generate_tpcds.sh
Some quick updates about the commands;
PARALLEL
is the total number of chunks. Each statement can create only one chunk. Therefore, the number of chunks indicates the number of times the statement will be executed.CHILD
is the serial number of the chunk created by the current statement.TERMINATE
means a field separator such as a vertical bar (|) is added at the end of each row. You can sayN
for no any field separator is added at the end of each row but we need.
Everything is now ready for the data to be generated.
nohup ./generate_tpcds.sh &
When all the processes are finished, we can check our stock to see if everything is OK or not.
cd /data/tpcds ls call_center_1_10.dat customer_7_10.dat inventory_2_10.dat store_sales_5_10.dat catalog_page_1_10.dat customer_8_10.dat inventory_3_10.dat store_sales_6_10.dat catalog_returns_10_10.dat customer_9_10.dat inventory_4_10.dat store_sales_7_10.dat catalog_returns_1_10.dat customer_address_10_10.dat inventory_5_10.dat store_sales_8_10.dat catalog_returns_2_10.dat customer_address_1_10.dat inventory_6_10.dat store_sales_9_10.dat catalog_returns_3_10.dat customer_address_2_10.dat inventory_7_10.dat time_dim_1_10.dat catalog_returns_4_10.dat customer_address_3_10.dat inventory_8_10.dat warehouse_1_10.dat catalog_returns_5_10.dat customer_address_4_10.dat inventory_9_10.dat web_page_1_10.dat catalog_returns_6_10.dat customer_address_5_10.dat item_1_10.dat web_returns_10_10.dat catalog_returns_7_10.dat customer_address_6_10.dat nohup.out web_returns_1_10.dat catalog_returns_8_10.dat customer_address_7_10.dat promotion_1_10.dat web_returns_2_10.dat catalog_returns_9_10.dat customer_address_8_10.dat reason_1_10.dat web_returns_3_10.dat catalog_sales_10_10.dat customer_address_9_10.dat ship_mode_1_10.dat web_returns_4_10.dat catalog_sales_1_10.dat customer_demographics_10_10.dat store_1_10.dat web_returns_5_10.dat catalog_sales_2_10.dat customer_demographics_1_10.dat store_returns_10_10.dat web_returns_6_10.dat catalog_sales_3_10.dat customer_demographics_2_10.dat store_returns_1_10.dat web_returns_7_10.dat catalog_sales_4_10.dat customer_demographics_3_10.dat store_returns_2_10.dat web_returns_8_10.dat catalog_sales_5_10.dat customer_demographics_4_10.dat store_returns_3_10.dat web_returns_9_10.dat catalog_sales_6_10.dat customer_demographics_5_10.dat store_returns_4_10.dat web_sales_10_10.dat catalog_sales_7_10.dat customer_demographics_6_10.dat store_returns_5_10.dat web_sales_1_10.dat catalog_sales_8_10.dat customer_demographics_7_10.dat store_returns_6_10.dat web_sales_2_10.dat catalog_sales_9_10.dat customer_demographics_8_10.dat store_returns_7_10.dat web_sales_3_10.dat customer_10_10.dat customer_demographics_9_10.dat store_returns_8_10.dat web_sales_4_10.dat customer_1_10.dat date_dim_1_10.dat store_returns_9_10.dat web_sales_5_10.dat customer_2_10.dat dbgen_version_1_10.dat store_sales_10_10.dat web_sales_6_10.dat customer_3_10.dat household_demographics_1_10.dat store_sales_1_10.dat web_sales_7_10.dat customer_4_10.dat income_band_1_10.dat store_sales_2_10.dat web_sales_8_10.dat customer_5_10.dat inventory_10_10.dat store_sales_3_10.dat web_sales_9_10.dat customer_6_10.dat inventory_1_10.dat store_sales_4_10.dat web_site_1_10.dat
Converting the File Formats
A file with the extension DAT is usually a generic data file that stores information specific to the application it refers to. We will convert it into a CSV file for loading into the ClickHouse database. To do this we can create a sh file in the tpcds
folder and execute it.
for file in $(ls -la | grep ".dat" | awk '{print $9}'); do mv "$file" "${file%.*}.csv" done
Creating the Tables
So far we have followed all the steps up to uploading the data to the ClickHouse database. In order to load the data, we now need to create the tables. So this benchmark runs 99 complicated SQL queries on 24 tables with different data sizes. Here is the table list we need to create.
CREATE DATABASE IF NOT EXISTS tpcdsch; USE tpcdsch; CREATE TABLE customer_address ( ca_address_sk Int64, ca_address_id String, ca_street_number Nullable(String), ca_street_name Nullable(String), ca_street_type Nullable(String), ca_suite_number Nullable(String), ca_city Nullable(String), ca_county Nullable(String), ca_state Nullable(String), ca_zip Nullable(String), ca_country Nullable(String), ca_gmt_offset Nullable(Float32), ca_location_type Nullable(String) ) ENGINE = MergeTree() ORDER BY (ca_address_sk); CREATE TABLE customer_demographics ( cd_demo_sk Int64, cd_gender Nullable(String), cd_marital_status Nullable(String), cd_education_status Nullable(String), cd_purchase_estimate Nullable(Int64), cd_credit_rating Nullable(String), cd_dep_count Nullable(Int64), cd_dep_employed_count Nullable(Int64), cd_dep_college_count Nullable(Int64) ) ENGINE = MergeTree() ORDER BY (cd_demo_sk); CREATE TABLE date_dim ( d_date_sk Int64, d_date_id String, d_date Nullable(Date), d_month_seq Nullable(Int64), d_week_seq Nullable(Int64), d_quarter_seq Nullable(Int64), d_year Nullable(Int64), d_dow Nullable(Int64), d_moy Nullable(Int64), d_dom Nullable(Int64), d_qoy Nullable(Int64), d_fy_year Nullable(Int64), d_fy_quarter_seq Nullable(Int64), d_fy_week_seq Nullable(Int64), d_day_name Nullable(String), d_quarter_name Nullable(String), d_holiday Nullable(String), d_weekend Nullable(String), d_following_holiday Nullable(String), d_first_dom Nullable(Int64), d_last_dom Nullable(Int64), d_same_day_ly Nullable(Int64), d_same_day_lq Nullable(Int64), d_current_day Nullable(String), d_current_week Nullable(String), d_current_month Nullable(String), d_current_quarter Nullable(String), d_current_year Nullable(String) ) ENGINE = MergeTree() ORDER BY (d_date_sk); CREATE TABLE warehouse ( w_warehouse_sk Int64, w_warehouse_id String, w_warehouse_name Nullable(String), w_warehouse_sq_ft Nullable(Int64), w_street_number Nullable(String), w_street_name Nullable(String), w_street_type Nullable(String), w_suite_number Nullable(String), w_city Nullable(String), w_county Nullable(String), w_state Nullable(String), w_zip Nullable(String), w_country Nullable(String), w_gmt_offset Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (w_warehouse_sk); CREATE TABLE ship_mode ( sm_ship_mode_sk Int64, sm_ship_mode_id String, sm_type Nullable(String), sm_code Nullable(String), sm_carrier Nullable(String), sm_contract Nullable(String) ) ENGINE = MergeTree() ORDER BY (sm_ship_mode_sk); CREATE TABLE time_dim ( t_time_sk Int64, t_time_id String, t_time Nullable(Int64), t_hour Nullable(Int64), t_minute Nullable(Int64), t_second Nullable(Int64), t_am_pm Nullable(String), t_shift Nullable(String), t_sub_shift Nullable(String), t_meal_time Nullable(String) ) ENGINE = MergeTree() ORDER BY (t_time_sk); CREATE TABLE reason ( r_reason_sk Int64, r_reason_id String, r_reason_desc Nullable(String) ) ENGINE = MergeTree() ORDER BY (r_reason_sk); CREATE TABLE income_band ( ib_income_band_sk Int64, ib_lower_bound Nullable(Int64), ib_upper_bound Nullable(Int64) ) ENGINE = MergeTree() ORDER BY (ib_income_band_sk); CREATE TABLE item ( i_item_sk Int64, i_item_id String, i_rec_start_date Nullable(Date), i_rec_end_date Nullable(Date), i_item_desc Nullable(String), i_current_price Nullable(Float32), i_wholesale_cost Nullable(Float32), i_brand_id Nullable(Int64), i_brand Nullable(String), i_class_id Nullable(Int64), i_class Nullable(String), i_category_id Nullable(Int64), i_category Nullable(String), i_manufact_id Nullable(Int64), i_manufact Nullable(String), i_size Nullable(String), i_formulation Nullable(String), i_color Nullable(String), i_units Nullable(String), i_container Nullable(String), i_manager_id Nullable(Int64), i_product_name Nullable(String) ) ENGINE = MergeTree() ORDER BY (i_item_sk); CREATE TABLE store ( s_store_sk Int64, s_store_id String, s_rec_start_date Nullable(Date), s_rec_end_date Nullable(Date), s_closed_date_sk Nullable(Int64), s_store_name Nullable(String), s_number_employees Nullable(Int64), s_floor_space Nullable(Int64), s_hours Nullable(String), s_manager Nullable(String), s_market_id Nullable(Int64), s_geography_class Nullable(String), s_market_desc Nullable(String), s_market_manager Nullable(String), s_division_id Nullable(Int64), s_division_name Nullable(String), s_company_id Nullable(Int64), s_company_name Nullable(String), s_street_number Nullable(String), s_street_name Nullable(String), s_street_type Nullable(String), s_suite_number Nullable(String), s_city Nullable(String), s_county Nullable(String), s_state Nullable(String), s_zip Nullable(String), s_country Nullable(String), s_gmt_offset Nullable(Float32), s_tax_precentage Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (s_store_sk); CREATE TABLE call_center ( cc_call_center_sk Int64, cc_call_center_id String, cc_rec_start_date Nullable(Date), cc_rec_end_date Nullable(Date), cc_closed_date_sk Nullable(Int64), cc_open_date_sk Nullable(Int64), cc_name Nullable(String), cc_class Nullable(String), cc_employees Nullable(Int64), cc_sq_ft Nullable(Int64), cc_hours Nullable(String), cc_manager Nullable(String), cc_mkt_id Nullable(Int64), cc_mkt_class Nullable(String), cc_mkt_desc Nullable(String), cc_market_manager Nullable(String), cc_division Nullable(Int64), cc_division_name Nullable(String), cc_company Nullable(Int64), cc_company_name Nullable(String), cc_street_number Nullable(String), cc_street_name Nullable(String), cc_street_type Nullable(String), cc_suite_number Nullable(String), cc_city Nullable(String), cc_county Nullable(String), cc_state Nullable(String), cc_zip Nullable(String), cc_country Nullable(String), cc_gmt_offset Nullable(Float32), cc_tax_percentage Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (cc_call_center_sk); CREATE TABLE customer ( c_customer_sk Int64, c_customer_id String, c_current_cdemo_sk Nullable(Int64), c_current_hdemo_sk Nullable(Int64), c_current_addr_sk Nullable(Int64), c_first_shipto_date_sk Nullable(Int64), c_first_sales_date_sk Nullable(Int64), c_salutation Nullable(String), c_first_name Nullable(String), c_last_name Nullable(String), c_preferred_cust_flag Nullable(String), c_birth_day Nullable(Int64), c_birth_month Nullable(Int64), c_birth_year Nullable(Int64), c_birth_country Nullable(String), c_login Nullable(String), c_email_address Nullable(String), c_last_review_date Nullable(String) ) ENGINE = MergeTree() ORDER BY (c_customer_sk); CREATE TABLE web_site ( web_site_sk Int64, web_site_id String, web_rec_start_date Nullable(Date), web_rec_end_date Nullable(Date), web_name Nullable(String), web_open_date_sk Nullable(Int64), web_close_date_sk Nullable(Int64), web_class Nullable(String), web_manager Nullable(String), web_mkt_id Nullable(Int64), web_mkt_class Nullable(String), web_mkt_desc Nullable(String), web_market_manager Nullable(String), web_company_id Nullable(Int64), web_company_name Nullable(String), web_street_number Nullable(String), web_street_name Nullable(String), web_street_type Nullable(String), web_suite_number Nullable(String), web_city Nullable(String), web_county Nullable(String), web_state Nullable(String), web_zip Nullable(String), web_country Nullable(String), web_gmt_offset Nullable(Float32), web_tax_percentage Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (web_site_sk); CREATE TABLE store_returns ( sr_returned_date_sk Nullable(Int64), sr_return_time_sk Nullable(Int64), sr_item_sk Int64, sr_customer_sk Nullable(Int64), sr_cdemo_sk Nullable(Int64), sr_hdemo_sk Nullable(Int64), sr_addr_sk Nullable(Int64), sr_store_sk Nullable(Int64), sr_reason_sk Nullable(Int64), sr_ticket_number Int64, sr_return_quantity Nullable(Int64), sr_return_amt Nullable(Float32), sr_return_tax Nullable(Float32), sr_return_amt_inc_tax Nullable(Float32), sr_fee Nullable(Float32), sr_return_ship_cost Nullable(Float32), sr_refunded_cash Nullable(Float32), sr_reversed_charge Nullable(Float32), sr_store_credit Nullable(Float32), sr_net_loss Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (sr_item_sk, sr_ticket_number); CREATE TABLE household_demographics ( hd_demo_sk Int64, hd_income_band_sk Nullable(Int64), hd_buy_potential Nullable(String), hd_dep_count Nullable(Int64), hd_vehicle_count Nullable(Int64) ) ENGINE = MergeTree() ORDER BY (hd_demo_sk); CREATE TABLE web_page ( wp_web_page_sk Int64, wp_web_page_id String, wp_rec_start_date Nullable(Date), wp_rec_end_date Nullable(Date), wp_creation_date_sk Nullable(Int64), wp_access_date_sk Nullable(Int64), wp_autogen_flag Nullable(String), wp_customer_sk Nullable(Int64), wp_url Nullable(String), wp_type Nullable(String), wp_char_count Nullable(Int64), wp_link_count Nullable(Int64), wp_image_count Nullable(Int64), wp_max_ad_count Nullable(Int64) ) ENGINE = MergeTree() ORDER BY (wp_web_page_sk); CREATE TABLE promotion ( p_promo_sk Int64, p_promo_id String, p_start_date_sk Nullable(Int64), p_end_date_sk Nullable(Int64), p_item_sk Nullable(Int64), p_cost Nullable(Float64), p_response_target Nullable(Int64), p_promo_name Nullable(String), p_channel_dmail Nullable(String), p_channel_email Nullable(String), p_channel_catalog Nullable(String), p_channel_tv Nullable(String), p_channel_radio Nullable(String), p_channel_press Nullable(String), p_channel_event Nullable(String), p_channel_demo Nullable(String), p_channel_details Nullable(String), p_purpose Nullable(String), p_discount_active Nullable(String) ) ENGINE = MergeTree() ORDER BY (p_promo_sk); CREATE TABLE catalog_page ( cp_catalog_page_sk Int64, cp_catalog_page_id String, cp_start_date_sk Nullable(Int64), cp_end_date_sk Nullable(Int64), cp_department Nullable(String), cp_catalog_number Nullable(Int64), cp_catalog_page_number Nullable(Int64), cp_description Nullable(String), cp_type Nullable(String) ) ENGINE = MergeTree() ORDER BY (cp_catalog_page_sk); CREATE TABLE inventory ( inv_date_sk Int64, inv_item_sk Int64, inv_warehouse_sk Int64, inv_quantity_on_hand Nullable(Int64) ) ENGINE = MergeTree() ORDER BY (inv_date_sk, inv_item_sk, inv_warehouse_sk); CREATE TABLE catalog_returns ( cr_returned_date_sk Nullable(Int64), cr_returned_time_sk Nullable(Int64), cr_item_sk Int64, cr_refunded_customer_sk Nullable(Int64), cr_refunded_cdemo_sk Nullable(Int64), cr_refunded_hdemo_sk Nullable(Int64), cr_refunded_addr_sk Nullable(Int64), cr_returning_customer_sk Nullable(Int64), cr_returning_cdemo_sk Nullable(Int64), cr_returning_hdemo_sk Nullable(Int64), cr_returning_addr_sk Nullable(Int64), cr_call_center_sk Nullable(Int64), cr_catalog_page_sk Nullable(Int64), cr_ship_mode_sk Nullable(Int64), cr_warehouse_sk Nullable(Int64), cr_reason_sk Nullable(Int64), cr_order_number Int64, cr_return_quantity Nullable(Int64), cr_return_amount Nullable(Float32), cr_return_tax Nullable(Float32), cr_return_amt_inc_tax Nullable(Float32), cr_fee Nullable(Float32), cr_return_ship_cost Nullable(Float32), cr_refunded_cash Nullable(Float32), cr_reversed_charge Nullable(Float32), cr_store_credit Nullable(Float32), cr_net_loss Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (cr_item_sk, cr_order_number); CREATE TABLE web_returns ( wr_returned_date_sk Nullable(Int64), wr_returned_time_sk Nullable(Int64), wr_item_sk Int64, wr_refunded_customer_sk Nullable(Int64), wr_refunded_cdemo_sk Nullable(Int64), wr_refunded_hdemo_sk Nullable(Int64), wr_refunded_addr_sk Nullable(Int64), wr_returning_customer_sk Nullable(Int64), wr_returning_cdemo_sk Nullable(Int64), wr_returning_hdemo_sk Nullable(Int64), wr_returning_addr_sk Nullable(Int64), wr_web_page_sk Nullable(Int64), wr_reason_sk Nullable(Int64), wr_order_number Int64, wr_return_quantity Nullable(Int64), wr_return_amt Nullable(Float32), wr_return_tax Nullable(Float32), wr_return_amt_inc_tax Nullable(Float32), wr_fee Nullable(Float32), wr_return_ship_cost Nullable(Float32), wr_refunded_cash Nullable(Float32), wr_reversed_charge Nullable(Float32), wr_account_credit Nullable(Float32), wr_net_loss Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (wr_item_sk, wr_order_number); CREATE TABLE web_sales ( ws_sold_date_sk Nullable(Int64), ws_sold_time_sk Nullable(Int64), ws_ship_date_sk Nullable(Int64), ws_item_sk Int64, ws_bill_customer_sk Nullable(Int64), ws_bill_cdemo_sk Nullable(Int64), ws_bill_hdemo_sk Nullable(Int64), ws_bill_addr_sk Nullable(Int64), ws_ship_customer_sk Nullable(Int64), ws_ship_cdemo_sk Nullable(Int64), ws_ship_hdemo_sk Nullable(Int64), ws_ship_addr_sk Nullable(Int64), ws_web_page_sk Nullable(Int64), ws_web_site_sk Nullable(Int64), ws_ship_mode_sk Nullable(Int64), ws_warehouse_sk Nullable(Int64), ws_promo_sk Nullable(Int64), ws_order_number Int64, ws_quantity Nullable(Int64), ws_wholesale_cost Nullable(Float32), ws_list_price Nullable(Float32), ws_sales_price Nullable(Float32), ws_ext_discount_amt Nullable(Float32), ws_ext_sales_price Nullable(Float32), ws_ext_wholesale_cost Nullable(Float32), ws_ext_list_price Nullable(Float32), ws_ext_tax Nullable(Float32), ws_coupon_amt Nullable(Float32), ws_ext_ship_cost Nullable(Float32), ws_net_paid Nullable(Float32), ws_net_paid_inc_tax Nullable(Float32), ws_net_paid_inc_ship Nullable(Float32), ws_net_paid_inc_ship_tax Nullable(Float32), ws_net_profit Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (ws_item_sk, ws_order_number); CREATE TABLE catalog_sales ( cs_sold_date_sk Nullable(Int64), cs_sold_time_sk Nullable(Int64), cs_ship_date_sk Nullable(Int64), cs_bill_customer_sk Nullable(Int64), cs_bill_cdemo_sk Nullable(Int64), cs_bill_hdemo_sk Nullable(Int64), cs_bill_addr_sk Nullable(Int64), cs_ship_customer_sk Nullable(Int64), cs_ship_cdemo_sk Nullable(Int64), cs_ship_hdemo_sk Nullable(Int64), cs_ship_addr_sk Nullable(Int64), cs_call_center_sk Nullable(Int64), cs_catalog_page_sk Nullable(Int64), cs_ship_mode_sk Nullable(Int64), cs_warehouse_sk Nullable(Int64), cs_item_sk Int64, cs_promo_sk Nullable(Int64), cs_order_number Int64, cs_quantity Nullable(Int64), cs_wholesale_cost Nullable(Float32), cs_list_price Nullable(Float32), cs_sales_price Nullable(Float32), cs_ext_discount_amt Nullable(Float32), cs_ext_sales_price Nullable(Float32), cs_ext_wholesale_cost Nullable(Float32), cs_ext_list_price Nullable(Float32), cs_ext_tax Nullable(Float32), cs_coupon_amt Nullable(Float32), cs_ext_ship_cost Nullable(Float32), cs_net_paid Nullable(Float32), cs_net_paid_inc_tax Nullable(Float32), cs_net_paid_inc_ship Nullable(Float32), cs_net_paid_inc_ship_tax Nullable(Float32), cs_net_profit Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (cs_item_sk, cs_order_number); CREATE TABLE store_sales ( ss_sold_date_sk Nullable(Int64), ss_sold_time_sk Nullable(Int64), ss_item_sk Int64, ss_customer_sk Nullable(Int64), ss_cdemo_sk Nullable(Int64), ss_hdemo_sk Nullable(Int64), ss_addr_sk Nullable(Int64), ss_store_sk Nullable(Int64), ss_promo_sk Nullable(Int64), ss_ticket_number Int64, ss_quantity Nullable(Int64), ss_wholesale_cost Nullable(Float32), ss_list_price Nullable(Float32), ss_sales_price Nullable(Float32), ss_ext_discount_amt Nullable(Float32), ss_ext_sales_price Nullable(Float32), ss_ext_wholesale_cost Nullable(Float32), ss_ext_list_price Nullable(Float32), ss_ext_tax Nullable(Float32), ss_coupon_amt Nullable(Float32), ss_net_paid Nullable(Float32), ss_net_paid_inc_tax Nullable(Float32), ss_net_profit Nullable(Float32) ) ENGINE = MergeTree() ORDER BY (ss_item_sk, ss_ticket_number); CREATE TABLE dbgen_version ( dv_version Nullable(String), dv_create_date Nullable(Date), dv_create_time Nullable(String), dv_cmdline_args Nullable(String) ) ENGINE = Log;
Data Loading
The dsdgen
commands are used to generate data files in text format. The default field separator is a vertical bar (|) according to our TERMINATE
option. So, this is how I loaded the data into the ChistaDATA DBaaS for one table! Or you can use this article to load the data more effectively.
for filename in /data/tpcds/customer_address_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/customer_demographics_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/date_dim_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/warehouse_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/ship_mode_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/time_dim_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/reason_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/income_band_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/item_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/store_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/call_center_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/customer_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/web_site_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/store_returns_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/household_demographics_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/web_page_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/promotion_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/catalog_page_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/inventory_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/catalog_returns_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/web_returns_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/web_sales_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/catalog_sales_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done for filename in /data/tpcds/store_sales_*.csv; do clickhouse-client --secure --user '****@chistadata.com' --password '****' --host '---.chistadata.io' --port '****' --format_csv_delimiter="|" --query="INSERT INTO tpcdsch.customer_demographics_sharded FORMAT CSV" < $filename; done
Conclusion
As can be seen from the last code block, we loaded this data into sharded tables in ChistaDATA DBaaS. As you can see, we strongly recommend that you run the benchmark in a ClickHouse sharding architecture. We utilized from an EC2 instance m5.24xlarge
for each shard for balance between compute/memory. You can also find the 99 queries from the reference as ClickHouse dialect. Have a nice benchmarking!
To read more about ClickHouse benchmarking, please do consider reading the below articles
- Benchmarking ClickHouse Performance with Brown University’s Benchmark
- ChistaDATA Cloud for ClickHouse v/s Google BigQuery: Comparative Benchmarking
- ChistaDATA Cloud for ClickHouse: Benchmarking on OpenTelemetry Dataset
References
[1] https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.11.0.pdf