Machine Learning with ClickHouse – Part 1 – Linear Regression

 

ClickHouse is quite a popular and capable columnar data store for real-time analytics on petabytes of data. While the ClickHouse database is frequently used to run analytical queries, it is still possible to perform minor machine learning algorithms on the data stored in ClickHouse using built-in SQL functions, eliminating the need for other external tools and frameworks. Before we get into the example, I will provide the commonly used jargon and its meaning,  to make it easier for the readers.

Machine Learning

Machine learning enables computers to learn from data and make decisions or predictions without being explicitly programmed to do so. – Source

Supervised and Unsupervised Learning

The biggest difference between supervised and unsupervised machine learning is the type of data used. Supervised learning uses labeled training data, and unsupervised learning does not.Source

Regression & Classification

In machine learning, regression is a supervised learning technique used to predict continuous numerical values whereas Classification techniques (supervised) predict discrete labels.

Independent & Dependant variables

Independent variables (feature(s)) are the input for a machine learning algorithm and the Dependent variables are the output/predictions that depend on the independent variables.

Linear Regression

Linear regression is a type of machine learning algorithm that models the linear relationship between a dependent variable and independent variables.

Training Data vs Test Data

The training data is a subset of the original dataset, which is used to train the machine learning model. Test data is another subset of the original dataset which is not part of the training dataset and it is used to evaluate the performance of the trained model.

Model Training

The process of training an ML model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The term ML model refers to the model artifact that is created by the training process. – Source

Model Hyperparameters

A model hyperparameter is a parameter whose value is set before the model starts training. They cannot be learned by fitting the model to the data. – Source

Taxi Trip Fare Prediction

Let us try out Stochastic Linear Regression to predict the taxi trip fares. The data set is available in Kaggle for download. The downloaded archive consists of three files. train.csv and test.csv are the training data and test/validation data respectively. The training data has more than 200,000 samples. Let us look at the steps to create a Stochastic Linear Regression model to predict the fares and infer from the model. As a prerequisite, download the data and have the ClickHouse instance ready.

Step 1 – Create the table

Login to the ClickHouse instance and create two different tables to store the training data and test data.

CREATE TABLE taxi_fare_train
(
trip_duration Float32,
distance_traveled Float32,
num_of_passengers Float32,
fare Float32,
tip Float32,
miscellaneous_fees Float32,
total_fare Float32,
surge_applied Float32,
actual_fare Float32 DEFAULT (total_fare - tip)
)
Engine = Log;

CREATE TABLE taxi_fare_test
(
trip_duration Float32,
distance_traveled Float32,
num_of_passengers Float32,
fare Float32,
tip Float32,
miscellaneous_fees Float32,
total_fare Float32,
surge_applied Float32,
actual_fare Float32 DEFAULT (total_fare - tip)
)
Engine = Log;

Step 2 – Insert the data

Use the below commands in the terminal to insert the downloaded data into the ClickHouse tables.

cat train.csv | clickhouse-client --query "INSERT INTO taxi_fare_train FORMAT CSVWithNames"

cat test.csv | clickhouse-client --query "INSERT INTO taxi_fare_test FORMAT CSVWithNames" 

 

Step 3 – Create the machine learning model

Let us create the model using the inbuilt function available in ClickHouse.

CREATE TABLE regression_taxi_fare ENGINE = Memory AS SELECT
stochasticLinearRegressionState(0.01, 0.0, 5)(actual_fare, trip_duration, distance_traveled,
num_of_passengers, surge_applied)
AS trained_model FROM taxi_fare_train;

The stochasticLinearRegressionState function can accept four parameters (Hyperparameters) as input. They are

  1. Learning Rate – Default is 0.00001
  2. L2 regularization co-efficient – Default is 0.1
  3. Mini batch size – Default is 15
  4. Weight update method – Default is Adam

More details can be found in the docs. We have created a ML model in the memory (regression_taxi_fare) using memory table engine. The target variable is actual_fare and the input variables are trip_duration, distance_traveled, num_of_passengers, surge_applied. The parameters are tweaked from defaults except for the weight update method.

Step 4 – Predict the actual_fare from the test data

Let us predict the fares using the trained model and the test data.

WITH (SELECT trained_model FROM regression_taxi_fare) AS model SELECT
evalMLMethod(model,trip_duration, distance_traveled,
num_of_passengers, surge_applied) AS predicted_fare FROM taxi_fare_test;
WITH (
        SELECT trained_model
        FROM regression_taxi_fare
    ) AS model
SELECT evalMLMethod(model, trip_duration, distance_traveled, num_of_passengers, surge_applied) AS predicted_fare
FROM taxi_fare_test
LIMIT 10

Query id: 8209fe31-4411-4a53-9982-e854cf912bbd

┌─────predicted_fare─┐
│  92.44493929867386 │
│  91.81795068701524 │
│  92.32603987173391 │
│  79.70719986398052 │
│    90.434714982121 │
│ 192.58361817101292 │
│  49.31004159031567 │
│  76.83238118281483 │
│  50.46457575384593 │
│ 196.93331312696364 │
└────────────────────┘

10 rows in set. Elapsed: 0.006 sec. 

We use the evalMLMethod to predict the output from the input data. The first parameter is the trained model and the subsequent parameters are the input variables from the test data. Note that, we should use the same set of input variables used to train the data.  We can manually verify the predicted values from the test data table.

Note:

With the current set of hyperparameters, the machine learning model may not be trained sufficiently and the predictions may not be accurate. For accurate predictions, the hyperparameters need to be tuned.

Conclusion

We have trained a regression model to predict the taxi fare using information such as trip duration, distance of the trip, number of passengers, and whether surge pricing was applied. In the next part of this series, let us perform a binary classification in ClickHouse using the same dataset to predict whether the surge pricing was used to calculate the travel fare.

References

ClickHouse StochasticLinearRegression – https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/stochasticlinearregression

Jargons – https://allanmacgregor.medium.com/machine-learning-jargon-5988f9b19380

Dataset – https://www.kaggle.com/datasets/raviiloveyou/predict-taxi-fare-with-a-bigquery-ml-forecasting

Basics of Linear Regression – https://www.analyticsvidhya.com/blog/2021/10/everything-you-need-to-know-about-linear-regression/

Basics of Linear Regression – http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

Hyperparameter Tuning – https://www.geeksforgeeks.org/hyperparameter-tuning/