Machine Learning With ClickHouse: Logistic Regression

Introduction

In the earlier blog article, we have seen the basics of Linear Regression and how to perform linear regression on data present in ClickHouse. In this article, let us look at an example of Logistic Regression in ClickHouse.

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. – Source

Logistic Regression in ClickHouse

Let us re-use the dataset used in the previous example (Taxi Trip Fare dataset). Although the original dataset is for linear regression (prediction of taxi fares, which is a continuous-valued output), there is a field in the dataset called surge_applied, which is either true or false (binary). We will treat this as the target variable and we will use the actual_fare, trip_duration, distance_traveled, and num_of_passengers as input variables.

(1) Create a Logistic Regression Model

Let us create the model using the inbuilt function available in ClickHouse.

CREATE TABLE classification_taxi_surgefare ENGINE = Memory AS SELECT
stochasticLogisticRegressionState(0.01, 0.0, 5)(surge_applied, actual_fare, trip_duration, distance_traveled,
num_of_passengers )
AS trained_classification_model FROM taxi_fare_train;

The stochasticLogisticRegressionState function can accept four parameters (Hyperparameters) as input. They are

  1. Learning Rate – Default is 0.00001
  2. L2 regularization co-efficient – Default is 0.1
  3. Mini batch size – Default is 15
  4. Weight update method – Default is Adam

The parameters are tweaked from defaults for this model except for the weight update method.

(2) Predicting Surge Fare Pricing

WITH (
        SELECT trained_classification_model
        FROM classification_taxi_surgefare
    ) AS model
SELECT evalMLMethod(model, actual_fare, trip_duration, distance_traveled, num_of_passengers) AS predicted_surgefare
FROM taxi_fare_test
LIMIT 10

Query id: 7289003a-d7af-400a-8746-dd8d7ca0c284

┌─predicted_surgefare─┐
│                   1 │
│    0.99999999999996 │
│                   1 │
│                   1 │
│  0.9999999999999998 │
│                   1 │
│  0.9999999793719152 │
│                   1 │
│                   1 │
│                   1 │
└─────────────────────┘

10 rows in set. Elapsed: 0.006 sec. 

As in the previous example, use the evalMLMethod to predict the output from the input data. The first parameter is the trained model and the subsequent parameters are the input variables from the test data.

Conclusion

The predicted values range between 0 and 1 (probabilities). The probability >= 0.5 is rounded off to 1, and <0.5 is rounded off to 0. This logistic regression model may not be trained for accurate predictions with the current set of Hyperparameters. The hyperparameters need to be tuned for accurate predictions and this is left as an exercise for the readers.

References

[1] ClickHouse Jargons – https://allanmacgregor.medium.com/machine-learning-jargon-5988f9b19380

[2] Dataset – https://www.kaggle.com/datasets/raviiloveyou/predict-taxi-fare-with-a-bigquery-ml-forecasting

[3] Logistic Regression – https://www.sciencedirect.com/topics/computer-science/logistic-regression/

[4] Logistic Regression – https://www.techtarget.com/searchbusinessanalytics/definition/logistic-regression

[5] Logistic Regression Hyperparameter – https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/