Data Science in ClickHouse: Skewed Q-Q Plots

Introduction

Skewed Q-Q plots are a useful visualization tool for data scientists when they are dealing with non-normal distributions. In real life, data scientists use skewed Q-Q plots to perform the following tasks:

  1. Assessing normality: Skewed Q-Q plots are often used to assess whether a dataset follows a normal distribution. If the points in the plot deviate significantly from a straight line, it indicates that the data is not normally distributed. This information is useful for selecting appropriate statistical tests and models.
  2. Identifying skewness: Skewed Q-Q plots can help data scientists identify the direction and degree of skewness in the data. A plot with an upward or downward curve indicates positive or negative skewness, respectively. This information is useful for determining the appropriate transformations to apply to the data to reduce skewness and make it more amenable to modeling.
  3. Outlier detection: Skewed Q-Q plots can help data scientists identify outliers in the data. Outliers are points that fall far away from the expected range of values based on the normal distribution. These points may indicate errors in the data or be genuinely interesting observations that require further investigation.
  4. Model assessment: Skewed Q-Q plots can also be used to assess the fit of statistical models. If the points in the plot deviate significantly from a straight line, it may indicate that the model is not a good fit for the data.

Overall, skewed Q-Q plots are a powerful tool for data scientists when dealing with non-normal distributions. They provide insights into the distribution, skewness, and outliers of the data, which are essential for selecting appropriate statistical tests and models, identifying data issues, and making informed decisions.

Skewed Q-Q plots use case

Let’s consider an example use case of skewed Q-Q plots to illustrate their application in data analysis.

Suppose a data scientist is analyzing the distribution of customer purchases on an e-commerce website. They want to determine whether the purchase amounts follow a normal distribution and identify any potential outliers. To do this, they can create a skewed Q-Q plot of the purchase amounts.

The data scientist can start by computing the purchase amounts and then creating a histogram of the data to get a rough idea of its distribution. Suppose the histogram shows that the data is right-skewed, indicating that there may be a few large purchases that are skewing the distribution.

To get a more detailed understanding of the distribution, the data scientist can then create a skewed Q-Q plot. They can plot the quantiles of the purchase amounts against the quantiles of a normal distribution. If the points in the plot follow a straight line, it suggests that the purchase amounts follow a normal distribution. However, if the plot curves or deviates significantly from a straight line, it indicates that the data is not normally distributed.

In this case, suppose the skewed Q-Q plot shows that the points in the plot curve upward at the high end, indicating positive skewness. This result suggests that there are a few large purchases that are skewing the distribution, which the data scientist can investigate further to identify potential outliers.

Using this approach, the data scientist can gain valuable insights into the distribution of purchase amounts, identify any potential outliers, and make informed decisions about how to model and analyze the data. The skewed Q-Q plot is a powerful tool that provides a clear visualization of the data distribution and is essential for conducting accurate data analysis.

Data Scientist’s SQL guide for analyzing the distribution of customer purchases for an e-commerce website on ClickHouse

Sure, here’s an example ClickHouse database schema for analyzing the distribution of customer purchases on an e-commerce website:

CREATE DATABASE ecommerce;

USE ecommerce;

CREATE TABLE purchases (
    id UInt32,
    customer_id UInt32,
    amount Float64,
    date Date
) ENGINE = MergeTree(date, (customer_id, id), 8192);

CREATE TABLE customer (
    id UInt32,
    name String,
    email String,
    city String,
    state String
) ENGINE = MergeTree()
ORDER BY id;

This schema includes two tables: purchases and customer. The purchases table stores information about customer purchases, including the customer ID, purchase amount, and date. The customer table stores information about the customers, including their name, email, city, and state.

To generate a Q-Q plot of the purchase amounts, a data scientist could use the following SQL query:

SELECT quantilesExact(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95)(amount) as quantiles, normal_distribution() as normal
FROM purchases;

This query calculates the exact quantiles of the purchase amounts at the specified levels (5%, 10%, 25%, 50%, 75%, 90%, and 95%) using the quantilesExact function. It also generates a normal distribution using the normal_distribution function. The resulting table includes two columns: quantiles, which contains the quantiles of the purchase amounts, and normal, which contains the corresponding quantiles of the normal distribution.

The data scientist can then plot the quantiles of the purchase amounts against the quantiles of the normal distribution to generate a Q-Q plot. If the points in the plot follow a straight line, it suggests that the purchase amounts follow a normal distribution. However, if the plot curves or deviates significantly from a straight line, it indicates that the data is not normally distributed.

Conclusion

This article is a comprehensive guide to generating skewed Q-Q plots in ClickHouse, which are a powerful tool used by data scientists in multiple use cases. It showcases how ClickHouse is ideally suited to SQL-based advanced data science and statistical analysis in real-time.

To read more about Data Science in ClickHouse, do consider reading the below articles

About Shiv Iyer 229 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.