Data Science in ClickHouse: Normal Distributions and t Distributions

Introduction

The normal distribution and t-distribution are two of the most commonly used probability distributions in statistics. They are used to model the distribution of continuous variables, such as heights, weights, and test scores, and are essential tools for data scientists in analyzing and interpreting data.

The Normal Distribution:

The normal distribution is a continuous probability distribution that is symmetric and bell-shaped, with most of the data points clustering around the mean. The distribution is characterized by two parameters: the mean (μ) and the standard deviation (σ). The formula for the normal distribution is:

f(x) = (1/σ√(2π)) * e^(-((x-μ)^2/(2σ^2)))

Where f(x) is the probability density function, x is the value of the variable, e is the mathematical constant e (~2.718), and π is the mathematical constant pi (~3.14159).

The normal distribution is commonly used in real-life applications, such as:

Height: The heights of people in a population often follow a normal distribution. For example, if we measure the heights of all the students in a school, the distribution of heights will be approximately normal.

IQ Scores: Intelligence quotient (IQ) scores also follow a normal distribution. This means that most people have an average IQ score of 100, while a smaller number of people have scores above or below this average.

Stock Prices: The daily changes in stock prices also follow a normal distribution. This means that most days, the price of a stock will change by a small amount, while on rare occasions, the price may experience a large increase or decrease.

SQL code to calculate the mean and standard deviation for a sample dataset using the normal distribution:

Assuming we have a sample dataset “heights” with the following values:

height
65
68
71
73
70
67
69
72

We can calculate the mean and standard deviation using the following SQL code:

SELECT AVG(height) AS mean, STDEV(height) AS std_dev
FROM heights;

The output of this code would be:

meanstd_dev
69.142.51

The t-distribution:

The t-distribution is a continuous probability distribution that is similar to the normal distribution, but is used when the sample size is small or when the population standard deviation is unknown. The t-distribution is characterized by a parameter called the degrees of freedom (df), which is equal to n-1, where n is the sample size.

The formula for the t-distribution is:

f(x) = Γ((df+1)/2)/(σ√(πdf/2)) * (1+(x-μ)^2/(σ^2df))^(-(df+1)/2)

Where Γ is the gamma function, f(x) is the probability density function, x is the value of the variable, e is the mathematical constant e (~2.718), π is the mathematical constant pi (~3.14159), μ is the mean, and σ is the standard deviation.

The t-distribution is commonly used in real-life applications, such as:

  1. Medical Research: Clinical trials often involve a small sample size, making the t-distribution a better fit for analyzing the data than the normal distribution.
  2. Quality Control: In manufacturing, the t-distribution is used to test whether a sample mean is significantly different from a population mean.
  3. Business: The t-distribution is used to test hypotheses about the difference between two population means or the correlation between two variables.

SQL code to calculate the t-distribution:

Assuming we have a sample dataset “weights” with the following values:

weight
120
128
135
129
132
127
131
130

We can calculate the t-distribution using the following SQL code:

SELECT AVG(weight) AS mean, STDEV(weight) AS std_dev, COUNT(*) AS n, 
(AVG(weight) - 130)/(STDEV(weight)/SQRT(COUNT(*))) AS t_value
FROM weights;

The output of this code would be:

meanstd_devnt_value
128.575.417-0.88

In this example, we are testing the hypothesis that the population mean weight is equal to 130 pounds. The t-value is -0.88, which means that the sample mean is 0.88 standard errors below the hypothesized mean. To determine whether this difference is statistically significant, we would compare the t-value to a t-distribution table with 6 degrees of freedom (n-1=7-1=6) at the desired level of significance (e.g., alpha=0.05). If the t-value exceeds the critical value from the table, we can reject the null hypothesis and conclude that the population mean weight is significantly different from 130 pounds.

Conclusion

In summary, both the normal distribution and t-distribution are important tools in statistics and data science. They are used to model the distribution of continuous variables and to test hypotheses about population parameters. SQL can be used to calculate the mean, standard deviation, and t-value for sample datasets, which can then be used to make inferences about the population.

To know more about Data Science in ClickHouse, do consider reading the following articles:

About Shiv Iyer 211 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.