Showing posts with label Mathematical Foundations for Data Science. Show all posts
Showing posts with label Mathematical Foundations for Data Science. Show all posts

Tuesday, August 27, 2024

Linear Regression | Hypothesis Testing

To See All ML Articles: Index of Machine Learning


Help from ChatGPT

I saw this screen in a video for t-test for slope. Can you shed some light as to what is being taught here?

Answer

The screen you are seeing explains how to perform a t-test for the slope of a simple linear regression model. The goal is to assess whether the slope (b1b_1) of the regression line is statistically significantly different from zero, which would indicate a linear relationship between the dependent and independent variables.

Here's an explanation of the components shown in the image:

t-Statistic for the Slope:

  • Formula: t=b1SE(b1)t = \frac{b_1}{SE(b_1)}
  • b1b_1 is the estimated slope of the regression line.
  • SE(b1)SE(b_1) is the standard error of the slope estimate, which measures the variability of the slope coefficient across different samples.

This formula is used to calculate the t-statistic for the slope. The t-statistic quantifies how many standard errors the estimated slope is away from zero.

Standard Error of the Slope SE(b1)SE(b_1):

  • The standard error SE(b1)SE(b_1) of the slope is calculated using the formula: SE(b1)=(yiy^i)2n2÷(xixˉ)2SE(b_1) = \sqrt{\frac{\sum (y_i - \hat{y}_i)^2}{n - 2}} \div \sqrt{\sum (x_i - \bar{x})^2}
  • Numerator:
    • (yiy^i)2\sum (y_i - \hat{y}_i)^2 is the sum of squared residuals (SSR), where yiy_i are the actual values of the dependent variable and y^i\hat{y}_i are the predicted values from the regression model.
    • n2n - 2 accounts for the degrees of freedom. Since two parameters are estimated in a simple linear regression (the intercept and the slope), we subtract 2 from the total number of data points nn.
  • Denominator:
    • (xixˉ)2\sum (x_i - \bar{x})^2 is the sum of squared deviations of the independent variable xx from its mean xˉ\bar{x}. This reflects how much variability exists in the independent variable xx.

Interpretation:

  • t-Statistic: Once the t-statistic is calculated, you compare it to the critical value from the t-distribution (based on the degrees of freedom and the desired significance level, e.g., 0.05). If the t-statistic is greater in magnitude than the critical value, you reject the null hypothesis that the slope is zero.

  • Null Hypothesis (H0H_0): b1=0b_1 = 0. This means that there is no linear relationship between the variables.

  • Alternative Hypothesis (H1H_1): b10b_1 \neq 0. This suggests a linear relationship between the variables.

If the null hypothesis is rejected, it implies that there is a statistically significant linear relationship between the independent and dependent variables.

In summary, this screen is demonstrating how to perform a t-test on the slope of a regression line to check whether the slope is significantly different from zero, indicating a linear relationship between the two variables. The t-test is based on the ratio of the slope to its standard error, and the standard error itself is calculated using the residuals from the regression and the variability of the independent variable.

Tags: Machine Learning,Mathematical Foundations for Data Science,

Monday, August 26, 2024

Summary of Support Vector Machines (SVM) - Based on Video by Intuitive Machine Learning

To See All ML Articles: Index of Machine Learning


Support Vector Machines (SVM) are among the top non-linear supervised learning models.

SVMs help identify the optimal hyperplane to categorize data. In one dimension, a hyperplane is a point; in two dimensions, it’s a line; and in three dimensions, it’s a surface that separates positive and negative categories.

Linear Model: For linearly separable data, there are multiple ways to draw a hyperplane that separates positive and negative samples. 

While all hyperplanes separate the categories, SVMs help choose the best one by maximizing the margin, earning them the name “maximum margin classifier.” What is a support vector? Support vectors are the data points closest to the hyperplane.
Identifying these crucial data points, whether negative or positive, is a key challenge that SVMs address. Unlike other models like linear regression or neural networks, where all data points influence the final optimization, in SVMs, only support vectors impact the final decision boundary. When a support vector moves, the decision boundary changes, but moving other vectors has no effect. Similar to other machine learning models:
SVMs optimize weights so that only support vectors determine the weights and the decision boundary. Understanding the mathematical process behind SVMs requires knowledge of linear algebra and optimization theory. Before diving into model calculations, it’s important to understand H0, H1, H2, and W, as shown in the image. W is drawn perpendicular to H0.

Equation for ( H_0 )

The equation for the hyperplane ( H_0 ) is ( W dot X + b = 0 ). This applies to any number of dimensions. For example, in a two-dimensional scenario, the equation becomes ( W_1 dot X_1 + W_2 dot X_2 + b = y ). Since ( H_0 ) is defined as ( W dot X + b = 0 ): ( H_1 ) is ( W dot X + b \geq 0 ), which can be rewritten as ( W dot X + b = k ), where ( k ) is a variable. For easier mathematical calculations, we set ( k = 1 ), so ( W dot X + b = 1 ). ( H_2 ) is ( W dot X + b < 0 ), which can be rewritten as ( W dot X + b = -k ). Again, setting ( k = 1 ), we get ( W dot X + b = -1 ).

Applying Lagrange Multipliers

Lagrange Multipliers help identify local maxima and minima subject to equality constraints, forming the decision rule. The assumption here is that the data is linearly separable with no outliers, known as Hard Margin SVM.

Handling Noise and Outliers

If the data contains noise or outliers, Hard Margin SVM fails to optimize. To address this, Soft Margin SVM is used. By adding “theta” as constraints to the optimization problem, it becomes possible to satisfy constraints even if outliers do not meet the original constraints. However, this requires a large volume of data where all examples satisfy the constraints.

Regularization:

One technique to handle this problem is Regularization. L1 regularization can be used to penalize large values of theta, with a constant “C” as the regularization parameter. If ( C ) is small, theta is significant; otherwise, it is less important. Setting ( C ) to a positive infinite value results in the same output as Hard Margin SVM. A smaller ( C ) value results in a wider margin, while a larger ( C ) value results in a narrower margin. A narrow margin is less tolerant to outliers. This way, Soft Margin SVM can handle non-linearly separable data with outliers and noise.

Kernel Trick

When data is inherently non-linearly separable, as shown in the example below, the solution is to use the kernel trick.
The kernel trick introduces a variable ( k ) that squares ( x ), helping to solve the problem. This approach is derived from applying the Lagrange equation. By squaring ( x ), we can achieve a clear separation of the data, allowing us to draw a separating line.

Polynomial Kernel

This kernel uses two parameters: a constant ( C ) and the degree of freedom ( d ). A larger ( d ) value makes the boundary more complex, which might result in overfitting.

RBF Kernel (Gaussian Kernel)

The RBF kernel has one parameter, ( gamma ). A smaller ( gamma ) value leads to a linear SVM, while a larger value heavily impacts the support vectors and may result in overfitting.
Tags: Machine Learning,Mathematical Foundations for Data Science,Mariswaran

Wednesday, August 14, 2024

Cost Function For Logistic Regression Model

To See All ML Articles: Index of Machine Learning

Can you show the derivation for this Cost function?

Here's a step-by-step derivation of the cost function:

1. Understanding Logistic Regression:

  • In logistic regression, the probability that a given sample x(i)\mathbf{x}^{(i)} belongs to class y(i)y^{(i)} (which can be either 0 or 1) is modeled using the logistic function: ϕ(z)=11+ezwherez=wTx\phi(z) = \frac{1}{1 + e^{-z}} \quad \text{where} \quad z = \mathbf{w}^T \mathbf{x}
  • For binary classification, y(i)y^{(i)} can either be 0 or 1, so: P(y(i)=1x(i);w)=ϕ(wTx(i))P(y^{(i)} = 1 \mid \mathbf{x}^{(i)}; \mathbf{w}) = \phi(\mathbf{w}^T \mathbf{x}^{(i)}) P(y(i)=0x(i);w)=1ϕ(wTx(i))P(y^{(i)} = 0 \mid \mathbf{x}^{(i)}; \mathbf{w}) = 1 - \phi(\mathbf{w}^T \mathbf{x}^{(i)})

2. Likelihood Function for a Single Observation:

  • The likelihood for a single observation ii can be written as: P(y(i)x(i);w)=ϕ(wTx(i))y(i)(1ϕ(wTx(i)))1y(i)P(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}) = \phi(\mathbf{w}^T \mathbf{x}^{(i)})^{y^{(i)}} \cdot \left(1 - \phi(\mathbf{w}^T \mathbf{x}^{(i)})\right)^{1 - y^{(i)}}
  • This formula works because:
    • If y(i)=1y^{(i)} = 1, the likelihood is ϕ(wTx(i))\phi(\mathbf{w}^T \mathbf{x}^{(i)}).
    • If y(i)=0y^{(i)} = 0, the likelihood is 1ϕ(wTx(i))1 - \phi(\mathbf{w}^T \mathbf{x}^{(i)}).

3. Likelihood Function for the Entire Dataset:

  • Assuming that the samples are independent, the likelihood of the entire dataset (with nn samples) is the product of the likelihoods of individual observations: L(w)=i=1nP(y(i)x(i);w)=i=1n[ϕ(wTx(i))y(i)(1ϕ(wTx(i)))1y(i)]L(\mathbf{w}) = \prod_{i=1}^{n} P(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}) = \prod_{i=1}^{n} \left[\phi(\mathbf{w}^T \mathbf{x}^{(i)})^{y^{(i)}} \cdot \left(1 - \phi(\mathbf{w}^T \mathbf{x}^{(i)})\right)^{1 - y^{(i)}}\right]

4. Log-Likelihood:

  • It is often easier to work with the log-likelihood (because the logarithm turns the product into a sum), which is given by: logL(w)=i=1n[y(i)logϕ(wTx(i))+(1y(i))log(1ϕ(wTx(i)))]\log L(\mathbf{w}) = \sum_{i=1}^{n} \left[y^{(i)} \log \phi(\mathbf{w}^T \mathbf{x}^{(i)}) + (1 - y^{(i)}) \log (1 - \phi(\mathbf{w}^T \mathbf{x}^{(i)}))\right]

5. Cost Function:

  • In practice, we minimize the negative log-likelihood, which gives us the cost function for logistic regression: J(w)=logL(w)=i=1n[y(i)logϕ(wTx(i))+(1y(i))log(1ϕ(wTx(i)))]J(\mathbf{w}) = -\log L(\mathbf{w}) = -\sum_{i=1}^{n} \left[y^{(i)} \log \phi(\mathbf{w}^T \mathbf{x}^{(i)}) + (1 - y^{(i)}) \log (1 - \phi(\mathbf{w}^T \mathbf{x}^{(i)}))\right]
  • This cost function is convex, which makes it suitable for optimization algorithms like gradient descent.

Conclusion:

The derivation of the likelihood function for logistic regression involves expressing the probability of each individual sample given the model's parameters and then combining these probabilities across all samples in the dataset, assuming independence. The final likelihood function provides a way to measure how well the model's parameters fit the data, and minimizing the negative log-likelihood (or equivalently, maximizing the likelihood) leads to the optimal model parameters.

Reference

Python Machine Learning Sebastian Raschka Packt Publishing (2015)
Tags: Machine Learning,Mathematical Foundations for Data Science,