Linear regression is a simple solution to our classification problems but what happens when it fails. As we will see in below problem

Suppose we want to classify **Y= {0,1}** and **X** are **data samples**. It is binary classification. Let’s try it with linear Regression

Wow, Linear Regression has done the job.It is really a good fit but what happens if i add a new data to given data set.

It’s really a bad solution. Linear Regression didn’t worked. What should we do next? If there is problem, there is solution and the solution is **Logistic Regression.** Let’s start with formal definition and have an idea what is linear and logistic Regression are:

In ** linear regression**, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values. In

**logistic regression**, the outcome (dependent variable) has only a limited number of possible values.

**Logistic regression**is used when the response variable is categorical in nature.

Intuitively, it also doesn’t make sense for h(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}.To fix this, lets change the form for our hypotheses h(x). We will choose hypothesis as follows :

g(z) is called ** Sigmoid function** or L

**. It look like as follows:**

*ogistic function*Let’s assume that

if we combine above equations together it can be re written as follows:

As in above equation when **y =0**, h(x) will become 1, p(y|x;theta) = (1-h(x)) and when **y =1**,(1- h(x)) will become 1, p(y|x;theta) = h(x)

Now, **likelihood of the parameters** are given by

and when we *Maximize log likelihood,**it* will be given by :

Now we can use **Gradient Descent** , we already know that h(x) is given by sigmoid function.

For ease, the partial derivative of g(z) with respect to z is given by

Now if apply gradient descent by taking the *partial derivative of log likelihood with respect to theta*

If we compare the above rule to the least mean square, it looks identical but it’s not. This is a different learning algorithm because h(x) is now defined as non-linear function of theta transpose * x[i].

If you find any inconsistency in my post, please feel free to point out in the comments. Thanks for reading.