I came across an old technical report written by Michael Jordan (no, not the basketball guy):
“Why the logistic function? A tutorial discussion on probabilities and neural networks“. M. I. Jordan. MIT Computational Cognitive Science Report 9503, August 1995.
The material is amazingly straightforward and easy to understand. It answers (or at least partially) a long-standing question for me, why the form of logistic function is used in regression? Regardless of how it was used in the first place, the report shows that it is actually can be derived from a simple binary classification case where we wish to estimate the posterior probability:
where
data:image/s3,"s3://crabby-images/34914/349142b10c3b28f148277dcd6b2f120b2f414a9c" alt="Rendered by QuickLaTeX.com w_{0}"
data:image/s3,"s3://crabby-images/d7bf4/d7bf49a16ba2cc92991e139bced11fa92b3e5d5b" alt="Rendered by QuickLaTeX.com \mathbf{x}"
where
data:image/s3,"s3://crabby-images/849ab/849abb65acfa7c838fa52b1618cb1b9269d62f52" alt="Rendered by QuickLaTeX.com a=\frac{P(\mathbf{x}|w_{0})}{P(\mathbf{x}|w_{1})}"
data:image/s3,"s3://crabby-images/6aaac/6aaacbda4c86b008b992d0b14fb4c6048c83ce5a" alt="Rendered by QuickLaTeX.com b= \frac{P(w_{0})}{P(w_{1})}"
data:image/s3,"s3://crabby-images/42161/421612a3c8813369270f22770b89635710c1902b" alt="Rendered by QuickLaTeX.com P(\mathbf{x}|w)"
However, the whole point of the report is not just to show where logistic function comes into play, but showing how discriminative models and generative models in this particular setting are only the two sides of the same coin. In addition, Jordan demonstrated that these two sides are simply NOT equivalent but should be treated carefully when different learning criteria is considered. In general, a simple take-away is that the discriminative model (logistic regression) is more “robust” where generative model might be more accurate if the assumption is correct.
More details, please refer to the report.