Multiple Linear Regression Model and Its Variants as Solutions for Regression Problems in Machine Learning – Part I

Wednesday, March 27, 2019

Blog Posts

blogs

Database

Technical

Information Technology

Artificial Intelligence

What is a regression problem? This question is easier to answer through a demonstrative example than by a long description extending to multiple paragraphs. Take a look at the sample data shown below. This is a screenshot taken from a python jupyter notebook view of the data frame obtained using pandas library. The data is arranged in rows and columns. Every column is one variable or data field and every row is one record. From now onwards, I will be adhering to terminologies aligned with machine learning (ML) as much as possible. Here’s a question for you. Given a chance to pick a data field whose values are strongly related to values in one/more of the remaining fields, what will be your choice? Well, you got it right – “Chance of admit” and this is going to be the target data field, which is called a label. Now you have to pick one (more) field(s) from the remaining set, which you feel can be mapped to the label. The set of fields you selected are the features. Features are nothing but predictor variables for the label. Mind you, the predictors should not be interrelated in any way. So, the rule of thumb here is

Predictors – Independent variables; each predictor is independent of the others.
Target – Dependent variable

In a regression problem what you do is find a mathematical relationship between the features and the label. Once this relationship is established, then the value of the label can be predicted for any given set of values of features. The label will always be a continuous variable of numeric type. In this example, I have considered all the feature values as of type numeric, but in general, they can be of type ordinal/nominal/interval as well. A detailed discussion about the data type of feature variables is not intended to be a part of this blog. There are several open source ML algorithms available to handle regression problems, starting from multiple linear regression to random forest to complex neural networks. They all work in different ways. The fundamental difference is, whilst linear regression mines for any linear relationship between the label and the features, other algorithms look out for non-linear relationships as well. My sole focus in this blog will be on multiple linear regression (and its variants in subsequent episodes of this blog) as it is the simplest and therefore easiest to understand; it is also a good starting point for someone who wants a jump start in ML. Multiple Linear Regression To start with, I am picking GRE Score, University Rating, SOP, and Research as features. There’s no concrete reason as to why I picked these fields, I just picked it. For this set of features, I can write the equation for a regression model as Chance_of_Admit = (a x GRE_Score) + (b x University_Rating) + (c x SOP) + (d x Research) + e Where a, b, c, and d are the model parameters for the regression model generated. Now the question is how will you get the values of a, b, c, d, and e. You feed the data for both features and label into your ML algorithm and allow it to find the best values for the model parameters. An example of how to perform this using python is shown below. To run this code you have to import the library statmodels.formula.api Note that the best values for the model parameters are obtained by running the algorithm by calling the function ols(). The same can be displayed along with several other statistics by calling the function summary() The statistics are shown in the above screenshot also contains information about the goodness of the model and the relevance of each feature to the model. Now that you have the model created, you can use it to get the predictions from the model for the label. To run the model with some new data, I have created a dataset new data as shown below. Predictions from the model are obtained using predict() function Remarks In this blog, I have not talked about data types and how to handle data of a type other than numeric. The blog also does not cover the topic of data preparation which can take you a long way in getting very good model performance. Details of the model statistics and improving the quality of the model through feature engineering are two more things that can go into the “Missing List”. I omitted these topics intentionally for fear of deviating too much from what I really want to put in. But stay tuned, these will be covered in upcoming blogs.

No items found.

Download more info

Cloud Services

HCM Cloud

ERP Cloud

CX Cloud

Oracle Cloud Extension

Oracle Cloud Infrastructure(OCI)

Oracle Integration Cloud (OIC)

EPM Cloud

Managed Services

Lift and Shift

ERP Audit

Grants Management

Supply Chain Management

On-Premise Services

PeopleSoft

JD Edwards

E-Business

Lift and Shift

Managed Services

Implementation / Upgrades

Enhancements

Reporting and Compliance

ERP Audit

Specialized Services

BI and Analytics

Big Data

Digital Services

Application Development & Maintenance

Quality Assurance Testing

Infrastructure Management Services

Database and Middleware Management

Solutions

Smart Onboarding

Employee Off-Boarding

E-Verify

Form I-9

Security, Compliance, and SoD

ePar

Talent Procurement

Supplier Diversity Reporting

Industries' Expertise

Diversified and Higher Education

Financial and Insurance

Govern­ment and Public Sector

Healthcare and Life Sciences

High Tech and FinTech

Industrial Manufa­ctu­ring

Media, Entertainment, and Tele­com­mu­nica­tions

Professional Services and Construction & Engineering

Retail, Wholesale Distribution, and Consumer Packaged Goods

Travel and Transporta­tion and Logistics

Utilities

Company

About Us

Executive Team

Client Reference Videos

Testimonials

Partners

Careers

Smart ERP Solutions Advisory Board

Multiple Linear Regression Model and Its Variants as Solutions for Regression Problems in Machine Learning – Part I

Government and Public Sector

Industrial Manufacturing

Media, Entertainment, and Telecommunications

Travel and Transportation and Logistics