Let’s discuss steps to predict salary using regression here. The regression model used is multi linear regression.
Assume that, a company’s salary for it’s programmers is as shown below
Total, Lead, Manager, Certifications, Salary
1,0,0,0,20000
1.5,0,0,0,23000
2,0,0,0,25000
2,0,0,1,30000
2.5,0,0,0,27000
3,0,0,2,30000
3.5,1,0,0,33000
3.5,1,0,1,35000
3.5,1,0,2,40000
4,1,0,0,35000
4,2,0,0,40000
4,2,0,2,43000
So in the table, there are some independent variables
they are total years of experience, total years of experience as team lead, total years of experience as project manager and number of certifications he has.
And the dependent variable is salary.
So we can try to predict salary of an employee
with 5 years of total experience, 2 years as team lead and one year as project manager. Assume that he took 2 certifications too.
First step is create independent variable matrix
Let’s take all rows and columns expect the last column
X = dataset.iloc[:, :-1].values
Now let’s take the dependent variable vector, it’s the last column (5th column, index starts at zero so last column index is 4)
Y = dataset.iloc[:, 4].values
Now let’s split the data in to training and test set.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
Now let’s fit the data to the linear regression
regressor = LinearRegression() regressor.fit(X_train, y_train)
Now we can predict salary of an employee using the regression object as shown below
total = 5
lead = 2
manager = 1
cert = 2
We have created an api.py, which will listen for requests on port 5000
Our below API call return the predicted salary as 48017.16738197425
curl -X POST -d “total=5&lead=2&manager=1&cert=2” http://127.0.0.1:5000
{
“predicted”: 48017.16738197425
}
Now let’s calculate accuracy of our prediction as shown below.
accuracy = (regressor.score(X_test,y_test))
Accuracy calculated is .9321110353200895. That means accuracy is 93%. If the accuracy was 1 (ie 100%), we could say that our predicted salary is perfect.
Sample code can be found from the link https://github.com/abdunnasir/data-science-open/tree/master/multi_linear_regression