PreRequisite

You need to Read about How to Split Dataset(only 1 min. read).

STEP-1: Import necessary libraries and information

First, we will import Amazon SageMaker Python SDK , which provides framework estimators and generic estimators to train our model while orchestrating the machine learning (ML) lifecycle accessing the SageMaker features for training and the AWS infrastructures, such as Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3).

import sagemaker
from sagemaker.debugger import Rule, ProfilerRule, rule_configs
from sagemaker.session import TrainingInput

region = sagemaker.Session().boto_region_name
print("AWS Region: {}".format(region))

role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))

where

SageMaker.debugger = After the training job has completed, we can download an XGBoost training report and a profiling report generated by SageMaker Debugger. to inspect training parameters and data throughout the training process when working with the TensorFlow, PyTorch, and Apache MXNet learning frameworks or the XGBoost algorithm. Debugger automatically detects and alerts users to commonly occurring errors such as parameter values getting too large or small.
region = Current AWS Region where SageMaker notebook instance is running.
role = IAM role used by the notebook instance.

STEP-2: Create S3 path for output location where XGBoost Model will be stored

s3_output_location='s3://{}/{}/{}'.format(bucket, prefix, 'xgboost_model')
display(s3_output_location)

where

's3://' = indicates that the path is an S3 path.
= Placeholders used to insert values into the string.
bucket = Name of S3 bucket where model will be stored.
prefix = Directory prefix within the bucket, providing a logical separation of data.
'xgboost_model' = Name of directory (or object) within specified prefix where XGBoost model will be saved.
'//'.format(bucket, prefix, 'xgboost_model'): Uses string formatting to construct the complete S3 path.
s3_output_location = Variable which holds the complete S3 path for the output location of the XGBoost model.

STEP-3: Retrive URI for XGBoost algorithm container in Amazon SageMaker

container=sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")
print(container)

where

sagemaker.image_uris.retrieve = This function is part of the SageMaker SDK and is used to retrieve the container image URI for a specific algorithm.
"xgboost": Name of algorithm for which you want to retrieve the container URI. In this case, it's "xgboost," indicating the XGBoost algorithm.
region: AWS region in which the algorithm container is available.
"1.2-1": The third argument specifies the version tag for the container image. In this case, it's "1.2-1," indicating the desired version of the XGBoost algorithm.
container: This variable stores the retrieved container image URI for the specified version of the XGBoost algorithm.

STEP-4: Construct a SageMaker estimator that calls the xgboost-container

Why we need an Estimator?

Estimator is a high-level interface that is used to set up, configure, and run machine learning (ML) training and inference tasks.
Estimators simplify the process of using SageMaker by encapsulating various details and configurations.
Abstraction of Infrastructure Details: That means I don't need to worry about creating an infrastructure such as instance types, container images, and distributed training settings.
Configuration of Training Jobs: It allow users to easily configure and customize training jobs by specifying parameters like the algorithm image, instance type, instance count, volume size, and output location.
- Can define hyperparameters specific to the chosen algorithm.
Consistent Interface: Whether you're training a linear model, a deep learning model, or an XGBoost model, the process of setting up and running a training job remains similar.
Rule Configuration: Allow users to specify rules to monitor training jobs for common issues or anomalies. These rules can include checks for data validation, model quality, and resource utilization.

xgb_model=sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session(),
    rules=[
        Rule.sagemaker(rule_configs.create_xgboost_report()),
        ProfilerRule.sagemaker(rule_configs.ProfilerReport())
    ]
)

where

image_uri=container : Specifies the Docker container image URI for the XGBoost algorithm. This URI we already retrieved using sagemaker.image_uris.retrieve in previous code.
role=role: Specifies the IAM role that SageMaker will assume to perform tasks on your behalf, such as creating instances, reading training results, call model artifacts from Amazon S3, and writing training results to Amazon S3
instance_count=1: Specifies the number of instances to use for model training. In this case, it's set to 1.
instance_type='ml.m4.xlarge': Specifies the type of instance to use for model training. The 'ml.m4.xlarge' instance is a general-purpose compute instance, which has 4 CPUs, 16 GB of memory, an Amazon Elastic Block Store (Amazon EBS) storage, and a high network performance.
volume_size=5: Specifies the size of the EBS (Elastic Block Store) storage volume, in gigabytes, to be attached to the training instance. If you don't specify this parameter, its value defaults to 30.
output_path=s3_output_location: Specifies the Amazon S3 location where the trained model artifacts and training results will be stored.
sagemaker_session=sagemaker.Session(): Specifies the SageMaker session to be used. The sagemaker.Session() creates a new session.
rules : Specifies a list of SageMaker Debugger built-in rules to be applied during training. The rules check for specific conditions, and in this case, it includes rules related to creating a report for the XGBoost model and profiling information. In this example, the create_xgboost_report() rule creates an XGBoost report that provides insights into the training progress and results, and the ProfilerReport() rule creates a report regarding the EC2 compute resource utilization. For more information, see SageMaker Debugger XGBoost Training Report.

STEP-5: Set HyperParameters of XGBoost

What is HyperParameters?

External configuration settings that are not learned from the data but are set prior to the training process.
These parameters influence the overall behavior of a machine learning model, impacting how the model learns and generalizes from the training data.

xgb_model.set_hyperparameters(
    max_depth = 5,
    eta = 0.2,
    gamma = 4,
    min_child_weight = 6,
    subsample = 0.7,
    objective = "binary:logistic",
    num_round = 1000
)

where

max_depth: Maximum depth of a tree.Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit.
- Valid values: Integer. Range: [0,∞)
- Default value: 6
eta: Learning rate or shrinkage parameter. It controls the step size shrinkage during optimization process. A lower value makes the optimization process more robust but slower. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
- Valid values: Float. Range: [0,1].
- Default value: 0.3
gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. It is a regularization term that encourages pruning of the tree. A larger gamma value leads to more conservative algorithm.
- Valid values: Float. Range: [0,∞).
- Default value: 0
min_child_weight: Minimum sum of instance weight (hessian) needed in a child. It is another regularization term that controls the minimum amount of samples (instances) required for a child node to be created. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
- Valid values: Float. Range: [0,∞).
- Default value: 1
subsample: Subsample ratio of the training instances. It controls the fraction of samples used for training each tree. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. A value less than 1.0 introduces randomness and helps prevent overfitting.
- Valid values: Float. Range: [0,1].
- Default value: 1
objective: The learning task and corresponding objective function. In this case, it is set to "binary:logistic" indicating a binary classification problem with logistic regression.
- Valid values: String
- Default value: "reg:squarederror"
num_round: Number of boosting rounds (trees) to be run. The number of rounds to run the training. It specifies the number of boosting iterations or trees to build.
- Valid values: Integer.

STEP-6: Configure a data input flow for training

We will use TrainingInput class to configure a data input flow for training, so that we can use the training and validation datasets we uploaded to Amazon S3.

from sagemaker.session import TrainingInput

train_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "data/train.csv"), content_type="csv"
)
validation_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "data/validation.csv"), content_type="csv"
)

where

's3://' = indicates that the path is an S3 path.
= Placeholders used to insert values into the string.
bucket = Name of S3 bucket where model will be stored.
prefix = Directory prefix within the bucket, providing a logical separation of data.

STEP 7: Start Model Training

To start model training, we have to call estimator's fit method with the training and validation datasets as input.

xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)

where

wait=True : Tells fit method to display progress logs and waits until training is complete.
This training job might take up to 10 minutes.

XGBoost Training Report

After the training job has been completed, we can download an XGBoost training report and a profiling report generated by SageMaker Debugger.
The XGBoost training report offers you insights into the training progress and results, such as the loss function with respect to iteration, feature importance, confusion matrix, accuracy curves, and other statistical results of training.
Debugger profiling report that shows summaries and details of the EC2 instance resource utilization, system bottleneck detection results, and python operation profiling results
For example, you can find the following loss curve from the XGBoost training report which clearly indicates that there is an overfitting problem.

How to read loss curve from XGBoost training report

Loss curve is a graphical representation of how the training and validation loss (or objective function value) change over the course of training rounds.
It can provide insights into whether the model is overfitting or underfitting.
Initially, both training and validation loss decreases, indicating learning as the model learns patterns from the training data.
After a certain point, the training loss continues to decrease, but validation loss plateaus or even starts to increase.
This divergence between training and validation loss indicates that model is becoming too specialized to the training data and may not generalize well to new, unseen data.
So, Loss curve serves as a diagnostic tool. As there is a noticeable gap between training and validation loss, it suggests overfitting.

STEP-8: Check if XGBoost training report and Profiling Report exists

Run the following code to specify the S3 bucket URI where the Debugger training reports are generated and check if the reports exist.

rule_output_path = xgb_model.output_path + "/" + xgb_model.latest_training_job.job_name + "/rule-output"
! aws s3 ls {rule_output_path} --recursive

where

output_path: Specifies the Amazon S3 location where the trained model artifacts and training results will be stored.

STEP-9: Download the Debugger XGBoost training and profiling reports

Download reports to current workspace.

! aws s3 cp {rule_output_path} ./ --recursive

STEP-10: Get the file link of the XGBoost training report

from IPython.display import FileLink, FileLinks
display("Click link below to view the XGBoost Training report", FileLink("CreateXgboostReport/xgboost_report.html"))

profiler_report_name = [rule["RuleConfigurationName"] 
                        for rule in xgb_model.latest_training_job.rule_job_summary() 
                        if "Profiler" in rule["RuleConfigurationName"]][0]
profiler_report_name
display("Click link below to view the profiler report", FileLink(profiler_report_name+"/profiler-output/profiler-report.html"))

STEP-10: Model Artifacts of XGBoost Model

Now, we have fully trained XGBoost model and SageMaker stores the model artifact in your S3 bucket. To find the location of the model artifact, run the following code to print the model_data attribute of the xgb_model estimator:

xgb_model.model_data

NEXT STEP

How To Deploy the Model