Published on

AI PROJECT-2.0 AWS SageMaker Tutorial



Table of Contents

Initial Setup

pip install sagemaker pandas numpy --upgrade


  • pip = Package installer for Python. It is a command-line tool that allows you to find, install, upgrade, and remove Python packages from the Python Package Index (PyPI) or other package indexes. PyPI is a repository of software packages developed and maintained by the Python community.
  • install = pip command used to install packages.
  • "sagemaker, pandas, and numpy" = Names of the Python packages that are being installed or upgraded.
  • --upgrade = pip flag to upgrade the specified packages to the latest versions available, if they are already installed.
pip freeze


  • pip = Package installer for Python. It is a command-line tool that allows you to find, install, upgrade, and remove Python packages from the Python Package Index (PyPI) or other package indexes. PyPI is a repository of software packages developed and maintained by the Python community.
  • freeze = Generate a list of all installed Python packages and their versions installed into your environment. Commonly used to create a requirements.txt file, which can be shared with others to replicate the same environment or to install the same dependencies on a different system.
pip freeze > requirements.txt
  • Save the list of installed packages and versions to a file named requirements.txt in the current directory. The resulting requirements.txt file can then be shared with others or used to install the same dependencies on another system using the pip install -r requirements.txt command.


DataSet available in this Book : Book Create an S3 bucket in the appropriate region. Create folder into it. Upload churn.txt to S3 bucket's dataset folder.

import boto3


  • import statement = used to bring modules or specific attributes (functions, classes, variables) from modules into the current namespace, allowing you to use them in your code. E.g. from math import sqrt, import math.
  • Modules are files containing Python code and can define functions, classes, variables, and other Python objects.
  • After importing a module, you can access its contents using dot notation (module_name.function_name, module_name.variable_name, etc.).
  • boto3: AWS SDK for Python. It provides APIs for interacting with various AWS services, including S3.
s3= boto3.client("s3")


  • client() = creates an S3 client object. Allows you to interact with S3 buckets and objects. It requires AWS credentials to authenticate and authorize requests.
s3.download_file(f"aimldemobootcamp", "DataSet/churn.txt", "churn.txt")


  • download_file() = downloads a file named churn.txt from the specified S3 bucket (aimldemobootcamp) and object key (DataSet/churn.txt). It saves the downloaded file locally with the name churn.txt.
import pandas as pd
churn = pd.read_csv("./churn.txt")


  • import pandas as pd = imports the pandas library and assigns it the alias pd.
  • pd.read_csv() = function to read the CSV file named churn.txt from the current directory (./) and stores its contents in a pandas DataFrame named churn.
  • DataFrame = Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).


  • churn.index or churn.columns = retrieves the column index (or column labels) of the DataFrame churn.
  • len(): This function calculates the length of the object, which, in this case, is the number of columns in the DataFrame.


Now, we have to remove data redundancy and biases from data, so our model gets the right training and is free from any bias, so it can predict customer churn properly.

As now we can see in our dataset, we have two types of features:

  • Numeric features
  • Categorical (Non-numeric)

Let's start analyzing non-numeric features, by creating frequency tables:

# Frequency tables for each categorical feature
for column in churn.select_dtypes(include=["object"]).columns:
    display(pd.crosstab(index=churn[column], columns="% observations", normalize="columns"))


  • Above code iterates over each column in the DataFrame churn that contains categorical data (object dtype).
  • crosstab() = Generates a frequency table for each categorical column, which is a cross-tabulation of two (or more) factors, in this case, the categorical feature and the percentage of observations, which tells us how many times(frequency) out of 1, does that non-numeric features appear.
  • index = Specifies categorical feature to be tabulated.
  • columns = Specifies values to be counted (in this case, "% observations").

Let's start analyzing the numeric features:

display(churn.describe()) #Display statistical overview of the dataset


  • describe() = Prints general statistical calculations summary (count, mean, std, min, 25%, 50%, 75%, max) for each numeric column in the DataFrame churn.

  • We can see Area Code comes as a numeric feature, however, as we know it is just for identification and can be ABCD, or may be some other. It can be a categorical feature, not a numeric feature. So we will convert it to a non-numeric feature so that we can perform operations and analyses that are specific to categorical data types.

churn["Area Code"] = churn["Area Code"].astype(object)


  • astype () = Convert numeric feature to Object Data Type.

Displaying histograms of the numeric features of the dataset

What is Histogram?

  • A histogram is a representation of the distribution of data.
hist = churn.hist()

But as we can see it is not very well aligned and presentable.

  • So, we have to add some parameters like bin, figsize and sharey. Let's discuss them one by one.

What is figsize in Histogram?

  • Sets the size of the overall figure in inches.
  • The figure size is specified as a tuple with width and height.
hist = churn.hist(figsize=(10, 10))
  • In this case, the width is set to 10 inches, and the height is set to 10 inches. But, still, there is some space left in width, So, we can adjust accordingly.
  • Let me try with a 20-inch width.
hist = churn.hist(figsize=(20, 10))
  • This looks good. But, still, I see bins are very less.

What is Bin In Histogram?

  • Refers to an interval or range into which the data is divided.
  • The data points are then counted or aggregated within each bin to create a visual representation of the distribution of the data.
  • Default Value: 10
  • I tried with multiple values, and I found it, 30 seems to give a good histogram.
hist = churn.hist(figsize=(20, 10), bins=30)

Important Note

  • Choosing an appropriate number of bins is important in creating a meaningful and informative histogram.
  • Too few bins may oversimplify the distribution, while too many bins may introduce noise and make it harder to identify patterns.
  • The choice of bin size can impact the visual representation of the data, so it's often a good idea to experiment with different bin sizes to find the one that best reveals the underlying distribution.

What is sharey or sharex In Histogram?

  • Specifies whether the y-axis or x-axis should be shared among histograms.
  • When set to True, all histograms will share the same y-axis or x-axis, making it easier to compare the distributions across different features.
  • In our case, sharey=True will make our histogram more presentable, as the main goal with a histogram is a correct and presentable representation of the distribution of data.
hist = churn.hist(figsize=(20, 10), bins=30, sharey=True)


  • We can see that "State" appears to be quite evenly distributed.
  • Phone takes on too many unique values and doesn't add any practical value, so we will remove unnecessary features to provide relevant data only for model training. It’s possible that parsing out prefixes could have some value, but without more context on how these are allocated, we should avoid using it.
churn = churn.drop("Phone", axis=1)
  • Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity, which indicates data is properly distributed and free from bias. Except VMail Message, which is a notable exception, which we can understand as not everyone has opted for VMail Plan and those who opted for it, are using it well.
  • To avoid redundancy and bias, we have to remove highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins, as they are directly dependent on each other and can create redundancy and bias.
churn = churn.drop(["Day Charge", "Eve Charge", "Night Charge", "Intl Charge"], axis=1)

As now we have cleaned our dataset. let’s determine which algorithm to use.


  • As we can see with VMail Plan or other variables, where both high and low (but not intermediate) values are predictive of churn. Means
    • Customers with no VMail Plan may be indicative of disengaged customers who are likely to churn.
    • Similarly, customers with a VMail Plan active may be indicative of heavy users who are also likely to churn due to potential dissatisfaction with the service or excessive costs.
  • Also, we know that our target variable has only two outcomes: True and False. As, our target attribute is binary, our model will be performing binary prediction, also known as binary classification. In order to accommodate this in an algorithm like linear regression, we’d need to generate polynomial (or bucketed) terms. Instead, let’s attempt to model this problem using gradient-boosted trees. So, we will try to utilize Gradient Boosting Algorithm.

What is Gradient Boosting

  • Gradient Boosting is a popular boosting algorithm in machine learning used for classification and regression tasks. Boosting is one kind of ensemble Learning method which trains the model sequentially and each new model tries to correct the previous model.


  • Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint.
  • XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

XGBoost Requirements

  • Amazon SageMaker XGBoost is capable of training with data in CSV or LibSVM formats. In this example, CSV will be used.
  • It should have target variable in the first column.
  • It should not have a header row.

But first, let’s convert our categorical features into numeric features.

model_data = pd.get_dummies(churn)
model_data = pd.concat(
    [model_data["Churn?_True."], model_data.drop(["Churn?_False.", "Churn?_True."], axis=1)], axis=1


We are spliting the dataset into three datasets :

  • One (train dataset) for training the model and
  • Second (test dataset) to evaluate the performance of the final trained model( test the model’s accuracy on data it hasn’t already seen) and
  • Third into validation dataset(which is often used during the training phase to fine-tune the model's hyperparameters and prevent overfitting.)

STEP-1 : Importing Necessary Library

import numpy as np 


  • np is a general-purpose array-processing package.

STEP-2 : Splitting Dataset

train_data, validation_data, test_data = np.split(
    model_data.sample(frac=1, random_state=7777),
    [int(0.7 * len(model_data)), int(0.9 * len(model_data))],


  • This line shuffles the rows of model_data using .sample() with
  • frac=1 to ensure a random shuffle
  • random_state=7777 sets the random seed for reproducibility.
  • Then, it splits the shuffled DataFrame into three parts based on the indices calculated by multiplying the length of model_data by 0.7 and 0.9. These splits result in approximately 70% of the data for training, 20% for validation, and 10% for testing.

STEP-3: Saves DataSet into CSV Files

  • We need to save DataSet into CSV Files without including headers and indices, so we can provide data to the algorithm in the right format, as discussed earlier.
train_data.to_csv("train.csv", header=False, index=False)
validation_data.to_csv("validation.csv", header=False, index=False)


  • Above code saves train_data and validation_data into a CSV file named "train.csv" and "validation_data".
  • header=False is used to exclude the header row from being written into CSV Files.
  • index=False used to exclude the row indices from being written into CSV Files.

STEP-4: Upload DataSet to S3

  • We will upload CSV format Training dataset to S3


  • boto3.resource("s3"): Creates an S3 resource object.

  • .Bucket("aimldemobootcamp"): Specifies the S3 bucket to interact with.

  • .Object(...): Specifies the S3 object (file) to upload.

  • os.path.join(".../train.csv"): Constructs the S3 key (path) where the file will be uploaded.

  • .upload_file("train.csv"): Uploads the local file "train.csv" to the specified S3 object.

  • Similarly, We will CSV format Validation dataset to S3.


To Confirm if files have been successfully upload or not, run below command

! aws s3 ls aimldemobootcamp/DataSet/Train --recursive
! aws s3 ls aimldemobootcamp/DataSet/Validation --recursive


  • First, we need to specify the location of our algorithm(XGBoost algorithm) containers.
import sagemaker
container = sagemaker.image_uris.retrieve("xgboost", sagemaker.Session().boto_region_name, "latest")


  • import sagemaker = Imports SageMaker SDK, which provides tools and APIs for working with SageMaker services.
  • Now we will retrieve the URI for the latest version of the XGBoost algorithm container image.
  • retrieve() function = from sagemaker.image_uris, specifying:
    • "xgboost" = The name of the algorithm (in this case, XGBoost).
    • sagemaker.Session(). boto_region_name = AWS region where the image is being retrieved.
    • "latest": Specifies that the latest version of the XGBoost container image should be retrieved.




  • We can see the location of the XGBoost Container Image which resides in our region inside Amazon Elastic Container Registry (ECR), which is AWS Service where all images are stored.

  • As we’re training with CSV file format, we’ll create TrainingInputs that our training function can use as a pointer to the files in S3. So, we will create the TrainingInput object "s3_input_train" for the training data.

s3_input_train = TrainingInput(


  • "s3:////train".format(bucket, prefix) = S3 URI where the training data is located, which we copied from S3 Console.
  • content_type="csv" specifies the content type of the data, indicating that the training data is in CSV format.

Same, we create TrainingInput objects for the training and validation datasets stored in Amazon S3.

s3_input_validation = TrainingInput(

Start the Training Job

  • Now we will set up a Training Job to start the training process.
  • And to set up a Training Job,
    • We need to specify a few variables like how many and what type of training instances we’d like to use.
    • And also we need to specify hyperparameters, we will discuss about them later.

And to define these terms and hyperparameters, we need to construct an estimator in Amazon SageMaker, which is an object that encapsulates the details of a machine-learning model and how it should be trained or deployed.

from sagemaker.debugger import Rule, ProfilerRule, rule_configs

xgb_model = sagemaker.estimator.Estimator(


  • image_uri=container : Specifies the Docker container image URI for the XGBoost algorithm. This URI we already retrieved using sagemaker.image_uris.retrieve in previous code.
  • role: Specifies the IAM role that SageMaker will assume to perform tasks on your behalf, such as creating instances, reading training results, call model artifacts from Amazon S3, and writing training results to Amazon S3
  • instance_count=1: Specifies the number of instances to use for model training. In this case, it's set to 1.
  • instance_type='ml.m4.xlarge': Specifies the type of instance to use for model training. The 'ml.m4.xlarge' instance is a general-purpose compute instance, which has 4 CPUs, 16 GB of memory, an Amazon Elastic Block Store (Amazon EBS) storage, and a high network performance.
  • output_path=s3_output_location: Specifies the Amazon S3 location where the trained model artifacts and training results will be stored.
  • sagemaker_session= sagemaker.Session(): Specifies the SageMaker session to be used. The sagemaker.Session() creates a new session.
  • rules: Specifies a list of SageMaker Debugger built-in rules to be applied during training. The rules check for specific conditions, and in this case, it includes rules related to creating a report for the XGBoost model and profiling information. In this example, the create_xgboost_report() rule creates an XGBoost report that provides insights into the training progress and results, and the ProfilerReport() rule creates a report regarding the EC2 compute resource utilization. For more information, see SageMaker Debugger XGBoost Training Report.

Why we need an Estimator

  • Estimator is a high-level interface that is used to set up, configure, and run machine learning (ML) training and inference tasks.
  • Estimators simplify the process of using SageMaker by encapsulating various details and configurations.
  • Abstraction of Infrastructure Details: That means I don't need to worry about creating an infrastructure such as instance types, container images, and distributed training settings.
  • Configuration of Training Jobs: It allow users to easily configure and customize training jobs by specifying parameters like the algorithm image, instance type, instance count, volume size, and output location.
    • Can define hyperparameters specific to the chosen algorithm.
  • Consistent Interface: Whether you're training a linear model, a deep learning model, or an XGBoost model, the process of setting up and running a training job remains similar.
  • Rule Configuration: Allow users to specify rules to monitor training jobs for common issues or anomalies. These rules can include checks for data validation, model quality, and resource utilization.

What is Hyperparameter in AI/ML

  • External configuration settings that are not learned from the data but are set prior to the training process.
  • These parameters influence the overall behavior of a machine learning model, impacting how the model learns and generalizes from the training data.


  • max_depth: Controls how deep each tree within the XGBoost model can grow. A deeper tree can capture more complex patterns but may lead to overfitting if set too high. 0 indicates no limit.

    • Valid values: Integer. Range: [0,∞)
    • Default value: 6
  • eta: Also known as learning rate or shrinkage parameter, controls the step size shrinkage during the boosting(optimization) process. A lower value makes the optimization process more robust but slower. After each boosting step, you can directly get the weights of new features. It affects the contribution of each tree to the final prediction. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.

    • Valid values: Float. Range: [0,1].
    • Default value: 0.3
  • gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. It is a regularization term that encourages pruning of the tree. A larger gamma value leads to more conservative algorithm.

    • Valid values: Float. Range: [0,∞).
    • Default value: 0
  • min_child_weight: Minimum sum of instance weight (hessian) needed in a child. It is another regularization term that controls the minimum amount of samples (instances) required for a child node to be created. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.

    • Valid values: Float. Range: [0,∞).
    • Default value: 1
  • subsample: Subsample ratio of the training instances. It specifies the fraction of samples to be used for training each tree. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. A value less than 1.0 introduces randomness and helps prevent overfitting.

    • Valid values: Float. Range: [0,1].
    • Default value: 1
  • verbosity: Controls the verbosity(specifies the level of detail for logging messages and diagnostic output generated by the XGBoost algorithm) of the output during training. Setting it to 0 means no output will be printed during training.

    • 0: Silent mode. No output is printed during training.
    • 1: Minimal output. Only critical information, such as errors and warnings, is printed.
    • 2: Information mode. Prints additional information about the training progress, such as metrics for each boosting round.
    • 3: Debug mode. Prints detailed debug information, including internal state and intermediate results. This mode is useful for troubleshooting and understanding the inner workings of the algorithm.
  • objective: Specifies the learning objective, which determines the loss function to be minimized. In this case, "binary:logistic" indicates binary classification with logistic regression.

    • Valid values: String
    • Default value: "reg:squarederror"
  • num_round: Number of boosting rounds (trees) to be run in training. It specifies the number of iterations of boosting to be performed or trees to build.

    • Valid values: Integer.


hyperparameters = {
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          volume_size=5, # 5 GB 


# Define the SageMaker Estimator
estimator = Estimator(image_uri=container,
                      instance_type='ml.m4.xlarge',  # Specify the type of training instance
                      instance_count=1,             # Specify the number of instances
                      role='YOUR_ROLE_ARN',         # Specify the IAM role for SageMaker
                      hyperparameters={'max_depth': 5,  # Set the max_depth hyperparameter
                                       # Add other hyperparameters as needed

  • So, till now we have only set up the configurations only. To start the training process for the XGBoost model, use below code{"train": s3_input_train, "validation": s3_input_validation}) 


  • fit() = It fits the model to the provided training data, optimizing the specified objective function and hyperparameters. Inside it we need to specifies the input data channels for training and validation. The keys ("train" and "validation") correspond to the names of the data channels, while the values (s3_input_train and s3_input_validation) are the TrainingInput objects representing the training and validation datasets stored in Amazon S3.


Create Deployable Model Endpoint

  • First, we will import CSVSerializer class from the SageMaker SDK, which will help us to serialize input data in CSV format. Also, we will create an object "xgb_predictor" as a SageMaker Predictor, representing deployed model endpoint. We can use this object to make predictions on new data.
from sagemaker.serializers import CSVSerializer


  • xgb_model.deploy: Method used to deploy trained XGBoost model as an endpoint on SageMaker for making predictions.

  • initial_instance_count=1: Specifies initial number of instances (containers) to deploy. In our case, we set it to 1.

  • instance_type='ml.t2.medium': Specifies type of compute instance to use for hosting the model.

  • serializer=CSVSerializer(): Specifies serializer to be used for input data of various formats (a NumPy array, list, file, or buffer). The CSVSerializer is chosen, indicating that input data should be formatted as CSV.

  • The deploy method creates a deployable model, configures the SageMaker hosting services endpoint, and launches the endpoint to host the model.

Retrive Endpoint Name

  • This will return the endpoint name of the xgb_predictor. The format of the endpoint name is "sagemaker-xgboost-YYYY-MM-DD-HH-MM-SS-SSS".


As now our model is deployed, We will evaluate the model by creating a function named predict() that makes predictions using an XGBoost model deployed on Amazon SageMaker, as we need to ensure that the data is properly formatted, as we know our test data is currently in numpy array format and sent to the deployed model endpoint.

def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = ",".join([predictions, xgb_predictor.predict(array).decode("utf-8")])

    return np.fromstring(predictions[1:], sep=",")
  • predict() that takes two parameters:
    • data: The input data for which predictions are to be made. It's expected to be a NumPy array.
    • rows: The number of rows to process at a time. It defaults to 500 rows but can be adjusted based on memory constraints.
  • np.array_split = Splits input data (data) into smaller arrays, with each array containing a maximum of "rows" rows.
  • for array in split_array = Loop iterates over each split array of data (array) and makes predictions using the XGBoost predictor (xgb_predictor.predict()). Predictions are obtained as byte strings, so decode("utf-8") is used to convert them to strings. Predictions from each iteration are concatenated with a comma separator.
  • return np.fromstring(predictions[1:], sep=",") = converts the concatenated predictions string into a NumPy array using np.fromstring(). [1:] is used to exclude the leading comma from the concatenated predictions.
predictions = predict(test_data.to_numpy()[:, 1:])
  • predict() = Calls the predict() function to make predictions on test data (test_data).
  • test_data.to_numpy()[:, 1:] = test data is converted to a NumPy array using to_numpy(), and only the features (not the labels) are passed to the function ([:, 1:] slices all rows and all columns starting from the second column).
  • predictions = Variable will contain the predictions made by the XGBoost model for the test data. These predictions can then be evaluated against the actual labels to assess the model's performance.

What is Confusion Matrix?

  • Provides a summary of the model's performance on a set of test data by showing the number of true positives, true negatives, false positives, and false negatives.
  • It allows you to evaluate the model's ability to correctly classify instances into their respective classes.
ChurnNot Churn
PredictedChurnTrue Positives (TP)False Positives (FP)
Not ChurnFalse Negatives (FN)True Negatives (TN)


  • True positives (TP): model accurately predicts a positive data point.
  • True negatives (TN): model accurately predicts a negative data point.
  • False positives (FP): model predicts a positive data point incorrectly.
  • False negatives (FN): model mispredicts a negative data point.
    index=test_data.iloc[:, 0],       # Actual labels (index)
    columns=np.round(predictions),    # Predicted labels (columns)
    rownames=["actual"],               # Label for the index axis
    colnames=["predictions"]          # Label for the columns axis


  • index=test_data.iloc[:, 0]: Specifies the actual labels (ground truth) from the test data. test_data.iloc[:, 0] selects all rows of the first column of the test_data DataFrame, assuming that the first column of test_data contains the actual labels.

  • columns=np.round(predictions): Specifies the predicted labels for the columns of confusion matrix rounded to the nearest integer (assuming the predictions are probabilities). np.round(predictions) rounds each prediction to the nearest integer (0 or 1).

  • rownames=["actual"]: Assigns the name "actual" to the rows of the confusion matrix.

  • colnames=["predictions"]: Assigns the name "predictions" to the columns of the confusion matrix.

  • print(confusion_matrix): Displays the generated confusion matrix.

Of the all churners, we’ve correctly predicted most of them (true positives). We also incorrectly predicted some customers would churn who then ended up not doing so (false positives). There are also some other customers who ended up churning, that we predicted would not (false negatives).

Congratulations!! You have successfully completed this AI/ML Project, make sure you keep learning continuously and stay updated in AI/ML. Make Sure you subscribe our Youtube Channel. It's free for you, but helps a lot to me!