Published on

AI PROJECT-1.5 Upload the Dataset to Amazon S3

Authors

PreRequisite

Why we are uploading DataSet to Amazon S3?

  • The datasets in the S3 bucket will be used by a compute-optimized SageMaker instance on Amazon EC2 for model training.

STEP-1: Import necessary libraries

We will upload the training and validation datasets to the default Amazon S3 bucket using SageMaker and Boto3.

import sagemaker, boto3, os

STEP-2: Associate S3 Bucket to SageMaker Session where DataSet resides.

The below code obtain the name of the default S3 bucket associated with the SageMaker session(where DataSet resides) and store it in the variable "bucket", which cab be later used to store and retrive data in context of SageMaker Workflows.

bucket = sagemaker.Session().default_bucket()

where

  • sagemaker = SageMaker Python SDK
  • sagemaker.Session(): Creates an instance of the "sagemaker.Session" class and provides functionalities related to managing SageMaker sessions.
  • .default_bucket(): This method, called on the Session instance, retrieves name of default S3 bucket associated with the SageMaker session. In Amazon SageMaker, S3 buckets are often used for storing input and output data, as well as model artifacts.
  • bucket = Variable to which name of the default S3 bucket is assigned.

STEP-3: Upload Training and Validation DataSet to Amazon S3

  • Now, we will create a new "demo-sagemaker-xgboost-adult-income-prediction" folder, and uploads the training and validation datasets to the data subfolder using AWS SDK for Python (Boto3).
prefix = "demo-sagemaker-xgboost-adult-income-prediction"
boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'data/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'data/validation.csv')).upload_file('validation.csv')

where

  • prefix = Variable to store new folder name "demo-sagemaker-xgboost-adult-income-prediction" .
  • boto3.Session().resource('s3') = Creates a Boto3 S3 resource object, which allows us to interact with Amazon S3.
  • .Bucket(bucket) = Specifies the S3 bucket using the bucket variable obtained from the default SageMaker session.
  • .Object(os.path.join(prefix, 'data/train.csv')): Specifies the object within S3 bucket,(where the first file, 'train.csv', will be uploaded).
  • os.path.join() = To concatenate the prefix and file path.
  • .upload_file('train.csv') = This method uploads the local file 'train.csv' to the specified S3 object.

Same Procedure is repeated for "validation.csv" file.

STEP-4: Check if CSV Files are successfully uploaded to Amazon S3

  • Using AWS CLI to check
! aws s3 ls {bucket}/{prefix}/data --recursive

NEXT STEP