How to Download DataSet in Jupyter or JupyterLab?

STEP-0: (ONLY IF YOU ARE NOT USING SAGEMAKER NOTEBOOK INSTANCES, For SageMaker Notebook Instances user, proceed with Step-2)

As we already using SageMaker Notebook Instance, So we don't need to install conda and other library package.
But if we are working on a system which doesn't have Conda installed, you can install conda by Downloading from Here.
Then to download datset, we require SHAP(SHapley Additive exPlanations) library, a popular tool for interpreting the output of machine learning models, which we can install using Conda using below code:

%conda install -c conda-forge shap

If you're using JupyterLab, you have to manually refresh Jupyter kernel after the installation and updates have completed. Run the following IPython script to shut down the kernel (the kernel will restart automatically):

import IPython
IPython.Application.instance().kernel.do_shutdown(True)

STEP-1: Update Conda Version

First, we will check current conda version using:

conda -V

First we will update conda using below code:

conda update -n base -c conda-forge conda

Then, again we will check current(updated) conda version using:

conda -V

STEP-2: Import SHAP Library

First we need to import the SHAP(SHapley Additive exPlanations) library using below commands.

import shap #Import the SHAP library, a popular tool for interpreting the output of machine learning models.

STEP-3: Load DataSet

Then, we have to load the Adult Census dataset (available here), also known as "Census Income" dataset to your notebook instance.

import shap # Import the SHAP library, a popular tool for interpreting the output of machine learning models.
X = shap.datasets.adult() #Load the Adult Census dataset using shap library
X  #Displays the value of X

Output:

(        Age  Workclass  Education-Num  Marital Status  Occupation  \
 0      39.0          7           13.0               4           1   
 1      50.0          6           13.0               2           4   
 2      38.0          4            9.0               0           6   
 3      53.0          4            7.0               2           6   
 4      28.0          4           13.0               2          10   
 ...     ...        ...            ...             ...         ...   
 32556  27.0          4           12.0               2          13   
 32557  40.0          4            9.0               2           7   
 32558  58.0          4            9.0               6           1   
 32559  22.0          4            9.0               4           1   
 32560  52.0          5            9.0               2           4   
 
        Relationship  Race  Sex  Capital Gain  Capital Loss  Hours per week  \
 0                 0     4    1        2174.0           0.0            40.0   
 1                 4     4    1           0.0           0.0            13.0   
 2                 0     4    1           0.0           0.0            40.0   
 3                 4     2    1           0.0           0.0            40.0   
 4                 5     2    0           0.0           0.0            40.0   
 ...             ...   ...  ...           ...           ...             ...   
 32556             5     4    0           0.0           0.0            38.0   
 32557             4     4    1           0.0           0.0            40.0   
 32558             1     4    0           0.0           0.0            40.0   
 32559             3     4    1           0.0           0.0            20.0   
 32560             5     4    0       15024.0           0.0            40.0   
 
        Country  
 0           39  
 1           39  
 2           39  
 3           39  
 4            5  
 ...        ...  
 32556       39  
 32557       39  
 32558       39  
 32559       39  
 32560       39  
 
 [32561 rows x 12 columns],
 array([False, False, False, ..., False, False,  True]))

Jupyter Notebooks retain memory throughout the notebook.So, if I put X in the next cell like below, it will show the value of X.

Output:

(        Age  Workclass  Education-Num  Marital Status  Occupation  \
 0      39.0          7           13.0               4           1   
 1      50.0          6           13.0               2           4   
 2      38.0          4            9.0               0           6   
 3      53.0          4            7.0               2           6   
 4      28.0          4           13.0               2          10   
 ...     ...        ...            ...             ...         ...   
 32556  27.0          4           12.0               2          13   
 32557  40.0          4            9.0               2           7   
 32558  58.0          4            9.0               6           1   
 32559  22.0          4            9.0               4           1   
 32560  52.0          5            9.0               2           4   
 
        Relationship  Race  Sex  Capital Gain  Capital Loss  Hours per week  \
 0                 0     4    1        2174.0           0.0            40.0   
 1                 4     4    1           0.0           0.0            13.0   
 2                 0     4    1           0.0           0.0            40.0   
 3                 4     2    1           0.0           0.0            40.0   
 4                 5     2    0           0.0           0.0            40.0   
 ...             ...   ...  ...           ...           ...             ...   
 32556             5     4    0           0.0           0.0            38.0   
 32557             4     4    1           0.0           0.0            40.0   
 32558             1     4    0           0.0           0.0            40.0   
 32559             3     4    1           0.0           0.0            20.0   
 32560             5     4    0       15024.0           0.0            40.0   
 
        Country  
 0           39  
 1           39  
 2           39  
 3           39  
 4            5  
 ...        ...  
 32556       39  
 32557       39  
 32558       39  
 32559       39  
 32560       39  
 
 [32561 rows x 12 columns],
 array([False, False, False, ..., False, False,  True]))

STEP-4: Disaplaying Features and Labels

We can see that X displays features(Age, Occupation, etc.) and labels(If Person's income exceeds $50K/yr, in form of 'True' or 'False').
So, we need to seperate them using below code.

X, Y = shap.datasets.adult() #Load the Adult Census dataset using shap library, where features are assigned to X and labels to Y
X 
Y

Output:

array([False, False, False, ..., False, False,  True])

As we can see from above code, only results of Y is displayed.
But if you load individual value of X, it will show output.

Output:


Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
0	39.0	7	13.0	4	1	0	4	1	2174.0	0.0	40.0	39
1	50.0	6	13.0	2	4	4	4	1	0.0	0.0	13.0	39
2	38.0	4	9.0	0	6	0	4	1	0.0	0.0	40.0	39
3	53.0	4	7.0	2	6	4	2	1	0.0	0.0	40.0	39
4	28.0	4	13.0	2	10	5	2	0	0.0	0.0	40.0	5
...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27.0	4	12.0	2	13	5	4	0	0.0	0.0	38.0	39
32557	40.0	4	9.0	2	7	4	4	1	0.0	0.0	40.0	39
32558	58.0	4	9.0	6	1	1	4	0	0.0	0.0	40.0	39
32559	22.0	4	9.0	4	1	3	4	1	0.0	0.0	20.0	39
32560	52.0	5	9.0	2	4	5	4	0	15024.0	0.0	40.0	39
32561 rows × 12 columns

And value of Y also will be displayed.

Output:

array([False, False, False, ..., False, False,  True])

This tells us that Cells only displays result of last variable.
And, to display results of results of variables other than last, we have to modify code as below:

display(X) #Displays the value of X
Y #Displays the value of Y

Output:

	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
0	39.0	7	13.0	4	1	0	4	1	2174.0	0.0	40.0	39
1	50.0	6	13.0	2	4	4	4	1	0.0	0.0	13.0	39
2	38.0	4	9.0	0	6	0	4	1	0.0	0.0	40.0	39
3	53.0	4	7.0	2	6	4	2	1	0.0	0.0	40.0	39
4	28.0	4	13.0	2	10	5	2	0	0.0	0.0	40.0	5
...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27.0	4	12.0	2	13	5	4	0	0.0	0.0	38.0	39
32557	40.0	4	9.0	2	7	4	4	1	0.0	0.0	40.0	39
32558	58.0	4	9.0	6	1	1	4	0	0.0	0.0	40.0	39
32559	22.0	4	9.0	4	1	3	4	1	0.0	0.0	20.0	39
32560	52.0	5	9.0	2	4	5	4	0	15024.0	0.0	40.0	39
32561 rows × 12 columns

array([False, False, False, ..., False, False,  True])

STEP-5: Performing Some Operations with Loaded Dataset

We can also perform certain functions here, just like if I want to display only features:

features = list(X.columns) #List all features,which is in Header Column of X
features #Display all features

Output:

['Age',
 'Workclass',
 'Education-Num',
 'Marital Status',
 'Occupation',
 'Relationship',
 'Race',
 'Sex',
 'Capital Gain',
 'Capital Loss',
 'Hours per week',
 'Country']

Load a version of the dataset specifically designed for display purposes.

X_display, Y_display = shap.datasets.adult(display=True) #Load Display version of Dataset
display(X_display) #Show Features of Display version of Dataset
Y_display #Show Labels of Display version of Dataset

Output:

	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
0	39.0	State-gov	13.0	Never-married	Adm-clerical	Not-in-family	White	Male	2174.0	0.0	40.0	United-States
1	50.0	Self-emp-not-inc	13.0	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	0.0	13.0	United-States
2	38.0	Private	9.0	Divorced	Handlers-cleaners	Not-in-family	White	Male	0.0	0.0	40.0	United-States
3	53.0	Private	7.0	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0.0	0.0	40.0	United-States
4	28.0	Private	13.0	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0.0	0.0	40.0	Cuba
...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27.0	Private	12.0	Married-civ-spouse	Tech-support	Wife	White	Female	0.0	0.0	38.0	United-States
32557	40.0	Private	9.0	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0.0	0.0	40.0	United-States
32558	58.0	Private	9.0	Widowed	Adm-clerical	Unmarried	White	Female	0.0	0.0	40.0	United-States
32559	22.0	Private	9.0	Never-married	Adm-clerical	Own-child	White	Male	0.0	0.0	20.0	United-States
32560	52.0	Self-emp-inc	9.0	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024.0	0.0	40.0	United-States
32561 rows × 12 columns

array([False, False, False, ..., False, False,  True])

You can see in Display version of dataset; Country, Occupation, Relationship etc. are not displayed in numeric form, instead it is displayed in Text form.
That's why display form of dataset is used for Human Understanding purposes.
But for models, numeric values is easier to understand or learn. That's why for overview purpose, we will use simple version, which has numeric features.

STEP-6: Displaying statistical overview of the dataset

To display statistical overview of the dataset, we require numeric features, which are available through normal version, not with display version.

display(X.describe()) #Display statistical overview of the dataset

Above command will display general statistical calculations like count, mean, min, max , etc.
Output is shown below:


Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
count	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000
mean	38.581646	3.868892	10.080679	2.611836	6.572740	2.494518	3.665858	0.669205	1077.648804	87.303833	40.437454	36.718866
std	13.640442	1.455960	2.572562	1.506222	4.228857	1.758232	0.848806	0.470506	7385.911621	403.014771	12.347933	7.823782
min	17.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
25%	28.000000	4.000000	9.000000	2.000000	3.000000	0.000000	4.000000	0.000000	0.000000	0.000000	40.000000	39.000000
50%	37.000000	4.000000	10.000000	2.000000	7.000000	3.000000	4.000000	1.000000	0.000000	0.000000	40.000000	39.000000
75%	48.000000	4.000000	12.000000	4.000000	10.000000	4.000000	4.000000	1.000000	0.000000	0.000000	45.000000	39.000000
max	90.000000	8.000000	16.000000	6.000000	14.000000	5.000000	4.000000	1.000000	99999.000000	4356.000000	99.000000	41.000000

STEP-7: Displaying histograms of the numeric features of the dataset

What is Histogram?

A histogram is a representation of the distribution of data.

hist = X.hist()

Output:

But as we can see it is not very well aligned and presentable.

So, we have to add some parameters like bin, figsize and sharey. Let's discuss them one by one.

What is figsize in Histogram?

Sets the size of the overall figure in inches.
The figure size is specified as a tuple with width and height.

hist = X.hist(figsize=(10, 10))

In this case, the width is set to 10 inches, and the height is set to 10 inches. But, still there is some space left in width, So, we can adjust accordingly.
Let me try with 20 inch width.

hist = X.hist(figsize=(20, 10))

This looks good. But, still I see bins are very less.

What is Bin In Histogram?

Refers to an interval or range into which the data is divided.
The data points are then counted or aggregated within each bin to create a visual representation of the distribution of the data.
Default Value: 10
I tried with multiple values, I found it, 30 seems to be gives a good histogram.

hist = X.hist(figsize=(20, 10), bins=30)

Output:

Important Note

Choosing an appropriate number of bins is important in creating a meaningful and informative histogram.
Too few bins may oversimplify the distribution, while too many bins may introduce noise and make it harder to identify patterns.
The choice of bin size can impact the visual representation of the data, so it's often a good idea to experiment with different bin sizes to find the one that best reveals the underlying distribution.

What is sharey or sharex In Histogram?

Specifies whether the y-axis or x-axis should be shared among histogram.
When set to True, it means that all histograms will share the same y-axis or x-axis, making it easier to compare the distributions across different features.
In our case, sharey=True will make our histogram more presentable, as main goal with histogram is correct and presentable representation of the distribution of data

hist = X.hist(figsize=(20, 10), bins=30, sharey=True)

Output:

NEXT STEP OF AI PROJECT:

Split the Dataset into Train, Validation, and Test Datasets