Published on

AI PROJECT-1.3 Split the Dataset into Multiple Datasets

Authors

PreRequisite

Splitting Dataset into Multiple Dataset

  • We are spliting the dataset into three datasets :
    • One (train dataset) for training the model and
    • Second (test dataset) to evaluate the performance of the final trained model and
    • Third into validation dataset(which is often used during the training phase to fine-tune the model's hyperparameters and prevent overfitting.)

STEP-1 : Importing Necessary Library

from sklearn.model_selection import train_test_split

where

  • sklearn is a machine library for predictive data analysis.
  • train_test_split is a function from scikit-learn, which is commonly used for splitting datasets into training and testing sets.

STEP-2 : Splitting Train and Test Dataset

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
display(X)
display(X_train)
display(X_test)
display(Y)
display(Y_train)
Y_test

where

  • X and Y are the original feature matrix and label vector.
  • test_size=0.2 specifies that 20% of the data should be used for testing, and the remaining 80% for training.
  • random_state=1 ensures reproducibility by fixing the random seed. Setting random_state to a specific value, such as 1, ensures that the random numbers generated in the code are the same every time you run it. This makes the results reproducible, as the randomness is controlled by a fixed seed.

Output:

	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
0	39.0	7	13.0	4	1	0	4	1	2174.0	0.0	40.0	39
1	50.0	6	13.0	2	4	4	4	1	0.0	0.0	13.0	39
2	38.0	4	9.0	0	6	0	4	1	0.0	0.0	40.0	39
3	53.0	4	7.0	2	6	4	2	1	0.0	0.0	40.0	39
4	28.0	4	13.0	2	10	5	2	0	0.0	0.0	40.0	5
...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27.0	4	12.0	2	13	5	4	0	0.0	0.0	38.0	39
32557	40.0	4	9.0	2	7	4	4	1	0.0	0.0	40.0	39
32558	58.0	4	9.0	6	1	1	4	0	0.0	0.0	40.0	39
32559	22.0	4	9.0	4	1	3	4	1	0.0	0.0	20.0	39
32560	52.0	5	9.0	2	4	5	4	0	15024.0	0.0	40.0	39
32561 rows × 12 columns

Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
16465	39.0	6	7.0	2	14	4	4	1	0.0	0.0	40.0	39
5625	54.0	6	13.0	2	4	4	4	1	0.0	0.0	40.0	39
30273	32.0	4	9.0	2	12	4	4	1	0.0	1902.0	50.0	39
3136	45.0	6	10.0	4	5	0	4	1	0.0	0.0	50.0	39
4521	60.0	4	6.0	2	6	4	2	1	0.0	0.0	40.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...
32511	25.0	2	13.0	4	1	3	2	0	0.0	0.0	40.0	39
5192	32.0	4	13.0	2	4	4	4	1	15024.0	0.0	45.0	39
12172	27.0	4	13.0	4	7	0	1	1	0.0	0.0	40.0	0
235	59.0	7	9.0	2	8	4	4	1	0.0	0.0	40.0	39
29733	33.0	4	13.0	2	1	4	4	1	0.0	1902.0	45.0	39
26048 rows × 12 columns

Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
9646	62.0	6	4.0	6	8	0	4	0	0.0	0.0	66.0	39
709	18.0	4	7.0	4	8	2	4	1	0.0	0.0	25.0	39
7385	25.0	4	13.0	4	5	3	4	1	27828.0	0.0	50.0	39
16671	33.0	4	9.0	2	10	4	4	1	0.0	0.0	40.0	39
21932	36.0	4	7.0	4	7	1	4	0	0.0	0.0	40.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...
5889	39.0	4	13.0	2	10	5	4	0	0.0	0.0	20.0	39
25723	17.0	4	6.0	4	12	3	4	0	0.0	0.0	20.0	39
29514	35.0	4	9.0	4	14	3	4	1	0.0	0.0	40.0	39
1600	30.0	4	7.0	2	3	4	4	1	0.0	0.0	45.0	39
639	52.0	6	16.0	2	10	4	4	1	0.0	0.0	60.0	39
6513 rows × 12 columns

array([False, False, False, ..., False, False,  True])
array([False,  True,  True, ..., False, False,  True])

The function returns four sets:

  • X_train: The feature matrix for the training set.
  • X_test: The feature matrix for the testing set.
  • Y_train: The labels for the training set.
  • Y_test: The labels for the testing set.

STEP-3 : Further Splitting Training into Training and Validation DataSet

X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.25, random_state=1)
display(X_train)
display(X_val)
display(Y_train)
Y_val

The function returns four sets:

  • X_train: Feature matrix for the updated training set.
  • X_val: Feature matrix for the validation set.
  • Y_train: Labels for the updated training set.
  • Y_val: Labels for the validation set.

Output:

	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
13825	54.0	6	6.0	2	3	4	4	1	0.0	0.0	36.0	39
2843	41.0	2	10.0	2	8	4	4	1	0.0	1485.0	40.0	39
3112	24.0	4	9.0	4	1	3	4	1	0.0	0.0	40.0	39
10886	33.0	4	12.0	0	7	0	4	0	0.0	0.0	42.0	39
12148	33.0	4	9.0	2	1	5	4	0	0.0	1887.0	20.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...
245	56.0	4	9.0	2	1	4	4	1	0.0	0.0	35.0	0
10156	28.0	4	9.0	4	6	3	4	1	0.0	0.0	40.0	39
21991	35.0	4	9.0	2	6	4	4	1	0.0	0.0	40.0	26
342	36.0	7	9.0	2	11	4	4	1	7298.0	0.0	40.0	39
25283	56.0	4	10.0	2	4	4	4	1	0.0	0.0	40.0	39
14652 rows × 12 columns

Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
22308	24.0	4	10.0	4	6	3	4	1	0.0	0.0	40.0	39
8499	66.0	0	10.0	2	0	4	4	1	0.0	0.0	40.0	39
27309	38.0	4	8.0	4	7	0	2	0	0.0	0.0	50.0	39
18937	21.0	4	8.0	4	6	3	4	1	0.0	0.0	32.0	39
30262	30.0	4	10.0	4	4	0	4	1	0.0	0.0	52.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...
21639	33.0	4	11.0	2	1	5	2	0	0.0	0.0	40.0	39
28968	29.0	4	4.0	0	3	1	4	0	0.0	0.0	55.0	39
21714	28.0	4	5.0	4	8	2	4	1	0.0	0.0	52.0	39
12412	39.0	2	8.0	2	14	4	4	1	0.0	1848.0	40.0	27
11419	39.0	7	13.0	2	10	4	4	1	0.0	0.0	45.0	39
4884 rows × 12 columns

array([False,  True, False, ..., False,  True,  True])
array([False, False, False, ..., False,  True,  True])

STEP-4 : Aligning Dataset for Display Purposes

  • First, we will fetch the index of training feature matrix, so that we can use it to identify corresponding display data for the training set indices.
display(X_train.index)

Output:

Index([13825,  2843,  3112, 10886, 12148,  5773, 26924,  7925, 18589, 25383,
       ...
       20387, 32381, 27533, 11772, 31925,   245, 10156, 21991,   342, 25283],
      dtype='int64', length=14652)

where

  • X_train.index is used to find index (i.e., first row in this case) of X_train.

  • Then, we will create a subset of the display dataset (X_display) by selecting only the rows (indices) that are present in the training set (X_train).

  • It ensures that the display dataset is aligned with the training data for visualization purposes.

  • For which we will utilize loc function. In the context of pandas, loc stands for "location." The loc function is used for label-based indexing in pandas DataFrames. It is primarily used to access a group of rows and columns by labels or a boolean array. The name "loc" is chosen to imply selecting data based on its location, using labels or boolean conditions.

  • Similarly, we did this for Test and validation dataset as well.

X_train_display = X_display.loc[X_train.index]
X_test_display = X_display.loc[X_test.index]
X_val_display = X_display.loc[X_val.index]
display(X_train_display)
display(X_test_display)
X_val_display

Output:

	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
13825	54.0	Self-emp-not-inc	6.0	Married-civ-spouse	Craft-repair	Husband	White	Male	0.0	0.0	36.0	United-States
2843	41.0	Local-gov	10.0	Married-civ-spouse	Other-service	Husband	White	Male	0.0	1485.0	40.0	United-States
3112	24.0	Private	9.0	Never-married	Adm-clerical	Own-child	White	Male	0.0	0.0	40.0	United-States
10886	33.0	Private	12.0	Divorced	Machine-op-inspct	Not-in-family	White	Female	0.0	0.0	42.0	United-States
12148	33.0	Private	9.0	Married-civ-spouse	Adm-clerical	Wife	White	Female	0.0	1887.0	20.0	United-States
...	...	...	...	...	...	...	...	...	...	...	...	...
245	56.0	Private	9.0	Married-civ-spouse	Adm-clerical	Husband	White	Male	0.0	0.0	35.0	?
10156	28.0	Private	9.0	Never-married	Handlers-cleaners	Own-child	White	Male	0.0	0.0	40.0	United-States
21991	35.0	Private	9.0	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0.0	0.0	40.0	Mexico
342	36.0	State-gov	9.0	Married-civ-spouse	Protective-serv	Husband	White	Male	7298.0	0.0	40.0	United-States
25283	56.0	Private	10.0	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	0.0	40.0	United-States
14652 rows × 12 columns

Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
9646	62.0	Self-emp-not-inc	4.0	Widowed	Other-service	Not-in-family	White	Female	0.0	0.0	66.0	United-States
709	18.0	Private	7.0	Never-married	Other-service	Other-relative	White	Male	0.0	0.0	25.0	United-States
7385	25.0	Private	13.0	Never-married	Farming-fishing	Own-child	White	Male	27828.0	0.0	50.0	United-States
16671	33.0	Private	9.0	Married-civ-spouse	Prof-specialty	Husband	White	Male	0.0	0.0	40.0	United-States
21932	36.0	Private	7.0	Never-married	Machine-op-inspct	Unmarried	White	Female	0.0	0.0	40.0	United-States
...	...	...	...	...	...	...	...	...	...	...	...	...
5889	39.0	Private	13.0	Married-civ-spouse	Prof-specialty	Wife	White	Female	0.0	0.0	20.0	United-States
25723	17.0	Private	6.0	Never-married	Sales	Own-child	White	Female	0.0	0.0	20.0	United-States
29514	35.0	Private	9.0	Never-married	Transport-moving	Own-child	White	Male	0.0	0.0	40.0	United-States
1600	30.0	Private	7.0	Married-civ-spouse	Craft-repair	Husband	White	Male	0.0	0.0	45.0	United-States
639	52.0	Self-emp-not-inc	16.0	Married-civ-spouse	Prof-specialty	Husband	White	Male	0.0	0.0	60.0	United-States
6513 rows × 12 columns

Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
22308	24.0	Private	10.0	Never-married	Handlers-cleaners	Own-child	White	Male	0.0	0.0	40.0	United-States
8499	66.0	?	10.0	Married-civ-spouse	?	Husband	White	Male	0.0	0.0	40.0	United-States
27309	38.0	Private	8.0	Never-married	Machine-op-inspct	Not-in-family	Black	Female	0.0	0.0	50.0	United-States
18937	21.0	Private	8.0	Never-married	Handlers-cleaners	Own-child	White	Male	0.0	0.0	32.0	United-States
30262	30.0	Private	10.0	Never-married	Exec-managerial	Not-in-family	White	Male	0.0	0.0	52.0	United-States
...	...	...	...	...	...	...	...	...	...	...	...	...
21639	33.0	Private	11.0	Married-civ-spouse	Adm-clerical	Wife	Black	Female	0.0	0.0	40.0	United-States
28968	29.0	Private	4.0	Divorced	Craft-repair	Unmarried	White	Female	0.0	0.0	55.0	United-States
21714	28.0	Private	5.0	Never-married	Other-service	Other-relative	White	Male	0.0	0.0	52.0	United-States
12412	39.0	Local-gov	8.0	Married-civ-spouse	Transport-moving	Husband	White	Male	0.0	1848.0	40.0	Nicaragua
11419	39.0	State-gov	13.0	Married-civ-spouse	Prof-specialty	Husband	White	Male	0.0	0.0	45.0	United-States
4884 rows × 12 columns

STEP-5 : Concatenating the numeric features with the true labels

  • We will import pandas package and create a pandas Series from the target variable (Y_train) with the same index as the feature matrix (X_train).
import pandas as pd
pd.Series(Y_train, index=X_train.index)
  • This will create a pandas Series using Y_train as the data and X_train.index as the index.

Output:

13825    False
2843      True
3112     False
10886    False
12148     True
         ...  
245      False
10156    False
21991    False
342       True
25283     True
Length: 14652, dtype: bool

Let's set the name of the Series to 'Income>50K'.

pd.Series(Y_train, index=X_train.index, name='Income>50K')

where

  • name is additional parameter to set name of Series.

Output:

13825    False
2843      True
3112     False
10886    False
12148     True
         ...  
245      False
10156    False
21991    False
342       True
25283     True
Name: Income>50K, Length: 14652, dtype: bool
  • As we can see now Data Type is currently in boolean format, so we have to convert into integer. This is often useful when preparing data for machine learning tasks where specific data types are required. This helps us to do calculations easily.
pd.Series(Y_train, index=X_train.index, name='Income>50K', dtype=int)

where

  • dtype=int parameter is added to explicitly set the data type of the values in the Series to integer.

Output:

13825    0
2843     1
3112     0
10886    0
12148    1
        ..
245      0
10156    0
21991    0
342      1
25283    1
Name: Income>50K, Length: 14652, dtype: int64

Now let's concat(add) labels ('Income>50K') as the first column of the DataFrames, allowing for easy reference to the target variable during analysis or modeling.

train = pd.concat([pd.Series(Y_train, index=X_train.index, name='Income>50K', dtype=int), X_train], axis=1)

where

  • It creates three DataFrame "train" by concatenating: A Series containing the labels (Y_train) with the same index as X_train. The original feature matrix X_train. The new DataFrame has the labels as the first column ('Income>50K') followed by the features.
  • axis=1 specifies that the concatenation should happen along columns. As a result, the Series and the feature matrix are combined horizontally, side by side, forming a new DataFrame.If axis were set to 0 (the default), concatenation would occur along rows, vertically. This would stack the Series and the feature matrix on top of each other, creating a new DataFrame with more rows.

Same, we do for test and validation dataset

test = pd.concat([pd.Series(Y_test, index=X_test.index, name='Income>50K', dtype=int), X_test], axis=1)
val = pd.concat([pd.Series(Y_val, index=X_val.index, name='Income>50K', dtype=int), X_val], axis=1)

STEP-6: Check Split of DataSet

  1. Check Training DataSet:
train

Output:

Income>50K	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
13825	0	54.0	6	6.0	2	3	4	4	1	0.0	0.0	36.0	39
2843	1	41.0	2	10.0	2	8	4	4	1	0.0	1485.0	40.0	39
3112	0	24.0	4	9.0	4	1	3	4	1	0.0	0.0	40.0	39
10886	0	33.0	4	12.0	0	7	0	4	0	0.0	0.0	42.0	39
12148	1	33.0	4	9.0	2	1	5	4	0	0.0	1887.0	20.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...	...
245	0	56.0	4	9.0	2	1	4	4	1	0.0	0.0	35.0	0
10156	0	28.0	4	9.0	4	6	3	4	1	0.0	0.0	40.0	39
21991	0	35.0	4	9.0	2	6	4	4	1	0.0	0.0	40.0	26
342	1	36.0	7	9.0	2	11	4	4	1	7298.0	0.0	40.0	39
25283	1	56.0	4	10.0	2	4	4	4	1	0.0	0.0	40.0	39
14652 rows × 13 columns
  1. Check Test DataSet:
test

Output:

Income>50K	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
9646	0	62.0	6	4.0	6	8	0	4	0	0.0	0.0	66.0	39
709	0	18.0	4	7.0	4	8	2	4	1	0.0	0.0	25.0	39
7385	1	25.0	4	13.0	4	5	3	4	1	27828.0	0.0	50.0	39
16671	0	33.0	4	9.0	2	10	4	4	1	0.0	0.0	40.0	39
21932	0	36.0	4	7.0	4	7	1	4	0	0.0	0.0	40.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5889	1	39.0	4	13.0	2	10	5	4	0	0.0	0.0	20.0	39
25723	0	17.0	4	6.0	4	12	3	4	0	0.0	0.0	20.0	39
29514	0	35.0	4	9.0	4	14	3	4	1	0.0	0.0	40.0	39
1600	0	30.0	4	7.0	2	3	4	4	1	0.0	0.0	45.0	39
639	1	52.0	6	16.0	2	10	4	4	1	0.0	0.0	60.0	39
6513 rows × 13 columns
  1. Check Validation DataSet:
val

Output:

Income>50K	Age	Workclass	Education-Num	Marital Status	Occupation	Relationship	Race	Sex	Capital Gain	Capital Loss	Hours per week	Country
22308	0	24.0	4	10.0	4	6	3	4	1	0.0	0.0	40.0	39
8499	0	66.0	0	10.0	2	0	4	4	1	0.0	0.0	40.0	39
27309	0	38.0	4	8.0	4	7	0	2	0	0.0	0.0	50.0	39
18937	0	21.0	4	8.0	4	6	3	4	1	0.0	0.0	32.0	39
30262	0	30.0	4	10.0	4	4	0	4	1	0.0	0.0	52.0	39
...	...	...	...	...	...	...	...	...	...	...	...	...	...
21639	0	33.0	4	11.0	2	1	5	2	0	0.0	0.0	40.0	39
28968	0	29.0	4	4.0	0	3	1	4	0	0.0	0.0	55.0	39
21714	0	28.0	4	5.0	4	8	2	4	1	0.0	0.0	52.0	39
12412	1	39.0	2	8.0	2	14	4	4	1	0.0	1848.0	40.0	27
11419	1	39.0	7	13.0	2	10	4	4	1	0.0	0.0	45.0	39
4884 rows × 13 columns

NEXT STEP OF AI PROJECT: