- Published on
- Authors
- Name
- KAUSTUBH SHARMA
PreRequisite
- Read about How to Download Dataset(only 1 min. read).
Splitting Dataset into Multiple Dataset
- We are spliting the dataset into three datasets :
- One (train dataset) for training the model and
- Second (test dataset) to evaluate the performance of the final trained model and
- Third into validation dataset(which is often used during the training phase to fine-tune the model's hyperparameters and prevent overfitting.)
STEP-1 : Importing Necessary Library
from sklearn.model_selection import train_test_split
where
- sklearn is a machine library for predictive data analysis.
- train_test_split is a function from scikit-learn, which is commonly used for splitting datasets into training and testing sets.
STEP-2 : Splitting Train and Test Dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
display(X)
display(X_train)
display(X_test)
display(Y)
display(Y_train)
Y_test
where
- X and Y are the original feature matrix and label vector.
- test_size=0.2 specifies that 20% of the data should be used for testing, and the remaining 80% for training.
- random_state=1 ensures reproducibility by fixing the random seed. Setting random_state to a specific value, such as 1, ensures that the random numbers generated in the code are the same every time you run it. This makes the results reproducible, as the randomness is controlled by a fixed seed.
Output:
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
0 39.0 7 13.0 4 1 0 4 1 2174.0 0.0 40.0 39
1 50.0 6 13.0 2 4 4 4 1 0.0 0.0 13.0 39
2 38.0 4 9.0 0 6 0 4 1 0.0 0.0 40.0 39
3 53.0 4 7.0 2 6 4 2 1 0.0 0.0 40.0 39
4 28.0 4 13.0 2 10 5 2 0 0.0 0.0 40.0 5
... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27.0 4 12.0 2 13 5 4 0 0.0 0.0 38.0 39
32557 40.0 4 9.0 2 7 4 4 1 0.0 0.0 40.0 39
32558 58.0 4 9.0 6 1 1 4 0 0.0 0.0 40.0 39
32559 22.0 4 9.0 4 1 3 4 1 0.0 0.0 20.0 39
32560 52.0 5 9.0 2 4 5 4 0 15024.0 0.0 40.0 39
32561 rows × 12 columns
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
16465 39.0 6 7.0 2 14 4 4 1 0.0 0.0 40.0 39
5625 54.0 6 13.0 2 4 4 4 1 0.0 0.0 40.0 39
30273 32.0 4 9.0 2 12 4 4 1 0.0 1902.0 50.0 39
3136 45.0 6 10.0 4 5 0 4 1 0.0 0.0 50.0 39
4521 60.0 4 6.0 2 6 4 2 1 0.0 0.0 40.0 39
... ... ... ... ... ... ... ... ... ... ... ... ...
32511 25.0 2 13.0 4 1 3 2 0 0.0 0.0 40.0 39
5192 32.0 4 13.0 2 4 4 4 1 15024.0 0.0 45.0 39
12172 27.0 4 13.0 4 7 0 1 1 0.0 0.0 40.0 0
235 59.0 7 9.0 2 8 4 4 1 0.0 0.0 40.0 39
29733 33.0 4 13.0 2 1 4 4 1 0.0 1902.0 45.0 39
26048 rows × 12 columns
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
9646 62.0 6 4.0 6 8 0 4 0 0.0 0.0 66.0 39
709 18.0 4 7.0 4 8 2 4 1 0.0 0.0 25.0 39
7385 25.0 4 13.0 4 5 3 4 1 27828.0 0.0 50.0 39
16671 33.0 4 9.0 2 10 4 4 1 0.0 0.0 40.0 39
21932 36.0 4 7.0 4 7 1 4 0 0.0 0.0 40.0 39
... ... ... ... ... ... ... ... ... ... ... ... ...
5889 39.0 4 13.0 2 10 5 4 0 0.0 0.0 20.0 39
25723 17.0 4 6.0 4 12 3 4 0 0.0 0.0 20.0 39
29514 35.0 4 9.0 4 14 3 4 1 0.0 0.0 40.0 39
1600 30.0 4 7.0 2 3 4 4 1 0.0 0.0 45.0 39
639 52.0 6 16.0 2 10 4 4 1 0.0 0.0 60.0 39
6513 rows × 12 columns
array([False, False, False, ..., False, False, True])
array([False, True, True, ..., False, False, True])
The function returns four sets:
- X_train: The feature matrix for the training set.
- X_test: The feature matrix for the testing set.
- Y_train: The labels for the training set.
- Y_test: The labels for the testing set.
STEP-3 : Further Splitting Training into Training and Validation DataSet
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.25, random_state=1)
display(X_train)
display(X_val)
display(Y_train)
Y_val
The function returns four sets:
- X_train: Feature matrix for the updated training set.
- X_val: Feature matrix for the validation set.
- Y_train: Labels for the updated training set.
- Y_val: Labels for the validation set.
Output:
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
13825 54.0 6 6.0 2 3 4 4 1 0.0 0.0 36.0 39
2843 41.0 2 10.0 2 8 4 4 1 0.0 1485.0 40.0 39
3112 24.0 4 9.0 4 1 3 4 1 0.0 0.0 40.0 39
10886 33.0 4 12.0 0 7 0 4 0 0.0 0.0 42.0 39
12148 33.0 4 9.0 2 1 5 4 0 0.0 1887.0 20.0 39
... ... ... ... ... ... ... ... ... ... ... ... ...
245 56.0 4 9.0 2 1 4 4 1 0.0 0.0 35.0 0
10156 28.0 4 9.0 4 6 3 4 1 0.0 0.0 40.0 39
21991 35.0 4 9.0 2 6 4 4 1 0.0 0.0 40.0 26
342 36.0 7 9.0 2 11 4 4 1 7298.0 0.0 40.0 39
25283 56.0 4 10.0 2 4 4 4 1 0.0 0.0 40.0 39
14652 rows × 12 columns
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
22308 24.0 4 10.0 4 6 3 4 1 0.0 0.0 40.0 39
8499 66.0 0 10.0 2 0 4 4 1 0.0 0.0 40.0 39
27309 38.0 4 8.0 4 7 0 2 0 0.0 0.0 50.0 39
18937 21.0 4 8.0 4 6 3 4 1 0.0 0.0 32.0 39
30262 30.0 4 10.0 4 4 0 4 1 0.0 0.0 52.0 39
... ... ... ... ... ... ... ... ... ... ... ... ...
21639 33.0 4 11.0 2 1 5 2 0 0.0 0.0 40.0 39
28968 29.0 4 4.0 0 3 1 4 0 0.0 0.0 55.0 39
21714 28.0 4 5.0 4 8 2 4 1 0.0 0.0 52.0 39
12412 39.0 2 8.0 2 14 4 4 1 0.0 1848.0 40.0 27
11419 39.0 7 13.0 2 10 4 4 1 0.0 0.0 45.0 39
4884 rows × 12 columns
array([False, True, False, ..., False, True, True])
array([False, False, False, ..., False, True, True])
STEP-4 : Aligning Dataset for Display Purposes
- First, we will fetch the index of training feature matrix, so that we can use it to identify corresponding display data for the training set indices.
display(X_train.index)
Output:
Index([13825, 2843, 3112, 10886, 12148, 5773, 26924, 7925, 18589, 25383,
...
20387, 32381, 27533, 11772, 31925, 245, 10156, 21991, 342, 25283],
dtype='int64', length=14652)
where
X_train.index is used to find index (i.e., first row in this case) of X_train.
Then, we will create a subset of the display dataset (X_display) by selecting only the rows (indices) that are present in the training set (X_train).
It ensures that the display dataset is aligned with the training data for visualization purposes.
For which we will utilize loc function. In the context of pandas, loc stands for "location." The loc function is used for label-based indexing in pandas DataFrames. It is primarily used to access a group of rows and columns by labels or a boolean array. The name "loc" is chosen to imply selecting data based on its location, using labels or boolean conditions.
Similarly, we did this for Test and validation dataset as well.
X_train_display = X_display.loc[X_train.index]
X_test_display = X_display.loc[X_test.index]
X_val_display = X_display.loc[X_val.index]
display(X_train_display)
display(X_test_display)
X_val_display
Output:
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
13825 54.0 Self-emp-not-inc 6.0 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 36.0 United-States
2843 41.0 Local-gov 10.0 Married-civ-spouse Other-service Husband White Male 0.0 1485.0 40.0 United-States
3112 24.0 Private 9.0 Never-married Adm-clerical Own-child White Male 0.0 0.0 40.0 United-States
10886 33.0 Private 12.0 Divorced Machine-op-inspct Not-in-family White Female 0.0 0.0 42.0 United-States
12148 33.0 Private 9.0 Married-civ-spouse Adm-clerical Wife White Female 0.0 1887.0 20.0 United-States
... ... ... ... ... ... ... ... ... ... ... ... ...
245 56.0 Private 9.0 Married-civ-spouse Adm-clerical Husband White Male 0.0 0.0 35.0 ?
10156 28.0 Private 9.0 Never-married Handlers-cleaners Own-child White Male 0.0 0.0 40.0 United-States
21991 35.0 Private 9.0 Married-civ-spouse Handlers-cleaners Husband White Male 0.0 0.0 40.0 Mexico
342 36.0 State-gov 9.0 Married-civ-spouse Protective-serv Husband White Male 7298.0 0.0 40.0 United-States
25283 56.0 Private 10.0 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 40.0 United-States
14652 rows × 12 columns
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
9646 62.0 Self-emp-not-inc 4.0 Widowed Other-service Not-in-family White Female 0.0 0.0 66.0 United-States
709 18.0 Private 7.0 Never-married Other-service Other-relative White Male 0.0 0.0 25.0 United-States
7385 25.0 Private 13.0 Never-married Farming-fishing Own-child White Male 27828.0 0.0 50.0 United-States
16671 33.0 Private 9.0 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 40.0 United-States
21932 36.0 Private 7.0 Never-married Machine-op-inspct Unmarried White Female 0.0 0.0 40.0 United-States
... ... ... ... ... ... ... ... ... ... ... ... ...
5889 39.0 Private 13.0 Married-civ-spouse Prof-specialty Wife White Female 0.0 0.0 20.0 United-States
25723 17.0 Private 6.0 Never-married Sales Own-child White Female 0.0 0.0 20.0 United-States
29514 35.0 Private 9.0 Never-married Transport-moving Own-child White Male 0.0 0.0 40.0 United-States
1600 30.0 Private 7.0 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 45.0 United-States
639 52.0 Self-emp-not-inc 16.0 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 60.0 United-States
6513 rows × 12 columns
Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
22308 24.0 Private 10.0 Never-married Handlers-cleaners Own-child White Male 0.0 0.0 40.0 United-States
8499 66.0 ? 10.0 Married-civ-spouse ? Husband White Male 0.0 0.0 40.0 United-States
27309 38.0 Private 8.0 Never-married Machine-op-inspct Not-in-family Black Female 0.0 0.0 50.0 United-States
18937 21.0 Private 8.0 Never-married Handlers-cleaners Own-child White Male 0.0 0.0 32.0 United-States
30262 30.0 Private 10.0 Never-married Exec-managerial Not-in-family White Male 0.0 0.0 52.0 United-States
... ... ... ... ... ... ... ... ... ... ... ... ...
21639 33.0 Private 11.0 Married-civ-spouse Adm-clerical Wife Black Female 0.0 0.0 40.0 United-States
28968 29.0 Private 4.0 Divorced Craft-repair Unmarried White Female 0.0 0.0 55.0 United-States
21714 28.0 Private 5.0 Never-married Other-service Other-relative White Male 0.0 0.0 52.0 United-States
12412 39.0 Local-gov 8.0 Married-civ-spouse Transport-moving Husband White Male 0.0 1848.0 40.0 Nicaragua
11419 39.0 State-gov 13.0 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 45.0 United-States
4884 rows × 12 columns
STEP-5 : Concatenating the numeric features with the true labels
- We will import pandas package and create a pandas Series from the target variable (Y_train) with the same index as the feature matrix (X_train).
import pandas as pd
pd.Series(Y_train, index=X_train.index)
- This will create a pandas Series using Y_train as the data and X_train.index as the index.
Output:
13825 False
2843 True
3112 False
10886 False
12148 True
...
245 False
10156 False
21991 False
342 True
25283 True
Length: 14652, dtype: bool
Let's set the name of the Series to 'Income>50K'.
pd.Series(Y_train, index=X_train.index, name='Income>50K')
where
- name is additional parameter to set name of Series.
Output:
13825 False
2843 True
3112 False
10886 False
12148 True
...
245 False
10156 False
21991 False
342 True
25283 True
Name: Income>50K, Length: 14652, dtype: bool
- As we can see now Data Type is currently in boolean format, so we have to convert into integer. This is often useful when preparing data for machine learning tasks where specific data types are required. This helps us to do calculations easily.
pd.Series(Y_train, index=X_train.index, name='Income>50K', dtype=int)
where
- dtype=int parameter is added to explicitly set the data type of the values in the Series to integer.
Output:
13825 0
2843 1
3112 0
10886 0
12148 1
..
245 0
10156 0
21991 0
342 1
25283 1
Name: Income>50K, Length: 14652, dtype: int64
Now let's concat(add) labels ('Income>50K') as the first column of the DataFrames, allowing for easy reference to the target variable during analysis or modeling.
train = pd.concat([pd.Series(Y_train, index=X_train.index, name='Income>50K', dtype=int), X_train], axis=1)
where
- It creates three DataFrame "train" by concatenating: A Series containing the labels (Y_train) with the same index as X_train. The original feature matrix X_train. The new DataFrame has the labels as the first column ('Income>50K') followed by the features.
- axis=1 specifies that the concatenation should happen along columns. As a result, the Series and the feature matrix are combined horizontally, side by side, forming a new DataFrame.If axis were set to 0 (the default), concatenation would occur along rows, vertically. This would stack the Series and the feature matrix on top of each other, creating a new DataFrame with more rows.
Same, we do for test and validation dataset
test = pd.concat([pd.Series(Y_test, index=X_test.index, name='Income>50K', dtype=int), X_test], axis=1)
val = pd.concat([pd.Series(Y_val, index=X_val.index, name='Income>50K', dtype=int), X_val], axis=1)
STEP-6: Check Split of DataSet
- Check Training DataSet:
train
Output:
Income>50K Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
13825 0 54.0 6 6.0 2 3 4 4 1 0.0 0.0 36.0 39
2843 1 41.0 2 10.0 2 8 4 4 1 0.0 1485.0 40.0 39
3112 0 24.0 4 9.0 4 1 3 4 1 0.0 0.0 40.0 39
10886 0 33.0 4 12.0 0 7 0 4 0 0.0 0.0 42.0 39
12148 1 33.0 4 9.0 2 1 5 4 0 0.0 1887.0 20.0 39
... ... ... ... ... ... ... ... ... ... ... ... ... ...
245 0 56.0 4 9.0 2 1 4 4 1 0.0 0.0 35.0 0
10156 0 28.0 4 9.0 4 6 3 4 1 0.0 0.0 40.0 39
21991 0 35.0 4 9.0 2 6 4 4 1 0.0 0.0 40.0 26
342 1 36.0 7 9.0 2 11 4 4 1 7298.0 0.0 40.0 39
25283 1 56.0 4 10.0 2 4 4 4 1 0.0 0.0 40.0 39
14652 rows × 13 columns
- Check Test DataSet:
test
Output:
Income>50K Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
9646 0 62.0 6 4.0 6 8 0 4 0 0.0 0.0 66.0 39
709 0 18.0 4 7.0 4 8 2 4 1 0.0 0.0 25.0 39
7385 1 25.0 4 13.0 4 5 3 4 1 27828.0 0.0 50.0 39
16671 0 33.0 4 9.0 2 10 4 4 1 0.0 0.0 40.0 39
21932 0 36.0 4 7.0 4 7 1 4 0 0.0 0.0 40.0 39
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5889 1 39.0 4 13.0 2 10 5 4 0 0.0 0.0 20.0 39
25723 0 17.0 4 6.0 4 12 3 4 0 0.0 0.0 20.0 39
29514 0 35.0 4 9.0 4 14 3 4 1 0.0 0.0 40.0 39
1600 0 30.0 4 7.0 2 3 4 4 1 0.0 0.0 45.0 39
639 1 52.0 6 16.0 2 10 4 4 1 0.0 0.0 60.0 39
6513 rows × 13 columns
- Check Validation DataSet:
val
Output:
Income>50K Age Workclass Education-Num Marital Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country
22308 0 24.0 4 10.0 4 6 3 4 1 0.0 0.0 40.0 39
8499 0 66.0 0 10.0 2 0 4 4 1 0.0 0.0 40.0 39
27309 0 38.0 4 8.0 4 7 0 2 0 0.0 0.0 50.0 39
18937 0 21.0 4 8.0 4 6 3 4 1 0.0 0.0 32.0 39
30262 0 30.0 4 10.0 4 4 0 4 1 0.0 0.0 52.0 39
... ... ... ... ... ... ... ... ... ... ... ... ... ...
21639 0 33.0 4 11.0 2 1 5 2 0 0.0 0.0 40.0 39
28968 0 29.0 4 4.0 0 3 1 4 0 0.0 0.0 55.0 39
21714 0 28.0 4 5.0 4 8 2 4 1 0.0 0.0 52.0 39
12412 1 39.0 2 8.0 2 14 4 4 1 0.0 1848.0 40.0 27
11419 1 39.0 7 13.0 2 10 4 4 1 0.0 0.0 45.0 39
4884 rows × 13 columns
NEXT STEP OF AI PROJECT:
- We need to convert our new datasets(train, test, val) to CSV , as it is input format for XGBoost algorithm, which we are going to use for our model training.
- Read about XGBoost algorithm.
- Learn about How-to-convert-Datasets-to-CSV-Files.