M/L Learning Byte: Load, Split and Scale Data
Shakir
Posted on October 16, 2023
Let's play with some data in Pandas. I am using a notebook for this purpose that's created inside a Notebook instance in Amazon SageMaker. This post contains some of the preliminary steps we would do before creating or training a model.
In my case the notebook instance was created with just the default settings.
Note that the notebook name should not be in snakecase(for ex. note_book is invalid), just saying as it's common to name variables or files in snakecase format in python.
Click on JupyterLab once the notebook instance is created.
Inside jupyter lab, create a new notebook, I have clicked on the tensorflow conda environment in the launcher.
Let's get started...
We are going to create a dataframe in Pandas that should hold the following data:
Age (years) | Income (thousands) | Hours_Worked | Salary (thousands) |
---|---|---|---|
32 | 45 | 50 | 70 |
41 | 50 | 45 | 80 |
28 | 30 | 60 | 60 |
35 | 38 | 55 | 75 |
45 | 60 | 42 | 90 |
29 | 32 | 48 | 65 |
37 | 40 | 35 | 75 |
42 | 55 | 47 | 85 |
36 | 48 | 38 | 80 |
31 | 35 | 52 | 70 |
Dictionary
For this, we would create a dictionary in Python with keys as the column names and wrap that dictionary in the Pandas DataFrame object.
import pandas as pd
salary_dict = {
'Age (years)': [32, 41, 28, 35, 45, 29, 37, 42, 36, 31],
'Income (thousands)': [45, 50, 30, 38, 60, 32, 40, 55, 48, 35],
'Hours_Worked': [50, 45, 60, 55, 42, 48, 35, 47, 38, 52],
'Salary (thousands)': [70, 80, 60, 75, 90, 65, 75, 85, 80, 70]
}
salary_df = pd.DataFrame(salary_dict)
Let's print this dataframe
print(salary_df)
Age (years) Income (thousands) Hours_Worked Salary (thousands)
0 32 45 50 70
1 41 50 45 80
2 28 30 60 60
3 35 38 55 75
4 45 60 42 90
5 29 32 48 65
6 37 40 35 75
7 42 55 47 85
8 36 48 38 80
9 31 35 52 70
Awesome, so it resembles our table, however it also has extra information on the left, which are the indices starting from 0. The last index is 9 which means there are 9+1 = 10 rows. By looking at the data we can say there are 10 rows and 4 columns, we can also get this with the shape attribute, which is useful with big datasets.
print(salary_df.shape)
(10, 4)
Index
Instead of the default integer indices, we can also set something meaningful if we would like to, for ex. in this case we could set some random person names as indices. And this time let's print the top5 and last5 rows with head and tail respectively.
person_names = ['Ali', 'Sara', 'Ahmed', 'Emily', 'David', 'Olivia', 'Michael', 'Sophia', 'Aryan', 'Neha']
salary_df_with_custom_indices = pd.DataFrame(
salary_dict,
index=person_names
)
print(salary_df_with_custom_indices.head())
print(salary_df_with_custom_indices.tail())
Age (years) Income (thousands) Hours_Worked Salary (thousands)
Ali 32 45 50 70
Sara 41 50 45 80
Ahmed 28 30 60 60
Emily 35 38 55 75
David 45 60 42 90
Age (years) Income (thousands) Hours_Worked Salary (thousands)
Olivia 29 32 48 65
Michael 37 40 35 75
Sophia 42 55 47 85
Aryan 36 48 38 80
Neha 31 35 52 70
Series
We could also create salary separately as a Series, as it's our label(target), and it's a single column. A series is typically used to represent a single list or column. Note that we usually separate features and labels in datasets for the purpose of training in machine learning. We would want our machine learning model to predict labels based on our features. Sometimes it would just be a single label that we want to predict.
salary_series = pd.Series(
[70, 80, 60, 75, 90, 65, 75, 85, 80, 70]
)
print(salary_series.head())
0 70
1 80
2 60
3 75
4 90
dtype: int64
We could also add custom indices to series, just like we have done for a dataframe, and give the series a name.
salary_series_with_custom_indices = pd.Series(
[70, 80, 60, 75, 90, 65, 75, 85, 80, 70],
index=person_names,
name='Salary (thousands)'
)
print(salary_series_with_custom_indices.head())
Ali 70
Sara 80
Ahmed 60
Emily 75
David 90
Name: Salary (thousands), dtype: int64
Instead of displaying the complete table, we can also set how many rows we want to see, for ex. 5. This option not just affects printing dataframes, it also affects other pandas outputs like describe.
pd.set_option('display.max_rows', 5)
print(salary_df)
Age (years) Income (thousands) Hours_Worked Salary (thousands)
0 32 45 50 70
1 41 50 45 80
.. ... ... ... ...
8 36 48 38 80
9 31 35 52 70
[10 rows x 4 columns]
Let's unset this option back.
pd.set_option('display.max_rows', None)
File
So far, we added data directly in a dataframe, now let's try reading from a file, that's what we usually do. The file can be remote with a URL or local. We would go local this time, let's add our table in a file salary.csv
.
%%writefile salary.csv
Age(years),Income (thousands),Hours_Worked,Salary (thousands)
32,45,50,70
41,50,45,80
28,30,60,60
35,38,55,75
45,60,42,90
29,32,48,65
37,40,35,75
42,55,47,85
36,48,38,80
31,35,52,70
Overwriting salary.csv
We can create a dataframe out of this and check the column names and shape.
salary_df_from_file = pd.read_csv('salary.csv')
print(salary_df_from_file.columns)
print(salary_df_from_file.shape)
Index(['Age(years)', 'Income (thousands)', 'Hours_Worked',
'Salary (thousands)'],
dtype='object')
(10, 4)
We can check the statistical description of this dataframe with the describe method.
print(salary_df_from_file.describe())
Age(years) Income (thousands) Hours_Worked Salary (thousands)
count 10.000000 10.000000 10.000000 10.000000
mean 35.600000 43.300000 47.200000 75.000000
std 5.738757 9.989439 7.612855 9.128709
min 28.000000 30.000000 35.000000 60.000000
25% 31.250000 35.750000 42.750000 70.000000
50% 35.500000 42.500000 47.500000 75.000000
75% 40.000000 49.500000 51.500000 80.000000
max 45.000000 60.000000 60.000000 90.000000
Index on file
This time lets add index on the file.
%%writefile salary_with_index.csv
Age(years),Income (thousands),Hours_Worked,Salary (thousands)
Ali,32,45,50,70
Sara,41,50,45,80
Ahmed,28,30,60,60
Emily,35,38,55,75
David,45,60,42,90
Olivia,29,32,48,65
Michael,37,40,35,75
Sophia,42,55,47,85
Aryan,36,48,38,80
Neha,31,35,52,70
Overwriting salary_with_index.csv
So we have added person names as indices to all the rows except the first row. This time, we just need to tell Pandas, the first column i.e. column 0 is meant for index with the index_col argument.
salary_df_from_file = pd.read_csv(
'salary_with_index.csv',
index_col=0
)
print(salary_df_from_file.shape)
print(salary_df_from_file.head())
(10, 4)
Age(years) Income (thousands) Hours_Worked Salary (thousands)
Ali 32 45 50 70
Sara 41 50 45 80
Ahmed 28 30 60 60
Emily 35 38 55 75
David 45 60 42 90
We could slightly change the file by adding a Name column.
%%writefile salary_with_names.csv
Name,Age(years),Income (thousands),Hours_Worked,Salary (thousands)
Ali,32,45,50,70
Sara,41,50,45,80
Ahmed,28,30,60,60
Emily,35,38,55,75
David,45,60,42,90
Olivia,29,32,48,65
Michael,37,40,35,75
Sophia,42,55,47,85
Aryan,36,48,38,80
Neha,31,35,52,70
Overwriting salary_with_names.csv
And we could mention the name column as our index.
salary_df_from_file = pd.read_csv(
'salary_with_names.csv',
index_col='Name'
)
print(salary_df_from_file.shape)
print(salary_df_from_file.head())
(10, 4)
Age(years) Income (thousands) Hours_Worked Salary (thousands)
Name
Ali 32 45 50 70
Sara 41 50 45 80
Ahmed 28 30 60 60
Emily 35 38 55 75
David 45 60 42 90
Split dataset
We can split the dataset to two subsets one for training(say 80%, frac=0.8) and one for validatiing(say 20%)
# you can set the argument random_state=0 if you want to see the same training
# dataframe each time you run the sample method
train_df = salary_df_from_file.sample(frac=0.8)
val_df = salary_df_from_file.drop(train_df.index)
print(train_df)
print(val_df)
Age(years) Income (thousands) Hours_Worked Salary (thousands)
Name
Olivia 29 32 48 65
Michael 37 40 35 75
Neha 31 35 52 70
Sophia 42 55 47 85
Sara 41 50 45 80
Ali 32 45 50 70
David 45 60 42 90
Ahmed 28 30 60 60
Age(years) Income (thousands) Hours_Worked Salary (thousands)
Name
Emily 35 38 55 75
Aryan 36 48 38 80
As you see, there is no overalp between the two datasets. Because we dropped the indices of train_df from the original dataframe to get the validation dataset. We can further split these two datasets to features(first 3 columns) and label(last column)
features = ['Age(years)', 'Income (thousands)', 'Hours_Worked']
label = 'Salary (thousands)'
train_df_features = train_df[features]
train_series_labels = train_df[label]
val_df_features = val_df[features]
val_series_labels = val_df[label]
Note that we have retreived features as a dataframe with column names, and labels as a series(single column), the name of the column becomes the name of the series.
print(val_df_features)
print('-' * 10)
print(val_series_labels)
print('-' * 10)
print(type(val_df_features))
print(type(val_series_labels))
Age(years) Income (thousands) Hours_Worked
Name
Emily 35 38 55
Aryan 36 48 38
----------
Name
Emily 75
Aryan 80
Name: Salary (thousands), dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
Scale and Save
Let's do some scaling on each feature column and save it as a new dataframe. Note that it's usual to do some scaling or normalization on data when we train models, so that values for each feature fall in a similar range.
# for simplicity, assigning the dataframe to a short variable
def scale(df):
# this uses the standard formula for min-max scaling
min_max_scaled_df = (df - df.min()) / (df.max() - df.min())
return (min_max_scaled_df)
scaled_train_df_features = scale(train_df_features)
scaled_val_df_features = scale(val_df_features)
print(scaled_train_df_features.head())
print(scaled_val_df_features)
Age(years) Income (thousands) Hours_Worked
Name
Olivia 0.058824 0.066667 0.52
Michael 0.529412 0.333333 0.00
Neha 0.176471 0.166667 0.68
Sophia 0.823529 0.833333 0.48
Sara 0.764706 0.666667 0.40
Age(years) Income (thousands) Hours_Worked
Name
Emily 0.0 0.0 1.0
Aryan 1.0 1.0 0.0
So all the values are now between 0 and 1. We can save this to a new csv file with the to_csv method.
scaled_train_df_features.to_csv('train_features.csv')
scaled_val_df_features.to_csv('test_features.csv')
Let's view the file content.
!cat train_features.csv
!cat test_features.csv
Name,Age(years),Income (thousands),Hours_Worked
Olivia,0.058823529411764705,0.06666666666666667,0.52
Michael,0.5294117647058824,0.3333333333333333,0.0
Neha,0.17647058823529413,0.16666666666666666,0.68
Sophia,0.8235294117647058,0.8333333333333334,0.48
Sara,0.7647058823529411,0.6666666666666666,0.4
Ali,0.23529411764705882,0.5,0.6
David,1.0,1.0,0.28
Ahmed,0.0,0.0,1.0
Name,Age(years),Income (thousands),Hours_Worked
Emily,0.0,0.0,1.0
Aryan,1.0,1.0,0.0
Note that commas are added as delimiter by default.
The files we have created or interacting with should show up in the files pane.
Alright, that's the ends of this post, so we have seen how to load data with Pandas, split the data and scale the features, which are usually done in a Machine learning lifecycle. Thanks for reading !!!
To avoid billing without usage, ensure you are stopping the notebook instance with the stop option.
Posted on October 16, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
April 15, 2020