Laid-Back Approaches to Clean Data
Alit Darma Putra
Posted on January 10, 2024
The importance of data in the output results of the model means that we need to carry out further handling of the input data. One way that can be done to improve data quality is to carry out data cleaning techniques.
What is Data Cleaning?
Data cleaning is a technique that aims to improve data quality by identifying and eliminating errors and inconsistencies in data.
Here I will share some data cleaning methods and providing simple implementation in how to do it.
There are 4 commonly methods in data cleaning:
- Scalling Feature Value
- Handling Extreme Outlier
- Binning
- Scrubbing
Scalling Feature Value
Feature scaling is the process of normalizing the range of features in a dataset. In real cases, the range of feature values varies greatly. If one of the features has a wide value, then that feature will greatly influence the calculations of the algorithm used. Therefore, the range of all features must be normalized so that each feature can provide a comparable contribution. There are several techniques that can be used to perform feature scaling, including:
Absolute Maximum Scaling
Absolute Maximum Scaling is a scaling technique that is carried out based on the absolute maximum value of each feature. The stages of this technique are:
- Determine the maximum absolute value of the feature in the data set.
- Divide all values in the column by the maximum value.
import matplotlib.pyplot as plt
def max_absolute_scaling(data):
# Determines the maximum absolute value
max_abs_value = max(map(abs, data))
# Divide each value by the maximum absolute value
scaled_data = []
for x in data:
scaled_data.append(x/max_abs_value)
return scaled_data
data = [3, -1, 6, 2, -4]
scaled_data = max_absolute_scaling(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Output:
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]
Min-Max Scaling
Min-Max Scaling is a scaling technique that is carried out by reducing each value in the dataset by the minimum value and then dividing by the range of the dataset (maximum-minimum). By applying this technique, all feature values will be between 0 and 1. The weakness of this technique is also the same, namely that it is susceptible to outliers.
import matplotlib.pyplot as plt
def min_max_scaling(data):
# Determine the maximum and minimum values
max_value = max(data)
min_value = min(data)
# Reduces each value by the minimum value
# then divided by the range of dataset values
scaled_data = []
for x in data:
scaled_data.append(
(x-min_value)/(max_value-min_value)
)
return scaled_data
data = [3, -1, 6, 2, -4]
scaled_data = min_max_scaling(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Output:
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]
Normalization
Normalization is a scaling technique that is similar to min-max scaling, but each feature value is reduced by the average value of the dataset. The results of the reduction are then divided by the range of dataset values.
import matplotlib.pyplot as plt
from statistics import mean
def normalization(data):
# Determine maximum, minimum, average values
max_value = max(data)
min_value = min(data)
mean_value = mean(data)
# Subtract each value by the average value
# then divided by the range of dataset values
scaled_data = []
for x in data:
scaled_data.append(
(x-mean_value)/(max_value-min_value)
)
return scaled_data
data = [3, -1, 6, 2, -4]
scaled_data = normalization(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Output:
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]
Standardization (Z-score Normalization)
Standardization is a scaling technique that is carried out by reducing each feature value by the average and dividing by the standard deviation value or what is usually called the z-score. The result of this technique is data that is scaled so that it has features centered on the average and a standard deviation of 1. This technique is suitable if the features have a normal distribution such as salary or age
import matplotlib.pyplot as plt
from statistics import mean, stdev
def standardization(data):
# Determine the average value, standard deviation
mean_value = mean(data)
stdev_value = stdev(data)
# Subtract each value by the average value
# then divided by the standard deviation
scaled_data = []
for x in data:
scaled_data.append(
(x-mean_value)/(stdev_value)
)
return scaled_data
data = [3, -1, 6, 2, -4]
scaled_data = standardization(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Output:
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]
Robust Scaling
In the Robust Scaling technique, each data is reduced by the median value and then divided by the Inter Quartile Range (IQR) value. IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).
import matplotlib.pyplot as plt
import pandas as pd
def robust_scaling(data):
series = pd.Series(data)
# Determine the median value, IQR
q1, median, q3 = series.quantile([0.25, 0.5, 0.75])
IQR = q3 - q1
# Subtract each value by the median value
# then divided by the IQR value
scaled_data = []
for x in data:
scaled_data.append(
(x-median)/IQR
)
return scaled_data
data = [3, -1, 6, 2, -4]
scaled_data = robust_scaling(data)
print("Data:", data)
print("Scaled Data:", scaled_data)
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Output:
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]
Scaling to Vector Unit Length
Scaling to Vector Unit Length is a scaling technique that is carried out by transforming the components of a feature vector so that the transformed vector has a length of 1. In this technique, each feature value is divided by the vector length.
This technique can only be done if the value ||X||>0
import matplotlib.pyplot as plt
import numpy as np
def vector_normalization(data):
vector = np.array(data)
# Determines the length of the vector
magnitued = np.linalg.norm(vector)
# Normalize vectors to unit length
scaled_data = vector/magnitued
return scaled_data
data = [3, -1, 6, 2, -4]
scaled_data = vector_normalization(data)
print("Data:", data)
print("Scaled Data:", scaled_data.tolist())
plt.plot(data, "red", label="Original Data")
plt.plot(scaled_data, "blue", label="Scaled Data")
plt.legend()
plt.show()
Output:
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]
Handling Extreme Outlier
Outliers are values that are much different from the majority of the data in the data set. Outliers may represent natural variation in the population. However, in most cases outliers are caused by an error in the data collection process such as entering incorrect data, equipment failure, or other measurement errors. If outliers are not handled, they can affect the results of statistical analysis and the accuracy of the model being developed.
Outlier Detection
Outlier Detection with Sorting Methods
The sorting method is the simplest method that can be used to detect outliers. Quantitative data can be sorted from low to high and manually detected for data with very low or very high values. In the Python language, sorting can be done using the sorted function.
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
sorted_data = sorted(data)
print("Data:", data)
print("Sorted Data:", sorted_data)
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100]
A value of 100 is very high when compared to other data. So 100 can be detected as outliers. However, this method is less accurate because outliers are not determined using statistical calculations.
Outlier Detection using the Histogram Method
Histograms can be used to help visualize data and find out whether there are outlier values in a set of data. Data that is outside the data curve is detected as an outlier. The disadvantages of this method are the same as the average method where outliers are determined only from visual observation of the data.
import matplotlib.pyplot as plt
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
plt.hist(data)
plt.show()
Output:
In the histogram visualization above, it can be seen that the data values are more centered between the values 0 to 40. However, there is data that is separated from other data, namely 100 and this data can be categorized as an outlier.
Outlier Detection with Box-Plot
Box-Plot is a summary of the sample distribution presented graphically which can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of the spread (diversity) of observational data. There are 5 statistical measures that can be read in the box plot, namely minimum, maximum, Q1, median, and Q3. Values outside the box and whisker can be categorized as outliers.
import matplotlib.pyplot as plt
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] plt.boxplot(data, vert=False)
plt.title("Detecting outliers using Boxplot")
plt.xlabel('Data')
plt.show()
Output:
Value 100 that is outside the whisker can be categorized as an outlier.
Deteksi Outlier dengan Z-Score
The criteria for determining outliers with z-score is that every data point that has a z-score value that is outside the 3rd standard deviation is an outlier. The stages carried out using this technique include:
- For all data points, calculate the z-score value using the formula (Xi-mean)/std.
- Initialize the threshold value=3 and mark data points that have an absolute z-score value greater than the threshold as outliers.
import statistics as s
def detect_outliers_zscore(data):
# Threshold initialization
thres = 3
# Determine the average value and standard deviation
mean = s.mean(data)
std = s.stdev(data)
outliers = []
# Determine the z-score value for each data
for i in data:
z_score = (i-mean)/std
# Check whether the data is outliers
if (abs(z_score) > thres):
outliers.append(i)
return outliers
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outliers = detect_outliers_zscore(data)
print("Outliers:", outliers)
Output:
Outliers: [100]
Outlier Detection with Inter Quartile Range (IQR)
Based on the Inter Quartile Range (IQR) value, outliers can be detected if the data point is located 1.5 times the IQR above Q3 and below Q1. The stages in determining outliers with IQR are:
- Sort asset data in ascending order
- Calculate 1st and 3rd quartiles (Q1, Q3)
- Calculate the value of IQR=Q3-Q1
- Calculate the lower limit value = (Q1–1.5*IQR) and the upper limit = (Q3+1.5*IQR)
- For all data in the data set, check whether any data is below the lower limit and above the upper limit. Then mark the data as an outlier.
import pandas as pd
def detect_outliers_iqr(data):
series = pd.Series(data)
# Determine Q1, Q3, IQR values
Q1, Q3 = series.quantile([0.25, 0.75])
IQR = Q3-Q1
# Determine lower bound and upper bound
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = []
# Determine the z-score value for each data
for i in data:
# Check whether the data is outliers
if (i < lower_bound or i > upper_bound):
outliers.append(i)
return outliers
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outliers = detect_outliers_iqr(data)
print("Outliers:", outliers)
Output:
Outliers: [100]
Outlier Handling
After successfully detecting outlier data in the dataset, the next stage is to handle the outlier data. There are several ways that can be done to handle outlier data that has been detected.
Trimming
Outlier data detected using this technique will be removed from the dataset. However, this method is not the best practice to do.
def trimming(data, outlier):
new_data = []
# remove data that includes outliers
for i in outlier:
new_data = [x for x in data if x != i ]
return new_data
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outlier = [100]
new_data = trimming(data, outlier)
print("Data:", data)
print("New Data:", new_data)
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]
Quantile Based Flooring and Capping
Handling of outliers in this technique is carried out by limiting outliers to certain values above the 90th percentile value or placed in factors below the 10th percentile value.
import pandas as pd
def handle_quantile_outlier(data):
series = pd.Series(data)
# Determine the 10th and 90th percentiles
P10 = series.quantile(0.1)
P90 = series.quantile(0.9)
new_data = []
# Replace the data value with P10 for data < P10
# and with P90 for data > P90
for x in data:
if x < P10:
x = P10
elif x > P90:
x = P90
new_data.append(x)
return new_data
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
new_data = handle_quantile_outlier(data)
print("Data:", data)
print("New Data:", new_data)
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]
Mean/Median Imputation
The average value is greatly influenced by the presence of outliers, so it is recommended to replace these outliers with median values.
import statistics as s
def handle_median_outlier(data, outlier):
# Determine median value
median = s.median(data)
# Change outlier with median value
new_data = []
for i in outlier:
for x in data:
if x == i:
new_data.append(median)
else:
new_data.append(x)
return new_data
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
outlier = [100]
new_data = handle_median_outlier(data, outlier)
print("Data:", data)
print("New Data:", new_data)
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]
Log Transformation
Log transformation is a common technique used to reduce the skew in a distribution and make it more symmetric. In this way, the occurrence of extreme values can be reduced and the data becomes more normally distributed.
import math
import matplotlib.pyplot as plt
def handle_log_outlier(data):
new_data = []
# Transformasi setiap nilai data dengan log
for x in data:
new_data.append(math.log(x, 10))
return new_data
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
new_data = handle_log_outlier(data)
print("Data:", data)
print("New Data:", new_data)
plt.plot(data, 'red', label='Data')
plt.plot(new_data, 'blue', label='New Data')
plt.legend()
plt.show()
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]
Binning
Data binning is a method of separating or grouping continuous numerical values into discrete intervals called "bins" or "groups". Data grouping methods can be used to simplify data distribution and assist in statistical analysis and visualization. There are several techniques that are often used to group data, including:
Equal Width Binning
This technique groups data into intervals or bins with the same width and has been determined previously. Even though this method is simple, it cannot be applied to data with a skewed distribution.
import pandas as pd
data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]}
df = pd.DataFrame(data)
num_bins = 4
# Calculate the bin width
bin_width = (df['age'].max() - df['age'].min()) / num_bins
# Create bin limits
bin_edges = [df['age'].min() + i * bin_width for i in range(num_bins + 1)]
df['age_bins'] = pd.cut(df['usia'], bins=bin_edges, include_lowest=True, right=True)
print(df)
Output:
age age_bins
0 16 (15.999, 18.75]
1 17 (15.999, 18.75]
2 18 (15.999, 18.75]
3 19 (18.75, 21.5]
4 20 (18.75, 21.5]
5 21 (18.75, 21.5]
6 22 (21.5, 24.25]
7 23 (21.5, 24.25]
8 24 (21.5, 24.25]
9 20 (18.75, 21.5]
10 26 (24.25, 27.0]
11 27 (24.25, 27.0]
Equal Frequency Binning
In this technique, data is grouped into bins with each bin having approximately the same number of data points. This technique is useful when maintaining the same frequency or distribution across bins if it is important. This binning method can also effectively deal with outlier data and skewed data.
import pandas as pd
data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]}
df = pd.DataFrame(data)
df['age_bins'] = pd.qcut(df['age'], q=3)
print(df)
Output:
age age_bins
0 16 (15.999, 19.667]
1 17 (15.999, 19.667]
2 18 (15.999, 19.667]
3 19 (15.999, 19.667]
4 20 (19.667, 22.333]
5 21 (19.667, 22.333]
6 22 (19.667, 22.333]
7 23 (22.333, 27.0]
8 24 (22.333, 27.0]
9 20 (19.667, 22.333]
10 26 (22.333, 27.0]
11 27 (22.333, 27.0]
Quantile Binning
In this technique, data is grouped based on percentile values. The limits of a bin are based on certain percentile values (e.g. 25th, 50th, and 75th percentiles).
import pandas as pd
import numpy as np
data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]}
df = pd.DataFrame(data)
# Defines percentiles for bin boundaries
percentiles = [0, 25, 50, 75, 100] # In this case, quartiles are used
# Defines percentiles for bin boundaries
bin_edges = np.percentile(df['age'], percentiles)
df['age_bins'] = pd.cut(df['age'], bins=bin_edges, include_lowest=True)
print(df)
Output:
age age_bins
0 16 (15.999, 18.75]
1 17 (15.999, 18.75]
2 18 (15.999, 18.75]
3 19 (18.75, 20.5]
4 20 (18.75, 20.5]
5 21 (20.5, 23.25]
6 22 (20.5, 23.25]
7 23 (20.5, 23.25]
8 24 (23.25, 27.0]
9 20 (18.75, 20.5]
10 26 (23.25, 27.0]
11 27 (23.25, 27.0]
Scrubbing
Data scrubbing is a process for changing or deleting incomplete, incorrect, inaccurate, or repetitive data in a dataset. By carrying out this process, it can help improve data consistency, accuracy and reliability.
Deleting Repetitive Data
Deleting duplicate data events is one way to perform data scrubbing. Repeated data often appears if the dataset used comes from several different sources.
import pandas as pd
data = {'age': [15, 17, 23, 22, 17],
'height': [155, 162, 165, 170, 162]}
df = pd.DataFrame(data)
print("Data:")
print(df)
# Check if data is duplicated
duplicate_data = df[df.duplicated()]
print("Duplicate Data:")
print(duplicate_data)
# Delete duplicate data
df = df.drop_duplicates()
print("New Data:")
print(df)
Output:
Data:
age height
0 15 155
1 17 162
2 23 165
3 22 170
4 17 162
Duplicate Data:
age height
4 17 162
New Data:
age height
0 15 155
1 17 162
2 23 165
3 22 170
Handling Missing Data
In real cases, usually there is a lot of missing data in a data set. The causes of this data loss are very varied, ranging from data corruption to device failure when recording measurements.
Deleting Missing Data
Missing data can be resolved by deleting rows or columns of data that have NULL values.
import pandas as pd
import numpy as np
data = {'age': [15, 17, 23, np.nan, 17],
'height': [155, 162, np.nan, 170, 162]}
df = pd.DataFrame(data)
print("Data:")
print(df)
# Deletes rows of missing data
df.dropna(axis=0, inplace=True)
print("New Data:")
print(df)
Output:
Data:
age height
0 15.0 155.0
1 17.0 162.0
2 23.0 NaN
3 NaN 170.0
4 17.0 162.0
New Data:
age height
0 15.0 155.0
1 17.0 162.0
4 17.0 162.0
Pros:
• A model trained by removing all missing values will produce a robust model.
Cons:
• Losing a lot of information.
• Works poorly if the percentage of missing values is too large compared to the data set.
Fill in missing data with mean/median/mode values
Columns in a dataset that have numeric values can be replaced with the mean, median, or mode of other data in that column. This technique will prevent data loss like the previous method.
import pandas as pd
import numpy as np
data = {'age': [15, 17, 23, np.nan, 17],
'height': [155, 162, 165, 170, 162]}
df = pd.DataFrame(data)
print("Data:")
print(df)
df_filled_mean = df.copy()
df_filled_median = df.copy()
df_filled_mode = df.copy()
# Determine mean/median/mode value
mean = df['age'].mean()
median = df['age'].median()
mode = df['age'].mode().values[0]
# Fill in the data with the mean/median/mode value
df_filled_mean['age'].fillna(mean, inplace=True)
df_filled_median['age'].fillna(median, inplace=True)
df_filled_mode['age'].fillna(mode, inplace=True)
print("New Data:")
print("Filled mean:")
print(df_filled_mean)
print("Filled median:")
print(df_filled_median)
print("Filled mode:")
print(df_filled_mode)
Output:
Data:
age height
0 15.0 155
1 17.0 162
2 23.0 165
3 NaN 170
4 17.0 162
New Data:
Filled mean:
age height
0 15.0 155
1 17.0 162
2 23.0 165
3 18.0 170
4 17.0 162
Filled median:
age height
0 15.0 155
1 17.0 162
2 23.0 165
3 17.0 170
4 17.0 162
Filled mode:
age height
0 15.0 155
1 17.0 162
2 23.0 165
3 17.0 170
4 17.0 162
Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
Cons:
• Only works with numeric continuous variables.
• May cause data leaks
Fill in Missing Data in Categorical Columns
When missing data is found in a categorical column of either character or number type, the missing data can be filled in with the highest frequency of the category. If there is a lot of missing data, the data is replaced with a new category.
import pandas as pd
import numpy as np
data = {'age': [15, 17, 23, 20, 17],
'impression': ['good', 'fair', 'fair', 'very good', np.nan]}
df = pd.DataFrame(data)
print("Data:")
print(df)
most_category = df['impression'].mode().values[0]
# Fill in the data with the highest category frequency
df['impression'].fillna(most_category, inplace=True)
print("New Data:")
print(df)
Output:
Data:
age impression
0 15 good
1 17 fair
2 23 fair
3 20 very good
4 17 NaN
New Data:
age impression
0 15 good
1 17 fair
2 23 fair
3 20 very good
4 17 fair
Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
• Eliminate data loss by adding unique categories
Cons:
• Only works with categorical variables.
• Adding new features to the model while coding may result in poor performance
Data Type Conversion
Most machine learning models cannot be run on categorical data. Therefore, categorical data needs to be converted into numerical data. One technique that can be used is one-hot-encoding. One-hot-encoding is a representation of categorical variables in binary vector form.
import numpy as np
# Categorical data to be converted
colors = ["red", "green", "yellow", "red", "blue"]
# Color list
total_colors = ["red", "green", "blue", "black", "yellow"]
# map each color to numeric
mapping = {}
for x in range(len(total_colors)):
mapping[total_colors[x]] = x
one_hot_encode = []
# Convert the numeric value of each data
for c in colors:
arr = list(np.zeros(len(total_colors), dtype = int))
arr[mapping[c]] = 1
one_hot_encode.append(arr)
print(one_hot_encode)
Output:
[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
Deleting Irrelevant Data
Data is said to be irrelevant when the data does not match the problem being researched.
import pandas as pd
data = {'age': [15, 17, 23],
'email': ['ahmad@gmail.com', 'putra@yahoo.com', 'tegar@gmail.com']}
df = pd.DataFrame(data)
print("Data:")
print(df)
# Remove irrelevant attributes
df = df.drop('email', axis=1)
print("New Data:")
print(df)
Output:
Data:
age email
0 15 ahmad@gmail.com
1 17 putra@yahoo.com
2 23 tegar@gmail.com
New Data:
age
0 15
1 17
2 23
Avoiding Structural Errors
Structural errors include typos, incorrect naming conventions, incorrect use of capital letters, and so on. The following is an example of improvements to letter capitalization in categorical features:
import pandas as pd
import numpy as np
data = {'age': [15, 17, 23, 20, 17],
'impression': ['good', 'Fair', 'fair', 'Very good', 'Good']}
df = pd.DataFrame(data)
print("Data:")
print(df)
# Fixed letter capitalization
df['impression'] = df['impression'].str.lower()
print("New Data:")
print(df)
Output:
Data:
age impression
0 15 good
1 17 Fair
2 23 fair
3 20 Very good
4 17 Good
New Data:
age impression
0 15 good
1 17 fair
2 23 fair
3 20 very good
4 17 good
Closing
In conclusion, employing effective data cleaning methods not only enhances the reliability of your analyses but also paves the way for informed decision-making, ensuring that your data-driven journey is built on a solid foundation of accuracy and integrity
Posted on January 10, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.