Reading and Manipulating Your Dataset With Pandas (2)
Marjan Ferdousi
Posted on December 19, 2020
Manipulation
Let's say you need to see only one column of your dataframe. To see the 'fixed acidity' column of our dataset, you need to write:
df['fixed acidity']
If you add a condition to this column, for example, if you want to see the rows that has a fixed acidity higher than 9:
df[df['fixed acidity']>9]
Sometimes you might need rows with multiple conditions added to columns:
df[(df['fixed acidity']>9) & (df['citric acid']>0.5)]
If you need to find specific columns:
df.loc[:,['volatile acidity', 'chlorides']]
You may want to add conditions with them too, for example, you may want to see the 'volatile acidity' and 'chlorides' content of those rows that have a 'fixed acidity' of 9.2:
df.loc[df['fixed acidity'] == 9.2, ['fixed acidity','volatile acidity', 'chlorides']]
You can view the rows for specific indices (as discussed in the previous chapter) too, like this:
df.loc[0:3, ['volatile acidity', 'chlorides']]
Now, if you want to locate a specific value, for example, the alchohol content of the wine of 0th row:
df['alcohol'].loc[0]
and you will get a value of 9.4
You can find locate a row using its index too:
df.iloc[100]
Now if you want to pinpoint a value within this, for example, the 1st attribute (volatile acidity in this case) of the 100th row, try:
df.iloc[100][1]
and you will get 0.61 as expected.
You can locate specific consecutive rows and columns using this iloc command, for example, first three columns of 3rd to 7th row:
df.iloc[3:8, 0:3]
and non consecutive rows and columns too:
df.iloc[[71, 122, 400], [0, 2]]
What if you want to add a new column to your dataframe? Let's add a 'new column' containing the word 'hi' for all rows:
df['new column'] = 'hi'
df.head()
Let's try changing the value of 'new column' of 0th index of the dataframe using iloc from 'hi' to 'bye':
df.iloc[0, df.columns.get_loc('new column')]= 'bye'
df.head()
Now let's try to find the word starts with 'by' (that we just have added) and replace it with 'hello':
df['new column'].loc[df['new column'].str.startswith('by')] = 'hello'
df.head()
You can also replace null values of your data using pandas. We do not have any null values here, so let's introduce a null value first. Let's replace the string 'hello' with null. To do so, we would need the numpy library.
import numpy as np
df['new column'].loc[df['new column'].str.startswith('hel')] = np.nan
df.head()
To check the number of null values, you can use the isna() method like this:
df.isna().sum()
This isna() method can also be used to locate the null value like this:
pd.isna(df.head())
Let's replace the null value with 'hey'.
df.fillna(value='hey', inplace=True)
df.head()
If you want to drop null values, use the dropna() method.
Now we will try to create a new dataframe using a loop, where one column of the new dataframe would look the same as the 'new column' of our dataframe df.
rows = []
for i in range(df.shape[0]):
rows.append(['hi', 'bye'])
df_new = pd.DataFrame(rows, columns=["new column 2", "new column 3"])
df_new.iloc[0, df_new.columns.get_loc('new column 2')]= 'hey'
df_new.head()
You can merge these two dataframes using their common attributes:
df_merged = df.merge(df_new, left_on='new column', right_on='new column 2')
df_merged.head()
You can make necessary variations in your merging operations by dropping mismatched attributes, or by using a column with common name and so on.
You can also group your dataframes:
df.groupby(['volatile acidity', 'chlorides']).count().head()
You can also group the dataframes using other attributes like sum.
When you are done with manipulation of your dataframes, you are ready to visualize your data.
Posted on December 19, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.