Pandas - Brief
Soujanya Satpute
Posted on June 14, 2022
What is pandas?
Pandas is python package built on two python packages Matplotlib and Numpy.
14 million users
DataFrame: 2 dimensional, Mutable, heterogeneous(Can be),Tabular Data structure
-
.info()
Method: Generates Summary of the dataFrame with column names, Non-null counts, Dtype, memory Usage. -
.head()
Method: returns the first few rows (the “head” of the DataFrame). -
.describe()
Method: use for calculating statistical properties like mean, max, std Deviation, percentiles -
.values
Returns Numpy representation of the dataFrame. But new method that isto_numpy()
should be used rather than.values
. -
.columns
List all column heading for database and its data types. .index
List all index in the dataFrame. These index means numbers of rows
.shape
Function:
Returns the tuple of shape such as rows and columns.size
Function:
Returns overall number of elements in that data frame.ndim
Function:
Returns dimensions of DatabaseDataFrame column selecting
You can select also multiple columns in database by double square bracket syntax. First square bracket is for syntax of dataFrame selection and second is for List of columns.
column1 = dataFrame['columnName']
column1 = dataFrame.columnName
column1 = dataFrame[['columnName', 'col2']]
- DataFrame row selecting with logical testing
- And or Operators in row selection
- Specific Value row selection: This selects particular row from given column where value is value. We can use different logical operator here also
row1 = dataFrame.[dataFrame.column == 'Value']
row1 = dataFrame.[dataFrame[column]== 'Value']
- Sorting Dataframe:
sortedDataFrame = dataFrame.sort_values('column_to_sort')
sortedDataFrame = dataFrame.sort_values(by = ['column_to_sort1', 'column_to_sort2'])
Sorting can be perform on numbers, Dates.
Extra Attributes -
ascending = True / False,
na_position = first/ last - where to put Nan Values.
Example:
homelessness_reg_fam = homelessness.sort_values(['region','family_members'],ascending=[True,False])
-
isin()
Method:isin()
is used in filtering DataFrame. With Particular Value and particular column.
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness.state.isin(canu)]
- Adding New Column to Database: Terms for adding new columns: Mutating/transforming DataFrame or feature engineering
dataframe['new_column'] = old_column.some_transformation
- Summary Statistics
Summary statistics is the way you can summarise and know more about your data.
mean()
,median()
,mode()
,min()
,max()
,var()
,std()
,sum()
,quantile()
,agg()
,agg()
method is use to calculate custom summary statistic.agg()
function takes more than one parameter function in the form of list. Example of custom percentile is as follows.
def percentile30(column):
return column.quantile(0.4)
dataFrame[columnName].agg(percentile30)
Functions like min
,max
works on Date columns also.
Calculating Cumulative Statistics
cumsum()
, cummax()
,cummin()
,cumprod()
To be Continued...
Posted on June 14, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.