Pandas 1.0.0 - jan/2020 - What's new?
Cezar Peixeiro
Posted on February 2, 2020
Pandas is a famous Python library for data wrangling and in January-2020 its development team made a major release! YES, big changes were made in this lib, what made it jump from 0.25 to 1.0.0 version.
I just read the official release dosc and wrote above what important changes were made:
IMPROVEMENTS
1. DataFrame.apply() method can use Numba engine!
Numba is a JIT compiler project that can translate a subset of Python code in optimized machine code using the LLVM compiler library. In other words, Numba creates faster Python codes. (You can apply Numba in your general code importing and embedding it as a decorator. Read more in the official docs)
The official release says:
"Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and the data set is larger (1 million rows or greater)"
When using DataFrame.rolling.apply() or DataFrame.expanding.apply() until now, the processing cost was huge. Now, we can pass engine = "numba" to the apply method and have an increase in the performance as follows:
import pandas as pd #pandas 1.0.0 version
data = pd.Series(range(1_000_000))
roll = data.rolling(10)
def f(x):
return np.sum(x) + 5
# Running in Jupyter Notebook
# Run the first time, compilation time will affect performance
In [4]: %%timeit -r 1 -n 1
roll.apply(f, engine='numba', raw=True)
Out [4]: 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# Function is cached and performance will improve
In [5]: %%timeit
roll.apply(f, engine='numba', raw=True)
Out [5]: 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %%timeit
roll.apply(f, engine='cython', raw=True)
Out [6]: 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Read more HERE.
DataFrame.to_markdown()
This new method is just an easy way to output dataframes in markdown format. To illustrate this I created a simple dataframe below and applied the method.
df = pd.DataFrame(data={"col_1": ["a", "b"], "col_2": ["c", "d"]})
print(df.to_markdown())
>>> | | col_1 | col_2 |
|---:|:------|:------|
| 0 | a | c |
| 1 | b | d |
EXPERIMENTAL NEW FEATURES
This kind of changes has its use optional and is available for the community evaluation. Besides this, some experimental features are really cool and useful...and can make some work and learning easier. Let's see:
pandas.NA value
When dealing with missing values in our dataframes is really common the usage of numpy.nan to represent them. Pandas now has its own missing value data type. Creating a pandas series with a None value with pandas 1.0.0, we'll see something like this:
import pandas as pd #pandas 1.0.0 version
s = pd.Series([1, 2, None], dtype="Int64")
print(s)
>>> 0 1
1 2
2 <NA>
Length: 3, dtype: Int64
and if we print the type of s[2], we'll get the following answer:
print(type(s[2]))
>>> pandas._libs.missing.NAType
Comparisons
- Working with the comparison operators ==, >, >=, <, <=, between numpy.nan and numbers the result is always False, but when using !=, Python returns True
- With the new pandas.NA, the value will be propagated with all the comparison operators. So, all operations with return
Logical
- Working with logical operators with numpy.nan is not supported.
- To pandas.NA some different outputs will occur:
import pandas as pd #pandas 1.0.0 version
print(pd.NA & False)
>>> False
print(pd.NA & True)
>>> <NA>
print(pd.NA | False)
>>> <NA>
print(pd.NA | True)
>>> True
String and Boolean Data Types
When analyzing a pandas dataframe dtype with strings the usual result is an OBJECT TYPE. But an object type column can hold more than one data type and make the analysis confusing. Now the string dtype is defined and a string column can hold only strings
import pandas as pd #pandas 1.0.0 version
s = pd.Series(['abc', None, 'def'], dtype="string")
print(s)
>>> 0 abc
1 <NA>
2 def
Length: 3, dtype: string
Different from strings a bool dtype already exists, but the columns with boolean values doesn't support missing values. This inconvenient was solved with the boolean dtype
import pandas as pd #pandas 1.0.0 version
s = pd.Series([True, False, None], dtype="boolean")
print(s)
>>> 0 True
1 False
2 <NA>
Length: 3, dtype: boolean
Well, that's all! New improvements were made in some existing functions and I really encourage you to take a look there!
Posted on February 2, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.