Is it time for Data Science / Engineering patterns?
Renato Cordeiro Ferreira
Posted on January 5, 2019
Most software engineers have already read Design Patterns, the seminal book by Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides (the Gang of Four a.k.a. GoF) that made most programmers of object-oriented languages talk in terms of singletons, iterators, strategies, etc.
For those who don't know the concept, here it's the definition from Wikipedia:
In software engineering, a software design pattern is a general, reusable solution to a commonly occurring problem within a given context in software design. It is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations. Design patterns are formalized best practices that the programmer can use to solve common problems when designing an application or system.
I remember reading this book cover-to-cover in two days (a not so common approach since this book is a kind of catalogue) in 2014. I had some experience with object-oriented programming (OOP) from my undergraduate courses. However, I still had some difficulties to design new systems with this paradigm. Suddenly, someone gave me a guide to make new projects. The idea really appealed to me!
It's a common mistake for new GoF's book readers to think that design patterns are "the hammer that will pound all nails". I made this mistake. I tried to plan everything in terms of patterns. But there are no silver bullets. I overshadowed their advantages and pitfalls. They give flexibility in one hand but increase complexity on the other hand. This is one of the biggest criticisms to the concept of design patterns (this and, for GoF patterns, OOP language limitations).
My current master's and former monograph's advisor made his PhD with Ralph Johnson at University of Illinois Urbana-Champaign (UIUC). We've been working together in the refactoring of a machine learning framework created by our research group, ToPS. The framework uses patterns extensively to the point we identified and documented a new design pattern -- the Secretary pattern -- which we presented in the 11th SugarLoaf-PLoP, the Latin American conference of Pattern Languages of Programs.
Thanks to this project and the relationship with my advisor, I think I realized how and why patterns are valuable (at least to me): to standardize and increase the descriptive power of developers' language. As I said in the beginning of this text, GoF's book made most programmers of object-oriented languages talk in terms of singletons, iterators, strategies, etc. This spares time while it makes design / architecture discussions more concise and accurate.
All this context brings me to the reason for this post. Today Andrew Ng, the famous Stanford professor who co-founded Coursera and made one of the most popular machine learning online courses of all time, posted this thread on Twitter:
This post is my (long) response to him.
GoF gave birth to a movement of pattern writers who document individual patterns and whole pattern languages for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, game programming, and many others.
But where are the data science and data engineering patterns?
As an enthusiast of this way of documenting best practices, I believe that we should search for these patterns and document them. My first realization of this was when I went to SugarLoaf Plop in 2016 and saw patterns for many areas, but none related to these. As Andrew Ng said (emphasis mine):
I'm also seeing many AI teams use new processes that haven't been formalized or named yet, ranging from how we write product requirement docs to how we version data and ML pipelines.
As I said above, I think patterns are most valuable to standardize and increase the descriptive power of developers' language.
This is an exciting time for developing these ideas!
So let's create them!
It's important to notice I talked about data science and data engineering instead of AI or ML as Andrew Ng said. This was a thought word-choice. I usually describe machine learning as one of the artificial intelligence techniques (statistical learning) available for data scientists when they're trying to reason about their data. However, data science is more broad and include other tasks (such as data mining or visualization) where patterns also could be useful. Meanwhile, dealing with lots of data has its own challenges, thus (big) data engineering can also benefit from this approach.
Thanks for reading! What do you think? Would you read data science / engineering patterns? Do you believe this is a good way to document best practices for the area?
In my PhD (which I hope will start soon), I want to explore the relationship between software engineering and data science / data engineering. I think that exploring patterns could be an interesting direction to take, hence my interest for Andrew Ng's tweet.
If you liked this discussion, take a look in my other post where I wrote about software engineering for the first time. There I introduced my idea of R.A.D.I.C.A.L systems, which can also use AI/ML for Intelligent data transformation.
Posted on January 5, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.