For many people, Python is their go-to language for data analysis. This in large part because the Python language makes it very easy to think about and write functions that operate on units of data. For example, this function determines a number segment a person of a certain age belongs to.
All of these functions operate on "single units of data". But a very common thing to do in data analysis is have a bunch of "units of data" and then have a sort-of "pipeline" of functions that each unit passes through with each function doing an operation that progressively transforms the data.
As an example, let's say we have such a list of data. And you want to take each element of data and pass it through this sort-of "pipeline" of functions that you could construct that passes data first to age_segment then to is_under_age and finally to print_age.
ages=[9,19,37,28,48,13]
Now you might realize that you could just use Python's for loop construct to loop over this and call the functions in order on the ages in the list. And that is what many people will do. But there are a few issues you will run into.
You will have to know Python to change the structure of the pipeline (e.g. - you want to remove age_segment because your data is already age segments)
You will have to write (in my opinion) messy code to make functions run in parallel (Python provides the multiprocessing library to make functions run at the same time - instead of one after the other - and easily pass data between them; but this introduces quite a bit of extra code)
Maybe this is legible to you but it certainly isn't for me.
What if there was a simple high-level language to describe the structure of a pipeline with a compiler that compiles the language code to Python. This language would be a language that anybody can use - a scientist, a business executive, ...literally anyone.
You would work with components of a pipeline having a basic understanding of what the components do (e.g. - is_under_age tells me if the age is considered legally underage) but no required knowledge of how they are implemented. Somebody who does know Python would write the implementations of the components but then the pipeline can be written and re-written and re-structured without needing to consult a person who knows Python because the pipeline is written in a simple high-level language that abstracts away the functions that make up any pipeline.
This high-level language exists! It's called Pipelines. With the Pipelines language, the above 63 lines of Python code is reduced to the below 5 line description.
The |> indicates that each element passed into age_segment and print_age is transformed by passing into the age_segment() and print_age() functions with the results then passed onto the next function in the pipeline if one exists.
The \> indicates that each element passed into is_under_age is filtered. The element is passed into is_under_age and if the result is True it is passed on. If the result is False it does not get passed on.
There's much more to the language than just pipes and filters and features designed for more complex data flow and everything can be found in the "README" on the GitHub repository.
An experimental programming language for data flow
Pipelines is a language and runtime for crafting massively parallel pipelines. Unlike other languages for defining data flow, the Pipeline language requires implementation of components to be defined separately in the Python scripting language. This allows the details of implementations to be separated from the structure of the pipeline, while providing access to thousands of active libraries for machine learning, data analysis and processing. Skip to Getting Started to install the Pipeline compiler.
An example
As an introductory example, a simple pipeline for Fizz Buzz on even numbers could be written as follows -
If you do decide to use in your own data analysis work, please let me know - I would love to see what you make! If not, I would appreciate it if you could star the GitHub repository for later if you find this interesting and potentially useful.