Introducing dataDisk: Simplify Your Data Processing Pipelines
David Ansa
Posted on June 29, 2024
Are you looking for an easy and efficient way to create and manage data processing pipelines? Look no further! I am excited to introduce dataDisk, a powerful Python package designed to streamline your data processing tasks. Whether you are a data scientist, data engineer, or a developer working with data, dataDisk offers a flexible and robust solution to handle your data transformation and validation needs.
Key Features
- Flexible Data Pipelines: Define a sequence of data processing tasks, including transformations and validations, with ease.
- Built-in Transformations: Use a variety of pre-built transformations such as normalization, standardization, and encoding.
- Custom Transformations: Define and integrate your custom transformation functions.
- Parallel Processing: Enhance performance with parallel execution of pipeline tasks.
- Easy Integration: Simple and intuitive API to integrate dataDisk into your existing projects.
How It Works
- Define Your Data Source and Sink
Specify the source of your data and where you want the processed data to be saved.
from dataDisk.data_sources import CSVDataSource
from dataDisk.data_sinks import CSVSink
source = CSVDataSource('input_data.csv')
sink = CSVSink('output_data.csv')
- Create Your Data Pipeline
Initialize the data pipeline and add the desired tasks.
from dataDisk.pipeline import DataPipeline
from dataDisk.transformation import Transformation
pipeline = DataPipeline(source=source, sink=sink)
pipeline.add_task(Transformation.data_cleaning)
pipeline.add_task(Transformation.normalize)
pipeline.add_task(Transformation.label_encode)
- Execute the pipeline to process your data.
pipeline.process()
print("Data processing complete.")
Get Started
To start using dataDisk, simply install it via pip:
pip install dataDisk
Contribute to dataDisk
I believe in the power of community and open source. dataDisk is still growing, and I need your help to make it even better! Here’s how you can contribute:
Star the Repository: If you find dataDisk useful, please star our Github Repository. It helps us gain more visibility and attract more contributors.
Submit Issues: Found a bug or have a feature request? Submit an issue on GitHub.
Contribute Code: I welcome pull requests! If you have improvements or new features to add, please fork the repository and submit a PR.
Spread the Word: Share dataDisk with your colleagues and friends who might benefit from it.
Example: Testing Transformations
Here's an example to demonstrate testing all the transformation features available in dataDisk:
import logging
import pandas as pd
from dataDisk.transformation import Transformation
logging.basicConfig(level=logging.INFO)
# Sample DataFrame
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'feature2': [6, 7, 8, 9, 10],
'category': ['A', 'B', 'A', 'B', 'A'],
'feature3': [None, 2.0, None, 4.0, 5.0]
})
logging.info("Original Data:")
logging.info(data)
# Test standardize
logging.info("Testing standardize transformation")
try:
standardized_data = Transformation.standardize(data.copy())
logging.info(standardized_data)
except Exception as e:
logging.error(f"Standardize transformation failed: {str(e)}")
# Test other transformations...
# Add similar blocks for normalize, label_encode, etc.
Join us in making dataDisk the go-to solution for data processing pipelines!
GitHub: Github Repository
Please star my Project.
Posted on June 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.