Refactoring based on design principles: example of a data collection crawler system
ma2mori
Posted on July 19, 2024
Introduction
Improving code quality is always an important issue in software development. In this article, we take a data collection crawler system as an example and explain specifically how to apply design principles and best practices through step-by-step refactoring.
Code before improvement
First, we start with a very simple web scraper with all functionality integrated into one class.
Translated with DeepL.com (free version)
project_root/
├── web_scraper.py
├── main.py
└── requirements.txt
web_scraper.py
import requests
import json
import sqlite3
class WebScraper:
def __init__(self, url):
self.url = url
def fetch_data(self):
response = requests.get(self.url)
data = response.text
parsed_data = self.parse_data(data)
enriched_data = self.enrich_data(parsed_data)
self.save_data(enriched_data)
return enriched_data
def parse_data(self, data):
return json.loads(data)
def enrich_data(self, data):
# Apply business logic here
# Example: extract only data containing specific keywords
return {k: v for k, v in data.items() if 'important' in v.lower()}
def save_data(self, data):
conn = sqlite3.connect('test.db')
cursor = conn.cursor()
cursor.execute('INSERT INTO data (json_data) VALUES (?)', (json.dumps(data),))
conn.commit()
conn.close()
main.py
from web_scraper import WebScraper
def main():
scraper = WebScraper('https://example.com/api/data')
data = scraper.fetch_data()
print(data)
if __name__ == "__main__":
main()
Points to be improved
- Violates the principle of single responsibility: one class is responsible for all data acquisition, analysis, enrichment, and storage
- Unclear business logic: business logic is embedded in the
enrich_data
method, but mixed with other processing - Lack of reusability: functions are tightly coupled, making individual reuse difficult
- Testing difficulties: difficult to test individual functions independently
- Configuration rigidity: database paths and other settings are embedded directly in the code
Refactoring phase
1. Separation of responsibilities: separation of data acquisition, analysis, and storage
- Major Change: Separation of responsibilities for data acquisition, analysis, and storage into separate classes
- Objective: Apply single responsibility principle, introduce environmental variables
directory structure
project_root/
├── data_fetcher.py
├── data_parser.py
├── data_saver.py
├── data_enricher.py
├── web_scraper.py
├── main.py
└── requirements.txt
data_enricher.py
class DataEnricher:
def enrich(self, data):
return {k: v for k, v in data.items() if 'important' in v.lower()}
web_scraper.py
from data_fetcher import DataFetcher
from data_parser import DataParser
from data_enricher import DataEnricher
from data_saver import DataSaver
class WebScraper:
def __init__(self, url):
self.url = url
self.fetcher = DataFetcher()
self.parser = DataParser()
self.enricher = DataEnricher()
self.saver = DataSaver()
def fetch_data(self):
raw_data = self.fetcher.fetch(self.url)
parsed_data = self.parser.parse(raw_data)
enriched_data = self.enricher.enrich(parsed_data)
self.saver.save(enriched_data)
return enriched_data
This change clarifies the responsibilities of each class and improves reusability and testability. However, the business logic is still embedded in the DataEnricher
class.
2. introduction of interfaces and dependency injection
- Main change: Introduce interfaces and implement dependency injection.
- Purpose: increase flexibility and extensibility, extend environment variables, abstract business logic
directory structure
project_root/
├── interfaces/
│ ├── __init__.py
│ ├── data_fetcher_interface.py
│ ├── data_parser_interface.py
│ ├── data_enricher_interface.py
│ └── data_saver_interface.py
├── implementations/
│ ├── __init__.py
│ ├── http_data_fetcher.py
│ ├── json_data_parser.py
│ ├── keyword_data_enricher.py
│ └── sqlite_data_saver.py
├── web_scraper.py
├── main.py
└── requirements.txt
interfaces/data_fetcher_interface.py
from abc import ABC, abstractmethod
class DataFetcherInterface(ABC):
@abstractmethod
def fetch(self, url: str) -> str:
pass
interfaces/data_parser_interface.py
from abc import ABC, abstractmethod
from typing import Dict, Any
class DataParserInterface(ABC):
@abstractmethod
def parse(self, raw_data: str) -> Dict[str, Any]:
pass
interfaces/data_enricher_interface.py
from abc import ABC, abstractmethod
from typing import Dict, Any
class DataEnricherInterface(ABC):
@abstractmethod
def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]:
pass
interfaces/data_saver_interface.py
from abc import ABC, abstractmethod
from typing import Dict, Any
class DataSaverInterface(ABC):
@abstractmethod
def save(self, data: Dict[str, Any]) -> None:
pass
implementations/keyword_data_enricher.py
import os
from interfaces.data_enricher_interface import DataEnricherInterface
class KeywordDataEnricher(DataEnricherInterface):
def __init__(self):
self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important')
def enrich(self, data):
return {k: v for k, v in data.items() if self.keyword in str(v).lower()}
web_scraper.py
from interfaces.data_fetcher_interface import DataFetcherInterface
from interfaces.data_parser_interface import DataParserInterface
from interfaces.data_enricher_interface import DataEnricherInterface
from interfaces.data_saver_interface import DataSaverInterface
class WebScraper:
def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface,
enricher: DataEnricherInterface, saver: DataSaverInterface):
self.fetcher = fetcher
self.parser = parser
self.enricher = enricher
self.saver = saver
def fetch_data(self, url):
raw_data = self.fetcher.fetch(url)
parsed_data = self.parser.parse(raw_data)
enriched_data = self.enricher.enrich(parsed_data)
self.saver.save(enriched_data)
return enriched_data
The main changes at this stage are
- introduction of an interface to facilitate switching to different implementations
- dependency injection to make the
WebScraper
class more flexible - The
fetch_data
method has been changed to takeurl
as an argument, making URL specification more flexible. - Business logic has been abstracted as
DataEnricherInterface
and implemented asKeywordDataEnricher
. - The business logic has been made more flexible by allowing keywords to be set using environment variables.
These changes have greatly improved the flexibility and extensibility of the system. However, the business logic remains embedded in the DataEnricherInterface
and its implementation. The next step is to further separate this business logic and clearly define it as a domain layer.
3. introduction of domain layer and separation of business logic
In the previous step, the introduction of interfaces increased the flexibility of the system. However, the business logic (in this case, data importance determination and filtering) is still treated as part of the data layer. Based on the concept of domain-driven design, treating this business logic as the central concept of the system and implementing it as an independent domain layer provides the following benefits
- centralized management of business logic
- more expressive code through the domain model
- greater flexibility for changing business rules
- ease of testing
Updated directory structure:
project_root/
├── domain/
│ ├── __init__.py
│ ├── scraped_data.py
│ └── data_enrichment_service.py
├── data/
│ ├── __init__.py
│ ├── interfaces/
│ │ ├── __init__.py
│ │ ├── data_fetcher_interface.py
│ │ ├── data_parser_interface.py
│ │ └── data_saver_interface.py
│ ├── implementations/
│ │ ├── __init__.py
│ │ ├── http_data_fetcher.py
│ │ ├── json_data_parser.py
│ │ └── sqlite_data_saver.py
├── application/
│ ├── __init__.py
│ └── web_scraper.py
├── main.py
└── requirements.txt
At this stage, the roles of DataEnricherInterface
and KeywordDataEnricher
will be moved to the ScrapedData
model and DataEnrichmentService
at the domain layer. Details of this change are provided below.
Before change (Section 2)
class DataEnricherInterface(ABC):
@abstractmethod
def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]:
pass
class KeywordDataEnricher(DataEnricherInterface):
def __init__(self):
self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important')
def enrich(self, data):
return {k: v for k, v in data.items() if self.keyword in str(v).lower()}
After modification (Section 3)
@dataclass
class ScrapedData:
content: Dict[str, Any]
source_url: str
def is_important(self) -> bool:
important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important')
return any(important_keyword in str(v).lower() for v in self.content.values())
class DataEnrichmentService:
def __init__(self):
self.important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important')
def enrich(self, data: ScrapedData) -> ScrapedData:
if data.is_important():
enriched_content = {k: v for k, v in data.content.items() if self.important_keyword in str(v).lower()}
return ScrapedData(content=enriched_content, source_url=data.source_url)
return data
This change improves the following.
business logic has been moved to the domain layer, eliminating the need for a
DataEnricherInterface
.the
KeywordDataEnricher
functionality has been merged into theDataEnrichmentService
, centralizing the business logic in one place.The
is_important
method has been added to theScrapedData
model. This makes the domain model itself responsible for determining the importance of data and makes the domain concept clearer.DataEnrichmentService
now handlesScrapedData
objects directly, improving type safety.
The WebScraper
class will also be updated to reflect this change.
from data.interfaces.data_fetcher_interface import DataFetcherInterface
from data.interfaces.data_parser_interface import DataParserInterface
from data.interfaces.data_saver_interface import DataSaverInterface
from domain.scraped_data import ScrapedData
from domain.data_enrichment_service import DataEnrichmentService
class WebScraper:
def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface,
saver: DataSaverInterface, enrichment_service: DataEnrichmentService):
self.fetcher = fetcher
self.parser = parser
self.saver = saver
self.enrichment_service = enrichment_service
def fetch_data(self, url: str) -> ScrapedData:
raw_data = self.fetcher.fetch(url)
parsed_data = self.parser.parse(raw_data)
scraped_data = ScrapedData(content=parsed_data, source_url=url)
enriched_data = self.enrichment_service.enrich(scraped_data)
self.saver.save(enriched_data)
return enriched_data
This change completely shifts the business logic from the data layer to the domain layer, giving the system a clearer structure. The removal of the DataEnricherInterface
and the introduction of the DataEnrichmentService
are not just interface replacements, but fundamental changes in the way business logic is handled.
Summary
This article has demonstrated how to improve code quality and apply design principles specifically through a step-by-step refactoring process for the data collection crawler system. The main areas of improvement are as follows.
- Separation of Responsibility: Applying the principle of single responsibility, we separated data acquisition, parsing, enrichment, and storage into separate classes.
- Introduction of interfaces and dependency injection: greatly increased the flexibility and scalability of the system, making it easier to switch to different implementations.
- Introduction of domain model and services: clearly separated the business logic and defined the core concepts of the system.
- Adoption of Layered Architecture: Clearly separated the domain, data, and application layers and defined the responsibilities of each layer. 5.Maintain interfaces: Maintained abstraction at the data layer to ensure flexibility in implementation.
These improvements have greatly enhanced the system's modularity, reusability, testability, maintainability, and scalability. In particular, by applying some concepts of domain-driven design, the business logic became clearer and the structure was more flexible to accommodate future changes in requirements. At the same time, by maintaining the interfaces, we ensured the flexibility to easily change and extend the data layer implementation.
It is important to note that this refactoring process is not a one-time event, but part of a continuous improvement process. Depending on the size and complexity of the project, it is important to adopt design principles and DDD concepts at the appropriate level and to make incremental improvements.
Finally, the approach presented in this article can be applied to a wide variety of software projects, not just data collection crawlers. We encourage you to use them as a reference as you work to improve code quality and design.
Posted on July 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
July 19, 2024