Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos

En entornos distribuidos, los fallos transitorios son inevitables: latencia de red, timeouts, servicios temporalmente no disponibles. El patrón Retry proporciona una estrategia robusta para manejar estos fallos temporales, permitiendo que las aplicaciones se recuperen automáticamente de errores que pueden resolverse por sí solos.

Comprendiendo el Patrón Retry

El patrón Retry implementa una estrategia de reintentos automáticos cuando una operación falla, asumiendo que la causa del fallo es temporal y puede resolverse sin intervención manual. La clave está en distinguir entre fallos transitorios y permanentes, y aplicar estrategias de reintento apropiadas.

Estrategias Comunes

Retry Inmediato: Reintenta la operación inmediatamente.
Retry con Backoff: Incrementa el tiempo entre reintentos.
Retry Exponencial: Duplica el tiempo de espera entre intentos.
Retry con Jitter: Añade aleatoriedad para evitar thundering herd.

Implementación Práctica

Veamos diferentes implementaciones del patrón Retry en Python:

1. Retry Simple con Decorador

import time
from functools import wraps
from typing import Callable, Type, Tuple

def retry(
    exceptions: Tuple[Type[Exception]] = (Exception,),
    max_attempts: int = 3,
    delay: float = 1
):
    def decorator(func: Callable):
        @wraps(func)
        def wrapper(*args, **kwargs):
            attempts = 0
            while attempts < max_attempts:
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    attempts += 1
                    if attempts == max_attempts:
                        raise e
                    time.sleep(delay)
            return None
        return wrapper
    return decorator

@retry(exceptions=(ConnectionError, TimeoutError), max_attempts=3)
def fetch_data(url: str):
    # Simulación de llamada a API
    return requests.get(url)

2. Retry con Backoff Exponencial

import random
from typing import Optional

class ExponentialBackoff:
    def __init__(
        self,
        initial_delay: float = 1.0,
        max_delay: float = 60.0,
        max_attempts: int = 5,
        jitter: bool = True
    ):
        self.initial_delay = initial_delay
        self.max_delay = max_delay
        self.max_attempts = max_attempts
        self.jitter = jitter
        self.attempt = 0

    def next_delay(self) -> Optional[float]:
        if self.attempt >= self.max_attempts:
            return None

        delay = min(
            self.initial_delay * (2 ** self.attempt),
            self.max_delay
        )

        if self.jitter:
            delay *= (0.5 + random.random())

        self.attempt += 1
        return delay

async def retry_operation(operation: Callable, backoff: ExponentialBackoff):
    last_exception = None

    while (delay := backoff.next_delay()) is not None:
        try:
            return await operation()
        except Exception as e:
            last_exception = e
            await asyncio.sleep(delay)

    raise last_exception

3. Retry con Circuit Breaker

from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    reset_timeout: timedelta = timedelta(minutes=1)
    retry_timeout: timedelta = timedelta(seconds=10)

class CircuitBreaker:
    def __init__(self, config: CircuitBreakerConfig):
        self.config = config
        self.failures = 0
        self.last_failure = None
        self.state = "CLOSED"

    def can_retry(self) -> bool:
        if self.state == "CLOSED":
            return True

        if self.state == "OPEN":
            if datetime.now() - self.last_failure > self.config.reset_timeout:
                self.state = "HALF_OPEN"
                return True
            return False

        return True  # HALF_OPEN

    def record_failure(self):
        self.failures += 1
        self.last_failure = datetime.now()

        if self.failures >= self.config.failure_threshold:
            self.state = "OPEN"

    def record_success(self):
        if self.state == "HALF_OPEN":
            self.state = "CLOSED"
        self.failures = 0
        self.last_failure = None

async def retry_with_circuit_breaker(
    operation: Callable,
    circuit_breaker: CircuitBreaker,
    backoff: ExponentialBackoff
):
    while True:
        if not circuit_breaker.can_retry():
            raise Exception("Circuit breaker is open")

        try:
            result = await operation()
            circuit_breaker.record_success()
            return result
        except Exception as e:
            circuit_breaker.record_failure()
            if (delay := backoff.next_delay()) is None:
                raise e
            await asyncio.sleep(delay)

Aplicaciones en la Nube

El patrón Retry es especialmente útil en escenarios cloud:

1. Comunicación entre Microservicios

from fastapi import FastAPI, HTTPException
from tenacity import retry, stop_after_attempt, wait_exponential

app = FastAPI()

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type(ConnectionError)
)
async def call_dependent_service(data: dict):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://dependent-service/api/v1/process",
            json=data,
            timeout=5.0
        )
        return response.json()

@app.post("/process")
async def process_request(data: dict):
    try:
        return await call_dependent_service(data)
    except Exception:
        raise HTTPException(
            status_code=503,
            detail="Service temporarily unavailable"
        )

2. Operaciones con Base de Datos

from sqlalchemy import create_engine
from sqlalchemy.exc import OperationalError
from contextlib import contextmanager

class DatabaseRetry:
    def __init__(self, url: str, max_attempts: int = 3):
        self.engine = create_engine(url)
        self.max_attempts = max_attempts

    @contextmanager
    def session(self):
        attempt = 0
        while True:
            try:
                with self.engine.connect() as connection:
                    yield connection
                    break
            except OperationalError:
                attempt += 1
                if attempt >= self.max_attempts:
                    raise
                time.sleep(2 ** attempt)

Beneficios del Patrón Retry

Resiliencia: Maneja automáticamente fallos transitorios.
Disponibilidad: Mejora la disponibilidad general del sistema.
Transparencia: Los reintentos son transparentes para el usuario.
Flexibilidad: Permite diferentes estrategias según el caso de uso.

Consideraciones de Diseño

Al implementar el patrón Retry, considera:

Idempotencia: Las operaciones deben ser seguras para reintentar.
Timeouts: Establece límites claros para los reintentos.
Logging: Registra los reintentos para monitorización.
Backoff: Usa estrategias que eviten sobrecarga del sistema.

Conclusión

El patrón Retry es esencial en arquitecturas distribuidas modernas. Una implementación cuidadosa, considerando la idempotencia y las estrategias de backoff, puede mejorar significativamente la resiliencia de tu sistema. Sin embargo, debe usarse juiciosamente para evitar ocultar problemas sistémicos que requieren atención.

Blog