Project Journey #2: πŸ› οΈ Coding, Failing, and Learning with AI Law Shield βš–οΈ

0x2e73

0x2e73

Posted on November 17, 2024

Project Journey #2: πŸ› οΈ Coding, Failing, and Learning with AI Law Shield βš–οΈ

Welcome back to my AI journey, where I stumbled, learned, and maybe cried a little! πŸ˜‚

1. Diving into the Code: The Good, The Bad, and The Ugly

This time, I got my hands dirty by coding the first version of my AI model. Spoiler alert: I achieved an accuracy of just 0.18945%! 🎯 (Ouch! I guess even my toaster could do better πŸ€–πŸž).

Let's dive into the code and see what went wrong.


# Initializing BERT for sequence classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

Enter fullscreen mode Exit fullscreen mode

What’s happening here?
I'm using BERT, the superstar transformer model, to classify the danger level of legal contracts on a scale of 1 to 5. πŸ“„

def preprocess_data(dataframe, tokenizer):
    texts = dataframe['texte'].tolist()
    labels = [label - 1 for label in dataframe['niveau_de_danger'].tolist()]
    encoded_data = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    return encoded_data, labels
Enter fullscreen mode Exit fullscreen mode

Why preprocess data?
I’ve tokenized the contract texts for BERT to digest (like breaking down a complex contract into easier-to-understand clauses). 🍽️

2. Training My Model: And… It Crashed and Burned πŸ’₯


def train_model(model, train_loader, num_epochs=5):
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

    for epoch in range(num_epochs):
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            accuracy = (outputs.logits.argmax(dim=-1) == labels).float().mean()
            loss = outputs.loss
            loss.backward()
            optimizer.step()
        print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')
    print(f'Final Loss: {loss:.4f}, Accuracy: {accuracy:.4f}')

Enter fullscreen mode Exit fullscreen mode

This function trains BERT to classify contracts, but let’s just say it didn’t pass the bar exam 😬. The low accuracy told me that my model was basically guessing randomly.

3. Evaluating the Model: Reality Check πŸ§‘β€βš–οΈ


def evaluate_model(model, test_loader):
    model.eval()
    predictions = []
    true_labels = []
    with torch.no_grad():
        for batch in test_loader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            outputs = model(input_ids, attention_mask=attention_mask)
            _, predicted = torch.max(outputs.logits, dim=-1)
            predictions.extend(predicted.cpu().tolist())
            true_labels.extend(labels.cpu().tolist())
    return classification_report(true_labels, predictions)

Enter fullscreen mode Exit fullscreen mode

After running this, I got a brutal classification report that screamed, "You need more data, buddy!" πŸ“‰

4. The Root Cause: My Dataset Needs a Lawyer-Grade Makeover πŸ“Š

After some reflection, I realized the real issue was my dataset. It’s like trying to learn law from a pamphlet instead of an encyclopedia. πŸ“š

I need to get my hands on a large, reliable, and indexed dataset that can better train the model. If anyone knows where to find high-quality legal datasets, I’m all ears! πŸ‘‚

5. Annotating Contracts (A Work in Progress) ✍️


def annotate_contract(model, tokenizer, contract_text):
    inputs = tokenizer(contract_text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
        _, predicted = torch.max(outputs.logits, dim=-1)
    danger_level = predicted.item() + 1
    problematic_sections = analyze_problematic_sections(contract_text, danger_level)

    return {
        'danger_level': danger_level,
        'problematic_sections': problematic_sections
    }

Enter fullscreen mode Exit fullscreen mode

This function is supposed to analyze the legal contract and predict the danger level, but as you might guess, it’s not ready to replace your lawyer just yet. 🧐

Next Steps: A Better Dataset and Model Tuning πŸ“ˆ

I’m planning to go on a treasure hunt for a better dataset. Once I have more data, I’ll revisit model training, tweak hyperparameters, and hopefully get a model that can actually understand legal jargon! βš–οΈ

Until next time, may your accuracy be ever in your favor! πŸš€

0x2e73

πŸ’– πŸ’ͺ πŸ™… 🚩
0x2e73
0x2e73

Posted on November 17, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related