🕵️ OSINT: link company acronyms to Standard Occupation Classification w. Open Source LLMs

🧑‍🎓 `OSINT`: "What is Open-Source Intelligence?"

According to sans.org :

"Open-Source Intelligence (OSINT) is defined as intelligence produced by collecting, evaluating and analyzing publicly available information with the purpose of answering a specific intelligence question."

In this post, I'll show you an experiment I made.

My main goal is to show how apparently naive data can lead to valuable intelligence and help discover new kind of information at scale and see what kind of strategic insights we can get out of it.

"[...] information does not equal intelligence. Without giving meaning to the data we collect, open-source findings are considered raw data. It is only once this information is looked at from a critical thinking mindset and analyzed that it becomes intelligence."

At last we'll focus on

"[...] finding meaningful information that is applicable to the intelligence question and being able to provide actionable intelligence[...]."

🔁 Intelligence cycle

We'll implement (and share source code on Kaggle) a full "Intelligence cycle", based on Open Data in input and delivering a brand new enhanced and structured Open Data dataset, whith a - ollama based approach - GenAI processing step in the middle... and all the code and used LLMs publicly available.

🍿 For impatients

💭 About Standard Occupation Classifications

The SOC system categorizes jobs in a standardized way to:

Help governments and organizations track employment trends
Understand skill needs
Shape workforce policies

By using a consistent framework, SOC data makes it easier to compare job markets, plan training programs, and respond to shifts in the economy, benefiting both policymakers and employers.

So... a lot of strategic insights that made me want to attach them to other datasets.

🇫🇷 About the French `ROME` code

In France, the ROME (Répertoire Opérationnel des Métiers et des Emplois) system does something similar, classifying jobs like "Développeur Informatique" (Software Developer) under specific categories to match skills with job needs and support workforce planning. See 🗂️ Codes ROME database for more.

🤔 About acronyms

Any enterprise has a set of acronyms. I find them very useful as in a way... they are a way to discover point of interests of activities.
Also, when you're a new recrutee, you may need to get the reference to understand common jargon, documents and colleagues in meetings (which was my case).

OPT-NC publicly shared its acronyms as an Open Data dataset :

Tweet de teasing

In general an acronym has:

A very few letters (let's say SaaS for example)
A sentence that explains the meaning which is very specific

So a collection of them embeds a lot of information, especially when they are specific to your activities.

💡 The idea : delegate to `LLM`

Being able to put relationships between acronyms and jobs classification should (that's my hypothesis) give insights about activities.

☝️ But with an ever increasing amount of acronyms & activities... it would be much much more interesting to delegate relationship creation to a LLM.

🎯 Our goal

In output, we want a traditional well structured classical database with integrity constraints: a ready to use duckdb database (and csv) that links acronyms and activity codes.

Here is the way I'll give a try to OSINT:

Preparation : Transform existing open data to well structured datasets
Collection : Make all required datasets within a single Notebook
Processing : Build relationships thanks to LLM and structured outputs
Analysis : Do some reporting on output data with simple SQL and dataviz
Dissimination : Deliver the output data as a Kaggle duckdb dataset

🦾 All about relationship automation

The main idea of this prototype is to delegate the hard stuff to LLM thanks to its core knowledge :

No RAG
No dedicated fine-tuned LLM
No Pydantic to ensure well structured outputs

👉 Here, we'll just focus on just pure prompting over out-of-the-box LLMs.

Import the acronyms dataset 📘 Lexique des acronymes de l’OPT-NC
Import the SOC/ROME codes dataset 🗂️ Codes ROME database
Build a customized ollama model with a dedicated PROMPT to get structured output json : OPT-NC : Acronymes genai augmentés
For each acronym, get a collection of json matching SOC/ROME codes from this custom model
LOAD json into a staging table in duckdb
Check & remove hallucinations : check integrity between generated SOC codes and the reference database
Share the generated data as a dataset : OPT-NC acronyms Enhanced by Open Source AI
Enjoy generated knowledge: perform some analysis on the the output database, see Kaggle Notebook OPT-NC acronyms genai exploration

⚖️ Accuracy ratio and LLMs benchmark

With this approach, it is then possible to switch and benchmark various LLMs just by changing a parameter and wait for the Notebook to finish on Kaggle:

LLM	Hallucinated ROME Codes	Valid ROME Codes	Duration
`reflection`	25	127	3h47'
`llama3.1:70b`	31	137	4h05'
`llama3.1:8b`	71	40	4'
`nemotron`	41	194	06h08'
`qwen2.5:72b`	52	146	06h15'
`nous-hermes2-mixtral:8x7b`	7	23	08h53'

Next, we can compute the "Accuracy ratio: (valid codes) / (valid codes + hallucinated codes)" :

LLM	Hallucinated ROME Codes	Valid ROME Codes	Acuracy ration	Duration
`reflection`	25	127	83.6 %	3h47'
`llama3.1:70b`	31	137	81.6 %	4h05'
`llama3.1:8b`	71	40	36.0 %	4'
`nemotron`	41	194	82.5 %	6h08'
`qwen2.5:72b`	52	146	73.7 %	6h15'
`nous-hermes2-mixtral:8x7b`	7	23	76.7 %	8h53'

As my goal is to get as much ROME codes as possible, here are the two best LLMs in my case:

🥇 Nemotron (nvidia/Llama-3.1-Nemotron-70B-Instruct) produces the most valid code with 194 valid codes and a precision of 82.5%.
🥈 Llama3.1:70b with 137 valid codes
🥉Reflection with 127 valid codes

💰 Benefits

For example, it is then possible to drill down into categories:

📑 Resources : Notebooks and datasets

🦾 Notebook that links acronyms to ROME codes : OPT-NC : Acronymes genai augmentés
📚 Dataset OPT-NC acronyms Enhanced by Open Source AI
📊 Analysis Notebook : OPT-NC acronyms genai exploration

Core datasets:

Blog

🕵️ OSINT: link company acronyms to Standard Occupation Classification w. Open Source LLMs

adriens

🧑‍🎓 `OSINT`: "What is Open-Source Intelligence?"

🔁 Intelligence cycle

🍿 For impatients

💭 About Standard Occupation Classifications

🇫🇷 About the French `ROME` code

🤔 About acronyms

💡 The idea : delegate to `LLM`

🎯 Our goal

🦾 All about relationship automation

⚖️ Accuracy ratio and LLMs benchmark

💰 Benefits

📑 Resources : Notebooks and datasets

📚 Related resources about AI and HRs

Join Our Newsletter. No Spam, Only the good stuff.

Related

🕵️ OSINT: link company acronyms to Standard Occupation Classification w. Open Source LLMs

adriens

🧑‍🎓 OSINT: "What is Open-Source Intelligence?"

🔁 Intelligence cycle

🍿 For impatients

💭 About Standard Occupation Classifications

🇫🇷 About the French ROME code

🤔 About acronyms

💡 The idea : delegate to LLM

🎯 Our goal

🦾 All about relationship automation

⚖️ Accuracy ratio and LLMs benchmark

💰 Benefits

📑 Resources : Notebooks and datasets

📚 Related resources about AI and HRs

Join Our Newsletter. No Spam, Only the good stuff.

Related

🧑‍🎓 `OSINT`: "What is Open-Source Intelligence?"

🇫🇷 About the French `ROME` code

💡 The idea : delegate to `LLM`