Jonathan Eccker
Posted on October 11, 2019
The Problem
Over the last few years I have done a ton of legacy code/database clean up. This often involved confirming that a table we suspected was dead wasn't referenced in any active repositories. There was also a ton of stored procedures left over from the days of old (I think over a thousand at its worst, we're down to about 60 nowadays). We needed a way to tell if a table was recursively referenced through these stored procedures back to any file in active repositories. This recursion made it really difficult to manually audit table usage.
About a year into dealing with very slow manual audits, I happened across Neo4J when doing R & D for a project. The thing that really struck me is how simplistic recursive queries are in Cypher (the querying language Neo4J uses). More importantly, the language really helped in getting speedy answers to relationship-focused questions. So naturally, I set out to map our table/stored procedure/code dependency graph.
The Solution
I won't go too much into the actual process that built these relationships, it was fairly crude and mostly focused on loading stored procedure, view definitions and repository information into memory, then using the regular expression:
( |\.|\[|"|\n)StoredProcedure/Table/ViewName( |\.|\]|"|\n)
It returned a lot of false positives for tables that have simplistic names (I.E User
, Vehicle
). Fortunately for me, my predecessors liked weirdly named tables and occasionally using the tTableName
naming pattern, so this crude RegEx was sufficient for most cases. For those curious, I just hosted a Neo4J container on my own machine using Docker to house the results, although Neo4J does provide free sandboxes you can spin up.
The end result of this is a series of nodes representing Entities (Files, Tables, Views, StoredProcedures) and Locations (SqlDatabase, GitRepository). The two relationships used are LIVES_IN
(for mapping an entity to it's database or repository) and USED_IN
(representing a dependency).
They're Dead, Jim
The big question I originally wanted to ask was "What stored procedures cannot be referenced back to a file?"
The Cypher query for this looks something like this:
MATCH(sp:StoredProcedure)
WHERE NOT (sp)-[:USED_IN*..]->(:File)
RETURN sp.Database, sp.Name
The *..
makes the relationship match recursive. So it will include matches, for example, if a table is used in a stored procedure which is used in a stored procedure which is used in a file.
Because our crude RegEx leaned on the side of returning false positives over ignoring false negatives, this simplistic query actually helped us get rid of about 600 out of the 1000-ish stored procedures. I even used it to write the query for dropping the procedures:
MATCH(sp:StoredProcedure)
WHERE NOT (sp)-[:USED_IN*..]->(:File)
RETURN 'USE '+sp.Database+' DROP PROCEDURE IF EXISTS '+sp.Name
(Using Cypher to write SQL definitely feels weird)
Trace It Back
After the above was taken care of, we were often left with tables that we figured should be dead, but were being referenced by code that was probably dead (mapping file->file relationships is not something I've managed to accomplish yet, it gets tricky when you use things like dependency injection). So another common query that ended up being used was:
MATCH path = (t:Table {Database:'DealerOn', Name:'DealerMake'})-[:USED_IN*..]-> ↩
(:File)-[:LIVES_IN]->(:GitRepository)
RETURN path
Because this is returning actual nodes (within the path), Neo4J gives you a nice visualization of the dependencies. This is good for getting an initial feel for if one repository is more tightly coupled to the table in question than other repositories.
You Can Ask Most Anything
I think the main reason I love GraphDbs as analytics tools is because they feel very fluid for answering any question you come up with, not just questions the database was designed to answer. A couple I've found useful over the months:
Get list of stored procedures that are used in more than one file:
MATCH(sp:StoredProcedure)-[r:USED_IN]->(:File)
WITH sp, COUNT(r) AS pathCount
WHERE pathCount>1
RETURN sp.Name, pathCount
Get count of unique stored procedures used by each repository:
MATCH(g:GitRepository)<-[:LIVES_IN]-(:File)<-[:USED_IN]-(sp:StoredProcedure)
RETURN g.Name, COUNT(DISTINCT sp)
Get the longest recursive stored procedure reference path (neo4j prevents these queries from being infinitely recursive):
MATCH path = (sp:StoredProcedure)-[:USED_IN*..]->(:StoredProcedure)
RETURN path ORDER BY LENGTH(path) DESCENDING LIMIT 1
Finding stored procedures that reference tables/stored procedures from other databases:
MATCH(sp:StoredProcedure)<-[:USED_IN]-(otherEntity:Entity)-[:LIVES_IN]->(db:SqlDatabase)
WHERE sp.Database <> db.Name
RETURN sp.Database, sp.Name, db.Name, otherEntity.Name
Find number of paths to a file each table has, and get a list of each repository the table is referenced in. Note that optional match allows the query to include tables that have no paths to files.
MATCH(t:Table)
OPTIONAL MATCH paths = (t)-[:USED_IN*..]->(f:File)
RETURN t.Database, t.Name, COUNT(paths), COLLECT(DISTINCT f.Repository)
ORDER BY COUNT(paths) ASC
What's Next
I'm looking into if I can figure out a way to build code->code dependency relationships (which I know is possible as I've seen other tools do it), and a better method for determining entity usage than a simplistic RegEx. With those two problems solved I really believe GraphDbs could be an extremely strong Code Dependency analyzer.
Posted on October 11, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.