Investigating large batches of files for Data Integrity using CLI tools and Vim
Michael Cantu
Posted on December 1, 2023
This is post #2 in my Data Integrity for the Laboratory series.
Last post was mostly background info, here we'll explore a practical example in the realm of data integrity.
Background info
When testing is done on a sample, any and all measurement data collected should be secured in a contemporaneous manner (e.g. removed from the PC and stored in a secure repository), to prevent any accidental or intentional modification or loss of data. I'll refer to this measurement data as the raw data. There is a higher risk for data to be compromised for instruments that do not save their data directly into a secure database. There are some instrument vendors that have transitioned to saving raw data into a DB, but a lot still save their raw data as flat files or binary files into a local folder on the instrument PC.
Problem
For the example, I will not write any scripts, I will use the CLI and Vim basically as power tools to get the job done. For me, I find this approach is good for "one-offs"/infrequent projects.
In this exercise, we have an Instrument PC with several years of unaccounted for, and unsecured instrument raw data files, let's say ~10k files spanning across every nook and cranny of the PC. Ideally, there should be none (or close to none) raw data files on this Instrument PC according to typical data integrity SOP's. All raw data files should have been secured and accounted for on a secure sever where files are accessible, but immutable once moved there. I won't get into the weeds why these raw data were there, but I'll coarsely define two possible scenarios:
- Some instrument software requires that the secured raw data be copied back to the instrument PC for data processing.
- The raw data was never secured properly (e.g. removed from the PC onto a secure server) for various reasons, accidental or not.
So item 1 is a legitimate use case, but we still need to:
verify that we have a corresponding exact copy on our secure server. If there is an exact match, this copy on the instrument PC requires lest of an investigational burden, and we can handle removing them through SOP's.
Item 2 may or may not be accidental, but just as with item 1, we need to confirm if there is a corresponding exact copy on our secure server.
If not, we need to dig deeper and use other redundancies we have in place to insure data integrity (e.g. audit trails, file system monitoring, LIMS, access control, etc... I'll hope to touch these throughout the series)
Reiterating our primary goal from above:
Find any and all raw data on the instrument PC, and verify that we have a corresponding exact copy on our secure server
The above figure is just to help visualize the problem, it's not a network diagram.
For this exercise, both the Secure File Server and Instrument PC are on Windows PC's (it's rare that an Instrument PC will be running something other than Windows), both networked to a Dev box with the following tools:
- Git bash and WSL(Debian) for Linux command line tools (find, sha256sum)
- Vim to clean and reformat search results
- Diff tools: Vimdiff or VScode's diff tool (very similar)
- SSMS (SQL Server Management Studio) to pull file locations of raw data on the secure server
Plan of attack
- Create a filepath list of all raw data files on the instrument PC.
- Use this filepath list to compute file hashes for all found raw data files.
- Create another list of just the filenames from the raw data filepath list.
- Use the extracted filenames to query the LIMS DB to get filepaths for corresponding secured files on the secure file server.
- Use this filepath list of secured files to compute files hashes for the secured files.
- Compare the raw data file hashes with their secured data files counterpart to identify any a) unsecured files and b) any files with mismatched hashes.
OK, let's do it!
1.First we need to find all relevant raw data files on the Instrument PC. For this exercise, searching by file extension is good enough: all our data has the extension of ".wiff" or ".wiff.scan". There might be scenarios where you can't or shouldn't search by file extension alone, but perhaps look for a unique, common signature within the file's metadata (I'll see if I can explore that in later post). Using Gitbash, we'll search as follows:
find //instrument_pc/d$ -name '*.wiff*' >> found_files.txt
find //instrument_pc/e$ -name '*.wiff*' >> found_files.txt
find //instrument_pc/f$ -name '*.wiff*' >> found_files.txt
etc...
Each line above will recursively search each lettered drive and append _find _ output to "found_files.txt", example output:
There's probably going to be filepaths with spaces, and that won't play well in the subsequent steps when we feed them to other commands. So be sure to encapsulate each filepath with quotes (' and '). Easy way to do this is with Vim.
In the Vim command line:
%s:^:':
Add quote at the start of each line
%s:$:':
Add quote at the end of each line
2.After we compile our list of filepaths of our raw data on the Instrument PC, we can feed these file paths to SHA256SUM to create a two column list of filepaths and signatures. We do this by piping the contents of "found_files.txt" into sha256sum, with the help of xargs.
cat < found_file.txt | xargs sha256sum > instrPC_hash_results.txt
3.Next, we need to compare this list of hashes + files (instrPC_hash_results.txt) to their corresponding file counterpart on the Secure File Server. In our exercise, our LIMS (Laboratory Information Management System) database keeps track of the filepath of where the original copies of raw data are stored on the secured server (so we don't have to comb through terabytes of files and folders), we just got to get a list of files to query. In this exercise, the filenames themselves can be considered primary keys, i.e. unique identifiers in which we can query our LIMS DB to pull the secured server filepaths. We're going to use Vim as a "precision chainsaw" on our first results "found_files.txt" to get a list of just filenames (i.e. no filepaths, not file extensions.)
For this step, we'll just be using a copy of instrPC_hash_results.txt, we will need to use it later.
With the power of Vim, I'm going to get this done with less thinking and more action:
First, I will will replace all extension strings on all lines by substituting away:
%s:\.wiff.*$::
Next, I will replace any common filepath strings on all lines:
%s:^//instrument_pc/d\$/scratch_folder/::
(etc...,saving the rest of the path to use as an example for the next step)
After that, there might be less preditcable strings, so we should inspect and delete large blocks of text visually. Here comes the chainsaw part.
This works well in VScode with the Vim extension using visual mode. This allows you to clearly highlight blocks of text without the typical wrapping of entire lines that is common when selecting multiple lines of text.
Hit Ctrl+V
to get into visual mode, and start free selecting large blocks of text, and then press d
to delete.
One final cleanup step: in the Vim command line, use command sort u
to sort the list, and get rid of duplicates. (For this example, it's OK that we get rid of duplicates, there should be only a 1 to 1 relationship between filename and filepaths.)
4.At this point, we should have a list of just bare filenames that we can use to query the LIMS DB.
Continuing the theme of just using Vim, I going to format this list to be able to just plug into a SQL query, e.g.
SELECT filepath FROM datatable WHERE filename in (1234567, 9876543, etc....)
Quick way to do this is to add a comma to each line of our filelist:
%s:^:,:
This will add a comma to every line, will need to get rid of the very last common at the end of the file.
Next, I will select all lines in visual mode, starting from the bottom:
Shift+g
to go to the bottom (or 'gg' to start form the top, doesn't matter, just invert these steps if yo do)
Shift+v
to enter into visual mode
gg to go to the top, all lines should be highlighted
now hit Shift+j
to join all lines
Now you can just plug this into the where statement, using a tool directly query such as SSMS (our LIMS DB in this example is a SQLserver DB)
Depending on how many files you have, you might have to break it into smaller bits to make sure you don't hit a max character limit for SQL queries.
We'll save the results as "LIMS_secured_filepaths.txt"
5.Just as we did in step 2, we'll use the same commands to create a new list of SHA256 fileshash + filepaths.
cat < LIMS_secured_filepaths.txt | xargs sha256sum > secured_files_hash_results.txt
We'll save the results as secured_files_hash_results.txt (just like in step 2, it will be a 2 column list with hashes in the first column and then it's corresponding filepath).
6.Now we should have two, two columns lists of file hashes plus plus their corresponding filepaths (refer back to step 2. for an example)
Now we need to do some cleanup to get the two lists in a state where we can use Vimdiff to compare the two easliy.
First, for both lists, we need to get rid of the full filepaths like we did in step 3, but keeping the filehashes. Ensure there's still a space/tab delimiter between hash and bare filename.
Next, again for both lists, we need to invert the columns, you can just use Vim visual block select again (like we did in step 3) to cut and paste to end up with the now bare filename column being the first column. You might need to add a space or tab to the start or end of each line as a delimiter first.
:%s:$: :
or :%s:$:\t:
Finally, again for both lists, we'll use the Vim command sort u
again to sort the list to remove duplicates and to get things in order.
We do this column swap and sort because by moving the filename to the beginning of each line for both list, we should end up getting both lists in a similar order with we apply the sort command, because like I said before, the filenames are used as primary keys in our example LIMS DB. Versus, if we didn't swap, and file hashes were first, then we're forcing Vim to try to figure out how to order by hashes.
Also, it should be safe to remove dupes, we won't be losing anything critical, in theory. Although we lost filepath information in these lists, and there is chance that a filehash in the list can be repeated, as well as a filename, the combination of filehash plus filename will guarantee that it's unique, even if there's an exact copy in different filepaths. If they are exact copies, then we only need to do a single comparison against the secure file server version.
We'll use vimdiff to get a nice visual comparison
vimdiff secured_files_hash_results.txt instrPC_hash_results.txt
So in this example, the secured files are on the left, and the raw data files are on the right. We see that we are missing secured copies of files "123456789-3.wiff and wiff.scan", and then the secured and raw data file versions do not match, and therefore, we need to investigate further.
Alternatively, we can also use the diff command to output results to get a list of lines that differ, or which side (left or right) where only that line exists. I'll leave that up for you to do as extra credit :)
Posted on December 1, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
December 1, 2023