Max
Posted on April 7, 2020
A practical guide for large-scale "on-prem" to Azure SQL migration projects
It can take a long time to prepare a system for migration to a cloud environment. There is a chance that the production codebase has changed during that time. This post offers a simple solution for identifying and merging such changes. The post is very specific to Azure migration and is unlikely to be of interest in any other context.
Our cloud migration project took 3 months from taking a snapshot of all production databases to making the code fully compatible with Azure SQL. In the meantime, the SQL code in production changed to keep up with customer requests and bug fixes. We decided to ignore the changes until the very end of migration to compare and merge the entire codebase in a single sweep. That decision was based on the universal precaution principle because there could also be any number of untracked changes we didn't know about and could miss if we relied on tracking the changes in production.
This diagram illustrates the relationship between DB codebase and the Git repository as a timeline.
Background
The system consisted of two types of MS SQL databases: customer and shared (central) databases.
Every customer had a separate DB with a standardized schema and identical (in theory) codebase. Every shared DB had a different schema and codebase. All databases were tightly linked via cross-DB SQL queries.
Customer databases
Customer DBs had data specific to a single customer, but shared the same codebase. In practice, there were minute differences in the codebase as well. E.g. customer-specific code and accumulation of Dev-Ops errors.
Only the differences affecting Azure compatibility would have to be identified and merged.
Naming convention:
-
c_4vallees_base
- the original customer DB snapshot taken for development -
c_4vallees_model
- c_4vallees_base modified for Azure -
c_4vallees
- the latest copy from production -
c_8hdke93
,c_a83hdk2
,c_hdj3ud5
- the latest production copies of other customer DBs
Shared databases
Shared DBs had schema and code that were specific to each DB. No code was shared between them.
-
central
,helpdesk
,reporting
- latest production copies -
central_base
,helpdesk_base
,reporting_base
- original snapshots taken for development (before any Azure modifications)
Merge sequence
Start with 3 sets of T-SQL code for each database:
-
base
- the original snapshot before Azure modifications -
model
- a modified snapshot that works on Azure -
prod
- the latest copy from production
and compare them in this order
-
base
tomodel
to get the list of all Azure compatibility changes -
base
toprod
to find out changed in production while we were busy moddingbase
for Azure - merge
prod
andmodel
The main advantage of using this method is that it can be re-run on a newer production set within minutes if the Azure deployment was delayed or rolled back.
Project structure
All scripts provided in this guide rely on the following file structure:
-
\
- root_solution_folder-
db
- contains folders with T-SQL code from snapshot DBs -
c_4vallees
- T-SQL code of c_4vallees DB, including its own GIT repo -
central
- same for central and other DBs helpdesk
reporting
- etc ...
-
utils
- all sorts of shell and SQL scripts for code refactoring and automation -
staging-diff
- diff files for comparing base and model versions -
customer-dbs
- modified code for customer DBs, based on a single customer model DB
-
All SQL code was committed to Git repositories, one repo per shared DB and one repo for a sample customer DB (c_4vallees). The rest of customer databases were supposed to be identical to the sample.
Step 1: identifying Azure compatibility modifications
This script diffs the latest version of modified (model) DBs against their original (base) state and outputs a list of SQL objects changed for migration. We will need this list to limit our search for changes in production to those objects only. Any other production changes will carry over to Azure as-is.
This and all the other scripts in this guide should be run from the solution root because they expect database repositories to be in ./db/db_name
folders.
Required input vars:
-
$diffDBs
- an array with DB names, e.g.$diffDBs = @("central", "helpdesk", "reporting")
-
$diffFolderName
- relative path to the output folder, e.g.$diffFolderName = "staging-diff"
Output: a collection of diff files, one per DB, e.g. staging-diff/init-head-diff-db_name.txt
. (Join-Path $PSScriptRoot vars.ps1)
# prepare paths and variables
$rootDir = (Get-Item -Path ".\").FullName
$diffDir = Join-Path $rootDir $diffFolderName
# create a folder for diffs, if needed
if (!(Test-Path -Path $diffDir )) {
New-Item -Path $rootDir -Name $diffFolderName -ItemType "directory" -Force | Out-Null
}
"Diff folder: $diffDir"
foreach ($dbName in $diffDBs) {
"`nDB: $dbName"
# try to checkout master branch
$dbCodeFolder = Join-Path "db" $dbName
git -C $dbCodeFolder checkout master --force
if ($LASTEXITCODE -ne 0) {
Write-Host "Cannot checkout MASTER on $dbCodeFolder" -ForegroundColor DarkRed
continue
}
# get the very first commit with the unmodified (base) code
$allCommits = git -C $dbCodeFolder rev-list HEAD
$initialCommit = $allCommits.split()[-1]
"Initial commit: $initialCommit"
# save the list of all modified files
$diffFileName = Join-Path $diffDir "init-head-diff-$dbName.txt"
if (!(Test-Path -Path $diffFileName )) {
$modFiles = git -C $dbCodeFolder diff --name-only $initialCommit HEAD
"Files in diff: " + $modFiles.split().length
Set-Content -Path $diffFileName -Value $modFiles
}
else {
Write-Host "Diff file already exist: $diffFileName" -ForegroundColor Yellow
}
}
The contents of the output from the script above is a list of all the file names (objects) affected by Azure modifications, per each DB. The file names conform to SSMS object scripting format: owner.object.type.sql, one file per object.
Step 2: identifying recent changes in PROD
In this step, we compare the code directly between base and prod DBs using sys.syscomments tables to avoid exporting and committing all objects from production DBs into Git repos. The comparison is done by an open source CLI tool(AZPM) we had to build for this project.
This PowerShell script relies on the output from Step 1 and should also be run from the solution root.
Required input vars:
-
$diffDBs
- an array with DB names, e.g.$diffDBs = @("central", "helpdesk", "reporting")
-
$diffFolderName
- relative path to the output folder, e.g.$diffFolderName = "staging-diff"
-
$azpm
- location of Azure migration CLI tool, e.g.$azpm = "C:\Temp\AzurePoolCrossDbGenerator.exe"
-
$modelCustomerDB
- name of themodel
customer DB. It is required only if you are processing customer DBs from a singlemodel
.
Output: new Git repo branches with unstaged files
. (Join-Path $PSScriptRoot vars.ps1)
# prepare paths and variables
$rootDir = (Get-Item -Path ".\").FullName
$diffDir = Join-Path $rootDir $diffFolderName
"Diff folder: $diffDir"
# check if the diff folder exists
if (!(Test-Path -Path $diffDir )) {
Write-Host "Cannot access the diffs in $diffDir " -ForegroundColor Red
Exit
}
foreach ($dbName in $diffDBs) {
$dbBase = $dbName + "_base" # the original version of the DB used for development
$dbCodeDir = Join-Path "db" $dbName # folder with the code repo
$diffFileName = Join-Path $diffDir "init-head-diff-$dbName.txt"
<# Uncomment this block if processing customer DBs to compare them to the same base
$dbBase = $modelCustomerDB + "_base"
$dbCodeDir = Join-Path "db" $modelCustomerDB
$diffFileName = Join-Path $diffDir "init-head-diff-$modelCustomerDB.txt"
#>
"`nStaging DB: $dbName"
"Base DB: $dbBase"
"Code repo: $dbCodeDir"
# get the list of modified files
if (!(Test-Path -Path $diffFileName )) {
Write-Host "Cannot access diff file: $diffFileName" -ForegroundColor Red
continue
}
# try to checkout master branch
git -C $dbCodeDir checkout master --force
if ($LASTEXITCODE -ne 0) {
Write-Host "Cannot checkout MASTER on $dbCodeDir" -ForegroundColor Red
exit
}
# get the initial commit
$allCommits = git -C $dbCodeDir rev-list HEAD
$initialCommit = $allCommits.split()[-1]
"Initial commit on master: $initialCommit"
# check out the very first commit into a new branch
$branchName = $dbName +"-" + (Get-Date -Format "yyyyMMddHHmm")
git -C $dbCodeDir checkout -B $branchName $initialCommit
if ($LASTEXITCODE -ne 0) {
"Failed to checkout $initialCommit as branch $branchName"
continue
}
# extract changed files from the diff list into the DB folder
$csl = "`"Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=$dbName;Server=.`"" # staging DB (latest prod)
$csb = "`"Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=$dbBase;Server=.`"" # a DB with the original code (base)
$wd = Join-Path $rootDir $dbCodeDir
Start-Process -FilePath $azpm -ArgumentList @("extract", "-csl", $csl, "-csb", $csb, "-l", "$diffFileName") -NoNewWindow -Wait -WorkingDirectory $wd
# multiple customer DBs are extracted into the same folder, so we need to stash the changes to keep them separate
# the stashes would need to be popped manually for further editing
$stashMsg = "staging"
git -C $dbCodeDir stash push -m $stashMsg
# run `git stash clear` to remove all stashes in the repo
}
Customer DBs vs Shared DBs
The data in our system was partitioned into a single-DB per customer and several shared DBs for reporting, lists of values and taxonimies. All shared DBs were compared to their base
. It was a 1:1 relationship. We could create a GIT branch for every comparison in every repo and leave the changes staged for review.
Customer DBs, on the other hand, were all compared to the same sample DB called 4vallees_base
. To keep the Git repo clean we stashed the changes for every customer DB to base comparison before moving onto the next DB. Stashing allowed us to review the diffs before making a commit.
For example, comparing customer DB c_hdj3ud5 to base creates stash c_hdj3ud5-202002211117:staging.
The screenshot above shows 3 branches named DB + timestamp with corresponding stashes to be reviewed and merged with the Azure-ready code.
Step 3: merging base
and prod
The purpose of this merge is to converge our base
version with the latest PROD changes.
For example, the diff in the screenshot below told us that prod had some new columns we had to incorporate into our Azure modifications.
After some "cherry-picking" we ended up with this version of accepted differences for a merge:
Once base
and prod
were merged the tree had 2 branches: master
for the model DB and c_8hdke93-202002211116
for PROD,
followed by a merge with all Azure compatibility changes from master
.
This diff confirmed that our Azure compatibility changes were present: cross-DB queries were correctly merged with Azure modifications and the new column names from PROD.
We repeated the merge for the other customer DBs from stashes and kept all customer-specific changes within their branches.
Applying merged changes back to PROD DBs
The following script generates SQL files with the merged code for all shared and customer DBs. The scripts can then be applied to PROD databases.
Run the script from the solution root.
Required input vars:
-
$dbCustomers
- an array with DB names, e.g.$dbCustomers = @("c_8hdke93", "c_a83hdk2", "c_hdj3ud5")
-
$modelCustomerDB
- name of the model customer DB, e.g.$modelCustomerDB = "4vallees"
- Copy the diff for customer model DB to
customer_dbs\diff.txt
and make changes to the list inside, if needed - This version of the script assumes that the SQL code modified for Azure is identical between all customer DBs. There may still be differences in other parts of customer databases, but what changed for Azure is identical.
Output: scripts for all modified objects are saved in .\customer_dbs\db_name
folders.
. (Join-Path $PSScriptRoot vars.ps1)
# source folder name - using the model DB
$sd = "db\" + $modelCustomerDB
# get the list of modified files from a diff
$diffFile = "customer-dbs\diff.txt"
if (!(Test-Path -Path $diffFile )) {
Write-Host "Missing the list of files to export: $diffFile" -ForegroundColor Red
Exit
}
$diffFiles = (Get-Content $diffFile)
# remove all target DB scripts from the previous run
Write-Host "Deleting all SQL files in the target location ..." -ForegroundColor Yellow
foreach ($db in $dbCustomers) {
$dbPath = "customer-dbs\" + $db
if (Test-Path -Path $dbPath ) { Remove-Item -Path $dbPath -Force -Recurse }
}
$firstRun = 1 # reset to 0 after the first DB to reduce console output
foreach ($db in $dbCustomers) {
Write-Host "$db" -ForegroundColor Yellow
# target folder name - using the customer DB name
$td = "customer-dbs\" + $db
# process all SQL files from the diff
foreach ($sqlFile in $diffFiles.Split("`n")) {
if (! ($sqlFile -match ".+\.sql$")) {
# report skipped files on the first run only
if ($firstRun -eq 1) { Write-Host "Skipped $sqlFile" -ForegroundColor Yellow }
continue
}
# check if the source file exists
$sqlSourceFileName = Join-Path $sd $sqlFile
if (!(Test-Path -Path $sqlSourceFileName )) {
if ($firstRun -eq 1) { Write-Host "Missing $sqlFile" -ForegroundColor Red }
continue
}
# replace model DB name with the target DB name
$sqlTargetFileName = Join-Path $td $sqlFile
# replace the contents of the file
New-Item -Path $sqlTargetFileName -Force | Out-Null
(Get-Content $sqlSourceFileName) | Foreach-Object { $_ -replace "$modelCustomerDB", "$db" } | Set-Content $sqlTargetFileName -Force
# The () brackets around Get-Content are needed to make the file writable. Otherwise it would be locked.
}
# reset this flag to reduce console output
$firstRun = 0
}
Running the above script for our 3 customer DBs produced ~ 600 SQL files.
The following script applies the files generated in the previous step to production databases. Run it from the solution root.
Required input vars:
-
$dbCustomers
- an array with DB names, e.g.$dbCustomers = @("c_8hdke93", "c_a83hdk2", "c_hdj3ud5")
-
$customerDBsFolderName
- set to$customerDBsFolderName = "customer-dbs"
.
Output: processed DB names and error messages from SQLCMD utility.
. (Join-Path $PSScriptRoot vars.ps1)
foreach ($db in $dbCustomers) {
Write-Host "$db" -ForegroundColor Yellow
# folder name with DB files
$td = Join-Path $customerDBsFolderName $db
if (!(Test-Path -Path $td )) {
Write-Host "# Missing DB folder $db" -ForegroundColor Red
continue
}
# get the list of files
$allFiles = Get-ChildItem -Path $td -Filter "*.sql"
foreach ($sqlFileName in $allFiles) {
# file name as `customer-dbs\db_name\file_name.sql`
$sqlFullFileName = (Join-Path $td $sqlFileName.Name)
# run the SQL file on the local server, add -U and -P params if needed
sqlcmd -b -S . -d $db -i `"$sqlFullFileName`"
# output the file name for troubleshooting if there was a problem
if ($LASTEXITCODE -ne 0) { $sqlFullFileName }
}
}
This script exports all production DBs with their Azure modifications into .bacpac files required by Azure SQL DB import process. Run it from the root of the solution.
Required input vars:
-
$dbs
- an array with DB names, e.g.$dbs = @("c_8hdke93", "c_a83hdk2", "c_hdj3ud5")
- Environmental var
SA_PWD
- enter yoursa
password in there.
Output: .bacpac files per DB in bacpac folder.
. (Join-Path $PSScriptRoot vars.ps1)
# output folder
$folder = "bacpac"
# create the output forlder on the first run
if (!(Test-Path -Path $folder )) {
New-Item -Path . -Name $folder -ItemType "directory" -Force
}
foreach ($db in $dbs) {
# SA password is taken from env var called `SA_PWD`
$pwd = $env:SA_PWD
$fileName = "$folder\$db.bacpac"
# do not overwrite existing files
if (Test-Path -Path $fileName) {
Write-Host "`n$fileName exists.`n" -ForegroundColor Yellow
continue
}
sqlpackage.exe /Action:Export /ssn:127.0.0.1 /su:sa /sp:$pwd /sdn:$db /tf:$fileName #/d:True
# You may want to enable `/d:True` flag if the utility fails with no good explanation to why
}
There is a chance you will get some errors during the export. They will have to be addressed before the export can be completed. Sometimes it means going back to the very beginning and re-running the entire process again.
Successfully exported files can be imported into Azure SQL Pool with this PowerShell script.
Required input vars:
-
$dbs
- an array with DB names, e.g.$dbs = @("c_8hdke93", "c_a83hdk2", "c_hdj3ud5")
- Environmental var
SA_PWD
- enter yoursa
password for Azure SQL server / pool. -
$AzUserName
- Azure SQL server admin user name, e.g.$AzUserName = "sa"
-
$AzResourceGroup
,$AzServerName
,$AzPoolName
- Azure SQL server params.
Output: list of .bacpac files names imported into Azure SQL.
. (Join-Path $PSScriptRoot vars.ps1)
foreach ($db in $dbs) {
# SA password is taken from env var called `SA_PWD`
$pwd = $env:SA_PWD
$filename = "$db.bacpac"
Write-Host "`n`nProcessing $fileName`n" -ForegroundColor Yellow
# ignore if the file is not there
if (!(Test-Path -Path $fileName)) {
Write-Host "`n$fileName doesn't exist.`n" -ForegroundColor Red
continue
}
# this line requires running `az login` and `az account set --subscription ...` first
Write-Host "Deleting AZ SQL DB ..." -ForegroundColor Yellow
az sql db delete --name $db --resource-group "$AzResourceGroup" --server $AzServerName --yes
if ($lastexitcode -ne 0) { continue }
sqlpackage.exe /a:import /tcs:"Data Source=tcp:$AzServerName.database.windows.net,1433;Initial Catalog=$db;User Id=$AzUserName;Password=$pwd" /sf:$filename /p:DatabaseEdition=Standard /p:DatabaseServiceObjective=S4 #/d:True
# You may want to enable `/d:True` flag if the utility fails with no good explanation to why
if ($lastexitcode -ne 0) { continue }
# add the DB to elastic pool
Write-Host "Moving to pool ..." -ForegroundColor Yellow
az sql db create --name $db --resource-group "$AzResourceGroup" --elastic-pool "$AzPoolName" --server $AzServerName
if ($lastexitcode -ne 0) { continue }
# rename the .bacpac file so that doesn't get imported again
Move-Item -Path $filename -Destination "_$filename"
}
Congratulations - you are done
All databases should have been imported into your Azure SQL pool by the above script.
This post is based on my recent experience migrating a real estate management system with hundreds of MS SQL databases from on-prem to Azure SQL. Read my other articles for more learnings from that project.
I realise that this article makes little sense outside the narrow context of migrating a large number of MS SQL DBs to Azure and is hard to comprehend. It took me quite some time to figure out the right merge process, so I decided that it's better to share what I learned even in its current form. It may still save someone a lot of time and effort.
Posted on April 7, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
March 9, 2024