Skip to contents

In development - results are not yet reliable

This is particularly true for relationships across CTE, subquery, and temptable workflows (i.e. most real-world sql worklows!).
Collaboration is welcome!

{sqlMiner} extracts table relationship metadata from a folder of SQL scripts. Point it at a directory, and it returns a tidy data.frame describing every JOIN, INSERT, and CTE dependency it finds — one row per relationship. Use the output to audit data lineage, map table dependencies, or spot unexpected cross-schema joins without reading every script by hand.

What the package handles:

  • LEFT, RIGHT, INNER, FULL OUTER, and CROSS JOINs
  • INSERT INTO … SELECT … FROM patterns (ETL lineage)
  • Chained CTEs, with a depth column that tracks nesting level
  • Inline subqueries (parsed recursively)
  • Single-line (--) and block (/* */) comment stripping before extraction

Installation

Install the development version from GitHub:

# install.packages("pak")
pak::pak("nottmhospitals/sqlMiner")

Hello world

Point sq_read_sql_files() at a folder of .sql files, then pass the result to sq_sql_mine():

library(sqlMiner)

scripts <- sq_read_sql_files("path/to/your/sql/folder")
relationships <- sq_sql_mine(scripts)

head(relationships)
#>                script_name source_table          target_table relationship_type
#> 1   fixture_basic_join.sql     episodes              patients              JOIN
#> 2     fixture_comments.sql dbo.episodes          dbo.patients              JOIN
#> 3 fixture_ctes_chained.sql dbo.episodes             dbo.staff              JOIN
#> 4 fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes              JOIN
#> 5 fixture_ctes_chained.sql dbo.episodes         dbo.diagnoses              JOIN
#> 6  fixture_ctes_simple.sql dbo.episodes             dbo.wards              JOIN
#>   join_type depth
#> 1 LEFT JOIN     1
#> 2 LEFT JOIN     1
#> 3 LEFT JOIN     1
#> 4 LEFT JOIN     3
#> 5 LEFT JOIN     1
#> 6 LEFT JOIN     1

Each row is one relationship. The key columns are:

Column Description
script_name The .sql file the relationship came from
source_table The primary read table (the FROM clause)
target_table The table being joined to, or written to for INSERTs
relationship_type "JOIN" or "INSERT"
join_type "LEFT JOIN", "INNER JOIN", etc. (NA for INSERTs)
depth CTE nesting level — base tables = 1, each CTE layer adds 1
source_is_cte / target_is_cte Whether the table is a CTE defined in the same script
left_keys / right_keys List columns of joining key expressions — source side and target side respectively (NA for INSERTs or unresolvable conditions)

Extended examples

library(sqlMiner)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.5.2

fixture_dir <- system.file("extdata", package = "sqlMiner")
scripts <- sq_read_sql_files(fixture_dir)
relationships <- suppressMessages(sq_sql_mine(scripts))

Count relationships per script

Quickly see which scripts are the most complex:

relationships |>
  count(script_name, sort = TRUE)
#>                       script_name n
#> 1          fixture_join_types.sql 5
#> 2         fixture_insert_into.sql 4
#> 3        fixture_ctes_chained.sql 3
#> 4         fixture_select_into.sql 3
#> 5     fixture_subquery_nested.sql 3
#> 6         fixture_ctes_simple.sql 2
#> 7      fixture_multistatement.sql 2
#> 8     fixture_subquery_simple.sql 2
#> 9           fixture_union_all.sql 2
#> 10         fixture_basic_join.sql 1
#> 11           fixture_comments.sql 1
#> 12           fixture_group_by.sql 1
#> 13 fixture_identifier_quoting.sql 1
#> 14  fixture_insert_cte_source.sql 1
#> 15      fixture_schema_prefix.sql 1
#> 16        fixture_table_hints.sql 1

Find the most frequently joined tables

Which tables appear most often as JOIN targets across all scripts?

relationships |>
  filter(relationship_type == "JOIN") |>
  count(target_table, sort = TRUE) |>
  head(10)
#>             target_table  n
#> 1           dbo.patients 10
#> 2              dbo.staff  5
#> 3              dbo.wards  3
#> 4          dbo.diagnoses  2
#> 5               patients  2
#> 6  cte_enriched_episodes  1
#> 7        dbo.departments  1
#> 8              diagnoses  1
#> 9            icd10_codes  1
#> 10     recent_admissions  1

Audit INSERT lineage

Find every script that writes to a table (ETL population):

relationships |>
  filter(relationship_type == "INSERT") |>
  select(script_name, source_table, target_table, relationship_type)
#>                     script_name       source_table         target_table
#> 1 fixture_insert_cte_source.sql cte_ready_episodes    dbo.fact_episodes
#> 2       fixture_insert_into.sql       dbo.episodes    dbo.fact_episodes
#> 3       fixture_select_into.sql       dbo.episodes dbo.staging_episodes
#>   relationship_type
#> 1            INSERT
#> 2            INSERT
#> 3            INSERT

Filter by join type

Isolate FULL OUTER JOINs, which often signal data-quality checks or reconciliation logic:

relationships |>
  filter(join_type == "FULL OUTER JOIN") |>
  select(script_name, source_table, target_table, on_condition)
#>              script_name source_table target_table                on_condition
#> 1 fixture_join_types.sql     episodes    diagnoses e.episode_id = d.episode_id

Understanding join keys

left_keys and right_keys are list columns — each element is a character vector of the column expressions that link source_table to target_table. left_keys is always the source (FROM) side; right_keys is always the target (JOIN) side, regardless of how the condition was written in the SQL.

Terminology

Term Definition
join predicate A condition that references columns from both sides of the join, e.g. e.patient_id = p.patient_id
filter predicate A condition that references only one side, e.g. p.active = 1 — dropped silently
equi-predicate A join predicate using = — these populate left_keys / right_keys
non-equi predicate A join predicate using >, <, BETWEEN, etc. — excluded from key columns, visible in on_condition

Rules applied during parsing

  • Both AND- and OR-connected equi-predicates are included (all represent potential linking columns).
  • Non-equi predicates and filter predicates (constants, literals) are silently dropped.
  • When a function call wraps a column — e.g. COALESCE(e.id, 0) — the raw expression is stored as-is.
  • When the full condition cannot be resolved (no equi-predicates, or no aliases to identify sides), left_keys and right_keys are NA. The raw on_condition is always available as a fallback.
  • INSERT rows always have NA for both key columns.

Inspecting keys

relationships |>
  filter(relationship_type == "JOIN") |>
  select(source_table, target_table, left_keys, right_keys) |>
  head(5)
#>   source_table          target_table    left_keys right_keys
#> 1     episodes              patients   patient_id patient_id
#> 2 dbo.episodes          dbo.patients   patient_id patient_id
#> 3 dbo.episodes             dbo.staff a.consul....   staff_id
#> 4 dbo.episodes cte_enriched_episodes p.patient_id patient_id
#> 5 dbo.episodes         dbo.diagnoses e.episode_id episode_id

Compound join keys — use tidyr::unnest() to get one row per key pair:

library(tidyr)

relationships |>
  filter(relationship_type == "JOIN", !is.na(left_keys)) |>
  select(script_name, source_table, target_table, left_keys, right_keys) |>
  unnest(cols = c(left_keys, right_keys)) |>
  head(10)
#> # A tibble: 10 × 5
#>    script_name                    source_table target_table left_keys right_keys
#>    <chr>                          <chr>        <chr>        <chr>     <chr>     
#>  1 fixture_basic_join.sql         episodes     patients     patient_… patient_id
#>  2 fixture_comments.sql           dbo.episodes dbo.patients patient_… patient_id
#>  3 fixture_ctes_chained.sql       dbo.episodes dbo.staff    a.consul… staff_id  
#>  4 fixture_ctes_chained.sql       dbo.episodes cte_enriche… p.patien… patient_id
#>  5 fixture_ctes_chained.sql       dbo.episodes dbo.diagnos… e.episod… episode_id
#>  6 fixture_ctes_simple.sql        dbo.episodes dbo.wards    ward_id   ward_id   
#>  7 fixture_ctes_simple.sql        dbo.episodes recent_admi… p.patien… patient_id
#>  8 fixture_group_by.sql           dbo.episodes dbo.departm… e.depart… departmen…
#>  9 fixture_identifier_quoting.sql dbo.episodes dbo.patients patient_… patient_id
#> 10 fixture_insert_into.sql        dbo.episodes dbo.patients patient_… patient_id

Inspect CTE depth

Scripts with depth > 1 reference CTEs defined in the same file. Depth tracks how many CTE layers deep a reference goes:

relationships |>
  filter(depth > 1) |>
  select(script_name, source_table, target_table, target_is_cte, depth) |>
  arrange(desc(depth))
#>                   script_name source_table          target_table target_is_cte
#> 1    fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes          TRUE
#> 2 fixture_subquery_nested.sql dbo.episodes          dbo.patients         FALSE
#> 3     fixture_ctes_simple.sql dbo.episodes     recent_admissions          TRUE
#> 4 fixture_subquery_nested.sql __subquery__             dbo.staff         FALSE
#> 5 fixture_subquery_simple.sql dbo.episodes          dbo.patients         FALSE
#>   depth
#> 1     3
#> 2     3
#> 3     2
#> 4     2
#> 5     2

Cross-schema joins

Spot joins that cross schema boundaries (a common governance concern):

relationships |>
  filter(relationship_type == "JOIN") |>
  mutate(
    source_schema = sub("\\..*", "", source_table),
    target_schema = sub("\\..*", "", target_table)
  ) |>
  filter(
    source_schema != target_schema,
    grepl("\\.", source_table),
    grepl("\\.", target_table)
  ) |>
  select(script_name, source_table, target_table, join_type)
#>                 script_name source_table     target_table join_type
#> 1 fixture_schema_prefix.sql dbo.episodes staging.patients LEFT JOIN

What tables does a script depend on?

sq_script_dependencies() summarises the external tables a single script reads from and writes to. CTE names are excluded — they are internal to the script.

sq_script_dependencies(relationships, "fixture_insert_into.sql")
#>          table_name   role
#> 1      dbo.episodes  reads
#> 2      dbo.patients  reads
#> 3         dbo.staff  reads
#> 4     dbo.diagnoses  reads
#> 5 dbo.fact_episodes writes

How is a table used across all scripts?

sq_table_usage() finds every script that references a given table and describes its role in each relationship.

sq_table_usage(relationships, "dbo.episodes")
#>                       script_name   role
#> 1            fixture_comments.sql source
#> 2        fixture_ctes_chained.sql source
#> 3         fixture_ctes_simple.sql source
#> 4            fixture_group_by.sql source
#> 5  fixture_identifier_quoting.sql source
#> 6         fixture_insert_into.sql source
#> 7      fixture_multistatement.sql source
#> 8       fixture_schema_prefix.sql source
#> 9         fixture_select_into.sql source
#> 10    fixture_subquery_nested.sql source
#> 11    fixture_subquery_simple.sql source
#> 12        fixture_table_hints.sql source
#> 13          fixture_union_all.sql source