You have GA4 data in BigQuery. You have SFCC orders in BigQuery. You write your first LEFT JOIN. You get a 40% NULL rate where you expected matches.

At this point, most people open their LookML model and start adjusting joins. That is the wrong move. The join is not the problem. The identifiers are.

This article covers every way the GA4/SFCC transaction ID can break, how to systematically diagnose which failure mode you are dealing with, and how to fix it in a way that benefits every dashboard automatically.


Why the Transaction ID Is Your Critical Path

The entire reconciliation model depends on one assumption: the transaction_id in GA4 is the same value as the order_id in SFCC for the same order. This assumption fails in practice more often than most teams expect, for reasons that are entirely preventable but rarely documented.

Here is why it is the critical path and not something you can work around: without a reliable join key, every reconciliation metric is either wrong or unmeasurable. You cannot calculate the revenue gap. You cannot classify orders by origin. You cannot build any of the governance metrics from the rest of this series. The join key is the foundation.


The Five Failure Modes

Transaction ID Failure Modes: Diagnostic Decision Tree Match rate below 85%? Are ID patterns the same? No Mode 1: Different IDs GA4 gets session token not SFCC order_id Mode 2: Prefix mismatch ORD-00123 vs 00123 #123 vs 123 Yes → Does LOWER(TRIM) help? Run query 2 to check normalised match rate Yes Mode 3: Case sensitivity ABC-123 vs abc-123 Fix: LOWER() both sides TRIM helps Mode 4: Whitespace “00123 ” ≠ “00123” Fix: TRIM() both sides No match Mode 5: Structural Offline / B2B / back-office No GA4 session expected
Figure 1 — Transaction ID failure mode decision tree. Run the diagnostic queries in order to identify which mode applies to your implementation before attempting any fix.

Mode 1: Completely Different Identifiers

This is the most severe failure mode and the hardest to fix because it requires a product-level decision. In some implementations, the value pushed to the dataLayer as transaction_id is not the SFCC order_id. It could be:

How to detect it: Take 20 recent GA4 purchase events and look up the corresponding SFCC orders manually. If the IDs share no structural pattern at all, you have Mode 1.

How to fix it: This is not fixable in BigQuery or LookML alone. You need to change what value is pushed to the dataLayer. The correct approach is to generate the SFCC order_id before the confirmation page renders and pass it explicitly to the dataLayer. In most SFCC implementations, the order object is available server-side at the time the confirmation page is generated:

// dataLayer push on order confirmation page (SFCC ISML template or headless)
dataLayer.push({
  event: 'purchase',
  ecommerce: {
    transaction_id: '${pdict.Order.orderNo}',  // SFCC order number -- NOT payment ref
    value: ${pdict.Order.totalGrossPrice.value},
    currency: '${pdict.Order.getCurrencyCode()}',
    items: [...]
  }
});
// pdict.Order.orderNo is the SFCC internal order number that matches order_id in the DB

Mode 2: Prefix or Suffix Mismatch

SFCC stores ORD-00123456. The dataLayer sends GA4 00123456. Or the reverse: SFCC stores 123456 and GA4 receives #123456 or order_123456.

This is extremely common when the dataLayer implementation was built by a front-end team that formatted the ID differently from the SFCC back-end team. It is also common after a platform migration where the ID format changed but the tracking code was not updated.

-- Step 1: Identify prefix patterns in SFCC
SELECT
  REGEXP_EXTRACT(order_id, r'^[^0-9]*')       AS prefix,
  REGEXP_EXTRACT(order_id, r'[0-9]+$')        AS numeric_suffix,
  LENGTH(order_id)                             AS total_length,
  COUNT(*)                                     AS order_count
FROM `project.sfcc.orders`
WHERE DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY 1, 2, 3
ORDER BY order_count DESC;

-- Step 2: Same for GA4
SELECT
  REGEXP_EXTRACT(transaction_id, r'^[^0-9]*') AS prefix,
  REGEXP_EXTRACT(transaction_id, r'[0-9]+$')  AS numeric_suffix,
  LENGTH(transaction_id)                       AS total_length,
  COUNT(*)                                     AS ga4_count
FROM `project.analytics.ga4_purchases`
WHERE DATE(event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY 1, 2, 3
ORDER BY ga4_count DESC;

-- If SFCC shows prefix='ORD-' but GA4 shows prefix=''
-- the fix is: REGEXP_REPLACE(sfcc.order_id, r'^ORD-', '')

Mode 3: Case Sensitivity

BigQuery string comparisons are case-sensitive by default. 'ABC-00123' does not equal 'abc-00123'. A join written as s.order_id = g.transaction_id will silently fail on any records where the case differs.

This failure mode is particularly insidious because:

Detection: If normalised match rate (with LOWER + TRIM) is significantly higher than raw match rate, you have Mode 3 or Mode 4.

Mode 4: Whitespace and Invisible Characters

Leading or trailing spaces are the most common invisible character issue. They are introduced by:

Less common but real: Unicode zero-width characters, non-breaking spaces ( ), or carriage return characters embedded in string fields. These require more aggressive cleaning:

-- Detect non-standard characters
SELECT
  order_id,
  LENGTH(order_id)               AS raw_length,
  LENGTH(TRIM(order_id))         AS trimmed_length,
  order_id != TRIM(order_id)     AS has_whitespace,
  -- Detect non-printable characters
  REGEXP_CONTAINS(order_id, r'[\x00-\x1F\x7F]') AS has_control_chars
FROM `project.sfcc.orders`
WHERE order_id != TRIM(order_id)
   OR REGEXP_CONTAINS(order_id, r'[\x00-\x1F\x7F]')
LIMIT 100;

-- Clean function for extreme cases:
SELECT
  REGEXP_REPLACE(TRIM(LOWER(order_id)), r'[\x00-\x1F\x7F\u00A0]', '') AS ultra_clean_id
FROM `project.sfcc.orders`

Mode 5: Structurally Unmatchable Orders

Some SFCC orders will never have a GA4 counterpart, not because of any data quality issue, but because they were created in environments where GA4 tracking cannot exist. These are not join failures. They are a different category of business reality.

Important: Do not try to fix Mode 5. Instead, classify these orders explicitly in your data model so they are excluded from match rate calculations and treated as their own segment. The article on dark traffic and orphan orders (Part 5 of this series) covers this in full.


The Full Diagnostic Sequence: Four Queries

Run these queries in order before building anything in Looker. Each one narrows the problem.

Query 1: Baseline match rate (raw, no normalisation)

SELECT
  COUNT(DISTINCT g.transaction_id)                           AS ga4_transactions,
  COUNT(DISTINCT s.order_id)                                 AS sfcc_web_orders,
  COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                      THEN g.transaction_id END)             AS matched_raw,
  ROUND(
    COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                        THEN g.transaction_id END)
    / NULLIF(COUNT(DISTINCT g.transaction_id), 0) * 100, 1
  )                                                          AS raw_match_rate_pct
FROM `project.analytics.ga4_purchases` g
LEFT JOIN `project.sfcc.orders` s
  ON g.transaction_id = s.order_id  -- raw, no normalisation
WHERE DATE(g.event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND s.channel IN ('web', 'storefront');

-- Result interpretation:
-- >= 90%: small normalisation issue (Mode 3 or 4)
-- 70-89%: prefix mismatch likely (Mode 2)
-- < 70%:  structurally different IDs (Mode 1) or high offline volume

Query 2: Normalised match rate (with LOWER + TRIM)

SELECT
  COUNT(DISTINCT g.transaction_id)                           AS ga4_total,
  COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                      THEN g.transaction_id END)             AS matched_normalised,
  ROUND(
    COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                        THEN g.transaction_id END)
    / NULLIF(COUNT(DISTINCT g.transaction_id), 0) * 100, 1
  )                                                          AS normalised_match_rate_pct,
  -- Compare with Query 1 result to measure gain
  ROUND(
    COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                        THEN g.transaction_id END)
    / NULLIF(COUNT(DISTINCT g.transaction_id), 0) * 100, 1
  ) - [raw_match_rate_from_query_1]                          AS normalisation_gain_pct
FROM `project.analytics.ga4_purchases` g
LEFT JOIN `project.sfcc.orders` s
  ON LOWER(TRIM(g.transaction_id)) = LOWER(TRIM(s.order_id))  -- normalised
WHERE DATE(g.event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

-- If normalised_match_rate >> raw_match_rate:
--   Fix: apply LOWER(TRIM()) to both sides in all joins
-- If normalised_match_rate ≈ raw_match_rate:
--   The issue is structural (Mode 1 or 2) -- normalisation alone doesn't help

Query 3: ID structure comparison (detect Mode 1 and 2)

WITH ga4_ids AS (
  SELECT
    'ga4'                                          AS source,
    REGEXP_EXTRACT(transaction_id, r'^[^0-9]*')    AS prefix,
    LENGTH(REGEXP_EXTRACT(transaction_id, r'^[^0-9]*')) AS prefix_len,
    LENGTH(transaction_id)                          AS total_len,
    COUNT(*)                                        AS record_count
  FROM `project.analytics.ga4_purchases`
  WHERE DATE(event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
  GROUP BY 1, 2, 3, 4
),
sfcc_ids AS (
  SELECT
    'sfcc'                                          AS source,
    REGEXP_EXTRACT(order_id, r'^[^0-9]*')           AS prefix,
    LENGTH(REGEXP_EXTRACT(order_id, r'^[^0-9]*'))   AS prefix_len,
    LENGTH(order_id)                                AS total_len,
    COUNT(*)                                        AS record_count
  FROM `project.sfcc.orders`
  WHERE DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    AND channel IN ('web', 'storefront')
  GROUP BY 1, 2, 3, 4
)
SELECT * FROM ga4_ids
UNION ALL
SELECT * FROM sfcc_ids
ORDER BY source, record_count DESC;

-- If GA4 shows prefix='' and SFCC shows prefix='ORD-':
--   Fix: REGEXP_REPLACE(sfcc.order_id, r'^ORD-', '')
-- If GA4 shows total_len=12 and SFCC shows total_len=18:
--   They are structurally different IDs (Mode 1) -- fix at dataLayer level

Query 4: Classify the unmatched bucket

SELECT
  s.channel,
  s.customer_type,
  CASE
    WHEN s.channel IN ('store', 'pos')      THEN 'offline_store'
    WHEN s.channel IN ('phone', 'callcenter') THEN 'offline_callcenter'
    WHEN s.customer_type IN ('b2b', 'wholesale') THEN 'b2b_wholesale'
    WHEN s.channel IN ('web', 'storefront') THEN 'web_unmatched_dark_traffic'
    ELSE 'other'
  END                                                       AS unmatched_category,
  COUNT(DISTINCT s.order_id)                                AS unmatched_orders,
  SUM(s.order_value)                                        AS unmatched_revenue,
  ROUND(
    COUNT(DISTINCT s.order_id)
    / SUM(COUNT(DISTINCT s.order_id)) OVER () * 100, 1
  )                                                         AS share_of_unmatched_pct
FROM `project.sfcc.orders` s
LEFT JOIN `project.analytics.ga4_purchases` g
  ON LOWER(TRIM(s.order_id)) = LOWER(TRIM(g.transaction_id))
WHERE g.transaction_id IS NULL
  AND DATE(s.created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1, 2, 3
ORDER BY unmatched_orders DESC;

-- Expected result for a healthy implementation:
-- 'offline_store' and 'b2b_wholesale': 70-80% of unmatched (expected, not fixable)
-- 'web_unmatched_dark_traffic': 20-30% of unmatched (ITP, WebView, consent)
-- If 'web_unmatched_dark_traffic' > 50% of unmatched: tracking implementation issue

Fixing It in LookML: Three Patterns

Once you have identified the failure mode, implement the fix in LookML or BigQuery views. The correction should be centralised — not scattered across individual dashboards.

Pattern 1: Normalisation only (Mode 3 + 4)

explore: ga4_purchases {
  label: "GA4 + SFCC Reconciliation"

  join: sfcc_orders {
    type: left_outer
    sql_on:
      LOWER(TRIM(${ga4_purchases.transaction_id}))
      = LOWER(TRIM(${sfcc_orders.order_id})) ;;
    relationship: one_to_one
  }
}

Pattern 2: Prefix stripping + normalisation (Mode 2)

-- SFCC uses ORD-00123456, GA4 uses 00123456
explore: ga4_purchases {
  join: sfcc_orders {
    type: left_outer
    sql_on:
      LOWER(TRIM(${ga4_purchases.transaction_id}))
      = LOWER(TRIM(
          REGEXP_REPLACE(${sfcc_orders.order_id}, r'^ORD-', '')
        )) ;;
    relationship: one_to_one
  }
}

-- Alternative: GA4 uses #123456, SFCC uses 123456
-- sql_on: REGEXP_REPLACE(${ga4_purchases.transaction_id}, r'^#', '')
--         = ${sfcc_orders.order_id}

Pattern 3: Normalised views in BigQuery (best for complex cases)

When normalisation logic is complex or involves multiple transformations, the cleanest approach is to create normalised views in BigQuery before Looker touches the data. Looker then joins on clean, pre-normalised fields:

-- BigQuery: create normalised views
CREATE OR REPLACE VIEW `project.reconciliation.norm_sfcc_orders` AS
SELECT
  order_id                                              AS raw_order_id,
  -- Apply all normalisation in one place
  LOWER(
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(order_id, r'^ORD-', ''),  -- strip prefix
        r'[\x00-\x1F\x7F]', ''                   -- strip control chars
      )
    )
  )                                                     AS norm_order_id,
  created_at,
  order_value,
  channel,
  customer_type,
  status,
  refund_amount
FROM `project.sfcc.orders`
WHERE status NOT IN ('fraud', 'test');

CREATE OR REPLACE VIEW `project.reconciliation.norm_ga4_purchases` AS
SELECT
  transaction_id                                        AS raw_transaction_id,
  LOWER(TRIM(transaction_id))                           AS norm_transaction_id,
  event_date,
  session_id,
  reported_revenue,
  session_source,
  session_medium,
  session_campaign
FROM `project.analytics.ga4_purchases`;

-- LookML: clean join on normalised fields
explore: ga4_purchases {
  join: sfcc_orders {
    type: left_outer
    sql_on: ${ga4_purchases.norm_transaction_id}
             = ${sfcc_orders.norm_order_id} ;;
    relationship: one_to_one
  }
}

What a Healthy Match Rate Looks Like

After applying the appropriate normalisation, here are the realistic match rate targets by order segment. These benchmarks are based on typical e-commerce implementations; your numbers will vary based on channel mix and Safari market share in your user base.

Order segmentTarget match rateAcceptable minimumIf below minimum, investigate
Web orders, same-day (Chrome/Firefox)92–97%88%Tag firing on confirmation page, consent mode config
Web orders, 7+ day return (Safari)75–85%70%ITP cookie loss — structural, partially addressable with server-side
Mobile web orders78–88%70%WebView traffic, in-app browser, ITP on iOS
Mobile app orders50–75%40%Firebase/GA4 SDK missing or misconfigured in app
Call centre orders0–10%N/AExpected — offline by nature. Classify, don't fix.
B2B / wholesale orders0–20%N/AExpected — no web session. Classify, don't fix.
POS / in-store orders0%N/AExpected — fully offline. Should not appear in reconciliation scope.

No reconciliation model reaches 100%. The goal is to understand your match rate by segment so you can make informed decisions about which orders to include in attribution analysis, and distinguish between structural gaps (not fixable) and implementation gaps (fixable).


Common Mistakes to Avoid


What Comes Next


Sources and Further Reading