The Transaction ID Problem: A Complete Diagnostic Guide to Fixing Your GA4/SFCC Join

You have GA4 data in BigQuery. You have SFCC orders in BigQuery. You write your first LEFT JOIN. You get a 40% NULL rate where you expected matches.

At this point, most people open their LookML model and start adjusting joins. That is the wrong move. The join is not the problem. The identifiers are.

This article covers every way the GA4/SFCC transaction ID can break, how to systematically diagnose which failure mode you are dealing with, and how to fix it in a way that benefits every dashboard automatically.

Why the Transaction ID Is Your Critical Path

The entire reconciliation model depends on one assumption: the transaction_id in GA4 is the same value as the order_id in SFCC for the same order. This assumption fails in practice more often than most teams expect, for reasons that are entirely preventable but rarely documented.

Here is why it is the critical path and not something you can work around: without a reliable join key, every reconciliation metric is either wrong or unmeasurable. You cannot calculate the revenue gap. You cannot classify orders by origin. You cannot build any of the governance metrics from the rest of this series. The join key is the foundation.

The Five Failure Modes

Figure 1 — Transaction ID failure mode decision tree. Run the diagnostic queries in order to identify which mode applies to your implementation before attempting any fix.

Mode 1: Completely Different Identifiers

This is the most severe failure mode and the hardest to fix because it requires a product-level decision. In some implementations, the value pushed to the dataLayer as transaction_id is not the SFCC order_id. It could be:

A payment gateway transaction reference (e.g. Stripe payment intent ID: pi_3Abc123). The payment gateway generates this before SFCC creates the order. GA4 receives this. SFCC stores a completely different internal order ID.
A checkout session token generated by the frontend for tracking continuity. This token may not be stored in SFCC at all.
A cart ID or basket ID generated at add-to-cart time. SFCC creates a new order ID at payment confirmation, which is different from the cart ID.
A confirmation number generated by the storefront that the customer sees, while SFCC uses an internal order number for operational purposes.

How to detect it: Take 20 recent GA4 purchase events and look up the corresponding SFCC orders manually. If the IDs share no structural pattern at all, you have Mode 1.

How to fix it: This is not fixable in BigQuery or LookML alone. You need to change what value is pushed to the dataLayer. The correct approach is to generate the SFCC order_id before the confirmation page renders and pass it explicitly to the dataLayer. In most SFCC implementations, the order object is available server-side at the time the confirmation page is generated:

// dataLayer push on order confirmation page (SFCC ISML template or headless)
dataLayer.push({
  event: 'purchase',
  ecommerce: {
    transaction_id: '${pdict.Order.orderNo}',  // SFCC order number -- NOT payment ref
    value: ${pdict.Order.totalGrossPrice.value},
    currency: '${pdict.Order.getCurrencyCode()}',
    items: [...]
  }
});
// pdict.Order.orderNo is the SFCC internal order number that matches order_id in the DB

Mode 2: Prefix or Suffix Mismatch

SFCC stores ORD-00123456. The dataLayer sends GA4 00123456. Or the reverse: SFCC stores 123456 and GA4 receives #123456 or order_123456.

This is extremely common when the dataLayer implementation was built by a front-end team that formatted the ID differently from the SFCC back-end team. It is also common after a platform migration where the ID format changed but the tracking code was not updated.

-- Step 1: Identify prefix patterns in SFCC
SELECT
  REGEXP_EXTRACT(order_id, r'^[^0-9]*')       AS prefix,
  REGEXP_EXTRACT(order_id, r'[0-9]+$')        AS numeric_suffix,
  LENGTH(order_id)                             AS total_length,
  COUNT(*)                                     AS order_count
FROM `project.sfcc.orders`
WHERE DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY 1, 2, 3
ORDER BY order_count DESC;

-- Step 2: Same for GA4
SELECT
  REGEXP_EXTRACT(transaction_id, r'^[^0-9]*') AS prefix,
  REGEXP_EXTRACT(transaction_id, r'[0-9]+$')  AS numeric_suffix,
  LENGTH(transaction_id)                       AS total_length,
  COUNT(*)                                     AS ga4_count
FROM `project.analytics.ga4_purchases`
WHERE DATE(event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY 1, 2, 3
ORDER BY ga4_count DESC;

-- If SFCC shows prefix='ORD-' but GA4 shows prefix=''
-- the fix is: REGEXP_REPLACE(sfcc.order_id, r'^ORD-', '')

Mode 3: Case Sensitivity

BigQuery string comparisons are case-sensitive by default. 'ABC-00123' does not equal 'abc-00123'. A join written as s.order_id = g.transaction_id will silently fail on any records where the case differs.

This failure mode is particularly insidious because:

It is invisible in any data preview or spreadsheet tool
It affects a seemingly random subset of orders (those that happened to be stored in different case)
The match rate looks reasonable (85–95%) rather than catastrophically bad, so it often goes undiagnosed
It can be introduced by a single system update or environment change — one day your match rate drops from 94% to 88% and nobody knows why

Detection: If normalised match rate (with LOWER + TRIM) is significantly higher than raw match rate, you have Mode 3 or Mode 4.

Mode 4: Whitespace and Invisible Characters

Leading or trailing spaces are the most common invisible character issue. They are introduced by:

CSV exports from SFCC where the field is not quoted correctly
API responses that include padding in JSON string values
Manual data entry in back-office tools
Copy-paste from spreadsheets where Excel adds invisible formatting

Less common but real: Unicode zero-width characters, non-breaking spaces ( ), or carriage return characters embedded in string fields. These require more aggressive cleaning:

-- Detect non-standard characters
SELECT
  order_id,
  LENGTH(order_id)               AS raw_length,
  LENGTH(TRIM(order_id))         AS trimmed_length,
  order_id != TRIM(order_id)     AS has_whitespace,
  -- Detect non-printable characters
  REGEXP_CONTAINS(order_id, r'[\x00-\x1F\x7F]') AS has_control_chars
FROM `project.sfcc.orders`
WHERE order_id != TRIM(order_id)
   OR REGEXP_CONTAINS(order_id, r'[\x00-\x1F\x7F]')
LIMIT 100;

-- Clean function for extreme cases:
SELECT
  REGEXP_REPLACE(TRIM(LOWER(order_id)), r'[\x00-\x1F\x7F\u00A0]', '') AS ultra_clean_id
FROM `project.sfcc.orders`

Mode 5: Structurally Unmatchable Orders

Some SFCC orders will never have a GA4 counterpart, not because of any data quality issue, but because they were created in environments where GA4 tracking cannot exist. These are not join failures. They are a different category of business reality.

Call centre and back-office orders: agent creates the order in SFCC directly. No browser, no session, no event.
B2B API orders: the customer’s procurement system sends an API request to SFCC. No browser.
POS syncs: in-store transaction recorded in a POS system and synced to SFCC. No web session.
Subscription renewals: processed automatically by SFCC on a schedule. No user interaction.

Important: Do not try to fix Mode 5. Instead, classify these orders explicitly in your data model so they are excluded from match rate calculations and treated as their own segment. The article on dark traffic and orphan orders (Part 5 of this series) covers this in full.

The Full Diagnostic Sequence: Four Queries

Run these queries in order before building anything in Looker. Each one narrows the problem.

Query 1: Baseline match rate (raw, no normalisation)

SELECT
  COUNT(DISTINCT g.transaction_id)                           AS ga4_transactions,
  COUNT(DISTINCT s.order_id)                                 AS sfcc_web_orders,
  COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                      THEN g.transaction_id END)             AS matched_raw,
  ROUND(
    COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                        THEN g.transaction_id END)
    / NULLIF(COUNT(DISTINCT g.transaction_id), 0) * 100, 1
  )                                                          AS raw_match_rate_pct
FROM `project.analytics.ga4_purchases` g
LEFT JOIN `project.sfcc.orders` s
  ON g.transaction_id = s.order_id  -- raw, no normalisation
WHERE DATE(g.event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND s.channel IN ('web', 'storefront');

-- Result interpretation:
-- >= 90%: small normalisation issue (Mode 3 or 4)
-- 70-89%: prefix mismatch likely (Mode 2)
-- < 70%:  structurally different IDs (Mode 1) or high offline volume

Query 2: Normalised match rate (with LOWER + TRIM)

SELECT
  COUNT(DISTINCT g.transaction_id)                           AS ga4_total,
  COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                      THEN g.transaction_id END)             AS matched_normalised,
  ROUND(
    COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                        THEN g.transaction_id END)
    / NULLIF(COUNT(DISTINCT g.transaction_id), 0) * 100, 1
  )                                                          AS normalised_match_rate_pct,
  -- Compare with Query 1 result to measure gain
  ROUND(
    COUNT(DISTINCT CASE WHEN s.order_id IS NOT NULL
                        THEN g.transaction_id END)
    / NULLIF(COUNT(DISTINCT g.transaction_id), 0) * 100, 1
  ) - [raw_match_rate_from_query_1]                          AS normalisation_gain_pct
FROM `project.analytics.ga4_purchases` g
LEFT JOIN `project.sfcc.orders` s
  ON LOWER(TRIM(g.transaction_id)) = LOWER(TRIM(s.order_id))  -- normalised
WHERE DATE(g.event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

-- If normalised_match_rate >> raw_match_rate:
--   Fix: apply LOWER(TRIM()) to both sides in all joins
-- If normalised_match_rate ≈ raw_match_rate:
--   The issue is structural (Mode 1 or 2) -- normalisation alone doesn't help

Query 3: ID structure comparison (detect Mode 1 and 2)

WITH ga4_ids AS (
  SELECT
    'ga4'                                          AS source,
    REGEXP_EXTRACT(transaction_id, r'^[^0-9]*')    AS prefix,
    LENGTH(REGEXP_EXTRACT(transaction_id, r'^[^0-9]*')) AS prefix_len,
    LENGTH(transaction_id)                          AS total_len,
    COUNT(*)                                        AS record_count
  FROM `project.analytics.ga4_purchases`
  WHERE DATE(event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
  GROUP BY 1, 2, 3, 4
),
sfcc_ids AS (
  SELECT
    'sfcc'                                          AS source,
    REGEXP_EXTRACT(order_id, r'^[^0-9]*')           AS prefix,
    LENGTH(REGEXP_EXTRACT(order_id, r'^[^0-9]*'))   AS prefix_len,
    LENGTH(order_id)                                AS total_len,
    COUNT(*)                                        AS record_count
  FROM `project.sfcc.orders`
  WHERE DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    AND channel IN ('web', 'storefront')
  GROUP BY 1, 2, 3, 4
)
SELECT * FROM ga4_ids
UNION ALL
SELECT * FROM sfcc_ids
ORDER BY source, record_count DESC;

-- If GA4 shows prefix='' and SFCC shows prefix='ORD-':
--   Fix: REGEXP_REPLACE(sfcc.order_id, r'^ORD-', '')
-- If GA4 shows total_len=12 and SFCC shows total_len=18:
--   They are structurally different IDs (Mode 1) -- fix at dataLayer level

Query 4: Classify the unmatched bucket

SELECT
  s.channel,
  s.customer_type,
  CASE
    WHEN s.channel IN ('store', 'pos')      THEN 'offline_store'
    WHEN s.channel IN ('phone', 'callcenter') THEN 'offline_callcenter'
    WHEN s.customer_type IN ('b2b', 'wholesale') THEN 'b2b_wholesale'
    WHEN s.channel IN ('web', 'storefront') THEN 'web_unmatched_dark_traffic'
    ELSE 'other'
  END                                                       AS unmatched_category,
  COUNT(DISTINCT s.order_id)                                AS unmatched_orders,
  SUM(s.order_value)                                        AS unmatched_revenue,
  ROUND(
    COUNT(DISTINCT s.order_id)
    / SUM(COUNT(DISTINCT s.order_id)) OVER () * 100, 1
  )                                                         AS share_of_unmatched_pct
FROM `project.sfcc.orders` s
LEFT JOIN `project.analytics.ga4_purchases` g
  ON LOWER(TRIM(s.order_id)) = LOWER(TRIM(g.transaction_id))
WHERE g.transaction_id IS NULL
  AND DATE(s.created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1, 2, 3
ORDER BY unmatched_orders DESC;

-- Expected result for a healthy implementation:
-- 'offline_store' and 'b2b_wholesale': 70-80% of unmatched (expected, not fixable)
-- 'web_unmatched_dark_traffic': 20-30% of unmatched (ITP, WebView, consent)
-- If 'web_unmatched_dark_traffic' > 50% of unmatched: tracking implementation issue

Fixing It in LookML: Three Patterns

Once you have identified the failure mode, implement the fix in LookML or BigQuery views. The correction should be centralised — not scattered across individual dashboards.

Pattern 1: Normalisation only (Mode 3 + 4)

explore: ga4_purchases {
  label: "GA4 + SFCC Reconciliation"

  join: sfcc_orders {
    type: left_outer
    sql_on:
      LOWER(TRIM(${ga4_purchases.transaction_id}))
      = LOWER(TRIM(${sfcc_orders.order_id})) ;;
    relationship: one_to_one
  }
}

Pattern 2: Prefix stripping + normalisation (Mode 2)

-- SFCC uses ORD-00123456, GA4 uses 00123456
explore: ga4_purchases {
  join: sfcc_orders {
    type: left_outer
    sql_on:
      LOWER(TRIM(${ga4_purchases.transaction_id}))
      = LOWER(TRIM(
          REGEXP_REPLACE(${sfcc_orders.order_id}, r'^ORD-', '')
        )) ;;
    relationship: one_to_one
  }
}

-- Alternative: GA4 uses #123456, SFCC uses 123456
-- sql_on: REGEXP_REPLACE(${ga4_purchases.transaction_id}, r'^#', '')
--         = ${sfcc_orders.order_id}

Pattern 3: Normalised views in BigQuery (best for complex cases)

When normalisation logic is complex or involves multiple transformations, the cleanest approach is to create normalised views in BigQuery before Looker touches the data. Looker then joins on clean, pre-normalised fields:

-- BigQuery: create normalised views
CREATE OR REPLACE VIEW `project.reconciliation.norm_sfcc_orders` AS
SELECT
  order_id                                              AS raw_order_id,
  -- Apply all normalisation in one place
  LOWER(
    TRIM(
      REGEXP_REPLACE(
        REGEXP_REPLACE(order_id, r'^ORD-', ''),  -- strip prefix
        r'[\x00-\x1F\x7F]', ''                   -- strip control chars
      )
    )
  )                                                     AS norm_order_id,
  created_at,
  order_value,
  channel,
  customer_type,
  status,
  refund_amount
FROM `project.sfcc.orders`
WHERE status NOT IN ('fraud', 'test');

CREATE OR REPLACE VIEW `project.reconciliation.norm_ga4_purchases` AS
SELECT
  transaction_id                                        AS raw_transaction_id,
  LOWER(TRIM(transaction_id))                           AS norm_transaction_id,
  event_date,
  session_id,
  reported_revenue,
  session_source,
  session_medium,
  session_campaign
FROM `project.analytics.ga4_purchases`;

-- LookML: clean join on normalised fields
explore: ga4_purchases {
  join: sfcc_orders {
    type: left_outer
    sql_on: ${ga4_purchases.norm_transaction_id}
             = ${sfcc_orders.norm_order_id} ;;
    relationship: one_to_one
  }
}

What a Healthy Match Rate Looks Like

After applying the appropriate normalisation, here are the realistic match rate targets by order segment. These benchmarks are based on typical e-commerce implementations; your numbers will vary based on channel mix and Safari market share in your user base.

Order segment	Target match rate	Acceptable minimum	If below minimum, investigate
Web orders, same-day (Chrome/Firefox)	92–97%	88%	Tag firing on confirmation page, consent mode config
Web orders, 7+ day return (Safari)	75–85%	70%	ITP cookie loss — structural, partially addressable with server-side
Mobile web orders	78–88%	70%	WebView traffic, in-app browser, ITP on iOS
Mobile app orders	50–75%	40%	Firebase/GA4 SDK missing or misconfigured in app
Call centre orders	0–10%	N/A	Expected — offline by nature. Classify, don't fix.
B2B / wholesale orders	0–20%	N/A	Expected — no web session. Classify, don't fix.
POS / in-store orders	0%	N/A	Expected — fully offline. Should not appear in reconciliation scope.

No reconciliation model reaches 100%. The goal is to understand your match rate by segment so you can make informed decisions about which orders to include in attribution analysis, and distinguish between structural gaps (not fixable) and implementation gaps (fixable).

Common Mistakes to Avoid

Applying normalisation only to one side of the join. If you LOWER the SFCC ID but not the GA4 ID, you solve nothing. Always normalise both sides consistently.
Treating Mode 5 (offline orders) as a join failure. Trying to force a GA4 match for call centre orders wastes engineering time and produces incorrect data. Classify them as offline and move on.
Fixing the join in a dashboard instead of the model. If you apply a normalisation fix in a specific Looker dashboard SQL override, it only affects that dashboard. Fix it in the LookML model or the BigQuery view.
Not rechecking after platform updates. SFCC platform upgrades and dataLayer updates frequently change ID formats. Your match rate should be a monitored metric, not a one-time check.

What Comes Next

Part 4: LookML metric governance — once the join works, defining net revenue, discrepancy metrics, and conversion rate with full documentation
Part 5: Dark traffic and orphan orders — building a classification system for the orders that will never match and making them analytically useful

Sources and Further Reading

GA4 ecommerce purchase event schema: developers.google.com/analytics/devguides/collection/ga4/reference/events#purchase
BigQuery REGEXP_REPLACE function: cloud.google.com/bigquery/docs/reference/standard-sql/string_functions
SFCC Order Management API — order number vs order ID: documentation.b2c.commercecloud.salesforce.com
Looker LookML join relationship types: cloud.google.com/looker/docs — relationship parameter
BigQuery case sensitivity in string comparisons: cloud.google.com/bigquery/docs/reference/standard-sql/operators
Simo Ahava — transaction ID best practices: simoahava.com — transaction tracking