'Unique hit-level ID dupe and mis-matched data issue in Universal Analytics export to BigQuery

I'm looking for some help troubleshooting a dupe + mismatched data issue in my hit-level query of Universal Analytics data export to BigQuery. Double-checking my understanding of the UA export schema along the way.

If I wanted to produce a result set that gave me some unnested, hit-level data for sessions in BigQuery, I might use a query like the following:

SELECT
    GA.date
,   GA.clientId
,   GA.fullVisitorId
,   GA.visitId
,   GA.visitNumber
,   GA.visitStartTime
,   GA.channelGrouping
,   GA.trafficSource.source
,   GA.trafficSource.medium
,   GA.trafficSource.campaign
,   GA.trafficSource.referralPath
,   GA.trafficSource.keyword
,   GA.trafficSource.campaignCode
,   GA.trafficSource.adContent
,   GA.trafficSource.adwordsClickInfo.adGroupId
,   GA.trafficSource.adwordsClickInfo.adNetworkType
,   GA.trafficSource.adwordsClickInfo.campaignId
,   GA.trafficSource.adwordsClickInfo.gclId
,   GA.trafficSource.adwordsClickInfo.page
,   GA.trafficSource.adwordsClickInfo.slot
,   GA.trafficSource.adwordsClickInfo.targetingCriteria.boomuserlistId
,   GA.geoNetwork.latitude
,   GA.geoNetwork.longitude
,   GA.geoNetwork.metro
,   GA.device.deviceCategory
,   GA.totals.bounces
,   GA.totals.newVisits
,   GA.totals.timeOnSite
,   GA.totals.visits
,   HIT.hitNumber
,   HIT.time
,   HIT.isEntrance
,   HIT.isExit
,   HIT.type
,   HIT.page.hostname
,   HIT.page.pagepath
,   HIT.page.pageTitle
,   HIT.eventInfo.eventCategory
,   HIT.eventInfo.eventAction
,   HIT.eventInfo.eventLabel
FROM `ga_data.1234567.ga_sessions_20191216` GA
,UNNEST(GA.hits) AS HIT

If I wanted to produce a field to uniquely identify each row, I would expect a concatenation like the following to do the trick:

CONCAT(GA.fullVisitorId,'_',GA.visitId,'_',HIT.hitNumber)

However, I'll occasionally (albeit rarely) see some duplicate ID'd results in this concatenation within a result-set, AND with differing information. Ex. with some obfuscated/randomized info while maintaining formats and matches between fields.

Unique_hit-level_ID date clientId fullVisitorId visitId visitNumber visitStartTime channelGrouping source medium campaign referralPath keyword campaignCode adContent adGroupId adNetworkType campaignId gclId page slot boomuserlistId latitude longitude metro deviceCategory bounces newVisits timeOnSite visits hitNumber time isEntrance isExit type hostname pagepath pageTitle eventCategory eventAction eventLabel
535816238356047000_1576517283_2 20191216 86768368.1576517283 535816238356047000 1576517283 1 1576517283 Organic Search google organic (not set) (not set) (not set) 1 34 1 2 7 EVENT www.example.com / Title1 Button Click MagazineView
535816238356047000_1576517283_2 20191216 86768368.1576517283 535816238356047000 1576517283 1 1576517283 Organic Search google organic (not set) (not set) (not set) 1 34 1 2 33909 TRUE PAGE ssl.example.com /customer/summary/index.cfm Title2

The time, isExit, type, hostname, pagepath, pageTitle, and event fields all do not match.

I can think of a few different possibilities for what is happening:

  1. I do not properly understand the export schema, and thus have constructed an incorrect hit-level ID.
  2. My query is incorrect in some way that mis-represents the data.
  3. Our export to BigQuery has been done incorrectly.
  4. Error of some sort on Google's side.

Does anyone have thoughts on what might be causing this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source