'Unique hit-level ID dupe and mis-matched data issue in Universal Analytics export to BigQuery
I'm looking for some help troubleshooting a dupe + mismatched data issue in my hit-level query of Universal Analytics data export to BigQuery. Double-checking my understanding of the UA export schema along the way.
If I wanted to produce a result set that gave me some unnested, hit-level data for sessions in BigQuery, I might use a query like the following:
SELECT
GA.date
, GA.clientId
, GA.fullVisitorId
, GA.visitId
, GA.visitNumber
, GA.visitStartTime
, GA.channelGrouping
, GA.trafficSource.source
, GA.trafficSource.medium
, GA.trafficSource.campaign
, GA.trafficSource.referralPath
, GA.trafficSource.keyword
, GA.trafficSource.campaignCode
, GA.trafficSource.adContent
, GA.trafficSource.adwordsClickInfo.adGroupId
, GA.trafficSource.adwordsClickInfo.adNetworkType
, GA.trafficSource.adwordsClickInfo.campaignId
, GA.trafficSource.adwordsClickInfo.gclId
, GA.trafficSource.adwordsClickInfo.page
, GA.trafficSource.adwordsClickInfo.slot
, GA.trafficSource.adwordsClickInfo.targetingCriteria.boomuserlistId
, GA.geoNetwork.latitude
, GA.geoNetwork.longitude
, GA.geoNetwork.metro
, GA.device.deviceCategory
, GA.totals.bounces
, GA.totals.newVisits
, GA.totals.timeOnSite
, GA.totals.visits
, HIT.hitNumber
, HIT.time
, HIT.isEntrance
, HIT.isExit
, HIT.type
, HIT.page.hostname
, HIT.page.pagepath
, HIT.page.pageTitle
, HIT.eventInfo.eventCategory
, HIT.eventInfo.eventAction
, HIT.eventInfo.eventLabel
FROM `ga_data.1234567.ga_sessions_20191216` GA
,UNNEST(GA.hits) AS HIT
If I wanted to produce a field to uniquely identify each row, I would expect a concatenation like the following to do the trick:
CONCAT(GA.fullVisitorId,'_',GA.visitId,'_',HIT.hitNumber)
However, I'll occasionally (albeit rarely) see some duplicate ID'd results in this concatenation within a result-set, AND with differing information. Ex. with some obfuscated/randomized info while maintaining formats and matches between fields.
| Unique_hit-level_ID | date | clientId | fullVisitorId | visitId | visitNumber | visitStartTime | channelGrouping | source | medium | campaign | referralPath | keyword | campaignCode | adContent | adGroupId | adNetworkType | campaignId | gclId | page | slot | boomuserlistId | latitude | longitude | metro | deviceCategory | bounces | newVisits | timeOnSite | visits | hitNumber | time | isEntrance | isExit | type | hostname | pagepath | pageTitle | eventCategory | eventAction | eventLabel |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 535816238356047000_1576517283_2 | 20191216 | 86768368.1576517283 | 535816238356047000 | 1576517283 | 1 | 1576517283 | Organic Search | organic | (not set) | (not set) | (not set) | 1 | 34 | 1 | 2 | 7 | EVENT | www.example.com | / | Title1 | Button | Click | MagazineView | |||||||||||||||||
| 535816238356047000_1576517283_2 | 20191216 | 86768368.1576517283 | 535816238356047000 | 1576517283 | 1 | 1576517283 | Organic Search | organic | (not set) | (not set) | (not set) | 1 | 34 | 1 | 2 | 33909 | TRUE | PAGE | ssl.example.com | /customer/summary/index.cfm | Title2 |
The time, isExit, type, hostname, pagepath, pageTitle, and event fields all do not match.
I can think of a few different possibilities for what is happening:
- I do not properly understand the export schema, and thus have constructed an incorrect hit-level ID.
- My query is incorrect in some way that mis-represents the data.
- Our export to BigQuery has been done incorrectly.
- Error of some sort on Google's side.
Does anyone have thoughts on what might be causing this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
