<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1870732089876948&amp;ev=PageView&amp;noscript=1">

Extract Nested Structs Without Cross Joining Data In BigQuery

This article was originally posted on Medium on June, 20 2020.

As a Google Cloud consultant focused on the Data & Analytics space, I spend a lot of time with BigQuery, Google Cloud’s petabyte-scale data warehouse-as-a-service, and for the most part, I love it. It’s really powerful, there’s no infrastructure management, and it comes with all sorts of built in analytics and AI capabilities.

Because of its nature as a columnar data store, however, BigQuery SQL syntax can sometimes be non-intuitive to work with. Google recommended best-practices for query performance and cost optimization are to denormalize your data, or in other words to take normalized data (i.e. data that is split across multiple tables) and combine it into a single table using specialized data types like arrays and key-value stores.

Querying denormalized data may be more performant than joining data at scale but querying the data isn’t that easy if you’re used to working with data in normalized tables, which had been my experience up until the point where I joined Cloudbakers. So in this post, I’ll cover some query syntax for common and less-common denormalized data structures in BigQuery, using a mock GCP billing dataset that has the same structure as the BigQuery billing export.

Basic STRUCTS, or key-value fields, are straightforward enough, as you can simply use dot notation to select subfields:

-- service is a STRUCT field and id is one of its keys
SELECT service.id AS service_id 
FROM   `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` 
LIMIT  1000

When working with arrays or REPEATED fields, things get a little bit more complicated. You can select individual components of an array using the array index, which can be useful if the order of elements is predetermined or doesn’t matter (you can see from the first two rows of the results below that in this case, order is not predetermined)

SELECT credits[OFFSET(0)] as credit_offset_0, 
       credits[OFFSET(1)] AS credit_offset_1 
FROM   `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` 
WHERE  ARRAY_LENGTH(credits) > 1 limit 1000

You also have the option to flatten the data using what’s called a correlated cross join. This takes any repeated field, pivots it so that each element in the array is a new row, and then joins that new tabular data with the original table, creating a flattened schema with repeated rows for every element in the original repeated field.

SELECT billing.credits,
       c.*
FROM   `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` billing, UNNEST(credits) c
WHERE  ARRAY_LENGTH(credits) > 1 limit 1000

This is useful if you need to flatten your data and calculate aggregate values or metrics based off the data contained in a array - for example, if you need to calculate the total credit amount per credit type for GCP consumption that was invoiced in the month of December, 2019, you would use the following query:

SELECT   c.NAME        AS credit_type, 
         SUM(c.amount) AS total_credit_amount 
FROM     `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` billing, 
         unnest(credits) c 
WHERE    billing.invoice.month = '201912' 
GROUP BY credit_type 
ORDER BY credit_type

But what if you simply need the arrayed data to be pivoted so that it can be consumed by another system or exported into spreadsheet software like Excel or Google Sheets? Again, because of its nature as a columnar data store, BigQuery doesn’t easily support syntax that would be the equivalent of something like the Pandas library’s unstack method.

As a specific example, let’s take two of the four repeated fields in the BigQuery billing export:

  • Project Labels
  • Resource Labels

If you’re not already familiar with the concept, project and resource labels allow you as an organization to apply business concepts and processes to GCP compute resources. Because the labels applied to your compute resources are propagated to your billing data, if you set up labels correctly and consistently, you can break down your GCP cost consumption by different environments, organizational departments, or initiatives. You can also initiate processes like chargebacks to individual departments.

Long story short, labels are or should be an important aspect of your GCP cost management strategy. Which makes it all the more frustrating that both the project labels and resource labels in the BigQuery billing export are in fields of repeated STRUCT objects, or in other words are written to the table as arrays of key-value pairs.

If you do a correlated cross join on the project labels in this instance, because the cost field is not contained within the project label array in the same way the credit costs are associated with specific credit types within the credit array, flattening the project label data causes the cost field to be repeated once per project label. Same goes if you do a correlated cross join on the resource labels.

SELECT project, 
       project_labels, 
       cost 
FROM   `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` billing, 
       unnest(project.labels) project_labels 
WHERE  cost > 1

If you tried to group by and sum in this instance, your GCP costs would potentially be much higher than your actual costs, depending on how many labels you applied to each project. If your table has a unique identifier for each row, you could use that to join against a flattened array that filtered out all but one label, and could do so repeatedly to integrate each different project label. But the billing data export does not have unique row identifiers, so that’s not really an option here. So what can you do?

All hope is not lost, thanks to this article from Lak Lakshmanan of Google Cloud, which explores some powerful querying patterns in BigQuery. I strongly recommend you read the full article, but in particular there was one component I found helpful in solving the billing problem described above, which was the application of expression subqueries to unpack nested struct objects in BigQuery tables. You can see below an example of how you can use this querying syntax to unstack project label data, creating a tabular result that can be easily worked with in BigQuery or other data analytics systems:

SELECT project.labels, 
       ( 
              SELECT value 
              FROM   Unnest(project.labels) 
              WHERE  KEY='creator') AS project_creator, 
       ( 
              SELECT value 
              FROM   Unnest(project.labels) 
              WHERE  KEY='account') AS project_account, 
       cost 
FROM   `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` 
WHERE  array_length(project.labels) > 0 
AND    cost > 1

That’s all I have for today - although the queries I presented are not necessarily very complicated, I found them useful (particularly the expression subqueries) and I hope you will too.

New call-to-action

Originally published on August 20, 2020

If you enjoyed this post, please consider sharing

Want more tech tips?
Subscribe to our IT Superhero Newsletter!