How to process large BigQuery tables/job result in a single memory machine with python
Google BigQuery is a great piece of technology that solves many of today’s big data challenges. It abstracts for you the pains of storage which means you don’t think anymore about how big is your dataset in order to store it, and compute, in other words, you don’t think anymore about how to distribute compute operations across multiple nodes. Billing will still be up to you.
But still sometimes you want to process your dataset or a subset of it in your local machine, and in order to do that, at least at the best of my knowledge in python there is some extra work you have to do around the BigQuery client library provided by Google, to handle memory load, (you will have to request multiple time using pagination)
This article aims to introduce you package I have built as a wrapper around the BigQuery client library, that makes an abstraction over it, and allows you to work with large BigQuery data as you would do with lists in python
And as a bonus, there is also a BigQuery fake data generator CLI linked (that I have also built for the purpose of this article, you can use it to generate large BigQuery fake data tables outside the context of this article)
How this article is organized :
- This article shows an example (few lines of code) of the package usage
- Explains the ideas behind
- Goes through a reproducible example where we will generate fake BigQuery data
- Makes a conclusion
How bq_iterate looks like ?
The package is named bq_iterate, before explaining what it does, we can see how in a few lines, we can request our data, and process it in batches
What is happening here ?
bq_iterate provide two functionalities :
- 2 classes BqQueryIterator and BqTableRowIterator, they behave like an iterator, where they hold only <batch_size> elements in memory and when you want to access the element <batch_size + 1> the iterator calls in memory the next batch_size elements
- A function batchify_iterator, what this function does, it takes an iterator and yields slices of it, the <batch_slice> can be bigger than the <batch_size> even if by common sense it’s supposed to be smaller, it doesn’t matter, since batchify_iterator will create in memory at each batch it yields, a list of <batch_slice> elements, once the batch gets consumed, it is freed from memory, thanks to it being a generator.
So for the example above, row_iterator is treated as an iterator and batchify_iterator is constructed from a generator of slices containing row_iterator elements
Let’s generate fake data and test the example above
To generate fake data in BigQuery, you can either use the package here or follow this example :
We will use Faker a python package that generates fake realistic data, we then convert the list of generated fake data to a data frame, then it is sent to BigQuery, we loop over this process in order to generate a large data table (larger than what our single machine can hold in memory) by appending the data generated to the BigQuery table at each step of the loop
Let’s see how the code looks like :
In order to make this code work:
- You should have gcloud already configured
- Have authorized gcloud client libraries see here
- Replace <project_id> variable with the name of your project id
Also you can play with the parameters nrows to generate more rows at each function gen_fake_data call, you can also play with main loop to repeat the process multiple times, see more infos of how this code works in the README.md of this repo
As a result in google cloud depending on the parameters you entered in fake_config, you obtain something that looks like that
So now you can execute the snippet :
Also if you have noticed I have added a column position which will allow you to order data in an easier manner if you want to do some comparison. For example you can change the query value to “query = “select * from <project_id>.<dataset_id>.<table_id> order by position”” then do the same request in BigQuery UI, and do a comparison to check if the batches match the data request.
So here is the package :
- Git hub repo : https://github.com/Senhaji-Rhazi-Hamza/bq_iterate
- Python package index : https://pypi.org/project/bq-iterate/
It’s my first published package, I hope it’ll help make your experience of BigQuery data manipulation with python better.
Please don’t hesitate to reach out, propose feedback, ideas and if you try it and like it, you’re more than welcome to give it a star!