Select your data by turning complex conditioning to writing a simple boolean expression

What you will learn in this article ?

Assuming you know what a pandas Data frame is in python, you will learn how you can express on this Data frame, some complex conditions to extract a subset of your data, into easier ones by using some notions of boolean algebra & 1st order logic in order to extract the subset of data that satisfies these conditions

What this article is not ?

This article isn’t another article that focuses on how many ways exist to select Data frames in pandas, it puts the focus more on how you can use boolean algebra & 1st order logic to express some quite complex conditions on your…


What you will learn/find on this article ?

  • Some context and history of how and why we are shifting to the cloud paradigm
  • A complete pattern example of how to migrate (or create from scratch) your pyspark jobs to GCP with DataProc workflow templates (you can use the same logic for spark and Hadoop migration, also some further references will be given)
  • A github repo that you can copy and adapt for your purposes of migration

Whom this article might be useful for ?

  • Anyone who wants to migrate his on-premise spark/hadoop infrastructure to GCP, or just want to implement his spark/hadoop workflows on GCP width Dataproc Workflow templates
  • Anyone curious

Introduction : a bit of history & context

From vertical to horizontal scaling

Ali Godsi founder of Databricks


Introduction :

Google BigQuery is a great piece of technology that solves many of today’s big data challenges. It abstracts for you the pains of storage which means you don’t think anymore about how big is your dataset in order to store it, and compute, in other words, you don’t think anymore about how to distribute compute operations across multiple nodes. Billing will still be up to you.

But still sometimes you want to process your dataset or a subset of it in your local machine, and in order to do that, at least at the best of my knowledge in python…


What is multi-tenancy ?

Briefly, multi-tenancy is an architecture that allows one instance of a software to serve multiple clients (called tenants), one of the advantages is cost saving. suppose you have licence for a software you use (MS sql server for example), would be nice if you can handle all your client in one instance, and paying for just one licence.

Where you might need multi-tenancy ?

The basic use case, is when you are building a Saas app, you want to serve multiple clients (tenants), and keep at the same time their data isolated, there are many ways multi-tenant architectures are achieved, describing those ways is outside of…

Senhaji Rhazi hamza

Full stack devops based in Paris, python advocate, let’s connect on Linkedin : https://www.linkedin.com/in/hamza-senhaji-rhazi-72170678/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store