Automating upcoming Data Engineering event notifications with Python, dbt, Snowflake & Zapier

by Nathan Purvis

Background

As someone who attends a lot of data events - particularly those pertaining to Data Engineering - I often share these internally with colleagues in case they want to join. So far, my approach has been to keep a Notion database up-to-date and rely on people looking at it. However, I noticed that the ‘last viewed’ widget was often very sparse and figured that, over time, folks might not know it existed, forget about it, or simply not want to add yet another bookmark to their browser. Internally at The Information Lab, we use a communication platform called Convo, and so I thought it’d be good to create an automated ‘upcoming events’ post at the beginning of each week, removing the need for navigating elsewhere or remembering to check!

The build could likely be trimmed down, but I always like to use these projects as an opportunity to keep my skills fresh across a variety of tools. In this blog I’ll provide more of a high-level look into the entire process, with key parts picked out. Broadly, it was broken up into the following parts:

  1. Extract/Load - Python alongside the Notion API ingests data from my Notion database, isolates key elements and lands this in Snowflake

  2. Transform - dbt Cloud creates a simple, curated table of just that week’s events, as well as a Snapshot tracking any changes people have made to database entries

  3. Publish post - Zapier accesses this Snowflake table and creates an HTML-formatted post that is then sent to Convo

  4. Orchestration - GitHub Actions, dbt Cloud job scheduling and Zapier’s Zap scheduling

If you’d like to look over the code for this then feel free to check out my GitHub repository here.

Extract/Load

This part of the process extracts key information from my Notion database, converts it into a DataFrame and then pushes the result to Snowflake. The key packages involved here are:

To quickly answer ‘why?’, a combination of these packages allows us to pull data from an API (requests), isolate key elements within the response (jsonpath-ng), turn this into a DataFrame (pandas) and push this to Snowflake (snowflake-connector-python[pandas]), all while keeping our credentials out of sight by populating them from environment variables placed in a .env file (python-dotenv). A couple of features I’m pleased I got to use are JSONPath and the Snowflake connector, particularly cursor objects. In future blogs I plan to write about these in more detail.

Transform

dbt is probably the hottest Data/Analytics Engineering framework for SQL right now and, although I use the open-source (Core) software almost daily with clients, I haven’t had a chance to play around in the Cloud (SaaS) product as much. Luckily the fundamentals and core functionality are identical, making it easy to set up a project and get going. After setting up my source/bronze layer data, I knew I wanted two things:

  1. A curated table of events for the next 7 days, to be replaced every Sunday

  2. A Snapshot table showing all database entries and how they have been changed (for example, has someone edited the time or URL?)

dbt Cloud makes it easy to do all of this and more. With a small collection of YAML and SQL files we can very quickly generate a workflow that will run freshness tests on our sources to check they’re up-to-date, create a directed acyclic graph (DAG) with inferred upstream/downstream dependencies, apply data checks (like ensuring primary keys aren’t null and are unique), materialize database objects in our desired form and much, much more!

I’m happy I got to implement a Snapshot model to keep a historical record of source data, as well as using sensible materialization choices to prevent unnecessary clutter in my warehouse. I’ve actually already published a blog on the latter of these here, and plan on writing a lot more about dbt in future so keep your eyes peeled!

Publish post

Zapier provides useful integrations across multiple web applications in order to automate parts of a workflow. Given Convo is a native integration, this offers the perfect link to get our curated, Snowflake table into a nice post for our audience! Within Zapier we can build Zaps which are workflows in themselves and represent the steps we want to take. Here’s the one I created:

To summarise, this Zap completes the following:

  1. Sets the weekly schedule (10:30 am Monday, in time for everyone getting online and nailing their first coffee!)

  2. Selects from the curated Snowflake table, excluding the unnecessary unique id (uid) field

  3. A Google Drive input provides the profile picture for the bot’s post

  4. Splits into two possible paths

    1. There are records in the Snowflake table

      1. Uses additional Python to process the entries that come in as comma separated strings, stitching each together and generating HTML-formatting around these to create a visually appealing post

      2. Sends this post to Convo, including relevant groups (Data Engineering & Sales Team). These look like so:

  1. There aren’t records in the Snowflake table

    1. Sends a very simple post to Convo stating there are no events this week, also reminding users to add any they come across to the Notion database, like so:

Orchestration

For this project, each stage has its own built-in scheduling element. Of course, this could be enhanced by using a central orchestration tool. However, again I wanted to go through the motions of using each to keep things fresh in my mind. It’s worth noting that I could also trigger these via webhooks/API calls when the previous step is completed.

First of all we have GitHub Actions, with all steps set in an actions.yml file, running on a Cron schedule:

This syntax tells GHA to run at midday every Sunday in the following way:

0 - Minute (0th minute)

12 - Hour (12th hour)

* - Day of the month (unused)

* - Month (unused)

0 - Day of the week (Cron weeks are Sunday to Saturday and are 0-based)

dbt Cloud also offers very user-friendly scheduling in the deploy GUI. We can again use Cron scheduling for added flexibility but as I only want simple, weekly execution, I opted for the simple approach and set this to run 3 hours after the source data refresh:

Finally, as we’ve just seen in the Zapier section above, the ‘Schedule by Zapier’ step is used to determine when this runs. There’s not much to say about this as you can see how simple it is! This simply executes at 10:30 am every Monday:

And that’s a wrap! I hope you’ve enjoyed reading this overview of a recent process I automated, and I hope you’ll take a look over the repo if you have the time. It was very fun to build and, as usual, I picked up a couple of extra packages and tricks to add to my arsenal along the way! There are a few improvements I already have in mind for this, including:

  • Automatically adding events to the Notion database by periodically scraping certain pages of interest

  • Creating Convo posts when a new database entry is added so that people can sign up ASAP

  • Using webhooks to trigger runs upon successful completion of their dependency, rather than simple scheduling

As usual, if you have any additional suggestions, feedback, or requests for future content then please do reach out!

Next
Next

Alteryx to Python (pandas) and Back: A Simple Guide for Getting Started