GCP 101: Building a Data Pipeline in Google Cloud Platform
As more data is created, there is an increased need for big data analytics, a field that is expected to reach $103 billion by 2023. In a previous post on Google Cloud Platform, we discussed how Google’s own tool, Big Query is a differentiator for its public cloud.
This time we take a deeper look at the role data analytics plays in the Google ecosystem. By highlighting the key products and their specific uses, you’ll get a better idea of how these can work for your organization, and how to use them to build a data pipeline.
Data analytics is a growing field led by further innovations in the cloud. With the power to crunch information even faster, organizations have more access to data. Google Cloud offers a full set of products that process data faster and more efficiently
Once you have your data organized into a system — or likely more than one system, you will need a way to move data between each of them. This can include several steps, from copying data to moving it to the cloud, then reformatting and combining it with other data sources. All these steps comprise a data pipeline.
Why Create a Data Pipeline
Data pipelines are used for a variety of reasons. If you’re in retail and selling a product, you can use a data pipeline to combine two sets of data warehouses. Let’s say one houses purchase data, and the other keeps inventory data. A data pipeline can combine the two, so you can determine when your stock is running low. Another example as illustrated below is an energy company with market information data that needs to be processed. A data pipeline can handle pricing and usage to streamline results.
Build a Data Pipeline in GCP
Creating a data pipeline involves utilizing GCP’s tech stack. Here’s an overview of how this would work in the above scenario with the energy company.
Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub.
In a pipeline project, Dataflow has several purposes. Your first task to build is data extraction from multiple HTTP API Endpoints that update the data at a rate of once every five minutes.
After you extract the data, it needs to be formatted into a single JSON message per endpoint, so dataflow coalesces the data from one endpoint into a single JSON message. Once Dataflow creates the JSON messages, it sends them to Pub/Sub, and also writes each of them to a .json file in cloud storage.
Google Cloud Pub/Sub serves as the messaging system. It receives messages from Google Cloud Dataflow. Then the visualization application housed in Compute Engine handles the message pulls from the Subscriber
As the messaging technology layer, Google Cloud Pub/Sub is employed. You need to send the JSON messages from the API endpoints you read to Pub/Sub using dataflow. The messages from Pub/Sub are read using a visualization application hosted within a Google Cloud Compute Engine instance. Once read, the messages are acknowledged to clear them from the Pub/Sub queue.
Google Cloud AppEngine houses the Dataflow application and launches it through the cron service on a schedule.
It is then used to host a small flask app that wraps the dataflow job. By doing this, it permits the use of the Google Cloud Cron Service within AppEngine to run the dataflow job every five minutes to match the update frequency of the API endpoints.
Google Cloud Compute Engine hosts the Visualization application stack the customer uses to visualize their data from the API endpoints.
The custom visualization application that gives real-time insights into the energy market is hosted by Google Cloud Compute Engine. The application reads data from your Pub/Sub subscription that is storing the messages.
Google Cloud BigQuery works to warehouse the data read from the API Endpoints, simply using it to warehouse the data sent to Pub/Sub.
Google Cloud Storage houses JSON versions of the APIs from where it needs to be read — and is used as auto-loaded startup files within the VM for the visualization dashboard.
Every time you push a message to Pub/Sub Google Cloud Storage, you overwrite the files in Cloud Storage that contain the same values. These files are used by the visualization application as startup files, and to read the data from the files, acting as the initial data load into the graphs.
Get Started With Big Data Products
Data is here to stay — and will only increase over time. That’s why organizing, processing and analyzing it will continue to be a key function for organizations. With Google Cloud Platform, you have the ability to do that and more.
We want to be sure you understand all that Google Cloud Platform has to offer, so be sure to check out other blogs in our GCP 101 series. We'll be adding more soon!