Building Batch Data Pipelines on GCP (Coursera)

Saturday, Feb 5, 2022 - Author: Alexandre Shimono

Data Engineering on GCP Logo

Overview

This is another course that Google developed with Coursera and has quizzes and practical lab activities. Batch processing ( the “opposite” of streaming processing) is a way to process huge amounts of data that are available in the form of files, locally, or in buckets in the cloud. The main advantage of modern batch processing techniques involves the usage of parallelism for such tasks, which greatly improves performance and scalability. Goole Cloud has several tools to approach this problem, and I was surprised to know that it is possible to use Spark in GCP Dataproc.

Pros

The great difference between this course and the “regular” Coursera ones is the integration with Qwik labs, which lets the student access GCP and run all exercises there. This is definitely an advantage for students who are reluctant to inform credit card numbers (and depending on the age, don’t even have a credit card) in order to test and learn GCP.

Another great difference is that the instructors are not professors from Universities, who typically are great at teaching foundations and theoretical concepts but usually lack industry knowledge. This course has only instructors that work at Google, so the approach is quite practical.

Lastly, it gives a brief overview of the many tools available in GCP for batch processing, which are many. This is a huge advantage compared to, for example, Azure. Some months ago I had to develop a batch processing task in Azure and really missed the existence of a dedicated tool for it (probably it changed now) - part of that problem was upper management dismissing Databricks integration to cut costs.

Cons

The main con regarding this course is the difficulty of the practical exercises. They are not designed in a way to stimulate students to learn how to solve problems using the tools, they already have the solution and the student just goes step by step copying and pasting to the google console. So, despite being good to provide practical experience, it is common to forget all you saw in the following hours after you finish it.

Another point that could be discussed a little bit better is the concept of batch processing and perhaps mentioning similar tools to the job. At some points of this course, it really feels we are seeing an advertisement of GCP saying all its advantages against the competition, which is really not necessary since this is a paid course ( we already paid for your product Google, relax).

Conclusion

This is mostly an introductory course in many GCP technologies for batch processing. Unfortunately, as it happens with many other courses that Google is developing with Coursera, the practical activities are all already solved and the student mostly has to copy and paste content instead of really trying to solve a problem. It is a good start, but you will not pass any Google certification just by finishing this course.

comments powered by Disqus