Professional Data Engineer

https://cloud.google.com/certification/data-engineer

A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

Reading List

based on the Linux Academy Google Cloud Certified Professional Data Engineer course structure

Managed Databases

Cloud SQL

Chapter 4. Cloud SQL: managed relational storage https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_014.html

Datastore

Chapter 5. Cloud Datastore: document storage https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_015.html

5.1.1. Design goals for Cloud Datastore | 5.1.4. Consistency with data locality |

Bigtable

Chapter 7. Cloud Bigtable: large-scale structured data https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_017.html

7.1.1. Design goals | 7.2.1. Data model concepts |

Overview of Cloud Bigtable https://cloud.google.com/bigtable/docs/overview

Instances, clusters, and nodes https://cloud.google.com/bigtable/docs/instances-clusters-nodes

Understanding Cloud Bigtable performance https://cloud.google.com/bigtable/docs/performance

Designing your schema https://cloud.google.com/bigtable/docs/schema-design

Choosing Between SSD and HDD Storage https://cloud.google.com/bigtable/docs/choosing-ssd-hdd

Media: Articles, Videos, and Podcasts https://cloud.google.com/bigtable/docs/media

Pricing https://cloud.google.com/bigtable/pricing

Cloud Spanner

Chapter 6. Cloud Spanner: large-scale SQL https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_016.html

6.5.4. Choosing primary keys |

Data Engineering Architecture

Real Time Messaging with Cloud Pub/Sub

Chapter 21. Cloud Pub/Sub: managed event publishing https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_034.html

Using customer-managed encryption keys https://cloud.google.com/pubsub/docs/cmek

Ordering messages https://cloud.google.com/pubsub/docs/ordering

Cloud Pub/Sub: A Google-Scale Messaging Service https://cloud.google.com/pubsub/architecture [Recommended]

Restricting Pub/Sub resource locations https://cloud.google.com/pubsub/docs/resource-location-restriction

Consistent Hashing with Bounded Loads https://ai.googleblog.com/2017/04/consistent-hashing-with-bounded-loads.html

Data Pipelines with Cloud Dataflow

Dataflow

Chapter 20. Cloud Dataflow: large-scale data processing https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_033.html

20.1. What is Apache Beam |

Deploying a pipeline https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline

Google Cloud Dataflow In the Smart Home https://nest.tech/google-cloud-dataflow-in-the-smart-home-data-pipeline-5ae71781b856

Apache Beam: An advanced unified programming model https://beam.apache.org/

Windowing https://beam.apache.org/documentation/programming-guide/#windowing

Dataproc

Dataproc https://cloud.google.com/dataproc/

What is Google Cloud Dataproc? https://cloud.google.com/dataproc/docs/concepts/overview

Analyzing Data and Enabling Machine Learning

BigQuery

Introduction to partitioned tables https://cloud.google.com/bigquery/docs/partitioned-tables

Creating and using ingestion-time partitioned tables https://cloud.google.com/bigquery/docs/creating-partitioned-tables

Creating and using date/timestamp partitioned tables https://cloud.google.com/bigquery/docs/creating-column-partitions

Querying partitioned tables https://cloud.google.com/bigquery/docs/querying-partitioned-tables

Switching SQL dialects https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql

Machine Learning

AI and machine learning products https://cloud.google.com/products/ai/

Responsible AI Practices https://ai.google/responsibilities/responsible-ai-practices/

AutoML Vision Beginner's guide https://cloud.google.com/vision/automl/docs/beginners-guide

AI Hub https://cloud.google.com/ai-hub/ Hosted AI repository with one-click deployment for machine learning teams.

Vision AI https://cloud.google.com/vision/

Glossary https://developers.google.com/machine-learning/glossary

This glossary defines general machine learning terms and terms specific to TensorFlow.

Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course

AI Platform (Formerly Cloud ML Engine)

AI Platform https://cloud.google.com/ai-platform/

Create your AI applications once, then run them easily on both GCP and on-premises.

Pretrained Machine Learning API's

Datalab

Cloud Datalab https://cloud.google.com/datalab/

Data Visualization

Cleaning Your Data with Dataprep

Cloud Dataprep by Trifacta https://cloud.google.com/dataprep/

Building Data Visualizations with Data Studio

Monitoring and Orchestration

Orchestrating Data Workflows with Cloud Composer

Links

Database

When to use what? https://cloud.google.com/products/databases

Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course (recommended)

A self-study guide for aspiring machine learning practitioners

Machine Learning Crash Course features a series of lessons with video lectures, real-world case studies, and hands-on practice exercises.

Curated articles for developers https://medium.com/google-cloud

Processing logs at scale using Cloud Dataflow https://cloud.google.com/solutions/processing-logs-at-scale-using-dataflow

Smart business analytics and AI https://cloud.google.com/solutions/#smart-business-analytics-ai

Google Cloud Platform Blog https://cloud.google.com/blog/products/gcp

Network

How Google Invented An Amazing Datacenter Network Only They Could Create

http://highscalability.com/blog/2015/8/10/how-google-invented-an-amazing-datacenter-network-only-they.html

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines https://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006

Cloud Dataproc

https://cloud.google.com/dataproc/

What is Google Cloud Dataproc? https://cloud.google.com/dataproc/docs/concepts/overview

Security Configuration https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/security

Dataflow

Streaming 101: The world beyond batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

Dataflow: A Unified Model for Batch and Streaming Data Processing https://youtu.be/3UfZN59Nsk8

After Lambda: Exactly-once processing in Google Cloud Dataflow, Part 1 https://cloud.google.com/blog/products/gcp/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1

Data Fusion

Cloud Data Fusion https://cloud.google.com/data-fusion/

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.

The biggest barrier to enterprise data analysis and machine learning is data integration.It is not uncommon to see companies struggle to get their data in one place, move it around, transform it, and make sense out of it.

Companies struggle to get their data in one place, move, transform, and make sense out of it. Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.

What is Cloud Data Fusion? https://cloud.google.com/data-fusion/docs/concepts/overview

A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion. To get started with Cloud Data Fusion, you create a Cloud Data Fusion instance through the Cloud Console.

Cloud Data Fusion creates ephemeral execution environments to run pipelines when you manually run your pipelines or when pipelines run through a time schedule or a pipeline state trigger. Cloud Data Fusion supports Dataproc as an execution environment, in which you can choose to run pipelines as MapReduce, Spark, or Spark Streaming programs. Cloud Data Fusion provisions an ephemeral Dataproc cluster in your customer project at the beginning of a pipeline run, executes the pipeline using MapReduce or Spark in the cluster, and then deletes the cluster after the pipeline execution is complete.

A pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources. Building pipelines allows you to create complex data processing workflows that can help you solve data ingestion, integration, and migration problems. You can use Cloud Data Fusion to build both batch and real-time pipelines, depending on your needs.

A plugin is a customizable module that can be used to extend the capabilities of Cloud Data Fusion. Cloud Data Fusion provides plugins for sources, transforms, aggregates, sinks, error collectors, alert publishers, actions, and post-run actions.

A compute profile specifies how and where a pipeline is executed. A profile encapsulates any information required to set up and delete the pipeline's physical execution environment.

Architecture components https://cloud.google.com/data-fusion/docs/concepts/architecture

Cloud Data Fusion Integration

Dialogflow

Build natural and rich conversational experiences https://dialogflow.com/

Give users new ways to interact with your product by building engaging voice and text-based conversational interfaces, such as voice apps and chatbots, powered by AI. Connect with users on your website, mobile app, the Google Assistant, Amazon Alexa, Facebook Messenger, and other popular platforms and devices.

Dialogflow documentation https://cloud.google.com/dialogflow/docs/

Dialogflow is a natural language understanding platform that makes it easy to design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your product.

Dialogflow can analyze multiple types of input from your customers, including text or audio inputs (like from a phone or voice recording). It can also respond to your customers in a couple of ways, either through text or with synthetic speech.

Firestore

Building Amazing Apps With Cloud Firestore (Cloud Next '19) https://youtu.be/ah5tQ7yOh2s

Beam

Pipeline fundamentals for the Apache Beam SDKs https://cloud.google.com/dataflow/docs/guides/beam-creating-a-pipeline

BigQuery

Streaming

The 8 Requirements of Real-Time Stream Processing http://cs.brown.edu/~ugur/8rulesSigRec.pdf

Migration

Migrating On-Premises Hadoop Infrastructure to Google Cloud Platform https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview

Migrating from an on-premises Hadoop solution to GCP requires a shift in approach. A typical on-premises Hadoop system consists of a monolithic cluster that supports many workloads, often across multiple business areas. As a result, the system becomes more complex over time and can require administrators to make compromises to get everything working in the monolithic cluster. When you move your Hadoop system to GCP, you can reduce the administrative complexity. However, to get that simplification and to get the most efficient processing in GCP with the minimal cost, you need to rethink how to structure your data and jobs.

The most cost-effective and flexible way to migrate your Hadoop system to GCP is to shift away from thinking in terms of large, multi-purpose, persistent clusters and instead think about small, short-lived clusters that are designed to run specific jobs. You store your data in Cloud Storage to support multiple, temporary processing clusters. This model is often called the ephemeral model, because the clusters you use for processing jobs are allocated as needed and are released as jobs finish.

Hadoop Migration Security Guide

Dataproc and Google Cloud contain several features that can help secure your data. This guide explains how Hadoop security works and how it translates to Google Cloud, providing guidance on how to architect security when deploying on Google Cloud.

Migrating HDFS Data from On-Premises to Google Cloud Platform

This guide describes the process of moving data from on-premises Hadoop Distributed File System (HDFS) to Google Cloud Platform (GCP).

Google Sites

Report abuse