https://cloud.google.com/certification/data-engineer
A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.
based on the Linux Academy Google Cloud Certified Professional Data Engineer course structure
Chapter 4. Cloud SQL: managed relational storage https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_014.html
Chapter 5. Cloud Datastore: document storage https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_015.html
5.1.1. Design goals for Cloud Datastore | 5.1.4. Consistency with data locality |
Chapter 7. Cloud Bigtable: large-scale structured data https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_017.html
7.1.1. Design goals | 7.2.1. Data model concepts |
Overview of Cloud Bigtable https://cloud.google.com/bigtable/docs/overview
Instances, clusters, and nodes https://cloud.google.com/bigtable/docs/instances-clusters-nodes
Understanding Cloud Bigtable performance https://cloud.google.com/bigtable/docs/performance
Designing your schema https://cloud.google.com/bigtable/docs/schema-design
Choosing Between SSD and HDD Storage https://cloud.google.com/bigtable/docs/choosing-ssd-hdd
Media: Articles, Videos, and Podcasts https://cloud.google.com/bigtable/docs/media
Chapter 6. Cloud Spanner: large-scale SQL https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_016.html
6.5.4. Choosing primary keys |
Chapter 21. Cloud Pub/Sub: managed event publishing https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_034.html
Using customer-managed encryption keys https://cloud.google.com/pubsub/docs/cmek
Ordering messages https://cloud.google.com/pubsub/docs/ordering
Cloud Pub/Sub: A Google-Scale Messaging Service https://cloud.google.com/pubsub/architecture [Recommended]
Restricting Pub/Sub resource locations https://cloud.google.com/pubsub/docs/resource-location-restriction
Consistent Hashing with Bounded Loads https://ai.googleblog.com/2017/04/consistent-hashing-with-bounded-loads.html
Chapter 20. Cloud Dataflow: large-scale data processing https://learning.oreilly.com/library/view/google-cloud-platform/9781617293528/kindle_split_033.html
20.1. What is Apache Beam |
Deploying a pipeline https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline
Google Cloud Dataflow In the Smart Home https://nest.tech/google-cloud-dataflow-in-the-smart-home-data-pipeline-5ae71781b856
Apache Beam: An advanced unified programming model https://beam.apache.org/
Windowing https://beam.apache.org/documentation/programming-guide/#windowing
Dataproc https://cloud.google.com/dataproc/
What is Google Cloud Dataproc? https://cloud.google.com/dataproc/docs/concepts/overview
Introduction to partitioned tables https://cloud.google.com/bigquery/docs/partitioned-tables
Creating and using ingestion-time partitioned tables https://cloud.google.com/bigquery/docs/creating-partitioned-tables
Creating and using date/timestamp partitioned tables https://cloud.google.com/bigquery/docs/creating-column-partitions
Querying partitioned tables https://cloud.google.com/bigquery/docs/querying-partitioned-tables
Switching SQL dialects https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql
AI and machine learning products https://cloud.google.com/products/ai/
Responsible AI Practices https://ai.google/responsibilities/responsible-ai-practices/
AutoML Vision Beginner's guide https://cloud.google.com/vision/automl/docs/beginners-guide
AI Hub https://cloud.google.com/ai-hub/ Hosted AI repository with one-click deployment for machine learning teams.
Vision AI https://cloud.google.com/vision/
Glossary https://developers.google.com/machine-learning/glossary
This glossary defines general machine learning terms and terms specific to TensorFlow.
Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course
AI Platform https://cloud.google.com/ai-platform/
Create your AI applications once, then run them easily on both GCP and on-premises.
Cloud Datalab https://cloud.google.com/datalab/
Cloud Dataprep by Trifacta https://cloud.google.com/dataprep/
When to use what? https://cloud.google.com/products/databases
Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course (recommended)
A self-study guide for aspiring machine learning practitioners
Machine Learning Crash Course features a series of lessons with video lectures, real-world case studies, and hands-on practice exercises.
Curated articles for developers https://medium.com/google-cloud
Processing logs at scale using Cloud Dataflow https://cloud.google.com/solutions/processing-logs-at-scale-using-dataflow
Smart business analytics and AI https://cloud.google.com/solutions/#smart-business-analytics-ai
Google Cloud Platform Blog https://cloud.google.com/blog/products/gcp
How Google Invented An Amazing Datacenter Network Only They Could Create
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines https://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006
What is Google Cloud Dataproc? https://cloud.google.com/dataproc/docs/concepts/overview
Security Configuration https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/security
Streaming 101: The world beyond batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Dataflow: A Unified Model for Batch and Streaming Data Processing https://youtu.be/3UfZN59Nsk8
After Lambda: Exactly-once processing in Google Cloud Dataflow, Part 1 https://cloud.google.com/blog/products/gcp/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1
Cloud Data Fusion https://cloud.google.com/data-fusion/
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
The biggest barrier to enterprise data analysis and machine learning is data integration.It is not uncommon to see companies struggle to get their data in one place, move it around, transform it, and make sense out of it.
Companies struggle to get their data in one place, move, transform, and make sense out of it. Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
What is Cloud Data Fusion? https://cloud.google.com/data-fusion/docs/concepts/overview
A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion. To get started with Cloud Data Fusion, you create a Cloud Data Fusion instance through the Cloud Console.
Cloud Data Fusion creates ephemeral execution environments to run pipelines when you manually run your pipelines or when pipelines run through a time schedule or a pipeline state trigger. Cloud Data Fusion supports Dataproc as an execution environment, in which you can choose to run pipelines as MapReduce, Spark, or Spark Streaming programs. Cloud Data Fusion provisions an ephemeral Dataproc cluster in your customer project at the beginning of a pipeline run, executes the pipeline using MapReduce or Spark in the cluster, and then deletes the cluster after the pipeline execution is complete.
A pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources. Building pipelines allows you to create complex data processing workflows that can help you solve data ingestion, integration, and migration problems. You can use Cloud Data Fusion to build both batch and real-time pipelines, depending on your needs.
A plugin is a customizable module that can be used to extend the capabilities of Cloud Data Fusion. Cloud Data Fusion provides plugins for sources, transforms, aggregates, sinks, error collectors, alert publishers, actions, and post-run actions.
A compute profile specifies how and where a pipeline is executed. A profile encapsulates any information required to set up and delete the pipeline's physical execution environment.
Architecture components https://cloud.google.com/data-fusion/docs/concepts/architecture
Build natural and rich conversational experiences https://dialogflow.com/
Give users new ways to interact with your product by building engaging voice and text-based conversational interfaces, such as voice apps and chatbots, powered by AI. Connect with users on your website, mobile app, the Google Assistant, Amazon Alexa, Facebook Messenger, and other popular platforms and devices.
Dialogflow documentation https://cloud.google.com/dialogflow/docs/
Dialogflow is a natural language understanding platform that makes it easy to design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your product.
Dialogflow can analyze multiple types of input from your customers, including text or audio inputs (like from a phone or voice recording). It can also respond to your customers in a couple of ways, either through text or with synthetic speech.
Building Amazing Apps With Cloud Firestore (Cloud Next '19) https://youtu.be/ah5tQ7yOh2s
Pipeline fundamentals for the Apache Beam SDKs https://cloud.google.com/dataflow/docs/guides/beam-creating-a-pipeline
The 8 Requirements of Real-Time Stream Processing http://cs.brown.edu/~ugur/8rulesSigRec.pdf
Migrating On-Premises Hadoop Infrastructure to Google Cloud Platform https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-overview
Migrating from an on-premises Hadoop solution to GCP requires a shift in approach. A typical on-premises Hadoop system consists of a monolithic cluster that supports many workloads, often across multiple business areas. As a result, the system becomes more complex over time and can require administrators to make compromises to get everything working in the monolithic cluster. When you move your Hadoop system to GCP, you can reduce the administrative complexity. However, to get that simplification and to get the most efficient processing in GCP with the minimal cost, you need to rethink how to structure your data and jobs.
The most cost-effective and flexible way to migrate your Hadoop system to GCP is to shift away from thinking in terms of large, multi-purpose, persistent clusters and instead think about small, short-lived clusters that are designed to run specific jobs. You store your data in Cloud Storage to support multiple, temporary processing clusters. This model is often called the ephemeral model, because the clusters you use for processing jobs are allocated as needed and are released as jobs finish.
Dataproc and Google Cloud contain several features that can help secure your data. This guide explains how Hadoop security works and how it translates to Google Cloud, providing guidance on how to architect security when deploying on Google Cloud.
This guide describes the process of moving data from on-premises Hadoop Distributed File System (HDFS) to Google Cloud Platform (GCP).