Google BigQuery

Features

BI Engine

Introduction to BigQuery BI Engine https://cloud.google.com/bi-engine/docs/overview

BigQuery BI Engine is a fast, in-memory analysis service. By using BI Engine you can analyze data stored in BigQuery with sub-second query response time and with high concurrency.

BI Engine integrates with familiar Google tools like Google Data Studio to accelerate data exploration and analysis. With BI Engine, you can build rich, interactive dashboards and reports in Data Studio without compromising performance, scale, security, or data freshness.

Everyone Flies Faster When BigQuery Fuels the BI Engines at AirAsia (Cloud Next '19)

Notes

Data Warehouse Modernization

https://cloud.google.com/solutions/data-warehouse-modernization

Streamline your migration path and unlock intelligent insights with BigQuery

Migrating data warehouses to BigQuery

Migrating data warehouses to BigQuery: Introduction and overview

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview

Why migrating to BigQuery

Today, not only do organizations need to understand past events using descriptive analytics, they need predictive analytics, which often uses machine learning (ML) to extract data patterns and make probabilistic claims about the future. The ultimate goal is to develop prescriptive analytics that combine lessons from the past with predictions about the future to automatically guide real-time actions.
Because traditional data warehouses capture a subset of data in batches and store data based on rigid schemas, they are unsuitable for handling real-time analysis or responding to spontaneous queries. Google designed BigQuery in part in response to these inherent limitations.
It can take years and substantial investment to build a scalable, highly available, and secure data warehouse architecture. BigQuery offers sophisticated software as a service (SaaS) technology that can be used for serverless data warehouse operations. This lets you focus on advancing your core business while delegating infrastructure maintenance and platform development to Google Cloud.
With BigQuery, you can find answers to previously intractable problems, apply machine learning to discover emerging data patterns, and test new hypotheses. As a result, you have timely insight into how your business is performing, which enables you to modify processes for better results.

What and how to migrate: The migration framework

A data warehouse typically supports a wide range of use cases and has a large number of stakeholders, from data analysts to business decision-makers. We recommend that you involve representatives from these groups to get a good understanding of which use cases exist, whether these use cases perform well, and whether stakeholders are planning new use cases.
Prepare and Discover
- To get insights into the use cases, you can develop a questionnaire to gather information from your subject matter experts (SMEs), end users, and stakeholders. The questionnaire should gather the following information:
  - What is the use case's objective? What is the business value?
  - What are the non-functional requirements? Data freshness, concurrent usage, and so on.
  - Is the use case part of a bigger workload? Is it dependent on other use cases?
  - Which datasets, tables, and schemas underpin the use case?
  - What do you know about the data pipelines feeding into those datasets?
  - Which BI tooling, reports, and dashboards are currently used?
  - What are the current technical requirements around operational needs, performance, authentication, and network bandwidth?
Assess and plan
- The assess-and-plan phase is about taking the input from the prepare-and-discover phase, assessing that input, and then using it to plan for the eventual migration. This phase can be broken down into the following tasks:
  - Assess the current state
  - Catalog and prioritize use cases
  - Define measures of success
  - Create a definition of "done"
  - Design and propose a proof-of-concept (POC), short-term state, and ideal end state
  - Create time and cost estimates
  - Identify and engage a migration partner
- Migrating using an iterative approach
  - When migrating a large data warehousing operation to the cloud, it's a good idea to take an iterative approach. Therefore we recommend that you make the transition to BigQuery in iterations. Dividing the migration effort into iterations makes the overall process easier, reduces risk, and provides opportunities for learning and for improving after each iteration.
  - Prioritizing use cases
    - Approach: Exploit current opportunities
    - Approach: Migrate analytical workloads first
      - Migrate Online Analytical Processing (OLAP) workloads before Online Transaction Processing (OLTP) workloads. A data warehouse is often the only place in the organization where you have all the data to create a single, global view of the organization's operations. Thererfore, it's common for organizations to have some data pipelines that feed back into the transactional systems to update status or trigger processes—for example, to buy more stock when a product's inventory is low. OLTP workloads tend to be more complex and have more stringent operational requirements and service-level agreements (SLAs)) than OLAP workloads, so it also tends to be easier to migrate OLAP workloads first.
    - Approach: Focus on the user experience
    - Approach: Prioritize least-risky use cases
Execute
- During the execute phase, the work to fully migrate or offload the use case or workload should focus on one or more of the following steps.
- You might not need to go through all of these steps in each iteration. For example, in one iteration you might decide to focus on copying some data from your legacy data warehouse to BigQuery. In contrast, in a subsequent iteration you might focus on modifying the ingestion pipeline from an original data source directly to BigQuery.
  - Setup and data governance
    - Data governance is a principled approach to manage data during its lifecycle, from acquisition to use to disposal. Your data governance program clearly outlines policies, procedures, responsibilities, and controls surrounding data activities. This program helps to ensure that information is collected, maintained, used, and disseminated in way that both meets your organization's data integrity and its security needs. It also helps empower your employees to discover and use the data to its fullest potential.
  - Migrate schema and data
    - The data warehouse schema defines how your data is structured and defines the relationships between your data entities. The schema is at the core of your data design, and it influences many processes, both upstream and downstream.
  - Translate queries
    - Some legacy data warehouses include extensions to the SQL standard to enable functionality for their product. BigQuery does not support these proprietary extensions; instead, it conforms to the ANSI/ISO SQL:2011 standard. BigQuery also does not support stored procedures. Therefore, some of your queries might need to be refactored during migration from your legacy database to BigQuery.
  - Migrate business applications
    - Business applications can take many forms—from dashboards to custom applications to operational data pipelines that provide feedback loops to transactional systems.
  - Migrate data pipelines
  - Optimize performance
  - Verify and validate
    - At the end of each iteration, validate that the use-case migration was successful by verifying that:
    - Data governance concerns have been fully met and tested.
    - The data and schema have been fully migrated.
    - Maintenance and monitoring procedures and automation have been established.
    - Queries have been correctly translated.
    - Migrated data pipelines function as expected.
    - Business applications are correctly configured to access the migrated data and queries.
- It's also a good idea to measure the impact of the use-case migration—for example, in terms of improving performance, reducing cost, or enabling new technical or business opportunities. Then you can more accurately quantify the value of the return on investment and compare the value against your success criteria for the iteration.
- After the iteration is validated, you can release the migrated use case to production and give your users access to migrated datasets and business applications.
Finally, take notes and document lessons learned from this iteration, so you can apply these lessons in the next iteration and accelerate the migration.

Transition state

Fully migrated

This document is part of a series that explores how to migrate your upstream data pipelines, which load data into your data warehouse. This document discusses data pipelines: what they are and what to think about when migrating them.

What is a data pipeline?

In the context of data warehousing, data pipelines are commonly used to read data from transactional systems, apply transformations, and then write data to the data warehouse. In the context of data pipelines, sources are usually transactional systems—for example, an RDBMS—and the sink connects to a data warehouse. This type of graph is referred to as a data flow DAG. You can also use DAGs to orchestrate data movement between data pipelines and other systems. This usage is referred to as an orchestration or control flow DAG.

When to migrate the data pipelines
- On the one hand, when you offload a use case, you don't need to migrate its upstream data pipelines up front. You first migrate the use case schema and data from your existing data warehouse into BigQuery. You then establish an incremental copy from the old to the new data warehouse to keep the data synchronized. Finally, you migrate and validate downstream processes such as scripts, queries, dashboards, and business applications.
- On the other hand, when you fully migrate a use case, the upstream data pipelines required for the use case are migrated to Google Cloud. Full migration requires you to offload the use case first. After the full migration, you can deprecate the corresponding legacy tables from the on-premises data warehouse because data is ingested directly into BigQuery.
- When all of your use cases are fully migrated, you can elect to switch off the old warehouse, which is an important step for reducing overhead and costs.
Extract, transform, and load (ETL)
- In the context of data warehousing, data pipelines often execute an extract, transform, and load (ETL) procedure. ETL technologies run outside of the data warehouse, which means the resources of the data warehouse can be used primarily for concurrent querying, instead of for preparing and transforming data. One downside of the transformation being executed outside of the data warehouse is that it requires you to learn additional tooling and languages (other than SQL) to express the transformations.
Extract, load, and transform (ELT)
- When you adopt an ELT approach, it's common to separate the extract and load into one DAG and the transformations into their own DAGs. Data is loaded into the data warehouse once and then transformed multiple times to create the different tables that are used downstream in reporting and so on.
Change data capture (CDC)
- Change data capture (CDC) is one of several software design patterns used to track data changes. It's often used in data warehousing because the data warehouse is used to collate and track data and its changes from various source systems over time.
Feedback loops with operational data pipelines
- Operational data pipelines are data processing pipelines that take data from the data warehouse, transform it if needed, and write the result into operational systems, hence the name. Operational systems refer to systems that process the organization's day-to-day transactions, such as OLTP databases, Customer Relationship Management (CRM) systems, Product Catalog Management (PCM) systems, and so on. Because these systems often act as a source of data, the operational data pipelines implement a feedback loop pattern.
Choosing a migration approach
- Redirect data pipelines to write to BigQuery
  - When the legacy data warehouse is fed by data pipelines executing an ETL procedure—when the transformation logic is executed before the data is stored in the data warehouse, then consider whether the technology used offers a native BigQuery sink (write connector). Independent software vendors (ISV) offer data processing technologies with BigQuery connectors, such as: Informatica: BigQuery connector guide or Talend: Writing data in BigQuery
  - Note: It's important to check that the data processing software takes advantage of the BigQuery large-scale ingestion mechanisms, such as streaming inserts or batch loads from Cloud Storage. An approach that employs the Magnitude Simba JDBC or ODBC BigQuery drivers isn't suitable for large-scale ingestion operations, because these drivers implement the query interface. While the drivers can perform inserts, this interface is intended for querying and data manipulation language (DML) statements on BigQuery, not for large-scale inserts or updates.
  - If the data pipeline technology doesn't support data ingestion to BigQuery, consider using a variation on this approach that writes the data temporarily to files that are subsequently ingested by BigQuery.
Redirect data pipelines by using files as an intermediate vehicle
- When the existing on-premises data pipeline technology doesn't support Google APIs, or if you are restricted from using Google APIs, you can use files as an intermediate vehicle for your data to reach BigQuery.
Migrate existing ELT pipelines to BigQuery
- If your data sources are supported by the BigQuery Data Transfer Service (DTS) either directly or through third-party integrations, you can use DTS to replace your EL pipeline. A Fivetran solution shows how Fivetran connectors can help you during the migration by automatically extracting data from your sources, normalizing and applying some light cleaning to the data, and then routing it to BigQuery.
Migrating existing OSS data pipelines to Dataproc
- Dataproc lets you deploy fast, easy-to-use, fully managed Hadoop and Spark clusters in a simple, cost-efficient way. Dataproc integrates with the BigQuery connector, a Java library that enables Hadoop and Spark to directly write data to BigQuery by using abstracted versions of the Apache Hadoop InputFormat and OutputFormat classes.
- Dataproc makes it easy to create and delete clusters so that instead of using one monolithic cluster, you can use many ephemeral clusters. This approach has several advantages:
  - You can use different cluster configurations for individual jobs, eliminating the administrative burden of managing tools across jobs.
  - You can scale clusters to suit individual jobs or groups of jobs.
  - You pay only for resources when your jobs are using them.
  - You don't need to maintain clusters over time, because they are freshly configured every time you use them.
  - You don't need to maintain separate infrastructure for development, testing, and production. You can use the same definitions to create as many different versions of a cluster as you need when you need them.
- Rehost third-party data pipelines to run on Google Cloud
- Rewrite data pipelines to use Google Cloud-managed services

Migrating data warehouses to BigQuery: Reporting and analysis

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-reporting-and-analysis

Migrating data warehouses to BigQuery: Performance optimization

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-performance-optimization

Technical Resource https://cloud.google.com/solutions/data-warehouse-modernization/technical-resources

Tools

Use the Power of Google Cloud and Informatica to Build a Modern Data Architecture (Cloud Next '18)

Training

Coursera

Exploring and Preparing your Data with BigQuery https://www.coursera.org/learn/gcp-exploring-preparing-data-bigquery

Creating New BigQuery Datasets and Visualizing Insights https://www.coursera.org/learn/gcp-creating-bigquery-datasets-visualizing-insights

Modernizing Data Lakes and Data Warehouses with GCP https://www.coursera.org/learn/data-lakes-data-warehouses-gcp/home/welcome

The two key components of any data pipeline are data lakes and warehouses. This course highlights use-cases for each type of storage and dives into the available data lake and warehouse solutions on Google Cloud Platform in technical detail. Also, this course describes the role of a data engineer, the benefits of a successful data pipeline to business operations, and examines why data engineering should be done in a cloud environment.

Notes

Links

Sandbox https://cloud.google.com/bigquery/docs/sandbox

The BigQuery sandbox gives you free access to the power of BigQuery subject to the sandbox's limits. The sandbox allows you to use the web UI in the GCP Console without providing a credit card. You can use the sandbox without creating a billing account or enabling billing for your project.

15 Awesome things you probably didn’t know about Google BigQuery https://medium.com/google-cloud/15-awesome-things-you-probably-didnt-know-about-google-bigquery-6654841fa2dc

Access Control

Access control examples https://cloud.google.com/bigquery/docs/access-control-examples

Creating authorized views https://cloud.google.com/bigquery/docs/authorized-views

Introduction to interacting with BigQuery https://cloud.google.com/bigquery/docs/interacting-with-bigquery

Provides an overview of ways to interact with BigQuery.

Introduction to partitioned tables Using GROUP BY to avoid self-joins

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.

Standard SQL Functions & Operators. https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators

Let’s talk data. https://medium.com/@hoffa

SQL Joins https://www.w3resource.com/sql/joins/sql-joins.php

Why Nesting Is So Cool https://looker.com/blog/why-nesting-is-so-cool

When you're setting up a data warehouse, one of the key questions is how to structure your data. Do you completely normalize the data into a snowflake schema? Completely denormalize it into a very wide table with lots of repeated values? Or do something in the middle like a star schema?

Exporting Table Data https://cloud.google.com/bigquery/docs/exporting-data

BigQuery can export up to 1 GB of data to a single file. If you are exporting more than 1 GB of data, you must export your data to multiple files.

You cannot export table data to a local file, to Google Sheets, or to Google Drive. The only supported export location is Cloud Storage.

Query plan and timeline https://cloud.google.com/bigquery/query-plan-explanation

Embedded within query jobs, BigQuery includes diagnostic query plan and timing information.

When BigQuery executes a query job, it converts the declarative SQL statement into a graph of execution, broken up into a series of query stages, which themselves are composed of more granular sets of execution steps. BigQuery leverages a heavily distributed parallel architecture to run these queries. Stages model the units of work that many potential workers may execute in parallel.

In-memory query execution in Google BigQuery https://cloud.google.com/blog/products/gcp/in-memory-query-execution-in-google-bigquery

How To Control Access To BigQuery At Row Level With Groups https://medium.com/google-cloud/how-to-control-access-to-bigquery-at-row-level-with-groups-1cbccb111d9e

Life of a BigQuery streaming insert

https://cloud.google.com/blog/products/gcp/life-of-a-bigquery-streaming-insert

Google BigQuery, the fully managed cloud-native data warehouse for Google Cloud Platform (GCP) customers, supports several ways to ingest data into its managed storage, including explicit load jobs or via queries against external sources. These methods share a common theme, in that they're used to transfer and append this new storage as part of a (potentially large) single commit to a table. BigQuery also supports a method of ingestion known as streaming, which is intended to service the needs of users who need a more open-ended, continuous style of ingestion. In this post, you’ll learn how the streaming system works, which should help aid understanding of observed behaviors and make it easier to reason about building integrations with BigQuery.

Google Sites

Report abuse