Pas de deux con Carlos Acosta

Por Isabel M. Echemendía Pérez Fotos: Cortesía del entrevistado y Johan Persson Revista Pionero Dec 12, 2018 Cuenta que de niño fue muy travieso y que poco a poco aprendió a amar la danza. Al…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Building a Data Lake on Google Cloud Platform with CDAP

It is no secret that traditional platforms for data analysis, like data warehouses, are difficult and expensive to scale, to meet the current data demands for storage and compute. And purpose-built platforms designed to process big data often require significant up-front and on-going investment if deployed on-premise. Alternatively, cloud computing is the perfect vehicle to scale and accommodate such large volumes of data in an economical way. While the economics are right, enterprises migrating their on-premises data warehouses or building a new warehouse or data lake in the cloud face many challenges along the way. These range from architecting network, securing critical data, having the right skill sets to work with the chosen cloud technologies, to figuring out the right set of tools and technologies to create operational workflows to load, transform and blend data.

Businesses are more dependent on data than ever before. Having the right toolsets empowering the right people makes data readily available for better and faster decision-making. Choosing the right toolsets that make integration simple and allow them to focus on solving their business challenges rather than focusing on infrastructure and technology is one of the most important steps in migrating a data warehouse to the cloud or building one in the cloud.

The CDAP Pipeline (Workflows) is a data orchestration capability that moves, transforms, blends and enriches data. CDAP Pipelines manage the scheduling, orchestration, and monitoring of all pipeline activities, as well as handle failure scenarios. CDAP Pipelines offer a collection of hundreds of pre-built connectors, simplified stream processing on top of open-source streaming engines, as well as new out-of-the-box connectivity to BigTable, BigQuery, Google Cloud Storage, Google PubSub and other GCP technologies. Thus, they enable users to integrate nearly any data, anywhere in a Google Cloud environment.

Governance is an important requirement of any data lake or data warehouse, whether it is deployed on-premises or in the cloud. The ability to automatically capture and index technical, business and operational metadata for any pipelines built within CDAP makes it easy to discover datasets, perform impact analysis, trace the lineage of a dataset, and create audit trails.

So, let’s look at some of the capabilities recently added in CDAP to integrate with Google Cloud Platform technologies.

At Cask, we believe having seamless workflows that optimize macro user-flows provide a complete and fun experience when working with complex technologies. It also allows them to focus on their business use cases rather than on infrastructure. We have observed first hand with our customers that in doing so they achieve higher efficiencies, reduced operating cost, less user frustration and ultimately the democratization of access to data, which leads to greater value from the data faster. In the spirit of achieving higher efficiency, we decided to first integrate CDAP’s Data Prep capability with Google Cloud Storage.

CDAP Pipelines provide plugins for integrating with GCS natively, irrespective of whether you are working with structured or unstructured data. They also provide seamless integration with CDAP Data Prep capabilities and make it easy to create a GCS connection to your project, browse GCS, and immediately wrangle your data without having to use code or move to another console.

Watch the screencast below to understand the flow of integration with respect to CDAP Data Prep and CDAP Pipelines and GCS.

This flow from the start (Configuring GCS) to finish (Pipeline Deployed) takes around ~ 2 minutes to build, and not a single line of code was written.

In addition to integration with CDAP Data Prep, the following CDAP plugins are available to work with GCS:

CDAP Data Prep automatically determines the file type and uses the right source depending on the file extension and the content type of the file. Below is a simple pipeline and configuration associated with GCS Text File Source for your reference.

The above pipeline reads the New York Trips Dataset (available as a public dataset on Google BigQuery), performs some transformations and calculations on the cluster, and writes the results back into Google BigQuery. This example might not be highly relevant to a real use case since you could use BigQuery SQL to do what is being done here, but this pipeline is for demonstration purposes only, to show that sources and sinks for Google BigQuery are available to read from and write to.

These BigQuery plugins provide simplicity in terms of importing metadata from BigQuery and automatically creating tables along with right schema based on pipeline schema.

Following is a simple real-time CDAP Data Pipeline used for pushing data up to Google Cloud Platform PubSub from on-premises Kafka in real time. The data published is readily and immediately available to be consumed for further transformation and processing.

Over the past decades, enterprises have installed appliances and other pre-configured hardware for data warehousing. The goal for these solutions, which often required heavy investments in proprietary technology, was to make it easier to manage and analyze data. However, recent advancements in open source technology that provide less expensive ways for storing and processing massive amounts of data have broken down the enterprise walls, allowing enterprises to question the cost of expensive hardware.This time, instead of replacing legacy systems with new hardware, enterprises are looking to move to the cloud to build their data lakes when it makes sense for them. But, the right tooling is needed to support the many possible use cases of a data warehouse in the cloud. Four things are needed to efficiently and reliably offload data from an on-premises data warehouse to the cloud:

Now, the main problem is how an enterprise can efficiently offload data from their on-premise warehouses into BigTable and keep the data in BigTable in sync. To support the EDW Offload to BigTable use case, CDAP provides capabilities to perform Change Data Capture (CDC) on relational databases and data pipelines, and plugins for consuming the change data events and updating the corresponding Google BigTable instance to keep the data in sync. The change data capture solutions can use one of three approaches for capturing changes in the source databases:

The first solution reads the database transactional logs and publishes all the DDL and DML operations into Kafka or Google PubSub. The real-time CDAP Data Pipeline consumes these changesets from Kafka or Google PubSub, normalizes and performs the corresponding operations for inserts, updates, and deletes to BigTable using the CDC BigTable Sink plugin.

Following is a pipeline that reads the changesets from a streaming source and writes them to BigTable recreating all the table updates and keep them in sync.

There are multiple reasons why an enterprise might decide to migrate from one public cloud platform to another or to choose more than one cloud provider. One reason might be that a different public cloud provider offers better pricing than the current provider or a better match in terms of services offered. Another common case is an enterprise recently went through a merger, and the acquirer already has a preference for their public cloud provider. Regardless of the reasons, one way to ease migration or support more than one cloud is to start with a multi-cloud data management platform that integrates with cloud environments. By using a multi-cloud data management solution, such as CDAP, you can seamlessly create an abstraction that hides the underlying cloud differences and allows simple migration of workflows and data. Adopting such a platform from the get-go is extremely valuable in a hybrid cloud environment, where you may be managing on-premises, (hosted) private as well as public clouds.

Building workflows that can efficiently and reliably migrate data from one public cloud store to another are simple with CDAP Pipelines. Following is an example that shows how data from Amazon S3 can be migrated into GCS and, during the process, can be transformed and stored in Google BigQuery.

After the pipeline is executed the results of execution of the pipeline are available on GCS and as well as within BigQuery.

Transcription is the best way to convert your recorded audio into highly accurate, searchable and readable text; being able to index and search through audio content is useful because it helps your users find relevant content. It can be used to boost organic traffic, improve accessibility, and also enhance your AI by transcribing audio files to provide better service to your customers.

Let’s say you have a company that offers customer support services and you are recording random customer conversations to improve the quality of service in order to get better insights into how representatives handle calls. The first step as part of improving the service is to transcribe the recorded audio files into digitized readable text. Further, the text can go through various AI / ML workflows to determine the mood of the call, customer sentiment, resolution latency, and more.

Simplifying the transcribing of massive amounts of recorded audio files, Google Cloud Platform technologies and CDAP together provide users with an integrated, scalable and code-free way to transcribe audio files. This integration allows users to build pipelines that can be scheduled and monitored with ease, for any production deployment, in minutes to hours, rather than weeks or months.

Below is a simple CDAP Pipeline that takes the raw audio files stored on Google Cloud Storage, passes it through Google Speech Translator plugin, and writes the transcribed text to another location on Google Cloud Storage.

The Google Speech Translator CDAP plugin is ready to go with a minor configuration of settings depending on the types of the files being recorded. The translation applied to the raw audio file in the above example generates a JSON output that describes the file that was transcribed, with the computed confidence for the transcription.

Our current level of integration with GCP is just a start; future integrations will be focused around integrations with Stackdriver for logs and metrics, integration with Google Container Engine, integration with Apache Beam and much more.

Pas de deux con Carlos Acosta

Building a Data Lake on Google Cloud Platform with CDAP

Add a comment

Related posts:

Accessibility matters

5 Cryptocurrency Developments To Look Forward To In 2019

Honor