Enterprise data warehouses (EDWs) are unified databases for all of an organization’s historical data, optimized for AnalyticsAnalytics. Today, companies that implement data warehouses often toy with the idea of building the data warehouse in the cloud and not on-premises. Many are also considering using data lakes instead of traditional data warehouses. The question of whether you want to combine historical data with live stream data is also crucial.
A data warehouse is an analytical, usually relational, database created from two or more data sources. The goal is usually to save historical data for later analysis. Data warehouses often have extensive computing and storage resources for running complicated queries and generating reports. They are often used as data sources for business intelligence and machine learning systems.
The write throughput requirements of transactional operational databases limit the number and type of indexes you can create (more indexes mean more writes and updates per record added, but also more potential conflicts). This slows down analytical queries to the operational database. Once you’ve exported your data to a data warehouse, you can index whatever interests you. This allows you to achieve good performance for analytical queries without affecting the write performance of the separate OLTP database (Online Transaction Processing).
Data marts contain data geared towards a specific line of business. They can be dependent on the data warehouse, independent (that is, they can come from an operational database or an external source), or they can be a mixture of both.
Data lakes store data in its native format and are essentially schema on read. This means that any application that reads data from the data lake must impose its own types and relationships on the data. Traditional data warehouses, on the other hand, are “schema on write” – data types, indexes and relationships are imposed when stored in the data warehouse.
When evaluating a cloud data warehouse, you should primarily pay attention to the following points:
- A simple administration
- High scalability,
- High performance,
- Good integration skills and
- Reasonable cost.
Also, ask for customer references, especially for large deployments, and run your own proof of concept. You should pay particular attention to the vendor’s currently available and planned machine learning capabilities, since the bulk of a data warehouse’s business value comes from the application of predictive analytics.
The following 13 products, listed alphabetically, are either cloud data warehouses or offer the functionality of data warehouses but are built on a different base architecture – such as a data lake. However, because all of these products add heterogeneous federated query engines, the functional distinction between data lake and data warehouse tends to blur.
Ahana Cloud for Presto
This solution turns a data lake on Amazon S3 into a data warehouse without moving data. SQL queries execute quickly even when connecting multiple heterogeneous data sources. Presto is an open-source, distributed SQL query engine for running interactive, analytical queries on data sources of any size. Data can be queried where it is – including Hive, Cassandra, relational databases and proprietary data stores. A single Presto query can combine data from multiple sources. For example, Facebook uses Presto for interactive queries against multiple internal data stores, including its 300PB data warehouse.
Ahana Cloud for Presto runs on Amazon, has a relatively simple user interface, and offers end-to-end cluster lifecycle management. The solution also runs in Kubernetes, is highly scalable and has an integrated data catalog and easy integration with other data sources, catalogs and dashboarding tools. Ahana’s default query interface is Apache Superset. You can also use Jupyter or Zeppelin notebooks, especially if you do machine learning.
With Amazon Redshift, you can combine exabytes of structured and semi-structured data in your data warehouse, operational database, and data lake with standard SQL queries. Redshift easily stores the results of your queries in your S3 data lake using open formats like Apache Parquet, so you can run additional analytics through other services like Amazon EMR, Amazon Athena, and Amazon SageMaker.
Azure Synapse Analytics
This analytics service brings together data integration, data warehousing and big data analytics. Azure Synapse Analytics enables you to process, explore, prepare, manage, and deliver data for BI and machine learning purposes. In addition, data can be queried at scale using serveless or dedicated resources. Azure Synapse can run queries using Spark or SQL engines and has deep integration with Azure Machine LearningMachine Learning, Azure Cognitive Services, and Power BI.
Databrick’s Delta Lake
This open source project makes it possible to build a “Lakehouse” architecture on top of existing storage systems such as Amazon S3, Microsoft Azure Data Lake StorageStorage, Google Cloud Storage and HDFS. The solution extends data lakes with ACID transactions, metadata processing, data versioning, schema enforcement, and schema development. Databrick’s Lakehouse Platform leverages Delta Lake, Spark and MLflow in a cloud service running on AWS, Microsoft Azure and Google Cloud. This allows data management and performance more typical of data warehouses to be combined with the cost-effective, flexible object storage of data lakes.
This solution is a serverless, petabyte-scale cloud data warehouse that has its own internal engines for business intelligence and machine learning on board. In addition, BigQuery integrates with all Google Cloud services, including Vertex AI and TensorFlow.
BigQuery Omni extends BigQuery to analyze data across clouds using Anthos. Data QnA extends BigQuery with a natural language front end. Connected Sheets allows users to analyze billions of rows of live BigQuery data in Google Sheets. BigQuery can process federated queries, including external data sources in object stores (Google Cloud Storage) for Parquet and ORC (Optimized Row Columnar) file formats, transactional databases (Google Cloud Bigtable, Google Cloud SQL), or spreadsheets in Google Drive.
Oracle Autonomous Data Warehouse
The Oracle solution is a cloud data warehouse service that automates the provisioning, configuration, securing, tuning, scaling, and securing of the data warehouse. Oracle Autonomous Data Warehouse includes self-service tools for data loading, data transformations, business models, automated insights, and built-in converged database capabilities that enable easier querying across multiple data types and machine learning analytics. The solution is available both in the Oracle Public Cloud and in customers’ data centers with Oracle Cloud@Customer.
This simple, open, and secure data lake platform for machine learning, streaming, and ad hoc analytics is available on AWS, Azure, Google, and Oracle clouds. Qubole helps you ingest datasets from a data lake, create schemas with Hive, query the data with Hive, Presto, Quantum, or Spark, and then use the results for data engineering and data science. You can work with Qubole data in Zeppelin or Jupyter notebooks and Airflow workflows.
This operational analysis database occupies a niche between transactional database and data warehouse. Rockset can analyze gigabytes to terabytes of current real-time and streaming data and has indexes that execute most queries in milliseconds. Rockset builds a converged index for structured and semi-structured data (from OLTP databases, streams and data lakes) in real time and provides a RESTful SQL interface.
This dynamically scalable enterprise data warehouse was developed for the cloud and runs on AWS, Azure and Google Cloud. Snowflake has storage, compute, and global service layers that are physically separate but logically integrated. Data workloads scale independently, making Snowflake a viable platform for data warehousing, data lakes, data engineering, data science, modern data sharing, and data application development.
The connected multi-cloud data platform for enterprise analytics connects data lakes, data warehouses, analytics and new data sources and types. Teradata Vantage runs in public clouds (such as AWS, Azure and Google Cloud), hybrid multi-cloud environments, on-premises with Teradata IntelliFlex or on commodity hardware with VMware.
Vertica offers a uniform analytics warehouse for the most important public clouds and local data centers. The solution integrates data into Cloud Object Storage and HDFS without you having to move it first. Vertica offers two deployment options: In Enterprise Mode, it runs on standard servers with tightly coupled storage and offers the highest performance for use cases that require consistent computing capacity.
In Eon Mode, Vertica features a cloud-native architecture that separates compute from storage for simplified management for variable workloads. This maintains the flexibility to apply specific compute resources to shared storage for different business applications. Vertica in Eon Mode is available on Amazon Web Services and Google Cloud Platform, but is not limited to public cloud deployments.
This open-source, column-oriented OLAP database management system manages extremely large amounts of data (including non-aggregated data) and allows you to generate customized data reports online in real time. The system is linearly scalable and can be scaled up to store and process trillions of rows and petabytes of data. ClickHouse is designed to work on regular hard drives. That means the cost per GB of data storage is low. If available, SSDs and additional RAM can also be fully used.
In ClickHouse, data can be stored on different shards. Each shard can be a set of replicas used for fault tolerance. The query is processed in parallel on all shards. ClickHouse uses asynchronous multi-master replication. After the data is written to any available replica, it is distributed to all remaining replicas in the background. ClickHouse is available as a cloud service from Yandex, Altinity (on AWS), Alibaba, SberCloud and Tencent.
Yellowbrick Data Warehouse
This modern analytics database processes large amounts of data in parallel and was developed for demanding batch or real-time workloads, among other things. With Yellowbrick, you can deploy data warehouses wherever they are needed – in private data centers, multiple public clouds and at the network edge. (FM)
This post is based on an article from our US sister publication Infoworld.