IHME Believes Open Source MariaDB ColumnStore Is The Future of Data Warehousing
Note: This is a guest post by Andrew Ernst, Assistant Director, Infrastructure at the Institute for Health Metrics and Evaluation.
In the early 1990s, the World Bank Commissioned an in-depth study to measure disability and death from a multitude of causes worldwide. Over the past few decades, this study has grown into an international consortium of more than 1,800 researchers from more than 120 countries, and its estimates are being updated annually.
Today, the Global Burden of Disease report, managed by the Institute for Health Metrics and Evaluation (IHME), serves as the most comprehensive effort to systematically measure the world’s health problems. In fact, the tools can be used at the global, national, and local levels to understand health trends over time, just like gross domestic product data are used to monitor a country’s economic activity.
The data is growing each year. In 2015, the Global Burden of Disease results were three times larger than any other year. As the Global Burden of Disease report continues to grow in size, focusing on more granular geographies, the data requirements also continue to scale exponentially.
Solving Volume and Scale Challenges
Over several decades, the size and scope of the Global Burden of Disease results have grown - today reaching multi-billion row tables. As it has grown, we’ve tried several storage engines which have failed miserably.
The Global Burden results are developed through many internal processes and pipelines that rely on a MySQL-compliant infrastructure. The choice to adopt MySQL was made at a time when the scope of IHME’s work was much smaller, and today’s scale would have been incomprehensible. The ambitions of our researchers and abundance of available input data have pushed the boundaries of research and traditional database engines.
IHME has leveraged the advanced Percona XtraDB Barracuda (InnoDB-compliant) storage engine for every database environment, and supports more than 90 database instances within their infrastructure. The environments supporting those critical scientific computing pipelines run on high-end hardware with and are optimized for low-latency and to support servicing extreme multi-concurrency from their High Performance Compute cluster.
Knowing that the existing solutions would not scale with the Institute’s growth, efforts have been underway to evaluate platforms that offer:
a MySQL-compatible interface
don’t require a vast amount of research application code to be re-written
Ideally offer a Open Source development effort with community-driven input and contributions.
memSQL was one of the top contenders and performed extremely well at scale, but had a number of non-standard constructs for database design and implementation, and lacked definable security mechanisms for authentication and authorization with their non-commercial product.
MySQL 5.7, while offering higher benchmark speeds for ingest and query optimization didn’t offer enough of a paradigm shift to make a huge impact in our workload. We knew that any platform selected would need to leverage multi-host sharding with a multi-threaded software. Our database team is small, and while building a sharding infrastructure is reasonably straightforward, realistically cannot ask the development staff to make the applications shard-aware.
Then IHME evaluated MariaDB ColumnStore - which combined the power of big data analytics with the the ease of use of SQL. Leveraging MariaDB’s open source model, MariaDB offers high performance search queries on massive billion row data tables.
With MariaDB ColumnStore, we were able to improve the performance of our multi-billion row tables. IHME found several benefits to using ColumnStore including:
Higher performance: Compared to row-based storage, MariaDB ColumnStore column storage reduces disk I/O, making it much faster for read-intensive analytic workloads on large datasets.
Better security: ColumnStore accesses all the same security capabilities delivered in MariaDB Server including encryption for data in motion, role-based access and audit features.
Benefits of leveraging SQL: ColumnStore brings transactional and analytic workloads into a single enterprise grade system. It simplifies enterprise administration and execution with a standard SQL front end for OLTP and analytics.
We also found MariaDB engineers to be incredibly responsive to IHME - we trust we can work with them for a very long time.
Looking into the future, IHME has to design around a future of growing data that allows for regular updates as new data and epidemiological studies are made available.
MariaDB's ColumnStore storage engine solved both a volume and scale problem within our environment that allows us to seamlessly handle both current and planned increases in workload.
When IHME released our results in 2010, there were approximately 2 billion data points, and with the 2015 effort, that number has grown to just shy of 100 billion. The 2016 results are already suggesting we will far exceed 10 billion results per table in the next six months (each result set is about 9-10 tables of roughly the same size). Looking further into the future, IHME will be focusing on smaller geographical areas across the globe, and will need to support analytical workloads that include geospatial calculations.
High performance, flexible data analytics using MariaDB ColumnStore doesn’t just make my day to day job easier. It will have a profound impact on how the global community can assess disease around the world.
The pioneering effort of the IHME continues to be hailed as a major landmark in public health and an important foundation for policy formulation and priority setting.