Big Data – An Analysis and Overview

We could consider Big Data as the new hype of the moment. Many believe that it brings eternal success and increased revenues to their companies.

It may be true that by analyzing Big Data, many enterprise corporations will find new ways to sell, create new products and optimize their processes. What is most important is that these great achievements will not come for free.

Is your Big Data really “Big”?

The term “Big Data” has been improperly used in too many situations. Technology providers abuse this buzzword to jump on a bandwagon hoping that they can increase their profits. Some techie aficionados simply find it more attractive to work on worldwide record databases than on mediocre size data sets. The reality is that data can be really big for some, but it is not always the case. And the term “Big” is often improperly used to mix size, complexity and variations.

Just for a quick overview you can say that you are dealing with Big Data if:

Your datasets are so large that single storage devices cannot store and manage them
The analysis of your data requires power computation that goes beyond a single server, no matter how big it is
If not handled properly, data transfer and loading may take so long that the data acquired will pile up and it will be unmanageable
The type of data acquired is a very large set of structured data, a set of unstructured data, or a mix of both

If one or more of these bullets describes the data you acquire, load and analyze, then you are dealing with Big Data.

Aspects to consider when handling Big Data

One of the factors that determine the real cost of Big Data is the technology adopted in the various phases of the project. We can summarize these phases in three ways:

Data acquisition, transformation/aggregation and loading
Business analysis and data mining
Users querying and reporting

Just like analyzing traditional data warehouse projects, the first step is to start from the objectives that we want to achieve. We initially know that we have “some data” but we do not know exactly how to organize it, where to store it and how to analyze it. Based on this data we can obtain some information. Here are a few questions to contemplate.

What kind of queries and reports do you need?
How often will you need this information?
What kind of pre-analysis is required to provide the information that can be queried and reported?
How can we structure the data in order to execute this analysis?

The result is a matrix that will help the project managers define the following:

The user tools and applications that can be used to provide the data
The software needed to manage and analyze the data
The model used at various stages of the process
The infrastructure in terms of storage, computation and networking

Which technologies should we use?

There is an infinite number and combination of different technologies that can be used to give you the perception that the project is already landing somewhere in terms of requirements.

The user tools

Sometimes, ad hoc tools and applications are the best solution, especially when the objective is to provide information for very specific requirements. For example, the results may be integrated in e-commerce and online services or fundamental for customer and technical support services.

In other situations, where reports reflect more typical results in terms of charts, graphs, pivots or matrices of results, standard reporting and BI tools may be the right solution. In this area, a large number of purely commercial products are paired with some open source products that have commercial support. The clear trend is to provide these tools “as a service”, so that customers can benefit from a vast scalability for their analysis. The “as a service model” is extremely cost-effective to start a project. One think go consider, however, is if it is possible to change this approach if it becomes less convenient compared to others in the future. This may require the need to re-engineer all the operations.

The analysis software

Today, the majority of the software used to analyze different information is commercial or open source with commercial support. Software varies in terms of type of data and objectives. For structured data, data mining tools have been around for many years and have reached their maturity so they can be effectively used in Big Data. Map/Reduce operations can crunch a large number of structured data and find patterns or show behaviors that could have not been feasible only few years ago. Extensions to this software can successfully analyze unstructured data, in terms of collected text, documents, images, as well as audio and video streams. These extensions can find similarities in video and audio clips, or in photos. They can understand not only the text stored in a document but also the sentiment and the emotions expressed in collected comments and texts.

Modelling software

Data modelling is very much related to the kind of analysis to perform. In general, this software comes with the analysis, but some technologies can be alternatives and follow different approaches.

Infrastructure

This is the area where commodity hardware and open source software is mostly used. There are many commercial solutions that promise improved performance and optimizations, but the common trend is to use less expensive and generally available boxes and products.

More than in any other areas, the adoption of cloud technologies in Big Data is the common trend. This is mainly due to the elasticity provided by these technologies so that computation can be turned on and off on demand, storage can be used, discarded or reused. An “as a service” approach is again very interesting since it will remove all the hassle to manage the computation and storage infrastructure thus leveraging communication and integration as the most important aspects to consider.

Which database should we use?

The technology at the center of a Big Data project is, without any doubt, the database. There are several commercial options for Big Data, but the common trend is in the open source area. The set of products developed by the Apache Foundation under the Hadoop umbrella and many side projects are extremely popular and they are considered the de facto standard for Big Data. Truth is, Hadoop can solve only some aspects of the analysis required for Big Data and it may be necessary to pair with other database technologies. NoSQL technologies are particularly popular for their scalability and performance.

Cassandra, for example, is a technology mainly used to store a large set of data collection. It is very effective for fast data inserts rather than for analysis. Therefore many projects see Cassandra used to store the acquired data with a denormalised model.

MongoDB is used in many cases to store documents or unstructured data in general. Data can be later reviewed and analyzed so that more structured information can be stored in other databases.

SQL databases are always a viable choice for Big Data, although they seem to be less popular than Hadoop, Cassandra and MongoDB. Due to their internal architecture, relational databases may struggle if the data acquired is unstructured or it is organized in large objects, such as documents and multimedia clips. In the recent years, much has been done in this area, so relational databases today are very different from the ones that were used 10 or more years ago. Certainly, the handling and the analysis of structured data is where relational databases can play a leading role.

Modern relational databases combine the efficiency of SQL with functionality that can provide faster indexing and optimized access to the data. Columnar relational databases provide for great improvements in traditional data analysis. New indexing algorithms also solve the nuisance of data statistics rebuild, index optimization and storage inefficiency when data is moved in large sets. In addition to these aspects, some relational databases also provide a map/reduce approach similar to the one available in Hadoop and in other NoSQL products.

MariaDB and the NoSQL integration

MariaDB is a drop-in replacement for MySQL, the most used open source database for online applications. MariaDB falls into the category of the NewSQL products, i.e. a product that provides unique NoSQL features together with the typical features available in relational databases. Therefore, aspects like transaction management, durability and consistency are available together with schema or schema-less modelling, full text storage and analysis and integration with other NoSQL technologies.

MariaDB can be part of the database infrastructure for Big Data. It is not meant to be a replacement for Hadoop, but it can be a technology used in conjunction with it. Hadoop is used in batch, ad-hoc analysis. In projects that require the processing of Terabytes or Petabytes of data, Hadoop is definitely a good fit. The results can be queried and reported via a standard MySQL/MariaDB interface, which is compatible with virtually all the BI tools and development frameworks available today.

MariaDB provides multiple storage engines through the use of the Pluggable Storage Engine Architecture. Two of the most interesting storage engines for Big Data are Cassandra and Connect.

Cassandra provides a direct connection between MariaDB and a ring of Cassandra nodes. Column families in Cassandra are seen as tables and they can be joined with local MariaDB tables, or they can be used for all the standard R/W SQL operations.

Connect is a storage engine used to integrate into MariaDB a large number of file formats. The engine can be used to connect MariaDB to external files in order to analyze them immediately or to load them into other MariaDB engines. The auto discovery of new files and their format makes Connect an extremely flexible and efficient way to integrate and import data from many different sources. Log acquisition and loading is where this engine shines. In an era where machine generated data is exploding, this engine is an essential piece of the puzzle in a Big Data project.

MariaDB can also provide access to documents and index them with the use of the Sphinx storage engine. Sphinx is fully integrated into the relational database so documents are accessible through standard SQL queries. They can be joined with other standard relational tables, but their storage and retrieval is handled by separate processes. This allows document formats, compression and indexing to happen without affecting the use of internal resources.

Another interesting engine for Big Data available with MariaDB is TokuDB. TokuDB provides advanced compression, fast insert rate and an indexing algorithm that can reduce or even remove database maintenance for structured data. In addition to these features, online operations are significantly improved with respect to the standard MySQL engines.

Version 10 of MariaDB also provides an advanced replication mechanism called multi source replication. With multi source replication, data sets can be built by adding data stored on multiple database masters. This gives users a unique, consolidated view of a subset or an aggregation of data available in a distributed environment.

Last but not least, dynamic columns in MariaDB allow users to store text objects and unstructured format in order to discover their content later and define the columns contained in a document on the fly.

All these aspects make MariaDB an ideal technology for Big Data when the data set has a structured or semi-structured format. The ready-made integration with NoSQL technology also reduces the cost to develop and administer data extraction, transfer and reload.

Conclusions

When it comes to Big Data, it is not always necessary to move away from well-known technologies like relational databases. Modern NewSQL databases like MariaDB can achieve the objective and provide all the features required. The result is a smoother learning curve, less risk, reuse of known technologies and resources and ultimately a reduced total cost of a Big Data project.

In other cases, MariaDB can be used in conjunction with NoSQL technologies and integrated in many different ways.

To that end, the most important point to consider is that when a single technology is not enough for a successful project, ease of integration is a must.