Sunday, May 3, 2015

Big Data - Impact on Enterprise Integration

In this article I tries to explains the new challenges in enterprise integration caused by the advent of Big Data, and presents some approaches to overcome them.



Big Data

 Big data is the latest buzz-word in the industry. At the basic level, big data is just that: large amounts of data that cannot be handled by conventional systems. With increases in hardware capacity, our definition of what constitutes a conventional system changes, and so does the threshold of big data, which is not something new but has always been around. It’s just that the threshold of what constitutes big data has changed. Today, the threshold for big data may be terabytes (1012 bytes). Soon, it will be petabytes (1015 bytes). Twenty years ago, there were very few systems that could process gigabytes (109 bytes) of data in an acceptable time-frame. So gigabytes would have been the lower threshold of big data at that time.

Integration

Integration is needed when we have to connect two or more software systems together. Large enterprises will have tens or hundreds of systems to connect together. Since the number of systems involved in integration could be large, the amount of data that flows across these connections will also be large.

How does big data impact enterprise integration?

With the advent of big data processors, more and more organizations recognize the value of big data analytic, and the need to process big data within enterprises is increasing. This will lead to large amounts of data (which is a subset of the big data captured by the organization) moving across integration middle-ware (a software that connects two or more software systems). Such large amounts of data would overload existing integration middle-ware systems, since they were designed to handle lower volumes of data. This is depicted in Figure 1.

Figure 1: Big data and integration

Let analyze Figure 1 in detail. Big data is not what is really interesting. Who needs a mountain of data anyway? What we want are the results of processing this data. Depending on the processing algorithm used to process the data, we get different results. This is where it gets interesting.

Let us suppose that an enterprise implements a big data solution. The big data processing (refer to the big data processor in the figure) solution could be a shared service across the enterprise. Due to the technical complexity and cost involved in building a big data processor, it is not possible for each division in the organization to have its own. But due to the business value accruing from big data analysis, sooner or later, different divisions in the enterprise will want their own processing on the big data set. This can be done by moving a subset of the big data, relevant to that division, over to its systems. This is the point at which the integration systems in the enterprise will feel the impact. To enable different divisions in the organization to “do their own stuff” with data, a subset of the big data will start moving across the integration middle-ware.

Overcoming the big data challenge

So how can we solve the new challenge caused by subsets of big data moving across the integration middle-ware? Let us look at three options:
  1. Buy more hardware: This is technically called vertical scaling or horizontal scaling. This approach may work with smaller challenges, but not with big data, since the amount of new hardware required to support the load will make the idea financially enviable.
  2. Buy specialized big data solutions: These solutions can be purchased and given to each division that needs a big data processor. There are very few big data processors in the market now, but we can expect more soon. This will be cheaper than buying more hardware, and can be considered, if the organization has enough funds. Note that this approach works by avoiding the problem altogether: we avoid moving any data across the integration middle-ware. On the flip side, this approach will result in multiple copies of big data in the enterprise.
  3. The third option involves extending our middle-ware’s capabilities by using data grids or distributed caching platforms. This attempts to overcome the big data challenge by increasing the integration middle-ware’s available memory, and introducing an asynchronous link in the integration middle-ware’s data-persistence mechanism.
How do these techniques help us overcome the challenge? Answering that question requires a deeper understanding of the root cause of the underlying problem. A middle-ware solution fails at high loads due to the following issues:
  • Memory overload, caused by data, threads or sockets
  • Lack of system resources like threads, sockets and swap space
A distributed cache helps with the first issue, by increasing the available memory. For example, if you have ten servers with 10 GB each, a distributed cache can help you add up all that RAM and use it like local RAM, effectively giving you 100 GB. It helps with the second issue by avoiding the need for a large number of threads, or a large swap space. This is accomplished by intelligent persistence mechanisms, like write-behind-cache.
In a write-behind-cache, data that needs to be persisted is written in an asynchronous manner: the write request is accepted, and the write function returns immediately. The persistence mechanism then writes into the file or DB; this frees up the persistence threads of the middle-ware, increasing the scalability of the overall solution.

 

A telecom use-case

Let us look at how the third option from the previous section, of using data grids, can be implemented. The use-case here is from a telecom scenario, and is depicted in Figure 2.


Figure 2: Big data and integration: A telecom use case

Figure 2 has conceptual similarity with Figure 1. The Network Switch is the data source here. The big data processor maps to the Mediation solution here. The Analytic Application is similar to the Data Warehouse. The Event Processor system is similar to the Fraud Management application. Let us understand the data flow in this figure.

This is the data flow for a cell phone services provider — a telecom company, or telco, as per industry parlance. Whenever a phone call is made, records called Call Detail Records are generated by the telco’s hardware, and the records get collated at the Network Switch. The Mediation system then processes these records. It performs validation, filtering, etc., and gives the records to the three systems it connects to: Fraud Management, Billing and Data Warehouse. The Billing system needs to connect to many other systems: CRM, Inventory, Fraud Management. For some of the data flows, like CRM to Billing, the volume is so high that we have to provide a direct connection from Billing to CRM, for a few use-cases (around 5 per cent result in such high volumes).
Normal middle-ware, even with clustering and load balancing, cannot handle this. This is where the need for Middle-ware Infrastructure comes in, in the architecture (see Figure 2). The Middle-ware Infrastructure component is a separate product that provides features like local and distributed caching, load balancing, fail-over and recovery, with much higher scalability than that provided by standard middle-ware products. Some of the middle-ware infrastructure products come in the form of data grids, which support scaling to hundreds of nodes. Examples of such products are Oracle Coherence, JBoss Infinispan, Websphere Extremescale and Terracotta Big Memory.

The future of Big data and integration

Any solution in the technology space starts out in a niche area, and as it becomes mainstream, it gets more and more commoditized. We can expect big data solutions to follow this path in the near future with:
  1. The arrival of big data processing appliances
  2. Support for big data in cloud platforms
  3. Cloud-based integration platforms that are per-packaged with middle-ware infrastructure.
Hopefully, this article gives you a good overview of the challenges posed to integration solutions by the advent of big data in enterprises. The example discussed, which is a use-case from the telecom domain, is generic enough to be applicable to other domains. The key value that open source brings to such solutions is that we can scale out our solution with much lower financial implications, compared to commercial solutions.




1 comment:

  1. I was very interested in the article, it’s quite inspiring I should admit. I like visiting your site since I always come across interesting articles like this one. Keep sharing! Regards. Read more about Big data service providers

    ReplyDelete