Saturday, February 22, 2014

Contrasting Cloud Storage, Data Analytics and Flash Storage

Two recent occurrences tempted me to start this blog. The lector of our German book approached me and asked if we would be interested to publish a third edition of our book. In addition I came across several predictions about storage trends for 2014. Recapping both threads three major topics seems to be of particular relevance: data analytics, flash and cloud. All three technologies touch different aspects of data processing. Let’s sort this out.

1. Cloud Storage

First, it is all about the business value of data. Nobody spends big money for high-end storage systems, storage networks or cloud storage just for the sake of owning cool technology. In the year 2002 Hermann Strass wrote in the foreword of our book: ‘Today stored data and the information it contains are the crown jewels of a company. The computers (servers) needed for processing data can be purchased by the dozen or in larger quantities – individually as server blades or packed into cabinets – at any time, integrated into a LAN or a WAN or exchanged for defective units. However, if stored data is lost, restore of it is very expensive and time-consuming, assuming that all or some of it can even be recovered. As a rule, data must be available ‘around the clock’. Data networks must therefore be designed with redundancy and high availability.’

In those days Fibre Channel SANs arrived in the open systems world (Unix, Linux, Windows) and changed the IT architecture from a server centric architecture to a storage centric IT architecture. Storage networks opened up new possibilities for data management. In contrast to a server centric IT architecture, in storage networks storage devices exist completely independently of any computer. Several servers can access the same storage device directly over the storage network without another server having to be involved. Storage devices are thus placed at the center of the IT architecture; servers, on the other hand, become an appendage of the storage devices that ‘just process data’. IT architectures with storage networks are therefore known as storage-centric IT architectures.

The costs for storage have always been a burden for CIOs and storage networks helped to reduce the costs for owning and accessing data. The consolidation of data on large storage systems (disk systems, tape systems, NAS) enabled more efficient strategies for data visualization, resource allocation, data tiering, policy based data placement, thin provisioning and backup for example. The storage consolidation made also the use of advanced features such as instant copies, remote mirroring, consistency groups and immutability more affordable which were required to support the business requirements for availability (RPO, RTO) and regulatory compliance.

In my opinion the shift from storage networks to cloud storage is just an evolutionary step to further reduce the costs for owning and accessing data. It will also reduce the costs for processing data, if both the application and the data reside inside the cloud. Business processes must run around the clock and adhere to laws. This does not change, whether an application or data runs in the cloud or on traditional IT infrastructure. Therefore future storage cloud implementations will include sophisticated features to support the business requirements for availability and regulatory compliance.

Nevertheless, cloud (storage) is imposing disruptive change to IT by offering a very cost effective delivery model which forces vendors, providers and customers to optimize hardware, software and operations for the new cloud delivery model. This puts huge pressure onto everyone and everything which implicitly forces more and more evolutionary technical enhancements to adapt to the cloud delivery model. The sum of these evolutionary technical enhancements may result in a disruptive change for the whole IT industry. The technical changes of cloud (storage) are not rocket science; cloud it is more about the maturing of IT by establishing standardized processes. 

2. Data Analytics

Second, data analytics is orthogonal to storage networks and cloud storage. Data analytics is about processing of data which is stored in traditional storage networks or new cloud storage networks. The objective of data analytics is to gain more insight from the already available data. CEOs want to gain a competitive advantage and researchers want to gain more knowledge about their research domain. Data analytics is a completely different discipline than storage and storage management.

Last year I had the opportunity to attend the 2nd Symposium of the Large Scale Data Management and Analysis (LSDMA) project. They were talking about data intensive science and the data life cycle. Key topics which I recall are the quality of acquired and ingested data (e.g., valid, accurate, consistent, having integrity), the correlation of data to get new insight, and the preservation of data forever. The discipline of data curation indicates activities to maintain research data long-term to make it available and reusable forever in the interest of science and education. I was surprised when one speaker said that he considers a few Terabyte of data as a Big Data problem: A data set which cannot be shipped on an external USB drive is a Big Data problem.

Data analytics is nothing new, though cloud computing enables new possibilities for data analytics. Many data analytics algorithms require huge amounts of CPU power to process even larger amounts of data. The standardized and cost effective delivery model of cloud computing provides the infrastructure for data analytics applications which cannot be afforded on traditional IT architectures.

Though there are challenges ahead: the huge size of a business warehouse makes it time consuming and expensive to transfer it for example from a traditional storage system into the cloud. Data analytics in the cloud requires that the analyzed data already resides in the cloud or that the storage for the data warehouse and the cloud resources are geographically collocated.

3. Flash Storage

Third, flash storage is a new storage technology which can be integrated in traditional storage centric IT architectures and in cloud storage. Flash significantly reduces the response time for reading and writing data. Today flash storage has still a high price tag per GB so it is primarily used where it has a high economic advantage.
The earlier the results of the data analytics or other data intensive computation are available, the bigger the competitive advantage.

Given the falling prices for flash storage we already see more and more flash in the mainstream market. The availability of flash DIMMs is a drastic change of the storage hierarchy (fast volatile storage close to the CPU vs. slow persistent storage on external devices). This will impact the architecture of operating systems and data intensive applications such as relational databases. I would not be surprised, if flash storage will turn out to be a much more disruptive change to the IT industry than the evolutionary change which is imposed by cloud.


There is no doubt that future storage cloud implementation will integrate huge amounts of flash storage, enabling data analytics applications to tackle problems which we do not think of today. Consequently these three topics should be added to a potential third addition of our book in addition to the typical minor updates and corrections.

What is your view? I am looking forward to receive you comments, thoughts and suggestions! 

Thanks to Axel Köster for reviewing an early draft of this posting and his suggestions for improvements. 

Updated on March 1, 2014, incorporating feedback from Sandeep Patil. Added a paragraph about the disruptive change triggered by cloud (storage).