The datacenter operating system discussed last week (https://florenttastet.wordpress.com/2014/04/06/are-we-failing-the-true-business-needs-part2/) is da strong pillar of business requirements and now more than ever as datacenters are almost entirely virtualized from the servers to the end points, and while at it, applications.
The efforts deployed to solidify the foundation of tomorrow’s IT reality are paying off, as the true business needs are around the end user community; we are able to witness a strong focus on their needs through the technologies made available and are starting to feel the results: users are more mobile then ever, their accessibility to applications far more agile and those same applications way far more responsive than, let’s say, 5 years ago.
On the storage side.
Call it vCenter, SDDC, Hyper-V, SCVMM, azzure, vCHS, Amazon, private of public cloud or any other form of aggregated compute-storage-application a.k.a “IT As a Service”, solutions are unlocking a complete flexibility of business needs. We all agree that without virtualization at its core, our datacenters and businesses wouldn’t be as strong as they are and we wouldn’t be achieving this high level of application portability as we do today, resulting in a much happier end user community in demand of performance and accessibility.
I also mean “strong” in two ways: 1) it’s one of the most important element of a successful business strategy in need to redefine the way IT plays its role in the organization 2) we should not forget how much “invisible” visibility it has to the end users.
As it stands today, (and we all know “tomorrow” can be different), the datacenter operation system, relies on different layers to enhance the various derived values of the impacts it caused. Today we will be reviewing the storage pillar of the datacenter and how it handles the various types of data he’s asked to manage, handle, store, retrieve, optimize and align.
Structured or unstructured
Fundamentally, centralized storage arrays are storing data in blocks and these blocks are occupying sectors on our disks. I don’t want to go too deep on this subject but we all understand what I mean; this wiki can help http://en.wikipedia.org/wiki/Disk_sector.
Aggregated, these blocks of data are forming datagrams, and once “retrieved” by the disks heads and reassembled by the controllers, these datagrams are presenting the information we consume, by the mean of an enterprise application, or a simple word document. Regardless of the nature of the data, it requires to be stored, and accessible, when needed, and its nature should not be a concern.
Data accessibility is at the core of today’s most demanding applications or end users, or both. Essentially two types of data exist: structured and unstructured.
Take for example the various data we are creating almost every day now: word documents, plain text files, templates, powerpoint, excel…etc in short, all of the end user’s data, we are creating sets of unstructured data that all be growing exponentially in the next 10 years (in facts many researches says that it will explode to about 40ZB by 2020… that’s in 6 years…. see Ashworthcollege blog http://community.ashworthcollege.edu/community/fields/technology/blog/2012/03/13/dealing-with-the-explosion-of-data)
So this growth is around us and requires to be managed and evaluated, organized, accessed and reliable as it remains the blood of any organization.
On the other end, the structured data. Primarily formed by anything related to database, proprietary file system, such as VMFS, NTFS, NFS…etc this aspect of the data is, although not at the same space, forming what today’s arrays need to deal with, and what tomorrow will look like for data management.
The structured data
As we are talking about the storage datacenter pillar, structured data is our concerned for this blog. Not that unstructured is not important, but it would not follow the spirit of the blog. I will come back later to it in another blog.
So structured data is what we’re dealing with when datacenter operating system is concerned. It forms the largest concern when virtualization (or any form of databases) is discussed and while the end user community is unaware of this, they are daily contributing to its growth by interacting from different angles with it and pushing the limits of the tools made available to them.
That we, as IT professionals, are providing a direct access to underlying disks through block based protocols, disk passthrough, RDM or any other form of disk direct access or are leveraging file based protocols, the underlying arrays controllers, need to manage the requests sent to them to retrieve the structured data in a very timely fashion.
We, as end users, are not really concerned about where the data is stored or “how” the data is stored, retrieving the information it contains is what we are concerned about and moreover how quickly this data is made available to us.
Say for example, you are with a customer on the phone and are required to enter data in a CRM system, you surely don’t want to be telling your client: “that won’t be long, my system is slow this morning”. Although understandable, it is often an uncomfortable message to deliver and we all feel under performing when this situation occurs… or worst irritated.
Same applies when you’re trying to retrieve data for that same database; the last thing you want is to be waiting for the system to provide you with the data you’re looking for, because the array is not able to aggregate the blocks of data fast enough to your expectations or the controllers are saturated, busy dealing with other requests from other systems.
You can now imagine how this would feel when, instead of a database system, you’re dealing with a virtual machine… but it could be even worst… picture that same database you’re working with, hosted on a virtual machine.
Structured data are one of the most encountered type of data end users are helping grow every day. Take a peak at http://www.webopedia.com/TERM/S/structured_data.html for a high level view of structured data
Virtual machines are part of the structured data growth and are heavily contributing to the challenges we’re all facing day-in day-out. We have sold to our management how efficient virtualization would be to the business, it is now time to deliver of these words, however, virtualization is as good as the underlying storage and how that same storage will be responding to our requests.
VMDKs or VHDs to only name these two most popular, are in fact residing on a file volume (either VMFS or NTFS) and these volumes require to be precisely stored onto precisely architected arrays RAIDs to perform to the level they are expecting to perform. Ever heard about partition alignment? Not an issue with Windows 2012 server, but it has been for long a source of concerns.
I believe that structured data is the most challenging aspect of architecting a reliable storage solution, for the sole reason that it is crucial to understand HOW this structured data will be accessed, when it will be accessed and definitely how often and by whom (read computerized system).
What arrays are dealing with
Sequential data access, random data access, unstructured data, structured data, performance, cache, cache extension, IOPS, RAIDs …etc are only a few of the most challenging aspects of data management arrays are requested to work with.
As the end users are consuming the data, creating data, modifying data, various forms of data access are daily managed and orchestrated by storage arrays. It requires a precise understanding of accessibility to everyone (read computerized systems), ensuring that every request is treated in order, respecting a tight SLA.
SEQUENTIAL data access is the simplest way of projecting how data will be accessed (http://en.wikipedia.org/wiki/Sequential_access). Defragmentation is surely a word you would need to master to remain on top of sequential accesses. ensuring that every blocks accessed are in sequential orders minimize the time controllers are waiting for the heads to pull the data out of the disks, resulting in faster datagram aggregation and presentation to the requestor (read computerized system)
On the opposite, RANDOM data access is the toughest (http://en.wikipedia.org/wiki/Random_access). When you don’t known which data or block will be accessed, it becomes a challenge at projecting the performances it requires and typically you could find this challenge in structured data, at the data level itself.
PERFORMANCE is a vast discussion but can be generally covered by IOPS, Cache, Controller memory and SSD. Often we would see “IOPS” as being the primary driver for performance. Right or Wrong, it drives a large amount of the outcome for applications. A single SAS drive, generates about 175 IOPS. Once aggregated to others using a “RAID” (Redundant Arrays of Inexpensive Disks) technology (http://en.wikipedia.org/wiki/RAID) you will find yourself, in the case of a “RAID5+1” with, say 5 disks, with an outcome of 175×5 = 875 IOPS (take out the RAID parity for the performances). Not bad. Now multiple this by 5 or 10 RAIDs, and the total outcome is now 4,375 to 8,750 IOPS.
Cache, Controller memory and SSD have tightly worked together lately to enhance the performance expected by the most demanding Tier1 applications. Very successfully it allowed IT department to rely on the various benefits centralized arrays have to offer and often we’re seeing this has the second wave of application virtualization.
Just thinking about a single flash disk driving around 3,000 IOPS (http://www.storagesearch.com/ssd-slc-mlc-notes.html) and the performance picture is brighter all of a sudden.
See http://www.tomshardware.com/charts/hard-drives-and-ssds,3.html and http://www.storagesearch.com/ssd-jargon.html for more details on the drives and their performances.
While the hardware performances are “easy to forecast”, the overall performances are far more challenging as they require to take into consideration a multitude of angles, such as, yes IOPS, but also Controllers memory, data retention and “type” of data (hot or highly and frequently accessed, cool, averaged accessed data or cold, little accessed, ready to be archived -almost-)
In the case of virtualization, the data accessed is typically almost always “hot” and here comes the challenge as the controllers and many arrays were NOT build for such requirement.
To address this challenge you’ll have to look at the data at the block level:
which are the blocks, composing a structured file, the most accessed, and where these blocks should be hosted (stored) to maximize performance and storage but moreover, where the blocks composing that same file, the LEAST ACCESSED should be stored to minimize the capacity and performance impact they may have while servicing the rest of the datagram properly to not impact the retrievable SLA required by the application requesting them.
I know it’s a long phrase… but it says it all.
This is where the interesting part of a storage array supporting a virtual farm is. All arrays are dealing with data management in the same “manner”; they store the data, and access the data. Now when this data has a unique requirement, a unique need, a unique purpose, this is where arrays are getting differentiated.
The end users are again, in the storage pillar of the datacenter, creating a constant opportunity in data management. We, as end users, are growing our needs and are not ready to compromise, because why should we?
When storage are architected, we often focus on the hosted application, but rarely are looking at the user’s expectations and systems requirements. We all ensure from all possible angles, that the application hosted on the array will service the end user adequately.
We might take the right approach by looking at the applications metrics and seating the data (blocks) where it will be best serviced, but as systems and data are accessed, this will change, and our agility at providing what the data needs when it needs it, will sets us apart when the big requirements will hit our architectures.
It became almost impossible to project the future of how the data will be accessed, or how the data will be performing. The reasons for that? Us, the end users, and how strong we are at changing or forming new opportunities.
We, as a species, have demonstrated over and over how we are strong at pushing the limits, and something stable today, can become unstable tomorrow simply because we asked for more and are doing more.
We need to adjust automatically and stop looking at “how” we should do it, but “where” it can land, and agnostically speaking, this is the real challenge when we’re architecting a solution for a business need.
The future is bright, let’s make sure our architectures are forecasting something that can’t be forecasted!
Last question, but not least: how do you backup that data?.. LOL that’s another topic, but don’t forget about it.
Have a good weekends friends.