Object Storage Demystified

I’ve seen a few definitions and watched a few presentations and I’ve never really been able to very easy and clearly articulate what object storage actually is! We all know it is an architecture that managed data as an object (rather than in blocks/sectors or a hierarchy) but I never really understood what an object was…! Might just be me being stupid but after a bit of reading I understood it a lot better once i understood the characteristics of an object e.g.

  • An object is independent of the application i.e. it doesn’t need an OS or an application to be able to make sense of the data. This means that a users can access the content (e.g. JPEG, Video, PDF etc) directly from a browser (over HTTP/HTTPS) rather than needing to use a specific application. This means no app servers required, dramatically improving simplicity and performance (of course you can still access object storage via an application if needed)
  • Object storage is globally accessible i.e. no requirement to move or copy data (locations, firewalls etc)… instead data is accessible from anywhere
  • Object storage is highly parallelized, what this means is that there are no locks on write operations meaning that we have the ability to have hundreds of thousands of users distributed around the world all writing simultaneously, none of the users need to know about one another and their behavior will not impact others. This is very different to traditional NAS storage where if you want it available in a secondary site it would need to replicated to another NAS platform which is sat passive and cannot be written to directly.
  • Object storage is linearly scalable i.e. there is no point at which we would expect performance to be impacted, it can continue to grow and there is no need to manage around limitations or constraints such as capacity or structure.
  • Finally it’s worth noting that object platforms are extensible, really all this means is that it has the ability to easily extend the capabilities without large implementation efforts, examples within this context is things like the ability to enrich data with meta-data and add policies such as retention, protection and where data cannot live (compliance).

Object storage is the way to organize data by addressing and manipulating discrete units of data called objects. Each object, like a file, is a stream of binary data. However, unlike files, objects are not organised in a hierarchy of folders and are not identified by its path in the hierarchy.  Each object is associated with a key made of a string when created, and you may retrieve an object by using the key to query the object storage. As a result, all of the objects are organized in a flat name space (one object cannot be placed inside another object). Such organisation eliminates the dependency between objects but retains the fundamental functionality of a storage system: storing and retrieving data. The main profit of such organisation is very high level of scalability.

Both files and objects have metadata associated with the data they contain, but objects are characterized by their extended metadata. Each object is assigned a unique identifier which allows a server or end user to retrieve the object without needing to know the physical location of the data. This approach is useful for automating and streamlining data storage in cloud computing environments. S3 and Swift are the most commonly used cloud object protocols. Amazon S3 (Simple Storage Service) is an online file storage web service offered by Amazon Web Services. OpenStack is a free and open-source software platform for cloud computing. The S3 protocol is the most commonly used object storage protocol.  So, if you’re using 3rd party applications that use object storage, this would be the most compatible protocol. Swift is a little bit less than S3, but still very popular cloud object protocol. S3 was developed by AWS and it’s API is open for third party developers. Swift protocol is managed by the OpenStack Foundation, a non-profit corporate entity established in September 2012 to promote OpenStack software and its community. More than 500 companies have joined the project. Below are some major difference between S3 and SWIFT.

Unique features of S3:

  • Bucket-level controls for versioning and expiration that apply to all objects in the bucket
  • Copy Object – This allows you to do server-side copies of objects
  • Anonymous Access – The ability to set PUBLIC access on an object and serve it via HTTP/HTTPS without authentication.
  • S3 stores its objects in a bucket.

Unique features of SWIFT

SWIFT API  allows Unsized object create feature, Swift is the only protocol where you can use “Chunked” encoding to upload an object where the size is not known beforehand.  S3 require multiple requests to achieve this. SWIFT stores the objects in its “Containers”.

Authentication (S3 vs SWIFT)

S3 – Amazon S3 uses an authorization header that must be present in all requests to identify the user (Access Key Id) and provide a signature for the request. An Amazon access key ID has 20 characters. Both HTTP and HTTPS protocols are supported.

SWIFT – Authentication in Swift is quite flexible. It is done through a separate mechanism creating a “token” that can be passed around to authenticate requests. Both HTTP and HTTPS protocols are supported.

Retention and AUDIT (S3 vs SWIFT)

Retention periods are supported on all object interfaces including S3 and Swift. The controller API provides the ability to audit the use of the S3 and Swift object interfaces.

Large Objects (S3 vs SWIFT)

S3 Multipart Upload allows you to upload a single object as a set of parts. After all of these parts are uploaded, the data will be presented as a single object. OpenStack Swift Large Object is comprised of two types of objects: segment objects that store the object content, and a manifest object that links the segment objects into one logical large object. When you download a manifest object, the contents of the segment objects will be concatenated and returned in the response body of the request.

So which object storage API to use ? Well, both have their benefits, at specific use cases. DellEMC ECS is an on-premise object storage solution which allows users to have multiple object protocols like S3, SWIFT, CAS, HTTPS, HDFS, NFSv3 etc all in a single machine. It is built on servers with their DAS storage running ECS software and is also available in software format which can be deployed on your own servers.

ECS1

There are many benefits of using ECS as your own object storage: Continue reading “Object Storage Demystified”

Advertisements

Understanding GDPR

For the past few months a lot has been spoken, written and talked about GDPR Compliance. Below write up is amalgamation of major takeaways from all the articles and actual GDPR document (OK, I did not read the whole document, but I did read few parts from it). If you find I have missed something that I have missed and I should have mentioned, please do point out, cause this is something which we have to get right. I’ll start with first highlighting some key aspects of GDPR – like:

  • What is GDPR
  • Key Regulatory Requirements
  • Role of IT Professionals
  • Actions for Compliance (12 Steps)

What is GDPR –

It’s not something new and before GDPR we had Data Protection Act so if you had it implemented then you will go through less pain since a lot of elements are partially covered by it. The whole idea and concept is to know how the data is collected, where the data resides, stored, processed, deleted, who can access it and how it’s used for EU citizens. This means that organizations will be required to show the data flow or life-cycle to minimize any risk of personal data being leaked and all required steps are in place under GDPR. In short, GDPR is to have common sense of data security ideas, minimize collection of personal data, delete personal data that’s no longer necessary, restrict access, and secure data through its entire life-cycle and also by adding requirements for documenting IT procedures, performing risk assessments under certain conditions, notifying the consumer and authorities when there is a breach, as well as strengthening rules for data minimization.

Key Regulatory Requirements –

  • Privacy by Design:   PbD is referenced heavily in Article 25 of the GDPR, and in many other places in the new regulation. Privacy by Design (PbD) focuses on minimizing data collection and retention and gaining consent from consumers when processing data are more explicitly formalized.  The idea is to minimize collection of consumer data, minimize who you share the data with, and minimize how long you keep it. Less is more: less data for the hacker to take, means a more secure environment. So the data points you collected from a web campaign over three years ago — maybe containing 15000 email addresses along with favorite pet names — and now lives in a spreadsheet no one ever looks at. Well, you should find it and delete it. If a hacker gets hold of it, and uses it for phishing purposes, you’ve created a security risk for your customers. Plus, if the local EU authority can trace the breach back to your company, you can face heavy fines.
  • Data Protection Impact Assessments: When certain data associated with subjects is to be processed, companies will have to first analyze the risks to their privacy. This is another new requirement in the regulation. You may also need to run a DPIA if the nature, scope, context, and purposes of your data processing place high risk to the people’s rights and freedoms. If so, before data processing can commence, the controller must produce an assessment of the impact on the protection of personal data. Who exactly determines whether your organization’s processing presents a high risk to the individuals’ rights and freedoms? The text of the GDPR is not specific, so each organization will have to decide for itself. If you find more details about it, please mention in the comments below.
  • Right to Erase and to be Forgotten: Discussed in Article 17 of the GDPR, it states that “The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where … the personal data are no longer necessary in relation to the purposes for which they were collected or otherwise processed; … the data subject withdraws consent on which the processing is based … the controller has made the personal data public and is obliged … to erase the personal data”. There’s been a long standing requirement in the DPD allowing consumers to request that their data be deleted. The GDPR extends this right to include data published on the web. This is the still controversial right to stay out of the public view and “be forgotten”. This means that in the case of a social media service that publishes personal data of a subscriber to the Web, they would have to remove not only the initial information, but also contact other web sites that may have copied the information. The new principle of extraterritoriality in the GDPR says that even if a company doesn’t have a physical presence in the EU but collects data about EU data subjects — for example, through a web site—then all the requirements of GDPR are in effect. In other words, the new law will extend outside the EU. This will especially affect e-commerce companies and other cloud businesses.
  • Breach Notification: A new requirement not in the existing DPD is that companies will have to notify data authorities within 72 hours after a breach of personal data has been discovered. Data subjects will also have to be notified but only if the data poses a “high risk to their rights and freedoms”. Breaches can be categorized according to the following security levels:
      • Confidentiality Breach: where there is an unauthorized or accidental disclosure of, or access to, personal data.
      • Integrity Breach: where there is an unauthorized or accidental alteration of personal data.
      • Availability Breach: where there is an accidental or unauthorized loss of access to, or destruction of, personal data (include where data has been deleted either accidentally or by an unauthorized person).
  • Fines: The GDPR has a tiered penalty structure that will take a large bite out of offender’s funds. More serious infringements can merit a fine of up to 4% of a company’s global revenue. This can include violations of basic principles related to data security — especially PbD principles. A lesser fine of up to 2% of global revenue — still enormous — can be issued if company records are not in order or a supervising authority and data subjects are not notified after a breach. This makes breach notification oversights a serious and expensive offense.

Role of IT Professionals

Information Security today is just not limited to the IT Department of any organization and as businesses have evolved during time, so does the need for everyone in the business for making his or her contribution to the security of the organisation’s information, and for protecting the personal data the organisation uses. You will notice that most GDPR webinars are attended by  business managers, compliance people and the like and these people are responsible for operating and overseeing GDPR compliance. Asking colleagues what data they hold, and getting the company lawyer to update standard contract terms and write privacy notices. But they can’t really do all this stuff on their own since they need IT for doing most of the work like providing a dump of the database schema, gives a guaranteed correct version and don’t forget the unique access required to scan the various files stored in local hard disks and networked file shares for the millions of files we use in the form of documents, emails, spreadsheets, meeting notes, etc. It is extremely important to engage the  IT Team from the discovery phase, for example: most of us hardly ever had one because nobody’s really been sufficiently bothered to spend the money and ask what data you hold about them. The other thing you need to understand is whether there’s a gap between how you think you work and how you actually work. For Example backups: Even though customer’s backup strategy is documented, do you really understand how it’s implemented by the tech teams? How your disk-to-disk-to-tape setup really works? Who transports the tapes to offsite storage? Do you destroy tapes when you say you will? If you’ve erased someone’s data on request, does the tech team re-delete the data from the live system if they’ve had to restore from backup?

Nearly every organization I have come across keeps some sort of back up with them and not everyone is fully utilizing the Cloud infrastructure and Back up tools. The data aspect is important and becoming compliant is one thing, but being able to quantify compliance is quite another. Specifically Data Protection Admins (note – there is a reason I did not mention backup administrators, since Data Protection / Management team shall manage backups, Archives, LTR copies etc.) who handle the data for company and its customers. Having a sound and tested data protection scheme which can report well also is what customers need, that is something which can be delivered by DellEMC DPS solutions.

Actions for Compliance

Below is a list of actions the organization needs to take in order to comply with GDPR and notice that I have not mentioned any timeline, since different organizations have different data set sizes and they may require less to more amount of time to carry out same set of actions.

  • Step 1 – Data Mapping: Identify and map your data processing activities, including data flows and use cases in order to create a comprehensive record of activities since GDPR requires you to keep detailed records of data processing activities. These records can be used to assess the compliance steps required by the business going forward and respond quickly to data breaches and to individuals who request their own data.
  • Step 2 – Privacy Governance / Data Protection Officer: Improve the corporate governance policies and structure to ensure that they are effective to achieve reasonable compliance throughout the business. Organizations who are in EU or deal heavily with EU users data have to assign a “Data Protection Officer” who meets GDPR criteria.
  • Step 3 – Data Sharing: Customers have to identify any data sharing with third parties, determine the role of those parties and put appropriate safeguards in place since GDPR imposes mandatory content for certain agreements and requires the clear assignment of roles and responsibilities.
  • Step 4 – Justification of Processing: Review or establish legal bases for processing, for key use cases. Plan and implement  remedial action to fill any compliance gaps, GDPR requires that all data processing has a legal basis and makes usage more difficult. GDPR also contains restrictions / additional obligations relating to the use of automated processing, including profiling.
  • Step 5 – Privacy Notices & Consents
  • Step 6 – Data Protection Impact Assessment: Assess whether the business carries out any “high risk” processing under the GDPR. If so, carry out a Data Protection Impact Assessment (DPIA) and, if necessary, consult with your supervisory authority, vendors (this is where we come in with NetWorker, DD, Avamar, Storage assessments as we can inform customer of their backup data, retention policies etc.).
  • Step 7 – Policies: Review and supplement the company’s existing suite of polices and processes dealing with data  protection, including those dealing with data retention and integrity, such as data accuracy and  relevance. The GDPR imposes stricter obligations to keep data accurate, proportionate and no longer  than necessary.
  • Step 8 – Individuals Rights: Organizations have to identify the new individual rights provided by the GDPR and establish procedures for dealing  with them. Review the procedures in place in order to comply with existing rights and set up any new internal procedures and processes, where required.
  • Step 9 – Data, Quality, Privacy by Design: Organizations have to make sure that GDPR compliance is embedded in all applications and processes that involve personal data from the start. Default settings must comply with the GDPR.
  • Step 10 – International Data Transfers: Organizations have to make sure they Identify and review the data transfer mechanisms in place in order to comply with the GDPR. Fill any gaps, including entering into Standard Contractual Clauses with service providers and group companies.
  • Step 11 – Data Security & Breach Management Process: Review the data security measures in place to ensure they are sufficient and to assess whether the specific measures referred to in the GDPR are (or should be) in place. Review or establish an effective Data Breach Response Plan (this is where we can talk a bit about IRS, encryption, WORM functionality of DPS products.). The GDPR implements stricter requirements regarding appropriate technical and organizational data security measures. It also requires data breaches involving risk to individuals to be reported to supervisory authorities without delay and within 72 hours (unless a longer period can be justified); affected individuals must also be notified if the breach is high risk.
  • Step 12 – Roll out of Compliance Tools & Staff Training: Roll-out amended and new privacy notices and consent forms. Publish new and revised policies and procedures and conduct training of key personnel on GDPR compliance.

Complete GDPR information can be found at: https://gdpr-info.eu/

Buying a Software Defined Storage Solution

I am back! and not with a data protection writeup, because last few weeks I have not been able to put away the thought of Software Defined Storage. So once, again here is a backup guy talking about storage. I was curious about the current state of software defined storage in the industry and decided to get my hands dirty.  I’ve done some research and reading on SDS over the course of the last month(well actually more than this) and this is a cruxof what I’ve learned from my teammates, customers and people with whom I work.

What is SDS?

The term is very broadly used to describe many different products with various features and capabilities. It is the term for defining the trend towards data storage becoming independent of the underlying hardware, so basically no more fancy SAN boxes as per SDS, simply put. SDS is nothing but a data storage software that includes policy-based provisioning, management of data storage that is independent of the underlying hardware, virtualization or OS. So basically it is Hardware Agnostic, which is good news considering the prices of servers and DAS.

I first looked at IDC and Gartner. IDC defines software defined storage solutions as solutions that deploy controller software (the storage software platform) that is decoupled from underlying hardware, runs on industry standard hardware, and delivers a complete set of enterprise storage services. Gartner defines SDS in two separate parts, Infrastructure and Management:

  • Infrastructure SDS uses commodity hardware such as x86 servers, JBOD and offer features through software orchestration. It creates and provides data center services to replace or augment traditional storage arrays.
  • Management SDS controls hardware but also controls legacy storage products to integrate them into a SDS environment. It interacts with existing storage systems to deliver greater agility of storage services.

Keep Calm and use SDS

The general Attributes of SDS

There are many characteristics of SDS in fact each vendor adds a new dimension to the offering, only making it better in long run, however I have put some key characteristics below, which you should give a look at:

  • Hardware and Software Abstraction – SDS always includes abstraction of logical storage services and capabilities from the underlying physical storage systems. It does not really matter to SDS software whether a server has SAS, SATA, SSD, PCIe card as storage, all is welcome.
  • Storage Virtualization – External-controller based arrays include storage virtualization to manage usage and access across the drives within their own pools, other products exist independently to manage across arrays and/or directly attached server storage.
  • Automation and Orchestration –SDS includes automation with policy-driven storage provisioning, and service-level agreements (SLAs) generally replace the precise details of the actual hardware.
  • Centralized Management –SDS includes management capabilities with a centralized point of management.
  • Enterprise storage features –SDS includes support for all the features desired in an enterprise storage offering, such as compression, deduplication, replication, snapshots, data tiering, and thin provisioning.

When and How to use SDS

There are a host of considerations when developing a software defined storage strategy. Below is a list of some of the important items to consider during the process.

  • Storage Management – You need to know how does a storage works, no not just the IOPS, performance but other details such as queue depth in an OS, how does different applications see same storage differently etc.  You will have to test different settings and get the necessary performance for your environment. Not All applications require read cache, compression, however some do, just an example how detailed this can be. the more detailed you go, the better SLA as a storage admin you deliver. Most of the times, these all nuances are overlooked which cost dearly in any environment. So while designing or purchasing even a normal storage understand your application, because you buy storage not cause of features but for your application which runs your business. Massive research is required for choosing from so many options like (vSAN, ScaleIO, Ceph etc.), all you need to remember is that you are doing all this for your application, not to save money, or get fancy IOPS but deliver the needed SLA.
  • Cloud  and Microservices Integration – Yes Probably today, you are not using cloud and you may also not require Jenkins for your application deployment, but its only a matter of time before this all is required. IT is moving faster than light these days (not really !) and growth of data has created new avenues as to where all it ca be used. Data is the new OIL. An intelligent SDS software has the capability to tier data across cloud platforms or even on a “cheaper storage”, does you software defined storage (in case of Object for example) support s3, REST APIs, SWIFT, HDFS, CAS etc. all at once. Designing of the SDS solution should be such that it is future ready. To give you an outline, if you are using a SDS solution for block storage, you should look out for Docker, Virtualization, COTS integration, understand your vendor’s roadmap and see if it aligns with your present and future requirements.
  • Expansion and Scalability – So then how much can your SDS solution can really scale. 10 nodes ? 100 nodes ? 1000 nodes ? , well the answer lies in your requirement, very few organizations require 1000 nodes+ , but when we talk about scalability, we also look at scalability without drop in performance. A lot of vendors may be able to scale, but if performance suffers, we are back to square one. There are different parameters to judge performance (easier way is to just mentioned IOPS and throughput) ranging from SLA, service catalog and stress testing. While procuring a SDS solution remember the performance should only increase with increase in size (of capacity, nodes, controllers etc.). The other important thing to remember is how many sites can it support?  A good SDS solution (block, Object etc.) should be able to support multiple sites across the globe and by this I obviously mean should be able replicate the data in selective format across sites and can be managed from a single management console.
  • Architecture Matters – How does your SDS solution transfer data from host to the device, how does an actual read and write occur and how it is different from the solution, what suits you. You will have to go into the details and understand it.  In case of SDS the Architecture and details matter much more, because probably you are no longer having the luxury if the old SAN box, what you now have are servers and disks in them, how do you get performance? by knowing the software which you procured. You need to understand networking, OS, Hyper-Visor, Disk, RAM 101 and 201 details. You should look out for solutions which are stateless and do not depend on specific processes to complete to move forward, thus averting bottlenecks. Have multiple detailed discussion with your vendor, the more you learn now, the less service issues you will have later.
  • Test, Test and Test Again before GO LIVE – Understanding your application is one thing and knowing how it will behave in your environment  is another, so before you cut the red ribbon, TEST. Make sure you have left no stone un-turned, Yes, probably you cannot do this for all applications in your setup, but Tier -1 application deserve this. Don’t shy, you will thank yourself later. Another important point I would like to make is, understand how the data will be migrated from old legacy SAN array to SDS solution, what will be the implications, will there be a downtime if yes, then how much and how to minimize it. One of the original purposes of SDS was to be hardware agnostic so there should be no reason to remove and replace all of your existing hardware. A good SDS solution should enable you to protect your existing investment in hardware as opposed to requiring a forklift upgrade of all of your hardware. A new SDS implementation should complement your environment and protect your investment in existing servers, storage, networking, management tools and employee skill sets.
  • Functionalities and Features – So does your SDS solution perform deduplication ? Compression ? Encryption ? Erasure Coding ? Replication ? Let’s step back a bit. How many of these features do you actually need in your applications. Probably Encryption, Compression, Replication, for a performance block storage or even Erasure Coding on an object storage. Do not go for functionalities, if you will never use them, think of these as the side missions to you actual requirement. Understand what you really need, performance ? Scale ? Availability ? IOPS ? and then decide.

There are many use cases and benefits of SDS (File, Block and Object) like Expansion, Automation, Cloud Integration, reduction in operational and management expenses, Scalability, Availability and durability, ability to leverage COTS and so and so forth. Few weeks back I did write a comparison between two major SDS – block vendors, you can check it out here. There are many vendors starting from DellEMC (ScaleIO), VMware vSAN , DellEMC (ECS) and so and so forth. The only key I would want you to take away is to learn you environment and understand you setup in detail and then choose what suits you the best!

Software Defined Storage for Block Workloads

Almost two months since I last wrote. No, I was not utterly busy, just procrastinating on my blog topics. Last week, I was lucky to be part of a meeting which had nothing to do with data protection (I didn’t know this!). The customer I met has several thousand Virtual machines (OK, around 9,000 VMs) and his concern is not data protection (at least initially) but performance on these. These VMs are used for running web servers, databases, Hadoop clusters, some even hold cold archives and so on and so forth. Obviously storage and corresponding IOPS performance required here is mammoth. Also, just to make things clear, the infra has VMware, Hyper-V, KVM, RHEV etc. They have a bunch of storage equipment as well, from almost all of major data storage vendors. In the conversation with customer, I learned their main concern was Cost, Scalability,  Performance, Data Services, DR capabilities and integration with ecosystem (applications, hyper-visors, OS, network etc). They had already tried almost every vendor and they were “satisfied” but were not particularly happy. They were looking for something software defined, which could perform like enterprise storage, or even better for their scale.

As Wikipedia describes it, “Software-defined storage (SDS) is a term for computer data storage software for policy-based provisioning and management of data storage independent of the underlying hardware. Software-defined storage typically includes a form of storage virtualization to separate the storage hardware from the software that manages it. The software enabling a software-defined storage environment may also provide policy management for features such as replication, thin provisioning, snapshots and backup.” Software-defined storage (SDS) is a key driver of data center transformation. As a data center grade SDS, the enterprise features, availability, performance and flexibility of Software defined storage makes it perfect for traditional array consolidation, private cloud/IaaS, and new emerging technologies like DevOps and container microservices. Since SDS is hardware agnostic (does not depend on type of drive, disk, network), it’s very easy for it to take advantage of new hardware releases immediately. Therefore, with SDS you can leverage newer hardware in market (such as NVMe Drives) providing performance and acceleration advancements.

Well, then what are the options in market for SDS? Now, before I take a plunge into this topic, would want to clarify, I in this blog will only be referring to Software Defined Block Storage (I will leave file and object for some other day.). If you perform a quick Google search, you will find almost everyone proclaiming the right to throne of SDS – Block kingdom. Before we choose a winner, I would want to re-iterate the requirements so that we can judge wisely. We need following attributes – SCALABILITY (No, not Terabytes (common, that was required in late 2000’s), Petabytes, Zeta bytes), Performance (on almost all block sizes, not just on 8K, 16K etc., this is important as different applications have different block sizes on which they deliver best results the storage should adapt to the same.), COTS enabled (can I deploy the storage on servers?), Data Services (Snapshots, Compression, Replication, Encryption etc.), Integration with ecosystem (supports for all OS, hyper-visors, container systems, microservices etc.). That seems a lot to ask from a single product, but this is how the dice is rolled in case of block storage requirement. But hadn’t we already solved all these issues with Traditional SAN systems? Well only for a while, as scale of IT infra grows, requirement for stateless systems managed by microservices are needed more and more for running “newer” applications and optimizing already existing ones.

We all want what we can’t have, normal human nature: a single globally distributed, unified storage system, that is infinitely scalable, easy to manage, replicated between several data centers and serves block devices, file systems, and object, all without any issues and delivering data services such as Data compression, and snapshots etc. However this is not really possible, not at scale. The point is that some storage systems are for IOPS, some for scale, some just for sprawl. With these different requirements it becomes extremely difficult to code storage for all the use cases. Adding different data services, just increases the data hops between different daemons, involved, reducing performance, as far as I believe as of now, with present technology and trends it is difficult to achieve a storage which does all, not that a unified storage does not work, but when you need performance, purpose built is the way to go. I have been an admin for storage for some time in my earlier life and I acknowledge that managing a Unified storage is much easier and simpler, than multiple purpose built appliances, but then again I would say, if I need block performance, I would bet my life on a purpose built Software defined Storage for block.

SCALEIO and CEPH are two most valid candidates to hold the baton for Software defined Block storage, but who is the real winner, in terms of attributes mentioned above. I will try to demystify on architecture levels and usability. So here is what CEPH delivers in single software, in a single go…

  • Scalable distributed block storage
  • Scalable distributed object storage
  • Scalable distributed file system storage
  • Scalable control plane that manages all of the above

To sweeten the deal, this all is free, for any capacity almost (well this depends, if you are a storage admin, you know what I mean). This is Holy Grail of storage (almost!), this is all OPEN SOURCE. But as a technologist if you look underneath the skin, remove the flesh, and understand the skeleton of a software, there are a lot of things happening here. Let’s check what CEPH has in its kitty. As I earlier mentioned fundamental problem with any multi-purpose tool is that it makes compromises in each “purpose” it serves, this is for a simple reason, cause a multipurpose storage like CEPH is designed to do many things and different things interfere with each other. It’s like you are asking a toaster to toast (which is fine) and also to fry your steak (All the best with that!), with present technology and coding it is possible but then there are some “TRADE-OFFS”  Ceph’s trade-off, as a multi-purpose tool, is the use of a single “object storage” layer.  You have a block interface (RBD), an object interface (RADOSGW), and a filesystem interface (CephFS), all of which talk to an underlying object storage system (RADOS).  Here is the CEPH architecture from their documentation:

C1

RADOS itself is reliant on an underlying file system to store its objects. So the diagram should actually look like this:

C2
So in a given data path, for example a block written to disk, there is a high level of overhead:c3

In contrast, a purpose-built block storage system that does not compromise and is focused solely on block storage, like DellEMC ScaleIO, can be significantly more efficient:

c4

(Here, SDC is ScaleIO Data Client which hosts the application which requires the IOPs and SDS is ScaleIO Data Server which pools the storage from multiple other SDS machines. A single server can act as both SDC and SDS.) This allows skipping two steps, but more importantly, it avoids complications and additional layers of indirection/abstraction as there is a 1:1 mapping of the ScaleIO client’s block and the block(s) on disk in the ScaleIO cluster. By comparison, multi-purpose systems need to have a single unified way of laying out storage data, which can add significant overhead, even at smaller scales.  Ceph, for example, takes any of its “client data formats” (object, file, block), slices them up into “stripes”, and distributes those stripes across many “objects”, each of which is distributed within replicated sets, which are ultimately stored on a Linux file system in the Ceph cluster.  Here’s the diagram from the Ceph documentation describing this:

c5

This is a great architecture if you are going to normalize multiple protocols, but it’s a terrible architecture if you are designing for high performance block storage only, reason simple enough, there will be just too many calculations and “INSIDE IOPS” for a heavy transactional workload. In terms of latency, Ceph’s situation would get much grimmer, with Ceph having incredibly poor latency, almost certainly due to their architecture compromises.

DellEMC ScaleIO is software that creates a server-based SAN from local application server storage (local or network storage devices). ScaleIO delivers flexible, scalable performance and capacity on demand andintegrates storage and compute resources, scaling to hundreds of servers (also called nodes). As an alternative to traditional SAN infrastructures, ScaleIO combines hard disk drives (HDD), solid state disk (SSD), Peripheral Component Interconnect Express (PCIe) flash cards and NVMe drives to create a virtual pool of block storage with varying performance tiers. As opposed to traditional Fibre Channel SANs, ScaleIO has no requirement for a Fibre Channel fabric between the servers and the storage. This further reduces the cost and complexity of the solution. In addition, ScaleIO is hardware-agnostic and supports both physical and virtual application servers.

Capture

It creates a Software-Defined Storage (SDS) environment that allows users to exploit the unused local storage capacity in any server. ScaleIO provides a scalable, high performance, fault tolerant distributed shared storage system. Once again it can be installed on VMware, Xen, Hyper-V, Bare Metal servers etc., you get the vibe.

There are other problems besides performance with a multi-purpose system.  The overhead I outlined above also means the system has to be hefty to just to do internal jobs, every new task or purpose it takes on includes overhead in terms of business logic, processing time, and resources consumed.  In most common configurations, ScaleIO, being purpose-built takes less of the host system’s resources such as memory and CPU. Ceph would take significantly more resources than ScaleIO, making it a very poor choice for “hyper-converged”, semi-hyper-converged, scale-out deployments. This means that if you built two separate configurations of Ceph vs. ScaleIO that are designed to deliver the same performance levels, ScaleIO would have significantly better TCO, just factoring in the cost of the more expensive hardware required to support Ceph’s heavyweight footprint. So this also ensures that purpose built software just not promise and deliver performance but also cost effectiveness. I stumbled upon an old YouTube video (https://www.youtube.com/watch?v=S9wjn4WN4tE) showcasing how on block storage ScaleIO performs better than Ceph on similar compute resources. If you watch the video in entirety it clearly shows that ScaleIO exploits the underlying the resources much more efficiently, making it more scalable over time.

If you want to build a relatively low cost, high performance, distributed block storage system that supports bare metal, virtual machines, and containers, then you need something purpose built for block storage (Performance Matters!).  You need a system optimized for block, ScaleIO. If you haven’t already, checked out ScaleIO, which is free to download and use at whatever size you want, Installing ScaleIO is very easy and can be made up and running in less than 10 minutes.  Run these tests yourself.  Report the results if you like. I am adding some documentation for ScaleIO which I found extremely useful understanding the way ScaleIO works: ScaleIO Architecture Guide. I will be writing more about SDS, specifically on its native data services like snapshots (as it pertains to Data protection) and ways to protect it via enterprise backup software and data protection appliances .