The following is excerpted from my just released book Windows Azure Data Storage (Wiley Press, Oct 2013). And, since the format is eBook only, there will be updates to the content as new features are added to the Azure Data Storage world.
Business craves data.
As a developer, this is not news to you. The people running businesses have wanted it for years. They demand data about how many widgets have been ordered, how much inventory is available to be used in manufacturing, how many accounts are more than 45 days past due. More recently, the corporate appetite for data has spread way past these snacks. They want to store information about how individual consumers navigate through their website. They want to keep track of how different metrics about the machines are used in the manufacturing process. They have hundreds of MB of documents, spreadsheets, pictures, audio, and video files that need to be stored and managed. And the volume of data that is collected grows by an obscene amount every single day.
What businesses plan on doing with this information depends greatly on the industry, as well as the type and quality of the data. Inevitably, the data needs to be stored. Fortunately (or it would be an incredibly short book) Windows Azure has a number of different data storage technologies that are targeted at some of the most common business scenarios. Whether you have transient storage requirements or the need for a more permanent resting place for your data, Windows Azure is likely to have you covered.
Business Scenarios for Storage
A feature without a problem to solve is like a lighthouse on a sunny day—no one really notices and it’s not really helping anyone. To ensure that the features covered in this book don’t meet the same fate, the rest of this chapter maps the Windows Azure Data Storage components and functionality onto problems that you are likely already familiar with. If you haven’t faced them in your own workplace, then you probably know people or companies that have. At a minimum, your own toolkit will be enriched by knowing how you can address common problems that may come up in the future.
NoSQL
A style of data storage that has recently received a lot of attention in the development community is NoSQL. While the immediate impression, given the name, is that style considers SQL to be an anathema, this is not the case. The name actually means Not Only SQL.
To a certain extent, the easiest way to define NoSQL is to look at what it’s not, as well as the niche it tries to fill. There is no question that the amount of data stored throughout the world is vast. And the volume is increasing at an accelerating rate. Studies indicate that over the course of four years (2008-2012), the total amount of digital data has increased by 500 percent. While this is not quite exponential growth, it is very steep linear growth. What is also readily apparent is that this growth is not likely to plateau in the near future.
Now think for a moment about how you might model this structure using a relational database. For relational databases, you would need tables and columns with foreign key relationships. For instance, start with a page table that has a URL column in it. A second table containing the links from that page to other pages would also be created. Each record in the second table would contain the key to the first page and the key to the linked-to page. In the relational database world, this is commonly how many-to-many relationships are created. While feasible, querying against this structure would be time consuming, as every single link in the network would be stored in that one, single table. And to this point, the contents of the page have not yet been considered.
NoSQL is designed to address these issues. To start, it is not a relational data store. Instead, there is no fixed schema and querying does not require any joins to be performed. At least, not in the traditional sense. Instead, NoSQL is a variation (depending on the implementation) of the key-value paradigm. In the Windows Azure world, different forms of NoSQL-style storage is provided through Tables and Blobs.
Big Data
Any discussion of NoSQL tends to lead into the topic of Big Data. As a concept, Big Data has been generating a lot of buzz over the last 12-18 months. Yet, like the cloud before it, people find it challenging to define Big Data specifically. Sure, they know its “Big,” and they know that it’s “Data,” but beyond that, there is not a high level of agreement or understanding of the purpose and process of collecting and evaluating Big Data.
Most frequently, you read about Big Data in the context of Business Intelligence (BI). The goal of BI is to provide decision makers with the important information they need to make the choices that are inevitable in any organization. In order to achieve this goal, BI needs to gain access to data from a variety of sources within an organization, rationalize the definitions (i.e., make sure that the definition for common terms are the same across the different data sources), and present visualizations of the information to the user.
Based on the previous section, you might see why Big Data and NoSQL are frequently covered together. NoSQL supports large values of semi-structured data, and Big Data produces large volumes of semi-structured information. It seems like they are made for one another. Under the covers, they are. However, to go beyond Table, and Blob Storage, the front for Big Data in Windows Azure is Adobe Hadoop. Or, more accurately, the Azure HDInsight Services.
Relational Data
For the vast majority of developers, relational data is what immediately springs to mind when the term Data is mentioned. But since relational data has been intertwined with computers since the early in the history of computer programming, this shouldn’t be surprising.
With Windows Azure, there are two areas where relational data can live. First there are Window Azure Virtual Machines (Azure VMs), which are easy to create and can contain almost any database that you can imagine. Second, there are Windows SQL Azure databases. How you can configure, access and synchronize data with both of these modes are covered in detail in the book.
Messaging
Messaging, message queues, and service bus have a long and occasionally maligned history. The concept behind messages and message queues are quite old (in technology terms) and, when used appropriately, are incredibly useful for implementing certain application patterns. In fact, many developers take advantage of the message pattern when they use seemingly non-messaging related technologies such as Windows Communication Foundation (WCF). If you look under the covers of guaranteed, in-order delivery using protocols, which don’t support such functionality (cough…HTTP…cough), you will see a messaging structure being used extensively.
In Windows Azure, basic queuing functionality is offered through Queue Storage. It feels a little odd to think of a message queue as a storage medium, yet ultimately that’s what it is. An application creates a message and posts it to the appropriate queue. That message sits there (that is to say, is stored) until a second application decides to remove it from the queue. So, unlike the data in a relational database, which is stored for long periods of time, Queue Storage is much more transient. But it still fits into the category of storage.
Windows Azure Service Bus is conceptually just an extension of Queue Storage. Messages are posted to and popped from the Service Bus. However, it also provides the ability for messages to pass between different networks, through firewalls, and even across corporate boundaries. Additionally, there is no requirement to open up an endpoint on either side of the communications channel that would expose the participant to external attacks.
Summary
It should be apparent even from just these sections that the level of integration between Azure and the various tools (both for developers and administrators) is quite high. This may not seem like a big deal, but anything that can improve your productivity is important. And deep integration definitely fits into that category. Second, the features in Azure are priced to let you plan with them at low or no cost. Most features have a long-enough trial period so that you can feel comfortable with the capabilities. Even after the trial, Azure bills based on usage, which means you would only be paying for what you use.
The goal of the book is to provide you with more details about the technologies introduced in this chapter. While the smallest detail of every technology is not covered, there is more than enough information for you to get started on the projects that you need to determine Azure’s viability in your environment.