Big Data Cloud Database & Computing
The rise of cloud computing and cloud data stores have been a precursor and facilitator to the emergence of big data. Cloud computing is the commodification of computing time and data storage by means of standardized technologies.
It has significant advantages over traditional physical deployments. However, cloud platforms come in several forms and sometimes have to be integrated with traditional architectures.
This leads to a dilemma for decision makers in charge of big data projects. How and which cloud computing is the optimal choice for their computing needs, especially if it is a big data project? These projects regularly exhibit unpredictable, bursting, or immense computing power and storage needs. At the same time business stakeholders expect swift, inexpensive, and dependable products and project outcomes. This article introduces cloud computing and cloud storage, the core cloud architectures, and discusses what to look for and how to get started with cloud computing.
A decade ago an IT project or start-up that needed reliable and Internet connected computing resources had to rent or place physical hardware in one or several data centers. Today, anyone can rent computing time and storage of any size. The range starts with virtual machines barely powerful enough to serve web pages to the equivalent of a small supercomputer. Cloud services are mostly pay-as-you-go, which means for a few hundred dollars anyone can enjoy a few hours of supercomputer power. At the same time cloud services and resources are globally distributed. This setup ensures a high availability and durability unattainable by most but the largest organizations.
The cloud computing space has been dominated by Amazon Web Services until recently. Increasingly serious alternatives are emerging like Google Cloud Platform, Microsoft Azure, Rackspace, or Qubole to name only a few. Importantly for customers a struggle on platform standards is underway. The two front-running solutions are Amazon Web Services compatible solutions, i.e. Amazon’s own offering or companies with application programming interface compatible offerings, and OpenStack, an open source project with a wide industry backing. Consequently, the choice of a cloud platform standard has implications on which tools are available and which alternative providers with the same technology are available.
Professional cloud storage needs to be highly available, highly durable, and has to scale from a few bytes to petabytes. Amazon’s S3 cloud storage and Microsoft Azure Blob Storage are the most prominent solutions in the space. They promise in the range of 99.9% monthly availability and 99.999999999% durability per year. This is less than an hour outage per month. The durability can be illustrated with an example. If a customer stores 10,000 objects he can expect to lose one object every 10,000,000 years on average. They sometime achieve this by storing data in multiple facilities with error checking and self-healing processes to detect and repair errors and device failures. This is completely transparent to the user and requires no actions or knowledge.
A company could build and achieve a similarly reliable storage solution but it would require tremendous capital expenditures and operational challenges. Global data centered companies like Google or Facebook have the expertise and scale to do this economically. Big data projects and start-ups, however, benefit from using a cloud storage service. They can trade capital expenditure for an operational one, which is excellent since it requires no capital outlay or risk. It provides from the first byte reliable and scalable storage solutions of a quality otherwise unachievable.
This enables new products and projects with a viable option to start on a small scale with low costs. When a product proves successful these storage solutions scale virtually indefinitely. Cloud storage is effectively a boundless data sink. Importantly for computing performances is that many solutions also scale horizontally, i.e. when data is copied in parallel by cluster or parallel computing processes the throughput scales linear with the number of nodes reading or writing.
Cloud computing employs visualization of computing resources to run numerous standardized virtual servers on the same physical machine. Cloud providers achieve with this economies of scale, which permit low prices and billing based on small time intervals, e.g. hourly.
This standardization makes it an elastic and highly available option for computing needs. The availability is not obtained by spending resources to guarantee reliability of a single instance but by their interchangeability and a limitless pool of replacements. This impacts design decisions and requires to deal with instance failure gracefully.
The implications for an IT project or company using cloud computing are significant and change the traditional approach to planning and utilization of resources. Firstly, resource planning becomes less important. It is required for costing scenarios to establish the viability of a project or product. However, deploying and removing resources automatically based on demand needs to be focused on to be successful. Vertical and horizontal scaling becomes viable once a resource becomes easily deployable.
Horizontal scaling refers to the ability to replace a single small computing resource with a bigger one to account for increased demand. Cloud computing supports this by making various resource types available to switch between them. This also works in the opposite direction, i.e. to switch to a smaller and cheaper instance type when demand decreases. Since cloud resources are commonly paid on a usage basis no sunk cost or capital expenditures are blocking fast decision making and adaptation. Demand is difficult to anticipate despite planning efforts and naturally results in most traditional projects in over- or under-provision resources. Therefore, traditional projects tend to waste money or provide poor outcomes.
Cloud Big Data Challenges
Vertical scaling achieves elasticity by adding additional instances with each of them serving a part of the demand. Software like Hadoop are specifically designed as distributed systems to take advantage of vertical scaling. They process small independent tasks in massive parallel scale. Distributed systems can also serve as data stores like NoSQL databases, e.g. Cassandra or HBase, or filesystems like Hadoop’s HDFS. Alternatives like Storm provide coordinated stream data processes in near real-time through a cluster of machines with complex workflows.
The interchangeability of the resources together with distributed software design absorbs failure and equivalently scaling of virtual computing instances unperturbed. Spiking or bursting demands can be accommodated just as well as personalities or continued growth.
Renting practically unlimited resources for short periods allows one-off or periodical projects at a modest expense. Data mining and web crawling are great examples. It is conceivable to crawl huge web sites with millions of pages in days or hours for a few hundred dollars or less. Inexpensive tiny virtual instances with minimal CPU resources are ideal for this purpose since the majority of crawling the web is spent waiting for IO resources. Instantiating thousands of these machines to achieve millions of requests per day is easy and often costs less than a fraction of a cent per instance hour.
Of course, such mining operations should be mindful of the resources of the web sites or application interfaces they mine, respect their terms, and not impede their service. A poorly planned data mining operation is equivalent to a denial of service attack. Lastly, cloud computing is naturally a good fit for storing and processing the big data accumulated form such operations.
Three main cloud architecture models have developed over time; private, public and hybrid cloud. They all share the idea of resource commodification and to that end usually virtualize computing and abstract storage layers.
Private clouds are dedicated to one organization and do not share physical resources. The resource can be provided in-house or externally. A typical underlying requirement of private cloud deployments are security requirements and regulations that need a strict separation of an organization’s data storage and processing from accidental or malicious access through shared resources.Private cloud setups are challenging since the economical advantages of scale are usually not achievable within most projects and organizations despite the utilization of industry standards. The return of investment compared to public cloud offerings is rarely obtained and the operational overhead and risk of failure is significant.
Additionally, cloud providers have captured the trend for increased security and provide special environments, i.e. dedicated hardware to rent and encrypt virtual private networks as well as encrypted storage to address most security concerns. Cloud providers may also offer data storage, transfer, and processing restricted to specific geographic regions to ensure compliance with local privacy laws and regulations.
Another reason for private cloud deployments are legacy systems with special hardware needs or exceptional resource demand, e.g. extreme memory or computing instances which are not available in public clouds. These are valid concerns however if these demands are extraordinary the question if a cloud architecture is the correct solution has to be raised. One reason can be to establish a private cloud for a transitionary period to run legacy and demanding systems in parallel while their services are ported to a cloud environment culminating in a switch to a cheaper public or hybrid cloud.
Public clouds share physical resources for data transfers, storage, and processing. However, customers have private visualized computing environments and isolated storage. Security concerns, which entice a few to adopt private clouds or custom deployments, are for the vast majority of customers and projects irrelevant. Visualization makes access to other customers’ data extremely difficult.
Real-world problems around public cloud computing are more mundane like data lock-in and fluctuating performance of individual instances. The data lock-in is a soft measure and works by making data inflow to the cloud provider free or very cheap. The copying of data out to local systems or other providers is often more expensive. This is not an insurmountable problem and in practice encourages to utilize more services from a cloud provider instead of moving data in and out for different services or processes. Usually this is not sensible anyway due to network speed and complexities around dealing with multiple platforms.
The varying performance of instances stems typically from the dependency on what kind of load other customers generate on the shared physical infrastructure. Secondly, over time the physical infrastructure providing the virtual resources changes and is updated. The available resources for each customer on a physical machine are usually throttled to ensure that each customer receives a guaranteed level of performance. Larger resources generally deliver very predictable performance since they are much closer aligned with the physical instance’s performance. Horizontally scaling projects with small instance should not rely on an exact performance of each instance but be adaptive and focus on the average performance required and scale according to need.
The hybrid cloud architecture merges private and public cloud deployments. This is often an attempt to achieve security and elasticity, or provide cheaper base load and burst capabilities. Some organizations experience short periods of extremely high loads, e.g. as a result of seasonality like black Friday for retail, or marketing events like sponsoring a popular TV event. These events can have huge economic impact to organizations if they are serviced poorly.
The hybrid cloud provides the opportunity to serve the base load with in-house services and rent for a short period a multiple of the resources to service the extreme demand. This requires a great deal of operational ability in the organization to seamlessly scale between the private and public cloud. Tools for hybrid or private cloud deployments exist like Eucalyptus for Amazon Web Services. On the long-term the additional expense of the hybrid approach often is not justifiable since cloud providers offer major discounts for multi-year commitments. This makes moving base load services to the public cloud attractive since it is accompanied by a simpler deployment strategy.
Keep it simple
Organizations that are faced with architecture decisions should evaluate their security concerns or legacy systems ruthlessly before accepting a potentially unnecessarily complex private or hybrid cloud deployment. A public cloud solution is often achievable. The questions to ask are which new processes can be deployed in the cloud and which legacy process are feasible to transfer to the cloud. It may make sense to retain a core data set or process internally but most big data projects are served well in the public cloud due to the flexibility it provides.
Typical cloud big data projects focus on scaling or adopting Hadoop for data processing. MapReduce has become a de facto standard for large scale data processing. Tools like Hive and Pig have emerged on top of Hadoop which make it feasible to process huge data sets easily. Hive for example transforms SQL like queries to MapReduce jobs. It unlocks data set of all sizes for data and business analysts for reporting and greenfield analytics projects.
Data can be either transferred to or collected in a cloud data sink like Amazon’s S3, and Microsoft Blob Storage, e.g. to collect log files or export text formatted data. Alternatively database adapters can be utilized to access data from databases directly with Hadoop, Hive, and Pig. Qubole is a leading provider of cloud based services in this space. They provide unique database adapters that can unlock data instantly, which otherwise would be inaccessible or require significant development resource. One great example is their mongoDB adapter. It gives Hive table like access to mongoDB collections. Qubole scales Hadoop jobs to extract data as quickly as possible without overpowering the mongoDB instance.
Ideally a cloud service provider offers Hadoop clusters that scale automatically with the demand of the customer. This provides maximum performance for large jobs and optimal savings when little and no processing is going on. Amazon Web Services Elastic MapReduce and Azure HDInsight, for example, allow scaling of Hadoop clusters. However, the scaling is not automatically with the demand and requires user actions. The scaling itself is not optimal since it does not utilize HDFS well and squanders Hadoop’s strong point, data locality. This means that an Elastic MapReduce cluster wastes resources when scaling and has diminishing return with more instance. Furthermore, Amazon’s Elastic MapReduce and HDInsight require a customer to explicitly request a cluster every time when it is needed and remove it when it is not required anymore. There is also no user friendly interface for interaction with or exploration of the data. This results in operational burden and excludes all but the most proficient users.
Qubole scales and handles Hadoop clusters very differently. The clusters are managed transparently without any action required by the user. When no activity is taking place clusters are stopped and no further expenses accumulate. The Qubole system detects demand, e.g. when a user queries Hive, and starts a new cluster if needed. It does this even faster than Amazon raises its clusters on explicit user requests. The clusters that Qubole manages for the user have a user defined minimum and maximum size and scale as needed to provide the user with the optimal performance and minimal expense.
Importantly users, developers, data engineers and business analysts alike, require an easy to use graphical interface for ad hoc data analysis access, and to design jobs and workflows. Qubole provides a powerful web interface including workflow management and querying capabilities. Data is accessed from permanent data store like S3 or Azure Blob Storage and database connectors with transient clusters. The pay-as-you go billing of cloud computing makes it easy to compare and try out systems. Sign up to Qubole and try it for free to experience how easy it is to use.