Big Data Analytics Using Window Functions

 

Webinar Transcription

Gil: All right, good morning everyone. And thank you for joining us for today’s webinar. Today’s title is, “Big Data Analytics Using Apache Hive Windows Functions,” for your Big Data application infrastructure. Today’s presenter is Siva Narayanan and Harish Butani. I’ll just give you a quick bio on both of our panelists. Harish Butani is a member of the Hortonworks Engineering Team, an Apache Hive committer, and has been working Hive for about a year. Overall, Harish has about 20 years of experience in architecting, leading, and implementing back-end systems. In the past, Harish helped develop SAP’s MDX Engine, Cloud9 Analytics MDX Engine, Oracle FS, and Agile, now Oracle BLM.

Also with us is Siva Narayanan, who previously worked in query optimization and workflow management at EMC Greenplum, and holds a Bachelor Degree of Engineering from BITS, Pilani, India, and a PhD in Computer Science from Ohio State. Siva has several pending patents and has published many papers in academic conferences in the area of large-scale data management. So just a couple of, I’m going to show you items for today’s webinar, this webinar is being recorded. So you guys will get the link via email to view this webinar at your own convenience on demand.

Siva N.: Hello, everyone. This is Siva from Qubole. Today we’re going to be talking about Big Data Analytics, using Apache Hive Window Functions, which is new in Hive 0.11 release. Let’s start with a brief history of window functions. Window functions became part of the SQL standard, in the SQL:2003 edition. They’re also called Analytic Functions in Oracle, and they first appeared in article 8.1.6, about the year 2000. Window Functions are syntactic sugar, but that doesn’t mean they’re less important. They make certain questions much, much easier to express. And now, they’re implemented in several analytic databases, including Greenplum, Teradata and others. And now, they are also available in Hive, courtesy my co-presenter Harish, and the good folks at Hortonworks.

Let’s start with a little example of sales data, to moderate the need for window functions. Here’s a simple table (Figure 1). It’s got 3 columns, it’s got the name of the salesperson, the country that they belong to, and what sort of revenue that they brought in. So we’ve got the 4 rows here, 2 salespeople from Sweden, 2 from USA, with revenue numbers on the right side. Now, what we want to do here is, we want to find the top salesperson across all countries. I’m sure you know that this is a extremely straightforward query. It’s just an Order By on revenue, and I do a limit 1. That’s really easy.

Now, let’s make it a little bit more challenging. What if you want to find the top salesperson in every country? You want to know who’s the best salesperson in Sweden, and in USA. How you’d go about that is, using standard SQL, is you’d find the best revenue by country. For example, you’d find the best revenue in Sweden was 3 million. And in USA was 2.2 million. And you’d do this by, let’s say, creating a temporary table, or a derived table as part of the bigger query. This query does a GROUP BY on country, and it does a max aggregate computation on revenue. So in this example (figure 2), you’d see 2 rows, in this temporary table.

Then, what we do is, we would match this table with the original table, to find out, “Who was the salesperson who matched this maximum revenue?” So that query looks like this. So this is a JOIN between the original sales table, and this temporary country_best table that we just created, and the join condition is both on the revenue number and the country name. And this will give us one row for every country, and it’ll tell us who is the best salesperson in every country (figure 3). So this got a little more complicated than the previous question.

Now, let’s make it even more challenging. What if you wanted to find the top 2 sales people per country? And how we’d go about doing that using a standard SQL construct is, first we would find the best revenue for every country, just like the previous example. Then you’d find the best of the rest in each country, or the second-best. This could be accomplished by, essentially, eliminating everybody who was the best in every county. Then you’d join the sales team with the union of these two temporary tables. All right, and the SQL is left as homework assignment to our viewers. Because I’m too lazy to write it down.

Now, you might have already noticed, there are a few problems with this approach. The first is, this approach is very difficult to generalize when we want to find the top-k salespeople per country. This approach uses lots of temporary tables or complex subqueries or derived tables. And most databases and BI tools can’t execute this efficiently, they cannot optimize when you’re breaking down a big query into lots of smaller queries. So in summary, this approach is difficult to read, difficult to write, and difficult to executive for the system.

So let’s try this example again, of finding the top salesperson per country, version 2.0 (figure 4). And this is the way you might do this, using window functions. And what we’re doing is, we have a derived table called T, and we see a new function called RANK. This rank function, this key tell us, “What is the rank of the salesperson in every country?” And then we figure out on the top-ranking ones.

So let’s break this down. What does this rank window function do, and how does this give us the answer that we’re looking for? So we take a look at the derived table, right, and break this down using our example.

This was our original table that we started off with. And what this query does is, it conceptually digs up all the rows into 2 partitions. The first one is the Sweden partition. This contains 2 rows, the first 2 rows here. And the next is the USA partition. And this is reflected by the partition by country clause in the query. Then, the RANK function is called on the ordered data here, and this produces a position for every row (Figure 5). So the first row, Mikael from Sweden, is number 1, in terms of revenue, and gets position 1. Martin from Sweden gets position 2.

And in USA, Tony Iommi gets position 1. Note that this position is within a partition. And so, the additional column that is produced, which we call Position, corresponds to this rank. And now, what we can do is, we can simply filter on Position = 1, and that will give us the top salesperson from every country.

Now, if I want to find the top 2 salespeople per country, it’s a little bit straightforward, since we have this original position column. So all this requires is a small change, and sort of filtering on position = 1. Now, we just say position is 1 or 2. So the query looks pretty much the same. And this generalizes very well to top-k. We can say top 3, top 4, top 5, 6, even the query doesn’t get any more complicated.

So let’s break this down into now, the anatomy of window function, and how would semantics gets broken down? Basically, you have a table with 3 columns, C1, C2, C3. The first step that happens is, your rows get partitioned into these joined set of rows.

So in this example, we just have 3 partitions. Within each partition, the rows are ordered. In this case, our rows are ordered by Column C3. And then, corresponding to every row, in every partition, there is a notion of a window, or a window frame that is established. And in this case, the first 2 rows belong to the window frame corresponding to Row 1. The rows 2 and 3 correspond to the window frame for row 2, and so on. So each row is now associated with another set of rows. And then the function is applied on the set of rows, to produce an additional column. This is a window function. The window function computers a result based on some computer, on these set of rows, and produces an additional column, which can be conceptually tied back to the original row.

We’ve talked about partitions, and ordering within a partition. And we formally described what a window frame looks like. Remember the window frame is a set of rows that becomes the input for the window function. And a window definition specification, in general has a frame_start and a frame_end clause. And the beginning and the end of the frame can be described relative to the current row. So you can say, the frame for this window starts at my CURRENT ROW. Or you can say it starts in the beginning of the partition, which is represented using UNBOUNDED PRECEDING. Or it can be 5 rows ahead, before the current row.

Remember, the rows are already ordered in some fashion. The ordering is optional, but if they are, then you can say the window corresponding to the current row starts 5 rows before this row. Similarly, window can start after, or end after the current row. And finally, you can say the window ends at the end of the partition.

So we already talked about how window functions are useful to find the top salesperson. Let me give a couple of other examples that are possible, and fairly straightforward, when using window functions. The first is, we already covered rank. Rank can be used to find the top salesperson in every country. There is an extension called percent_rank, which can be used to answer questions like, “Give me the top 1 percent of sales person in every country.” This is useful when you don’t have the same number of salespeople in every country, and you want some kind of proportional representation.

There’s another interesting window function called first_value. This function returns the first value in the frame corresponding to the current row. So this can be used to answer questions like, “How is every salesperson doing relative to the best in his or her country?” So you can compare the revenue for a salesperson to the revenue of the best salesperson in the country, and do interesting analysis on, are they, how much below, worse off they are. Or, how much better than average they are.

So now, you must be wondering, window functions do look similar to some certain other SQL constructs. So is it like anything else? Window functions are like scalar functions, in that, ultimately, they produce one output row for every input row. Unlike scalar functions, however, they consume multiple rows to produce one row. So they can do slightly more sophisticated analytics. Like an aggregate function, a window function generally consumes multiple rows to produce one row. However, unlike an aggregate function, it produces one row for every input row. Remember, an aggregate function, like SUM or COUNT, generally collapses multiple rows into a single row. So it’s somewhat like a cross between a scalar function and an aggregate function.

And now, Hive has it. I’ll pass the mic over to my co-presenter Harish, who actually built it. And it’s part of Hive 0.11 release. Harish.

Harish B.: Thanks, Siva. That was really good. Okay, so let me start by summarizing what windowing enables. So it enables to set a pattern. So it enables ranking. So things like, top and bottom N Supplier by Region, so report on that. As Siva went through, a very similar example. It enables what I called reporting aggregates. So these are reports which are at the entity level. But they also include aggregation statistics. So for example, you want to build a report about Supplier and his sales, but you also want to include the Average within his region, and possibly, also, his Percentage contribution. It enables what are called Sliding Windowing Aggregates.
In this case, each row has a different aggregation, based on a window. So things like, if you want to build a report on sales by month, but you also want to include the accumulative sales, or year-to-date sales.

Or if you want a Stock Price report, and you want to include a sliding three-month average. And the other pattern it enables is lead/lag analysis, or what I call navigational analysis. So, for example, you want to build a report on regions, and include the sales by quarter. But you also want to include the last Quarter sales, for every row. Or you want to build a report for stock price, and you want to show what the opening and closing price for every day.

Okay, so what does Hive have? So what is in Hive is pretty close to the standard, so you can do windowing and location, and so on, all the aggregation functions in Hive, plus the new functions. The ranking, lead/land, first/last value functions. The window frames can be logical operator. So by physical, I mean, for each row, the window size is static. So like, Siva showed you can have 5 preceding and 10 succeeding. But they can also be logical. So logical windows are based on the value in the row. So you can say, “go back until the value is 5 less.” Window functions can appear anywhere a aggregation function can. So you can, just like aggregation functions, you can have logical and arithmetic expressions on top of them. You can embed aggregation functions, or inside window invocations.

And finally, a query can contain multiple window invocations, and these can under frame partition and other specifications. So what happens under the covers is, we group the more different window invocations into sets based on common partition order specification. And then, we process them set by set. But there are limitations. So here’s a set of limitations.

First of all, Hive doesn’t have a concept of controlling, ordering up nulls, so there’s no nulls first/last clause. Hive doesn’t have a concept of collation. If you have character-based data types, you cannot specify how to collate them. The window ranges today are numeric constants, whereas, in the spec, you can have arbitrary numerical expressions. And in Hive, you don’t have date intervals. So if you want to do date-based windows, you pretty much have to break down the date into components, and treat each component as a int.

And finally, there’s no support for Frame exclusion. So frame exclusion has to do with the controlling window boundaries. So you can say, “Do not extend the end of the window to rows which have the same value of current row.” For example. And the default we give here is to include those rows. So we don’t have the frame exclusion clause supported in Hive today. And for details on how you can specify window frames, how you can specify windows, there is a specification document that’s attached to Hive-4197.

Okay, so little bit on how windowing is processed. So in general, in a query, the joining or Join clause, the Where, Group By, Having is processed just like if there’s no windowing present. And then, the output of that is partitioned and order, is partitioned based on the window specification. And then, each partitioned is ordered based on the window specification. And then, each of the window function is computed partition by partition. So they are computed independently, partition by partition.

And as I said, a query can contain multiple window functions, so that there are multiple window functions, we’ll group them into sets, and we’ll repeat this process. And finally, at the end, if there’s an order by class, that will be processed. The key observation here is that, window functions are processed partition by partition, and so, this process is highly parallelizable.

So what does that mean in Hive? Just like how Joins and Group Bys are executed in Hive, we use the shuffle to divide the data into partitions and order them. And so each Reducer processes the partitions that it’s given, it processes the window functions for that partition. And it processes it independent of other uses and in parallel with other Reducers. So it’s highly parallelized. And the way we implemented windowing in Hive is, it’s on top of a more generic framework, which is the Partition Table Function framework.

And what this framework let you do is, it lets you inject user defined table functions into the Hive data flow – the way it operates on the table is partition by partition. And so, internally, what we do is, we have a special PTF to handle windowing. And what this windowing PTF does is it, when it’s given a ordered partition, it invokes the window functions associated with it. So each window function is, ultimately, just an aggregation function. Based on the pattern, it may be giving the entire partition, all the rows in the partition to aggregate. Or it may be given a window of rows for current row.

And then, finally, once the window functions are evaluated, it assembles a output, by combining the data from the data from the input columns and the window function columns.
Okay, a little bit about performance in windowing. So one of the observations about windowing is that, you don’t need to compute aggregation statistics separately and then join it back. As Siva went through the Top Salesperson Example, the way you do this without windowing is, you first compute the best sales by country. And then, you join this aggregation statistics, this intermediate result back to the person table, to find the person who has the sales.

So it’s a two-step process. And then he showed the windowing query, you only had to scan the person table, and we first partitioned and ordered the person table. But we didn’t re-scan it. And so, what that means is, we avoid the extra scan on the underlying data. And in the mapreduce context what that means is, it eliminates 1 or more mapreduce jobs. And so, this performance impact can be quite significant.

So let me go through a little bit more elaborate example. TPCH – This is query 2. It’s the Minimum Cost Supplier query. And this is about finding for each part the supplier who can supply it at the lowest price. So the pattern is very similar to the person example that Siva was showing, but it’s a more elaborate query. But it’s all in the same way. In order to do this query, the first thing you do is, for each part, compute the minimum cost.

And so, you have this sub-query– once you’ve computed the minimum cost per part, you join it back to the PartSupplier table on minimum cost, to get the supplier data, who provides this part of the minimum cost. You’re basically scanning PartSupplier twice. And then, once you’ve found the minimum cost, you’re joining it back to PartSupplier.

As opposed to, if you have to write this query in windowing, the way you can express it is, the following. Right, so you say, you partition the PartSupplier by part. And then order each partition by supplier costs. And so, given this ordering, you rank the rows. And then, in the outer query, you say, “Just keep the first row from each partition. Keep the row with rank 1.” And so, in this, 2 things to observe here. The first thing is, this inner query is doing about the same thing as the inner query in the previous example. It’s shuffling on the partkey, and then it’s ordering on cost.

In the previous case, you shuffle on the partkey, and then you computed the minimum cost, though you’re doing the ranking. So it’s a little bit more expensive, but it’s, ballpark, it’s about the same. But the key difference is, I’m not doing the second scan of PartSupplier, and we are not doing the Join.

And so, you can observe this in, if you look at the plans for the 2 queries. The right side plan is the one for windowing. And you can see, it’s much simpler than the left side one. The right side one, you are involving the PTF operator on the PartSupplier table. On the left hand side, it looks a little more complicated, with all that’s showing its Hive’s conditional operator plans. But the things to observe are, it does do scans on PartSupplier. The first one computes the minimum cost, the second one then joins that back. And my experiment showed that, the improvement using the windowing query is totally different percent. We ran on 300GB datascale.

Okay, so now, I just want to go over a couple of real-world examples. So these are from the ClickStream Analytics use case at RichRelevance. So a really simplified view of the data is, a ClickStream table, which has session-level information, and for each session, it’s capturing the pages visited. And for each page, you have the URL type and things. RichRelevance is a marketing analytics company, and they have some of the top retailers as customers. And if you want more details about the use case, there’s a really nice presentation by the architect Murtaza Doctor, and the link is here.

So couple of examples from their use case. The first one is the Most Frequent Item example. In this case, what they want to find is, “When you’re on a page, what is the most frequent next page you visit?” And the reason they want to compute this is, they use this to improve website optimization process. And this use case is an example of the lead function. So essentially, what you are doing is, we are saying, “Partition the data by session. And then, for each page, for each impression, I want the page type and the next page’s page type.” And then, in the outer query, you’re doing a combination.

Another example that they have is, the Landing and Exit Page Analysis. So what they want to find is, “What pages people enter a site and leave a site.” And they want to get a count of these combinations. This is an example of the first value and last value function. So again here, you partition by session, order by time, and you want the first page and the last page. In the case, you only want to return. So when you invoke the query, you get all the rules associated with the session return. In windowing, you don’t get the rows aggregated out. In this case, you only want 1 row per session. So what you do is, you say, “Just keep the first row, the one with rank 1.”

And they have lots of other interesting use cases. I just mentioned a couple, and instead of just doing first and last page, they also do something called path to purchase. So in that case, if a session leads to a purchase, we want to know what events led to the purchase. So they want those sequence of pages that led to the purchase. Another example that they do is, they have looked at is, trying to compute the most-frequent item set – they want to compute the combinations of pages that people visit together.

And hopefully, now you’ve got a good sense of what is possible with windowing. And I just quickly want to mention what’s in the works. So these fall into the bucket of performance improvements and query rewrites. And the case of performance improvement, as I said, if there are multiple windowing functions in the query, we want to group them into sets, and compute them set by set. So one of the things you are looking at is, how to optimize the order in which these are executed, so that, the number of not-used jobs/shuffles are minimized. And this work is similar to what’s done in Hive 0.11 for Group Joins and Group Bys.

And this is not unique to Hive. Similar work was demonstrated on Postgres, and there’s a really neat paper about it. The Query Rewrite work is around the observation that, windowing queries are very often expressed at Correlated SubQueries. Especially the ones which deal with reporting aggregates. For example, we identified 4 of the 32 TPCH queries, which actually could be written as windowing queries, but are written as Correlated SubQueries. And as we showed there, there’s a big performance impact, if we rewrite them as windowing.

So the work going on here is, to look at how to automatically rewrite these queries as windowing queries. And this work, I just want to mention that this work is being done at Duke by Prajakta, and she is the same person who helped windowing in Hive in 0.11.
And back to Siva now for a demo.

Siva N.: Thanks, Harish, for the detailed look at windowing in Hive. Now that you have heard all about it, now I’m guessing you want to see it in action. And what we’re going to do is, we’re going to show the same examples that I started off with in the beginning of the talk. We’re going to be using Qubole data service, to demonstrate the window functions in Hive.

A little bit about Qubole before we do the demo. Qubole delivers the next generation Big Data as a Service. It’s called Qubole Data Services. Qubole is a startup, it’s founded by 2 co-founders of Apache Hive, Ashish and Joydeep. They ran the Data Operations team at Facebook, that managed over 25 petabytes of data. Qubole offers Big Data as a Service. What that means is, you don’t need to set up any hardware or software. You can just log in, and get the power of Hadoop, Hive, Pig, Oozie, and Sqoop. And Qubole has been used to process over 20 petabytes of data in the past few months, and several interesting companies.

Now onto the demo. What I’m going to do here is, briefly describe what the interface looks like. So I’ve just logged into Qubole Data Service here using my login. And what I see here is, I can see a history of queries that I’ve executed in the past. These queries were run, the data here lives in Amazon’s S3 storage. And Qubole takes care of launching clusters for me, making sure they’re the right size. The clusters grow and shrink, depending on my workload. And so I can focus on the queries and the data, and not worry about how these queries are getting executed. So, I’ve already uploaded the data for this tiny table that we used in our demo, the sales data. And we’re going to play it on Qubole.

As a first step, what I’m going to do is, I’m going to create a Hive external table. And this table, of course, has 3 columns, which is the name of the salesperson, the country that he or she belongs to, and the revenue. And this data is backed by S3. So I have a little text file that lives in this S3 location, that is actually from Amazon. And what I do is, I just submit this query. And this goes ahead and creates a Hive table that I can use for subsequent queries. And you’ll see the log updates given here, and this says that, the query is complete.

So now, I go ahead and just quickly preview the data, to make sure that this has the right data here. What I do is, I just execute a simple STAR Query, to see what the data looks like. And this come back, yeah. So this is showing me rows similar to, basically they are identical to the example that I represented in the beginning of the seminar. There are 4 rows, 2 rows from Sweden, 2 from USA, with the revenue numbers here.

So now, what we do is, we first find the top salesman across all countries. This is a simple example that I talked about already. This is, all this takes is a simple Order By query, we’re ordering by revenue, and we’re doing a limit of 1, to find the top salesperson. So now, this query is a little bit more complicated than the previous query, and this requires an Order By and a limit. And what Qubole does underneath is, this Hive query gets translated to MapReduce job, as is standard in Hive.

And in this case, you see a message here getting Hadoop plus the information. If your cluster was not up already, Qubole brings up the cluster for you. Typically, that takes 1 or 2 minutes, and then the query gets submitted there, and you get the results back, as you see here. Qubole takes care of keeping the cluster around, so if you want to executive subsequent queries, you can reuse the same cluster. And when you’re done using the cluster, Qubole shuts it down automatically for you.

So now, we’ll do the second example, of finding the top salesman in every country. This is a naive way of doing this, and the way we do this is, we create a temporary table, which computes the maximum revenue, as we already talked about. Now, in the Qubole system, this temporary table lives in Hadoop HDFS. In this world, S3 is the ultimate source of good, and so temporary tables are created against HTFS, because clusters can come and go as necessary. I execute this query. Now, this results in 2 jobs. The first one, to create the temporary table, and the second one, to do the Join. And what you’ll see here is, this query is executed, the first job is done. And I see logs that show that the second one is going on.

Okay, so it’s complete, and I should see the results for our query soon. So this tells me that Tony Iommi from USA is the best salesperson from USA, and he has done 2.2 million worth of sales. And Mikael from Sweden has done 3 million worth of sales.

But now, what we’ll do is, we’ll use the window function way of finding the top salesperson in every team. In this case again, the table of the subquery adds a new column called Position, which uses the rank window function. And finally, there’s a request – “Oh, I want rows with position 1.” And remember, this is new in Hive 0.11. And one of the differences you would notice from the previous run is that, this requires only 1 Hadoop job, as opposed to 2. And you see that this is complete, and we should see the results fairly quickly. And what you see is, you see the same row set, you see Mikael from Sweden that is the best in Sweden, and Tony is the best in USA.

And what we’re going to do is, we’re going to do another example, something which is very similar to the previous one, only we’re going to try to find the second-best in every country. And all this requires is a little change. Instead of position = 1, each position = 2. So this makes making creating questions very, very easy. And you can even templatize the query. So some of the things that, features that Qubole offers is the ability to templatize queries. So an end user can just put in the position value, and we execute the corresponding queries. So in this case, we see that the other 2 rows are second here.

Couple of other points about Qubole. What you do is, if you were to use Qubole, you would enter your Amazon account details in the Control Panel section. And you would see that, I’ve put in my Amazon credentials. This allows me to access the data that lives in S3. And I can put in my AWS credentials, and put in Hadoop plus the settings. You can say minimum and maximum as different numbers, so Qubole takes care of scaling the cluster up and down as required. You can choose different instance types, and you have a whole lot of other options here that you can play with. And you also have the ability to schedule workloads.

So most of the times that a report does not execute it just once, you want to execute a reporting query on a daily basis or a monthly basis, or a weekly basis. And you can compose complex workflows, which involve Hive queries and a Hadoop job, then a big job, and finally, push the data back into a reporting a database. Qubole makes this extremely simple, so you don’t have to worry about things like, “Oh, how do I set up my Hadoop cluster? Or do I have the right versions, and the tool is in place,” and so on. And finally, there is a small Administrator dashboard that shows you, what are the recent activity in your account, how many queries you’ve executed. What are your top queries, what does your cluster usage look like, and so on.

All right, so that’s it for the demo. To conclude the talk, windowing simplifies many analytical SQL tasks. Windowing is primarily about partitioning of data, ordering the data within partitions, and window frame and window function definition. Window functions are extremely useful to do reporting aggregates, windowing aggregates, time series analysis, and a whole lot of other applications. And Hive just got a lot more powerful and easy to use, thanks to the work from Harish and Hortonworks.

And that’s it. I’d like to thank Harish and Hortonworks and Gill. And I’m going to hand it over to Gill for a question-and-answer session.

Question and Answer Session

Gill: Thank you very much, Siva, and thank you very much, Harish. This was a very impressive presentation. And we already have many, many different questions on the queue for you guys to answer, so if you guys want to read out the question one by one, and have you guys answer them appropriate.

All right, so for the first question, “Are we using these functions for simplicity? And how much would it be efficient to add new roles on the fly?”

Harish B.: Yeah, so some of the … As Siva said, some of this is just syntactic sugar, but there are cases which windowing enabled, which is really hard, like the top 2 example. After a while, it gets really painful to hand-write these. In terms of injecting new rows, so there are 2 options there. One, Hive has this notion, so LateralViews is the way to inject rows. And then, or else you can take this to use table functions. So table functions, you are not restricted to the rows you’ve got, you can output whatever you want.

Gill: Very good. Thank you, Harish.

Second question would be, “Will any BI tools benefit from these new extensions? For example, will MicroStrategy be able to use these?”

Harish B.:  I’m not a MicroStrategy expert, but I do know that many BI tools have a bunch of options that allow them to translate the BI workloads into SQL. And I know that, for example, when windowing is not available, they do create temporary tables. And if windowing is available, they could switch to that. So that could definitely result in performance improvements, if the BI tool actually can take advantage of window functions.

Siva N.: Yeah so, just to add to that, I don’t especially care about MicroStrategy, we can check and get back. But no, I know like, that SAP team was looking at invoking window functions. So yeah, I think, in general, BI tools can take advantage of this.

Gill: Very good. Thank you very much, Siva, and thank you very much, Harish.

Next question is, “Is the windowing functionality part of Hive as a whole, or something proprietary to Qubole?” Talking about some of the window functions.

Harish B.: No, it’s part of Hive. It’s part of Apache Hive, so it’s there, you can go get it from the source code.

Gill: Thanks for that clarification, Harish.

Another question coming up is, “We’re a little bit confused. How come, as the query is like, running window functions, are these Apache Hive queries that are embedded in Qubole? Or can we run those in Hortonworks and any other kind of distribution?” This is very similar to the previous question.

Harish B.: Yeah. So I think Qubole basically just uses Apache Hive and expands on it.
Siva N.: Yeah so, Qubole’s Hive is primarily based on Apache Hive, with a few modifications, which are primarily are performance improvements to run on Amazon and S3. But the window functionality is basically from Apache Hive.

Harish B.: Right. And the same goes for Hortonworks, so the Hortonworks flavor of Hive is pretty much Apache flavor.

Gill: Perfect. Thank you, guys, for that clarification. I just wanted to make sure.

And next question coming up is, “Can you talk a little bit about how Qubole can help understand user behavior on web pages?”

Siva N.: Yeah. So that means I’m not the right person to answer about the specific use case. I can tell you how to solve problems within Hive. So the way you would do it is, to write a custom UDF function. So there’s been interest in that, and either someone from Hortonworks, or someone in the community will do that. How it’s useful, I’m not the right person.

Harish B.: So I can maybe add little bit of a thing there. So generally, when you’re capturing ClickStream data, what you’re capturing is basically, timestamp. Like, when a user clicks on a certain link, and so on. And it’s after the fact that you want to understand, “What is the sequence of events that happened in the session?” And so, window functions are very useful to rank the data, to order the data on timestamp, and understand the sequence of actions that a user took. And try to find common patterns and common transitions, and so on. And if you want to do anything fancier, yeah, then you could always build your own table functions to derive other interesting insights.
Gill: Very good. Thank you very much, guys. Another question coming up is, “You mentioned some things about time series data. Is it handled based on the timestamp? Do we have any special functions to handle timestamps through window functions?”

Harish B.: No, no. There’s no special data types for time series yet, so pretty much, there are 2 ways to handle time in Hive. There’s the timestamp data type, which is down to the second or millisecond, I don’t know exactly what.

Gill: Another question coming up is, “Can we use R as well on Qubole Data Service, as a physical language?”

Siva N.: So R’s is not natively provided, but it is possible to run R. Qubole supports the ability to run arbitrary shell commands. And you can run R programs, you can use tools like R Hadoop to access data that is in HTFS to do certain computations within R. So it is possible. It’s not something that we offer out of the box, but it’s definitely possible.

Gill: Another question coming up would be regarding the slides. So, as I mentioned before, all of the slides for this webinar, as well as the recording, will be provided to you as a follow-up email on this webinar. So you guys will be able to download the slides, and share them with all your colleagues, as well as watch this recording off this webinar on-demand, at your convenience.

Next question coming up is, “Possible pricing for Qubole.” So if you guys go to Qubole.com, you’ll see a data pricing page, where you can go into and see exactly the details about Qubole’s monthly pricing.

Next question coming up, “In the interface that Siva mentioned and demonstrated tied to Qubole/S3 docket, or can be ported to other Hadoop distributions? For example, Hortonworks, or anything of that sort?”

Siva N.: Yes, so Qubole provides Hadoop as a service. The motivation here is that, you don’t need to worry about things like Hadoop distributions and setting up Hive, and so on. Our Hadoop tree is based off Facebook’s Hadoop distribution, for just historical reasons. And we have compatibility with Hadoop 1, in terms of APIs and so on.

And our Hive is basically Apache Hive, with performance improvements that we have detailed in our blog. You can go there and check out why, what other kind of performance improvements we have done. And again, in the process of, we definitely want to contribute this back to open source, and at some point, it will make its way back there as well.

clear