as a Service
What is Apache Hive? Hive is an Apache open-source project built on top of Hadoop for querying, summarizing and analyzing large data sets using a SQL-like interface. It’s noted for bringing the familiarity of relational technology to big data processing with its Hive Query Language, as well as structures and operations comparable to those used by relational databases such as tables, joins and partitions.
Apache Hive is used mostly for batch processing of large ETL jobs and batch SQL queries on very large data sets.
We collect events from our various systems via a Flume pipeline that writes data out to Amazon S3. From there, we use a data processing pipeline hosted by Qubole to process and aggregate statistics to Hive (computing) tables and to an AWS Redshift based data warehouse. For easy access to the data for the entire company, we use Tableau to navigate through our tables and produce visualizations.
Prakash Janakiraman, Co-Founder and VP Engineering at NextDoor