Big Data Tips From the Experts: Setup is Key
Are you struggling with a current big data project or looking for advice on how to make your first project successful? Check out the following tips from data scientists, engineers and experienced business users. This is the second blog post in a series of big data tips. To see more tips from big data experts, click here.
1. Careful and Smart Integration with BI tools
Big Data tools ( Mapreduce/Hive etc. ) are known for their latency problems, but on the other hand they are excellent for processing petabytes of data in a distributed computing environment. When it comes to integration with any BI/reporting tools, big data technologies should be used in an appropriate manner so that you can avoid the negatives and leverage the strength of these technologies.
For example – if you are building an integrated pipeline with BI tools, try to aggregate as much as you can and utilize the caching or cube technologies with the BI tools to make it a faster experience for the end user. Real time connectivity with big data sources like Hive/HDFS is not a great end user experience in the BI space, so it should be avoided. -Ashish Dubey, Solutions Architect at Qubole
For more best practices on using Apache Hive, see this article.
2. Invest in Your Pipeline
Rule of thumb, invest 80% of your time in your data lake and data pipeline (mining, extracting, cleaning, transforming, loading), and 20% in the high level data science and machine learning effort. Data in the wild is complex, wrong, contradicting, hard to access and find. Consequently more, faster, and accurate data usually has a higher impact than more complex models and makes for a robust system.- Christian Prokopp, Principal Consultant at Big Data Partnership
3. Don’t Rush Into Analysis
Everyone with a Big Data project wants to rush straight into analysis. That
is where things usually fall apart, however, because there is simply too
much data flowing across the network and it is mostly in a format that
current analytics software cannot handle. -Rick Aguirre, president of Cirries Technologies
4. Start with Heavy Lifting
Big Data success requires three steps of heavy lifting first, before you
ever analyze it.
Step 1 is data capture. Most of the Big Data torrent is a big nothing and
not relevant. Decide what data you want to analyze and set up algorithms to
locate and corral it.
Step 2 is data control. You want to capture the data you need as it come
across the network. It may not be relevant in just a few minutes, or you may
need to store it for a number of years if, as one example, it is data that
might be needed later for law enforcement purposes.
Step 3 is data humanization. This is where you convert whatever format the
data is in to a format that your analytics software can use. Only now, at
this step, do you have the right data in the right format that you can then
use for whatever kind of analytics you have in mind. -Rick Aguirre, president of Cirries Technologies
5. Think wide
Once data is collected then you have easy access for advanced analytics – don’t stop at only analyzing one log source or one dimension of data – analyze across log sources and multiple entities. For example, in order to discover advanced cyber attacks that leveraged users’ credentials, we profile across behavioral activity of users – including their permissions configuration, their access to files and systems and their web activity. We analyze their historical activity as well as comparing them against their peers.- Idan Tendler, CEO of Fortscale
6. Use the ODBC Driver
“Perform BI Analytics and Visualization with the ODBC Driver” -Minesh Patel, Qubole
7. Use a Subsample
I always start by looking at a subsample of the data. You often get a very good impression of what the main focus of the data munging or cleaning will be just by looking at some numbers (or characters).-Benedikt Koehler, Data Scientist and Blogger at Beautiful Data