Modernizing ML & AI Operations to Advance Healthcare
So start with the agenda of this talk. This talk is broadly divided into three sections, in the first section I’ll be talking about the state of enterprise data science and healthcare. In the second section we’ll be covering certain customer data science examples across the healthcare verticals that we serve and in that way we’ll cover certain case studies. One of them is in the provider space, the second case study is a robotics company who is trying to leverage ML and AI to advance their use cases. Towards the end of this talk we’ll be leading you into the demo of a real-world example of an application of track discovery through deep learning. So, let’s get started.
So to start with an interesting story, I’ll take you all back to 1854, London. Back in the day there was this wide outbreak of cholera. So back then cholera was believed to be caused due to bad air and this theory was otherwise known as the Miasma Theory. So if you see, Dr. John Snow who is on the right is a physician who refuted this Miasma theory and he came up with an ingenious idea to mark on a map of London all known locations of death caused by cholera. He took this map of London, went door to door and he made his observations, so if you see the map, each mark on the map is a case of death caused by cholera. He took this observation back to his drawing board and he made a bold conclusion, that cholera was not really spreading due to bad air, it was really spreading due to water.
So his justification came from the fact that the stacks of death on the map are really tracking really close to the water pumps across the board, across the map. So he took these observations back to the civic authorities and finally the civic authorities were able to contain the cholera back in the day.
So this really marked the first application of data mining and data science in the field of life sciences. This led to the birth of a separate field of life sciences called epidemiology. So if epidemiology, the field of life sciences is more about data than about life sciences, if you heard about that. This map ever since has been referenced as the ghost map and a book has been dedicated for this subject, written by Steven Johnson. It’s a great read if you guys haven’t read it.
Fast forward to today, what we are seeing today is a different landscape, you’re experiencing a different kind of shift in growth. Though the population is at an all-time high, we are seeing that the growth rate of the population is trending really low, it’s at an all time low right now. Also if you see the general population has gotten healthier, we are living longer, so the average life expectancy is at an all-time high. So given those positives we have certain unique challenges today. If you look at one end of the spectrum we have an aging population and on the other end of the spectrum the healthcare costs are really rapidly rising. Starbucks CEO once famously quoted that they spend a lot more on ensuring their employee healthcare than on procuring coffee, so it speaks volumes about the kind of problems that we deal with, the rising cost of healthcare.
These challenges really need certain compelling kind of solutions to address. Basically, all these practitioners, medicine and healthcare practitioners, are looking at leveraging data. If you look at the data, if you look at the infographic on the right bottom corner, human data, especially a large percentage of human data is human health data. That data is rapidly growing, it’s growing ten times faster than the traditional business data.
So how do you kind of leverage this data and solve those precise and compelling problems? How do you kind of understand which mutation in the gene is causing this disease? Or why is my patient not responding to this chemotherapy? So those precise and specific questions can be answered if you kind of tap this human health data. That requires a different trigger and really requires a different feel of data science called human data science.
To really execute on this human data science, again there are a lot of challenges, if you see the transformational promise of Big Data projects remains extremely elusive. So these are the metrics published by Mackenzie and Gardner: 85% of Big Data projects fail to meet expectations and more than 70% of analytics potential value today is unrealized. This is across the board, not specific to healthcare. Being successful at Big Data is extremely challenging.
If you look at the different kind of use cases across verticals in healthcare be it bioTech, be it pharma, be it life sciences or providers, what you see is different wide variety of use cases, but all of them have one common theme. What they try to do is they’re trying to leverage those billions and billions of data points and solve those different variety of use cases.
If you look at data, data is one piece of the puzzle. You also need the right tooling. You need to pick the right tool for the right job. Again, you need to embrace the choice, be it data preparation, be it ensuring governance and security, be it model building, or be it deployment of the solution. You need to embrace the choice and be able to pick the right tool for the right job.
If, again, we talked about data, we talked about the tooling and another aspect to this is leveraging the right capabilities and filling out the right process. You no longer can be constrained by the limits of your data centers. They have finite capacity. You got to leverage this infinite computer power that is available in the form of cloud and which is rapidly growing everyday. You also have to be looking at how do you leverage machine learning to solve these specific and precise questions. Machine learning is becoming ever more accessible.
Last but not the last, you have to think about how do you enable that self-service analytics platform so that you can facilitate those bottoms-up use cases. Again, I’ll pause there and explain what I mean by bottoms-up use cases. So typically what happens is business comes to IT with certain specific problems and IT projects get created around those problems. So that’s really a top-down kind of an approach.
So what I mean by bottoms up is when you enable these self-service data analytics platforms, you kind of let the business solve their own problems. They kind of find their own data and answer their own questions. So that’s really the true enterprise transformation that everyone seeks.
Again, all the enterprises are really recognizing this trend. Once they embrace this trend, and embrace the change are the ones who are leading kind of disruptive technologies to solve these compelling and precise problems. Again, it goes to speak, it really establishes that embracing this change is really really important in this day of age.
So if you look at different use cases across the different verticals, all of them really have a common theme. They’re all trying to leverage those billions and billions of data points to solve their use cases. How do you really create a platform around these use cases? To be able to achieve success? And on to those precise and specific questions that we talked about.
So of all these different data points come at you in different velocities and different form factors. So the platform in this particular case we are actually showcasing Qubole as a platform that solves this particular, which stitches this whole healthcare operations. In this particular case, Qubole is able to actually ingest all those different data points, which are coming in different form factors and in different velocities and be able to provide you the choice of tooling in terms of engines like Spark, engines like Hive, Presto, to be able to execute on AI and machine learning, to be able to leverage your data, those billions and billions of data points.
Finally, at the other end of the spectrum, once you’re able to create models which solve those compelling problems, which answer those precise and specific questions, you’re looking at how do you kind of report out, how do you kind of do business intelligence out of these solutions? So all that is possible through this kind of big data operations, facilitated by a platform like Qubole.
When you’re doing big data or data science in the cloud, auto-scaling is an essential aspect. When you’re operating in a data center you’re really running your infrastructure 24/7, but in the cloud you’re really renting your computer. So it is really really essential to optimize your usage. So as trivial as it sounds, auto-scaling is really really important.
If you look at this cluster, again, this is one of a sample Qubole clusters in production. If you look at this cluster at 6:00 AM in the morning, the cluster was really dormant. We were trending at a minimum capacity of the cluster which is around ten or twenty nodes, and right around 7:00 AM we saw an increased demand for the cluster and we immediately scaled up the cluster. We kind of sustain that cluster capacity at 72 nodes for an hour.
We can sustain that cluster capacity 72 nodes for an hour because we saw that demand for that cluster. At right around 8:00 a.m. the demand has decreased so we can scale it down back down to around 20 nodes. So it’s in that aggressive down scaling that happened around 8:00 a.m. So what this really enables is all this white space that you see here is really the dollars saved. It’s the cost concern so if you are running your cluster in a kind of ecosystem you will run your cluster at 18 nodes all the time to meet that demand. To meet that peak demand. So it’s really essential to ensure other scaling when you’re doing big data in the cloud.
So, again, talking about the differences between doing big data and on-prem vs. cloud, data locality is the biggest difference. So if you look at the on-prem kind of info structure, data in compute is co-located, the fundamental principle in the on-prem infrastructure is to bring compute to the storage. So if you look at it right, it’s really impossible to scale storage separately without scaling the compute. So if you want too scale any of that independently, it leads to an expensive kind of declinement. Another aspect with it is, is it’s really difficult to share the data locked HDFS across your operating units and across your functions. Compare and contrast that with the cloud operations, the storage in compute is separated so it’s all about leveraging that separation of storage and compute. And what you’ll get out of that is ephemeral clusters. What I mean by ephemeral cluster is the cluster is really running for 24/7. You only run the cluster when there is a workload. So that leads to a really excellent operational efficiency. And if you look at it right, data can be easily shared across operating units and can be accessed from different locations because all your data assets are really stored in your cloud storage.
So if you look at the advantages of cloud it leads to a lower cost, it enables that iterative and collaborative approach and you can scale your infrastructure needs with automation. So again, if you contrast a few of those with the on-prem one, you’re dealing with the large infrastructure a million dollar for upgrading perhaps and you’re dealing with a finite capacity, a finite compute capacity, so you’re hitting walls all the time. Again certain pitfalls to avoid when you’re doing data science in the cloud. Data governance is essential so for a successful Data Lakes, there needs to be a proper focus on ensuring the right anonymization … you need to anonymize your data and ensure right data policies to be able to provide that well-rounded data security.
You also need to be able to keep users siloed from the data. You need to allow your users on-demand access to data and analytics in order to enable that collaboration that will help you provide that self-service data platform that will provide those bottom up use cases. Again, focus on ensuring time-to-value and focus on ensuring delivery versus over building or over engineering. All the cloud infrastructure and the platform kind of a story covered so far, really helps you deliver that. It helps you focus more on time-to-value rather than on your over engineering or over building things.
Again, you don’t need to compromise on compliance when you’re operating in cloud. If you’ll look at the typical kind of solution architecture that we layout for doing big data or data operations in the cloud, you kind of take your data in a part of curation. So you line all your data assets, all your siloed data assets in a raw unstructured zone and you take it down the path of curation. You take it to a derived kind of a zone where you ensure data governance. You refine your data sets, you blend your data sets, you try to actually get more insights of your siloed data sets and then finally you further refine it and create this source of truth zone where you share all your data assets, your refined data assets for further analysis, for your advance analytics use cases.
So again, this particular use case is what we’ll cover towards the end of this session. So, towards the end of this session we’ll show you … we take you a tour of how we enable this work flow and provide an example case study of drug discovery through deep planning and Keras Notebook. Again, Keras is deep learning. So we’ll do demonstrating of this work flow towards the end of this session.
Again, this is to cover the data science workflow at a high level. So this data science workflow is from Microsoft. It’s called Team Data Science Process. Team Data Science Process is all about ensuring that collaborative and actuative kind of an approach to data science and AI. So it all starts in business understanding. As important data science is today, it all really starts with that business understanding or the specific problem that you’re trying to solve. So once you understand the problems faced, the business case, you move into this phase of data acquisition. So you try to understand what dat you need to solve the specific business problem and then that … this is where you spend a lot of time. 80 percent of your time is really spent in the data acquisition and understanding phase.
So here’s where you run into a lot of data. You kind of explore your data. You clean your data. You normalize your features. You prepare your data for modeling. Right? And really the modeling aspect is what helps you answer those precise and specific questions. So you look at your historical observations and you help create this models which answer those specific questions without the board. How do you help a physician understand why my patient is not responding to this chemotherapy? Or, how do you help a physician understand what mutation in this gene has caused this particular disease? So those precise and specific questions are answered by doing this modeling and tapping those billions and billions of data points.
So in this phase what you end of doing is do a lot of feature engineering, you do model training. Again, you try evaluate a lot of algorithms, there is no one master algorithm. You also spend a lot of time doing model comparison, evaluation, cross validation, all those fun aspects to modeling, and once you actually take a successive model that is able to solve your problem in a compelling way, then you are ready for the planning. So when you’re ready for plan that’s when you really productionize tools, models and put it up on the web service, and be able to answer those questions in real time. Those compelling and precise questions in real time.
And again, this an extremely actuative process so one team can work data acquisition and the data science team can probably work on modeling. So it’s an extremely actuative process, you jump back and forth in that circle so it really enables a successful data science workflow.
So at this time I will lead you into a poll question. And I will also hand this off to my colleague Ojas to take you through the rest of this presentation. Thank you for listening. Ojas take it away.
Yes. Perfect. Thanks Pradeep. So just to point of introduction, a quick information about me. So, my name is Ojas. I’m a Solutions Architect at Qubole and prior to Qubole, I worked at Amgen, a biologic company for close to nine years. So I’m pretty familiar with some of the challenges in the healthcare industry overall. And what I wanted to share today with you is more around some of the use cases and examples of how customers are leveraging the Qubole platform and the cloud infrastructure in the healthcare field.
So just to recap what I mentioned, we have seen a lot of harmonizations, moving from on-prem infrastructure to cloud infrastructure and really taking and bringing value to the users. So once such example is a Fortune 500 health care services company. So this company builds technology solutions for its customers, you can imagine these customers to be pharma and life sciences companies or insurance providers. What they do is they basically ingest lots of data. And some of this data could come from third-party providers. So they are able to ingest lots of data that comes in shapes, size, and formats.
So they are able to really extract valuable data and curate that data and then provide that data to the customers in the form of reports, visualizations, or even in some cases, just a pro-curated data, which they can run and evaluate or get value off of it. So this organization started on the commercial side of it where the data was pure data or p-data or even some scenarios, the drug sales data. And what they were able to achieve of the solution was really make it accessible to a lot of customers through our build platform. Because of technologies, one thing they made a decision on was really leveraging the right tool for the use case so for the data processing, they leverage Hive and for some of their more lib cases, they are leveraging Spark. This is really exciting to see that how this organic growth happens within the organization.
So another such example is AURIS. So for those of you who don’t know AURIS is a robotics company that builds robotic instruments for medical intervention. So as you can see, they have a lot of sensors on these devices, which collects a lot of data points. So there’s steady stream of data points being captured by these devices and they needed a way of really capturing those data points, processing it, and providing value, or understanding insights into it.
So in this case they were able to process this data on year to year time basis and provide it and go back to those robots about to whom came the next action of it. So this was just an example of how you could use the real time data for processing. They were also able to build maintenance schedules based on the data that they captured on these devices. It’s very difficult to have a process with such large amounts of data on their on-prem infrastructure and I think Pradeep mentioned about decoupling and storage for big data processing so just imagine what kind of infrastructure you would need if you still have a sort of on-prem set-up. That was one of the key decisions and enabler for them.
So basically what they are doing right now is they are really using that data and insights to build new R&D use cases, so it does really help them in that one R&D pipelines and building new products around it. So they have unvalued of the data across data science. Because of technologies they are using a variety of technologies so they decided leverage AirFlow, which is a scheduling engine, which is a part of Qubole as well. And then TensorFlow, which I’ll show you a part of the demo.
So just think about how you can leverage by using some of these modeling capabilities as part of Life Science. So what I’m going to show is … talk about a deep learning example which uses TensorFlow, which is one of the technologies supported by Google for AI and machine learning. But just before I go there, I know this might be recurrative for some of you, but just wanted to quickly touch on what machine learning and what deep learning is. So machine learning can be categorized into three. Supervised Learning which includes a known or a training data set. You can think of it as an image recognition machine learning. This is an initial sort of images from which you know what those objects are. So that’s the initial training data set, which you would provide to a model and based on that you can train the model for learning. So suppose learning is what I’m going to show you as part of the demo and this is mostly used for classification type of work examples. The second type of machine learning is Unsupervised Learning as you can see by the name. We do not provide any labor inputs or the data is not defined initially. As the machine learns, it extracts features based on that input.It extracts features based on that input and tries to identify patterns around it. So it does not have any supervision around it. And the third piece is semi-supervised which obviously is a combination of supervised and unsupervised training where you have a fixed set of data.
What’s the real difference between machine learning and deep learning. When we talk about deep learning, deep learning is an extension of machine learning where it simulates an artificial neural network. Don’t worry about all of the mathematical calculations you see here. What I want to highlight here is deep learning involves multiple layers through which the data passes and in each of those layers it tries to extract different features based on the output and input values, which can then help you identify different backgrounds of those input. So, you can see those individual components. Those are neurons which are create an input and provide and output. As the data passes through this it will extract different features around it.
Coming back to the example which I’m going to show. Some of the key things which it wants to identify and understand is… so by the way this is a data set provided by Kaggle. On Kaggle website the data is provided by Merck.
So what the order to identify as part of this data set is how can we ultimately predict the efficacy and safety of drugs, or the molecular activity of a chemical? Right?
So they initially started with a molecular data set. Based on that data set, we passed that as an input the learning model. Which was then trained from this data set to extract molecular properties. So what it means is; when you have a new molecular development, you might be able to leverage the same model which is trained on the know data set to extract some properties.
The whole idea is around leveraging and supervised learning it should extract on the features and then leverage it for new models.
So I’m pretty sure most of you have seen a lot of technologies out there in the market who are trying to help in building this artificial neural network. I’m going to focus on some of the technologies which I use as part of the example. But at Microsoft, different organizations are really investing heavily in building these technologies as a part of it right?
One of the pretty popular and well known framework is TensorFlow which is used for building deep neural networks. It is considered as backend technology for artificial neural network or deep learning. Keras switch is a high level API which sits on top of TensorFlow, it just allowed you to make those API calls and make it easier for you. Its considered as a front end to the TensorFlow.
There are other technologies like CNTK or Theano, which can also run underneath using Keras.
The other piece is the Spark Data processing engine. You can imagine the amount of data you have. This is in petabyte scale. You really need a processing engine that can handle petabyte scale of data. Spark definitely provides that capability and it also provides something; powerless data frame. Which, allows you to transform the data quickly.
I’m going to build this example of Qubole which is a self serving machine learning platform. With this you should see Poll coming in. (Silence)
Alright, so while you’re answering the Poll I’ll switch to a development environment and walk you through the drug discovery example. For what you’re seeing is Qubole’s platform and what I’m leveraging is Qubole’s notebooks. You can think of Qubole Notebooks as an IDE for application development. This is just a place where you would write your code and then it would basically run on a cloud infrastructure. In this case, this notebook it attach to a particular cluster which you can think of as a decoupled infrastructure. This is where all the auto scaling happens. As a user I’m not really concerned about where its running, how much infrastructure I need, and how I can manage it. I think that what Pardeep showed earlier.
So a quick walk through. What I’m going to do is a quick walk through about this notebook and some on the visualization which come with it. The first thing as a part of any machine lining exercise is load the data. So you would load your tables as well as the key aspects of the data. So you adjust this data from different data sources and you can have it in an object store. So in this case I’m storing it in a S3 location. As a part of this piece of code I’m going to read that into a Spark data frame, which has allowed me to process that data.
The next step is, given that this is a sample dataset I’m going to distribute it into two pieces. One of that will be a training records and some of them would be testing records. I’ll use the initial training record to train my model and then verify the model based on the testing record. As you are Training the model, we’ll do feature extraction. So what it means is, as data is move in through the different layers of artificial neural network we’ll try to extract some of the features and evaluate those.
As I mentioned there are API’s of Keras, is an API that allows you to build that model. So in this case I’m passing some of the parameters to the API. One of them is neurons which are basically the competition here as you saw in the earlier slide. I would define how many neurons I want per layer. The second is the learning rate, how fast do you want the data to move through your neural network and processing. The back size, in each filtration what is the back size you want to submit. And then there are some other features around activation optimizers which are different functions you can use to define and build you model.
As you can see here, this is my Spark dataframe. I’m kind just displaying all the parameters or variations of the parameters which I have as a part of it.
So the next piece I want to talk about is this whole notebook has one key function which is a training function which would basically train the model. This is the core of the notebook. It take a bunch of parameters which we have already defined at the top, like the back size, the optimizer, your input and training data set. Then it will basically run that model and provide you a final score on the data. This is where the training of the model happens and it extracts the final score.
What we then do is based on that… once the model is trained we’ll basically run different types of search variables on top of it. You could either run random search. There’s another option of “great search” which is basically and exhaustive search throughout the network. So in this case I’m just running a random search, but definitely great search would provide more exhaustive information. Once I have that, I would basically rank all variations of my model. As you can see I have multiple different types, different variables for the parameters and the spores identify as the train the model.
In the end I’m just trying to figure out what are my top three model and then grab the top model out of it. So, one of the thing is while it’s well this model is being trained you can even visualize it as a part of the training. For that Tensorflow supports Tensorboard which is visualization on top of Tensorflow which will allow you to view how the model is being trained. So in this case I’m grabbing the top most model and pre-running the top most model and visualizing in Tensorflow. As you can see I’m just going to save the model in the end and then run Tensorflow as a part of it.
So let’s switch to Tensorflow, okay, perfect. So this is what you would see as Tensorboard. This is provided by Google as part of the Tensorflow and Tensorboard framework. What you can see, is you can see the different learning patterns that happen as you mover through the artificial neural network. This is one example of a loss function which you can crack and visualize while you are moving through the pattern.
You can also look at some of the graphs, which are provided as a part of it. Not going to go into the details of it, but this allows you- so in this case just- in this case I had different components to being passed to the neural network and then each time it was being trained. So you can look at the training set of each of them and see what was the value of the loss function in different kinds of input. Use your full area of identifying those values and then you can use them for tuning your model.
I hope this helps you to get an overview of what, how you can build machine learning models for your life size use cases, leverage it as a part of it.
So with this we’ll open it up for questions.