Jon Austin Osborne – Data Platforms 2017
Jon Austin Osborne: Today I’m going to talk to you a little bit about a topic that’s been very close to my heart for the past six months or so, and it revolves around how we go about managing user access profiles in Capital One and a very large enterprise. Just a little bit of my background really quickly. I did my undergrad at MIT in mechanical engineering. Currently, I’m pursuing a masters from Georgia Tech in machine learning, and as I said, I work at Capital One as a machine learning engineer for their Center for Machine Learning.
To get started we have to look at what our problem actually is. It can be worded very simply, how do we manage user access? but there’s a lot of complication inherent with that. If any of you work in companies that have an active directory system or anything like that, what I mean by user access means any AWS resource, or on-premises server, or if you’re an admin on a server, or Microsoft Office, or Outlook, or a distribution list. Anything that you are able to log into with a single sign-on or something like that counts as one access.
To give you the scope of what we’re talking about here, Capital One has about 40,000 active associates currently at the company, and each of us has about 70, on average, of these accesses. That’s a lot. Now, 70, again, is just an average. There are associates who have upward of 500 of these accesses. There’s a huge issue that starts to come up, one that we’ve seen quite a bit in the past.
If someone logs into my profile and my credentials are compromised, not only do they have access to all of the things that I’m supposed to have access to, there’s a whole bunch of things that I have that I’m not supposed to have access to. For instance, as a large credit card company, we have a whole bunch of credit card transaction data. If I’m no longer working on credit cards, and maybe I move to cybersecurity or something like that, whoever just logged into my account now has all of this credit card transaction history, they can do whatever they want. It’s a huge, huge issue.
Now, our problem is that as people change teams, change projects, change rolls et cetera, these profiles don’t ever shrink. They just continue to grow and grow. If you actually watch the file sizes, it just grows every single day because nobody really wants to ever give up their permissions. If somebody said, “Hey, do you need this data table anymore?” you’re probably going to say, “I don’t know, but I want to keep it just in case,” and that’s what everybody says.
We needed a way to do this in a somewhat smart way. We needed some automation involved. To bring this problem scope down a little bit, we had to make a few assumptions. First, of which we’re assuming that associates who are on the same team will have similar accesses, which makes sense, these are the people you sit with on a daily basis, you have the same manager. You’re probably going to be logging in to a lot of the same things and need a lot of the same permissions.
The second one is by looking at pure number of accesses that you have, we can determine associates that have a higher risk of having what I’m going to call from this point forward rogue accesses. What I mean by a rogue access is one that you no longer need that is currently on your account. As any assumption, we have to make sure that those are valid. First one, same team, same access. Well, the Center for Machine Learning is a perfect counterexample for this assumption, it doesn’t doesn’t work in our case. We have a whole bunch of lines of business that we work across. All of us have all of these different projects and so all of us have all kinds of different datasets, resources, tools and all things that we need, so not one of us has the same provisioning on our accounts.
We started to look more into other aspects of the company and other teams, and we started to find the answers is resounding no, that’s not valid at all. Almost in no case could we take two people on a team and say, “Yes, these two should have almost exactly the same.” Very, very rare in our particular case because our lines of business are so intertwined. That’s okay, we still have another assumption to check, so how about the one with the large amount of accesses.
If you created a box with whisker plot of everybody’s accesses, you get something like this. Pure numbers don’t really matter, this is just an example that I found. Essentially, we’re saying, “I want to look at these people, and I want to make sure and audit them and make sure they don’t have a whole bunch of accesses that they shouldn’t,” but couldn’t these also have a whole bunch of errors even though they only have 20 years or whatever in some cases? Couldn’t there still be a bunch of problems?
In a more specific example, let’s say someone has three accesses and two of them are wrong, what if those two are admin rights on some very, very sensitive data? Then you have someone with 500 where it doesn’t even necessarily matter how many are wrong, but what if they’re distribution list or something like that, something far less impactful? Sure, I’ll catch that person on the right, but I’m going to miss this person on the left, and there’s a lot of these. Is this one valid? Kind of. You’ll catch some of the people, but in reality, everyone needs to be audited equally. It will give you a sense of these are higher risk individuals, but you’re still going to miss a whole bunch of it, so you have to start over.
None of our assumptions are valid, and so, again, what are we trying to do? We’re just trying to manage user access but really what we’re trying to do and what we noticed when we were validating our assumptions was that that wasn’t really what we were asking. We are really asking, is there a way that I can create a cohesive grouping of people that are similar because if I can do that, I can say, “Okay, I don’t care what team you’re on, I don’t care what you’re doing, if you two individuals are working on similar projects, whether it’s in different lines of business or whatever it may be, if I can figure out groupings within the company, I can, one, determine what makes you different from that group, which may be an anomaly, and, two, people that are joining into your project can also have a more accurate representation of the thing that they’ll need.”
How can we do this? We use networks or graphs. For those of you that don’t have a background into this, I’m going to give a really quick overview of what a graph is, it’s going to be very important.
Basically, you have a concept of nodes and edges. Nodes are your data and edges are your relationships between those pieces of data. For instance, you can have either directed or undirected edges. Now, let’s say that I was tracking transactions or something like that. If I pay someone $20, I would have an edge pointing to them. It’s a one way transaction, whereas if I have a LinkedIn connection or a Facebook friend or whatever it is, it’s a mutual relationship, so that’s the difference between undirected and directed, and that’s basically all I’m going to say about graphs, but that’s the basic pieces that you need to know for the remainder of this.
Now we can set up our problem. We can start to actually look into our data and figuring out how we’re going to, one, process it in a way that makes sense and, two, put it into some machine learning model. In basic terms, this is what we have for active directory data. A lot of it has been stripped out, but for the purpose of this talk, we have an employee ID, which is a EID, and a big comma-separated list of accesses that they have. I also have a set of HR data that has the expected information that you would find locations, managers, things like that.
In order to pass this from the machine learning model, we can’t leave it in this format, this comma format basically, so we have to clean it up a little bit, do some joins, and get it into a way that we can actually use it. We end up with something like this. Instead of having each row representing one associate, we have each represent an access. If you notice on the right, there’s another field here called VP employee ID. That was not the one that was given to us. We had to figure it out, and it’s actually a little bit important for what we ended up doing in our initial version of our model. The reason for this is because we want to limit the size of the graph that we’re creating, which I will get into in a second, but if you notice, that’s how we did that.
Now, the way we calculated that was just simple recursion we just looked it through that HR file, we were able to say, “All right, give me this person’s direct report and then there’s ,and then there’s, and then there’s and so on until we were able to create these trees of people within our organization. Again, this is solely for limiting the size of the graphs that we’re processing. It has nothing to do with the model itself.
Given that, we have our data in a format that we can use, so let’s start setting the graph up. We can do the nodes first. Now these nodes just represent associates. Again, let’s say it’s under someone XYZ one two three. Has nothing to do with access at this point this is just each individual, we haven’t said how they’re similar or anything like that but now we have to look into what are the edges are we going to use undirected edges, directed edges how are we going to do this? so It’s a little bit complicated.
The problem is the entire model hinges on how good this edge condition is. If the edge condition is wrong, you have a completely useless model, so we need to determine how do we say two people are similar I have all of their accesses, I have a big set. They’re all distinct because I can’t have the same access twice, how do I determine if two people are similar?
What we use initially was a jacquard index, which essentially what’s the formulas saying how much overlap basically is there between these two sets. We’re looking at that intersection over the union of the two sets. It ranges from zero to one, one being most similar zero being least similar.
For the purposes of this, just assume it’s allowing us to know how similar two people’s provisions are.
Great, we can go back to this. I can pick any node and then I pick any of its neighbors, it doesn’t matter. Then I can calculate in Jaccard similarity. This one is 0.2. Great. I’ll make an edge, and I’m maintaining that similarity as the weight of the edge. For instance, two nodes that are more similar are going to think of them as more tightly bound together, basically. 0.2, that’s a fairly weak edge, but I make it anyway. 0.6, okay, that’s a little bit more similar. Then I continue to do this process but, there’s a fairly obvious problem that’s about to happen.
It’s going to have TON of edges. I have N nodes, and then I will end up with this formula essentially. Which for the entire graph of Capital One, which is the eventual goal here. It’s going to make 800 million edges, which the way that we want to do this, we want to have it be daily at first, but eventually kind of on streaming basis. If I change my profile, I want to automatically reflect in our tools. This is not going to work.
How about this? What if we can create a minimum threshold. We say “Okay, we’re only going to make an edge between two people if they’re at 60% similar or so.” All right. We can try that. That knocks out this edge because this was 0.2, so I can skip it. This one stays, this one goes. Great. Okay. This is a lot better. Well, there’s still a problem, right? I still had to calculate all those. It really didn’t save me any time. It’s making the actual community detection that we’re going to do a little bit easier because there’s less edges to evaluate, but it’s still took me a very long time to calculate those graph.
This is still not going to work. We can do one of the two things. We can require small graphs, which is what I was saying before about the VP thing, and this was the reason that we did this. By limiting the actual number of nodes, we are limiting at least how long is going to take to even set the problem up before we can even do anything of value or we can be smart about how we do this. Do we really have to check every single pair of nodes? You would think “yes”. That’s another “sort of”. We don’t actually necessarily have to know exactly how similar two people are. I just need to know if you’re similar enough. I don’t really care if you’re 65% or 68% similar, if you’re over 60%, I’ll take it.
To do this, we use another algorithm called MinHash LSH. I’m going to go fairly briefly into this. It’s a very detailed algorithm, but I’ll give a little run through. There are two pieces, first which is the MinHashing. Essentially, what we’re trying to do here is instead of computing every single one of those similarity indices, can we just estimate? Can we do that more quickly than having actually sit there and do all these computation? The answer is “Yes”. Our goal here is going to be to create what we’re going to call signatures. Exactly the same principle as your signature on a document or something. It’s something that will represent these large sets of accesses. We want to shrink them essentially, but still retain some of the properties of the original set.
As an example, let’s say we have two people again, Jaccard similarity is 0.375, not very similar but let’s start taking some signatures. If I take a random subset of these, let say four, for the accesses. I still only have one that overlaps. Before I had three potential overlaps, but since I’m taking randomly, I adjust the luck of the draw there’s only one. What about if I take a random subset of two. Now I have no matches. What we’re trying to accomplish here is take signatures that still are going to give me roughly the same amount of matches, but allow me to store a lot less data. This is the power that we’re trying to get out of this MinHashing step. Why do this? Well, there’s a few advantages?
Instead of storing all the strings, which all of our accesses are these long code names that follow some naming structure that they forgot how to do years ago. Nobody knows how to read them. Instead of storing all that, we can just store a few of those items and we could store them as integers rather than the actual strings, which is going to make our life a lot easier as we go forward, but I really still didn’t help us. It’s a little bit. It made them a little smaller, but we still have to check every single pair, but there’s a second part of this.
LSH is the second part of this algorithm, it stands for Locality Sensitive Hashing. What we wanted to here is what I was talking about before. This is where we want to do our estimating. This is a stochastic algorithm, so it’s not actually going to give you the same response, same output every single time, but we want to find candidate pairs of nodes that are going to be similar enough essentially. We’re going to decide some sort of level of acceptability, and we’re going to say, “Okay, we want pairs that are likely to be over this.” They may be under, they maybe way over, or maybe we may miss some.” It happens, but that’s kind of where the tuning part comes in.
The way that we do this is to separate that MinHash matrix that we just came up within the bands. This is what a MinHash matrix would look like. Let’s say I have five associates here, and this is the signatures that I’ve just created from that MinHashing step. Again, there is no access names left here. This is essentially unreadable for someone that has just pick this up. For our particular case, we know that this signature represent the original set. How we separate them in the bands? We literally just split the rows up.
Now it may seem a little hand-wavy, I just picked two, but that will become clear in a second why I did this. We look through each of these bands, and we look for columns that are identical. In this first band, there are none. Every single column is different from the others. I can see in this next one the first two columns are equal, all the rest are different, and then the last one, the first two, again, are the same, and then C and E are the same. What we’ve done, we say, “Okay, give me AB and CE. If I see any columns that match in any of those bands, those two are likely to be similar.”
Now, again, that was a little hand-wavy. Why did I just pick two. What if I had picked all six as a band, I would have got no matches. Or what if I had picked one row as a band, I would have gotten a bunch. This is where we can take a look, and we see that this follows an S-curve essentially. What we do, we can essentially just tune this, the band heights and the signature size and all of this based on the similarity that we’re trying to achieve. In our case, we’re looking for 0.6 similarity. We just pick the band size and pick the signature size that corresponds to that probability. We can do this with a high degree of certainty.
Now, that was quite a big jump from our original goal, but we have this connected graph, that was the major issue with this whole problem, was the processing involved to get to this stage. Now, to get back to the original question, we wanted to know, can I get groupings of people that are similar? To do that, we use something called Louvain method for community detection.
Before we get into this, you have to understand one thing. We need to say, “How do I know if a grouping is good? If I could take these nodes and I could just pick any random one, but how do I know if that’s a good grouping or not?” For that, we need to know modularity. Modularity essentially measures the density of edges within those communities versus the density of edges connecting adjacent communities. For example, this would be an example of a graph of high modularity. You can tell the orange density here are all one group, and they’re highly interconnected. Then we have those purples ones, and that yellow is alone.
A low modularity, who knows, these are all over the place. It doesn’t really seems to have any correspondence to the actual edges being placed. We’re looking for these examples on the left where we have these highly coupled groups that we can say are highly interrelated. This does take into account that edge weight that I was talking about before. Two people that are highly similar are much more likely to be grouped together.
How does this work? Every node initially starts in its own community, which is just represented by the color codings here. I can pick any node, and I’ll start just with this one. We look at each of its neighbor and then we calculate the new modularity if we were to change the community to one of its neighbors. We can see it went up a little bit to point one. No more neighbors to check, so I’m going to save that. Now I move on. Again, I can take a look at each of the neighbors. First, I’ll look at this gray, okay, my modularity went to 0.2. Now I can look at the red and it went to 0.15, so I’m going to stick with the gray. I continue to do this all the way around the graph until I end up with no possible modularity increases remaining.
Maybe I’ll end up with something like this. For those even algorithms, this is a gray algorithm that is also local, which also makes it a little bit different than a lot of other community detection algorithm that exist. A lot of them look at the graph holistically rather than node by node basis.
Great. We have communities now, but what can we do with it? Let’s say that I have three people that are in a group, and they share this list of four. It doesn’t really even matter what they are. I just picked four random things, but once I know what makes that group similar, I can look at each of those individuals and say, “What makes you different in this group. I understand that you are highly similar to this people, but what is it that makes you different?” In our case, a lot of time that would be things from their past, old teams they were on, NYC server admin.
These access were actually sort of I changed the names, of course, but this was an actual example of things that we were able to find. Servers that nobody knows what they do anymore. Things that are obsolete and all kinds of issues that we were finding.
Another way that we were able to start to remedy this user access problem was with onboarding. It’s a little hard to read, but for any of you that have ever on boarded which I assume is everybody at some point, you probably have come across the issue of not knowing where to start and nobody else really seems to know where to start either sometimes. They just say, “Well I don’t know. Just go see what that guy has and just take it.” It happens all the time. The problem is, now, if I just take everything that other person has, I not only take the things I do need, I take the things that they don’t need anymore. Now my profile is all out of whack and it’s got hope and you’re just perpetuating the issue.
What if instead I could give you a table of accesses that you need and the likelihood that you will need them? This is some of the things that we’ve been trying to offer to hiring managers and managers that have to do these audits on a monthly or bi-weekly or six-month basis. One of the major issues here, it’s really easy to show you this on a PowerPoint, but how do we actually boil this down for someone that has a null idea what any of this means?
This was also one of the biggest pieces. We needed to create an interface for people to actually use this. People that don’t care why it works, they don’t care how it works, they just want to know what to do, and it has to be a little bit intuitive for them. To do that, we had to– essentially this is what was the requirement of our UI. This was our whole pipeline up to now. We were receiving our data on S3, we were processing it with Apache Spark, and we’re storing on Elasticsearch for now. These were just the requirements that we had. We had a build around these at this point.
For our UI, we’ve decided to React JS, just the fact that it was such a extensible UI framework. It really lent itself to this use case. We used the library called Viz JS which was absolutely outstanding for the network visualization. We hosted all of this on AWS. Before, I hope this is my last slide, but I wanted to show you what we ended up building. I can drag this right dimension over to that application. We still have it at the VP stage because it doesn’t– You may see one of the issues here. Let’s say I were to enter it for me, I have no direct reports. That’s a completely useless evaluation. Keeping it at a somewhat higher level just abstracting it away from the user, we could display for anyone that we want.
As a user, I’m going to get the wrong insights if my grouping is too small here. I’m going to get the wrong– Just completely wrong information. Actually, let me refresh it. We have a nice lake visualization on the side where you can actually zoom in and see. It’s going to be little hard for me to control this, but you can actually zoom in on each of these nodes and they correspond to actual associates. You can see each of their IDs and everything. Now, let’s say that I want to look at my own profile here and figure out what I know I need so I can take a look. I also get a percentage that is showing how much of my profile is considered potentially anomalous.
As a manager, and this is very tiny, but I can have all of my accesses listed now. I can just go on to our platform and simply delete them. We’re also working on integrating this currently with our actual API to send the request to remove these accesses automatically or as for our hiring managers or for our people managers, you can actually just sort by their direct reports and see exactly how risky they are, or what if someone is just joining and they’re going to be working on a project very similar to me or working on the same project?
We can resolve that issue I was talking about earlier about taking just all of my accesses. They can say, “All right, you’re going to be working with Austin for this, so why don’t you take a look. Now I can give you this table again of all the accesses that you’ll most likely need and the ones that are a little bit less likely.” It makes the problem so much smaller instead of looking through widths of 100 and 200, I can just look at 10. It’s much easier for me to know which of these that my associates actually need and which ones that they don’t.
That is the end of my material but I am happy to take any questions about any of our architecture, or the model, or anything.
Participant 1: Let me walk around.
Participant 2: Hi, I come from Identity Index Management for Harvard, so I see you showing this access controls from the directories, what about how people use their access actually? Like sourcing data from the logs of those central systems?
Jon Austin Osborne: It’s so funny you ask that. This actual project began that way, and the one issue was that our log data got locked right before it. Essentially, this is the first pass at this because our Active Directory data was available for me to use at this point. If you notice, for this entire model, there was nothing specific about this problem like this whole infrastructure of the way this is setup and analysis, it’s completely extensible to a whole bunch of other thing. Essentially, it is all about, like I said before, picking that edge condition is the most important part by far.
If you have the right data set and you can massage it in the way that you can put it into the model, and then you can figure out the right relationship, what are you examining. I could have chosen people that had the same breakfast. It would have made no difference. I could have still done this, but it would have been useless, but this works with any data set. Yes, we were absolutely looking at the logs, and that’s probably the next step of this because that will absolutely give you some outstanding insights.
Participant 3: Two questions actually. You mentioned a lot of algorithms and my question is how did you choose those? Second question that relates actually to this, what is your performance matrix? Basically, how do you know that your final model works well?
Jon Austin Osborne: How do we have you choose– what was the first one? What do we choose?
Participant 3: What was your decision like making process for choosing the algorithms that [crosstalk]–
Jon Austin Osborne: For choosing the algorithms? It’s a little bit iterative. The reason I’ve structured the deck in this particular way is because there’s actually very similar to how a thought process work. There are problems that you don’t necessarily foresee sometimes. I think we didn’t necessarily count on– We were so focused on having the community detection work. That was the first thing we did. We figured out how are we going to do this community detection in a quick way first of all. We spent a lot of time looking through a whole bunch of algorithms, and that’s how I ended up running across the Louvain method.
That was just looking through different research papers and then we– Obviously, there are a ton of different choices, some of it is completely preference-based, but at the end of the day, we came up with that, but then we realized this was actually not the hardest problem here. The hardest problem was figuring out how we’re going to process all those data pretty much in real time and doing it at a user’s request. Creating the graph was actually the issue, so we spent quite a bit of time looking at the ways to do that, and that’s how we ended up with the MinHash LSH implementation and then– Even still. I mean, there are still so many things that we want to change our database from Elasticsearch to use something like Neo4j.
There are so many things that we are looking into expanding because it is very iterative and you can’t necessarily predict every single problem you’re going to run into. The second question was the, oh, how do I evaluate? It’s tough. This particular use case is very tough to evaluate because this has a lot of false positives. For instance, there are some of these on that other rogue page. There’re something that I actually do need, they’re legitimate. I’m just the only person with those accesses and maybe I only needed them temporarily, so they will show.
We have found, and honestly, it’s purely anecdotal, is that we are actually able to catch almost every single one you do not need, but it will show you a lot of false positives. In terms of the communities, we’re just using the modularity as the– And so actually when we do create the graphs, the modularities are too low. We actually will not allow that to be displayed to the user because it indicates that the grouping was not good enough. We just simply like– We just don’t allow that that to be displayed just because we don’t want to give people the wrong impression.
Modularity is one that we use, but it is a very complicated problem to figure out how exactly do you evaluate how good these are. We could track how many of these are actually removed, but you have some people that just– I still can’t force you to remove them. We can only essentially give recommendations. We can’t even necessarily take statistics of percentage of accesses that were actually removed from this. It’s a tough problem. I think we are still struggling with a little bit.
Participant 4: While calculating in your model, did the level of access or sensitivity, especially any access carries, did that play into weights into building the model, or how did you account for that in your process?
Jon Austin Osborne: We played around with doing weightings. One of our issues was for some of these, I started out with just the names. Actually what we ended up doing, we didn’t have actual levels of, for example, this is an admin access, this is developer access, this is a monitoring access, things like that which would have been gold, that would have been absolutely outstanding, but what we ended up doing instead, we were actually able to do some other clustering algorithms before this almost as a preprocessing to get our list into accesses that people actually cared about. Now, for example, you will not see mailing lists on this or you will not see things like Microsoft Office and things like that, IDE’s and stuff like that.
We actually did have to do some preprocessing there, but essentially, we were keeping everything as the same weight for now, but that’s strictly due to the data that we have as less of a design choice, more of a restriction that we had, but, yes, absolutely weighting this.
There are actually other data sets actually that we were looking at do take in to account level, the actually number of permissions and things like that. For instance with AWS, you can have rolls that have a whole bunch of permissions and then one that only allows you to look at billing or something like that. We have the data sets like that, but that’s not in this particular one. A weighting is definitely valuable, yes, absolutely.
Participant 5: I have a two-part question, could you throw some light on the data pipeline that’s essentially powering it. From essentially getting data from active directories into you database. Then just to clarify, are you using Neo4j or planning to use that? If so, why? That would be my question.
Jon Austin Osborne: Planning to use, I’ll go in reverse order. Planning to use strictly because of the kind of the querying capabilities, so querying with this what we’re doing right now, the memory handling are just not the best way to do this. Mainly because if we want to have the entire company as a graph.
[00:32:35] [END OF AUDIO]