Why AIOps is Critical for Networks

Speaker 1: This is Techstrong TV.

Mitch Ashley: With great pleasure of being joined by Andrew Colby. Andrew is VP of AIOps at Vitria. Welcome, Andrew.

Andrew Colby: Good afternoon, Mitch. And thank you.

Mitch Ashley: It’s a great topic. I’m excited to talk with you about it. We could go down the share war stories in telco experience, which really could be about 10 episodes of a different show, but today in the telco environment, or just in the business environment in general, the economic conditions, competitive pressures, looking for areas where we can get more for less, there are a lot of different parameters that have shifted or changed or maybe tightened that we’re currently working within. I’d love to get your perspective on that.

Andrew Colby: Certainly, and thank you. Yeah, I’d say we see cautious optimism. Obviously, I’m based in the US in the DC Metro area, Maryland. And the US, the government entities and quasi-governmental entities have been tightening the economic structure in order to tame inflation. Fortunately, that has not driven our economy and had the potential recessionary effect that was feared, but people are still cautious, businesses are still cautious. That said, it’s hard to hire people and it’s really hard to hire technical people. So a lot of companies are continuing to look towards how to leverage technologies and automation to build efficiency so that they can do more with either the same number of people or re-task their people to higher value purposes, and let the technology do some of the more menial and mundane tasks.

And we can explore this a little bit, especially in these new complex service delivery and network environments. It’s very difficult for me to imagine how an engineer who’s gone through anywhere from two to eight years of college education is going to really be happy going and spending their days collecting a lot of data across network, container management VM and other infrastructure systems to figure out what’s going on. I mean, really that’s where a lot of the automation provides a significant amount of value to let the engineers do the smart, difficult things that we want humans to do.

Mitch Ashley: And a lot of pressures around meantime to recovery, even looking at resiliency, how do we stand up under a test-full situation, whether it be a security attack that might be going on or some unobserved condition that our systems and networks have never been under?

DevOps World 2023

Andrew Colby: Oh, there’s so much of that. So much is changing. It’s not just a person like you or me behind a smartphone that can actually report that there’s a problem, but it’s sensors and equipment that won’t necessarily report right away, so it needs to be detected. So that’s a whole nother additional dimension that service providers, large enterprise IT organizations are under, which is to be able to have this kind of real-time awareness of what’s going on. Whether the service is real time, like the video conference that we’re on or not, there really is a desire and expectation to have real time awareness of the service delivery to be able to detect what’s going on, react to it, address it before the user, whoever that is, the customer, the employee detects a problem, identifies a problem, and actually reports it. That’s really kind of the last line of defense that you want to have.

Now, that said, when your customers are reporting problems, I think it’s really important to have a way to hear that, and that’s another thing that we are able to do with VIA AIOps in particular. We have customers that use that care event data, which is… It’s kind of messy data. It’s human generated, so it’s late in the timeline, but when the users are telling you something’s going on, you need a way to be able to hear that so that you can definitively react. I mean, I know for most people, there’s nothing more frustrating than having an interaction with your technical support teams that tells you, well, there’s really no problem. There’s nothing wrong or I don’t see a problem.

Mitch Ashley: That’s when customers feel like they know more about it than you do, right? I am experiencing this problem, help me please.

Andrew Colby: Absolutely.

Mitch Ashley: Well, I do want to hear some more about VIA AIOps. Just about can’t stumble out the door without tripping over AI in some form, maybe generative AI at the moment. So I’d love to hear some more about your approach around AIOps.

Andrew Colby: Certainly. I mean, it really starts with ingesting a wide variety of information, structured, semi-structured, unstructured information, enriching it, in some cases, classifying it and providing additional context around it. So we take this tremendous amount of data, the telemetry, the faults and others, and by enriching it, we actually grow it even larger and then need to be able to process it in a way that provides meaning for the business. Now, historically, you’d have expert engineers who would say, oh, well this value should be here and above that or below that, would be where we care about it.

There’s two things that are happening simultaneously. There’s too many measures. I was just working with a team turning up a pretty simple system on a container management system, a CMS, and they said, “Well, there’s like 2,400 metrics just in that one system,” haven’t even gotten really to the application measures yet. So tell me what’s important. So that’s one of the challenges, is just you can’t set the values for everyone anymore. And then the other challenge is to know which of them are important. Again, it depends a little bit upon how the service is consumed. Is it one way or bi-directional? Is it real time? Is it latency sensitive? Is it packet loss sensitive? What are the characteristics of it? And depending on those is going to be what’s important and then what’s normal. I mean, every service has some sort of pattern throughout the day and throughout the days of the week.

The machine learning, the unsupervised learning is able to learn what those normals are and then identify when those measures and values deviate from that normal and alert on it. And more important or equally important to just alerting on it, is to be able to bring together all of the related information. And that’s where the enrichment becomes important. It’s not just enough to say, oh, well this measure changed, but well, what is it related to? What are the services that run over that infrastructure? What are the other measures that have changed among those same infrastructure at about the same time? What were the planned changes or manual changes that went in, in the prior 15 minutes? Because maybe one of them had an impact and maybe it didn’t. But that’s really important and useful information to know. So there’s all this bringing together of the enriched information in order to get a full picture of what’s going on.

And that’s a lot of the value of the automation and the AIOps brings together. You don’t need to have your highly skilled, expensive engineers do a lot of that manual data gathering. Let the machine intelligence bring that together for you and let the engineers do the value that humans can do effectively, which is interpret that, identify new patterns, see what’s new, especially if you’re rolling out a new service or you have different behaviors going on. Now, there’s a wide variety of AI and ML that goes into this.

You mentioned generative AI, which is a very interesting topic and has a lot of both technical and popular coverage and attention right now. And there are places where that’s effective to be used for things like the data extraction and ingestion, which is learning what the data is and understanding how to interpret it, as well as potentially developing a suggested fix or likely cause or likely fix that could be generated from all of the different inputs and individual root issues that are identified. But there’s also a variety of other traditional machine learning and AI techniques that really provide a lot of value and are a part of VIA AIOps, and we leverage very extensively.

Mitch Ashley: In a way, I feel like networking, security, they’re very broad topics. In a few words, you’re talking about many, many different technologies. The same is true I think also for AI. There’s so many fields of it. It’s a bit overwhelming to folks. I know you prescribe kind of take an incremental approach. Don’t try to eat the whole elephant at one, if you will, that old adage. But what is that incremental approach? How do you work with customers to start to figure out that path that should go down?

Andrew Colby: That’s a great question. I’d say there’s a couple of steps to that. One is to focus on the business value. What is it that you want to achieve and how can you tell that you’ve achieved it? It’s not just enough to collect a lot of data and produce something that looks a little interesting, aha. Can you do something about it? Does it have an impact on the service? What is the measurable impact on the service? So being focused on business value and even directly measuring it is one important piece. The other is because this is all about network and service operations, it’s generally something that’s done across an entire service or entire company. It’s not so easy to try on maybe a departmental level. So it’s a big decision for companies. Because of the way VIA AIOps is built and structured, we enable it to be delivered or we enable delivery of what we call incremental transformation, which is the ability to augment the existing or augment the machine intelligence in AI with the institutional knowledge.

So to be able to specify policies, for example, for things that are… To differentiate what’s important for the business from what’s important in just a measure, to leverage the value of existing investments. Nobody’s starting this from scratch. So there’s always investments in application performance monitoring or other network and service monitoring tools. They’re not bad, but they just may be sort of siloed or they may have a real key piece of information in one area, and we want to leverage that across the entire service delivery. So that’s another way to provide that incremental transformation and leverage. And overall improving the efficiency of the operation staff and being able to deliver these really as individual use cases incrementally to continue to provide business value over time.

Mitch Ashley: You mentioned automation before, you talked about metrics and things to be what you should be observing. It all starts with getting the data and as you mentioned, bringing, it’s not just correlation anymore, it’s not correlation of events. Yes, we do that or we need to continue to do that, but if you have at 10 factors that are actually part of what’s going on, that’s usually bigger than what someone looking at a monitor or a screen up in an op center is really going to be able to put up together. And this seems like that complexity or maybe the speed of that happening is also a big driver for where you might consider AI. Agree? Disagree?

Andrew Colby: Absolutely. I mean, the kind of highest level metrics are things like the meantime to understand, MTTU, and meantime to restore, MTTR. Those are top level metrics and you want to build down and drive down from those, which is what goes into them. What does it take to understand a problem? In one of our customers, the challenge was not only to understand the problem, but to identify which part of the network the problem was in, so they can get to the right fix agent quickly. Because sometimes you create, in an incident management system, a ticket, but if it goes to the wrong team, they have to triage it, evaluate it, and they say, “Oh, it’s not us. It has to go to…” Maybe it goes to the firewall team instead of the network, instead of the WAN team.

So that’ll take a lot of time. So getting the likely root issue and likely fix to be in the right area is really important to be able to do that. So it’s areas like what are the components that drive that MTTU? How do you measure it? We have a variety of customers and they take a variety of different approaches. In the end, what they want to achieve is a higher percentage of incidents that are handled through automation. And you can do that by decreasing the overall number of incidents as well as by increasing the number that are handled through automation, and we’re able to take both approaches.

Mitch Ashley: It goes back to what you were originally talking about, not being able to hire enough people, the right people, they may not exist, right?

Andrew Colby: Absolutely.

Mitch Ashley: The people we’d like to have is some purple unicorns, but that’s the people don’t always have those skills we’re looking for. So it isn’t always just about downsizing people. Sometimes it’s bringing that curve of what you need to hire right-sizing that just to get… So you can handle the workload with the staff that you have.

Andrew Colby: Right. And some of that is to be able to free up the staff from things like monitoring screens or systems that are just telling you red and green or up or down because that doesn’t have enough context, but you need to understand. And often it’s not just simply red or green, it’s gray or purple, which is, well, it’s working, but it’s not working to the level that we want or need or expect in order to provide the level of service that our customers expect or that our underlying service requires. So it’s being able to provide all that nuance as well as that level of detail and insight, all that enriched information. We go back to that again so that the right action can be taken. Eventually, once the right action gets taken 10 or 50 or 100 times, my expectation is that the trust will have been built in the system so that that action can be now taken in an automated fashion.

And again, that’s an opportunity to accelerate that time to free up an engineer from doing something that they’ve done 98 times before and be able to more quickly allow the action to be taken. Obviously, you’d still maybe want to have some post analysis review to see, well, did we take the right action? Is that same action being taken every day? Maybe there’s some other problem that really we need to address, but still, as long as they can deliver the service in a way that meets the customers’ and expectation and the service expectation, they’re achieving their goal. And they can enable engineers and others’ network operations to kind of free up to think at the higher level about what they need to do and can do to continue to achieve that effectively.

Mitch Ashley: Yeah, we live in a world where it’s very easy for customers to say, your service is taking too long, I’ll jump to the next app or the next site or whatever. There is loyalty, but there’s also patients that tries that.

Andrew Colby: Absolutely.

Mitch Ashley: That trust and loyalty. [inaudible 00:16:54] few minutes left. I’d love to have you talk some more about the incremental approach. I think all the buzz about generative AI, and you hear everything from, we won’t need programmers. We won’t need these kind of people, we won’t need those kind of people. We have robotics in factories, but we still need people in factories, right?

Andrew Colby: Right.

Mitch Ashley: I’m not a subscriber to the… The pendulum doesn’t slam against the other side all of a sudden, those kind of changes don’t happen very often. So it’s probably somewhere in the middle where we end up. So knowing that, if you take that assumption, how do you get aggressive enough that you’re getting some value from AIOps, and I mean so conservative that you’re kind of missing the opportunity?

Andrew Colby: I’d say again, it goes back to aligning your actions with the business values you want to achieve. I’m working with a customer who says they want to achieve a value of running their operations with significantly fewer people, like maybe half or more, less than a traditional network operation center would have in a circuit switch world, let’s say. That’s the goal from the top down. The people that are responsible from the bottom up, they’re like, “Whoa, slow down. Hold on. Don’t do all of your automation yet. We want to look at everything first. We want to see.” Because that’s how they’ve been used to dealing with, and that’s some of the place where there’s tension in being able to do this when the tests are done and you measure response time from issue occurrence to issue detection, to issue resolution, there’s a lot of human think time in there.

And it’s not wasted. It’s people doing their due diligence. Is it really down? Can I restore it myself? Do I need to take an action with an outside team? But those all need to be aligned, and that’s how you can achieve a result. And that’s hard in a large organization. And in all of these cases, we’re talking about large, complex organizations with large, complex service delivery environments.

Mitch Ashley: And you mentioned earlier the trust that’s built in that process too. We don’t just throw AI into the mix and say, it’s in charge now. Let’s walk away and hope it does all right. No, we need to know that-

Andrew Colby: We need to build trust.

Mitch Ashley: … what it’s going to do is the right thing, right?

Andrew Colby: Right. We need to build that trust. And that’s actually one of the reasons that, again, we go back to this incremental transformation. We think that’s really important to be able to do it in steps so that you can see, all right, this is what the system shows and we think the action from that system should be B. So we can give you a button to say, all right, when you believe it’s B, click on B. And again, you do that 50 times. After a while, you’re going to get tired of just saying, well, every time it’s B, why don’t you just do that for me automatically? And that’s where we want to get to.

Mitch Ashley: It’s kind of supervised learning for AI. Very much so. Well, Andrew, it’s been fascinating to talk with you. I hope you’ll get a chance to come back and chat some more. And I really appreciate you sharing sort of on the ground experiences with customers as you’re working in a large setting. It’s one thing to adopt AI in a startup, it’s another in a large telco environment that generally doesn’t make those kind of shifts. On an hourly daily basis, it takes a bit to get that flywheel turning in the new direction that you’re headed.

Andrew Colby: Absolutely.

Mitch Ashley: So, where can folks find out more about VIA AIOps?

Andrew Colby: Thanks, Mitch. Please go to vitria.com on our homepage, and on the Resource tab you’ll find a suite of information about how you can actually realize this business value and the types of capabilities Vitria and AIOps can deliver.

Mitch Ashley: Great. There’s a lot of great resources under that Resource tab, so be sure and check it out. Thanks for joining me again, Andrew Colby, who is a VP of AIOps at Vitria.

Andrew Colby: Thank you, Mitch.

Mitch Ashley: Talk to you again soon.

Andrew Colby: Look forward to it.