See all Inferless transcripts on Youtube

youtube thumbnail

ML Model Deployment in Kubernetes with Sushant Hiray from RingCentral

22 minutes 42 seconds

🇬🇧 English

S1

Speaker 1

00:00

That's basically my constant struggle. And so to give you some context, we bring central is as customers in Europe, it has customers, it has a huge customer base in EU, Australia. So we not only want to support English, but we have to support other languages, right? And we don't necessarily, the amount of traffic that we get still for English workloads is almost 10 times more than other workloads.

S1

Speaker 1

00:25

And so, which basically, and all of these other workloads then are distributed, like you have Spanish, German, French, Italian, like these are 4 languages. All of these have really small amount of traffic coming in. And so far as the constant struggle is, hey, I have 1 GPU. I don't have enough traffic from each of these languages.

S1

Speaker 1

00:45

I am not going to give each of these models a new GPU, right? Otherwise, we are just wasting GPU cycles. So a lot of our work has been on trying to sort of take a T100, split it into like multiple sort of, you know, GPUs per se from a model perspective, and so that we can effectively run multiple types of models. But then which also brings us to other challenges is scaling, auto scaling in that scenario becomes super, super tricky.

S2

Speaker 2

01:29

Like. Hi everyone, this is Aishwarya Goyal.

S3

Speaker 3

01:33

And I'm Nilesh Agarwal.

S2

Speaker 2

01:34

From Towards Scaling Inference. Towards Scaling Inference is a place where you can learn about ML deployments from 0 to scale. We are very excited to have Sushant with us from RingCentral.

S2

Speaker 2

01:45

He is an expert AI researcher and data science leader, currently serving as the director of machine learning at RingCentral, where he focuses on building next generation conversational AI platform. Previously, he was the co-founder and CTO at DeepEffect, which was acquired by RingCentral. Prior to that, he also led the engineering, data engineering team at Lumiata. Thanks, thank you Sushant for joining us today.

S1

Speaker 1

02:09

Thanks guys, thanks guys for hosting me.

S2

Speaker 2

02:12

Great, so Sushant to begin with, would like to understand that what kind of business problems are you solving through AI?

S1

Speaker 1

02:20

Yep, so Brink Central is a UCAS CKS provider, so we basically work with phone meeting the whole stack. So my team is specifically focused on improving meeting experiences And we basically started during the hybrid work sort of scenario, so during the COVID time stamp. So what we've been involved in is adding various AI filters to our meetings.

S1

Speaker 1

02:46

And how do we sort of get from a standard meeting experience where you get fatigued to all the way to how do you sort of extract out as much info as possible from that. So we work on an entire suite of products which powered AI into different things like transcription, translation, and then summarizing meetings, the whole stack. That's what me and my team work on.

S2

Speaker 2

03:09

Interesting. And how do you currently deploy your models in production? So do you work with like managed services in the likes of AWS SageMaker, Hugging Face, or have you built your own ML infrastructure from scratch?

S1

Speaker 1

03:24

So we have a completely in-house infrastructure. We have like a separate infrastructure for training large scale models. And then we have like a separate sort of infrastructure for inferencing pipeline.

S1

Speaker 1

03:35

All of this is built on top of Kubernetes. And yeah, so we were 1 of the early adopters of ML Ops and Yeah, so it's been an interesting journey for us.

S3

Speaker 3

03:48

Got it. That's super awesome. Just wanted to understand again as you guys are deploying your production infrastructure in Kubernetes, how has that experience been and what are the challenges in terms of special utilization of resources there?

S3

Speaker 3

04:04

Just wanted to understand that part.

S1

Speaker 1

04:07

Yeah so you know since we built the whole sort of deployment process all the way from building and deploying into production from ground up there has been it's been an interesting journey a lot of learnings a lot of challenges along the way and over time we have made a lot of optimizations on how we push things into production and whether it be starting from as simple as model versioning to all the way to how do you effectively monitor models, right? So there's been a lot of work on that front, but in general challenges we have, we've faced quite a few, both in terms of how do you, you know, scale up the workloads based on the traffic, how do you effectively use GPUs so that you can squeeze as much juice as possible from every single GPU that we have out there. And then there's also been an interesting sort of a lot of benchmarks that we have done is how do you effectively deploy models like whether it makes sense to run on a GPU whether you want to run it on a CPU.

S1

Speaker 1

05:08

And so yeah, that's basically some of the things that we constantly dabble with since we also work with a wide variety of data. Model monitoring in itself is a pretty interesting challenge for us. It's not solved by any line, but we are spending a lot of our energy nowadays on making sure we not only just monitor the health of AI models but how do you make sure AI models are constantly state of the art? How do we keep them keep making them better over time?

S1

Speaker 1

05:42

Yeah,

S3

Speaker 3

05:43

correct. No, no, I think the whole pipeline I think thanks for giving us like a bird's eye view. So I think I just found that there are 3 or 4 different parts of the pipeline that again, you guys have successfully built there. So just wanted to dive deep into each of the pipeline.

S3

Speaker 3

05:58

Starting from the first, which is very much the versioning and deployment and optimization right. So just wanted to understand like how do you guys again you don't need to like deep dive into the secret sauce but how do you guys think about like optimization before you deploy a model like what are the some of the suggestions that like you can do once you have like done that.

S1

Speaker 1

06:18

Yeah. So, you know, a lot of these things have become much more standard nowadays, but things like quantization, default quantization was 1 of the things we started off very early. And, there is again, a lot of like trade-offs, you know, whether you want to deploy something on CPU or so you want to deploy something on GPU. We have empowered our data scientists to sort of be at a level where they can also understand these trade offs and they can also make those decisions.

S1

Speaker 1

06:45

And they are the ones who kind of build the model. So they are the ones who have more intimate knowledge of how this model in general works and what could be the best way to sort of deploy them. And then there is a good sort of platform team which supports them in terms of doing a lot of this automated, I would say optimizations to squeeze as much as performance while reducing the size of the model. That's the first part and then second part is, how do we deploy it?

S1

Speaker 1

07:13

What is the mode of deployment? So for us, we have different levels of workloads. We have streaming workloads like live transcription or live translation. But then we do a lot of asynchronous processing, like once the recording is generated, we sort of want to summarize it.

S1

Speaker 1

07:27

We do a lot of these things asynchronously. And so both of these workloads have an entirely different way of optimizations like for offline workload, you might be okay with slightly longer latencies, but you want to batch very effectively on a streaming workload. You kind of want to make sure your latency is as low as possible, but still you want some level of batching. You don't want to like send 1 request and just run it on the entire GP.

S1

Speaker 1

07:57

So it's kind of like there is a lot of work that involves in how do you effectively batch requests in case of let's say live transcription and then in offline is how do you sort of if we are LLM is nowadays the craze but we've been working on generative AI for a while and how do you now everybody wants to jump in on the bandwagon with chat jeopardy and everything but in general once you have a good POC once you you know find that a little bit of success a lot of work is going to be in you know now I'm getting a thousand simultaneous requests how do I batch each of these effectively How do I create a summarization sort of framework on top of it? So those are some of the challenges that people will face on the sort of asynchronous part of it as well.

S3

Speaker 3

08:43

So for us,

S1

Speaker 1

08:45

the model deployment and optimization definitely is a constant aspect and that is never a solved problem, right? You want to squeeze as much as performance as possible, like we are much better than what we were a year before, but if we have the same conversation next year we'll still be even better than what we are today so it's kind of like a constant battle that people work on but yeah yeah it's and that's where the problems are difficult so more fun to solve.

S3

Speaker 3

09:16

Correct, that's pretty interesting especially the part where again the optimization and the quantization pipeline even differs for different kind of workloads, right? So if I understand correctly, it's like there are different types of tasks. Some are more real-time, some are more batch.

S3

Speaker 3

09:33

And based on the type of the job there's also like a different type of quantization or gpu or memory support that you give those kind of tasks right is that a correct understanding

S1

Speaker 1

09:44

correct yeah So in terms of like even on which type of machines you are running them right so some of these tasks could be very memory intensive but might not necessarily need that much CPUs or even when you're running workloads on GPUs you have a lot of pre-processing you have a lot of post-processing you don't necessarily want to run everything on GPU that's like waste of GPU time on GPU cycles so you want to create that effective pipeline where you do a you kind of need to figure out which pipeline which sections of your pipelines make sense on which type of like underlying hardware and sort of a lot of work will then become very specific in terms of what do you do on CPU or so what do you do on GPU or are you able to do everything on CPU.

S3

Speaker 3

10:32

So also does networking also play a very good part in like managing the Kubernetes, let's say, how the data flows from like pre processing to the means, let's say the predictor board to the post processing, like what have been your learnings in that part of the pipeline? Right. How critical is that?

S3

Speaker 3

10:50

And like, how does that affect latency?

S1

Speaker 1

10:54

Yeah. That basically, so we have been fortunate to have a pretty solid, like ML ops team, like they were not ML ops per se, they were like really good DevOps engineers who have battle tested Kubernetes and we sort of groomed them towards the whole MLOps infra and in general while we have a very solid system running nowadays where we have squeezed performance as much as possible, both in terms of latencies from every step of the pipeline. There is still a lot of stuff. If we take example of async workloads, you have a lot of queuing time.

S1

Speaker 1

11:30

A lot of time requests are just queued inside Kafka. They're waiting for some of these parts to come up and process them. So that's where we do a lot of optimization and that's where a lot of time is spent in terms of how do you effectively auto scale each of these. Because at the end, we are trying to maintain some level of SLAs and different workflows have different levels of SLAs.

S1

Speaker 1

11:53

Some of them are not customer defined but internal defined because we want to give amazing experience. So if you want to look at a meeting summary, you don't want to wait like 2 hours for the summary to come in. By that time, you're already in next meeting. So you kind of like there is like an optimum product experience and to drive that product experience, you want to make sure your AI services are able to generate those within the required sort of time, right?

S1

Speaker 1

12:19

And so that's where we faced a lot of challenges as well as because every model has its own latencies per se and scaling every model itself is not as straightforward as every other model. So you have different types of models, scaling transformers could be a different task than scaling. Let's say something like a CNN model, just an example but so because they have a different way of mechanism. And so, yep, but the way we have, what we have done now is we have been able to abstract out a lot of these things we have abstracted out and every model sort of pushes their metrics to graph on a like all of these model monitoring metrics are being pushed to Grafana which is then being used to auto scale a lot of these.

S1

Speaker 1

13:04

So now we started off with something very ad hoc and then now we have come to a point where it's much more specialized in the sense that a lot of things happen manually which used to happen manually are now automated to the point where you just need to look at alerts you don't even need to worry about anything else so.

S3

Speaker 3

13:22

Got it, got it. And last question before I hand it over to Aishwarya. So in terms of kubernetes right especially when you're working with GPUs since it does not come up with like native support for let's say memory sharing so how have you guys been able to like do that on kubernetes let's say like use the same gpu to run multiple models and all those things we just wanted to like have a hundred feet

S1

Speaker 1

13:47

and we've that's basically my constant struggle. And so to give you some context, right, we bring Central as customers in Europe. It has customers.

S1

Speaker 1

13:57

It has a huge customer base in EU, Australia. So we not only want to support English, but we have to support other languages, right? And we don't necessarily, the amount of traffic that we get still for English workloads is almost 10 times more than other workloads. And so, which basically, And all of these other workloads then are distributed.

S1

Speaker 1

14:17

Like you have Spanish, German, French, Italian, like these are 4 languages. All of these have really small amount of traffic coming in. And so far as the constant struggle is, hey, I have 1 GPU. I don't have enough traffic from each of these languages.

S1

Speaker 1

14:32

I am not going to give each of these models a new GPU, right? Otherwise, we are just wasting GPU cycles. So a lot of our work has been on trying to sort of take a even hundred split it into like multiple sort of you know GPUs per se from a model perspective and so that we can effectively run multiple types of models but then which also brings us to other challenges is scaling auto scaling in that scenario becomes super super tricky like it's okay if it's just 1 gpu you can split it as much as you want but once you want to auto scale then how do you auto scale at that point because let's say you just have french workloads coming in not German that when you get in the next GPU are you sort of deploying all French all German like those are like very difficult decisions that need to be made so that we at the end it's all about infrastructure cost right You don't want to waste all of these infrastructure cycles because bringing up GPU and bringing it down itself takes so much time. So you kind of need to warm some of these GPUs up before they come in while you still want to make sure it's a fine trade-off, right?

S1

Speaker 1

15:46

You don't want to give so much poor latency that customers will stop using it, but you don't want to sort of support that you don't want to end up spending like 10 times more than what you would have otherwise spent. So, but yeah, we work closely with Nvidia and we are 1 of the biggest partners of Nvidia on this front as well and we've worked with them on optimizing a lot of this stack but I would say we are still far away from where I want to be and yeah that's basically 1 of the interesting challenges that we are working on. Awesome.

S2

Speaker 2

16:23

Very interesting. I think so the next thing we would love to understand it that what do you feel are the implications of the infrastructure cost on your business outcomes? Like do they limit you in terms of experimenting with more models, growth, et cetera?

S1

Speaker 1

16:38

So for us, what we have done is we have split training infrastructure cost versus like inference by playing infrastructure cost. And inference pipeline is tied back to like the product like as long as we are making as long as the unit economics makes sense it is still okay like you can scale up but only to a certain point And so that's where we do on the inference for training. We kind of have a different way of looking at it.

S1

Speaker 1

17:06

That's our experimentation workload. So we don't want to limit people in terms of the number of experimentations they can run, but Again, it's a huge team and a team with very different types of like experiments around like computer vision team has something different. Whereas speech, where you're training a new speech to text model, which takes like a lot of time to run. So we have like a dedicated, we have a couple of dedicated MLOps guys who effectively are working on how do we share this training platform so that there is not, you have to wait, but there shouldn't be so much wait that it's bottling, hampering your productivity.

S1

Speaker 1

17:46

Like you're not able to train models. So we here we basically work with nvidia dgx clusters and we have a big setup on that front to train train these models. But yeah so in general you know it's sharing it's based on like you get some time on this, on the server and on this platform, and then you kind of run your experiments and then you get the results back and then spend some time analyzing while you request like a new job. So We have like a certain set of things and that's kind of where my rule is, is to make sure we provision enough based on how much capacity we can get.

S1

Speaker 1

18:25

And so that it's kind of like a fine balance. You don't want people, you don't want to overbuy at a point where it's not being used or not being used for interesting enough items, but you don't want to under buy to a point where your team is just doing sitting doing nothing. They're running workloads on the laptops and like digging for it. So it's kind of like a good balance, but yeah, it's a very interesting challenge to solve for.

S1

Speaker 1

18:51

And as with Niko, and fortunately we are at a very interesting time. We have generative AI, which is like every new day there is something crazy going on and so with all of this hype it also brings a lot of new avenues of using AI into problems which people thought previously were not possible. So yeah, so which again makes my job difficult is now I need to figure out how to get more, how do I squeeze in more performance from whatever infra we have already. So

S2

Speaker 2

19:24

yeah, I'm sure definitely an interesting challenge to solve. Just the last question, you know, Sushant, So can you discuss any best practices or tips for other ML engineers, practitioners looking to deploy a model in production based on your experience?

S1

Speaker 1

19:41

Yeah, definitely. So, you know, it entirely depends on what stage of AI maturity a company is at. So if you are an ML engineer at an early stage startup, then your goal should not really be trying to optimize a pipeline and use Kubernetes and all that.

S1

Speaker 1

19:57

It's just like a waste of your time and energy. You kind of should need to focus on, hey, as long as I'm able to get my model serving pipeline up. And as long as I'm able to serve my traffic, that should be your primary goal. But as your team goes through maturity, you kind of want to spend more time not just looking at, hey, are we able to deploy properly, but more in terms of monitoring.

S1

Speaker 1

20:21

A lot of times GPUs get stuck and we sort of noticed this like we had behavior only when we started monitoring is sometimes you're not able to squeeze in performance and then you start looking at logs and then you figure out, Oh, you know, this part of just sitting idle for a long time. What happened? Oh, something happened on GPU and it got stuck. It went into inconsistency.

S1

Speaker 1

20:41

You will start noticing these weird problems at scale when you are max, when you are trying to max out performance from each of these steps. So then I think even in terms of ML monitoring, start off with something very rudimentary, which is like, how would you monitor a software stack? Start off with the basics, like just monitor CPU, memory, just monitor how many requests are coming and going on. What is the latency start there?

S1

Speaker 1

21:04

And then as you move forward, try to bring in the fancier things, which we keep hearing in a model monitoring communities, data drives and all of it. All of this is great to have, but in reality, if you don't know how to interpret each of these graphs, kind of doesn't really make sense. Like you wouldn't really, unless these are actionable items where you understand that some drift is coming and, but you don't know what to do with it. It's kind of like, it's good to, you are not really able to do anything with it.

S1

Speaker 1

21:32

So you at least understand and then sort of figure out what is the next step from there. If you understand there is some level of drift in your data or your distribution then you kind of need to let's say re-train your model or maybe change something on your pipeline or pre-processing, post-processing. So that way you can sort of take these gradual steps and not necessarily everything is needed on day 1, but don't do anything is also not a good idea. So start with bare minimum and slowly over time as your team grows as like you know maturity inside the organization grows keep on adding this additional level of stacks which can help create a more formidable AI pipelines.

S2

Speaker 2

22:15

Great, no this is a great great kind of you know feedback and you know I think this will really help other ML engineers who are in the process of deploying. So thank you so much, Sushant, for joining us today. It was really fun to understand about your journey and how you are dealing with all the challenges and the new initiatives.

S2

Speaker 2

22:35

Thank you so much. Awesome, thank you.