Rust at NetTV: An 84% Reduction in Infrastructure Costs

NetTV is the biggest IPTV/OTT platform of Nepal and at scale we face a lot of interesting technical challenges. To give a hint of scale, we serve over 500,000 users. It is our sole responsibility to ensure our services provide users with the best experience. This blog presents with one of the issue we faced on production internally and our approach towards solving it.

Running out of memory

Recently, some of our nodes were observed to be the victim of OOM killer. All of these nodes had one thing in common, they were running the same WebSocket service and nothing more. I accumulated conttrack entries across all of these nodes and on multiple occasions, TCP connection entries were 1.5x to 4x more than the number of users online - there should be only one connection per device. Based on our logs, it was evident that here had been no malicious traffic. I assumed our application to be establishing duplicate WebSocket connection for the same functionality. Fortunately, there had been no such cases where this led to any undefined behavior, simply a duplicate connection. 

I inspected the service independently, the functionality of the service was trivial - a nodejs application connected to Redis & Kafka with the purpose of fanning-out messages to end-users. Duplicate connection was definitely concerning but the shock on my face when I saw this service being provisioned a total of 26cores & 52GB RAM. The worst part being, the service still OOM’ed. That's an expensive amount of CPU  and DRAM. At this point we could definitely scale up to solve our problem but let's be honest, NetTV is growing fast and only grows bigger - we need to solve the problem not delay it.

Deciding between Golang and Rust

A work stealing runtime such as the Go’s runtime & tokio (rt-multi-thread enabled), moves the ownership of socket across threads at times. WS connections are long lived and ping-pong throughout its lifetime. We benefit by not moving connections across threads because of CPU cache hits. With maximum performance in mind, I decided on a Thread Per Core, share nothing architecture and io_uring over epoll for maximum throughput.

Here, rust was a clear winner with the flexibility to choose a i/o backend and the async runtime. On the other hand, golang did not support any of them.

Rewriting it in Rust

I wrote the initial version on top of the compio runtime, along the rewrite I ran into a bug with WebSocket which at the time remained unsolved. I decided to fallback to using tokio, creating a single threaded tokio runtime per thread. I also tested different allocators, specifically jemalloc and mimalloc and ended up using jemalloc. The rust rewrite was complete in just 3 days and on the 5th, we were already serving production traffic. 

Post Analysis

Did rewriting it in rust solve the problem ? The short answer is yes. It, of course, didn't solve the duplicate connection we were facing but we sure did save ourselves from the OOM killer. Here is a observation we made after the deployment of the rust rewrite onto a single server with 4cores and 16GB memory, handling almost a million active WebSocket connections.

A single server handling all of production traffic is not something you want to do. This was a HA setup and I had multiple standby nodes ready to re-balance traffic.
grafana-websocket.jpeg

With this deployment, we were already serving all of production traffic including duplicate connections. Now, let’s compare to see the benefits from the rust rewrite.

Resource

Before

After

Reduced By

% Reduction

CPU Cores

26

4

22 cores

~84.6%

RAM

52 GB

16 GB

36 GB

~69.2%

84% reduction in CPU cores, speaks for itself.

No comments yet