Search

Our Journey to Autoscaling EKS Node Groups for Job Workloads

URL
생성 일시
2026/04/16 10:06
최종 편집 일시
2026/04/16 10:06
태그
당근
파일과 미디어
|| Hello! My name is Ssup, and I work as an engineer on the Cluster team within Karrot’s SRE organization. Our team handles a wide range of responsibilities — from operating AWS EKS clusters and configuring Istio-based service mesh networks, to managing monitoring components like Prometheus and Loki. Most of Karrot’s workloads run on AWS EKS clusters. While workloads can be classified in many ways, one of the most common distinctions is between Server workloads and Job workloads. Unlike Server workloads, Job workloads have a clear start and end — they run once, complete their task, and stop. One of the key characteristics of Job workloads is that they are difficult to interrupt once started. Most Job workloads need to restart from the beginning if they are interrupted, which means any disruption directly translates to wasted time and compute. This is especially true for long-running Jobs that take an hour or more to complete — interrupting them is simply not an option. This “hard to interrupt” nature is one of the main reasons why autoscaling EKS Node Groups for Job workloads is so challenging. In this post, I will walk you through how we at Karrot worked around these constraints to successfully enable autoscaling for the EKS Node Groups running our Job workloads. Previous Approach to Managing Job Workloads As mentioned earlier, Job workloads are difficult to interrupt, which means any Node running a Job workload cannot be removed during the Scale-in process of autoscaling. This means that the more evenly Job workloads are distributed across Nodes, the more they interfere with autoscaling. For this reason, Karrot operates a dedicated Node Group exclusively for Job workloads, separate from other workloads. The diagram above illustrates an example of separating the Node Group for Server workloads from the Node Group for Job workloads. Before the separation, Job workloads were running across all Nodes, making it impossible to perform Node Scale-in until all Job wo