Large-scale cluster manage ment at Google with Borg

(1)

Large-scale cluster manage ment at Google with Borg

Google Inc.

(2)

Agenda



Borg 的目標　 (1, 2.1)



使用者怎麼描述工作 / 任務對運算資源　

的需求 (2.3, 2.5)



運算資源的分派單位 (2.2, 2.4) 　



排程算法與資源分配 (3.2, 6.2) 　



評估方式 (5.1) 　



實驗結果 (5.2, 5.4, 5.5) 　



Lesson Learned (8)

(3)

Agenda



Borg 的目標　 (1, 2.1)

 使用者怎麼描述工作 / 任務對運算資源　

的需求 (2.3, 2.5)

 運算資源的分派單位 (2.2, 2.4)　

 排程算法與資源分配 (3.2, 6.2)　

 評估方式 (5.1)　

 實驗結果 (5.2, 5.4, 5.5)　

 Lesson Learned (8)

(4)

Borg



A cluster manager.

◦Runs hundreds of thousands of jobs fr om many thousands of different applic ations.

◦Across a number of clusters each with up to tens of thousands of machines.



With very high reliability and ava

ilability.

(5)

Workloads



Heterogeneous workload with two ma in parts.

◦Long-running services

 Handle short-lived latency-sensitive reque sts.

 High priority(prod).

◦Batch jobs

 Take a few seconds to a few days to comple te.

 Low priority(non-prod).

(6)

Agenda

 Borg 的目標　 (1, 2.1)



使用者怎麼描述工作 / 任務對運算資源　

的需求 (2.3, 2.5)

 運算資源的分派單位 (2.2, 2.4)　

 排程算法與資源分配 (3.2, 6.2)　

 評估方式 (5.1)　

 實驗結果 (5.2, 5.4, 5.5)　

(7)

Jobs and Tasks



Job

◦Runs in one Borg cell.

◦Consist of many tasks.

◦Has properties and constraints.

 name, owner, number of tasks, priority.



Task

◦Maps to a set of Linux processes runn ing in a container on a machine.

◦Has properties and constraints.

 resource requirements(CPU cores, RAM, disk space, disk access rate, TCP ports, etc).

(8)

Jobs and Tasks(Cont.)

(9)

Jobs and Tasks(Cont.)



Non-overlapping priority bands

◦Monitoring, production, batch, and be st effort.

◦Tasks from jobs with higher priority can preempt lower priority one.

◦Disallow tasks in the production prio rity band to preempt one another.

(10)

Jobs and Tasks(Cont.)



Jobs with insufficient quota are i mmediately rejected upon submissio n.

◦Quota: a vector of resource quantitie s.

 (CPU, RAM, disk space, etc.)

◦Higher-priority quota costs more.

(11)

Agenda

 Borg 的目標　 (1, 2.1)

的需求 (2.3, 2.5)



運算資源的分派單位 (2.2, 2.4) 　

 排程算法與資源分配 (3.2, 6.2)　

 評估方式 (5.1)　

 實驗結果 (5.2, 5.4, 5.5)　

(12)

Architecture(Cont.)



Cell

◦A set of heterogeneous machines that run jobs in a cluster.

◦Median cell size: 10k machines.



Alloc

◦A reserved set of resources on a mach ine.

(13)

Agenda

 Borg 的目標　 (1, 2.1)

的需求 (2.3, 2.5)

 運算資源的分派單位 (2.2, 2.4)　



排程算法與資源分配 (3.2, 6.2) 　

 評估方式 (5.1)　

 實驗結果 (5.2, 5.4, 5.5)　

(14)

Scheduler



The scheduling algorithm consists of two parts.

◦Feasibility checking: find machines o n which the task could run.

◦Scoring: picks one of the feasible ma chines.

 Spreading load v.s. Best-fit

 Use a hybrid method to reduce the amount o f stranded resources – ones that cannot b e used because of another resource on the machine is fully allocated.

(15)

Performance Isolation

 To help with overload and over-commitm ent.



Latency-sensitive(LS)

tasks v.s. the r est(batch).

◦LS tasks are capable of temporarily starvi ng batch tasks for several seconds.



Compressible

v.s.

non-compressible

res ources.

◦Terminates low priority tasks while runnin g out of non-compressible.

◦Throttles usage(favoring LS tasks) while r unning out of compressible.

(16)

Agenda

 Borg 的目標　 (1, 2.1)

的需求 (2.3, 2.5)

 運算資源的分派單位 (2.2, 2.4)　

 排程算法與資源分配 (3.2, 6.2)　



評估方式 (5.1) 　



實驗結果 (5.2, 5.4, 5.5) 　

(17)

Combined vs Segregated

(18)

Agenda

 Borg 的目標　 (1, 2.1)

的需求 (2.3, 2.5)

 運算資源的分派單位 (2.2, 2.4)　

 排程算法與資源分配 (3.2, 6.2)　

 評估方式 (5.1)　

 實驗結果 (5.2, 5.4, 5.5)　



Lesson Learned (8)

(19)

Lesson Learned



The bad:

◦Jobs are restrictive as the only grou ping mechanism for tasks.

◦One IP address per machine complicate s things.

◦Optimizing for power users at the exp ense of casual ones.

(20)

Lesson Learned(Cont.)



The good:

◦Allocs are useful.

◦Cluster management is more than task management.

◦Introspection is vital.

◦The master is the kernel of a distrib uted system.

(21)

Conclusion



Virtually all of Google’s cluster workloads have switched to use Bor g over the past decade.



Large-scale cluster manage ment at Google with Borg