Large-scale cluster manage ment at Google with Borg
Google Inc.
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Borg
A cluster manager.
◦Runs hundreds of thousands of jobs fr om many thousands of different applic ations.
◦Across a number of clusters each with up to tens of thousands of machines.
With very high reliability and ava
ilability.
Workloads
Heterogeneous workload with two ma in parts.
◦Long-running services
Handle short-lived latency-sensitive reque sts.
High priority(prod).
◦Batch jobs
Take a few seconds to a few days to comple te.
Low priority(non-prod).
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Jobs and Tasks
Job
◦Runs in one Borg cell.
◦Consist of many tasks.
◦Has properties and constraints.
name, owner, number of tasks, priority.
Task
◦Maps to a set of Linux processes runn ing in a container on a machine.
◦Has properties and constraints.
resource requirements(CPU cores, RAM, disk space, disk access rate, TCP ports, etc).
Jobs and Tasks(Cont.)
Jobs and Tasks(Cont.)
Non-overlapping priority bands
◦Monitoring, production, batch, and be st effort.
◦Tasks from jobs with higher priority can preempt lower priority one.
◦Disallow tasks in the production prio rity band to preempt one another.
Jobs and Tasks(Cont.)
Jobs with insufficient quota are i mmediately rejected upon submissio n.
◦Quota: a vector of resource quantitie s.
(CPU, RAM, disk space, etc.)
◦Higher-priority quota costs more.
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Architecture(Cont.)
Cell
◦A set of heterogeneous machines that run jobs in a cluster.
◦Median cell size: 10k machines.
Alloc
◦A reserved set of resources on a mach ine.
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Scheduler
The scheduling algorithm consists of two parts.
◦Feasibility checking: find machines o n which the task could run.
◦Scoring: picks one of the feasible ma chines.
Spreading load v.s. Best-fit
Use a hybrid method to reduce the amount o f stranded resources – ones that cannot b e used because of another resource on the machine is fully allocated.
Performance Isolation
To help with overload and over-commitm ent.
Latency-sensitive(LS)
tasks v.s. the r est(batch).◦LS tasks are capable of temporarily starvi ng batch tasks for several seconds.
Compressible
v.s.non-compressible
res ources.◦Terminates low priority tasks while runnin g out of non-compressible.
◦Throttles usage(favoring LS tasks) while r unning out of compressible.
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Combined vs Segregated
Agenda
Borg 的目標 (1, 2.1)
使用者怎麼描述工作 / 任務對運算資源
的需求 (2.3, 2.5)
運算資源的分派單位 (2.2, 2.4)
排程算法與資源分配 (3.2, 6.2)
評估方式 (5.1)
實驗結果 (5.2, 5.4, 5.5)
Lesson Learned (8)
Lesson Learned
The bad:
◦Jobs are restrictive as the only grou ping mechanism for tasks.
◦One IP address per machine complicate s things.
◦Optimizing for power users at the exp ense of casual ones.
Lesson Learned(Cont.)
The good:
◦Allocs are useful.
◦Cluster management is more than task management.
◦Introspection is vital.
◦The master is the kernel of a distrib uted system.
Conclusion
Virtually all of Google’s cluster workloads have switched to use Bor g over the past decade.