• 沒有找到結果。

SIMD Architecture for Job Shop Scheduling Problem Solving

N/A
N/A
Protected

Academic year: 2021

Share "SIMD Architecture for Job Shop Scheduling Problem Solving"

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

SIMD ARCHITECTURE FOR JOB SHOP SCHEDULING PROBLEM SOLVING

Kuaiz-Hung Cheia', Shi-Chung Charas', Tzi-Dar Chiueh', Peter B. Luh2, Xing Zhao2

1

Department of Electrical Engineering, Room 5 1 1 National Taiwan University, Taipei, Taiwan 10617

2

Department of Electrical and Computer Engineering, University of Connecticut

Storrs, Connecticut 06269-2 157, USA

ABSTRACT

Job shop is a typical environment for manufacturing high- variety and low-volume discrete parts. Good scheduling is crit- ical and challenging to the competitiveness of job shops. The Lagrangian relaxation neural network (LRNN) developed by [I], provides an approach of quantifiable quality and successful indus- trial applications. To further speed up scheduling for large-scale problems, in this paper, the parallelism of the LRNN approach is exploited for hardware implementation. New designs include a SIMD architecture, its associated instruction set and detailed cir- cuits. Logic level simulation of the circuit design shows consistent schedules with those obtained by a software implementation. The hardware implementation is expected to have a one to two orders speed-up over the software one.

1. INTRODUCTION

Job shop is a typical environment for manufacturing low-volume and high-variety discrete parts. In a job shop, parts with various due dates and priorities are to be processed on various types of machines. Job shop scheduling selects machines and beginning times for processing individual operations to achieve certain ob- jectives under the given machine capacity and computation time constraints. A good solution to scheduling problems can result in significant savings. For example. a scheduling system developed by IBM-Japan is estimated to save over a million dollars a year for a major steel company [ 2 ] .

There are two main challenges for effective scheduling: solu- tion quality and solution finding speed. Theoretically, computation complexity of many job shop scheduling problems is NP [ 3 ] . The generation of an optimal schedule often requires excessive compu- tation time regardless of the methodology. Instead. near- or sub- optimal solutions are adopted for practical applications and there have been many sub-optimal or heuristic methods [ 3 ] .

Recently, there have been a series of scheduling methods with successful industrial applications [4] [5] developed under a com- mon framework of Lagrange relaxation (LR). These methods relax the coupling constraint(s) of a scheduling problem by applying the Lagrangian relaxation technique. The original scheduling problem is then decomposed into independent. simpler optimization sub- problems and a Lagrange multiplier optimization problem. Var- ious optimization techniques are developed for efficient solution with quantifiable optimality. To further advance the computation This work was supported in pan by the National Science Council of

Taiwan. Republic of China. under grant NSC-89-22 12-E-002-040 and by

the National Science Foundation of the United States of America under grant DMI-9813176.

efficiency of this class of methods, Luh et al. exploited the inher- ent parallelism of their LR-based job shop scheduling methods and designed a LR neural network (LRNN) algorithm [ I ] for parallel computing.

In this paper. we further enhance the parallelism in LRNN and design parallel processing hardware for the parts of intensive com- putations in LRNN to speed up the computation. We first ana- lyze and modify the LRNN scheduling algorithm to be amenable for parallel hardware implementation. We then design for the al- gorithm a parallel processing architecture and its associated in- struction set. Finally, logic circuit implementation is designed, simulated, and then fabricated. The performance of our design is demonstrated by the preliminary testing result of the hardware implementation.

The remainder of the paper is organized as follows. In section 2, the problem formulation and the LRNN algorithm are presented. Section 3 describes a SIMD architectural design for the VLSI chip implementation. Section 4 presents two key modules of our circuit design. Verilog simulation results and the estimation of speedup. Finally. a summary of our design is given in Section 5 .

2. MODIFIED L R " ALGORITHM 2.1. Problem Formulation

Consider a job shop. where there are H machine types and each machine type may consist of a few identical machines. There cue

I parts to be scheduled over a time horizon of I< time units. Part i has its due date D , , weight (or priority) factor

I T , ,

and requires .J, sequential processing operations. Each operation requires the processing by a machine of a specific type for pre-specified units of time. Processing of each operation must satisfy the operation precedence constraint. i.e.. its processing may start only after the completion of its preceding operation. The number of operations assigned to machine type I , at time k: should not violate the ma-

chine capacity constraint and it should no more than the number of machines available at that time, MA,/, , i.e.,

X6,,/A/!

5

A16./,. A . = 1...1;; It = 1 . . . H . (i) where 6),/ 6.1, is a 0- 1 variable and equals 1 if operation , j of part i is being processed by machine type Ir at time k:; it equals 0 otherwise.

Vnriables 6, / b . l r are determined once the beginning times /I,,/ of all operations are decided.

The scheduling goal of on-time delivery for individual iobs is modeled as penalties on job delivery tardiness T, = 111ax[O> C, -

F , ] . where C , is the completion time of part i and is equal to the beginning time of the last operation of part i plus its processing

I ,/

IV-530

0-7803-6685-9/01/$10.000200 1 IEEE

(2)

time. The scheduling problem then boils down to determine be- ginning times b,, of individual operations to minimize the total weighted part tardiness

E,

x T, while satisfying all the con- straints. With linear constraints and additive objective function, mathematically, this formulation is a separable optimization prob- lem since all the constraints are linear and the objective function is additive.

2.2. Solution by

L R "

Now apply Lagrangian relaxation to machine capacity constraints and obtain a "relaxed problem" as

I b t j } 2 k h 1 . J

1)

Inin L: WithL =

1

TI,T, f

c

(iTkh

(E

S i j k h

-

n f k h

(2) subject to individual part constraints,

where 7 r k h are the Lagrange multipliers. By exploiting the separa-

bility, the relaxed problem can be decomposed into the following decoupled part subproblems for a given set of multipliers:

Ji C i j

min

Li,

,withL, = Li*)Ti

+

7 r k h i j , z = 1:. . .

, I ,

(3)

t h, } j-1 k = b , ,

subject to operation precedence constraints of part i,

where h,, denotes the machine type used by operation j of part

i. Each & , k h in (2) is set, in the decomposition, to a value 0 or 1 according to its definition.

Let Lf denote the minimal subproblem cost of part i under the given multipliers. A dual problem is then obtained as

The dual function D is concave and provides a lower bound to the original scheduling problem. Interested readers may refer to [ I ] for details.

A surrogate subgradient method is adopted to solve the dual problem in (4). Under a given set of Lagrange multipliers, 7ri~h.1~.

part subproblems can be solved independently among parts. After solving a part subproblem and obtaining the beginning times of individual operations. this method updates the subgradient of the dual function D based on the solution of the subproblem. One iteration of the method consists of solving all part subproblems and updating the corresponding subgradient once. The procedure iterates until a convergent dual solution is obtained. As there may be machine capacity violations in the dual solution, a heuristic then adjusts it to a feasible one. Solving a part subproblem is the most computation intensive of all.

S-NBDP for Solviiig Subproblems

Each part subproblem ( 3 ) is a multistage optimization problem. We design a simplified neuron-based dynamic programming (S-

NBDP) algorithm for its solution, which combines the ideas of NBDP [6] and SDP [7] with consideration of hardware implemen- tation. The basic structure of the S-NBDP application to a pnrt subproblem is depicted in the dash-lined box of Fig. I, where state and comparison neurons perform backward DP [8] computation while a forward sweep procedure identifies the optimal schedule

from results of the backward DP. Neurons are connected based on precedence constraints. In Fig. 1, an arrow indicates the direction of data flow.

In the backward DP procedure for a part subproblem, a stage corresponds to an operation and a state corresponds to an opera- tion beginning time. The backward DP is a stage-by-stage iterative procedure starting from the last stage. In stage j , all state neu- rons of the stage computes in parallel their respective cumulative cost by adding up the stage-wise cost and the optimal cost-to-go (OCTG) of the state. The stage-wise cost of a state is the summa- tion of multipliezs associated with the machine type needed during the processing time of the stage (operation). The OCTG of a state represents the minimum cost to schedule the remaining part oper- ations after the state (time). For a state in the last stage. the OCTG equals to the tardiness penalty of the state.

Figure 1: L R " Structure.

Comparison neurons then find the OCTGs for individual states of the preceding stage, i.e., stage j

-

1. In the procedure, one comparison neuron (CN) inputs the output of another CN. For the example in Fig. 1, the CN for OCTG of state k compares and finds

the smaller value between the cumulative cost of state k and the

output of a CN for state k

+

1. Such a comparison procedure is obviously sequential starting from the last state. However, soft- ware simulation results show that beginning time of an operation obtained in one LRNN iteration usually differs from that of the previous iteration by only few time units. It in turn suggests that only OCTGs of some adjacent states in a stage may actually need to be calculated. Such an observation motivates our simplification of the comparison procedure.

In the simplified comparison procedure for a stage. a set of adjacent states is first identified based on beginning time obtained from the previous iteration. Each CN for a state in the set performs comparison normally as described in the previous paragraph. For a state not in the set. the CN does nothing but relaying the cumula- tive cost of the state. For each comparison neuron, there is a state flag to record which comparison operand is the minimum. In the case of Fig. 1. a flag value ' 1' indicates that the cumulative cost of

current state is the minimum and '0' otherwise. Computations by both the state and comparison neurons are functionally repetitive from one stage to the next.

New beginning times of individual operations are then identi- fied from state flags of all stages by a forward sweep procedure.

It searches through state flags stage by stage starting from the first stage. Within a stage, the search is done state-by-state starting from the state corresponding to the earliest beginning time of that stage (operation). which is equal to 1 plus the completion time of previous stage. Whenever a flag of value 1 is found at a state of a stage, the time (state) is set as the beginning time of the operation (stage). The forward sweep completes S-NBDP for a part.

(3)

Sugradient arid multiplier updaring

After solving a part subproblem in one iteration, the difference of machine usage at each time between schedules resultant from the current and previous iterations, di f f k h , is calculated. Each ele-

ment of the subgradient y of the dual function D is updated by

g k h = 9i.i'

+

d i f f i , ; ' ,

where

rz

=

1 , 2 , .

.

.

,

is

the iteration index and ykh = - h I k h . Multipliers are then updated according to the formula, x::+') = x::)

+

ai,:)

x g:;) , by Lagrangian neu- rons. where Q is the step size parameter. Note that both updating

can beadone in parallel among different ( I C , h ) pairs.

( n + l )

3. SIMD ARCHITECTURE

To exploit the parallelism of L R " , a system

as

shown in Fig. 2 is designed, which consists of a PC, a micro-controller, and sev- eral LRNN chips. Software in the PC takes problem inputs, struc- tures the data and generates controlling commands to the micro- controller. The micro-controller then feeds the data from PC to individual LRNN chips. controls their processing sequences, and returns their solutions to PC for output. Each L R " chip imple- ments, with a limit on problem dimension, S-NBDP and the sur- rogate subgradient updating of multipliers. Chips can be cascaded for various problem dimensions. Based on the dual solution ob- tained by LRNN chips, the software in PC performs feasibility adjustment of the final solution.

Figure 7: Overall system architecture.

In this system, the LRNN chips carry out most of the compu- tation for finding

a

dual optimum. A SIMD architecture shown in Fig. 3 is designed to map the modified LRNN into a parallel hard- ware implementation. Under this architecture, an instruction de- coder decodes the instruction code from the micro controller into control signals for the whole chip. Processing elements (PES) carry out many arithmetic operations in parallel such as the addition in calculating the cumulative cost by each State Neuron, the compar- ison by each Comp'arison Neuron, and the updating of Lagrange multipliers. The parallelism of the first two types of operations is quite straight forward based on the modified LRNN structure and it is natural to have one PE to support each state and CN pair. As

for

multiplier (subgradient) updating, its parallelism is rooted in that one multiplier (gradient) is defined for each machine type at each time period. Since a state corresponds a time unit, it becomes obvious that the PE for the state can be used to store all the multi- pliers (gradients) of the corresponding time period, i.e., Qkh

,

X k h , for all h. Similarly, the calculation of subgradient direction and the updating of Lagrange multipliers by the Lagrangian Neuron can also be carried out by individual PES. A forward sweep cir- cuit (FSC) then reads state flag values from PES and performs a sequential search to find out the beginning times of individual op- erations. The global memory stores input data items such

as

due dates and output data items such

as

operation beginning times.

Figure 3: SIMD architecture of a LRNN Chip.

4. CIRCUIT DESIGN

Designs of PE and FSC are more involved than those of the global memory and the instruction decoder. Key design concepts of the two are described as follows.

4.1.

PE

Design

Figure 4: PE architecture.

Fig. 4 depicts our circuit architecture design of PE. To perform arithmetic operations required by the LRNN algorithm, a standard circuit design for arithmetic logic unit (ALU) is adopted. The AL.U

is capable of performing addition, comparison, etc. In executing

most

instructions, inputs

to

ALU

are

latched

by two

registers,

rhe

ACC (accumulator) and the DR (data register). Registers R1, R2, and R3 latch data for arithmetic operations conditioned on a pre- ceding ALU execution result. The local memory is required to store the Lagrange multipliers and the machine usage information for each machine type. A local bus communicates between local memory and registers such as ACC, DR, etc. within a PE. It also serves the global data communications with the global bus. Fi- nally, instead of using a 16-bit storage unit in the local memory to store each 1-bit state flag, we design

a

stack for string state flags. 4.2. Forward Sweep Circuit

The forward sweep circuit implements fonvnrd sweep steps for

state flag search within a stage. Each state has a logic unit with three 1-bit inputs: a begin flag with '1' indicating that this state is the earliest beginning state for the search within this stage or '0' otherwise, the state flag of this state, and an input search flag. The logic unit has two 1-bit outputs: an output search flag feediiig to the next state as its input search flag and a time flag indicating the new beginning time. The search is a sequential procedure and always starts at the first state and completes at the last states. The search flag, that is initially

'O',

will be set to '1' at the earliest

(4)

beginning state. Time flag of the logic unit is set to '1' #state flag of value ' I' is found at this state when the input search flag also has

a value ' I . , and the search flag is reset to '0'. Such a sequential search will require a long search time for a large-scale problem. Note that the output search flags should be 0 for states before the earliest beginning state and for states after the state where its time flag is set to ' 1'. So we adopt

an

idea similar to the "carry bypass" concept of the Manchester carry chain [9] to speed up the search.

State 1

State 2 State 3

4.3. Experimental Results

Our circuit design is finished with 16 PES in one chip. A test prob- lem of 5 parts, 2 machine types and a time horizon of 16 time units is used for logic level simulation. Verilog is adopted as the logic simulator. A software implementation of our modified LRNN al- gorithm in C code serves as the benchmark. In specific, values of Lagrange multipliers obtained by the C code and by the Verilog simulation are compared. Tables 1 and Fig. 5 give the results re- spectively. Comparing the two sets of results. we clearly see that they are all the same. This is a justification of logical correctness of our circuit design.

I 1 State 9 0

23 State 10 0

0 State 11 12

Figure 5 : Multipliers of type-I machine generated from Verilog simulator.

State 6 State 7 State 8

The proposed architecture has been fabricated by Taiwan Semi- conductor Manufacturing Co., using a single-poly quadruple-metal 0.35-pm CMOS technology. The chip measures 4.56 mm x 4.24 mm. There are 16 processing elements and 356k transistors in the chips. Preliminary testing results show that the chip works at a clock rate of 100 MHz while drawing only 742 mW from a 3.3V power supply. Under the above operating condition. 0.15ms is re- quired to complete 10 iterations for the previous 5-part test prob- lem. The software solution takes approximately 3.5ms on a K6-I1 300MHz computer. More than 20 times of speed up is achieved. The gain is derived from the parallel computation by 16 PES and the simplification of the sequential pair-wise comparisons. With putting more PES in one LRNN chip and cascading the LRNN chips, two orders speed-up may be achieved for larger problems.

15 State 14 0

20 State 15 0

0 State 16 0

Table 1: Multipliers of type-I machine from C program.

I

MultiDliers

I

I

Multioliers

1

I

State4

I

14

I

State 12

I

0 State 5

I

19

I

State 13

I

0

1

5. SUMMARY

In this paper, a SIMD architecture has been designed to h p l e - ment the modified LRNN algorithm for speeding up the job shop scheduling problems solving. The design concepts of the archi- tecture and the circuit are presented. The logic-level simulation results show the feasibility of the SIMD architecture to implement the modified LRNN algorithm with parallel hardware efficiently. The preliminary performance testing of hardware implementation demonstrates the potential of one to two orders speed-up over the software one.

6. REFERENCES

[ 11 P. B. Luh, X. Zhao, and Y. Wang, "Lagrangian Relaxation Neural Networks for Job-Shop Scheduling." IEEE Traiisac-

tioiis 011 Robotics aizd Automation, vol. 16, no. 1, pp. 78-88,

February 2000.

[2] M. Numao and S . Morishita, J. Kittler, "A Scheduling Envi- ronment for Steel-making Processes," in Proceedings, Fi@h

Conference. A rtij-icinl bztelligence for Applications. pp. 279-

286, 1989.

[3] M. Pinedo. Scheduling: Theoiy, Algorithms, and Syste~ns.

Prentice-Hall, Inc., 1995.

[4] Liao D.Y., Chang S.C., Pei K.W., and Chang C.M. "Daily scheduling for R&D semiconductor fabrication". IEEE Transactions on Semiconductor Manufacturing. Vol. 9, no. [5] Hoitomt D.J., Luh P.B., and Pattipati K.R. "A practical ap- proach to job shop scheduling problems". IEEE Transactions on Robotics and Automation. Vol. 9, Feb. 1993, pp. 1-13. [6] Luh P.B., Zhao X., Thakur L.S., Chen K.H., Chiueh T.D.

and Chang S.C. "Architectural Design of Neural Network Hardware for Job Shop Scheduling". Annals. of the CIRP. [7] X. Zhao. K. H. Chen, P. B. Luh, T. D. Chiueh, S. C. Chang and L. S. Thakur, "Integrated online job shop scheduling sys- tem," SPIE Iriterizariortal Syinposiurn on Iiztelligerzt Systems

arid Adimrzced Maizufacturiiig. Bosrorz, MA, 1999.

[8] Dimitxi P. Bertrekas, Dynamic Programmiizg: Deterrniizistic

aizd Stochastic Models. Prentice-Hall, Inc., 1987.

[9]

N.

H. E. Weste and K. Eshraghian, Principles of CMOS VLSI

Design: A Systems Perspective. Addison Wesley, 1992.

4, NOV. 1996, pp. 550 -561.

Vol. 48/1, 1999, pp. 373-376.

數據

Figure 1:  L R &#34;   Structure.
Figure  7:  Overall system architecture.
Table 1:  Multipliers of type-I  machine  from C program.

參考文獻

相關文件

In this chapter, a dynamic voltage communication scheduling technique (DVC) is proposed to provide efficient schedules and better power consumption for GEN_BLOCK

Internal service Quality, Customer and Job Satisfaction: Linkages and Implications for Management.. Putting the Service-Profit Chain

Therefore, this study proposes a Reverse Logistics recovery scheduling optimization problem, and the pallet rental industry, for example.. The least cost path, the maximum amount

Hogg (1982), “A State-of-the-art Survey of Dispatching Rules for Manufacturing Job Shop Operation,” International Journal of Production Research, Vol.. Gardiner (1997), “A

(1988), “An Improved Branching Sheme for the Branch Bound Procedure of Scheduling n Jobs on m Parallel Machines to Minimize Total Weighted Flowtime,” International Journal of

This study conducted DBR to the production scheduling system, and utilized eM-Plant to simulate the scheduling process.. While comparing the original scheduling process

Capacity determination model with time constraints and batch processing in semiconductor wafer fabrication.. Approximations For The

Li, The application of Bayesian optimization and classifier systems in nurse scheduling, in: Proceedings of the 8th International Conference on Parallel Problem Solving