Parallel and Distributed Computing pot

Thông tin tài liệu

I Parallel and Distributed Computing Parallel and Distributed Computing Edited by Alberto Ros In-Tech intechweb.org Published by In-Teh In-Teh Olajnica 19/2, 32000 Vukovar, Croatia Abstracting and non-prot use of the material is permitted with credit to the source. Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of information contained in the published articles. Publisher assumes no responsibility liability for any damage or injury to persons or property arising out of the use of any materials, instructions, methods or ideas contained inside. After this work has been published by the In-Teh, authors have the right to republish it, in whole or part, in any publication of which they are an author or editor, and the make other personal use of the work. © 2010 In-teh www.intechweb.org Additional copies can be obtained from: publication@intechweb.org First published January 2010 Printed in India Technical Editor: Sonja Mujacic Cover designed by Dino Smrekar Parallel and Distributed Computing, Edited by Alberto Ros p. cm. ISBN 978-953-307-057-5 V Preface Parallel and distributed computing has offered the opportunity of solving a wide range of computationally intensive problems by increasing the computing power of sequential computers. Although important improvements have been achieved in this eld in the last 30 years, there are still many unresolved issues. These issues arise from several broad areas, such as the design of parallel systems and scalable interconnects, the efcient distribution of processing tasks, or the development of parallel algorithms. This book provides some very interesting and highquality articles aimed at studying the state of the art and addressing current issues in parallel processing and/or distributed computing. The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and recongurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing. I would like to thank all the authors for their help and their excellent contributions in the different areas of their expertise. Their wide knowledge and enthusiastic collaboration have made possible the elaboration of this book. I hope the readers will nd it very interesting and valuable. Alberto Ros Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain a.ros@ditec.um.es VI VII Contents Preface V 1. Faulttoleranceofprogrammabledevices 001 MinoruWatanabe 2. FragmentationmanagementforHWmultitaskingin2DRecongurable Devices:MetricsandDefragmentationHeuristics 011 JulioSeptién,HortensiaMecha,DanielMozosandJesusTabero 3. TOTALECLIPSE—AnEfcientArchitecturalRealizationofthe ParallelRandomAccessMachine 039 MarttiForsell 4. Facts,IssuesandQuestions-GPUsforDependability 065 BernhardFechner 5. Shufe-ExchangeMeshTopologyforNetworks-on-Chip 081 RezaSabbaghi-Nadooshan,MehdiModarressiandHamidSarbazi-Azad 6. CacheCoherenceProtocolsforMany-CoreCMPs 093 AlbertoRos,ManuelE.AcacioandJoséM.Garc´ıa 7. UsinghardwareresourceallocationtobalanceHPCapplications 119 CarlosBoneti,RobertoGioiosa,FranciscoJ.CazorlaandMateoValero 8. AFixed-PrioritySchedulingAlgorithmforMultiprocessorReal-TimeSystems 143 ShinpeiKato 9. PlaguedbyWork:UsingImmunitytoManagetheLargest ComputationalCollectives 159 LucasA.Wilson,MichaelC.Scherger&JohnA.LockmanIII 10. SchedulingofDivisibleLoadsonHeterogeneousDistributedSystems 179 AbhayGhatpande,HidenoriNakazatoandOlivierBeaumont 11. OntheRoleofHelperPeersinP2PNetworks 203 ShayHorovitzandDannyDolev VIII 12. ParallelandDistributedImmersiveReal-TimeSimulationof Large-ScaleNetworks 221 JasonLiu 13. Aparallelsimulatedannealingalgorithm4ptasatoolfortness landscapesexploration 247 ZbigniewJ.Czech 14. Fine-GrainedParallelGenomicSequenceComparison 273 DominiqueLavenier Faulttoleranceofprogrammabledevices 1 Faulttoleranceofprogrammabledevices MinoruWatanabe 0 Fault tolerance of programmable devices Minoru Watanabe Shizuoka University Japan 1. Introduction Currently, we are frequently facing demands for automation of many systems. In particular, demands for cars and robots are increasing daily. For such applications, high-performance embedded systems are necessary to execute real-time operations. For example, image processing and image recognition are heavy operations that tax current microprocessor units. Parallel computation on high-capacity hardware is expected to be one means to alleviate the burdens imposed by such heavy operations. To implement such large-scale parallel computation onto a VLSI chip, the demand for a large- die VLSI chip is increasing daily. However, considering the ratio of non-defective chips under current fabrications, die sizes cannot be increased (1),(2). If a large system must be integrated onto a large die VLSI chip or as an extreme case, a wafer-size VLSI, the use of a VLSI including defective parts must be accomplished. In the earliest use of field programmable gate arrays (FPGAs) (3)–(5), FPGAs were anticipated as defect-tolerant devices that accommodate inclusion of defective areas on the gate array because of their programmable capability. However, that hope was partly shattered because de- fects of a serial configuration line caused severe impairments that prevented programming of the entire gate array. Of course, a spare row method such as that used for memories (DRAMs) reduces the ratio of discarded chips (6),(7), in which spare rows of a gate array are used instead of defective rows by swapping them with a laser beam machine. However, such methods re- quire hardware redundancy. Moreover, they are not perfect. To use a gate array perfectly and not produce any discarded VLSI chips, a perfectly parallel programmable capability is necessary: one which uses no serial transfer. Currently, optically reconfigurable gate arrays (ORGAs) that support parallel programming capability and which never use any serial transfer have been developed (8)–(15). An ORGA comprises a holographic memory, a laser array, and a gate-array VLSI. Although the ORGA construction is slightly more complex than that of currently available FPGAs, the parallel programmable gate array VLSI supports perfect avoidance of its faulty areas; it instead uses the remaining area. Therefore, the architecture enables the use of a large-die VLSI chip and even entire wafers, including fault areas. As a result, the architecture can realize extremely high-gate-count VLSIs and can support large-scale parallel computation. This chapter introduces an ORGA architecture as a high defect tolerance device, describes how to use an optically reconfigurable gate array including defective areas, and clarifies its high fault tolerance. The ORGA architecture has some weak points in making a large VLSI, as 1 ParallelandDistributedComputing2 Fig. 1. Overview of an ORGA. do FPGAs. Therefore, this chapter also presents discussion of more reliable design methods to avoid weak points. 2. Optically Reconfigurable Gate Array (ORGA) The ORGA architecture has the following features: numerous reconfiguration contexts, rapid reconfiguration, and large die size VLSIs or wafer-scale VLSIs. A large die size VLSI can produce large physical gates that increase the performance of large parallel computation. Fur- thermore, numerous reconfiguration contexts achieve huge virtual gates with contexts several times more numerous than those of the physical gates. For that reason, such huge virtual gates can be reconfigured dynamically on the physical gates so that huge operations can be integrated onto a single ORGA-VLSI. The following sections describe the ORGA architecture, which presents such advantages. 2.1 Overall construction An overview of an Optically Reconfigurable Gate Array (ORGA) is portrayed in Fig. 1. An ORGA comprises a gate-array VLSI (ORGA-VLSI), a holographic memory, and a laser diode array. The holographic memory stores reconfiguration contexts. A laser array is mounted on the top of the holographic memory for use in addressing the reconfiguration contexts in the holographic memory. One laser corresponds to a configuration context. Turning one laser on, the laser beam propagates into a certain corresponding area on the holographic memory at a certain angle so that the holographic memory generates a certain diffraction pattern. A photodiode-array of a programmable gate array on an ORGA-VLSI can receive it as a reconfiguration context. Then, the ORGA-VLSI functions as the circuit of the configuration context. The reconfiguration time of such ORGA architecture reaches nanosecond-order (14),(15). Therefore, very-high-speed context switching is possible. Since the storage capacity of a holographic memory is extremely high, numerous configuration contexts can be used with a holographic memory. Therefore, the ORGA architecture can dynamically treat huge virtual gate counts that are larger than the physical gate count on an ORGA-VLSI. 2.2 Gate array structure This section introduces a design example of a fabricated ORGA-VLSI chip. Based on it, a generalized gate array structure of ORGA-VLSIs is discussed. (a) (b) (c) (d) Fig. 2. Gate-array structure of a fabricated ORGA. Panels (a), (b), (c), and (d) respectively depict block diagrams of a gate array, an optically reconfigurable logic block, an optically reconfigurable switching matrix, and an optically reconfigurable I/O bit. 2.2.1 Prototype ORGA-VLSI chip The basic functionality of an ORGA-VLSI is fundamentally identical to that of currently available field programmable gate arrays (FPGAs). Therefore, ORGA-VLSI takes an island-style gate array or a fine-grain gate array. Figure 2 depicts the gate array structure of a first prototype ORGA-VLSI chip. The ORGA-VLSI chip was fabricated using a 0.35 µm triple-metal CMOS process (8). The photograph of a board is portrayed in Fig. 3. Table 1 presents the spec- ifications. The ORGA-VLSI chip consists of 4 optically reconfigurable logic blocks (ORLB), 5 optically reconfigurable switching matrices (ORSM), and 12 optically reconfigurable I/O bits (ORIOB) portrayed in Fig. 2(a). Each optically reconfigurable logic block is surrounded by wiring channels. In this chip, one wiring channel has four connections. Switching matrices are located on the corners of optically reconfigurable logic blocks. Each connection of the switching matrices is connected to a wiring channel. The ORGA-VLSI has 340 photodiodes to program its gate array. The ORGA-VLSI can be reconfigured perfectly in parallel. In this fabrication, the distance between each photodiode was designed as 90 µm. The photodiode size was set as 25.5 × 25.5 µm 2 to ease the optical alignment. The photodiode was constructed between the N-well layer and P-substrate. The gate array’s gate count is 68. It was confirmed experimentally that the ORGA-VLSI itself is reconfigurable within a nanosecond-order period [...]... on running, and we would have to add only the time needed to transfer the status of each currently running task from the active context to the other one 26 Parallel and Distributed Computing 5.2 Preventive defragmentation This defragmentation is fired by the Free Area Analyzer module, and it will be performed only if the free area is large enough, and it will try first to relocate islands inside the... t_remi, and the allocation heuristic used is based on the 3D-adjacency concept Figure 11.a shows a FPGA situation with six running tasks and a high fragmentation status (QF=0.76) For each task Ti, example t_remi and t_margi values are shown A global defragmentation will lead to 28 Parallel and Distributed Computing the situation of Figure 11.b We have supposed all tasks meet condition C2, and a tD... tolerance analysis of optically reconfigurable gate arrays,” World Scientific and Engineering Academy and Society Transactions on Signal Processing, Issue 11, Vol 2, pp 1457- 1464, 2006 10 Parallel and Distributed Computing [14] M Miyano, M Watanabe, F Kobayashi, ”Optically Differential Reconfigurable Gate Array,” Electronics and Computers in Japan, Part II, Issue 11, vol 90, pp 132-139, 2007 [15] M Nakajima,... with a simple island made of two tasks and its VL is shown in Figure 10.b The island alarm is then only a bit that is set whenever the Free Area Analyzer module detects the presence of a pair of virtual edges in VL, that in the example appear as discontinued arrows T3 T2 T1 T4 a) QF=0.77 Island T3 T1 T4 b) VL T2 c) QF=0.25 Fig 10 FPGA status with an island (a) and its vertex list (b), and FPGA status... defragmentation (c) If the island alarm has been fired, we check first if we can relocate it or not, by demanding that for every task Ti in the island the following condition is satisfied: C1: t_margi ≥ tD_island (10) Fragmentation management for HW multitasking in 2D Reconfigurable Devices: Metrics and Defragmentation Heuristics 27 where t_margi is computed as in (1) and tD_island is the time needed to... medium and large tasks combined The average number of running tasks comes from the average task size and is approximately of 12 for S1, 8 for S2, and 6 for S3 For S4 it is more unpredictable All the task sets have an excess of workload that forces the allocator to store some tasks temporally in a queue, and even discard them when their latest starting time constraint is reached 22 Parallel and Distributed. .. results of Table 2 are summarized in some figures Figures 6 and 7 show how much computing volume (in percentage with respect to the whole computing volume of the task set) is discarded for each set and for each one of the selection heuristics, for hard and soft time constraints, respectively We suppose all the other tasks have been successfully loaded and executed before their respective time constraints... and QF for the less fragmented case, but do behave not so well with islands: F1 does not discriminate among 5.a and 5.c and F2 chooses as more fragmented the case where the island is closer to the perimeter F3 chooses as less fragmented 3.a instead of 3.f Finally, F4 and HF do not discriminate among many of the cases proposed, and assign excessive fragmentation values to cases with several independent... island alarm, or a fragmentation metrics alarm The first alarm checked is the island alarm An island is made of one or more tasks that have become isolated when all the tasks surrounding them have already finished An island can appear only when a task-end event happens It is obvious that to remove an island by relocating its tasks can lead to a significant reduction of the fragmentation value, and. .. all them and incorporating them to the allocation environment (that for some of them is not possible) The experimental results are summarized in Table 2 and Figures 6, 7, 8 and 9 We have used a 20x20 FPGA with 400 area units, and as benchmarks several task sets with 100 tasks and different features each one We have used four different task size ranges Set S1 is made of small tasks, with each randomly . I Parallel and Distributed Computing Parallel and Distributed Computing Edited by Alberto Ros In-Tech intechweb.org Published. Mujacic Cover designed by Dino Smrekar Parallel and Distributed Computing, Edited by Alberto Ros p. cm. ISBN 978-953-307-057-5 V Preface Parallel and distributed computing has offered the opportunity. SchedulingofDivisibleLoadsonHeterogeneous Distributed Systems 179 AbhayGhatpande,HidenoriNakazato and OlivierBeaumont 11. OntheRoleofHelperPeersinP2PNetworks 203 ShayHorovitz and DannyDolev VIII 12. Parallel and Distributed ImmersiveReal-TimeSimulationof Large-ScaleNetworks

Ngày đăng: 26/06/2014, 23:20

Xem thêm: Parallel and Distributed Computing pot, Parallel and Distributed Computing pot, Carlos Boneti, Roberto Gioiosa, Francisco J. Cazorla and Mateo Valero

Parallel and Distributed Computing pot

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Preface

Fault tolerance of programmable devices

Minoru Watanabe

Fragmentation management for HW multitasking in 2D Reconfigurable Devices: Metrics and Defragmentation Heuristics

Julio Septién, Hortensia Mecha, Daniel Mozos and Jesus Tabero

TOTAL ECLIPSE—An Efficient Architectural Realization of the Parallel Random Access Machine

Martti Forsell

Facts, Issues and Questions - GPUs for Dependability

Bernhard Fechner

Shuffle-Exchange Mesh Topology for Networks-on-Chip

Reza Sabbaghi-Nadooshan, Mehdi Modarressi and Hamid Sarbazi-Azad

Cache Coherence Protocols for Many-Core CMPs

Alberto Ros, Manuel E. Acacio and Jos´e M. Garc´ıa

Using hardware resource allocation to balance HPC applications

Carlos Boneti, Roberto Gioiosa, Francisco J. Cazorla and Mateo Valero

A Fixed-Priority Scheduling Algorithm for Multiprocessor Real-Time Systems

Shinpei Kato

Plagued by Work: Using Immunity to Manage the Largest Computational Collectives

Lucas A. Wilson, Michael C. Scherger & John A. Lockman III

Scheduling of Divisible Loads on Heterogeneous Distributed Systems

Tài liệu cùng người dùng

Tài liệu liên quan