A parallel implementation on modern hardware for geo-electrical tomographical software

Thông tin tài liệu

Tài liệu tham khảo công nghệ thông tin A parallel implementation on modern hardware for geo-electrical tomographical software

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Cơng nghệ thơng tin HÀ NỘI – 2010 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CƠNG NGHỆ Nguyễn Hồng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Cơng nghệ thơng tin Cán hướng dẫn: PGS TSKH Phạm Huy Điển Cán đồng hướng dẫn: TS Đoàn Văn Tuyến HÀ NỘI – 2010 ABSTRACT Geo-electrical tomographical software plays a crucial role in geophysical research However, imported software is expensive and does not provide much customizability, which is essential for more advanced geophysical study Besides, these programs are unable to exploit the full potential of modern hardware, so the running time is inadequate for large-scale geophysical surveys It is therefore an essential task to develop domestic software for overcoming all these problems The development of this software is based on our research in using parallel programming on modern multi-core processors and stream processors for high performance computing While this project with its inter-disciplinary aspect poses many challenges, it has also enabled us to gain valuable insights in making scientific software and especially the new field of personal supercomputing INTRODUCTION CHAPTER HIGH PERFORMANCE COMPUTING ON MODERN HARDWARE 1.1 An overview of modern parallel architectures 1.1.1 Instruction-Level Parallel Architectures 1.1.2 Process-Level Parallel Architectures 1.1.3 Data parallel architectures 1.1.4 Future trends in hardware 13 1.2 Programming tools for scientific computing on personal desktop systems .15 1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++ 16 1.2.2 GPU programming with CUDA 21 1.2.3 Heterogeneous programming and OpenCL 26 CHAPTER THE FORWARD PROBLEM IN RESISTIVITY TOMOGRAPHY 28 2.1 Inversion theory 28 2.2 The geophysical model 30 2.3 The forward problem by differential method 35 CHAPTER SOFTWARE IMPLEMENTATION 39 3.1 CPU implementation 39 3.2 Example Results 40 3.3 GPU Implementation using CUDA 42 CONCLUSION 46 References 47 List of Acronyms CPU CUDA GPU Central Processing Unit Compute Unified Device Architecture Graphical Processing Unit OpenMP Open Multi Processing OpenCL Open Computing Language TBB Intel Threading Building Blocks INTRODUCTION Geophysical methods are based on studying the propagation of the different physical fields within the earth’s interior One of the most widely used fields in geophysics is the electromagnetic field generated by natural or artificial (controlled) sources Electromagnetic methods comprise one of the three principle technologies in applied geophysics (the other two being seismic methods and potential field methods) There are many geo-electromagnetic methods currently used in the world Of these electromagnetic methods, resistivity tomography is the most widely used and it is of major interest in our work Resistivity tomography [17] or resistivity imaging is a method used in exploration geophysics [18] to measure underground physical properties in mineral, hydrocarbon, ground water or even archaeological exploration It is closely related to the medical imaging technique called electrical impedance tomography (EIT), and mathematically is the same inverse problem In contrast to medical EIT however, resistivity tomography is essentially a direct current method This method is relatively new compared to other geophysical methods Since the 1970s, extensive research has been done on the inversion theory for this method and it is still an active research field today A detailed historical description can be seen in [27] Resistivity tomography surveys searching for oil and gas (left) or water (right) Resistivity tomography has the advantage of being relatively easy to carry out with inexpensive equipment and therefore has seen widespread use all over the world for many decades With the increasing computing power of personal computers, inversion software for resistivity tomography has been made, most notably being Res2Dinv by Loke [5] According to geophysicists at Institute of Geology (Vietnam Academy of Science and Technology), the use of imported resistivity software encountered the following serious problems:  The user interface is not user-friendly;  Some computation steps cannot be modified to adapt to measurement methods used in Vietnam;  With large datasets, the computational power of modern hardware is not fully exploit;  High cost for purchasing and upgrading software Resistivity software is a popular tool for both short term and long term projects in research, education and exploration by Vietnamese geophysicists Replacing imported software is therefore essential not only to reduce cost but also to enable more advance research on the theoretical side, which requires custom software implementations The development of this software is based on research in using modern multi-core processors and stream processors for scientific software This can also be the basis for solving larger geophysical problems on distributed systems if necessary Our resistivity tomographical software is an example of applying high performance computing on modern hardware to computational geoscience For 2-D surveys with small datasets, sequential programs still provide results in acceptable time Parallelizing for these situations provides faster response time and therefore increases research productivity but is not a critical feature However, for 3-D surveys, datasets are much larger with high computational expenses A solution for this situation is using clusters Clusters, however, are not a feasible option for many scientific institutions in Vietnam Clusters are expensive with high power consumption With limited availability only in large institutions, getting access to clusters is also inconvenient Clusters are not suitable for field trip as well because of difficulties in transportation and power supply Exploiting the parallel capabilities of modern hardware is therefore a must to enable cost-effective scientific computing on desktop systems for such problems This can help reduce hardware cost, power consumption and increase user convenience and software development productivity These benefits are especially valuable to scientific software customers in Vietnam where cluster deployment is costly in both money and human resources Chapter High Performance Computing on Modern Hardware 1.1 An overview of modern parallel architectures Computer speed is crucial in most software, especially scientific applications As a result, computer designers have always looked for mechanisms to improve hardware performance Processor speed and packaging densities have been enhanced greatly over the past decades However, due to the physical limitations of electronic components, other mechanisms have been introduced to improve hardware performance According to [1], the objectives of architectural acceleration mechanisms are to  decrease latency, the time from start to completion of an operation;  increase bandwidth, the width and rate of operations Direct hardware implementations of expensive operations help reduce execution latency Memory latency has been improved with larger register files, multiple register sets and caches, which exploit the spatial and temporal locality of reference in the program With the bandwidth problem, the solutions can be classified into two forms of parallelism: pipelining and replication Pipelining [22] divides an operation into different stages to enable the concurrent execution of these stages for a stream of operations If all of the stages of the pipeline are filled, a new result is available every unit of time it takes to complete the slowest stage Pipelines are used in many kinds of processors In the picture below, a generic pipeline with four stages is shown Without pipelining, four instructions take 16 clock cycles to complete With pipelining, this is reduced to just clock cycles On the other hands, replication duplicates hardware components to enable concurrent execution of different operations Pipelining and replication appear at different architectural levels and in various forms complementing each other While numerous, these architectures can be divided into three groups [1]:  Instruction-Level Parallel (Fined-Grained Control Parallelism)  Process-Level Parallel (Coarse-Grained Control Parallelism)  Data Parallel (Data Parallelism) These categories are not exclusive of each other A hardware device (such as the CPU) can belong to all these three groups 1.1.1 Instruction-Level Parallel Architectures There are two common kinds of instruction-level parallel architecture The first is superscalar pipelined architectures which subdivide the execution of each machine instruction into a number of stages As short stages allow for high clock frequencies, the recent trend is to use longer pipeline For example the Pentium uses a 20-stage pipeline and the latest Pentium core contains a 31-stage pipeline Figure Generic 4-stage pipeline; the colored boxes represent instructions independent of each other [21] A common problem with these pipelines is branching When branches happen, the processor has to wait until the branch finishes fetching the next instruction A branch prediction unit is put into the CPU to guess which branch would be executed However, if branches are predicted poorly, the performance penalty can be high Some programming techniques to make branches in code more predictable for hardware can be found in [2] Programming tools such as Intel VTune Performance Analyzer can be of great help in profiling programs for missed branch predictions The second kind of instruction-level parallel architecture is VLIW (very long instruction word) architectures A very long instruction word usually controls to 30 replicated execution units An example of VLIW architecture is the Intel Itanium processor [23] As of 2009, Itanium processors can execute up to six instructions per cycle For ordinary architectures, superscalar execution and out-of-order execution is ... Process-Level Parallel (Coarse-Grained Control Parallelism)  Data Parallel (Data Parallelism) These categories are not exclusive of each other A hardware device (such as the CPU) can belong to all these... a whole data set, usually a vector or a matrix This allows applications to exhibit a large amount of independent parallel workloads Both pipelining and replication have been applied to hardware. .. can provide the level of parallelism that has been only available to cluster systems Figure Intel Gulftown CPU 1.1.3 Data parallel architectures Data parallel architectures appeared very soon

Ngày đăng: 23/11/2012, 15:03

Xem thêm: A parallel implementation on modern hardware for geo-electrical tomographical software, A parallel implementation on modern hardware for geo-electrical tomographical software, An overview of modern parallel architectures 4 Programming tools for scientific computing on personal desktop systems 15 Inversion theory 30 2.2 The geophysical model 32 The forward problem by differential method 36 Algorithms for solving large linear sy, Instruction-Level Parallel Architectures An overview of modern parallel architectures, Process-Level Parallel Architectures An overview of modern parallel architectures, Data parallel architectures An overview of modern parallel architectures, Future trends in hardware, CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++, GPU programming with CUDA, Heterogeneous programming and OpenCL, The forward problem by differential method, Algorithms for solving large linear systems, CPU implementation Example Results

A parallel implementation on modern hardware for geo-electrical tomographical software

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan