A parallel implementation on modern hardware for geo electrical tomographical software - 7

βk 1

zTr

rT r


pk

end for.

k k


zk βk

k 1 k 1


1pk 1


3.2 CPU implementation

For each wave number, a large number of linear systems have to be solved. In theory, this equals the number of data points which means that for each electrode arrangement, a linear system must be solved. This is the common method in most research articles [9], [10]. However, according the superposition property of the electrical field, the electrical potential caused by two electrodes is the sum of the two potential caused by each electrode:

v = v1 - v2 ,

where

v1 is the potential caused by the positive electrode and

v2 is the potential

caused by the negative electrode. As a result, the potential has to be calculated only for each electrode position and the results for each data point can be produced from the calculated results for single electrodes using equation (2.1). This means the number of linear systems we have to solve now only equals the number of electrodes, not the number of data points. For example, a survey with 100 electrodes can have more than 900 data points for a pseudo­section depth of 10. With this notice, we can reduce the number of needed computations by a large magnitude.

Although the number of linear systems is large, they can be solved independently. Parallelization can therefore be done on a coarse­grain scale. In our software, Intel Threading Building Blocks [4] is used for parallelizing the code as this library is portable across platforms, provides good load balancing and has good documentation with a large user community. The loop over all the linear systems is parallelized using TBB’s parallel_for. We use the Matrix Template Library 4 by Open Systems Lab at Indiana University for solving each linear system. Running time for typical test cases are quite optimistic as near linear scale can be achieved. On our Core 2 Quad 9400 test system, for example, a speedup of 3.9x is observed.

3.2 Example Results

Surface

Block

Here we present some results for common underground model using the Wenner alpha electrodes array. Each example contains two picture. The picture on top is the model. The picture below is the result after running the forward modeling module.


Environment



Figure 24 Example 1 Surface Environment Area with higher resistivity Figure 25 Example 2 46 Surface 1


Figure 24 Example 1


Surface


Environment

Area with higher resistivity


Có thể bạn quan tâm!

Xem toàn bộ 65 trang tài liệu này.

A parallel implementation on modern hardware for geo electrical tomographical software - 7


Figure 25 Example 2 46 Surface Block 1 Block 2 Environment Sur Figure 26 Example 3 face Figure 27 2


Figure 25 Example 2


46

Surface



Block 1

Block 2

Environment


Sur Figure 26 Example 3 face Figure 27 Example 4 3 3 GPU Implementation using CUDA For the 2­D 3


Sur

Figure 26 Example 3

face




Figure 27 Example 4 3 3 GPU Implementation using CUDA For the 2­D case the number of electrodes 4


Figure 27 Example 4

3.3 GPU Implementation using CUDA

For the 2­D case, the number of electrodes is often less than 1000. The above multi­core CPU implementation can therefore be adequate for 2­D surveys. However, for 3­D surveys, the number of electrodes can be a few thousands or more and the 3­D mesh is significantly larger than the 2­D mesh. It is therefore essential to get more performance increase for large data sets. Traditionally, this has been done using clusters, which are expensive and inconvenient for ordinary users. At the moment, we are experimenting with porting computationally expensive parts of the program to run on the GPU.

It can be seen form chapter 1 that GPU has a very different model of execution from CPU. The memory hierarchy is also different. This lead to more work in porting a program from CPU to GPU.

Choosing the right matrix format is an important first step. There are a few common sparse matrix formats to choose from.

For dense matrices, most elements are nonzero and the storage format often stores all the matrix elements in column­major or row­major form. For sparse matrices, because of the fact that a large number of elements are zero, it is more efficient to store only the nonzero ones to save space and increase computation speed. This requires for more complicated storage scheme than dense matrices.

Some common sparse matrix formats are

Compressed Sparse Row (CSR) format: a vector contains the entire matrix elements in which the elements in the same row are next to each other, rows are placed according to their position in the matrix. A second vector contains the columns of the elements. The third vector contains the start of each row in the first vector.

1

7

0

0

0

2

8

0

5

0

3

9

0

6

0

4


0


2


4


7 9

For example the matrix


Α


has the CSR presentation


indices


ptr

0 1 1 2 0


2 3 1 3


data

1 7 2 8

5 3 9 6 4 .


Diagonal format: every diagonal of the matrix that has a non zero is stored. Zeros are padded in so every diagonal has the same length. This is most suitable for matrices produced by finite differencing.

1

7

0

0

0

2

8

0

5

0

3

9

0

6

0

4

For example, the matrix


Α


has the diagonal presentation


*

1

7


*

2

8


5

3

9


6

4

*



2 0


1 .

data


offsets


ELL format: every row is assumed to have no more than a fixed number of non­zeros. Zeros are padded in if the row has fewer non­zeros. The column of each element is also explicitly stored.

For example, the matrix


1

Α0

5

0


has the ELL presentation:

1 7 *

7 0 0

2 8 0

0 3 9

6 0 4


0 1 *

data

2 8 *

indices

1 2 * .


5

3

9

0

2

3

6

4

*

1

3

*

1

7

0

0

0

2

8

0

5

0

3

9

0

6

0

4

COO format: both the coordinate and the value of the element are stored. For example, the matrix


Α


has COO presentation


row

0

0

1

1

2

2

2

3

3

col

0

1

1

2

0

2

3

1

3

data

1

7

2

8

5

3

9

6

4 .

Many other formats have also been proposed. Some are hybrid of the existing types, some are completely new format.

There are many other types of matrix data formats. Finding the appropriate format for specific parallel architectures requires a much deeper understanding of the inner of hardware and parallel algorithms. This should be an interesting subject for further research.

In the CPU implementation, we use the CSR format as it is the most common in matrix libraries and supported by many solvers. However, this format is inefficient for GPU as memory access in the matrix­vector multiplication kernel is not contiguous, which significantly slow down GPU computation. As we use finite difference in our mesh, we use the diagonal format to enable contiguous memory access.

We have also applied some common optimizations for GPU kernels. Attention has been paid to the use of shared memory within GPU threads in the same block to avoid access to the slower global memory. This alone proves to be quite effective in optimizing for GPU. We also tried to keep the processing on the GPU for as long as possible to avoid long memory transfer time between GPU and CPU. For small cases, grouping smaller matrices into larger ones can help get more bandwidth as the bandwidth is often under­utilized for these cases.

Porting the whole computation part of the program to GPU is an on­going task that requires more time to complete. Below are preliminary results for comparison between CPU and GPU conjugate gradient method with diagonal preconditioner after some optimization on the GPU kernels.

The CPU code is run on a Core 2 Quad Q9400 CPU (2.66 GHz, 6 MB cache). The GPU code is run on an MSI Nvidia 9800GT GPU (600 MHz core clock, 1375 MHz shader clock, 512 MB GDDR3). The time is measured for 2­D survery datasets. However, with 3200 electrodes, the amount of computation is quite the same as large 3­D datasets. Memory copy operation between CPU and GPU is already included in the GPU execution timing to provide real comparison between GPU time and CPU time.


2000


1500


1000

CPU sequential CPU parallel

GPU

500


0

100

200 400 800 1600 3200

Numbe r of e lectrode s

Running time (s)

Figure 28 Running time for different implementations (CPU sequential time is not measured after 800 electrodes because of long running time).

It can be seen that for large datasets, the GPU shows up to more than 6x speed up over a 4­core CPU. This is quite promising for an old GPU such as the Geforce 9800. The GPU also has a 75W TDP compared to the 120W TDP of the CPU. The speed up can be improved in the future as there is still much room for improvement in our GPU kernel. GPUs are also getting much faster with each generation with better compilers and tools. GPU computing can therefore be a viable option for processing large 3­D survey datasets with a great amount of speed up over CPU computing.

We are currently working on a multi­GPU implementation. This requires more research into the problem of load balancing between different GPUs. Using a cluster with multi­GPU nodes is also an interesting research area which has the potential to solve larger problems at lower cost and power consumption.

CONCLUSION‌

In the first phase of our project, we have developed the forward module for the resistivity tomography problem. Our implementation can run on multi­core CPUs and GPUs. Future versions may see new modules for resistivity inversion and other geophysical features. We are also working on making our code more efficient and more scalable over different kinds of architecture.

Xem tất cả 65 trang.

Ngày đăng: 09/05/2022
Trang chủ Tài liệu miễn phí