Independent Optimization of a Decomposed Connectivity Graph

joins, and then the tuples corresponding to this value are retrieved. In practice, the efficiency is achieved by performing the joins one tuple at a time. However, there is an important difference between this approach and the semijoin approach; while the former is mainly used to synchronize the processing of inner and outer relations and requires a lot of communication, the latter is used to save communication and requires a lot of local processing.

In principle, each communication method can be used in the context of each join method, but practical investigations have only made a small number of combinations for R*. For example, in the nested loop method, it is not possible to propagate all four I, because I must be scanned card(O) times, and this is only efficient if there are efficient access methods for the join attribute. Relational propagation would invalidate such methods. This and similar investigations limit the possible cases:

- Nested loop, “move the entire outer relation and do not store it”. The cost is the sum of the local join costs in this method and the outer relation traversal O:

C 1 = C(nested-loop) + C mes  [card(O)  size(O)/m]

Maybe you are interested!

- Scan-merge , “move the entire external relation and do not store it”. Similarly, the cost is:

C 2 = C(merge-scan) + C mes  [card(O)  size(O)/m]

- Nested loop, “inner relation retrieved as needed”. In this case, for each tuple of O, we send a request to I and NI the (average) tuples satisfying this request will be returned. We have:

C 3 = C(nested-loop) + C mes  card(O)  (1+[NI  size(I)/m])

- Scan-shuffle, “inner relations are retrieved as needed”. Here, we assume that a request message is sent from O to I for each different value of the join property, and all the connectable tuples of I for each different value of the join property, and all the connectable tuples of I are returned. These assumptions are in contrast to what was done in R*, where a message is sent for any tuple of O. If A is a join property of O, we have:

C 4 = C(merge-scan) + C mes  val(A/O/)  (1 + [NI  size(I)/m])

- Scan-merge, “move the entire relation in and store it before use”. In this case, we incur additional costs to store and retrieve the relation at a remote location:

C 5 = C(merge-scan) + C mes  [cad(I)  size(I)/m] + 2  N in  C 10

- Move both relations to a third place. In this case, the communication costs are the sum of the costs of transmitting both internal relations to the external relation.

and outer relation to inner relation; the join cost is C(merge-scan) or C (nested-loop).

The R* optimizer enumerates all possible join sequences with all possible join methods and locates joins at each possible location. It then determines the least expensive access for a query problem. Dynamic programming is used to speed up the computation, because the cost calculation satisfies the rules applicable to this technique.

5.3. General queries

In this section, we consider the general case of queries with joins and unions in optimization graphs. We show that the extension from join queries to general queries is important, and that many different execution strategies can be used for the same general query. The basic transformation used is the commutativity of joins and unions (i.e. criterion 5 in the previous chapter) to create distributed join graphs. We focus on the problem of a single join between fragmented relations; multiple joins can be solved by successively applying the following considerations.

Determining the strategy for performing a union is easy, since we do not care about the order of the tuples of operands introduced into the result relation. Therefore, the transfers of non-local fragments to the places (where the union is performed) can be performed in parallel. According to the communication requirements, the union has a delay corresponding to the transfer of the largest fragment, and is simply the sum of all the communication costs .

The effect of commutating connectives and unions

The commutation of joins and unions is shown in Figure 5.4, which presents three different optimization graphs of the same query. In Figure 5.4 and in all other figures in this section, the circles represent fragments of relation R, and the squares represent fragments of relation S. Consider the first two optimization graphs of Figures 5.4a and 5.4b. In Figure 5.4a, the fragments are first joined and then reassembled. We call the first approach a nondistributed join and the second a distributed join. These two cases pose different optimization problems:

a - Non-distributed joins: This optimization problem is simpler, requiring only the identification of a pair of places (one place is possible) where the unions are to be performed. If the two places are different, then the query is reduced to a simple join query between the two relations which can be optimized as shown in the previous section.

b - Distributed joins: This optimization problem is more difficult. Note that in Figure 5.4b we see the join graph of the join between R and S inside the bounding node.

(hypernode) whose node represents the union. Use fragmentation criteria to remove edges from the join graph that correspond to empty joins, as shown in Chapter 4, Section 4.2, Section 4. Once the minimal join graph is determined, the join results are sent to the same place to perform the union. The optimization graph of Figure 5.4b shows a distributed join. However, it is also possible to perform partial unions that involve their fragments in the same relation, before performing the join. An example is shown in Figure 5.4c.

Figure 5.4. Commutation of connectives and unions

In constructing the optimization graph G‟ of Figure 5.4c, starting from the optimization graph G of Figure 5.4c, apply the following rules:

- The pieces used to perform partial unions will be surrounded by a bounding node (in this example, this rule applies to {R 1 ,R 2 } and {S 2 ,S 3 }).

- If two pieces R i and S j are connected by an edge in G, then the (covering) nodes

which they lie are also connected by an edge in G‟ (for example, the edge between R 1 and S 2 in G creates the alignment between (R 1 ,R 2 ) and (S 2 ,S 3 ) in G).

Partitioned graphs are a suitable type of connected graphs.

from the perspective of optimizing their execution. In a disjoint connected graph, we can distinguish disjoint subgraphs. Each such subgraph can be optimized independently. This property is independent of the method used to compute the query performances, and is based on the following arguments:

- Optimizing the connections of disjoint subgraphs can be done independently.

- Union is not affected by the order in which the operands are assembled.

For example, in Figure 5.5a we have a decomposition graph, the optimization problem can be decomposed into two problems consisting of fragments (R 1 , R 2 , S 1 , S 2 ) and (R 3 , S 3 ). Figure 5.5b shows a possible final optimization graph with the query, including performing the partial unions of (R 1 , R 2 ) and (S 1 , S 2 ), the joins between them, the join between R 3 and S 3 , and the final union of the results.

Figure 5.5. Independent optimization of a split-connectivity graph

The convenience of transforming the optimization graph of Figure 5.5a into the optimization graph of Figure 5.5b depends on the locality and size of the pieces. Consider the following example: suppose a network has three locations for the i-th piece at i, the result at location 1, and the operands and results of each partial join are all of the same size. It is convenient to perform the assembly of each component at location 1, the join between R 3 and S 3 at location 3, and the final union at location 1.

The above properties allow for the construction of many strategies including combinations.

each part for each part in the connectivity graph; finding the best execution strategy for a given query requires:

- Generate all possible query optimization graphs.

- Apply join query methods to optimize joins and add the costs of unions. Therefore, each join graph corresponds to an optimal query strategy and a cost.

- Choose the best query processing strategy among the strategies, Figure 5.6, with four different edges to compute the cost for a fully connected graph of two relations R and S, each with two branches.

Figure 5.6. Different optimization graphs for the same query

- Perform distributed joins. This problem simply performs its four joins of 5.6a and aggregates their results. Note that the optimality of the four joins

cannot be done separately. Because, for example, if the size of one of the pieces is reduced, this can be beneficial for at least two joins.

- Perform partial union of the pieces of R. The problem involves only two joins, and is shown in Figure 5.6b.

- Perform partial joins of the pieces of S. The problem involves only two joins, and is shown in Figure 5.6c.

- Perform the union of the fragments, then perform their join; this corresponds to the non-distributed execution of the join, shown in Figure 5.6d.

QUESTIONS AND EXERCISES

I. Theoretical questions

1) State the concept of distributed database and important aspects of a distributed database

2) Give an example and explain a system using distributed database design and implementation.

3) State the purpose of using distributed database

4) Draw a distributed database reference architecture model

5) Explain the components in the architectural model

6) State the reference architecture types for distributed database management systems.

7) State the main features of distributed databases

8) List the typical components and software required to build a distributed database.

9) List the types of distributed database management systems

10) State the types of data fragmentation

11) State the correct conditions for horizontal fragmentation

12) Give an example of the correct condition for horizontal fragmentation.

13) State the correct conditions for vertical fragmentation.

14) Give an example of the correct condition for vertical fragmentation.

15) State the types of vertical fragmentation.

16) State the concept of mixed fragmentation; Give examples.

17) State the levels of dispersion transparency.

18) State the methods of accessing distributed databases.

19) State the methods of accessing databases with fragmented transparency.

20) State the concept of fragmented tree; Give an example.

21) State the concept of update tree; Give an example.

22) List the steps to access the database for each value.

23) List the steps to access the database after entering all values.

24) List the steps to access the database before entering values.

25) State the goals of distributed data design

26) List the types of information used for distributed design.

27) State the strategies for designing distributed databases.

28) Draw and explain the diagram of top-down design strategy

29) State the information requirements when designing horizontal fragmentation.

30) Give an example of database information when designing horizontal fragmentation.

31) State the steps to design a distributed database.

32) State the concept, representation, and qualitative predicate of main horizontal fragmentation.

33) Give an example of a globally fragmented relation.

34) State the concept, representation, and qualitative predicate of derived horizontal fragmentation.

35) Give an example of a globally fragmented relation derived from

36) State the concept and representation of vertical fragmentation.

37) Give an example of a vertically fragmented global relation.

38) State the concepts: simple predicate, simple predicate set, minimal intersection predicate, minimal intersection predicate set, appropriate predicate, complete predicate set, minimal predicate set

39) State rule 1, fi of Pr'

40) Present the COM_MIN algorithm to find a set of simple predicates that is complete and minimal.

41) Present the PHORIZONTAL algorithm to find the set of minimal intersection predicates.

42) State the main horizontal fragmentation design method using the top-down approach.

43) State the horizontal fragmentation design method derived from the top-down approach.

44) State the concept of distributed join; Give an example.

45) State the concept and types of connection graphs.

46) Define: Attribute usage value, attribute usage matrix

47) State the derived horizontal fragmentation design strategies.

48) State the information requirements of vertical fragmentation.

49) Present the clustering algorithm

50) Present the decomposition algorithm

51) Present the positioning problem

52) Present the information requirements for the positioning problem.

53) Present the positioning model

54) Define the operator tree of a query, how to build an operator tree from

55) Present equivalent transformations

56) Define the normal expression of a fragment query; Give an example.

57) List the steps to simplify horizontally fragmented relations.

58) State the steps to simplify the joins between the main horizontally fragmented relations; Give examples.

59) State the simplified steps for derived horizontal fragmentation.

60) State the simplified steps for vertical fragmentation.

61) State the issues of query optimization

62) State the goals of query optimization

63) State the new model of queries

64) State the importance of query optimization in distributed databases.

65) State how to Use semi-connected programs for connected queries

66) State how to identify semi-connected programs in SDD-1

67) State how to determine semi-connected programs using Apers, Hevner and Yao algorithms.

68) State how to process queries using joins

II. Multiple choice questions

1. Distributed database is:

A. Unification of database theory and computer network technology.

B. Integration of telecommunications and information technology.

C. Integration of information technology and database.

2. The concept of distributed database system includes the concept of:

A. Distributed databases and computer network technology.

B. Centralized database and query optimization

C. Distributed databases and distributed database management systems.

3. Distributed database is:

A. A set of logically related databases distributed over a computer network.

B. A set of databases distributed over a computer network.

C. A set of databases installed and stored on servers.

4. Distributed database management system is:

A. A distributed database management software system and makes that distribution transparent to users.

B. Distributed database access control software system.

C. Software systems perform data storage and retrieval operations on computer networks.

5. Characteristics of distributed databases are:

A. Data is distributed over computer networks and is logically related to each other.

B. A set of data files stored on the memory devices of a computer network.

C. A set of logically related data files

6. Data independence is understood as:

A. Data storage organization is transparent to users.

B. Application programs are independent of data storage organization.

C. Organize data storage on network servers

7. Characteristics of data independence in distributed database systems are:

A. Dispersion transparency

B. Distributed applications.

Comment