Bloomopt Algorithm Optimizes Data Transfer in Path Expressions

path x.dongCo.nhaSanXuat.ten retrieves the name property value of the engine manufacturers of object x . We can create path expressions that contain properties as well as methods. Optimizing path expressions is a problem that has received much attention in the field of object query processing.

Path expressions provide a concise abstract notation for representing traversal of object composition graphs, allowing the shaping of predicates on values deeply nested within an object structure. They provide a unified mechanism for shaping queries containing composition and inherited member functions. Path expressions can be of single-valued or set-valued type, and can appear in a query as a member of a predicate, a target of a query, or a member of a reference list. A path expression is single-valued if each member of the path expression is a single-valued. If at least one member of the path expression is a set of values, then the entire expression is a set of values. A technique for traversing forward and backward through a path expression has been discussed in [31].

The problem of path expression optimization spans the entire query optimization process. During or after parsing a query, but before algebraic optimization, the compiler must identify path expressions that are potentially optimizable. This is typically achieved through rewriting techniques, which transform path expressions into logical algebraic equivalents. Once the path expression is expressed in algebraic form, the query optimizer seeks the least-cost solution. Finally, the optimal execution method may contain algorithms that efficiently compute path expressions, including hash joins, complex object aggregations, or indexed scans of path indices.

Consider first a query rewrite and algebraic optimization. Consider the path expression x.car.manufacturer.name . Suppose that each instance of a car has a reference to an engine object, each engine has a reference to a manufacturer object, and each manufacturer has a name field. Suppose that the types DongCo and NhaSanXuat have a corresponding type family. The first two links of the path above might have to trace the engine and manufacturer objects

resides on disk. The third path consists of only a lookup of a field within a producer object. Thus only the first two paths present query optimization opportunities in computing that path. An object query compiler needs a mechanism to distinguish these paths in a path that represents optimization possibilities.

There have been some studies on path expression optimization in a centralized environment following Ozkan's heuristic approach in [45], rewriting complex path expressions into a simple form in Truong Bao Chau's doctoral thesis [2]. The algorithms proposed in [52], [34] are all dynamic programming algorithms. The limitation of the algorithm in [52] is that it only considers each path expression, not combining path expressions together. The limitation of the algorithm in [34] is that there is no filtering mechanism during data transmission and the subgraph generation algorithm is not optimal. In sections 3.3 and 3.4, the thesis will present path expression optimization algorithms in the PT-HDT database.

3.3. BloomOpt algorithm optimizes data transfer in path expressions

3.3.1. Introduction

The query cost in a distributed system includes the local processing cost at each station and the remote data transmission cost when needed, in which the normal data transmission cost is the largest cost. The problem of optimizing distributed queries is reduced to the problem of minimizing the above two costs. Reducing the local processing cost at each station is the problem of query processing in the centralized model that some studies have conducted. To reduce the data transmission cost, it is necessary to minimize the transmission of unnecessary data. There are a number of filtering algorithms to limit the transmitted data, including the filtering mechanism of the Bloom filter [16]. Using the Bloom filter to process distributed queries in relational database systems has been mentioned in articles [35], [46]. In this section, the thesis proposes an algorithm to optimize data transmission between layers when processing path expression queries in the PT-HDT database using the Bloom filter to reduce data transmission.

3.3.2. Query with path expression

Queries in object-oriented databases are expressed by the OQL language, which is an extension of and inherited from the SQL language. The design of OQL is in the form of functions, the results of queries have types, this allows the results of one query to be the input for another query, so it is possible to express complex queries.

When performing a query in the HDT database, if the classes have complex attributes, the query will be directed to nested objects according to a path expression.

Example 3.3: The database schema consists of the following classes: CanBoNghienCuu (Research Staff), ToChuc (Organization), DiaChi (Address). The CanBoNghienCuu class has the attribute coQuan which is a set of objects of the ToChuc class , the CoQuan class has the attribute diaChi which is an object of the DiaChi class , the DiaChi class has the attribute thanhPho .

Figure 3.3 : Illustration of classes with complex attributes

Suppose the query here is to list the names of researchers in Hanoi city and can be written as:

SELECT c.name

FROM CanBoNghienCuu as c

WHERE c.coQuan.diaChi.thanhPho = “Hanoi”

c.coQuan.diaChi.thanhPho is a path expression or also known as an implicit connection. Each object of the class has a unique identifier (OID) that is initialized by the system, and implicit connections are made through these identifiers. A path expression can be single-valued if each of its components is single-valued, otherwise if at least one component is a set of values, the entire expression is a set of values. Optimizing path expressions is an important goal in processing queries to the PT-HDT database.

3.3.3. Bloom filter

3.3.3.1.Bloom filter structure

Bloom filter introduced in [16], is a probabilistic data structure to check whether an element is in a set or not. The structure of a Bloom filter includes:

- An array of n bits, initialized to all values 0

- A set of hash functions h 1 , h 2 ,..h k . Each hash function will map the key values to a value in the range from 1 to n, corresponding to the filter bit

- A set S consists of m key values

For each key value k in S:hash it according to the hash function, the i-th hash function has value h i (k), set the h i (k)-th bit of the Bloom filter to 1.

Filter function

Bloom is a definition of a

quarter x has ́

episode

S or

No. It is used as a pre - processing step of the search process . If after filtering

through filter

Bloom returns “ no ” then no need to practice

porch

work

Also , if the result is “ maybe ” then the action is done .

porch

search .

To determine a

any quarter x has

episode

S or not, we are

The n-bit vector has the value h 1 ( x),..., h k (x) from x through the hash key . If the k bits in the n-bit vector at corresponding positions h 1 (x), h 2 (x),…, h k (x) all have the value 1 then x “ can ”

exists in the set S with some probability , and n if at least one bit has the value 0

then affirm h is x does not belong

episode

S. We can only assert that x “ has

drug

episode

S is because in a bit vector , 1 bit can be

the value is 1 more

swim multiple quarters in S when starting

filter. Just one

bit 0 we have

affirmative statement hx does not belong

S swims because if x belongs to

S then all corresponding k bits will

is set to 1 at initialization Example 3.4:

filter

with a quarter of the x measure .

 Bloom filter consists of 9 bits, initially assigning all bits to 0.

 Using two hash functions: The first function h 1 (k) takes the values at odd positions (from left to right) of k when represented in binary, and converts them to values

decimal and then divide by 9. The h 2 (k) function is similar but takes the value at even positions.

 S = {23, 147, 352}

 The process of building a Bloom filter is illustrated in Table 3.2

Table 3.2: Illustration of Bloom filter construction

Step

Binary performance k's stool	h 1 (k)	h 2 (k)	Bloom Filter
Initialization				000000000
k = 23	10111	7	1	010000010
k = 147	10010011	0 (9 mod 9)	5	110001010
k = 352	101100000	6 (24 mod 9)	4	110011110

Maybe you are interested!

 If x = 15, the binary representation of x is 1111, h 1 (x) = h 2 (x) =3. At bit number 3 of the Bloom filter, the value is 0, so x is definitely not in S.

3.3.3.2. Error analysis

With a filter two errors can occur:

 False positive : Filter check is there but actual search is not there

 False negative : Filter check is not available but actual search is available

With Bloom filter, there is no possibility of false negative error , but only a very small probability of false positive error . The probability of encountering false positive error depends on the density of bits with value 1 and the number of hash functions.

The probability that a random bit of an n-bit array is assigned the value 1 by the function

hash rate is 1/n, the probability that that bit is not assigned the value 1 is (1 −

1 ) . Probability for bit

that is not assigned the value 1 by any hash function is (1 −

1 ) 𝑘 where k is a number

𝑛

hash functions. Given m key values, the probability that that bit remains 0 is

(1 −

1 ) mk ,

and the probability that that bit is set to 1 is (1 − (1 − 1 ) mk ) .

Consider the process of checking whether an element is in a set or not. The algorithm checks k positions in the array to see if they are equal to 1 or not, with probability

] mk k −km/nk

all of them are equal to 1 is (1 − [1 −1) ≈ (1 − e ) for very large n. Verify

This probability does not depend on the element being tested and is therefore called the false positive probability. The false positive probability can be reduced by choosing appropriate values of n, k, and m. The value of n needs to be quite large compared to m, and this probability can be further reduced by increasing the number of

hash function. In the best case, the optimal value of k is k = n ln2 , then the probability

The false positive is minimized with respect to k and will be p = 2 −k = 2 −nln2/m => n =

− mlnp . That is, to keep the probability of false positive constant, the size of

(ln2) 2

The array is linear in the number of elements in the set.

3.3.4. Using Bloom filters to reduce communication costs

We will describe an algorithm that uses a Bloom filter to process a query with a path expression between two classes. Suppose class C i has an attribute a k which is an object of class C j , class C i is located at station S, class C j is located at station R. It is easy to see that the size of an object of class C j is smaller than the size of an object of class C i . Suppose there is a query containing a path expression from class C i to class C j . Let the cost of transmitting a unit of data from station S to station R be transCost(S, R), assuming this transmission cost is symmetric, that is, transCost (S, R) = transCost (R, S).

The main idea here is that instead of transmitting all the objects of C j from R to S for association with C i (or vice versa), we will filter some objects of C j that are necessary for association with C i . This will reduce the amount of data transmitted and, consequently, the cost of data transmission. The filtering mechanism will be implemented by a Bloom filter. There are three cases: (1) The query occurs at station S (which contains C i ). (2) The query occurs at station R (which contains C j ). (3) The query occurs at a station other than S and R. We will consider each case in turn.

Case 1 : Query at station S

- Build a Bloom filter for all objects of class C i according to attribute value a k , called BF(C i ).

- Send BF(C i ) to R (the station containing C j ), BF(C i ) is used to filter the objects of C j .

- Send the objects of C j that satisfy the above filter condition to S

- Execute queries from objects of C i and C j at S

Algorithm 3.1 - Case1(C i , C j , R, S)

Input:

- Class C i at station R and class C j at station S

- Attribute a k of class C i is an object of class C j

Output:

- result is the result of the join operation C i and layer C j at station S

Steps to follow:

BF(C i ) = createBloomFilter(C i ,a k ); send(BF(C i ),S, R);

tempClass = filter(C j , BF(C i )); send(tempClass, R, S);

result = join(C i , tempClass); end; {Algorithm 3.1}

Case 2: Query at station R

 Build a Bloom filter for all objects of class C j according to the value of the identifying attribute, called BF(C j ).

 Send BF(C j ) to S (the station containing C i ), BF(C j ) is used to filter the objects of C i .

 Send objects of C i that satisfy the above filter conditions to R

 Execute queries from objects of C i and C j at R

Algorithm 3.2 – Case2(C i , C j , R, S)

Input:

- Class C i at station R and class C j at station S

Output:

- result is the result of the join operation C i and layer C j at station R

Steps to follow:

BF(C j )= createBloomFilter(C i , oid(C j )); send(BF(C j ), R, S);

tempClass = filter(C i , BF(C i )); send(tempClass, S, R);

result = join(tempClass, C j ); end; {Algorithm 3.2}

Case 3 : The query at the station does not contain both C i and C j , let's say station P ( P ≠ S , P ≠ R )

 Compare the cost of transmitting a unit of data in 3 cases:

o (a) transCost (S, R) + transCost (R, P)

o (b) transCost (R, S) + transCost (S, P)

o (c) transCost (S, P) + transCost (R, P)

 If the minimum cost is case (a), perform the steps as in case 1, then transmit the result from S to P.

 If the minimum cost is case (b), perform the steps as in case 2, then transmit the result from R to P

 If the minimum cost is case (c)

o If transCost (S, P) <= transCost (R, P)

 Build a Bloom filter for all objects in class C i

according to the attribute value a k , called BF(C i ).

 Send BF(C i ) to P, then from P send BF(C i ) to R. BF(C i ) is used to filter the objects of C j .

Comment