Research on access control model for big data - 12



Figure 4-1: Crawler process

We proceed to create a short program to perform sequential work and measure the time of each stage. The code for measuring time is as follows:

start-time = now()

Execute Inject after_inject_time = now()

inject_duration = after_inject_time-start_time for (i=1; i<=depth; i++)

{

Record information for depth=i before_generate_time=now() Execute Generate after_generate_time=now()

generate_duration=after_generate_time – before_generate_time Perform Fetch

after_fetch_time = now()

fetch_duration = after_fetch_time - after_generate_time Perform Parse


after_parse_time = now()

parse_duration = after_parse_time – after_fetch_time Perform Update CrawlDB

after_update_time = now()

update_duration = after_update_time – after_parse_time Record the total number of URLs fetched by the parser

}

before_invert_time = now() Execute Invert link after_invert_time = now()

invert_duration = after_invert_time – before_invert_time Perform index

after_index_time = now()

index_duration = after_index_time – after_invert_time total_duration = after_index_time – start_time


Perform crawling with the following conditions:


Condition

Value

Describe

Seek URLs

http://hcm.24h.com.vn

Start the original URL

Depth

3

Repeat crawl loop 3 times

URL Filter

+^http://([az)-9]*.)*24h.com.vn/

Only URLs are accepted.

belong to 24h.com.vn domain

Maybe you are interested!

Table 1: Experimental crawl conditions

We run this program on two environments as described below.

The time results are saved to a log file for comparison and evaluation.

5.2.3.1 Stand alone

Run the system in Stand alone mode on machine is_aupelf04, hardware configuration as described above.

5.2.3.2 Distributed


Run the system on a Hadoop cluster as follows:


Figure 4-2: Empirical model of distributed crawler

5.2.4 Results

The results of the statistical time of the processes are as follows (Faster times will be italicized):


Implementation process

Stand alone (seconds)

(1)

Distributed(seconds) (2)


Ratio % (2)/(1)

Injection

33

180

545%


Crawl Loop

Dept=1

Generate

54

256

474%

Fetch

150

345

230%

(Number of URLs=1)

Parse

210

260

124%

Update DB

157

340

217%

Total

571

1201

210%

Dept=2

Generate

183

301

164%




Fetch

2112

1683

80%

(URL Number=149)

Parse

1385

1079

78%

Update DB

851

902

106%

Total

4531

3965

88%

Dept=3

Generate

2095

1982

95%

Fetch

27365

18910

69%

(URL Number=10390)

Parse

6307

3550

56%

Update DB

2381

2053

86%

Total

38184

26495

69%

Invert Link

2581

2275

88%

Index

3307

2557

77%

Total time

49171

36673

75%

Table 2: Statistical results of experimental evaluation of crawling in standalone and Distributed modes.

However, since time measured in seconds is difficult to assess, we will switch to a more intuitive form:


Implementation process

Stand alone (seconds) (1)

Distributed(seconds) (2)


Ratio % (2)/(1)

Injection

33 seconds

180 seconds

545%


Crawl Loop

Dept=1

Generate

54 seconds

256 seconds

474%

Fetch

150 seconds

345 seconds

230%

(Number of URLs=1)

Parse

210 seconds

260 seconds

124%

Update DB

157 seconds

340 seconds

217%

Total

571 seconds

1201 seconds

210%

Dept=2

Generate

183 seconds

301 seconds

164%

Fetch

35 minutes

28 minutes

80%

(URL Number=149)

Parse

23 minutes

18 minutes

78%

Update DB

14 minutes

15 minutes

106%

Total

1 hour 16 minutes

1 hour 6 minutes

88%

Dept=3

Generate

35 minutes

33 minutes

95%

Fetch

7 hours 36 minutes

5 hours 15 minutes

69%

(URL Number=10390)

Parse

1 hour 45 minutes

59 minutes

56%




Update DB

40 minutes

34 minutes

86%

Total

10 hours 36 minutes

7 hours 22 minutes

69%

Invert Link

43 minutes

38 minutes

88%

Index

55 minutes

41 minutes

77%

Total time

12 hours 16 minutes

10 hours 11 minutes

75%

Table 3: Statistical results of experimental evaluation of crawling in standalone and Distributed modes – More intuitive

(Note: The results of the experimental process, i.e. the execution times of the stages, are measured in seconds. However, due to the long time period, we have converted them to minutes, hours and rounded off some insignificant fractions where appropriate.)

5.2.5 Evaluation

We see that in stages where the amount of data to be processed is small, the execution time using stand alone is faster than executing on a distributed environment using MapReduce.

When the data to be processed increases, such as when the processing in the crawl loops is at depth=2, depth=3, the processing speed on the distributed system gradually becomes more dominant than the local processing. Typically, when depth=2, the total time to execute the crawl loop of the distributed system is 88% of the local system. When depth=3, the number of URLs to be processed is up to more than 10,000, the speed when performing distributed execution is 69% of the local execution. This is consistent with the theory: MapReduce and HDFS will be more suitable for processing and storing large data blocks.

The total crawling time on a distributed system is 75% of the crawling time on a single machine, demonstrating the benefits of applying distributed computing to Search Engines (crawling and indexing).

5.2.6 Conclusion

The results were not as expected because the amount of data to be processed was still small. If we crawled deeper (increased the depth), the results would probably be better. However, the deeper crawling failed for the following reason: The number of URLs to be processed with depth=4 increased quite a lot. The execution time was extended by a few


days. But after the system has been running for about a day or two, some machines reset (probably due to voltage or loose power cord).

5.3 Experimental search on the index set

5.3.1 Data sample:

Data is obtained from the process of crawling 10 major newspaper websites in Vietnam with a depth of 4.

Number of web pages loaded and indexed: 104000 web pages. Data block size: 2.5 GB

Crawling time (fetch + parse + index): 3 days.

5.3.2 Hardware

Experimental hardware includes:


5.3.3 Implementation method

5.3.3.1 Search locally (stand alone mode):

-teacher02

-aupelf04, is-teacher06

The data is placed entirely on the local file system of the machine is-aupelf04.

5.3.3.2 Search on HDFS:

The data is placed on a distributed file system with Namenode configuration: is-aupelf04

Datanodes: is-teacher02, is-teacher06

5.3.3.3 Add data and distribute to Search servers

The data sample is split into two equal, non-intersecting parts (i.e. the two sub-samples do not share any URL) and distributed to two search servers is- teacher02, isteacher06, port 2010.

5.3.4 Query execution result table



Query

Execution time

Rate compared to Local

Number of results

HDFS

Local

Search server

HDFS

Search server

"human"

3762

398

205

945 %

52

6503

"football"

5003

443

411

1129 %

93

35258

"music"

1329

211

194

630 %

92

16346


"sport"

3137

306

304

1025 %

99

51650

"society"

1184

205

200

578 %

98

13922

"author"

977

233

165

428 %

71

6431

"topic"

1524

168

181

907 %

108

1908

"family"

1536

237

272

648 %

115

18944

"information system"

8138

462

391

1761 %

85

127

"organization"

4053

189

193

2144 %

102

16649

"traffic accident"

5669

221

212

2565 %

96

1663

"love" + "family"

4672

301

309

1552 %

103

7087

"security and order"

1495

197

260

759 %

132

115

"life"

1211

155

162

781 %

105

5261

"cook"

429

81

69

530 %

85

1584

"culture"

1246

163

161

764 %

99

13167

"tourist destination"

4003

456

312

878 %

68

41

"rules"

958

165

130

581 %

79

209

"criminal"

5038

313

268

1865 %

86

15149

"police"

1959

317

182

618 %

57

3656

"traffic safety"

3915

188

141

2082 %

75

223

"food hygiene"

3129

327

411

957 %

126

130

"company"

1493

184

131

811 %

71

30591

"individual"

1309

226

173

579 %

77

7112

"entertainment"

1970

227

185

868 %

81

22327

"children"

1627

198

163

822 %

82

6071

"education"

4124

190

96

2171 %

51

23190

"market moves"

concession


2523


177


153


1425 %


86


1045

"image"

2715

200

164

1358 %

82

1045

"star"

1510

233

163

648 %

70

19515

"college entrance exam"

6442

341

219

1889 %

64

1997

"recruitment"

1440

128

105

1125 %

82

8747

"stock market"

2553

138

135

1850 %

98

722

"online game"

726

184

186

395 %

101

3328

Table 4: Query result execution table

5.3.5 Evaluation:

Through the above results, we see that searching on the index set placed on HDFS is completely inappropriate, the execution time is too long (many times longer than the local execution time), just as in theory.


Most queries performed better when executed distributedly than when executed centrally on a single machine, with some queries nearly doubling in speed. However, due to the small size of the dataset, these results are not yet convincing.

Conclusion: Distributing the index and search across Search servers resulted in increased search speed compared to performing on a single machine.

5.4. Conclusion, application and development direction

5.4.1 Results achieved

After the efforts of the past few months, the thesis has achieved the following results:

and models for big data.

on data access control, especially access for data

big.



Hadoop core: HDFS and MapReduce Engine.


5.4.2 Application


apReduce with Hadoop.

- Kis Securities deploys IBM's mid-range data storage drive solution to enhance data storage and processing capabilities.

- ACB Bank builds a modular data center, applying IBM business analysis solutions to process large data blocks.

- Currently, Intel is supporting Da Nang city to deploy solutions related to big data such as turning Da Nang data center into a green data center with cloud computing technology, implementing pilot projects (POC - Proof of concept), in which Intel will preside over POCs on resource management, Intel data center will continue to support Da Nang to establish an open standard data center, connecting all data systems in the area, serving state and business management, developing public services on modern network technology to provide to citizens and organizations.

Comment


Agree Privacy Policy *