Research on access control model for big data

Figure 4-1: Crawler process

We proceed to create a short program to perform sequential work and measure the time of each stage. The code for measuring time is as follows:

start-time = now()

Execute Inject after_inject_time = now()

inject_duration = after_inject_time-start_time for (i=1; i<=depth; i++)

{

Record information for depth=i before_generate_time=now() Execute Generate after_generate_time=now()

generate_duration=after_generate_time – before_generate_time Perform Fetch

after_fetch_time = now()

fetch_duration = after_fetch_time - after_generate_time Perform Parse

after_parse_time = now()

parse_duration = after_parse_time – after_fetch_time Perform Update CrawlDB

after_update_time = now()

update_duration = after_update_time – after_parse_time Record the total number of URLs fetched by the parser

}

before_invert_time = now() Execute Invert link after_invert_time = now()

invert_duration = after_invert_time – before_invert_time Perform index

after_index_time = now()

index_duration = after_index_time – after_invert_time total_duration = after_index_time – start_time

Perform crawling with the following conditions:

Condition

Value	Describe
Seek URLs	http://hcm.24h.com.vn	Start the original URL
Depth	3	Repeat crawl loop 3 times
URL Filter	+^http://([az)-9].)24h.com.vn/	Only URLs are accepted. belong to 24h.com.vn domain

Maybe you are interested!

Table 1: Experimental crawl conditions

We run this program on two environments as described below.

The time results are saved to a log file for comparison and evaluation.

5.2.3.1 Stand alone

Run the system in Stand alone mode on machine is_aupelf04, hardware configuration as described above.

5.2.3.2 Distributed

Run the system on a Hadoop cluster as follows:

Figure 4-2: Empirical model of distributed crawler

5.2.4 Results

The results of the statistical time of the processes are as follows (Faster times will be italicized):

Implementation process

Stand alone (seconds) (1)	Distributed(seconds) (2)	Ratio % (2)/(1)
Injection			33	180	545%
Crawl Loop	Dept=1	Generate	54	256	474%
	Dept=1	Fetch	150	345	230%
	(Number of URLs=1)	Parse	210	260	124%
		Update DB	157	340	217%
		Total	571	1201	210%
	Dept=2	Generate	183	301	164%

	Fetch	2112	1683	80%
(URL Number=149)	Parse	1385	1079	78%
	Update DB	851	902	106%
	Total	4531	3965	88%
Dept=3	Generate	2095	1982	95%
Dept=3	Fetch	27365	18910	69%
(URL Number=10390)	Parse	6307	3550	56%
	Update DB	2381	2053	86%
	Total	38184	26495	69%
Invert Link			2581	2275	88%
Index			3307	2557	77%
Total time			49171	36673	75%

Table 2: Statistical results of experimental evaluation of crawling in standalone and Distributed modes.

However, since time measured in seconds is difficult to assess, we will switch to a more intuitive form:

Implementation process

Stand alone (seconds) (1)	Distributed(seconds) (2)	Ratio % (2)/(1)
Injection			33 seconds	180 seconds	545%
Crawl Loop	Dept=1	Generate	54 seconds	256 seconds	474%
	Dept=1	Fetch	150 seconds	345 seconds	230%
	(Number of URLs=1)	Parse	210 seconds	260 seconds	124%
		Update DB	157 seconds	340 seconds	217%
		Total	571 seconds	1201 seconds	210%
	Dept=2	Generate	183 seconds	301 seconds	164%
	Dept=2	Fetch	35 minutes	28 minutes	80%
	(URL Number=149)	Parse	23 minutes	18 minutes	78%
		Update DB	14 minutes	15 minutes	106%
		Total	1 hour 16 minutes	1 hour 6 minutes	88%
	Dept=3	Generate	35 minutes	33 minutes	95%
	Dept=3	Fetch	7 hours 36 minutes	5 hours 15 minutes	69%
	(URL Number=10390)	Parse	1 hour 45 minutes	59 minutes	56%

	Update DB	40 minutes	34 minutes	86%
	Total	10 hours 36 minutes	7 hours 22 minutes	69%
Invert Link			43 minutes	38 minutes	88%
Index			55 minutes	41 minutes	77%
Total time			12 hours 16 minutes	10 hours 11 minutes	75%

Table 3: Statistical results of experimental evaluation of crawling in standalone and Distributed modes – More intuitive

(Note: The results of the experimental process, i.e. the execution times of the stages, are measured in seconds. However, due to the long time period, we have converted them to minutes, hours and rounded off some insignificant fractions where appropriate.)

5.2.5 Evaluation

We see that in stages where the amount of data to be processed is small, the execution time using stand alone is faster than executing on a distributed environment using MapReduce.

When the data to be processed increases, such as when the processing in the crawl loops is at depth=2, depth=3, the processing speed on the distributed system gradually becomes more dominant than the local processing. Typically, when depth=2, the total time to execute the crawl loop of the distributed system is 88% of the local system. When depth=3, the number of URLs to be processed is up to more than 10,000, the speed when performing distributed execution is 69% of the local execution. This is consistent with the theory: MapReduce and HDFS will be more suitable for processing and storing large data blocks.

The total crawling time on a distributed system is 75% of the crawling time on a single machine, demonstrating the benefits of applying distributed computing to Search Engines (crawling and indexing).

5.2.6 Conclusion

The results were not as expected because the amount of data to be processed was still small. If we crawled deeper (increased the depth), the results would probably be better. However, the deeper crawling failed for the following reason: The number of URLs to be processed with depth=4 increased quite a lot. The execution time was extended by a few

days. But after the system has been running for about a day or two, some machines reset (probably due to voltage or loose power cord).

5.3 Experimental search on the index set

5.3.1 Data sample:

Data is obtained from the process of crawling 10 major newspaper websites in Vietnam with a depth of 4.

Number of web pages loaded and indexed: 104000 web pages. Data block size: 2.5 GB

Crawling time (fetch + parse + index): 3 days.

5.3.2 Hardware

Experimental hardware includes:

5.3.3 Implementation method

5.3.3.1 Search locally (stand alone mode):

-teacher02

-aupelf04, is-teacher06

The data is placed entirely on the local file system of the machine is-aupelf04.

5.3.3.2 Search on HDFS:

The data is placed on a distributed file system with Namenode configuration: is-aupelf04

Datanodes: is-teacher02, is-teacher06

5.3.3.3 Add data and distribute to Search servers

The data sample is split into two equal, non-intersecting parts (i.e. the two sub-samples do not share any URL) and distributed to two search servers is- teacher02, isteacher06, port 2010.

5.3.4 Query execution result table

Query

Execution time			Rate compared to Local		Number of results
HDFS	Local	Search server	HDFS	Search server	Number of results
"human"	3762	398	205	945 %	52	6503
"football"	5003	443	411	1129 %	93	35258
"music"	1329	211	194	630 %	92	16346

"sport"

3137	306	304	1025 %	99	51650
"society"	1184	205	200	578 %	98	13922
"author"	977	233	165	428 %	71	6431
"topic"	1524	168	181	907 %	108	1908
"family"	1536	237	272	648 %	115	18944
"information system"	8138	462	391	1761 %	85	127
"organization"	4053	189	193	2144 %	102	16649
"traffic accident"	5669	221	212	2565 %	96	1663
"love" + "family"	4672	301	309	1552 %	103	7087
"security and order"	1495	197	260	759 %	132	115
"life"	1211	155	162	781 %	105	5261
"cook"	429	81	69	530 %	85	1584
"culture"	1246	163	161	764 %	99	13167
"tourist destination"	4003	456	312	878 %	68	41
"rules"	958	165	130	581 %	79	209
"criminal"	5038	313	268	1865 %	86	15149
"police"	1959	317	182	618 %	57	3656
"traffic safety"	3915	188	141	2082 %	75	223
"food hygiene"	3129	327	411	957 %	126	130
"company"	1493	184	131	811 %	71	30591
"individual"	1309	226	173	579 %	77	7112
"entertainment"	1970	227	185	868 %	81	22327
"children"	1627	198	163	822 %	82	6071
"education"	4124	190	96	2171 %	51	23190
"market moves" concession	2523	177	153	1425 %	86	1045
"image"	2715	200	164	1358 %	82	1045
"star"	1510	233	163	648 %	70	19515
"college entrance exam"	6442	341	219	1889 %	64	1997
"recruitment"	1440	128	105	1125 %	82	8747
"stock market"	2553	138	135	1850 %	98	722
"online game"	726	184	186	395 %	101	3328

Table 4: Query result execution table

5.3.5 Evaluation:

Through the above results, we see that searching on the index set placed on HDFS is completely inappropriate, the execution time is too long (many times longer than the local execution time), just as in theory.

Most queries performed better when executed distributedly than when executed centrally on a single machine, with some queries nearly doubling in speed. However, due to the small size of the dataset, these results are not yet convincing.

Conclusion: Distributing the index and search across Search servers resulted in increased search speed compared to performing on a single machine.

5.4. Conclusion, application and development direction

5.4.1 Results achieved

After the efforts of the past few months, the thesis has achieved the following results:

and models for big data.

on data access control, especially access for data

big.

Hadoop core: HDFS and MapReduce Engine.

5.4.2 Application

apReduce with Hadoop.

- Kis Securities deploys IBM's mid-range data storage drive solution to enhance data storage and processing capabilities.

- ACB Bank builds a modular data center, applying IBM business analysis solutions to process large data blocks.

- Currently, Intel is supporting Da Nang city to deploy solutions related to big data such as turning Da Nang data center into a green data center with cloud computing technology, implementing pilot projects (POC - Proof of concept), in which Intel will preside over POCs on resource management, Intel data center will continue to support Da Nang to establish an open standard data center, connecting all data systems in the area, serving state and business management, developing public services on modern network technology to provide to citizens and organizations.

Research on access control model for big data - 12

Comment