| Academic Open Internet Journal |
Volume 12, 2004 |
A
Simple Algorithm for Minimal Search Data Index: Active Partitioned Data Search
Using
1 Research Scholar, Dept. of Computer Science & Engineering
2 HOD, Dept. of Computer Science & engineering
Email:rnalliah@Yahoo.co.in
Abstract:
In
this paper, we examine the issue of mining association rules among items of a
large database of search engines. The mining of association rules can be mapped
into the problem of discovering large item sets where a large item set is a
group of items which appear in a sufficient number of transactions. The problem
of discovering large itemsets can be solved by constructing an index set of
itemsets first and then, identifying, within the subdivided set. Generally this
is done iteratively for each large Lk itemsets are equally
partitioned into Sk subsets. To
determine the related data of a searching text each index itemset contains the
two candidate keys. One is moral text and another one is key that indicates the
subset name where the data is originally located. To address this issue, we
propose the parallel algorithm for “Minimal Search Data Index(MSDI)” based on
the Active Partition Data Search (APDS). Extensive simulation study is
conducted to evaluate performance of the proposed algorithm
1.1 Parallel Virtual Machine
The system called Parallel Virtual Machine
(PVM) allows a heterogeneous collection of workstations and computers to
construct a parallel processing environment. PVM is portable and runs on a wide
variety of modern systems. The PVM system permits a heterogeneous collection of
UNIX computers networked together to be viewed by a user program as a single
parallel environment with some distinctive features:
A PVM computational model is shown in Figure 1.1. The dotted
lines indicate integer component communication and synchronization, and the
solid lines indicate the inter-instance communication and synchronization.
Input and Partitioning



![]()
![]()

Fig 1.1: PVM Computational Model
PVM tasks may
possess arbitrary control and dependency structures. In other words, at any
point in the execution of a concurrent application, any task in existence may
start or stop other tasks or add or delete computers from the virtual machine.
Any process may communicate and/or synchronize with any other. Any specific
control and dependency structure may be implemented under the PVM system by
appropriate use of PVM constructs and host language control-flow statements.
Owing to its
ubiquitous nature (specifically, the virtual machine concept) and also because
of its simple but complete programming interface, the PVM system has gained
widespread acceptance in the high –performance scientific computing community.
/*database elements added into the PVM by means of n X n
matrix format */
foreach transactions rowi
Î
SIZE do begin /*SIZE may be
network capacity */
foreach transactions colj
Î SIZE
do begin
a[rowi][col j ]
= SIZE/20 /* divide by
means of packet size */
pvm_spawn(“testdrive”,(char**)0,pvmTestDefault,”“, NPROCS,task_id);
/* Call the test drive for
data distribution */
/*
NPROCS can be a SIZE of the subset */
rowi ++;
colj ++;
end
end
foreach roxi Î NPROCS do begin
/* Connection Establishment */
pvm_initsend
(PvmDataDefault);
/* Data type Implementation */
pvm_pkint
(&num_data,1,1);
/* Packing a specified data set */
pvm_pkint
(&a[num_data*rowi] [num_data*colj],num_data,1);
/* Ready for Data Transmission */
pvm_send (tast_id[rowi],1);
end
foreach rowi Î
NPROCS do begin
/*
Recovery of specified task */
pvm-recv (task_id[I],2);
/* unpack an item for reading */
pvm_upkint
(&result [rowi],1,1);
/* Area that is used to substitute a
parallel algorithm */
end
1.2 System study
When we deeply think about recent trends of information technology,
Database and its mining concept plays a very important role. From the bottom
level “shop” to top level research areas, there is a database everywhere.
But how to manage these huge
databases is a question. For that “Distributive Rules” of data mining give the
initial solution to use a candidate key instead of searching an entire
database.
Based on this idea, we proposed the
datamining “distributive rules” based algorithm called Minimal Search Data
Index (MSDI). In that algorithm Index itemset plays a very important role.
Various algorithms have been
proposed to discover the large itemsets. Generally, these algorithms first
construct a candidate set of large itemsets based on some heuristics, and then
discover the subset that indeed contains large itemsets.
In this paper, we propose an algorithm MSDI for time reduced data search. It is very efficient in the filed of very large database management system. Here APDS algorithm is the backbone connectivity for this MSDI algorithm.
Searching For: Applications of Data Mining
![]()
Bypassed Link Ref Key I Data Mining, DSIII Ref Key II Ref Key III Ref Key IV (25 lakhs Data) Eg. Data Mining (25 lakhs Data) (25 lakhs Data) (25 lakhs Data) (More
than 1Crore Data) Search
![]()
![]()
![]()
![]()
Data Set III
DATA INDEX
![]()
Data Set IV
Data Set II
Data Set I
![]()
![]()
![]()
Huge Database
![]()
![]()
Fig 1.2 MSDI Architecture
/*Ln= Very Large
Database Variable*/
/* Set all the database index point to BOF */
foreach data datasub Î SIZE do begin
forall transaction
ip Î
n do begin
check
for data occurrences
if (ip eq Li
) then do begin
Raise a flag variable with different signals
end
end
data
++;
end
/*Li = Node that is connected with flag variable
*/
if (flv=GREEN) then do begin
while (| Li[
Lq=Li++;
process (L); /* Overall Index Value
*/
forall transactions LqÎ flv do begin
if (Lq != ‘\0’) then do begin
s[Ldats ]=getc(
Ldats ++
end
end
Lqf = File name that is
red from the index page
fp=fopen(Lqf,”r”)
While((g=getc(fp))!=EOF)
Display the
Result Page
Case I :
while (| Li[
If exists, raise the flag as green,
else raise the red flag.
Open the
green flaged datasets and read it until the null value exists
if (Lq != ‘\0’)
Open the
file with read mode, display the contents
simultaneously in paging format.
Case 2:
Pascal Mode Generation
-------------------------------- (Data Mining)
Data set I
Moral Page
![]()
Moral Dataset Security DS2 Data Mining DS3 CSMA/CD DS4
Index Page
QAS DS4

CSMA/CD-----------QAS
------------------------------------------- Data Mining
---------------------------------------------------------

![]()
![]()
Data set IV
![]()
Data set III
Fig 1.3 .Busy Establishment of Candidate and Index key that
indicate suitable subset in a large database
Comparision Experiments
Values in Milli Seconds.
|
Subset |
MSDI |
APDS |
Dataset
Size |
|
S1
(Index
Page Generation) |
0.25
|
------ |
1,27,500
items |
|
S2 (Subset
2) |
----- |
0.26
|
-----Do-- |
|
S3 (Subset
3) |
------ |
0.252 |
-----
Do ---- |
|
S4 (Subset
4) |
0.13 |
0.252 |
|
Total
|
0.38 |
0.779 |
|

Fig 1.4 Comparison Chart
Above table shows the relative
performance between MSDI and APDS. Here, we use | T|=15, i.e., each transaction
has 15 double items on an average, so as to have more large itemsets in later
passes for interest of presentation. The execution time of these two algorithms
is shown in figure. In the phase 4 of both the algorithms are the same. Phase 4
partitioned the database from very large database. But the other phases are
entirely different when compared to MSDI with APDS. Every result may result in
different microseconds. The tabulation with the two different database sizes.
But when we go to the execution time, the time can be reduced in a large data
set when compared to small data set. As mentioned before, in the MSDI algorithm
has the options of switching from APDS to MSDI after early passes for better
performance, and such an option is not adopted here. It can, nevertheless, be
seen from figure that the execution time of the APDS is larger than the total
execution time by MSDI.It can be seen from table that the execution time of the
first pass of MSDI is additional step compared with APDS. However, MSDI incurs
significantly smaller execution times than APDS in later passes, not only in
the second pass through out the table.

The increased database sized graph
shows that the execution time of MSDI increases linearly as the database size
increases, meaning that MSDI possesses the same important feature as APDS.
Also, we examine the performance of MSDI as the number of items, N, increases.
Table shows the execution times of MSDI when the number of item increases from
1, 00,000 to 1, 50,000 for these data sets 1.14 msec and 0.94 msec respectively.
In other words, a large transaction has a larger likelihood of having large
itemsets to process than a small transaction. Also, given a fixed minimum
support, when the number of items N increases, the execution time to obtain L,
increases since the size of L1 is usually close to N1 but the execution time to
obtain larger k-itemsets decreases since the support for an itemset is averaged
out by more items and then decreases. Consequently, as shown in table, when N
increases, execution time for small transactions increase a little more
prominently than those for large transactions. Performance of MSDI when the
number of items increases
N
|
Index
|
Reference
|
|
1,90,000 |
0.12Msec |
0.75 Msec |
|
3,00,000 |
0.09Msec |
0.32 Msec |
|
4,50,000 |
0.06Msec |
0.14 Msec |
We examined in this paper, the issue
of mining distributive rules among various items in
a very large database of search engine transactions. The problem of handling
large itemsets was solved by constructing an indexed set and then subdivided
the large database into number of small scale database. We proposed the algorithm
MSDI is especially effective for the identification of specific item in a very
large database. We compare this algorithm with APDS algorithm and it concluded
that the MSDI algorithm is better than APDS in case of data items increases.
Extensive simulation study has been concluded to evaluate performance of the
proposed algorithm and that simulation guides “MSDI is the algorithm that battles
with delay”.
1) R. Agrawal, T. Imielinski, and A.
Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc.
1993 ACM SIGMOD Int’l Conf. Management of Data, pp. 207-216, Wahington, D.C.,
May 1993
2) R.
Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design,
Implementation, and Experience,” IEEE Trans. Knowledge and Data
3) R.
Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” Proc.
1994 Int’lconf. Very Large Data Bases, pp. 487-499,
4) R.
Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995 Int’l Conf.
Data Eng., pp.265-276,Tucson,Ariz., May 1997.
5) S.
Brain, R. Motwani, and C.Silverstein, “Beyond Market Basket: Generalizing
Association Rules to Correlations,” Proc. 1997 ACM SIGNMOD Int’l Conf.
Management of Data, pp. 265-276, Tucson, Ariz., May 1997
6) S.
Chaudhuri and U. Dayal, “An Overview of Data Warehousing and PLAP
Technology,”AGM SIGMOD Record, vol. 26, pp. 65-74, 1997.
7) M.S.
Chen, J. Han, and P.S. Yu, “Data Mining: An overview from a Database
Perspective,” IEEETrans. Knowledge and Data Engg, Vol.8, PP.866-883, 1996.
8) D.W.
Cheung, J. Han, V. Ng,A. Fu, and Y. Fu, “A Fast Distributed Algorithm for
Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed
Information Systems, PP. 1996 Int’l Conf. Data Enf., PP. 106-114, New Orleans,
Feb. 1996.
9) Y.
Fu and J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast Distributed Algorithm for
Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed
Information Systems, PP. 31-44, Miami Beach, Fla.,Dec. 1996.
10) D.W.
Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association
Rules in Large Databases: An Incremental Updating Technique,” Proc. 1996
Int’l Conf, Data Engg. PP. 106-114,
Technical College - Bourgas,