| Academic Open Internet Journal |
Volume 12, 2004 |
Dynamic Data Fetch: An Efficient Algorithm
for Query processing in large databases using Association Rules
1. Research Scholar, Department of Computer
Science & Engineering
2. Prof. and Head, Department of Computer
Science & Engineering
3. Student, Department of Electrical & Electronics
Engineering
Email: rnalliah@yahoo.co.in1,
Karthikv2k@yahoo.co.uk3
Abstract:
Association
rules are used extensively for data mining in many domains such as wholesale,
retail marketing, lottery selection etc., and all the domains which is dealing
with more number of data. Direction based large relationships are critical in
many domains, including geographic information system (GIS) and image
interpretation. They are also frequently used as selection conditions in large
queries. In this paper, we explore the processing of large queries and propose
and describe five basic data formulations to reduce the redundancy. The forward
path cross data fetch (FPCDF), the diagonal cross random data fetch (DCRDF),
the upper/lower diagonal cross random data fetch (U/LDCRDF) and the data merge
and fetch (DMF) respectively.
Keywords: Data mining, Association rule, query
processing, query merging.
1. INTRODUCTION:
Mining of association rules from large data sets has been focused topic in recent data mining research [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]. Many applications at mining associations require that mining be performed at multiple levels of abstraction. For example, the association relationship in different databases is expressed at a lower level of abstraction without duplication but carries more specific and concrete information. However, in many applications, it is unlikely that a large number of clients will issue exactly the same query. In this paper, we present algorithms for efficient use of query for such applications. We achieve this by considering merging queries with answers that overlap significantly. By merging these queries, the server has to process fewer queries and the amount of data or information may be reduced. We present a frame work for studying query merging, processing and its costs. We present a variety of algorithms for merging, some optimal and some heuristic.
2. PROBLEM
SPECIFICATION:
In this section,
we formalize the query processing problem by presenting a conceptual model. The
objective of our approach is to reduce the redundancy of a set of query
subscriptions made by clients to a server. We attempt to reduce the cost by
finding a different set of queries, with lower processing and transmission
costs, from which the clients can drive the answers to their original queries.
2.1 .CONCEPTUAL
MODEL:
The conceptual
model for a query subscription is shown in figure 1. In this model, we have a
set of clients C= {C1…..Cn}, that require
information. The information need of Ci is described
by a set of subscriptions. Each subscription consists of a query and its
timing, Requirements. For simplicity, we assume that all subscriptions have
identical timing requirements. Thus, we can view the subscriptions have
identical timing requirements.

Thus we can view the subscriptions of client Ci simply as a set of queries Qi. We will assume that Qi is relatively small but that its queries will be processed periodically (and answers set to the client) over a long period of time. We call Q the set of all queries received by the server .Clients send their queries to a server. The server periodically processes the queries against a database, and sends answers to the clients. Before processing queries the server runs a merge algorithm that combines”similar”queries. The output of the merge process is a collection M= {Mi} where each Mi contains the queries that are merged. The queries in each Mi are merged into a single query, mrg (Mi).we use ans (q) to represent the answer to query q. Thus, the server generates ans (mrg (Mi)) for each Mi in M.For completeness, we require that UiQi=UiMi. Similarly, we require that ans (q) ≤ans (mrg (Mi)) for every qεMi.
We call the difference between
the answer to the merged query sent to a client, ans (mrg (Mi)), and the
original query, ans (q), the irrelevant information for q sent to the client. To
illustrate these concepts ,say client C1
submits queries Q1={x,y}and c2 submits Q2= {z}.The server may merge them into M1={x,z}
and M2={y}.Then, the server runs mrg(M1) and mrg(M2) against the database and
generates A1 = ans(mrg(M1)) and
A2=ans(mrg(M2)). Note that A1 needs to be sent to both C1 and C2 and
that each must apply an extractor to be sent to obtain the desired answer. For example, the extractor, e, that C1 applies
to A1 should yield e (A1) =ans(x). Thus, when the server sends A1 out, it must
include a header containing information.
1. A list of clients that should receive A1.
2. For each such client c, one or more pairs (e, q), where e is an extractor and q is a query identifier. The extractor e is what client C needs to apply to obtain the answer to its original query q.
Note that more than one (e, q) pair is needed if multiple c queries are involved in A1.If clients do not need to know what queries generated answers, and then the query identifiers are not required. In our example, the information sent with A1 would be C1: (ex, x) and C2 :( ez, z). Client C1 then applies ex (A1) to obtain its answer to query x while C2 applies ez (A1) to obtain its answer. For example, the server could tag each individual answer object with the identifier of the query that generates the object, or with the identifier of the client should receive the object. Then, each extractor only needs to look for the appropriate tags. In some cases, the extractor for a query is the original query itself. In particular, this happens when queries only have selections and projections. In any case, the extractor needs to be installed on the client machine. Note that our basic model does not specify what kind of network is used to send queries to the server and answers to the clients. In the next section, we will introduce a parallel virtual machine in to the model.
2.2. THE PARALLEL
VIRTUAL MACHINE:
The system called parallel virtual machine (PVM)
allows a heterogeneous collection of workstations and computers to construct a
parallel processing environment. PVM is portable and runs on a wide variety of
modern systems. The PVM system permits a heterogeneous collection of UNIX
computers networked together to be viewed by a user program as a single
parallel environment with some distinctive features:
A PVM
computational model shown in Figure 1.1. The dashed arrows indicate integer
component communication and synchronization, and the solid lines represent the
inter-instance communication and synchronization. A general paradigm for
paradigm for program execution in the PVM system has the following steps:

Figure 1.1 PVM Computational Model
2.3 The merging
procedure for student database queries
We consider the database to be a
single relation R that has position attributes (e.g, “Roll no” and “Name”), as
well as other attributes describing that position .A student query has the
form.
I (C1≤Roll no≤C3)
Λ (C2≤Name≤C4) R
In figure (2) forward path cross data fetch (FPCDF), we
illustrate the merge procedure that can be used in this student database
queries in to a single selection query. Additionally queries in to a single
selection query. Additionally, it is easy to extract the answers to the
original queries from the answers to the merged queries. There are other
possible merge procedures for the student database query scenario. Figure (3) shows
the reverse path cross data fetch (RDCDF). This procedure also generates a
single merged query. But the query may have the way of collecting a data is
entirely different from other techniques (gathering data from room no. ‘n’ to
room no. 1). Although, the merge query contains less irrelevant information we
can again use the original query as the extraction function for this merging procedure.
Figure (4) shows Diagonal cross random data fetches (DC RDF) a merge procedure
that completely eliminates irrelevant information. However, four “merged”
queries are generated. A client implementing the extraction function for this
merging procedure needs to combine the answers to the four merged queries in
order to find the answer to the original query.
2.4 The Multiple queries
merging problem for student database queries
The multiple query merging problem is
the special case of the query merging problem when │Q│=2 or
more. Our student database query example
is convenient for illustrating why the multiple query merge problem is simple,
but why it is hard for more than three queries. In the multiple queries merging
problems we want to decide if it is worthwhile to merge two queries q1 and q2
in to a merged query q3. . For compactness in the following discussion, let us
denote size (qi) as si. Therefore, the cost of processing and transmitting
queries q1 and q2 separating will be KM+KT.S1 and KM +KT .S2, respectively, for
a total cost of 2 KM +KT (S1+S2). If we merge the queries into a single query
q3, the total cost will be
KM + KT.S3 +KU.U (Q, M), where U (Q,
M) =2.S3-S1-S2. We derive the U (Q, M) term in the following way: If we send a
message with only ans (q1), the client receives an answer with size S1.If we
send a message with ans (q3) instead, the client will receive an answer of size
S3. The difference (S3-S1) is the size of the irrelevant results received by
the client. We can use a similar derivation for the other client and conclude
that the size of the irrelevant information is (S3-S2).Therefore the total size
of irrelevant information is U (Q, M) =2.S3-S1-S2.Unfortunately, the general
problem (│Q│>2) is significantly harder, since these are, many
ways to combine a set of queries into a merged queries. For example, it we have
three or more queries as inputs, it could be the case that it is not worthwhile
merging any pair of them, but it is worthwhile merging the three or four
queries in to a single query. On the
other hand, it could be the case that it is worthwhile merging one specific
pair, but not the other pair and not the three or four queries. In conclusion,
we would have to consider all possible ways to partition the input queries into
subsets. For each possible partition we compute a cost and then we pick the
partition with minimum cost. This approach leads to an exponential algorithm.
Let us use our student database scenario to show a case when merging three or
four queries is optional, although merging any pair is not. In figure (5) (DMF),
we show 3 or 4 queries over our student database. Since there are 3 queries,
there are five ways to merge. We can merge two of them (three combinations), we
can merge all of them, or we can keep them separately.
2.5. Algorithm for the
query merging problem
In this section, we introduce five
basic formulations for the association rule based dynamic fetch algorithms and
combines good features of these approaches. We assume that “N” data sets are
randomly distributed to “P” processors initially such that each processor has
N/P cases.
2.5.1: Forward path
cross data fetch (FPCDF)
In this approach, database contents
are travel along the straight forward fashion. Initially, database pointer
placed in a BOF position and it travels towards the EOF.It is not only for a
single database it is applicable to “n” number of database system. It travels
along the “n” number of databases and select records randomly. For instance,
“n” number of class rooms having “n” number of students. Instead of selecting
repeated students from a single class room, we select students randomly from
“n” number of class rooms (selecting data from room No.1………….n). The duplication
approach is effective. But it is repeatedly travel in forward fashion. To
overcome this problem, we need more number of association rule algorithms to
distribute or select a different number of data simultaneously.
BOF- Beginning of File,
EOF- End of file
Database I:
Database II:

Fig 1.3
Fetching techniques of Forward Path Cross Data Fetch
2.5.2 Reverse path
cross data fetch (ROCDF)
In
this approach, data travel form EOF to BOF. It is the reverse process of
(FPCDF). Behavior of both the algorithms the same but, the way of collecting a
data is entirely different from other algorithms selecting data form room no.”N” to Room
No1.) FPCDF is better than the RPCDF .For example selecting students
(or) Data from room No.1 to room No.n/2 using FPCDF and selecting students(or)
data from room No: (n/2)+1 using RPCDF
Database
I:
Database II:

2.5.3
Diagonal cross random data fetch (DCRDF)
In this approach, we select the diagonal data element form 2 or more databases for instance if we want to select any no. of lucky persons in lottery database (or) any other database. This algorithm continues picking every diagonal element available in the database. It selects the elements only when the attributes and the entities are same (A=E).
Database I:
Database II:

2.5.4 Upper5 / Lower
Diagonal cross random Database fetch (U/L DCRDF)
This algorithm only takes the data which is located above (or) below the diagonal elements selection of data to focus only the diagonal elements. If the element is not a diagonal then we randomly take a value from 2 or more database. The constraints applied for the selection are attributes and is not equal to the entities (i.e.) E>A (or) E<A.
Database I:
Database II:

2.5.5: Data merge and
fetch (DMF):
In this approach, the algorithm is
to merge “n” database elements and ask the questions after the data merge and
maintains a set of queries. Initially, each set of contains each single query.
The Algorithm continues merging number of questions and answers according to
the internal behaviors of each algorithm. On the other hand, merge two
different questions randomly to make a single question. Each client can be
allocated to the different paths and answering the queries sent by clients.

3. Experimental
results:

4. Conclusion:
In this paper, we have studied the
query merging problem. We presented a very general framework for merging, and
we presented a variety of algorithms. To illustrate and experimentally evaluate
performance, we considered the sample queries as an example. Our results show
that the running time can be significantly decreased by applying a merge
problem. If we have small number of queries, we can use the Forward path cross
data fetch (FPCDF) and reverse path cross data fetch (RPCDF) algorithm. If the
running time is critical and subscriptions change dynamically then using the
Diagonal cross random data fetch (DCRDF) before applying and merging algorithm
improves the running time significantly. If finding the best solution is
important and the number of queries is large, data merge and fetch should be
used. Reducing the run time has been non trivial challenge for researchers in
the area. This will not only reduce the hardware overhead on a computer and
make many scientific computations feasible
Algorithm
|
Time in
Milliseconds
|
|
Forward
Path Cross Data Fetch (FPCDF) |
1.6 |
|
Reverse
Path Cross Data Fetch (RPCDF) |
1.68 |
|
Diagonal
Cross Random Data Fetch (DCRDF) |
1.8 |
|
Upper/Lower
Diagonal Cross Random Data Fetch (U/L DCRDF) |
1.87 |
|
Data
Merge and Fetch (DMF) |
1.9 |
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining
Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM
SIGMOD Int’l Conf. Management of Data, pp. 207-216, Washington, D.C., May 1993
[2] R.
Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design,
Implementation, and Experience,” IEEE Trans. Knowledge and Data
[3] R.
Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” Proc.
1994 Int’lconf. Very Large Data Bases, pp. 487-499,
[4] R.
Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995 Int’l Conf.
Data Eng., pp.265-276,Tucson,Ariz., May 1997.
[5] S.
Brain, R. Motwani, and C.Silverstein, “Beyond Market Basket: Generalizing
Association Rules to Correlations,” Proc. 1997 ACM SIGNMOD Int’l Conf.
Management of Data, pp. 265-276, Tucson, Ariz., May 1997
[6] S.
Chaudhuri and U. Dayal, “An Overview of Data Warehousing and PLAP
Technology,”AGM SIGMOD Record, vol. 26, pp. 65-74, 1997.
[7] M.S.
Chen, J. Han, and P.S. Yu, “Data Mining: An overview from a Database
Perspective,” IEEETrans. Knowledge and Data Engg., Vol.8, PP.866-883, 1996.
[8] D.W.
Cheung, J. Han, V. Ng,A. Fu, and Y. Fu, “A Fast Distributed Algorithm for
Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed
Information Systems, PP. 1996 Int’l Conf. Data Enf., PP. 106-114, New Orleans,
Feb. 1996.
[9]
Y. Fu and J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast
Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf.
Parallel and Distributed Information Systems, PP. 31-44, Miami Beach, Fla.,Dec.
1996.
[10]
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance
of Discovered Association Rules in Large Databases: An Incremental Updating
Technique,” Proc. 1996 Int’l Conf, Data Engg., PP. 106-114,
[11]
T. Fu and J. Han,”Meta-Rule-Guided Mining of
Association Rules in Relational Databases,”Proc. First Int’l Workshop
Integration Knowledge Discovery with Deductive and Object-Oriented Databases
(KDOOD ’95), PP. 39-46,
Technical College - Bourgas,