Academic Open Internet Journal

www.acadjournal.com

Volume 12, 2004

 

 

 

 

Dynamic Data Fetch: An Efficient Algorithm for Query processing in large databases using Association Rules

 

N. Rajkumar1, S.N. Sivanandam 2 and M.R.Karthik3

 

1. Research Scholar, Department of Computer Science & Engineering

2. Prof. and Head, Department of Computer Science & Engineering

3. Student, Department of Electrical & Electronics Engineering

PSG College of technology, Coimbatore-641004, India

Email: rnalliah@yahoo.co.in1, Karthikv2k@yahoo.co.uk3

 

Abstract:

Association rules are used extensively for data mining in many domains such as wholesale, retail marketing, lottery selection etc., and all the domains which is dealing with more number of data. Direction based large relationships are critical in many domains, including geographic information system (GIS) and image interpretation. They are also frequently used as selection conditions in large queries. In this paper, we explore the processing of large queries and propose and describe five basic data formulations to reduce the redundancy. The forward path cross data fetch (FPCDF), the diagonal cross random data fetch (DCRDF), the upper/lower diagonal cross random data fetch (U/LDCRDF) and the data merge and fetch (DMF) respectively.

Keywords: Data mining, Association rule, query processing, query merging.

 

1. INTRODUCTION:

Mining of association rules from large data sets has been focused topic in recent data mining research [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]. Many applications at mining associations require that mining be performed at multiple levels of abstraction. For example, the association relationship in different databases is expressed at a lower level of abstraction without duplication but carries more specific and concrete information. However, in many applications, it is unlikely that a large number of clients will issue exactly the same query. In this paper, we present algorithms for efficient use of query for such applications. We achieve this by considering merging queries with answers that overlap significantly. By merging these queries, the server has to process fewer queries and the amount of data or information may be reduced. We present a frame work for studying query merging, processing and its costs. We present a variety of algorithms for merging, some optimal and some heuristic.

 

2. PROBLEM SPECIFICATION:

In this section, we formalize the query processing problem by presenting a conceptual model. The objective of our approach is to reduce the redundancy of a set of query subscriptions made by clients to a server. We attempt to reduce the cost by finding a different set of queries, with lower processing and transmission costs, from which the clients can drive the answers to their original queries.

 

2.1 .CONCEPTUAL MODEL:

The conceptual model for a query subscription is shown in figure 1. In this model, we have a set of clients C= {C1…..Cn}, that require information. The information need of Ci is described by a set of subscriptions. Each subscription consists of a query and its timing, Requirements. For simplicity, we assume that all subscriptions have identical timing requirements. Thus, we can view the subscriptions have identical timing requirements.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 Figure 1: Processing constraints of dynamic data fetch

                         Thus we can view the subscriptions of client Ci simply as a set of queries Qi. We will assume that Qi is relatively small but that its queries will be processed periodically (and answers set to the client) over a long period of time. We call Q the set of all queries received by the server .Clients send their queries to a server. The server periodically processes the queries against a database, and sends answers to the clients. Before processing queries the server runs a merge algorithm that combines”similar”queries. The output of the merge process is a collection M= {Mi} where each Mi contains the queries that are merged. The queries in each Mi are merged into a single query, mrg (Mi).we use ans (q) to represent the answer to query q. Thus, the server generates ans (mrg (Mi)) for each Mi in M.For completeness, we require that UiQi=UiMi. Similarly, we require that ans (q) ≤ans (mrg (Mi)) for every qεMi.

 

We call the difference between the answer to the merged query sent to a client, ans (mrg (Mi)), and the original query, ans (q), the irrelevant information for q sent to the client. To illustrate  these concepts ,say client C1 submits queries Q1={x,y}and c2 submits Q2= {z}.The server may merge them into M1={x,z} and M2={y}.Then, the server runs mrg(M1) and mrg(M2) against the database and generates A1 = ans(mrg(M1)) and  A2=ans(mrg(M2)). Note that A1 needs to be sent to both C1 and C2 and that each must apply an extractor to be sent to obtain the desired answer.  For example, the extractor, e, that C1 applies to A1 should yield e (A1) =ans(x). Thus, when the server sends A1 out, it must include a header containing information.

1. A list of clients that should receive A1.

2. For each such client c, one or more pairs (e, q), where e is an extractor and q is a query identifier. The extractor e is what client C needs to apply to obtain the answer to its original query q.

 

Note that more than one (e, q) pair is needed if multiple c queries are involved in A1.If clients do not need to know what queries generated answers, and then the query identifiers are not required. In our example, the information sent with A1 would be C1: (ex, x) and C2 :( ez, z). Client C1 then applies ex (A1) to obtain its answer to query x while C2 applies ez (A1) to obtain its answer. For example, the server could tag each individual answer object with the identifier of the query that generates the object, or with the identifier of the client should receive the object. Then, each extractor only needs to look for the appropriate tags. In some cases, the extractor for a query is the original query itself. In particular, this happens when queries only have selections and projections. In any case, the extractor needs to be installed on the client machine. Note that our basic model does not specify what kind of network is used to send queries to the server and answers to the clients. In the next section, we will introduce a parallel virtual machine in to the model.

 

2.2. THE PARALLEL VIRTUAL MACHINE: 

The system called parallel virtual machine (PVM) allows a heterogeneous collection of workstations and computers to construct a parallel processing environment. PVM is portable and runs on a wide variety of modern systems. The PVM system permits a heterogeneous collection of UNIX computers networked together to be viewed by a user program as a single parallel environment with some distinctive features:

  • Explicit message-passing model: For a collection of computational tasks, each one performs computations.
  • Heterogeneity support: PVM supports heterogeneity in terms of machines, networks and applications.
  • Multiprocessor support: PVM uses the native message-passing facilities on multiprocessors to take advantage of the underlying hardware.
  • Process-based computation: The unit of parallelism in PVM is a task, which is an independent sequential thread of control that transfers between communication and computation. PVM tasks are identified by an integer task identifier. The task identifiers are supplied by the PVM system and are not user chosen, because they must be unique across the entire virtual system.
  • User-configured host pool: The computational tasks execute on a set of machines that are selected by the user for program execution. Both single-processor and multiprocessor computers may part of the host pool. The host pool may be altered by adding and deleting machines during operation.

 

A PVM computational model shown in Figure 1.1. The dashed arrows indicate integer component communication and synchronization, and the solid lines represent the inter-instance communication and synchronization. A general paradigm for paradigm for program execution in the PVM system has the following steps:

  • A user writes one or more sequential programs in C, C++ or Fortran 77, which are programming languages that are supported by PVM. The programs contain embedded calls to the PVM system libraries.

 

Figure 1.1 PVM Computational Model

                      

 

 

2.3 The merging procedure for student database queries

We consider the database to be a single relation R that has position attributes (e.g, “Roll no” and “Name”), as well as other attributes describing that position .A student query has the form.

                               I (C1≤Roll no≤C3) Λ (C2≤Name≤C4) R

In figure (2) forward path cross data fetch (FPCDF), we illustrate the merge procedure that can be used in this student database queries in to a single selection query. Additionally queries in to a single selection query. Additionally, it is easy to extract the answers to the original queries from the answers to the merged queries. There are other possible merge procedures for the student database query scenario. Figure (3) shows the reverse path cross data fetch (RDCDF). This procedure also generates a single merged query. But the query may have the way of collecting a data is entirely different from other techniques (gathering data from room no. ‘n’ to room no. 1). Although, the merge query contains less irrelevant information we can again use the original query as the extraction function for this merging procedure. Figure (4) shows Diagonal cross random data fetches (DC RDF) a merge procedure that completely eliminates irrelevant information. However, four “merged” queries are generated. A client implementing the extraction function for this merging procedure needs to combine the answers to the four merged queries in order to find the answer to the original query.

 

2.4 The Multiple queries merging problem for student database queries

The multiple query merging problem is the special case of the query merging problem when │Q│=2 or more.  Our student database query example is convenient for illustrating why the multiple query merge problem is simple, but why it is hard for more than three queries. In the multiple queries merging problems we want to decide if it is worthwhile to merge two queries q1 and q2 in to a merged query q3. . For compactness in the following discussion, let us denote size (qi) as si. Therefore, the cost of processing and transmitting queries q1 and q2 separating will be KM+KT.S1 and KM +KT .S2, respectively, for a total cost of 2 KM +KT (S1+S2). If we merge the queries into a single query q3, the total cost will be

KM + KT.S3 +KU.U (Q, M), where U (Q, M) =2.S3-S1-S2. We derive the U (Q, M) term in the following way: If we send a message with only ans (q1), the client receives an answer with size S1.If we send a message with ans (q3) instead, the client will receive an answer of size S3. The difference (S3-S1) is the size of the irrelevant results received by the client. We can use a similar derivation for the other client and conclude that the size of the irrelevant information is (S3-S2).Therefore the total size of irrelevant information is U (Q, M) =2.S3-S1-S2.Unfortunately, the general problem (│Q│>2) is significantly harder, since these are, many ways to combine a set of queries into a merged queries. For example, it we have three or more queries as inputs, it could be the case that it is not worthwhile merging any pair of them, but it is worthwhile merging the three or four queries  in to a single query. On the other hand, it could be the case that it is worthwhile merging one specific pair, but not the other pair and not the three or four queries. In conclusion, we would have to consider all possible ways to partition the input queries into subsets. For each possible partition we compute a cost and then we pick the partition with minimum cost. This approach leads to an exponential algorithm. Let us use our student database scenario to show a case when merging three or four queries is optional, although merging any pair is not. In figure (5) (DMF), we show 3 or 4 queries over our student database. Since there are 3 queries, there are five ways to merge. We can merge two of them (three combinations), we can merge all of them, or we can keep them separately.

 

2.5. Algorithm for the query merging problem

In this section, we introduce five basic formulations for the association rule based dynamic fetch algorithms and combines good features of these approaches. We assume that “N” data sets are randomly distributed to “P” processors initially such that each processor has N/P cases.

 

2.5.1: Forward path cross data fetch (FPCDF)

In this approach, database contents are travel along the straight forward fashion. Initially, database pointer placed in a BOF position and it travels towards the EOF.It is not only for a single database it is applicable to “n” number of database system. It travels along the “n” number of databases and select records randomly. For instance, “n” number of class rooms having “n” number of students. Instead of selecting repeated students from a single class room, we select students randomly from “n” number of class rooms (selecting data from room No.1………….n). The duplication approach is effective. But it is repeatedly travel in forward fashion. To overcome this problem, we need more number of association rule algorithms to distribute or select a different number of data simultaneously.

BOF- Beginning of File, EOF- End of file

Database I:                                                     Database II:

Fig 1.3 Fetching techniques of Forward Path Cross Data Fetch

 

2.5.2 Reverse path cross data fetch (ROCDF)

                    In this approach, data travel form EOF to BOF. It is the reverse process of (FPCDF). Behavior of both the algorithms the same but, the way of collecting a data is entirely different from other algorithms selecting data form room no.”N” to Room No1.) FPCDF is better than the RPCDF .For example selecting students (or) Data from room No.1 to room No.n/2 using FPCDF and selecting students(or) data from room No: (n/2)+1 using RPCDF

Database I:                                                     Database II:

               

 2.5.3 Diagonal cross random data fetch (DCRDF)

               In this approach, we select the diagonal data element form 2 or more databases for instance if we want to select any no. of lucky persons in lottery database (or) any other database. This algorithm continues picking every diagonal element available in the database. It selects the elements only when the attributes and the entities are same (A=E).

 

Database I:                                                     Database II:

Fig 1.4 Selection of Lottery winners using DCRDF

 

2.5.4 Upper5 / Lower Diagonal cross random Database fetch (U/L DCRDF)

              This algorithm only takes the data which is located above (or) below the diagonal elements selection of data to focus only the diagonal elements. If the element is not a diagonal then we randomly take a value from 2 or more database. The constraints applied for the selection are attributes and is not equal to the entities (i.e.) E>A (or) E<A.

 

Database I:                                                     Database II:

 

Fig 1.5 Selection of Lottery winners using U/L DCRDF

 

 

 

2.5.5: Data merge and fetch (DMF):

                  In this approach, the algorithm is to merge “n” database elements and ask the questions after the data merge and maintains a set of queries. Initially, each set of contains each single query. The Algorithm continues merging number of questions and answers according to the internal behaviors of each algorithm. On the other hand, merge two different questions randomly to make a single question. Each client can be allocated to the different paths and answering the queries sent by clients.

 

3. Experimental results:

 

 

  • FPCDF – Forward Path Cross Data Fetch Forward
  • RPCDF -Reverse Path Cross Data Fetch
  • DCRDF-Diagonal Cross Random Data Fetch
  • U/L DCRDF - Upper/Lower Diagonal Cross Random Data Fetch
  • DMF- Data Merge and fetch

 

4. Conclusion:

In this paper, we have studied the query merging problem. We presented a very general framework for merging, and we presented a variety of algorithms. To illustrate and experimentally evaluate performance, we considered the sample queries as an example. Our results show that the running time can be significantly decreased by applying a merge problem. If we have small number of queries, we can use the Forward path cross data fetch (FPCDF) and reverse path cross data fetch (RPCDF) algorithm. If the running time is critical and subscriptions change dynamically then using the Diagonal cross random data fetch (DCRDF) before applying and merging algorithm improves the running time significantly. If finding the best solution is important and the number of queries is large, data merge and fetch should be used. Reducing the run time has been non trivial challenge for researchers in the area. This will not only reduce the hardware overhead on a computer and make many scientific computations feasible 

 

Algorithm

Time in Milliseconds

Forward Path Cross Data Fetch (FPCDF)

1.6

Reverse Path Cross Data Fetch (RPCDF)

1.68

Diagonal Cross Random Data Fetch (DCRDF)

1.8

Upper/Lower Diagonal Cross Random Data Fetch (U/L DCRDF)

 

1.87

Data Merge and Fetch (DMF)

1.9

REFERENCES

 

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD Int’l Conf. Management of Data, pp. 207-216, Washington, D.C., May 1993

 

[2] R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules: Design, Implementation, and Experience,” IEEE Trans. Knowledge and Data Eng., vol.8, pp. 962-969,1996

 

[3] R. Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int’lconf. Very Large Data Bases, pp. 487-499, Santiago, Chile, Sept. 1994.

 

[4] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995 Int’l Conf. Data Eng., pp.265-276,Tucson,Ariz., May 1997.

 

[5] S. Brain, R. Motwani, and C.Silverstein, “Beyond Market Basket: Generalizing Association Rules to Correlations,” Proc. 1997 ACM SIGNMOD Int’l Conf. Management of Data, pp. 265-276, Tucson, Ariz., May 1997

 

[6] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and PLAP Technology,”AGM SIGMOD Record, vol. 26, pp. 65-74, 1997.

 

[7] M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An overview from a Database Perspective,” IEEETrans. Knowledge and Data Engg., Vol.8, PP.866-883, 1996.

 

[8] D.W. Cheung, J. Han, V. Ng,A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed Information Systems, PP. 1996 Int’l Conf. Data Enf., PP. 106-114, New Orleans, Feb. 1996.

 

[9]   Y. Fu and J. Han, V. Ng, A. Fu, and Y. Fu, “A Fast Distributed Algorithm for Mining Association Rules,” Proc. 1996 Int’l Conf. Parallel and Distributed Information Systems, PP. 31-44, Miami Beach, Fla.,Dec. 1996.

 

[10]     D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique,” Proc. 1996 Int’l Conf, Data Engg., PP. 106-114, New Orleans, Feb. 1996.

 

[11]     T. Fu and J. Han,”Meta-Rule-Guided Mining of Association Rules in Relational Databases,”Proc. First Int’l Workshop Integration Knowledge Discovery with Deductive and Object-Oriented Databases (KDOOD ’95), PP. 39-46, Singapore, Dec. 1995.

 

 

 

Technical College - Bourgas,

All rights reserved, © March, 2000