Academic Open Internet Journal

www.acadjournal.com

Volume 16, 2005

 

 

 

SOFTWARE AGENT BASED SEARCH ENGINE USING GRID TECHNOLOGY

 

 

 

Authors

 

D INDUMATHI,

LECTURER, DEPT. OF CSE,

PSG COLLEGE OF TECHNOLOGY,

COIMBATORE-641 004, INDIA

e-mail: indumathi_d@hotmail.com

             indumathi_d@yahoo.com

 

 

Dr. A.CHITRA,

PROFESSOR, DEPT. OF CSE,

PSG COLLEGE OF TECHNOLOGY,

COIMBATORE-641 004, INDIA

e-mail: achitra@cse.psgtech.ac.in

 

 

Abstract

 

            The amount of information available via networks and databases has rapidly increased. Existing search and retrieval engines provide limited assistance to users in locating the relevant information that they need. Software  agents may prove to be the needed item in transforming passive search and retrieval engines into active, personal assistants. This project aims at developing a search engine that will allow users to search through heterogeneous resources stored in geographically distributed digital collections. Many of the search engines existing today will not have a single centralized index. In some cases, the distributed approach offers advantages over the centralized approach since it is more scalable, can be used on otherwise inaccessible material, and can provide advanced search options customized for each data source. This proposal addresses the situations where centralized indexes are unfeasible and proposes the development of a decentralized search engine built on top of grid technology with the help of software agents.

 

Keywords: Searching, Grid Computing, Software agents, intelligent agents

 

INTRODUCTION

           

Search engines have become an essential component of everyday life in modern society. Most of the applications involve some interaction with search engines one way or the other. The Grid Computing Based Search Engine specifically addresses the situations where centralized indexes are unfeasible and proposes the development of a decentralized search engine built on top of grid technology. Grid computing has the design goal of solving problems too big for any single supercomputer, at the same time it has the flexibility to work on multiple smaller problems. Thus, grid computing provides a multi-user environment[1].

 

            The goal of the project is to build a tool capable of searching and automatically categorizing vast amounts of geographically distributed information, which in many cases could not be searched effectively, if at all, by a centralized search engine such as those available on the web.

           

            The reason we deploy a grid for implementing a search engine is to better match grid computing capabilities to customer requirements.

 

            In most organizations, there are large amounts of underutilized computing resources. Most desktop machines are busy less than 5% of the time. In some organizations, even the server machines can often be relatively idle. Grid computing provides a framework for exploiting these underutilized resources and thus, has the possibility of substantially increasing the efficiency of resource usage. Another function of the grid is to better balance resource utilization. An organization may have occasional unexpected peaks of activity that demand more resources. If the applications are grid enabled, they can be moved to underutilized machines during such peaks. In general, a grid can provide a consistent way to balance the loads on a wider federation of resources. This applies to CPU and other storages.

 

            The potential for massive parallel CPU capacity is one of the most attractive features of a grid. A CPU intensive grid application can be thought of as many smaller “subjobs,” each executing on a different machine in the grid. To the extent that these subjobs do not need to communicate with each other, the more “scalable” the application becomes. A perfectly scalable application will, for example, finish 10 times faster if it uses 10 times the number of processors.

 

            A grid federates a large number of resources contributed by individual machines into a greater total virtual resource. The grid can offer a resource balancing effect by scheduling grid jobs[2] on machines with low utilization

 

 

            Grid Computing based Search Engine[3] provides an alternate approach to reliability that relies more on software technology than expensive hardware. The systems in a grid can be relatively inexpensive and geographically dispersed. Thus, if there is a power or other kind of failure at one location, the other parts of the grid are not likely to be affected. In critical, real-time situations, multiple copies of the important jobs can be run on different machines throughout the grid.

 

Existing system

 

            Most of the Search Engines which exist today are based on cluster computing. Cluster computing is primarily concerned with computational resources. Compute clusters basically consist of off- the- shelf PC's interconnected via a high speed network. There is usually a single node which acts as the controller, the remaining nodes acting as slaves. Cluster computing cannot be truly be characterized as a distributed computing solution. They usually contain a single type of processor and operating system, whereas, grids can contain machines from different vendors running various operating systems. Clusters typically contain a static number of processors and resources. Cluster technology delivers extremely low network latency, which can cause problems if clusters are not close together.

            The major disadvantage of using parallel supercomputers is that one is tied to a specific vendor. Any refinements or an improvement to the hardware or the controlling software is extremely expensive.

 

 

The amount of time taken for locating the required data is considerably longer. This may be because the server is overloaded or the computing resources are underutilized. As there is no load balancing[4], speedup is generally limited by the speed of the slowest node. Assuming that communication and calculation cannot be overlapped, any time spent communicating the data between processors directly degrades the speedup

 

Proposed System

 

            The proposed system focuses on making better and cost effective use of existing computing power and resources with the view to share applications and collaborate on projects through distributed computing.

            A grid computing system is a distributed parallel collection of computers that enables the sharing, selection and aggregation of resources. This sharing is based on the resources' availability, capability, performance, cost and ability to meet quality-of-service requirements. The Grid merges people, computers, databases, instruments, and other resources in ways that were simply never possible before.

            By using an agent enhanced Search Engine with grid technology, the customer requirements can be satisfied  by searching for the required data and retrieving the results in real time. At the same time, it improves the resource usage and provides efficient workload optimization

 

            The features of the proposed system are as follows:

 

Saves Time

            By employing a grid for the search engine the time for locating the required data is considerably shortened. This is possible by better utilizing and distributing existing compute resources.

 

Lower Computing Costs

            On a price-to-performance basis, the Grid platform gets more work done with less administration and budget than dedicated hardware solutions. Depending on the size of the network, the price-for-performance ratio for computing power can literally improve by an order of magnitude.

 

Faster Project Results

            The extra power generated by the Grid platform can directly impact an organization's ability to win in the marketplace by shortening product development cycles and accelerating research and development processes.

 

Better Product Results

            The power created by the Grid platform helps to ensure a higher quality product by allowing higher-resolution testing and results, and permits an organization to test more extensively prior to product release.The security, scalability, and manageability of Grid technology has been proven.

 

Load Balancing

            Grid Computing Based Search Engine ensures that each node performs the same amount of work. i.e. the system is load balanced.

 

IMPLEMENTATION

 

            Implementation is done in Java over Windows platform. Java based agent programming [8] is used for implementing RMI, threads and user interfaces. The Client/Server connectivity is established by creating remote objects. The searching mechanism runs on all the gridlets using Fork/Join pattern and DEQueue algorithm. The Players in this proposal include

 

l      GRID SERVER

 

Ø      Job Scheduling

Ø      Job Dispatching

Ø      Job Monitoring

Ø      Aggregation of the results

 

l      CLIENT

 

Ø      Connect to the grid server

Ø      Give the search request

Ø      Receive the result

 

l      GRIDLETS

Ø      Run the search engine application

Ø      Return the results

l      AGENTS      

Ø      Communication

 

            The first job of the Grid server is to generate the remote object and configure the gridlets based on gridlet selection criteria. The Grid server then waits for the Client’s request and upon receiving the request, dispatch the search job to the configured gridlets. During the searching process, the Grid server has to monitor the gridlets using watch monitor and status monitor.

             The watch monitor’s job is to terminate the search process in the case of an infinite search. The status monitor’s job is to report the current status of the gridlet to the Grid server. Upon receiving the search results  from the gridlets, the Grid server has to aggregate the results and return it to the Client. The Client connects to the Grid server through the Remote Method Invocation. Once the client is connected, it can send its request to the Grid server and wait for the search results. The search request given by the client must contain either the filename or the partial content of the file.

 

            The search engine application must run in all the gridlets. After receiving the search job from the server, the gridlet has to search all the local disks of the clients connected to the network. The gridlet then stores the results in the result log which is then retrieved by the Grid server.

 

            The Gridlet selection criterion is based on the percentage of CPU utilization. The server runs the CPUTESTER.EXE in each client to get their load. The CPUTESTER.EXE will call WIN 32 API, and return the CPU load of each client. The clients with more percentage of idle times will be configured as gridlets by the Grid Server. The Gridlet selection process is described in Figure 1.

 

            The protocol used for client/server transaction is Java Remote Method Protocol (JRMP)[5]. JRMP is responsible for the communication between the client’s stub and the server’s skeleton. The implementation of JRMP can be adapted with either TCP port or JRMP port Here clients can forward their search requests to the server using the JRMP. The JRMP ensures that server receives all the messages from the clients and vice versa. Thus, JRMP guarantees reliability at the transport layer level. 

     

                                              Figure 1 Gridlet selection process

      The project is implemented on the local area network (LAN) of the organization with four clients and one grid server successfully. This application will work for any number of nodes connected in a network. 

 

CONCLUSION

 

Technology equipment and services are very under-utilised by organisations in relation to their capacity. Statistics quoted suggest that PCs are used only  between 5% and 10% of capacity; servers to about 20% of capacity. By making use of these idle processing cycles of the clients, the search operation was performed using grid computing. This results in increased speed, reduced search time and less cost. In the future, apart from searching in LAN, grid-computing appliances could be plugged into the Internet, tapping into thousands of high-performance computers, allowing us to do word processing, spreadsheet calculations, email, and so on with a very low-cost computing device.

 

 

REFERENCES

 

  1.  K. Czajkowski , S. Fitzgerald , I Foster , and C Kesselman , “Grid Information Services  for   Distributed   Resource  Sharing”,  10th  IEEE  International  Symposium  on  High Performance  Distributed  Computing, IEEE Press, pp. 181-184, 2001.
  2. C. Foster, C. Kesselman , and S. Tuecke , “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International  Journal  of  Supercomputer Applications., vol. 15, Sage Publications, USA, 2001.   
  3. I. Foster, C. Kesselman, J. Nick, and S. Tuecke,The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002.
  4. Ian Foster, and Carl Kesselmann, The Grid : Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, Inc., USA, 1999.
  5. Patrick  Naughton, and  Herbert  Schildt, The  Complete  Reference  JAVATM2,  Tata        McGraw-Hill Publishing Company Limited, New Delhi, 2004.
  6. http:// www.GridComputing.com
  7. http://www.globus.org/research/papers/anatomy.pdf
  8. Danny B Lange and Mitsuru Oshima ,Programming and deploying java mobile agents with Aglets, Addison Wesley, USA, 1998

 

 

Technical College - Bourgas,

All rights reserved, © March, 2000