|
Academic Open Internet Journal |
Volume 16, 2005 |
SOFTWARE AGENT BASED SEARCH ENGINE USING
GRID TECHNOLOGY
Authors
D INDUMATHI,
LECTURER, DEPT. OF CSE,
COIMBATORE-641 004,
e-mail: indumathi_d@hotmail.com
indumathi_d@yahoo.com
Dr. A.CHITRA,
PROFESSOR, DEPT. OF CSE,
COIMBATORE-641 004,
e-mail:
achitra@cse.psgtech.ac.in
Abstract
The amount of information
available via networks and databases has rapidly increased. Existing search and
retrieval engines provide limited assistance to users in locating the relevant
information that they need. Software agents may prove to be the needed item
in transforming passive search and retrieval engines into active, personal
assistants. This project aims at
developing a search engine that will allow users to search through
heterogeneous resources stored in geographically distributed digital
collections. Many of the search engines existing today will not have a single
centralized index. In some cases, the distributed approach offers advantages
over the centralized approach since it is more scalable, can be used on
otherwise inaccessible material, and can provide advanced search options
customized for each data source. This proposal addresses the situations where
centralized indexes are unfeasible and proposes the development of a
decentralized search engine built on top of grid technology with the help of
software agents.
Keywords: Searching, Grid Computing, Software agents,
intelligent agents
INTRODUCTION
Search engines have become an essential component of everyday life in modern society.
Most of the applications involve some interaction with search engines one way
or the other. The Grid Computing Based Search Engine specifically addresses the
situations where centralized indexes are unfeasible and proposes the
development of a decentralized search engine built on top of grid technology. Grid computing has the design
goal of solving problems too big for any single supercomputer, at the same time
it has the flexibility to work on multiple smaller problems. Thus, grid
computing provides a multi-user environment[1].
The goal of the project is to build
a tool capable of searching and automatically categorizing vast amounts of
geographically distributed information, which in many cases could not be
searched effectively, if at all, by a centralized search engine such as those
available on the web.
The
reason we deploy a grid for implementing a search engine is to better match
grid computing capabilities to customer requirements.
In most organizations, there are large amounts of
underutilized computing resources. Most desktop machines are busy less than 5%
of the time. In some organizations, even the server machines can often be
relatively idle. Grid computing provides a framework for exploiting these
underutilized resources and thus, has the possibility of substantially
increasing the efficiency of resource usage. Another function of the grid is to
better balance resource utilization. An organization may have occasional
unexpected peaks of activity that demand more resources. If the applications
are grid enabled, they can be moved to underutilized machines during such
peaks. In general, a grid can provide a consistent way to balance the loads on
a wider federation of resources. This applies to CPU and other storages.
The potential for massive parallel CPU capacity is one
of the most attractive features of a grid. A CPU intensive grid application can
be thought of as many smaller “subjobs,” each
executing on a different machine in the grid. To the extent that these subjobs do not need to communicate with each other, the
more “scalable” the application becomes. A perfectly scalable application will,
for example, finish 10 times faster if it uses 10 times the number of
processors.
A
grid federates a large number of resources contributed by individual machines
into a greater total virtual resource. The grid can offer a resource balancing
effect by scheduling grid jobs[2] on machines with low
utilization
Grid
Computing based Search Engine[3] provides an alternate
approach to reliability that relies more on software technology than expensive
hardware. The systems in a grid can be relatively inexpensive and
geographically dispersed. Thus, if there is a power or other kind of failure at
one location, the other parts of the grid are not likely to be affected. In
critical, real-time situations, multiple copies of the important jobs can be
run on different machines throughout the grid.
Existing
system
Most of the Search Engines which exist today are based on cluster
computing. Cluster computing is
primarily concerned with computational resources. Compute clusters
basically consist of off- the- shelf PC's interconnected via a high speed
network. There is usually a single node which acts as the controller, the
remaining nodes acting as slaves. Cluster computing cannot be truly be characterized as a distributed computing solution.
They usually contain a single type of processor and operating system, whereas,
grids can contain machines from different vendors running various operating
systems. Clusters typically contain a static number of processors and
resources. Cluster technology delivers extremely low network latency, which can
cause problems if clusters are not close together.
The major disadvantage of using parallel supercomputers is that one is
tied to a specific vendor. Any refinements or an improvement to the hardware or
the controlling software is extremely expensive.
The amount of time taken for locating the required
data is considerably longer. This may be because the server is overloaded or
the computing resources are underutilized. As there is no load balancing[4], speedup is
generally limited by the speed of the slowest node. Assuming that communication
and calculation cannot be overlapped, any time spent communicating the data
between processors directly degrades the speedup
Proposed
System
The proposed system focuses on making better
and cost effective use of existing computing power and resources with the view
to share applications and collaborate on projects through distributed
computing.
A grid computing system is a distributed parallel
collection of computers that enables the sharing, selection and aggregation of
resources. This sharing is based on the resources' availability, capability,
performance, cost and ability to meet quality-of-service requirements. The Grid
merges people, computers, databases, instruments, and other resources in ways
that were simply never possible before.
By using an agent enhanced Search Engine with grid
technology, the customer requirements can be satisfied by searching for the required data and
retrieving the results in real time. At the same time, it improves the resource
usage and provides efficient workload optimization
The
features of the proposed system are as follows:
Saves Time
By employing a grid for the search engine the time for
locating the required data is considerably shortened. This is possible by
better utilizing and distributing existing compute resources.
Lower Computing Costs
On a price-to-performance basis, the Grid platform gets
more work done with less administration and budget than dedicated hardware
solutions. Depending on the size of the network, the price-for-performance
ratio for computing power can literally improve by an order of magnitude.
Faster Project Results
The extra power generated by the Grid platform can
directly impact an organization's ability to win in the marketplace by
shortening product development cycles and accelerating research and development
processes.
Better Product Results
The power created by the Grid platform helps to ensure a
higher quality product by allowing higher-resolution testing and results, and
permits an organization to test more extensively prior to product release.The security, scalability, and manageability of
Grid technology has been proven.
Load Balancing
Grid Computing Based Search Engine ensures that each node
performs the same amount of work. i.e. the system is
load balanced.
IMPLEMENTATION
Implementation
is done in Java over Windows platform. Java based agent programming [8] is used
for implementing RMI, threads and user interfaces. The Client/Server
connectivity is established by creating remote objects. The searching mechanism
runs on all the gridlets using Fork/Join pattern and DEQueue algorithm. The Players in this proposal include
l GRID SERVER
Ø
Job Scheduling
Ø
Job Dispatching
Ø
Job Monitoring
Ø
Aggregation of
the results
l CLIENT
Ø
Connect to the
grid server
Ø
Give the search
request
Ø
Receive the
result
l GRIDLETS
Ø
Run the search
engine application
Ø Return the results
l AGENTS
Ø Communication
The
first job of the Grid server is to generate the remote object and configure the
gridlets based on gridlet selection
criteria. The Grid server then waits for the Client’s request
and upon receiving the request, dispatch the search job to the
configured gridlets. During the searching process,
the Grid server has to monitor the gridlets using
watch monitor and status monitor.
The watch monitor’s job is to terminate the
search process in the case of an infinite search. The status monitor’s job is
to report the current status of the gridlet to the
Grid server. Upon receiving the search results from the gridlets,
the Grid server has to aggregate the results and return it to the Client. The
Client connects to the Grid server through the Remote Method Invocation. Once
the client is connected, it can send its request to the Grid server and wait
for the search results. The search request given by the client must contain
either the filename or the partial content of the file.
The
search engine application must run in all the gridlets.
After receiving the search job from the server, the gridlet
has to search all the local disks of the clients connected to the network. The gridlet then stores the results in the result log which is
then retrieved by the Grid server.
The Gridlet
selection criterion is based on the percentage of CPU utilization. The server
runs the CPUTESTER.EXE in each client to get their load. The CPUTESTER.EXE will
call WIN 32 API, and return the CPU load of each client. The clients with more
percentage of idle times will be configured as gridlets
by the Grid Server. The Gridlet selection process is
described in Figure 1.
The
protocol used for client/server transaction is Java Remote Method Protocol
(JRMP)[5].
JRMP is responsible for the communication between the client’s stub and the
server’s skeleton. The implementation of JRMP can be adapted with either TCP port or JRMP port Here clients can forward their search requests to the
server using the JRMP. The JRMP ensures that server receives all the messages
from the clients and vice versa. Thus, JRMP guarantees reliability at the
transport layer level.

Figure 1 Gridlet selection process
The project is implemented on
the local area network (LAN) of the organization with four clients and one grid server successfully. This application will work for any
number of nodes connected in a network.
CONCLUSION
Technology equipment and services are very under-utilised by organisations in
relation to their capacity. Statistics quoted suggest that PCs are used only between 5% and
10% of capacity; servers to about 20% of capacity. By making use of these idle
processing cycles of the clients, the search operation was performed using grid
computing. This results in increased speed, reduced search time and less cost.
In the future, apart from searching in LAN, grid-computing appliances could be
plugged into the Internet, tapping into thousands of high-performance
computers, allowing us to do word processing, spreadsheet calculations, email,
and so on with a very low-cost computing device.
REFERENCES
Technical College - Bourgas,
All rights reserved,
© March, 2000