Kamalakar Karlaplem Ng Moon Pun University of Science and Technology Department of Computer Science Clear Water Bay, Kowloon, Hong Kong e-mail: firstname.lastname@example.org Phone: +852 2358 6991 Fax: +852 2358 1477
A major cost in executing queries in a distributed database system is the data transfer cost incurred intransferring multiple database objects (fragments) accessed by a query from di erent sites to the site where the query is initiated. The objective of a data allocation algorithm is to locate the fragments at different sites so as to minimize the total data transfer cost incurred in executing a set of queries. We develop a site-independent fragment dependency graph representation to model thedependencies among the fragments accessed by a query, and use it to formulate and solve data allocation problems for distributed database systems based on (query-site and move-small) query execution strategies. We show that an optimal solution can be achieved when the query-site query execution strategy is employed, and for the move-small query execution strategy we performed experimental evaluation aboutthe e ectiveness of a hillclimbing heuristic algorithm in achieving a near-optimal solution.
Keywords: data allocation, distributed database systems, query processing, distributed database design, hill-climbing heuristics, optimal allocation, max- ow mincut problem, network ow algorithms.
research has been funded by RGC CERG grant HKUST 609/94E.
Adistributed database is a collection of data which belongs logically to the same system but is physically spread over the sites of a computer network. A typical user can access the complete database from any site, a distributed database system processes and executes an user's query by accessing the data from multiple sites. A major component of query execution cost is the data transfer cost. That is, thetotal cost involved in moving the data from the sites where it is located to the site where the query is issued. It is desirable to reduce the amount of data transfer This can be achieved by optimally allocating the data (i.e. database objects or fragments) to the sites of the distributed database system. Optimal allocation of fragments is a complex problem because of mutual interdependency betweenallocation scheme (which gives the location of each of the fragments at various sites of a distributed database system) and query optimization strategy (which decides how a query can be optimally executed given an allocation scheme). We model the dependencies between the fragments and the amount of data transfer incurred to execute a query based on a query execution strategy as site independentfragment dependency graphs. We use these fragment dependency graphs to aid in coming up with an optimal or near optimal allocation of fragments. We call this approach to data allocation problem as query-driven data allocation in distributed database systems. Figure 1 shows the steps in query-driven approach developed in this paper for the data allocation problem. There are two aspects to thisproblem. The rst aspect is to represent and evaluate the set of queries accessing the distributed database. The second aspect is to use this information to come up with a formulation and solution for the data allocation problem. The distributed query processor is used to process a given set of queries. This query processing 10] consists of decomposing the queries and data localization. Data localizationinvolves identifying the fragments accessed by the query and generating the fragment query operator trees. These fragment query operator trees are further processed by taking into consideration a given distributed query execution strategy and the information about the fragment sizes to generate fragment dependency graphs. The fragment dependency graph models the dependencies between the...