Data grid aims at developing suitable solutions to data-intensive applications by means of grid-based tools. Indeed, a data grid is specifically designed to store, manage, and provide reliable access to data.
Due to the basic grid principles, the environment is characterized by its heterogeneity. In the case of a data grid, this includes different storage systems, data access mechanisms, data access policies, and data formats. The data grid management infrastructure must act as an abstraction layer that provides a common, standard and efficient procedure to access the information stored. Although data grid allows heterogeneous data resources to be shared, only few research works in the field of data grid are oriented to increase the performance of these solutions.
The aim of MAPFS-Grid is to develop a complete suite of services for high performance access to huge volumes of data in a grid environment
Different needs arise in data management and access in grids.We have noticed three important aspects related to these needs:
MAPFS-Grid is a generic framework where all these problems can be solved, by means of the definition of different services, suitable for these three identified scenarios. All these three scenarios cover the needs of most data grid-based applications. All the developed services take advantage of double parallelism. Moreover, services provided by MAPFS-Grid are incorporated within the generic architecture of a grid making use of MAPFS.
Our first proposal is to provide a grid-like interface to MAPFS. This WSRF-compliant service, named PDAS, allows parallel I/O operations to be made in a cluster environment. The conception of this service comes from Data Access and Integration Service (DAIS). PDAS is an adaptation of this concept from the performance and parallelism viewpoints.
The two levels of parallelism provided by PDAS are shown in the Figure. The level 1 parallelism is provided by several PDAS (in every storage element), which give support to a distributed data repository. The level 2 parallelism is offered directly by MAPFS, in those storage elements which are clusters. As this figure shows, data to be transfered are divided in blocks which are sent to each storage element. These data blocks are internally divided and sent to each node if the storage element is a cluster.
The main advantage of PDAS is that constitutes a WSRF-compliant grid service, which provides reasonably good performance and it is easy to deploy in a grid scenario where the main components are clusters of workstations.
Focusing on the GridFTP server, it is possible to optimize its performance by modifying one of its modules. This module is the Data Storage Interface (DSI), whose responsibility is to read and write to the local storage system. We have used the flexibility of the GridFTP server for transforming the I/O operations. MAPFS I/O routines are used instead for enhancing the server. The result is MAPFS-DSI. MAPFS-DSI enables GridFTP clients to read and write data in a storage system based on MAPFS. As the architecture for MAPFS is a cluster of workstations, the GridFTP server should be the master node from a cluster of workstation, where MAPFS is installed.
MAPFS-DSI is embedded within the general scenario in which GridFTP is used. As we can see in the Figure, there are two independent parts of the architecture that can improve the performance of a data transfer operation, both from client to server (writing operations) and from server to client (reading operations). Firstly, the specific features of GridFTP (TCP stream parallelism and striping), which can be used in any GridFTP server. Secondly, the parallel access provided by MAPFS. This implies that the use of MAPFS within the GridFTP server offers two levels of data parallelism, avoiding that the server storage system becomes a bottleneck in the whole data transfer process. MAPFS-DSI offers great flexibility, since several combinations of both levels of parallelism can be used in different configurations.
MAPFS-DAI constitutes an extension of the OGSA-DAI architecture, whose aim is to increase this performance. As the Figure shows, the MAPFS-DAI architecture is divided into four layers:
Therefore, the main advantage of MAPFS-DAI is its interoperability. Every storage element that exhibits the OGSA-DAI interface can be used together with MAPFS-DAI elements. Due to the same interface of OGSA-DAI, several storage systems providing this interface could be accessed in parallel.
If you are interested in more details about the MAPFS-Grid tool you can contact Alberto Sánchez.
|