Status Meeting: 10GigE Research Network

| |

[phpwiki]
A meeting to discuss the implementation of the 10 Gigabit Ethernet Research Network was held on Friday, February 16, in Rust by the design and implementation group members Tommy Foley (ENG), Puri Bangalore and Fran Fabrizio (CIS), Doug McLean, Phillip Lindley, and John-Paul Robinson (IT).

The 10GigE Research Network (10GigErNET) is a project to facilitate inter-cluster job scheduling between clusters in the CIS and Engineering compute centers and eventually form a high bandwidth pathway between UAB research computing resources and research computing resources at other institutions. It is an important component of the UABgrid grid computing initiative.

!!Three main outcomes of this meeting were:
# A shared understanding of the goals, requirements, and considerations for the initial deployment of the 10GigE link.
# Construction of a diagram to capture the design goals of the 10GigErNET project.
# Recognition of the need to develop a UABgrid security policy that can be applied to resources facing the 10GigE link such that the internal networks of CIS and Engineering remain adequately protected.

!!Action items that emerged from this meeting were:
# Updated quotes reflecting performance emphasis on the Foundry SuperX 10GigE switches. (IT)
# Verify fiber run between CIS and Engineering extends to switch installation points. (IT)
# Cost and feasibility estimates for installing an additional network interface on each cluster for the initial 1GigE connection, each cluster needs 3 network interfaces (see summary for details). (CIS, ENG, IT)
# Cost and feasibility estimates for installing 10GigE network interfaces on each cluster. (CIS, ENG, IT)
# Development of a UABgrid security policy that adequately protects the new cluster interfaces from external networks, since these clusters will now be multi-homed between CIS and Engineering networks. (CIS, ENG, IT)

!!Summary discussion:

!Switch Performance Emphasis
To determine the needs for the 10GigE switching equipment we agreed that the emphasis should be on the performance of the switch backplane rather than on features that lead to near zero maintenance downtime. The general feeling was that ordinary switch maintenance cycles would not impact the ability to stage and manage inter-cluster jobs. This lead to the the recommendation that the Foundry SuperX switch is likely to meet the needs for the 10GigE switches to be located in CIS and Engineering

The Foundry website contains [information on their 10GigE switches|http://www.foundrynet.com/technology/10gbe/]. The [SuperX line is described in detail here|http://www.foundrynet.com/products/enterprise/agg-bb-l23/fi-superx.html]. Please be sure to share your comments on the performance configuration of this switch.

!Network Diagram
The process of stepping through the design goals of this project lead to the creation of [a diagram|http://lab.ac.uab.edu/files/10GigErNET/10gigernet.png] which seeks to capture the existing architecture and highlight the near-term needs of this project and long-term goals of establishing a high-bandwidth connection through the UAB backbone to other high performance research networks.
[/phpwiki]

[phpwiki]
This diagram shows the various networks and devices to be considered in this project. Specifically, it shows the 10GigE Research Network linking the CIS and Engineering clusters. This link will consist of two 10GigE switches, one each in CIS and Engineering. These switches will be connected to each other via a direct fiber link. The clusters will connect to the 10GigE switches via 1GigE connections initially with the intention of upgrading the cluster connections to 10GigE as the clusters are upgraded to 10GigE NICs.

!On-going Performance Tuning and Monitoring
An open question remains as to whether the clusters can properly leverage 10GigE links and if they are even capable of saturating a 1GigE link. These questions will need to be answered as part of the post-deployment phase as we determine the appropriate combination of hardware and software tuning parameters that best meet the performance goals of each cluster.

!Three Network Connections for each Cluster
The existing network connectivity for clusters is provided by the head node which has two network connections: the first connecting the cluster to the UAB network via the CIS and Engineering departmental networks and the second connecting the head node to the intra-cluster network, a private-to-each-cluster network that is used to facilitate distribution of jobs across the compute nodes in the cluster. In order to also attach these clusters to the 10GigE Research Network, an additional external network link will need to be provisioned for each cluster. Depending on the existing configuration of the cluster, this may involve additional hardware purchases. At a minimum, this will need to be a 1GigE network connection with the goal of supporting 10GigE links as hardware is installed in each cluster.

!Additional UABgrid resources
For completeness, the diagram includes two UABgrid resources that are currently being developed. The UABgrid Data Staging node will be a shared file storage area that will help manage the distribution of job data to the clusters. It is attached directly to the 10GigE Research Network to maximize data transfer speeds. The UABgrid Science Gateway will be a web-based portal that will facilitate end-user job submission and management to UABgrid compute resources.

The dashed line connecting the 10GigE Research Network to the UAB Network Backbone represents the goal of having unconstrained bandwidth to external research networks. This is only a logical representation of the pathway, implementation decisions are still to be made.

Finally, the dashed-boxes labeled "Grid Firewalls" represent an established UABgrid security policy protecting the clusters and their hosting networks from in-appropriate access. This is an important design consideration discussed below in more detail.

!UABgrid Security Policy
The clusters currently rely on the departmental firewalls in CIS and Engineering which ensure that accesses to the cluster adheres to department specific security policies. With the addition of the inter-cluster 10GigE link, there will now be a network that spans directly between CIS and Engineering. This makes it imperative to apply an appropriate security policy to the newly exposed interfaces of the clusters to this shared network link in order to perserve the security domains of the respective departmental networks. The importance of this security policy and its implementation become even more apparent when considering the long-term goal of extending high-bandwidth connectivity directly to external research networks.