Protected Data Enclave Environments with PRDN

What is a Protected Network (PN) or a Protected Research and Data Network (PRDN)?

A Protected Network (PN) is Duke’s name for its secure storage and processing technical infrastructure. A PN is a secure network environment (via firewalls, intrusion protection systems, network ACLs) that has dedicated virtual compute environments and allows university users to store and analyze sensitive data such as HR, business operations, and academic records, and research and academic data. A Protected Research Data Network (PRDN) is a service designed and provided by Duke that leverages the PN and adds a layer to manage research and academic operations with regulatory or compliance requirements. Duke—one of the key contributors to the ImPACT project—has generously contributed its design and expertise with PNs and PRDNs to the ImPACT project. The ImPACT project refers to those elements of PRDN that it adopted and generalized for ImPACT’s purposes as a Protected Data Enclave (PDE). What follows is an overview of PDE architecture from Duke’s perspective as used by the ImPACT project.

Respondent confidentiality concerns, proprietary and other data privacy concerns, regulatory restrictions, and data use agreement or other licensing terms impose data security obligations which Duke manages administratively and technically in the PDE. The PDE also provides access to a high performance computing environment, which allows secure and efficient processing of large data sets, making it attractive to research teams while providing a degree of risk management which both external partners and the university administration find comforting.

Access to the Duke PN is restricted to either Remote Desktop or Proconsul (https://github.com/carte018/Proconsul) which allows web based access to Windows servers and also creates dynamic credentials for a user on the Windows servers inside of the PN reducing the risk of pass the hash compromises. Access to the PDE is provided through RDP or FastX with limited permissions.

Ideally, a PN is constructed dynamically using automation for the provisioning of the firewall rules (network and host), the network itself including subnets and ACLs, as well as provisioning the VMs, software and storage needed. The firewall rules are fairly consistent between different instances of a PN – they allow administrative systems to access hosts inside of the PN for patching or services like DNS or LDAP. A fully automated PN deployment also reduces the temptation to add a host to an existing protected enclave to reduce the time to deploy a new server.

Automation of the provisioning of a VM with base software is relatively easy, but allowing researchers to build more complex software environments while at the same type limiting access to the public internet can be challenging. Within the PN, we have used container deployment methods for both Singularity containers (singularity.lbl.gov) and docker containers (docker.com) which rely on continuous integration services (CI) provided by our gitlab server to dynamically build containers outside of a PN and then allow them to be moved into the PN. The goal is to restrict access on data transfers into and out of the PN consistent with data use agreements but still allow researchers access to the tools they need (and websites and support communities) to conduct their analysis. Examples of external data sources might be docker build scripts, patches, or repositories that are discipline specific and scattered across the internet – other examples are repositories like the comprehensive R Archive network (CRAN) which is mirrored at Duke but is outside of the PN. Researchers can develop their methods and toolsets outside of a PN and then move them into it when ready.

Once the container is built and copied into an appropriate server inside of a PN, the user can run it and have access to the full toolset they built outside. There are some cases where a proxy server is needed in addition to using our build methodology – namely when there are live lookups on public data inside of the PN – getting information from public sources such as live geo tags for example.

Overall, a PN is a simple construct to allow users to access sensitive data in a secure fashion that minimizes the friction typically associated with operating on a remote service – from reproducing the desktop model when appropriate or giving the researchers access to the same tools they use when doing work on less sensitive data. The key to achieving this is automating as much of the deployment as possible – from building the networks, firewalls, and VMs, to the tools used by the researchers to build, update, and deploy the tools they use. The difficult problem to address is the varied data security requirements from data providers – automated provisioning to meet a variety of needs is more difficult.

In the PDE, Duke has coupled the secure infrastructure provided by the PN with operational, administrative and legal support for researchers on the processes involved in restricted data acquisition, use, and compliance. The PN is used for research with fewer formal compliance requirements and for internal administrative data such as HR, business operations, and academic records.