Architecture Overview

In the last post we enumerated the elements of ImPACT and now we’d like to describe the architecture in more detail.

Broadly, the architecture envisions four classes of principals or actors:

  1. First and foremost are the researchers, or research teams that wish to perform analysis on some number of privacy-protected data sets.
  2. There are one or more data providers that wish to allow their data to be used for analysis, without relinquishing control over the data and having assurances their policies governing the data use are honored.
  3. There are infrastructure providers - institutional or public cloud, that are capable of instantiating protected enclaves in which data processing can take place.
  4. Finally, there are representatives of institutional governance responsible for setting and enforcing data collection, curation and storage policies at a given academic institution.
  5. All of them have a part to play in a multi-institutional collaborative analysis research ImPACT enables.


The figure below shows a somewhat simplified way in which the principals interact with the major elements of the architecture:

The major architectural elements include:

  • Virtualizable infrastructure on which the infrastructure provider can instantiate an enclave at the request of the research team. The virtualized servers in the enclave can be remotely accessed by the research team and can also be reconfigured with new analysis software.
  • Dataverse web application (http://dataverse.org) that is used by the data providers to register the protected datasets so they can be discovered by researchers. Critically, the data remains under control of the data provider.
  • Notary Service, which helps the principals negotiate data use agreement (DUA)  policies. Data providers can register their policies with the service, and other principals (researchers, infrastructure providers, representatives of institutional governance) can make digitally-signed statements or attestations confirming compliance with various parts of the policies. Statements recorded by the Notary Service can then be used to automatically grant or deny access to the protected data made by researchers directly or on their behalf from the enclave.

In this post we will provide further introduction to these architectural elements, leaving technical details to follow-up posts and deployment documentation.

 

Virtualizable Infrastructure

Creating isolated enclaves in virtualized infrastructure is a relatively well-understood process. It requires setting up appropriate access control rules on the hosts and on the networks to which the hosts belong in order to guarantee that only trusted individuals can access the hosts from a controlled set of IP addresses and that protected data can’t leave the enclave.

These two requirements always come in tension with making the enclave infrastructure easily usable by the researchers. Researchers want to have a ‘desktop’ look and feel to the enclave virtual environment and the ability to customize it to the needs of the project.

Two essential elements of ImPACT solution to this problem are:

  • A browser-based remote desktop solution, called Proconsul, that provides responsive desktop access to enclave hosts running either Windows or Linux and uses federated identity solutions to authenticate and authorize users to specific hosts. The unique advantage is the simplicity of use, cross-platform nature both for end-user clients and enclave hosts and high level of security provided by this solution.
  • A Singularity-based software pipeline that allows researchers to customize their enclave host environments by easily adding new trusted analysis tools. The tools can be built by the researchers themselves or organization’s IT personnel, automatically built and tested outside the enclave environment, digitally signed and then securely made available in a repository within the enclave for researchers to use

A special type of enclave based on Secure Multiparty Computations or SMC that relies on cryptographic communications to selectively expose aggregate information about the data in data provider stores will be addressed in a separate series of blog posts.

 

Dataverse

Dataverse is a widely used system for making social-science and other types of datasets available and discoverable to researchers and by researchers. Multiple instances of Dataverse can federate with each other to create a web of connections linking different datasets. Datasets can be searched for by various types of metadata, with Dataverse extracting some common statistical information about the data to help better judge its relevance to a particular research effort.

An important aspect that ImPACT is addressing is providing Dataverse with the ability to extract searchable metadata information and various indexes from data stored on data provider infrastructure without the need to upload it into a Dataverse instance - an essential requirement, since data providers don’t wish to relinquish control of the data.

To address this problem ImPACT is extending the functionality of Dataverse to support indexing such datasets remotely by creating something called Dataverse Trusted Remote Storage Agent or TRSA.

TRSA represents a piece of software that a data provider may download from a trusted repository and use it to selectively index the datasets in its infrastructure, explicitly specifying the types of metadata to extract and communicating this metadata to a selected instance of Dataverse.

This process makes the protected dataset discoverable in the selected Dataverse without the data provider relinquishing control over its access. Instead of the dataset itself, Dataverse retains a pointer to the data stored within the data provider infrastructure.

A researcher may attempt to download or otherwise access the data using the pointer and a special access control agent will validate that all requirements of data access policy, like the DUA have been satisfied prior to granting access.

 

Notary Service

The Notary Service serves as a records keeper in the architecture. It is aware of the data DUA access policy requirements spelled out by the data provider for a specific data sets. A researcher (who discovered the dataset via Dataverse) is directed to the Notary Service, which then collects digital attestations from the researcher, the infrastructure provider, the representatives of the institutional governance. Simple examples of such attestations can be:

  • That infrastructure compliant with data provider requirements has been created - made by the infrastructure provider
  • That the research team has undergone training on the infrastructure - made by all members of the research team
  • That the project has approval from the institution - made by a representative of institutional governance.

Real-life examples examples of such attestations can be found in [1].

The Notary Service provides information about the state of satisfaction for a given DUA policy, provides an audit trail of the various attestations made so far for it, prompts and reminds principals about specific required remaining attestations.

The attestations, digitally signed and recorded by the Notary Service are then used by the access control agent of the data provider to make decisions about granting or denying access to specific datasets by specific researchers.

 

Conclusion

This concludes our brief introduction to the ImPACT architecture. In the next few posts we will provide more detailed descriptions of each of the components described today and begin discussing how the various pieces fit together to respond to the different use case scenarios.

References

[1] https://www.psc.isr.umich.edu/dis/data/dc/grid.html