Abstract:
Transferring data between different hospitals is often restricted, and federated analysis of
clinical data is a viable alternative. Existing federated analytics frameworks are often limited
in the type of input data to process or analysis that can be performed. In the Personal Health
Train paradigm, the analysis algorithm (wrapped in a ’train’) travels between multiple sites
(e.g., hospitals - so-called ’train stations’), hosting the data in their protected infrastructure,
and only transfers results rather than the data. Within the established infrastructure of the
German Medical Informatics initiatives, patients’ structured pseudonymized clinical data is
stored in FHIR servers at Data Integration Centers based on the HL7/FHIR profiles of the
German National Core Data Set.
Implementing trains as secured containers enables complex data analysis workflows to
travel between sites, i.e., genomics pipelines or deep-learning algorithms - analytic methods
that are generally not easily amenable. We present PHT-meDIC, a productively deployed, in-
teroperable, open-source implementation of the Personal Health Train paradigm. The scope
of applications for this platform ranges from machine learning algorithms to sophisticated
omics and image analysis with arbitrary input data. Light-weight virtualization permits the
automated deployment of complex data analysis pipelines (e.g., genomics, image analysis)
across multiple hospitals in a secure and scalable manner. We combine different open-source
third-party services with several custom-developed services. A separation into various services
allows flexible adaption and extension in a scalable form. We achieve constant monitoring
and persistent execution of trains and are providing governance template documents for de-
ployment. In our proposed security protocol, hospitals have pseudo-identifiers within the
infrastructure and can only access their repository, so that such inference attacks are less likely.
Results are always encrypted at rest. Only participating sites and the submitting user can access
them. Manipulation of trains will be detected at any stage.
Furthermore, researchers can use additional privacy mechanisms (e.g., Paillier cryptosys-
tem). The execution is within an encapsulated environment using project-specific FHIR servers
or data warehouses. We successfully deployed the implementation for distributed analyses
of large-scale data. Our platform has been extended for interoperability in the Leuko-Expert
project with other Medical Informatics Initiative partners’ architecture.