Distributed Computing for Health Data

HDR UK

Topic

27 July 2025

Self Paced

Background

Traditional computing approaches are no longer adequate for the tens of petabytes of data generated each year in healthcare settings. Health data comes from diverse sources, such as electronic health records, medical imaging, wearable devices, genomic sequencing, clinical research, and even from social media, with different frequencies and formats. The volume, variety, and velocity of health data are a typical example of a big data scenario, where structured and unstructured data must be aggregated and analysed to support clinical and administrative decisions.

Distributed computing technologies are a suitable approach to deal with the increasing volume and complexity of healthcare data. By distributing data and processing across multiple machines, healthcare settings can integrate diverse data sources, implement real-time services, and analyse massive amounts of historical data to improve patient treatment and outcomes, along with operational efficiency.

About This Course

The course content is composed of seven modules covering foundational big data concepts, such as the “3Vs” definition (volume, velocity, and variety) and different types of data and their sources, along with essential distributed computing principles including data partitioning, data and model parallelism, fault tolerance, and horizontal scalability; all of them contextualised with healthcare examples and demonstrated through specialised tools.

Course participants will develop hands-on experience with key distributed computing technologies: Hadoop, Spark, and Kafka. Hadoop provides a distributed storage system (HDFS) and programming model (MapReduce) for batch processing of large healthcare datasets. Spark offers a series of specialised libraries for data engineering, graph processing, and machine learning models for descriptive and predictive applications. Kafka allows for the design of real-time pipelines for data integration and clinical monitoring and alerting applications.

Learning objectives

After completing this course, participants will be able to:

Understand the main elements of big data applications and their relation to healthcare settings.

Distinguish between structured, unstructured, time series, and sequence data, and identify appropriate storage and processing requisites of each type of data.
Explore distributed computing tools to deploy analytical models using a mixture of data and cloud computing environments.
Orchestrate different resources for batch and streaming processing of healthcare data based on key distributed computing technologies.

Intended Audience

This course is aimed at anyone interested in the intersection of data science and healthcare and who wants to acquire hands-on experience with distributed computing tools applied to health data. The emphasis is on orchestrating big data pipelines rather than interpreting results. Some experience with Python programming may be helpful but is not required.

This curriculum is accredited by The CPD Certification Service.

Course Outline

Distributed Computing for Health Data

Review

Instructor

Marcos Barreto

Assistant Professor of Data Science, London School of Economics and Political Science

Organisation

HDR UK