Courses /
Background
Traditional computing approaches are no longer adequate for the tens of petabytes of data generated each year in healthcare settings. Health data comes from diverse sources, such as electronic health records, medical imaging, wearable devices, genomic sequencing, clinical research, and even from social media, with different frequencies and formats. The volume, variety, and velocity of health data are a typical example of a big data scenario, where structured and unstructured data must be aggregated and analysed to support clinical and administrative decisions.
Distributed computing technologies are a suitable approach to deal with the increasing volume and complexity of healthcare data. By distributing data and processing across multiple machines, healthcare settings can integrate diverse data sources, implement real-time services, and analyse massive amounts of historical data to improve patient treatment and outcomes, along with operational efficiency.
About This Course
The course content is composed of seven modules covering foundational big data concepts, such as the “3Vs” definition (volume, velocity, and variety) and different types of data and their sources, along with essential distributed computing principles including data partitioning, data and model parallelism, fault tolerance, and horizontal scalability; all of them contextualised with healthcare examples and demonstrated through specialised tools.
Course participants will develop hands-on experience with key distributed computing technologies: Hadoop, Spark, and Kafka. Hadoop provides a distributed storage system (HDFS) and programming model (MapReduce) for batch processing of large healthcare datasets. Spark offers a series of specialised libraries for data engineering, graph processing, and machine learning models for descriptive and predictive applications. Kafka allows for the design of real-time pipelines for data integration and clinical monitoring and alerting applications.
Learning objectives
After completing this course, participants will be able to:
Intended Audience
This course is aimed at anyone interested in the intersection of data science and healthcare and who wants to acquire hands-on experience with distributed computing tools applied to health data. The emphasis is on orchestrating big data pipelines rather than interpreting results. Some experience with Python programming may be helpful but is not required.
Distributed Computing for Health Data
Big Data: Key Concepts and Terminology (9:47)
Big Data Sources (10:06)
Distributed computing (07:48)
Big data processing tools – Hadoop (09:12)
Hadoop installation & basic usage (12:04)
Big data processing tools – Spark core components (13:37)
Big data processing tools – Spark SQL and DataFrames (06:42)
Spark Installation & basic usage (15:33)
Big data processing tools – Kafka (10:49)
Kafka installation & basic usage (20:19)
Review
Quiz (10 Questions)
Feedback