The Role of Data Science in Research

The training was very relevant. I am about to start a project that aims to predict phenotype based on genetic data, which I plan to approach using machine learning. I really enjoyed the discussions on pitfalls of machine learning, what makes them effective, what can be expected of them and what can’t be expected of them.

Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine.

Aligned with these advancements, we have received growing interest from professionals in academic disciplines outside of computer science, regarding what are the Data Science tools and techniques they need to know to prepare for the future, and what are the relevant applications in their area of specialisation.

Working with Liverpool School of Tropical Medicine (LSTM), we set out to address these questions and upskill their Department of Vector Biology in Data Science using Python. Our goal was to provide PhD’s and Post Doctoral Researchers with transferable knowledge and Data Science skills they can apply to their research in Epidemiology and Bioinformatics.

In this article we will provide an overview of:

Essential Data Science techniques researchers need to know
Applications of Data Science in Epidemiology
Case study: A training plan for Liverpool School of Tropical Medicine

It’s worth noting that this Data Science training strategy can be applied in any field. Cambridge Spark Data Science and Machine Learning training programmes are designed to equip individuals with the skills to gather, analyse and interpret structured and unstructured data, in just two days.

Get in touch with us to learn more about the course!

An Introduction to Data Science in Python

The essential Data Science techniques researchers need to know about

To build data science capabilities, the first step is to upskill researchers and subject-matter experts in the foundations of Data Science using Python. Widely-used techniques to start learning are:

Data Science Essentials

Working with Jupyter notebooks
The Numpy library for array manipulation
The Pandas library for data manipulation
Data cleaning and pre-processing
Data visualisation with Matplotlib and Seaborn
Applying Principal Component Analysis (PCA) in Python with SKLearn

Unsupervised Learning and Supervised Learning

Unsupervised Learning

The Scikit-learn library for Machine Learning and Scikit-learn pipelines
k-means clustering
Hierarchical cluster analysis
Density-based clustering (DBScan)

Supervised Learning

The k-Nearest Neighbour algorithm
Overfitting, underfitting, bias-variance tradeoff
Cross-Validation and hyperparameter tuning

Ensemble Models

Decision Trees
The intuition behind Bagging and Bootstrapping, Concept, Algorithm, Random Forests in scikit-learn
The intuition behind Boosting classifiers, visualisation, Boosting methods in scikit-learn
Adaboost, XGBoost, LightGBM
Stacking in scikit-learn

Applications of Data Science in Epidemiology

How researchers can make use of Machine Learning

Current research initiatives are using Machine Learning to detect health threats and improve diagnosis accuracy /efficiency to have a positive impact on patient outcomes. Examples include:

Using Feature Engineering and Feature Selection in order to identify biomarkers capable of distinguishing between diseases and group samples with shared characteristics.
Applying regression models to examine the cause-and-effect relationship between disease risk factors.
Using random forests to make highly informative predictions for more targeted drug prescriptions.
Using CNN’s for image analysis to detect diseases such as Malaria.

Case Study

A training plan for researchers at the Liverpool School of Tropical Medicine

“The course was intended to improve the data science capability of our department, though each student had their own motivation for signing up. Personally, I was looking for an overview of machine learning tools, the necessary considerations when applying them, and indications about how to implement them,” said Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine. Aligned with these technical specifications and learning objectives Cambridge Spark delivered a three-day Introduction to Data Science using Python training session, on-site, at the Department of Vector Biology.

The training was very relevant. I am about to start a project that aims to predict phenotype based on genetic data, which I plan to approach using machine learning. I really enjoyed the discussions on pitfalls of machine learning, what makes them effective, what can be expected of them and what can’t be expected of them.

Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine.

“I enjoyed learning about how the different machine learning tools work, their strengths and weaknesses. I do a lot of data analysis already (using a lot of tools that overlap strongly with machine learning, such as logistic regressions, PCA, clustering analysis) and I generally get a kick out of thinking about data,” said Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine. “I was actively searching for organisations that could provide in-house machine learning courses, and the course which Raoul proposed matched very closely with what I envisaged.”

Interested in training for your teams?

Whether you're looking to train 5 people or 100 people, we have a variety of scalable training solutions to help you address a wide spectrum of training needs within the fields of Data Science, Artificial Intelligence, or Software Engineering.

Please contact us with your details and any known requirements. We'll then get in touch and guide you through every step of the way.