By Dr. Niclas Thomas
“Data science” is a recent addition to the vocabulary of analytics, and can mean many different things to different people. As far as I am concerned, it is a fashionable term used to define the intersection of advanced analytical techniques from statistics, mathematics, machine learning and programming. The complexity of each of these fields can make them seem inaccessible to most biomedical researchers, who themselves have broad, complex fields of research to understand and may therefore have little time left to dedicate to emerging fields like data science. I started my career as a mathematician and took the bold step of moving into medical research for my PhD. More specifically, I joined a T cell immunology group at UCL under the guidance of Prof. Benny Chain. It was during this period of time in my career that I noticed how many immunologists (and clinical scientists) felt that they had neither the ability nor time to learn how to implement machine learning through programming to complement their research. Whilst arguably many researchers may indeed lack the necessary time, I would argue that every researcher has the ability to learn to programme if it is taught in a comprehensible and easy to follow way.
Laura and I met at UCL whilst she was also undertaking a PhD in viral immunology. Funnily enough Laura approached me for help with some fairly basic statistics initially, which over time developed into more complex analysis. This particular project, carried out at UCL, focussed on drawing novel insights into the metabolic regulation of immunopathology in chronic hepatitis B infection, relying on such lab techniques as flow cytometry to process precious liver samples. At the same time, I was relying on bench scientists to generate datasets on which I could utilise my mathematical background to make novel findings - my PhD made use of machine learning to gain further insights on T cell migration in lymph nodes and dynamics of the T cell receptor repertoire. I can safely say that my time spent on PhD allowed me to make two key insights - firstly, that no exciting or novel finding relating to the immune system can be made without the tireless work of talented bench scientists, and secondly that the tireless work of lab scientists deserves cutting edge techniques when the data is being analysed.
This realisation led to many successfully collaborations with other immunologists working on a whole host of diseases from HIV, chronic hepatitis B infection and cancer. Thankfully for me, each and every one of these collaborators recognised the power of data science (when used appropriately!), but also how problematic it is to become a master of all trades. Each of these immunologist and clinical scientist was keen to learn the analytical skills to complement their clinical or lab skills, but felt that there was simply not enough time in the day to dedicate to beginning to learn some data science. It was this perception that our new book was borne out of, and as such the sole objective was to provide an easy to follow introduction to data science for immunologists that had the desire to learn, yet may have had no formal training in the subject and little time to give to the subject. Our new book, called “Data Science for Immunologists” introduces some key practical aspects of data science, alongside how to implement them programmatically, with the hope of overcoming the belief that it is too difficult or too time-consuming.
So our plan was to develop a textbook that would present the most frequently used methods in data science, including statistical tests, clustering, making predictions and visualising high-dimensional data. Many textbooks seem overwhelming to beginners, covering too many techniques or too much theoretical background, so we aimed to keep unnecessary theory to a minimum. We also felt that providing theoretical background alone was not the best way to learn new things, so we made sure to present each new technique alongside the code (in two different languages) to allow the reader to implement the techniques using datasets we felt they would recognise.
In total, the book took just under a year to write, which included developing the code to accompany each individual example in addition to writing the content itself. Looking back, as with many new projects, the first two or three months were relatively pain free, when enthusiasm was at its peak and the excitement carried us through many late evenings. I began by sketching out a rough plan of the commonly asked questions I’d received from immunologists to provide the content. The early chapters were the hardest to write, as their scope was much broader – concisely describing what data science is, why the book is required and who would benefit from reading the book. I view these chapters as key to convincing each new reader of the importance of data science in immunology, as well as why our book was what they needed!
If I was asked to give any advice to any budding authors out there I would be hard pushed to do so without sounding slightly cliché. However, if I were to provide one piece of advice when writing an educational book, I would highlight how important it is to keep to a well-defined plan. Motivation comes relatively easily during the early stages of writing a book (or PhD), but inevitably wanes at certain points as the difficulty of combining a full-time job in addition to writing a book sets in. During these moments, it was sometimes easy to lose sight of why we wanted to start writing the book, and it occasionally felt like the end was never in sight! The first full draft that we managed to produce was a significant milestone, and it was a massive relief to send the draft to our three volunteers (Alice Burton, Jemima Thomas and Leo Swadling) who kindly offered to review our book. Of course, following the reviews there was still plenty of work to do, which forced us to review our initial goal of completing the book within a year and to reset the ‘goal posts’ to the more realistic target of fourteen months.
Having published the book in February this year, we’re now extremely passionate about introducing data science to as many researchers as possible, putting the power in their hands to do their own analysis. To achieve this, we regularly demonstrate new worked-through examples of data science being used in immunology, and use Twitter and our website to showcase these examples. Finally, we are indebted to many conference organisers, who happily publicise our book to attendees. We would also like to thank the British Society for Immunology, and our friends and colleagues in particular who have been extremely helpful in spreading the word.
You can out more about Datascience for Immunologists here to get your copy of the book!