CAGE the RAGE
Published:
Good Pratice in Data Analysis
In a couple of weeks, I will teach a course in Leicester on reproducible data analysis, which will emphasise the use of git, github and rmarkdown. The students on the course will be a mix of postgraduates and post-docs, so I am sure that they will have no problem learning the mechanics of using this software. Where I do anticipate a problem, is in getting them to change their routine research workflow. We are all lazy and like to cut corners, so why archive R scripts or produce extensive documentation? The danger is that they will follow my advice for a couple of days and then slip back into their old ways.
In order to make the teaching seem relevant to what they actually do, I have decided to base it around a relatively simple data analysis project that I have called CAGE. You will find the repo for CAGE in my github account.
CAGE stands for Classification After Gene Expression. The idea of the project is to compare different was of selecting features (probes) from a microarray for use in classifying the disease of a set of patients. I have taken the data from GEO study GSE210271. The details are in my repo’s README.
By developing this analysis from scratch, I hope that the students will come to see the benefits of working in a reproducible way. There is a companion project called RAGE (Regression After Gene Expression) that I will get the students to work through as a course exercise.
My own conversion to reproducible research came about through the bitter experience of analysing data without proper archiving and documentation. Perhaps, they too will have to learn the hard way; I can only try.