Dathaton exposes students to real-world data science

Datathon lets students practice techniques for analyzing big data, says organizer and UND Associate Professor Brian Darby

In late March, about 30 students and professionals from North Dakota and South Dakota participated in the UND Datathon, which let them apply big-data analytical techniques to a public-health dataset. Creative Commons image.

Brian Darby, associate professor in biology at UND, has a quote he refers to in order to rationalize the use of big data and complex research problems.

The quote comes from British statistician and geneticist Ronald Fisher, who pioneered the application of statistics in scientific experiments. It reads, “No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question at a time. [I am] convinced that this view is wholly mistaken. Nature will best respond to a logical and carefully thought-out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed.”

Brian Darby

The need to parse multiple research variables at once is what Darby wanted to impart to students who participated in the Datathon contest, which took place in late March. The Datathon was the first such competition to be offered at the University.

“It is similar to a Hackathon, where groups get together for a short amount of time and approach some sort of coding challenge,” Darby said. “Datathons allow groups to get together and analyze a dataset or test some problem with data that they can find online.”

Organized by the UND Biology Club, UND’s Dakota Cancer Collaborative on Translational Activity, DaCCoTA, and Sanford Research’s Research Design and Biostatistics Core, the week-long Datathon attracted about 30 participants (undergraduate and graduate students as well as professionals) from North Dakota and South Dakota. Split into teams, they had to apply big-data analytical techniques to tease out public-health insights from a dataset that contained roughly 100 variables from the National Health and Nutrition Examination Survey.

The teams’ final presentations were judged by Mark Williamson, statistician with DaCCoTA; David Sturdevant, biostatistician with the Research Design and Biostatistics Core; and Thomas Tiahrt, chair of Economics & Decision Sciences at the University of South Dakota. Winners were announced on April 2. No UND teams placed in the top three places.

Real-world data science

“It is a real-world dataset that wasn’t designed to answer or test a particular skill like in a course environment,” Darby said.

When Parker Combs opened the data file for the first time, it felt “overwhelming,” he said. One of his teammates, Temidayo Adeluwa, had a similar reaction.

“You don’t know where to start,” he said. “We were not given questions to answer. We had to form our own questions that we could answer in only five days.”

Combs, who graduated from UND last year with a degree in computer science, and Adeluwa, who is pursuing a master’s degree in biomedical sciences, both work in the lab of Junguk Hur, assistant professor in Department of Biomedical Sciences at UND’s School of Medicine and Health Sciences. Named “Team Hur,” the team also included Kai Guo, a post-doctoral affiliate in Hur’s lab.

The team posed several questions in their data analysis that looked at different potential factors for depression. One of the key insights they derived was that individuals with children are less likely to develop depression. Another one was a positive relationship between marijuana use and depression.

To arrive at their findings, Combs and Adeluwa utilized R, a software environment for data science and analytics. While Combs had little experience using the program, the Datathon let him learn the software so that he can use it in his job at the lab, where he works on software development.

For Adeluwa, the experience taught him how to formulate research questions and present insights in a coherent story format. Reflecting on the experience, Adeluwa said, “The organizers should do it again.”

Darby hopes that the Datathon becomes a regular occurrence to expose students to datasets in a variety of industries. “The main thing I would like participants to get from this is that the things we’re learning in class are really used in all sciences,” he said. “I mostly just want to give them an opportunity to see all things that are learned in terms of model selection and checking assumptions. And all that is an important part of dealing with these large, complex datasets. It’s a real life thing that scientists have to deal with.”

Winners:

Open Category:

1st place – Team “Data Quest:” Mohamed Ahmed, Amin Baabol, Abdulkadir Said

2nd place – “Team Jacks:” Laura Fox, Jacob Lacker, Jordan Johnson

3rd place – Team “JAPKAS:” Samuel Adjei, James Young, Bundu Paudel, Kenneth Annan, Richard Acquah Sarpong

Undergraduate Category (teams that have 51% representation of undergraduate students):

1st Place – Team “SDSU Predictive Group:” Deepak Raj Joshi, Willy Tshiyole Ntumba, Hu Jie, Nicole Kneip, Rama Khadka

2nd Place – Team “SdState2:” Matthew Questad, Yirong Wang, Benjamin Pond