A brain made of puzzle pieces surrounded by bar graphs
A research team led by Assistant Professor of Information Systems and Analytics ML Tlachac, Ph.D., recently examined the wide range of datasets that have been used in depression screening studies involving smart devices and used their findings to create a comprehensive directory of information for researchers.

Finding the patterns: Landmark resource gleans datasets to guide future depression research

Dec 02, 2025, by Stephen Kostrzewa

As the world embraces smart devices, we have more personal health information than ever before, including mobile sensing information that could be used for depression assessment. Research in this area is rapidly expanding field, making it more important than ever that scholars understand the fundamental patterns — and overarching limitations — in the data available.

A team of researchers led by Bryant University Assistant Professor of Information Systems and Analytics ML Tlachac, Ph.D., recently examined the wide range of datasets that have been used in depression screening studies involving smart devices. The team then created a comprehensive directory of information for researchers — and to better explain what that information means. The resulting paper, "Datasets of Smartphone Modalities for Depression Assessment: A Scoping Review," is available in early access on the IEEE Transactions on Affective Computing website.

“This is the foundational piece for so many different people, and it has so many different uses for so many different portions of the community, depending on what you choose to make of it,” Tlachac notes. “Nobody's done this. There's been other papers that have brought together studies and there have been some that have identified common datasets or public datasets. But I haven't seen any who's really gone to this level to try to categorize what's in the field.

Depression is believed to affect more than 322 million people worldwide, the paper notes, and studies support that it is undiagnosed in up to 50 percent of affected persons in high-income countries and 90 percent in low middle income countries and that nearly three quarters of affected people do not receive any depression treatment.

Smart devices offer unprecedented insights into depression and anxiety, Tlachac notes. But parsing that information has become a challenge: What pieces of data are important indicators, which are white noise, and which are false positives? Figuring that out means understanding the fundamental patterns and overarching limitations in the datasets used, especially as many researchers tap the same datasets in their work.

“A paper like this would be important,” Tlachac remembers thinking. “I kept expecting to see it, but it never came. So, I said ‘Fine, I’ll do it.’”

“We're talking about the need for generalizability research,” says Tlachac. “And to me, that starts with understanding the datasets.”

The project had been on Tlachac’s mind since graduate school. “I read a lot of the papers in my field, and I was recognizing details and patterns across papers and across datasets that I noticed other people, I thought, weren’t necessarily making or thinking about,” says Tlachac. “I started seeing a lot of connections.”

“A paper like this would be important,” Tlachac remembers thinking. “I kept expecting to see it, but it never came. So, I said ‘Fine, I’ll do it.’”

Tlachac recruited a team of experts to assist with the project, including other data scientists and research psychiatrists to help contextualize the intricacies of the datasets. Bryant Assistant Professor of Information Systems and Analytics Geri Louise Dimas, Ph.D., served as an advisor on ethics and the standardized search methodology. Bryant Data Science major and research assistant Arielle LaPreay ’26 assisted with the project as well and received her first credit as a co-author on a published paper.

Their first task was identifying what was out there. Tlachac’s team examined hundreds of published papers that use smart devices in depression assessment and picked them apart, often working backwards to determine the datasets they had used.

“I would ask people in my field how many datasets they thought there were, and they would guess maybe in the 20s, or thirty at most. We found 80."

Certain papers containing certain datasets are highly publicized, says Tlachac, which can lead to those being consulted again and again, to the neglect of others. “People typically only really know the key datasets or the key players in the field. And they're not seeing the other datasets that are coming out,” Tlachac notes.

“I would ask people in my field how many datasets they thought there were, and they would guess maybe in the 20s, or thirty at most. We found 80,” says Tlachac.

By highlighting and identifying the full scope of information available, the paper shines a spotlight on the host of data that has been used. “I'm not saying this is a perfect resource, but this is the closest that we have as a community right now to a full categorization of the datasets in our field,” Tlachac notes.

In addition to cataloging the datasets, the paper also breaks them down by what they collected, including information regarding location/activity, communication logs, phone use, and vocal utterances.

“The data in these datasets is messy, and it’s reported in different ways, because what people think is important is different,” Tlachac notes.

They also looked at how the information was sourced, where it came from, and the population recruited for the study. The team’s research found that studies were largely inconsistent in the information they reported about the datasets they used, and many samples were relatively homogeneous, limiting the translational value of their modeling results.

“This is about developing a clear idea of what we have. By finding the holes and inconsistencies, we can help move the field forward.”

“Sometimes papers might include a little about the population they collected data from, or other small details that were relevant to the paper,” explains Tlachac. “But that skips over a lot of other information, which can lead to bias being introduced to a dataset.”

Tlachac points out that the majority of datasets had more women than men participants, for instance, and that more than 30 percent of the participant populations were students.

“Students are a huge portion of the datasets used in this field,” notes Tlachac. “And the thing about students, especially if you're looking at any sort of passive sensing information, is that students’ lives are regimented:  They all have a rather homogeneous pattern in their behavior.”

The paper could serve as a key resource going forward — not only by offering a catalog of datasets to pull from but also by highlighting gaps in data collection. “This is about developing a clear idea of what we have,” Tlachac says firmly. “By finding the holes and inconsistencies, we can help move the field forward.”

“I'm really hoping that it starts a conversation on dataset bias,” Tlachac says. “I want this to be a great resource, but I also want it to be something people build off of.” 

Read More

Related Stories