Blog

Beyond the Assembly Line: Mastering the Elements of Clinical Data Science

Speeding access to lifesaving therapies depends on aggregating, cleaning, and transforming disparate data into one cohesive database. Legacy tools and processes put trials — and patients — at risk.

I’m probably not the only one who loves “I Love Lucy.” The 1950’s TV sit-com records the hilarious adventures of a frustrated housewife as she tries out different jobs or hobbies. In one episode, Lucy works in a chocolate factory, wrapping chocolates as they pass along on a conveyor belt. At first, the chocolates move slowly, but as the belt picks up speed, she resorts to hiding, eating, or throwing out the candy she can’t wrap quickly enough. The task is simply beyond the tools and training she has received.

This analogy may seem too simple to describe the state of clinical data management teams today, but, not too long ago, clinical data flowed like that slow belt of chocolates. Manual processing and quality control methods were enough to allow teams to keep up. This is no longer the case.

Today, more data from more sources flows through clinical trials than ever before. Clinical data management teams must routinely aggregate, clean, and transform data from electronic data capture (EDC) systems, electronic patient reported outcomes (ePRO), randomization and trial supply management (RTSM), central labs, medical devices, and other external locations to ensure that clinical databases are fit for use.

But, like the old chocolate factory team, they must use manual processes and disconnected tools to keep up with the volume of diverse data streaming across the virtual belt. They still check emails and servers for data deliveries, manually rerun code to refresh outputs, piece together countless spreadsheet trackers, and continuously shuffle data around from place to place, just to be able to use it.

At a time when more complex trial designs and shorter timelines demand real-time efficiency, pure data, and adherence to scientific standards, why do we continue to treat clinical data management like a slow, mechanical process? Why must data managers still face challenges they described five years ago: lengthy data-cleaning backlogs, inconsistent data formats and metadata, and lengthy time frames required to receive external patient data from third parties. These issues can lead to uncertainty about data validity and regulatory compliance.

The costs of manual data aggregation and cleaning, although unquantified, are another concern. Data queries, which involve checking data points with research site staffers or third- party data providers, can cost from $28 to $225 per data point, according to recent studies.

As the volume, sources, and complexity of clinical data continue to increase, relying on manual data cleaning poses risks to trials, patients, and pipelines. It also runs counter to the growing demand for —and shortage of — data managers trained to address today’s challenges.

Sponsors and CROs need to free data managers from mundane, repetitive tasks to harness their therapeutic and domain expertise. Instead, most are dispatching small armies of outsourced data checkers —supervised by statisticians and data managers—to handle this work. Like thousands of Lucys, they manually check data and record that checks have been done, but the industry’s needs have become far more complex.

Data management tools and approaches must evolve so that data managers can analyze data and tackle higher order problems sooner. This requires that their teams be able to aggregate, clean, and transform data from multiple disconnected sources without repeated email checks and spreadsheet updates. Advanced analytics would then become part of the clinical data management team’s function. Instead of eliminating jobs, it would lead to upskilling. However, change will require a thoughtful approach to data review, and closer collaboration with downstream stakeholders and upstream data providers.

Automating checks and queries

Automating checks and queries, and providing access to clinical data in one place will help free data managers from manual work so they can assess critical quality issues earlier in a trial. Biopharma companies’ work with Veeva CDB, a clinical data workbench introduced in 2022, suggests that the application can automate approximately 30% of all the checks that data management teams now perform manually. In addition, it would allow most of those checks to be closed automatically when data is corrected, without the need for additional verification.

Automated data checks already exist in Vault EDC via its rules engine, which can spot basic inconsistencies and errors in data accessible via the EDC. Veeva CDB extends this capability to external clinical data by aggregating data from multiple systems and sources. It allows checks with explicit criteria to be performed without the need for manual checking and confirmation.

With Veeva CDB, once a data manager receives an email alert that new data is delivered, the application sends a log of all the basic checks that the system has already performed, eliminating the need for users to go into another system, verify, and check manually. Database readiness can be assessed at a glance, by patient or data domain. Examples include data status, whether SDV/freeze/lock; queries that are open vs. closed; and how many issues are still open and for how long. Veeva CDB also tracks which data have already been reviewed.

Today, many data managers use their own systems for keeping information straight (e.g. highlighting spreadsheet sections in different colors or numbering them), introducing potential variability and error into the process. If a data manager marks row 5 in the data as reviewed, Veeva CDB will save that status, even when new data has been added and the original row 5 is now row 7 (or 700). If the underlying data in that row changes, the data manager easily sees that review is needed. Veeva CDB prevents duplicative reviews by allowing filtering on data that hasn’t been reviewed.

Automating queries

Veeva CDB enables even more significant savings by allowing queries to be customized based on more complex data checks involving external data sources. For example, using the Veeva CDB Clinical Query Language (CQL), more advanced checks can now be automated: for instance, does the investigator’s assessment of disease progression match the calculated algorithm using imaging results as specified in the protocol? Queries can be automatically sent, not just to the site staff, but also to the imaging data provider or electronic clinical outcomes assessments (eCOA) provider, eliminating repeated phone calls, emails, or spreadsheets.

These improvements may not eliminate all manual cleaning processes for clinical data—at least, not yet—but promise to reduce them significantly and make it easier to meet new mandates for clinical data quality and patient safety. This is good news for patients and the industry, but also for data management teams. No longer like workers on an assembly line, clinical data professionals will be able to harness deeper therapeutic knowledge to move clinical data science into a new era and to advance professionally.

To learn more about Veeva CDB and how it can speed up your data management processes, click here.

Interested in learning more about how Veeva can help?