Creating a Curation Workflow for the Environmental Health Vocabulary for Human and Environmental Chemical Assessments
On this page:
The US Environmental Protection Agency (EPA) uses systematic review methods to conduct human and environmental chemical assessments transparently and reproducibly. Evidence supporting these assessments is collected, evaluated, and integrated across various fields (e.g., animal bioassays, epidemiology, and ecology). Critical to evidence integration is the standardization of reported language so that the information extracted from the literature can be aggregated and/or compared across studies. EPA created the environmental health vocabulary (EHV) to address this challenge and provides a controlled vocabulary for the evidence collected supporting an assessment. A controlled vocabulary is a collection of non-redundant and unambiguous terms that can be used to clearly understand information. The EHV is implemented in the EPA Health Assessment Workspace Collaborative (HAWC), a content management system used for several EPA chemical assessment programs. A curation workflow with both automated and manual review steps is used to continuously review candidate terms and update the EHV.
The EHV is currently a collection of terms for animal health outcomes resulting from chemical exposures. The terms are organized into a five-tier hierarchy with each tier being a parent of the next: system, organ, effect, effect sub-type, and endpoint. The EHV was initially created by downloading extracted information from animal bioassay studies from HAWC and cross-referencing terms with existing vocabularies from Unified Medical Language System (UMLS). Experts standardized the terms and placed them into the hierarchy. The EHV was added to HAWC as a feature to improve the efficiency of future data extractions of animal bioassay data. In HAWC, a user can choose to use EHV terms to describe the information being extracted or choose to use different terms if needed.
The software used to implement the curation workflow includes Synaptica Knowledge Management System (KMS), Microsoft Excel, and HAWC.
The curation workflow was developed by manually reviewing the data that was collected in HAWC after the EHV was implemented. Terms that did not adhere to the EHV were considered candidate terms. Many candidate terms are not necessarily new EHV terms. Automated processing steps were created to quickly identify instances of misspellings or differences in punctuation between the candidate term and a matching EHV term. Candidate terms were identified as a synonym if the term was matched to an EHV term after automated processing or queued for manual curation if it did not. New terms or updates to existing terms are reflected in KMS while changes to HAWC are reviewed by project teams before changes are made.
The curation workflow has only been run once with manual review not yet completed. The results provided are percentages expected to change as the amount of work completed within HAWC fluctuates with the number of active assessments at a given time and with the amount of data extraction able to be completed at any given time.
The EHV has been a great benefit to assessment teams as it standardizes the data collected across studies within a single assessment and across assessments. This allows team members to more easily aggregate or compare data in tabular representations and visualizations. Assessment teams must standardize the language used within a project, therefore using the EHV reduces the burden of doing so independently while also making information more findable within and across assessments. The curation workflow allows for expanding the EHV to capture more information and concepts across assessments and identifies priority areas to expand EHV coverage by highlighting gaps in coverage based on adherence to the EHV across projects, while the automated processing reduces the burden manual review places on available resources.