Skip to main content
U.S. flag

An official website of the United States government

Here’s how you know

Dot gov

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

HTTPS

Secure .gov websites use HTTPS
A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Environmental Topics
  • Laws & Regulations
  • Report a Violation
  • About EPA
Risk Assessment
Contact Us

PDF Entity Annotation Tool (PEAT)

On this page:

  • Overview
  • Downloads
While text mining approaches – including Deep Learning (DL), Artificial Intelligence (AI), and Machine Learning (ML) - continue to expand at a rapid pace, the tools used by researchers with the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats.  Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes.  Automatic extraction of data and information is currently heavily restricted by proprietary data formats and a focus on print quality, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. 

Impact/Purpose

We proposed the creation of a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.

Citation

Markey, K., C. Stahl, B. Jewell, D. Shams, MicheleM Taylor, A. Wilkins, S. Watford, A. Shapiro, AND M. Angrish. PDF Entity Annotation Tool (PEAT). Open Journals, Austin, TX, 10(108):5336, (2025). [DOI: 10.21105/joss.05336]

Download(s)

DOI: PDF Entity Annotation Tool (PEAT)
  • Risk Assessment Home
  • About Risk Assessment
  • Risk Recent Additions
  • Human Health Risk Assessment
  • Ecological Risk Assessment
  • Risk Advanced Search
    • Risk Publications
  • Risk Assessment Guidance
  • Risk Tools and Databases
  • Superfund Risk Assessment
  • Where you live
Contact Us to ask a question, provide feedback, or report a problem.
Last updated on April 23, 2025
United States Environmental Protection Agency

Discover.

  • Accessibility Statement
  • Budget & Performance
  • Contracting
  • EPA www Web Snapshots
  • Grants
  • No FEAR Act Data
  • Privacy
  • Privacy and Security Notice

Connect.

  • Data
  • Inspector General
  • Jobs
  • Newsroom
  • Open Government
  • Regulations.gov
  • Subscribe
  • USA.gov
  • White House

Ask.

  • Contact EPA
  • EPA Disclaimers
  • Hotlines
  • FOIA Requests
  • Frequent Questions

Follow.