A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks

  • Claudio Aracena
  • , Luis Miranda
  • , Thomas Vakili
  • , Fabián Villena
  • , Tamara Quiroga
  • , Fredy Núñez-Torres
  • , Victor Rocco
  • , Jocelyn Dunstan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Annotated corpora are essential to reliable natural language processing. While they are expensive to create, they are essential for building and evaluating systems. This study introduces a new corpus of 2,869 medical and admission reports collected by an occupational insurance and health provider. The corpus has been carefully annotated for personally identifiable information (PII) and is shared, masking this information. Two annotators adhered to annotation guidelines during the annotation process, and a referee later resolved annotation conflicts in a consolidation process to build a gold standard subcorpus. The inter-annotator agreement values, measured in F1, range between 0.86 and 0.93 depending on the selected subcorpus. The value of the corpus is demonstrated by evaluating its use for NER of PII and a classification task. The evaluations find that fine-tuned models and GPT-3.5 reach F1 of 0.911 and 0.720 in NER of PII, respectively. In the case of the insurance coverage classification task, using the original or de-identified corpus results in similar performance. The annotated data are released in de-identified form.

Original languageEnglish
Title of host publicationClinicalNLP 2024 - 6th Workshop on Clinical Natural Language Processing, Proceedings of the Workshop
EditorsTristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, Danielle Bitterman
PublisherAssociation for Computational Linguistics (ACL)
Pages111-121
Number of pages11
ISBN (Electronic)9798891761094
StatePublished - 2024
Externally publishedYes
Event6th Workshop on Clinical Natural Language Processing, ClinicalNLP 2024, held at NAACL 2024 - Mexico City, Mexico
Duration: 21 Jun 2024 → …

Publication series

NameClinicalNLP 2024 - 6th Workshop on Clinical Natural Language Processing, Proceedings of the Workshop

Conference

Conference6th Workshop on Clinical Natural Language Processing, ClinicalNLP 2024, held at NAACL 2024
Country/TerritoryMexico
CityMexico City
Period21/06/24 → …

Fingerprint

Dive into the research topics of 'A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks'. Together they form a unique fingerprint.

Cite this