Adverse Drug Event Detection with n2c2, SIDER, and LLM Evaluation

This repository contains two related workflows for adverse drug event (ADE) extraction and evaluation:

A classical preprocessing pipeline that extracts ADE-drug pairs from n2c2 annotations, normalizes terms, links them to SIDER, and filters validated versus potentially novel pairs.
An LLM-based inference and evaluation workflow that runs ADE detection on either annotation files or parquet context windows and compares performance with and without SIDER context.

The current project state is centered on the parquet-window LLM workflow and its full-batch evaluation on the n2c2 test set.

Current State

As of 2026-03-14, the repository contains completed full-batch parquet inference outputs and evaluation artifacts for paired SIDER and no-SIDER runs across two run families (full and 2pred).

Current status:

Full parquet-window inference completed for both SIDER-enabled and no-SIDER prompts.
Full-batch comparison completed on all 202 test files.
Confidence-threshold sweep completed for 0.30, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, and 0.80.
Best tested SIDER operating region is the 0.75-0.80 plateau with overall F1 0.5140.
Without thresholding, the no-SIDER run performs better overall; SIDER overtakes once min-confidence >= 0.55.

Key current artifacts:

data/outputs/llm_predictions_parquet_full_parquet_sider.csv
data/outputs/llm_predictions_parquet_full_parquet_no_sider.csv
data/outputs/full_batch_comparison_eval.txt
findings/eval.md

Repository Layout

ade-project/
├── README.md
├── requirements.txt
├── example.env
├── drug_atc.tsv
├── data/
│   ├── n2c2/
│   │   ├── raw/
│   │   │   ├── train/
│   │   │   ├── test/
│   │   │   ├── test_txts/
│   │   │   └── entity_dataset_w_3_sentence_grouping.parquet
│   │   └── processed/
│   │       ├── ade_drug_relations.csv
│   │       ├── n2c2_clean.csv
│   │       ├── n2c2_entities.csv
│   │       ├── n2c2_relations.csv
│   │       ├── n2c2_with_sider_context.csv
│   │       ├── potential_novel_ade_pairs.csv
│   │       └── validated_ade_drug_pairs.csv
│   ├── sider/
│   │   ├── raw/
│   │   │   ├── drug_names.tsv
│   │   │   └── meddra_all_se.tsv
│   │   └── processed/
│   │       └── sider_clean.csv
│   └── outputs/
│       ├── evaluation_gold_truth_ade_drug.csv
│       ├── evaluation_selected_test_ids.csv
│       ├── full_batch_comparison_eval.txt
│       ├── full_batch_comparison_eval_thr_07.txt
│       ├── full_batch_comparison_eval_thr_08.txt
│       ├── full_batch_comparison_eval_2pred_thr_080.txt
│       ├── full_batch_comparison_eval_2pred_thr_090.txt
│       ├── llm_predictions_parquet_2pred_parquet_full_no_sider.csv
│       ├── llm_predictions_parquet_2pred_parquet_full_sider.csv
│       ├── llm_predictions_parquet_full_parquet_no_sider.csv
│       └── llm_predictions_parquet_full_parquet_sider.csv
├── findings/
│   ├── eval.md
│   ├── OPENROUTER_TEST_RESULTS.md
│   └── overview.txt
├── notebooks/
│   └── 01_extract_n2c2_entities.ipynb
└── scripts/
    ├── config.py
    ├── estimate_costs.py
    ├── evaluate_results.py
    ├── extract_n2c2_entities.py
    ├── filter_validate.py
    ├── jsonl_to_csv.py
    ├── link_sider.py
    ├── llm_ade_detection.py
    ├── normalize_terms.py
    └── run_pipeline.py

Data Requirements

You need both of the following datasets available locally:

n2c2 ADE extraction dataset in data/n2c2/raw/
SIDER raw files in data/sider/raw/

Required raw files:

data/n2c2/raw/train/*.ann
data/n2c2/raw/train/*.txt
data/n2c2/raw/test/*.ann
data/n2c2/raw/test/*.txt
data/n2c2/raw/test_txts/*.txt
data/n2c2/raw/entity_dataset_w_3_sentence_grouping.parquet
data/sider/raw/drug_names.tsv
data/sider/raw/meddra_all_se.tsv

The parquet file is required for the current LLM workflow.

Environment Setup

Python

The current workspace is running in a local virtual environment with Python 3.14.3.

Create and activate a virtual environment:

Windows PowerShell

python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt

Git Bash on Windows

python -m venv .venv
source .venv/Scripts/activate
python -m pip install --upgrade pip
pip install -r requirements.txt

Required packages

Core dependencies are listed in requirements.txt and include:

pandas
numpy
pyarrow
fastparquet
rapidfuzz
drug_named_entity_recognition

Notes:

drug_named_entity_recognition is used for normalization where available; parts of the code fall back to simpler normalization if it is unavailable.

Environment variables

The LLM workflow uses OpenRouter and expects an API key in a local .env file.

Copy the sample file and set your key:

cp example.env .env

Then copy example.env into .env and edit it as follows:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_API_URL=https://openrouter.ai/api/v1/chat/completions

llm_ade_detection.py reads .env directly at runtime.

Classical Pipeline

The classical preprocessing pipeline operates on n2c2 annotations and SIDER tables.

Run the full preprocessing pipeline

python scripts/run_pipeline.py

Useful variants

Run a subset of steps:

python scripts/run_pipeline.py --steps 1-3
python scripts/run_pipeline.py --steps 3,4

Skip steps with existing outputs:

python scripts/run_pipeline.py --skip-existing

Force rerun:

python scripts/run_pipeline.py --force

Include evaluation step if predictions already exist:

python scripts/run_pipeline.py --include-eval

Individual classical pipeline commands

python scripts/extract_n2c2_entities.py
python scripts/normalize_terms.py
python scripts/link_sider.py
python scripts/filter_validate.py

Outputs from this workflow are written mainly under data/n2c2/processed/ and data/sider/processed/.

LLM Workflow

The current research workflow uses scripts/llm_ade_detection.py to run inference with OpenRouter.

Supported modes:

--source ann: use annotation and note files
--source parquet: use 3-sentence parquet windows
--mode pilot: small first-file test
--mode batch: multi-file processing
--mode full: full parquet processing

Full parquet inference with SIDER context

python scripts/llm_ade_detection.py \
  --source parquet \
  --mode full \
  --output-suffix full_parquet_sider

Full parquet inference without SIDER context

python scripts/llm_ade_detection.py \
  --source parquet \
  --mode full \
  --disable-sider-context \
  --output-suffix full_parquet_no_sider

These commands generate:

data/outputs/llm_predictions_partial_parquet_<suffix>.jsonl
data/outputs/llm_predictions_parquet_<suffix>.csv

Example model override

python scripts/llm_ade_detection.py \
  --source parquet \
  --mode full \
  --model meta-llama/llama-3.1-8b-instruct \
  --output-suffix full_parquet_llama31

Default model at present:

meta-llama/llama-3.3-70b-instruct

Evaluation Commands

Compare SIDER versus no-SIDER on the full parquet run

python scripts/evaluate_results.py \
  --predictions-with-sider data/outputs/llm_predictions_parquet_full_parquet_sider.csv \
  --predictions-without-sider data/outputs/llm_predictions_parquet_full_parquet_no_sider.csv \
  --batch-type full

This writes:

data/outputs/full_batch_comparison_eval.txt

Run thresholded comparison

Example at 0.75:

python scripts/evaluate_results.py \
  --predictions-with-sider data/outputs/llm_predictions_parquet_full_parquet_sider.csv \
  --predictions-without-sider data/outputs/llm_predictions_parquet_full_parquet_no_sider.csv \
  --batch-type full \
  --min-confidence 0.75 \
  --reuse-gold-truth \
  --batch-report-output data/outputs/full_batch_comparison_eval_thr_075.txt

Single prediction file evaluation

python scripts/evaluate_results.py \
  --predictions data/outputs/llm_predictions_complete.csv

Configuration

Shared path and threshold settings live in scripts/config.py.

Important current settings:

PARQUET_CONTEXT_PATH: parquet window input path
N2C2_TEST_TXTS_DIR: preferred test note text source
MIN_N2C2_FREQUENCY_FOR_NOVEL = 2
SIDER_FREQUENCY_THRESHOLD = 0
LLM_MAX_FILES = 202
LLM_NOTE_TRUNCATION_LENGTH = 3000

Current Findings Summary

From the current full-batch parquet evaluation:

No threshold: no-SIDER performs better overall with F1 0.4836 versus SIDER 0.4673.
Starting at min-confidence 0.55, SIDER becomes better overall than no-SIDER.
Best tested SIDER operating point is the 0.75-0.80 plateau with precision 0.4700, recall 0.5670, and F1 0.5140.
Confidence values are quantized, so thresholds within 0.55-0.60, 0.65-0.70, and 0.75-0.80 produce identical retained predictions.

For detailed write-ups, see:

findings/eval.md

Notes and Caveats

Access to n2c2 data requires appropriate authorization.
The OpenRouter workflow will fail without OPENROUTER_API_KEY.
scripts/evaluate_results.py supports both single-file evaluation and paired SIDER versus no-SIDER comparison.

License

This repository is intended for academic and research use. Ensure that you have the necessary rights and approvals for all datasets and API services used with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adverse Drug Event Detection with n2c2, SIDER, and LLM Evaluation

Current State

Repository Layout

Data Requirements

Environment Setup

Python

Windows PowerShell

Git Bash on Windows

Required packages

Environment variables

Classical Pipeline

Run the full preprocessing pipeline

Useful variants

Individual classical pipeline commands

LLM Workflow

Full parquet inference with SIDER context

Full parquet inference without SIDER context

Example model override

Evaluation Commands

Compare SIDER versus no-SIDER on the full parquet run

Run thresholded comparison

Single prediction file evaluation

Configuration

Current Findings Summary

Notes and Caveats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
findings		findings
notebooks		notebooks
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
drug_atc.tsv		drug_atc.tsv
example.env		example.env
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Adverse Drug Event Detection with n2c2, SIDER, and LLM Evaluation

Current State

Repository Layout

Data Requirements

Environment Setup

Python

Windows PowerShell

Git Bash on Windows

Required packages

Environment variables

Classical Pipeline

Run the full preprocessing pipeline

Useful variants

Individual classical pipeline commands

LLM Workflow

Full parquet inference with SIDER context

Full parquet inference without SIDER context

Example model override

Evaluation Commands

Compare SIDER versus no-SIDER on the full parquet run

Run thresholded comparison

Single prediction file evaluation

Configuration

Current Findings Summary

Notes and Caveats

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages