This repository contains two related workflows for adverse drug event (ADE) extraction and evaluation:
- A classical preprocessing pipeline that extracts ADE-drug pairs from n2c2 annotations, normalizes terms, links them to SIDER, and filters validated versus potentially novel pairs.
- An LLM-based inference and evaluation workflow that runs ADE detection on either annotation files or parquet context windows and compares performance with and without SIDER context.
The current project state is centered on the parquet-window LLM workflow and its full-batch evaluation on the n2c2 test set.
As of 2026-03-14, the repository contains completed full-batch parquet inference outputs and evaluation artifacts for paired SIDER and no-SIDER runs across two run families (full and 2pred).
Current status:
- Full parquet-window inference completed for both SIDER-enabled and no-SIDER prompts.
- Full-batch comparison completed on all 202 test files.
- Confidence-threshold sweep completed for
0.30,0.50,0.55,0.60,0.65,0.70,0.75, and0.80. - Best tested SIDER operating region is the
0.75-0.80plateau with overall F10.5140. - Without thresholding, the no-SIDER run performs better overall; SIDER overtakes once
min-confidence >= 0.55.
Key current artifacts:
data/outputs/llm_predictions_parquet_full_parquet_sider.csvdata/outputs/llm_predictions_parquet_full_parquet_no_sider.csvdata/outputs/full_batch_comparison_eval.txtfindings/eval.md
ade-project/
├── README.md
├── requirements.txt
├── example.env
├── drug_atc.tsv
├── data/
│ ├── n2c2/
│ │ ├── raw/
│ │ │ ├── train/
│ │ │ ├── test/
│ │ │ ├── test_txts/
│ │ │ └── entity_dataset_w_3_sentence_grouping.parquet
│ │ └── processed/
│ │ ├── ade_drug_relations.csv
│ │ ├── n2c2_clean.csv
│ │ ├── n2c2_entities.csv
│ │ ├── n2c2_relations.csv
│ │ ├── n2c2_with_sider_context.csv
│ │ ├── potential_novel_ade_pairs.csv
│ │ └── validated_ade_drug_pairs.csv
│ ├── sider/
│ │ ├── raw/
│ │ │ ├── drug_names.tsv
│ │ │ └── meddra_all_se.tsv
│ │ └── processed/
│ │ └── sider_clean.csv
│ └── outputs/
│ ├── evaluation_gold_truth_ade_drug.csv
│ ├── evaluation_selected_test_ids.csv
│ ├── full_batch_comparison_eval.txt
│ ├── full_batch_comparison_eval_thr_07.txt
│ ├── full_batch_comparison_eval_thr_08.txt
│ ├── full_batch_comparison_eval_2pred_thr_080.txt
│ ├── full_batch_comparison_eval_2pred_thr_090.txt
│ ├── llm_predictions_parquet_2pred_parquet_full_no_sider.csv
│ ├── llm_predictions_parquet_2pred_parquet_full_sider.csv
│ ├── llm_predictions_parquet_full_parquet_no_sider.csv
│ └── llm_predictions_parquet_full_parquet_sider.csv
├── findings/
│ ├── eval.md
│ ├── OPENROUTER_TEST_RESULTS.md
│ └── overview.txt
├── notebooks/
│ └── 01_extract_n2c2_entities.ipynb
└── scripts/
├── config.py
├── estimate_costs.py
├── evaluate_results.py
├── extract_n2c2_entities.py
├── filter_validate.py
├── jsonl_to_csv.py
├── link_sider.py
├── llm_ade_detection.py
├── normalize_terms.py
└── run_pipeline.py
You need both of the following datasets available locally:
- n2c2 ADE extraction dataset in
data/n2c2/raw/ - SIDER raw files in
data/sider/raw/
Required raw files:
data/n2c2/raw/train/*.anndata/n2c2/raw/train/*.txtdata/n2c2/raw/test/*.anndata/n2c2/raw/test/*.txtdata/n2c2/raw/test_txts/*.txtdata/n2c2/raw/entity_dataset_w_3_sentence_grouping.parquetdata/sider/raw/drug_names.tsvdata/sider/raw/meddra_all_se.tsv
The parquet file is required for the current LLM workflow.
The current workspace is running in a local virtual environment with Python 3.14.3.
Create and activate a virtual environment:
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txtpython -m venv .venv
source .venv/Scripts/activate
python -m pip install --upgrade pip
pip install -r requirements.txtCore dependencies are listed in requirements.txt and include:
pandasnumpypyarrowfastparquetrapidfuzzdrug_named_entity_recognition
Notes:
drug_named_entity_recognitionis used for normalization where available; parts of the code fall back to simpler normalization if it is unavailable.
The LLM workflow uses OpenRouter and expects an API key in a local .env file.
Copy the sample file and set your key:
cp example.env .envThen copy example.env into .env and edit it as follows:
OPENROUTER_API_KEY=your_key_here
OPENROUTER_API_URL=https://openrouter.ai/api/v1/chat/completionsllm_ade_detection.py reads .env directly at runtime.
The classical preprocessing pipeline operates on n2c2 annotations and SIDER tables.
python scripts/run_pipeline.pyRun a subset of steps:
python scripts/run_pipeline.py --steps 1-3
python scripts/run_pipeline.py --steps 3,4Skip steps with existing outputs:
python scripts/run_pipeline.py --skip-existingForce rerun:
python scripts/run_pipeline.py --forceInclude evaluation step if predictions already exist:
python scripts/run_pipeline.py --include-evalpython scripts/extract_n2c2_entities.py
python scripts/normalize_terms.py
python scripts/link_sider.py
python scripts/filter_validate.pyOutputs from this workflow are written mainly under data/n2c2/processed/ and data/sider/processed/.
The current research workflow uses scripts/llm_ade_detection.py to run inference with OpenRouter.
Supported modes:
--source ann: use annotation and note files--source parquet: use 3-sentence parquet windows--mode pilot: small first-file test--mode batch: multi-file processing--mode full: full parquet processing
python scripts/llm_ade_detection.py \
--source parquet \
--mode full \
--output-suffix full_parquet_siderpython scripts/llm_ade_detection.py \
--source parquet \
--mode full \
--disable-sider-context \
--output-suffix full_parquet_no_siderThese commands generate:
data/outputs/llm_predictions_partial_parquet_<suffix>.jsonldata/outputs/llm_predictions_parquet_<suffix>.csv
python scripts/llm_ade_detection.py \
--source parquet \
--mode full \
--model meta-llama/llama-3.1-8b-instruct \
--output-suffix full_parquet_llama31Default model at present:
meta-llama/llama-3.3-70b-instruct
python scripts/evaluate_results.py \
--predictions-with-sider data/outputs/llm_predictions_parquet_full_parquet_sider.csv \
--predictions-without-sider data/outputs/llm_predictions_parquet_full_parquet_no_sider.csv \
--batch-type fullThis writes:
data/outputs/full_batch_comparison_eval.txt
Example at 0.75:
python scripts/evaluate_results.py \
--predictions-with-sider data/outputs/llm_predictions_parquet_full_parquet_sider.csv \
--predictions-without-sider data/outputs/llm_predictions_parquet_full_parquet_no_sider.csv \
--batch-type full \
--min-confidence 0.75 \
--reuse-gold-truth \
--batch-report-output data/outputs/full_batch_comparison_eval_thr_075.txtpython scripts/evaluate_results.py \
--predictions data/outputs/llm_predictions_complete.csvShared path and threshold settings live in scripts/config.py.
Important current settings:
PARQUET_CONTEXT_PATH: parquet window input pathN2C2_TEST_TXTS_DIR: preferred test note text sourceMIN_N2C2_FREQUENCY_FOR_NOVEL = 2SIDER_FREQUENCY_THRESHOLD = 0LLM_MAX_FILES = 202LLM_NOTE_TRUNCATION_LENGTH = 3000
From the current full-batch parquet evaluation:
- No threshold: no-SIDER performs better overall with F1
0.4836versus SIDER0.4673. - Starting at
min-confidence 0.55, SIDER becomes better overall than no-SIDER. - Best tested SIDER operating point is the
0.75-0.80plateau with precision0.4700, recall0.5670, and F10.5140. - Confidence values are quantized, so thresholds within
0.55-0.60,0.65-0.70, and0.75-0.80produce identical retained predictions.
For detailed write-ups, see:
findings/eval.md
- Access to n2c2 data requires appropriate authorization.
- The OpenRouter workflow will fail without
OPENROUTER_API_KEY. scripts/evaluate_results.pysupports both single-file evaluation and paired SIDER versus no-SIDER comparison.
This repository is intended for academic and research use. Ensure that you have the necessary rights and approvals for all datasets and API services used with it.