A Public Engineering Experiment: Building an Open-Source GAMP 5 Training Dataset

Many in pharma are currently discussing the potential of AI for GxP documentation. But to leverage this technology effectively, you need a fine-tuned model that truly understands the SDLC and follows strict SOPs to generate compliant artifacts. Fine-tuning such a model, however, requires something that currently does not exist in the public domain: a high-quality, GAMP-aligned training corpus.

Over the next 30 days, I am running a public experiment to build and test exactly that.

Today, I am publishing an initial, synthetically generated open-source dataset on my site and on my GitHub account, featuring:

50 User Requirements Specifications (URS)
50 Functional Specifications (FS)
50 Design Specifications (DS)

Pharma organizations deploying high-risk AI in 2026 desperately need GAMP-aligned training data to meet their technical-documentation and data-governance obligations (such as Art. 10 and Art. 11 of the EU AI Act). Today, there is no public corpus to seed this against. I know because I searched.

To be absolutely clear on provenance: every document in this corpus is synthetically generated strictly from regulatory primary sources (FDA guidance, ISPE GAMP 5, ICH, and ISO 13485). It contains zero anonymized customer data from past validation projects.

Within the next month, the repository will expand to include more document types from this internal set, adding Validation Plans (VP), Validation Reports (VR), Test Plans (TP), Test Reports (TR), Risk Assessments (RA), IQ/OQ/PQ protocols, and Traceability Matrices.

But here is the core thesis of this experiment: high-quality training data is not just a collection of perfect documents. According to the latest AI research, a model cannot truly understand compliance if it only sees the "happy path." To successfully navigate human QA rubrics, the model must explicitly learn from negative examples.

That is why this corpus covers a wide range of topics and includes detailed narratives of non-compliant scenarios. Specifically, we are engineering use cases of typical CSV (Computer System Validation) findings and deviations. The dataset doesn't just show the mistake; it trains the model on how these errors are discovered, why they constitute a compliance failure, and how to remediate them.

High-quality training data shouldn't be a proprietary bottleneck; it is shared infrastructure. I'm publishing this corpus to provide a reliable baseline so we can focus on the actual engineering challenge.

Once the dataset is fully public, the next phase begins: attempting to fine-tune a Qwen 3 (7B) model exclusively on this regulatory corpus.

I will be upfront: forcing an LLM to reliably navigate compliance and pass a human QA reviewer's rubric is not a trivial, weekend fine-tune. It will require rigorous testing, constant iteration, and likely some failures along the way. I will document that process transparently as it happens.

I don't have all the answers yet, but I will share the evaluation results as soon as we get there.

Where to find it

Download (ZIP, ~1.2 MB): neuralarchitects.ae/gxp-corpus — the corpus hub page, no email gate, CC-BY-SA 4.0.
Direct zip: gxp-corpus-v1.3.zip
GitHub repo: github.com/neuralarchitects-de/gamp5-corpus — star or watch the repo for release notifications. v1.3 release ships the same zip as a downloadable asset.