GPT-3.5 Misinformation
GPT-3.5 Misinformation Details
I developed a project management and versioning system for fine-tuning pre-trained Large Language Models (LLMs) (e.g., OpenAI, Google) using augmented datasets. Users could upload or generate datasets dynamically via Easy Data Augmentation (EDA), text translation, and other expandable methods. A custom datapoint editor ensured compatibility with each model's required format for fine-tuning.
The system supported hierarchical model training, allowing models to be trained sequentially or in parallel, starting from a base model. A dual-layer versioning system (horizontal and vertical) structured training workflows, such as book-chapter-style development. Models and datasets could also be shared globally across different projects, enabling collaborative training and dataset reuse.
To optimize storage, augmented datasets were stored hierarchically, referencing the original data rather than duplicating large files (and much more). The live training status of models could be monitored, and sessions could be terminated if needed. Model evaluation included HHH (Helpful, Honest, Harmless) criteria, alignment scoring, and semantic similarity analysis via SBERT to assess data quality.
A comparative model analysis tool allowed users to inspect multiple models based on quantitative metrics, evaluation statistics, and training metadata. Each model's training history, augmentation methods, and datasets were fully documented.
The backend was built with Python, the frontend with Streamlit, and SQLite as the local-first database, allowing users to switch databases easily. A "Fail-Safe" mechanism ensured that if the application crashed or was closed mid-session, all progress—including active fine-tuning and dataset modifications—was saved, allowing users to resume seamlessly. Deployed via Docker.
This project highlights my expertise in AI model training, hierarchical versioning, real-time monitoring, and data augmentation, with a focus on efficiency, automation, and reliability.
Tech Stack
- Python (Programming Language)
- Streamlit
- OpenAI API
- Google Cloud Platform (GCP)
- Rational Software Architect
- Docker
- SQLite
- SQLAlchemy
- ORM
- Marshmallow
Project information
- CategoryFull Stack
- DevelopmentSolo
- RepositoryPublic
- Project dateJan. 2024
- Visit Project















