Below is the updated FastAPI-Based Benchmarker TRD with no timelines and an additional layer of detail for every section. Each point now includes sub-goals clarifying what must be done and how. Enjoy!
Technical Requirements Document (TRD)
FastAPI-Based Benchmarker
1. Overview
The purpose of this TRD is to define the requirements, architecture, and specifications for a FastAPI-based application that benchmarks local Large Language Models (LLMs) via Ollama. The system will collect detailed performance metrics, store results in a database (MongoDB), and provide a lightweight web interface for users to submit prompts, compare metrics, and review historical benchmarks.
Sub-Goals
- Define the core functionalities:
- Identify how requests are sent to local LLMs (via Ollama).
- Define the structure for performance data collection.
- Decide on the tech stack:
- FastAPI for the web/API layer.
- MongoDB for data persistence.
- Jinja2 templates and JavaScript for a minimal UI.
- Ensure maintainability:
- Clear separation of concerns: services, models, routers, etc.
- Sufficient documentation and test coverage.
2. Scope
- System Name: Ollama Benchmarker
- Users:
- Developers or ML Engineers evaluating local model performance.
- QA engineers verifying consistent model output over time.
- Non-technical stakeholders needing a simple UI for side-by-side comparisons.
- In-Scope:
- Benchmarking local LLMs through Ollama.
- Collecting system usage metrics (CPU, memory, GPU).
- Timing metrics (time to first token, total time, chunk times).
- Logging and storing user prompt, model response, and benchmarks.
- Simple UI to input prompts, configure models, and explore results history.
- Out-of-Scope:
- Cloud-based LLM integration.
- Production-grade security measures such as full user management.
Sub-Goals
- Precisely define what “benchmarking” includes:
- Confirm which metrics (CPU, memory, GPU usage, etc.) are relevant.
- Confirm which timing metrics (time to first chunk, total time, chunk throughput) matter.
- Clarify minimal UI:
- Outline the page(s) needed.
- Identify user flows for setting prompts, viewing real-time benchmarks, and retrieving history.
- Define exclusions:
- Maintain focus on local LLMs.
- No third-party authentication or advanced user system.
3. Objectives
- Collect Performance Data
- Gather standardized metrics for consistent cross-model comparisons.
- Compare and Visualize
- Persist results to MongoDB (or JSON fallback) for historical trend analysis.
- Provide an interface that highlights differences (e.g., CPU usage, response speed).
- Ease of Use
- Enable quick user prompts in a minimal UI with minimal overhead.
- Generate structured benchmark results (JSON) automatically.
- Maintainability
- Keep code modular, consistent, and thoroughly tested.
- Ensure that system can be extended with additional metrics or new LLMs.
Sub-Goals
- Performance Data Depth:
- Include CPU frequency fluctuations, memory usage deltas, GPU usage if applicable.
- Manage detailed timing for chunked responses.
- Data Visualization:
- Provide tabular listings of historical benchmarks.
- Offer basic statistical summaries (min, max, average).
- UI Simplicity:
- Accept prompt input and model list.
- Display results in real time or near real time with minimal confusion.
- Long-Term Maintainability:
- Use PEP8 style guidelines.
- Keep test coverage at a high level.
- Document all major classes and methods.
4. Architecture
┌─────────────┐
│ Web UI │
│(HTML/JS/CSS)│
└─────────────┘
│
▼
┌────────────────┐
│ FastAPI Server │
│ (app/main) │
└────────────────┘
┌─────────────┬─────────────┬──────────────────┐
▼ ▼ ▼ ▼
┌───────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ Routers │ │ Models │ │ Services │ │ Templates │
│(benchmarks│ │(Pydantic)│ │(benchmark, │ │(index.html)│
│ etc.) │ │ │ │ storage, │ └────────────┘
└───────────┘ └──────────┘ │ ollama) │
└────────────┘
│
▼
┌────────────┐
│ MongoDB / │
│ JSON Files │
└────────────┘
- FastAPI orchestrates HTTP requests.
- Routers separate concerns (benchmarks, history).
- Services contain business logic (benchmark, storage, Ollama interactions).
- Models define request and response formats using Pydantic.
- MongoDB persists data (fallback: JSON file store).
- Templates + JS for minimal user interface.
Sub-Goals
- Clear Layered Architecture:
- Separate each function of the system into logical modules (API, services, data models).
- Future-Proofing:
- Keep an open design for adding new LLM backends besides Ollama if needed.
- Minimal Complexity:
- Ensure the system is straightforward and not over-engineered for the current scope.
5. Detailed Requirements
5.1. Functional Requirements
- Model Benchmarking
- Req-F-1: Accept a list of models to benchmark given a user prompt.
- Req-F-2: Measure time to first chunk, total processing time, chunk sizes, etc.
- Req-F-3: Track CPU, memory usage, and optional GPU usage over the benchmark duration.
- Req-F-4: Capture and store the model response text for reference.
- Data Storage
- Req-F-5: Persist all benchmark data in MongoDB (or JSON) including ID, timestamps, model metadata, user prompt.
- Req-F-6: Store basic system hardware info (CPU cores, total memory, OS info) for context.
- Data Retrieval
- Req-F-7: Provide an endpoint to retrieve the entire benchmark history.
- Req-F-8: Provide an endpoint to retrieve a single benchmark by ID.
- Web Interface
- Req-F-9: Let users input a prompt, specify selected models, and run the benchmark from the browser.
- Req-F-10: Show benchmark results in real time or near real time in the UI.
- Req-F-11: Let users browse, sort, or filter historical benchmarks.
- Error Handling
- Req-F-12: Gracefully handle errors (timeouts, missing models), returning appropriate HTTP codes and messages.
- Req-F-13: Log errors in a structured manner for debugging.
Sub-Goals for Each Functional Requirement
- Req-F-1: Implement a POST endpoint accepting a
BenchmarkRequest
(prompt, list of models). - Req-F-2: During model inference, record timestamps for the first token and the entire generation.
- Req-F-3: Use psutil or an equivalent library to record CPU/mem usage over the request duration.
- Req-F-4: Append the final text output from the model to the result object.
- Req-F-5: Convert results to a Pydantic model and store them in MongoDB or a local JSON file.
- Req-F-6: Use a system metrics function to store CPU/memory/gpu stats with each benchmark.
- Req-F-7: Implement a GET endpoint returning a list of recent benchmarks in reverse chronological order.
- Req-F-8: Implement a GET endpoint accepting a benchmark ID to fetch a single record.
- Req-F-9: Provide an HTML page with a prompt input box and a multi-select for models.
- Req-F-10: Render partial results in the UI as soon as possible, or show a loading animation if synchronous.
- Req-F-11: Provide a history tab or section to display stored benchmarks and filter them by timestamp or model.
- Req-F-12: Throw or catch exceptions (e.g.
HTTPException
) for misconfigurations and timeouts. - Req-F-13: Log the detailed stacktrace and user-facing messages at different log levels.
5.2. Non-Functional Requirements
- Performance
- Req-NF-1: The system must handle concurrent benchmark requests (async/await).
- Req-NF-2: The system must not hang for indefinite durations; a maximum benchmark time or cancellation approach is recommended.
- Scalability
- Req-NF-3: The design should allow easy containerization or running across multiple processes.
- Security
- Req-NF-4: Sanitize all prompt input to avoid injection attacks.
- Req-NF-5: Keep environment-specific secrets (MongoDB credentials) out of the code repository.
- Maintainability
- Req-NF-6: Code should follow PEP8 style guidelines.
- Req-NF-7: Each major module should be covered by unit or integration tests.
- Usability
- Req-NF-8: UI must allow a novice user to run a benchmark with minimal instructions.
Sub-Goals for Each Non-Functional Requirement
- Req-NF-1: Properly implement FastAPI concurrency with
async
routes. - Req-NF-2: Add a configurable timeout for model inference.
- Req-NF-3: Provide a Dockerfile or deployment instructions.
- Req-NF-4: Validate/sanitize all input and ensure no direct shell injection.
- Req-NF-5: Rely on environment variables or secrets manager for DB credentials.
- Req-NF-6: Enforce style checking with a tool like
black
orflake8
. - Req-NF-7: Create a test suite with coverage metrics for each module.
- Req-NF-8: UI layout is simple, label inputs clearly, show success/error messages.
6. API Endpoints
Endpoint | Method | Description | Request Body | Response |
---|---|---|---|---|
/api/benchmarks/run | POST | Run benchmarks on selected models | BenchmarkRequest | BenchmarkResponse (JSON) |
/api/benchmarks/history | GET | Get recent benchmarks in descending timestamp | None | Array of BenchmarkResponse |
/api/benchmarks/history/{id} | GET | Retrieve a specific benchmark by ID | None | A single BenchmarkResponse or error |
Sub-Goals
- Implement POST logic (
/api/benchmarks/run
):- Validate the incoming request.
- Launch concurrent tasks to benchmark each model.
- Return a
BenchmarkResponse
.
- Implement GET logic (
/api/benchmarks/history
):- Query the storage backend for the most recent benchmarks.
- Return a list of responses.
- Implement GET logic (
/api/benchmarks/history/{id}
):- Validate the
id
type (ObjectId or string). - Return the single benchmark document or 404 if not found.
- Validate the
7. Data Models
Sub-Goals
- BenchmarkRequest:
- Must contain prompt, models, and optional generation parameters (e.g., temperature, top_k).
- SystemInfo:
- Collect platform, CPU, memory, GPU details at the start of each benchmark run.
- BenchmarkResult:
- Store model name, timing details, throughput data, system impact, success status, final text response.
- BenchmarkResponse:
- Wraps the entire benchmark run with a timestamp and system info, plus the list of results.
8. Implementation Sketch
A high-level code layout demonstrating how the components interact:
app/main.py
- FastAPI entry point, sets up routing and templates.
app/routers/benchmarks.py
- Contains
/api/benchmarks/run
,/api/benchmarks/history
,/api/benchmarks/history/{id}
endpoints.
- Contains
app/services/benchmark.py
- Orchestrates concurrency for multiple models.
- Uses an
ollama_client
for local LLM interaction. - Wraps data in
BenchmarkResponse
.
app/services/storage.py
- Persists data to MongoDB (or fallback JSON).
- Retrieves records for history or a specific ID.
app/services/ollama_client.py
- Hypothetical wrapper over Ollama CLI or local server.
- Returns chunk timing, text output, and throughput stats.
app/utils/system_metrics.py
- Gathers CPU/mem usage with
psutil
before and after model inference.
- Gathers CPU/mem usage with
Sub-Goals
app/main.py
:- Register the
benchmarks
router. - Mount static files for the minimal UI.
- Register the
app/routers/benchmarks.py
:- Validate input with Pydantic.
- Call
BenchmarkService
methods. - Handle exceptions for non-existent IDs.
app/services/benchmark.py
:- Spawn async tasks for each model to measure concurrency.
- Compute and record system metrics before and after run.
- Return structured data in a
BenchmarkResponse
.
app/services/storage.py
:- Insert new documents in
save_benchmark()
. - Support basic queries like
.find({}).sort("timestamp", -1)
.
- Insert new documents in
ollama_client.py
:- Might use async
subprocess
calls or an HTTP client. - Parse chunk times if provided by Ollama.
- Might use async
system_metrics.py
:- Snapshot CPU/mem usage at the start and end.
- Potentially log usage at intervals throughout the run.
9. Error Handling & Logging
- Use
HTTPException
for returning 4xx/5xx status codes. - Try/Except blocks in services for handling internal errors.
- Structured Logging with timestamps, error messages, stack traces.
Sub-Goals
- HTTP-Level Errors:
- Return user-friendly messages.
- Example:
HTTPException(status_code=400, detail="Invalid model name")
.
- Internal Errors:
- Log traceback for debug.
- Distinguish between warnings (e.g., model fallback) and critical errors (e.g., DB failure).
10. Security Considerations
- Prompt Sanitization to avoid injection attacks in logs or shell calls.
- Environment Secrets for DB credentials.
- HTTPS recommended for production usage.
Sub-Goals
- Input Validation:
- Ensure prompt input does not contain malicious shell commands if used in any direct exec.
- Credential Management:
- Rely on environment variables or a secret manager for DB connection strings.
- Transport Layer Security:
- Deploy behind TLS in any public environment.
11. Testing
- Unit Tests
- For each function: system metrics retrieval, model invocation, JSON creation.
- Integration Tests
- Use a mock or real Ollama client to test the end-to-end pipeline.
- Verify that the database receives correct data.
- End-to-End Tests
- Start the full FastAPI app locally.
- Run a real benchmark, confirm the data structure in the response.
- Load & Performance Tests
- Evaluate concurrency via load testing tools (e.g., Locust).
- Check memory usage under many simultaneous prompts.
Sub-Goals
- Comprehensive Coverage:
- Ensure major routes and services are tested.
- Mocking:
- Mock external dependencies (Ollama, DB).
- Confirm the system logic without real side effects.
- Data Integrity Validation:
- Test that the stored JSON format matches the Pydantic model definitions.
12. Performance
- AsyncIO is used to handle concurrent requests.
- CPU/Memory usage measurement must remain accurate under load.
- The system should degrade gracefully if model calls saturate resources.
Sub-Goals
- Async Implementation:
- Use
await
andasync
in the main benchmark flow. - Possibly rely on concurrency primitives if multiple benchmarks run at once.
- Use
- Resource Monitoring:
- Confirm that CPU/memory usage metrics don’t conflict with other system processes.
- Limit the size of logs if usage data is collected frequently.
- Graceful Degradation:
- If the model call is blocking or times out, return an error object.
13. Deployment
- Local Development:
- Use
uvicorn app.main:app --reload
for testing.
- Use
- Containerization:
- Provide a Dockerfile with required dependencies.
- Production:
- Use a production server (gunicorn + uvicorn worker).
- Deploy behind a load balancer for scaling.
Sub-Goals
- Local Environment:
- Simple startup scripts or
docker-compose
for dev environment.
- Simple startup scripts or
- Dockerfiles:
- Multi-stage build to reduce final image size.
- Expose default port 8000.
- Production Deployment:
- Document environment variables for DB connection.
- Possibly introduce a caching layer if usage grows.
14. Maintenance
- Regular Library Updates: Keep FastAPI, Motor, and psutil up to date.
- Code Review: All significant changes must be peer-reviewed.
- Documentation: Methods, classes, and major flows described in docstrings or external docs.
Sub-Goals
- Dependency Management:
- Use Poetry or pip-compile to track library versions.
- Periodically update them.
- Review Process:
- Ensure that each pull request is reviewed by at least one team member.
- Documentation:
- Maintain a README covering setup, usage, and known issues.
- Keep in-line docstrings explaining complex code paths.
15. (No Timelines)
(Removed all timeline references as requested.)
16. Risks & Mitigations
- Risk: Ollama call fails or times out.
- Mitigation: Implement an async timeout wrapper, return error info.
- Risk: Large prompts cause memory spikes.
- Mitigation: Warn or limit prompt size.
- Risk: MongoDB performance slowdown with large results.
- Mitigation: Implement indexes, archiving, or limit stored data.
Sub-Goals
- Timeout Handling:
- Confirm timeouts with an asynchronous approach.
- Return partial results if possible.
- Prompt Size Management:
- Provide a maximum prompt length in the UI or config.
- Scalable Storage:
- If the DB grows large, plan for archiving or sharding strategies.
17. Approvals
- Project Sponsor: Decides on overall direction and signs off on major features.
- Team Lead: Ensures architecture consistency and code quality.
- DevOps: Confirms deployment feasibility.
Sub-Goals
- Project Sponsor:
- Validate high-level functionality is in line with business priorities.
- Team Lead:
- Verify that the system meets coding standards and design patterns.
- DevOps:
- Check containerization, ensure minimal friction in deploying or scaling.
18. References
Sub-Goals
- Link Relevance:
- Confirm each reference is correct and up-to-date.
- Continuous Learning:
- Team should stay updated on new releases or best practices in FastAPI, Motor, etc.
19. Appendix
A.1: Example JSON output for a single benchmark
Sub-Goals
- Data Accuracy:
- Validate that fields like
time_to_first_chunk
andtotal_time
are measured accurately.
- Validate that fields like
- Clarity:
- Ensure each top-level field is self-explanatory or well-documented.
End of TRD
That’s the final FastAPI-Based Benchmarker TRD with no timeline constraints, plus detailed sub-goals for every section. Let me know if you need further clarification or extra detail in any specific area!
Hey Christopher, here’s a concise yet comprehensive look at the core FastAPI component that will drive user interaction with NovaSystem in the next phase of development:
Core Focus
-
User Interaction Via FastAPI
- The central goal is to have a single streamlined interface where users can submit prompts and retrieve responses, metrics, and agent steps.
- This interface will unify the NovaSystem components (Core, Agents, Memory, etc.) behind one simple API.
-
Essential Responsibilities
- Receive Prompts: Accept a user’s request (prompt, optional parameters) via a POST endpoint (e.g.,
/api/nova/ask
). - Orchestrate Responses: Pass the prompt to the relevant NovaSystem processes (agents, memory, local LLM).
- Return Results: Send back structured JSON containing the system’s chain-of-thought steps (if required), final text, and any relevant metrics.
- Receive Prompts: Accept a user’s request (prompt, optional parameters) via a POST endpoint (e.g.,
-
Immediate Development Goals
- Integration: Hook up the existing NovaSystem methods (like
process_message()
,AgentOrchestrator.process_turn()
) to a single FastAPI route. - Session Management: Ensure each request can include or reference session context (if you are using sessions).
- Minimal Data Model: Define the Pydantic models that represent incoming user prompts and outgoing responses (text, metrics, conversation state).
- High-Level Logging: Track each request’s input prompt, system usage, and any errors within the FastAPI route.
- Integration: Hook up the existing NovaSystem methods (like
Core FastAPI Component (Example)
Key Points
- Single
/api/nova/ask
Endpoint: All user prompts go through here. NovaRequest
Model: Defines what the user can send (prompt, session ID).NovaResponse
Model: Captures the final text, optional chain-of-thought, metrics, etc.- Session Handling: If your conversation includes context, pass a session ID that the system recognizes.
- Central Logging: Each request can log relevant details (prompt, CPU usage, time to response).
What This Phase Achieves
- Unified Communication Channel: No matter which internal NovaSystem modules are invoked (Core, Agents, Memory), the user only sees the
/api/nova/ask
endpoint. - Simple Expandability: You can add new features—like advanced metrics, memory lookups, error handling—behind this endpoint without changing the user-facing interface.
- Chain-of-Thought & Metadata: If you decide to include chain-of-thought data or additional metadata in responses, it’s trivial to extend the
NovaResponse
model.
Next Steps
- Real Implementation: Replace the mock response logic with real calls to
AgentOrchestrator
andNovaSystemCore
. - Session/Context Management: Store conversation history either in-memory, via your
SessionManager
, or in a database. - Authentication or Security: If future phases require user-based authentication, add a FastAPI dependency or JWT-based approach.
- Testing & Logging: Add test suites to ensure each request logs the correct data and returns the correct response format.
That’s it—an essential summary of the next development phase. We’re focusing on building a core FastAPI endpoint that glues together user prompts and NovaSystem’s internal logic. Once that is in place, it forms the solid foundation for more advanced features like chain-of-thought visualization, advanced memory, or multi-agent orchestrations.