Video Analysis: MVP & File API Implementation

by Admin 46 views
Video Analysis Implementation: MVP & File API

Introduction: The Challenge of Video Analysis

Hey guys! Let's dive into a cool project: implementing video analysis for our gemini-media-mcp server. Right now, we rock at analyzing images and audio, but videos? Nope. Adding video analysis is a game-changer, but it's not a walk in the park. We're talking about potential token limits with large videos and keeping track of uploaded files. This is where our strategy comes in, breaking this down into smaller, more manageable steps.

Problem: Expanding Media Analysis Capabilities

Our current system, gemini-media-mcp, is pretty good at handling images and audio. But, come on, it's 2024! We need video analysis. This expansion promises a lot, but it also means dealing with some technical hurdles. We are going to address these problems.

Goal: A Robust and Scalable Video Analysis Feature

Our goal? To build a solid, scalable video analysis feature, following two key principles: YAGNI (You Ain't Gonna Need It) and SRP (Single Responsibility Principle). To keep things simple and get something working fast, we'll break this down into two phases (MVP).

  • Phase 1 (MVP): Get the basics down by analyzing short videos (up to 20MB) using inline_data. This is our minimum viable product.
  • Phase 2: Crank things up by adding support for longer videos (up to 2GB) using the File API. This tackles the problem of file management on the client side. We're aiming for a solution that is both effective and easy to use. This is our target.

Architectural Changes: How We'll Build It

Design Principles: Keeping It Clean

  • YAGNI (You Ain't Gonna Need It): We're starting small. We'll focus on the simplest solution for short videos first. We add complexity only when it's absolutely necessary.
  • SRP (Single Responsibility Principle): The video analysis logic will live in its own separate module, tools/video_analyzer.py. This keeps everything organized and prevents the analysis from getting tangled up with other media types. We believe in keeping things simple.

Sequence Diagram (Phase 1: MVP): Short Video Processing

This diagram shows how a short video gets processed. The client sends a request. The server checks the file. The file's bytes go straight into the Gemini API. The API gives back structured data. The server then returns a Pydantic object with the results.

sequenceDiagram
    participant C as Client
    participant S as Server
    participant VA as VideoAnalyzer
    participant GC as GeminiClient
    participant API as Gemini API

    C->>S: Call analyze_video(path)
    S->>VA: analyze_video(path)
    VA->>VA: File validation (size < 20MB)
    VA->>GC: generate_content(media_bytes)
    GC->>API: Request with inline_data
    API-->>GC: JSON response
    GC-->>VA: Response text
    VA->>VA: Parsing into VideoAnalysisResponse
    VA-->>S: Return Pydantic object
    S-->>C: Analysis result

Logical Diagram: Two-Phase Approach

This diagram shows how the server will decide how to process the video, depending on the file size. Once both phases are fully implemented, this diagram illustrates this.

graph TD
    A[Video analysis request] --> B{File size < 20 MB?}
    B -- Yes --> C[Use inline_data (MVP)]
    B -- No --> D[Use File API (Phase 2)]
    C --> E{Analysis result}
    D --> F[Upload file, get file_uri]
    F --> G[Analysis using file_uri]
    G --> H[Analysis result + file_uri]

Concrete Implementation: Making It Real

Data Models and Signatures: The Building Blocks

1. VideoAnalysisResponse Model (in models/analysis.py): This model defines the structure of the response we get back from the video analysis. It will include things like the video title, a summary, the full transcript, key events, hashtags, and a file_uri if the File API is used.

from pydantic import BaseModel, Field
from typing import List, Optional

class VideoEvent(BaseModel):
    """Describes a key event in the video with a timestamp."""
    timestamp: str = Field(..., description="Event timestamp in MM:SS format.")
    description: str = Field(..., description="Event description.")

class VideoAnalysisResponse(BaseModel):
    """Structured response with video analysis results."""
    title: str = Field(..., description="Short and informative video title.")
    summary: str = Field(..., description="Video summary.")
    transcription: str = Field(..., description="Full audio track transcription.")
    events: List[VideoEvent] = Field(..., description="List of key events with timestamps.")
    hashtags: List[str] = Field(..., description="List of relevant hashtags.")
    file_uri: Optional[str] = Field(None, description="File URI after upload via File API for reuse.")

2. analyze_video Function Signature (in tools/video_analyzer.py): This is the main function that handles the video analysis. It takes a video path, an optional user prompt, model name, and system instructions as inputs. It automatically decides whether to use inline_data or the File API based on the file size.

from models.analysis import VideoAnalysisResponse, ErrorResponse

def analyze_video(
    video_path: str,
    user_prompt: str = "",
    model_name: str | None = None,
    system_instruction_name: str = "default",
    system_instruction_override: str | None = None,
    system_instruction_file_path: str | None = None,
) -> VideoAnalysisResponse | ErrorResponse:
    """
    Analyzes a video file, automatically selecting the method (inline_data or File API)
    depending on the file size.
    """
    # ... implementation ...

File API Strategy: Keeping It Stateless

To keep our server stateless (meaning it doesn't store information about uploaded files), we won't save any file data. Instead, when analyzing large videos using the File API, the VideoAnalysisResponse will include a file_uri. The client (you) can save this URI and reuse it in subsequent requests to re-analyze the same file for up to 48 hours, without re-uploading. That's pretty cool, right?

Task TODO List: Your Checklist

  • Phase 1: Short Video Analysis (<20MB)
    • [ ] Create the VideoAnalysisResponse model in models/analysis.py. We need to start here to define the structure for our results.
    • [ ] Add the is_video_valid function in utils/file_utils.py to check the format and size of the video. Gotta make sure we're only analyzing valid videos.
    • [ ] Implement analyze_video in tools/video_analyzer.py with the inline_data logic. This is the core of our MVP.
    • [ ] Integrate and register the new tool in server.py and tools/\_init__.py. We need to make sure our server knows about this new function.
    • [ ] Add a test short video in tests/. Always test, test, test!
  • Phase 2: Long Video Analysis (File API)
    • [ ] Extend analyze_video to support the File API for files > 20MB. Make it handle larger files.
    • [ ] Add file_uri to VideoAnalysisResponse when using the File API. We need this to allow clients to reuse files.
    • [ ] Update GeminiClient to support video uploads via the File API. The client needs to know how to upload the videos.
    • [ ] Add tests for both logic branches (short and long videos). More tests.
  • Completion
    • [ ] Update the project documentation, describing the new feature. Let's make sure everyone knows how to use this.
    • [ ] Update the knowledge base with information about this solution. Let's document our work.

That's the plan, guys! Let's get to work and make this video analysis a reality!