CROP
ProjectsPDF Parser

Changelog - LLaMA 3.1 8B & BGE-Large Embeddings

1. Replaced OpenAI with LLaMA 3.1 8B - ✅ Created for LLaMA API configuration - ✅ Updated to use LLaMA instead of OpenAI - ✅ Support for vLLM (self-hosted) and...

Changelog - LLaMA 3.1 8B & BGE-Large Embeddings

Changes Made

1. Replaced OpenAI with LLaMA 3.1 8B

  • ✅ Created llm_config.py for LLaMA API configuration
  • ✅ Updated agents.py to use LLaMA instead of OpenAI
  • ✅ Support for vLLM (self-hosted) and Vertex AI endpoints
  • ✅ Removed dependency on OpenAI API key

2. Replaced Embeddings with BGE-Large

  • ✅ Updated main.py to use BAAI/bge-large-en-v1.5 embeddings
  • ✅ Configurable via EMBEDDING_MODEL environment variable
  • ✅ Support for alternative models: thenlper/gte-large
  • ✅ Added normalize_embeddings=True for better similarity search
  • Important: Using dedicated embedding model, NOT LLaMA for embeddings

3. vLLM Deployment Scripts

  • deploy_llama.sh - Creates GCP VM with GPU
  • setup_llama_vm.sh - Part 1: Install NVIDIA drivers
  • setup_llama_vm_part2.sh - Part 2: Install CUDA and vLLM
  • ✅ Systemd service configuration for vLLM
  • ✅ Support for HuggingFace token for model access

4. Updated Configuration

  • cloudbuild.yaml - Updated environment variables
  • ✅ Removed OPENAI_API_KEY dependency
  • ✅ Added LLAMA_API_URL, EMBEDDING_MODEL, EMBEDDING_DEVICE
  • ✅ Updated requirements.txt - Removed OpenAI, added sentence-transformers

5. Documentation

  • DEPLOYMENT.md - Complete deployment guide
  • EMBEDDINGS.md - Embedding models guide
  • README.md - Updated with new configuration
  • weaviate_gcp_setup.md - Weaviate setup instructions
  • llama_gcp_setup.md - LLaMA setup instructions

Environment Variables

Required

  • WEAVIATE_URL - Weaviate cluster URL
  • LLAMA_API_URL - vLLM server URL (e.g., http://EXTERNAL_IP:8000)

Optional

  • WEAVIATE_API_KEY - For Weaviate Cloud
  • VERTEX_AI_ENDPOINT - Alternative to vLLM (Vertex AI)
  • EMBEDDING_MODEL - Embedding model (default: BAAI/bge-large-en-v1.5)
  • EMBEDDING_DEVICE - Device for embeddings (cpu or cuda)
  • HUGGING_FACE_HUB_TOKEN - For accessing LLaMA models (if required)

Deployment Steps

  1. Deploy Weaviate (Cloud or self-hosted)
  2. Deploy LLaMA 3.1 8B using vLLM on GCP Compute Engine
  3. Prepare data using data_preparation.py
  4. Deploy AI service to Cloud Run

See deployment section in README.md for detailed instructions.

Cost Estimation

  • LLaMA 3.1 8B on T4 GPU: ~$1-4/hour
  • Weaviate Cloud: ~$50-200/month
  • Cloud Run: Pay per use
  • Total: ~$100-500/month

Key Improvements

  1. No OpenAI dependency - Fully self-hosted solution
  2. Better embeddings - BGE-Large for improved RAG quality
  3. Cost-effective - ~$1-4/hour vs $25/hour for OpenAI
  4. Better instruction following - LLaMA 3.1 8B excels at JSON and instructions
  5. Scalable - Can run on single GPU (T4/L4/A10)

On this page