ProjectsPDF Parser
Changelog - LLaMA 3.1 8B & BGE-Large Embeddings
1. Replaced OpenAI with LLaMA 3.1 8B - ✅ Created for LLaMA API configuration - ✅ Updated to use LLaMA instead of OpenAI - ✅ Support for vLLM (self-hosted) and...
Changelog - LLaMA 3.1 8B & BGE-Large Embeddings
Changes Made
1. Replaced OpenAI with LLaMA 3.1 8B
- ✅ Created
llm_config.pyfor LLaMA API configuration - ✅ Updated
agents.pyto use LLaMA instead of OpenAI - ✅ Support for vLLM (self-hosted) and Vertex AI endpoints
- ✅ Removed dependency on OpenAI API key
2. Replaced Embeddings with BGE-Large
- ✅ Updated
main.pyto useBAAI/bge-large-en-v1.5embeddings - ✅ Configurable via
EMBEDDING_MODELenvironment variable - ✅ Support for alternative models:
thenlper/gte-large - ✅ Added
normalize_embeddings=Truefor better similarity search - ✅ Important: Using dedicated embedding model, NOT LLaMA for embeddings
3. vLLM Deployment Scripts
- ✅
deploy_llama.sh- Creates GCP VM with GPU - ✅
setup_llama_vm.sh- Part 1: Install NVIDIA drivers - ✅
setup_llama_vm_part2.sh- Part 2: Install CUDA and vLLM - ✅ Systemd service configuration for vLLM
- ✅ Support for HuggingFace token for model access
4. Updated Configuration
- ✅
cloudbuild.yaml- Updated environment variables - ✅ Removed
OPENAI_API_KEYdependency - ✅ Added
LLAMA_API_URL,EMBEDDING_MODEL,EMBEDDING_DEVICE - ✅ Updated
requirements.txt- Removed OpenAI, added sentence-transformers
5. Documentation
- ✅
DEPLOYMENT.md- Complete deployment guide - ✅
EMBEDDINGS.md- Embedding models guide - ✅
README.md- Updated with new configuration - ✅
weaviate_gcp_setup.md- Weaviate setup instructions - ✅
llama_gcp_setup.md- LLaMA setup instructions
Environment Variables
Required
WEAVIATE_URL- Weaviate cluster URLLLAMA_API_URL- vLLM server URL (e.g.,http://EXTERNAL_IP:8000)
Optional
WEAVIATE_API_KEY- For Weaviate CloudVERTEX_AI_ENDPOINT- Alternative to vLLM (Vertex AI)EMBEDDING_MODEL- Embedding model (default:BAAI/bge-large-en-v1.5)EMBEDDING_DEVICE- Device for embeddings (cpuorcuda)HUGGING_FACE_HUB_TOKEN- For accessing LLaMA models (if required)
Deployment Steps
- Deploy Weaviate (Cloud or self-hosted)
- Deploy LLaMA 3.1 8B using vLLM on GCP Compute Engine
- Prepare data using
data_preparation.py - Deploy AI service to Cloud Run
See deployment section in README.md for detailed instructions.
Cost Estimation
- LLaMA 3.1 8B on T4 GPU: ~$1-4/hour
- Weaviate Cloud: ~$50-200/month
- Cloud Run: Pay per use
- Total: ~$100-500/month
Key Improvements
- No OpenAI dependency - Fully self-hosted solution
- Better embeddings - BGE-Large for improved RAG quality
- Cost-effective - ~$1-4/hour vs $25/hour for OpenAI
- Better instruction following - LLaMA 3.1 8B excels at JSON and instructions
- Scalable - Can run on single GPU (T4/L4/A10)