Skip to main content

Troubleshooting Guide

Overview

This comprehensive troubleshooting guide covers common issues, debugging procedures, and resolution strategies for the JobHive platform. It’s organized by system component and includes both technical and operational troubleshooting scenarios.

Quick Reference - Common Issues

Emergency Response Checklist

🚨 SYSTEM DOWN
1. Check status page: https://status.jobhive.com
2. Verify AWS service health: https://status.aws.amazon.com
3. Check DataDog dashboards for system metrics
4. Review recent deployments in GitHub Actions
5. Check error rates in Sentry
6. Verify database connectivity
7. Check Redis/cache connectivity
8. Review load balancer health checks

🔍 INVESTIGATION PRIORITY
- High Error Rates → Check application logs
- Slow Response Times → Check database performance
- Failed Logins → Check authentication service
- AI Not Working → Check AI service logs
- Payment Issues → Check Stripe webhooks

Application Issues

Django Application Problems

High Response Times

Symptoms:
  • API endpoints taking > 2 seconds to respond
  • Users reporting slow page loads
  • DataDog showing elevated response times
Investigation Steps:
# Check current database connections
python manage.py dbshell
SELECT count(*) as active_connections FROM pg_stat_activity WHERE state = 'active';

# Check for long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query 
FROM pg_stat_activity 
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

# Check Redis connectivity
redis-cli ping

# Check Celery worker status
celery -A config.celery_app inspect active
Common Causes & Solutions:
  1. Database Connection Pool Exhaustion
# In settings.py, adjust database pool settings
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'OPTIONS': {
            'MAX_CONNS': 20,  # Reduce if too high
            'CONN_MAX_AGE': 0,  # Disable persistent connections
        }
    }
}

# Check for connection leaks in code
# Always use context managers or ensure connections are closed
  1. N+1 Query Problems
# Bad: Causes N+1 queries
interviews = InterviewSession.objects.all()
for interview in interviews:
    print(interview.user.email)  # Triggers additional query

# Good: Use select_related
interviews = InterviewSession.objects.select_related('user').all()
for interview in interviews:
    print(interview.user.email)  # No additional queries
  1. Cache Misses
# Check cache hit rates
import redis
r = redis.Redis.from_url(settings.REDIS_URL)
info = r.info()
hit_rate = info['keyspace_hits'] / (info['keyspace_hits'] + info['keyspace_misses'])
print(f"Cache hit rate: {hit_rate:.2%}")

# Warm up critical caches
from django.core.management.base import BaseCommand
class Command(BaseCommand):
    def handle(self, *args, **options):
        # Pre-populate frequently accessed data
        SubscriptionPlan.objects.all()  # Triggers cache warming

Authentication Issues

Symptoms:
  • Users unable to log in
  • JWT token validation failures
  • “Unauthorized” errors on API calls
Investigation Steps:
# Check JWT token validity
python manage.py shell
from django.contrib.auth import get_user_model
from rest_framework_simplejwt.tokens import RefreshToken

User = get_user_model()
user = User.objects.get(email='[email protected]')
token = RefreshToken.for_user(user)
print(f"Access: {token.access_token}")
print(f"Refresh: {token}")

# Check Redis for session data
redis-cli
KEYS "*session*"
GET "session_key_here"
Common Solutions:
  1. Token Expiration Issues
# settings.py - Adjust token lifetimes
from datetime import timedelta

SIMPLE_JWT = {
    'ACCESS_TOKEN_LIFETIME': timedelta(minutes=60),  # Increase if too short
    'REFRESH_TOKEN_LIFETIME': timedelta(days=7),
    'ROTATE_REFRESH_TOKENS': True,
}
  1. Clock Synchronization
# Ensure server time is synchronized
sudo ntpdate -s time.nist.gov
timedatectl status

# Check for clock skew in logs
grep "clock skew" /var/log/jobhive/application.log
  1. Secret Key Issues
# Verify JWT secret key consistency across instances
python manage.py shell
from django.conf import settings
print(settings.SECRET_KEY[:10] + "...")  # Don't log full key

# Rotate secret key if compromised (will invalidate all tokens)

AI Service Issues

Sentiment Analysis Failures

Symptoms:
  • Sentiment scores not updating during interviews
  • AI analysis timing out
  • Inconsistent or null sentiment results
Investigation Steps:
# Test sentiment analysis directly
from jobhive.interview.agents.enhanced_sentiment_agent import EnhancedSentimentAgent

agent = EnhancedSentimentAgent()
test_text = "I am very excited about this opportunity and confident in my abilities."

try:
    result = agent.analyze_sentiment_with_context(test_text, {})
    print(f"Sentiment analysis result: {result}")
except Exception as e:
    print(f"Sentiment analysis failed: {e}")

# Check AI processing queue
from celery import current_app
inspect = current_app.control.inspect()
active_tasks = inspect.active()
print(f"Active AI tasks: {active_tasks}")
Common Solutions:
  1. Model Loading Issues
# Check if models are properly loaded
import torch
from transformers import pipeline

try:
    sentiment_pipeline = pipeline("sentiment-analysis", 
                                model="cardiffnlp/twitter-roberta-base-sentiment-latest")
    print("Sentiment model loaded successfully")
except Exception as e:
    print(f"Model loading failed: {e}")
    # Download model manually if needed
  1. Memory Issues
# Check memory usage of AI processes
ps aux | grep -E "(celery|python)" | awk '{print $2, $4, $11}' | sort -k2 -nr

# If memory usage is high, restart workers
sudo supervisorctl restart celery-worker

# Or use memory-efficient model loading
export PYTORCH_TRANSFORMERS_CACHE=/tmp/transformers_cache
  1. API Rate Limiting
# Implement retry logic for external AI APIs
import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    time.sleep(wait_time)
            return None
        return wrapper
    return decorator

Skill Assessment Errors

Symptoms:
  • Skill scores not calculated
  • Missing skill assessments in completed interviews
  • Inconsistent skill scoring
Debugging Steps:
# Test skill assessment pipeline
from jobhive.interview.agents.skill_assessment_agent import SkillAssessmentAgent

agent = SkillAssessmentAgent()
test_interview = InterviewSession.objects.get(id='test_interview_id')

try:
    skill_results = agent.assess_skills(test_interview)
    print(f"Skill assessment results: {skill_results}")
except Exception as e:
    print(f"Skill assessment failed: {e}")
    import traceback
    traceback.print_exc()

# Check skill categories and weights
from jobhive.interview.models import SkillCategory, ScoreWeight
print("Available skill categories:", list(SkillCategory.objects.values_list('name', flat=True)))
print("Score weights:", ScoreWeight.objects.first().__dict__)
Solutions:
  1. Missing Skill Data
# Populate skill categories and skills
python manage.py shell
from jobhive.interview.models import SkillCategory, Skill

# Create missing skill categories
categories = ['Technical Skills', 'Communication', 'Problem Solving', 'Leadership']
for cat_name in categories:
    SkillCategory.objects.get_or_create(name=cat_name)

# Create skills for each category
tech_skills = ['Python', 'JavaScript', 'SQL', 'System Design']
comm_category = SkillCategory.objects.get(name='Communication')
for skill_name in tech_skills:
    Skill.objects.get_or_create(name=skill_name, category=comm_category)
  1. Transcript Processing Issues
# Check transcript quality and processing
def debug_transcript_processing(interview_session):
    transcript = interview_session.get_transcript()
    
    if not transcript:
        print("No transcript available")
        return
    
    print(f"Transcript length: {len(transcript)} characters")
    print(f"Word count: {len(transcript.split())}")
    
    # Check for common transcript issues
    if len(transcript) < 50:
        print("WARNING: Transcript too short for meaningful analysis")
    
    # Check for transcription errors
    error_indicators = ['[inaudible]', '[unclear]', '***']
    for indicator in error_indicators:
        if indicator in transcript:
            print(f"WARNING: Transcript contains {indicator}")

Database Issues

PostgreSQL Performance Problems

Slow Query Performance

Investigation:
-- Enable query logging temporarily
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1s
SELECT pg_reload_conf();

-- Check currently running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query 
FROM pg_stat_activity 
WHERE state = 'active' 
AND now() - pg_stat_activity.query_start > interval '5 seconds';

-- Check table sizes
SELECT schemaname,tablename,attname,n_distinct,correlation 
FROM pg_stats 
WHERE tablename IN ('interview_interviewsession', 'interview_sentimentsession')
ORDER BY tablename, attname;

-- Check index usage
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats 
WHERE schemaname = 'public' 
ORDER BY n_distinct DESC;
Solutions:
  1. Missing Indexes
-- Create missing indexes based on common queries
CREATE INDEX CONCURRENTLY idx_interview_session_user_status_date 
ON interview_interviewsession(user_id, status, start_time DESC);

CREATE INDEX CONCURRENTLY idx_sentiment_session_interview_created 
ON interview_sentimentsession(interview_session_id, created_at DESC);

-- Check index effectiveness
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM interview_interviewsession 
WHERE user_id = 123 AND status = 'completed' 
ORDER BY start_time DESC LIMIT 10;
  1. Query Optimization
# Optimize Django queries
# Bad: Multiple queries
user_interviews = []
for user in User.objects.all():
    interviews = user.interviewsession_set.filter(status='completed').count()
    user_interviews.append((user, interviews))

# Good: Single query with aggregation
from django.db.models import Count
user_interviews = User.objects.annotate(
    completed_interviews=Count('interviewsession', 
                              filter=Q(interviewsession__status='completed'))
).values('id', 'email', 'completed_interviews')

Connection Pool Issues

Symptoms:
  • “Too many connections” errors
  • Connection timeouts
  • Intermittent database connectivity
Investigation:
-- Check current connections
SELECT count(*) as total_connections FROM pg_stat_activity;

-- Check connections by database and user
SELECT datname, usename, count(*) 
FROM pg_stat_activity 
GROUP BY datname, usename 
ORDER BY count DESC;

-- Check idle connections
SELECT count(*) as idle_connections 
FROM pg_stat_activity 
WHERE state = 'idle';

-- Check connection limits
SELECT name, setting FROM pg_settings WHERE name = 'max_connections';
Solutions:
  1. Connection Pool Configuration
# settings.py - Optimize connection pooling
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'OPTIONS': {
            'MAX_CONNS': 20,
            'CONN_MAX_AGE': 300,  # 5 minutes
        }
    }
}

# Use connection pooler (PgBouncer)
# Install PgBouncer and configure
  1. Connection Leak Detection
# Add connection monitoring
from django.db import connection
from django.core.management.base import BaseCommand

class Command(BaseCommand):
    def handle(self, *args, **options):
        # Monitor connection usage
        with connection.cursor() as cursor:
            cursor.execute("""
                SELECT pid, now() - query_start as duration, query, state
                FROM pg_stat_activity 
                WHERE state != 'idle'
                ORDER BY duration DESC;
            """)
            
            active_queries = cursor.fetchall()
            for query in active_queries:
                if query[1] and query[1].total_seconds() > 300:  # 5 minutes
                    print(f"Long-running query: PID {query[0]}, Duration: {query[1]}")

Redis Cache Issues

Cache Connection Problems

Symptoms:
  • Redis connection timeouts
  • Cache misses despite data being present
  • Inconsistent caching behavior
Investigation:
# Check Redis connectivity
redis-cli ping

# Check Redis memory usage
redis-cli info memory

# Check connected clients
redis-cli info clients

# Monitor Redis commands
redis-cli monitor

# Check for Redis errors in logs
grep -i redis /var/log/jobhive/application.log
Solutions:
  1. Connection Pool Configuration
# settings.py - Optimize Redis connection pool
CACHES = {
    'default': {
        'BACKEND': 'django_redis.cache.RedisCache',
        'LOCATION': settings.REDIS_URL,
        'OPTIONS': {
            'CLIENT_CLASS': 'django_redis.client.DefaultClient',
            'CONNECTION_POOL_KWARGS': {
                'max_connections': 50,
                'retry_on_timeout': True,
                'socket_keepalive': True,
                'socket_keepalive_options': {},
            }
        }
    }
}
  1. Cache Key Management
# Check for cache key conflicts
import redis
r = redis.Redis.from_url(settings.REDIS_URL)

# List all keys (use carefully in production)
all_keys = r.keys('*')
print(f"Total cache keys: {len(all_keys)}")

# Check key expiration
for key in all_keys[:10]:  # Sample first 10 keys
    ttl = r.ttl(key)
    print(f"Key: {key}, TTL: {ttl}")

# Clear problematic keys
problematic_pattern = 'user_session:*'
keys_to_delete = r.keys(problematic_pattern)
if keys_to_delete:
    r.delete(*keys_to_delete)

Infrastructure Issues

AWS Service Problems

ECS Task Failures

Symptoms:
  • Tasks stopping unexpectedly
  • Health check failures
  • Unable to place tasks
Investigation:
# Check ECS service status
aws ecs describe-services --cluster jobhive-production --services jobhive-web

# Check task definitions
aws ecs describe-task-definition --task-definition jobhive-web-task

# Check stopped tasks
aws ecs list-tasks --cluster jobhive-production --desired-status STOPPED

# Get task details and failure reason
aws ecs describe-tasks --cluster jobhive-production --tasks TASK_ARN
Solutions:
  1. Memory/CPU Limits
// Adjust task definition resources
{
  "family": "jobhive-web-task",
  "cpu": "1024",  // Increase from 512
  "memory": "2048", // Increase from 1024
  "containerDefinitions": [
    {
      "name": "django-web",
      "memoryReservation": 1024,  // Soft limit
      "memory": 2048  // Hard limit
    }
  ]
}
  1. Health Check Issues
# Improve health check endpoint
from django.http import JsonResponse
from django.db import connections
from django.core.cache import cache

def health_check(request):
    health_status = {'status': 'healthy', 'checks': {}}
    
    # Database check with timeout
    try:
        with connections['default'].cursor() as cursor:
            cursor.execute("SELECT 1")
        health_status['checks']['database'] = 'healthy'
    except Exception as e:
        health_status['checks']['database'] = f'unhealthy: {str(e)}'
        health_status['status'] = 'unhealthy'
    
    # Cache check with timeout
    try:
        cache.set('health_check', 'ok', 10)
        cache.get('health_check')
        health_status['checks']['cache'] = 'healthy'
    except Exception as e:
        health_status['checks']['cache'] = f'unhealthy: {str(e)}'
        health_status['status'] = 'unhealthy'
    
    status_code = 200 if health_status['status'] == 'healthy' else 503
    return JsonResponse(health_status, status=status_code)

Load Balancer Issues

Symptoms:
  • 502/503 errors from load balancer
  • Uneven traffic distribution
  • Health check failures
Investigation:
# Check target group health
aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN

# Check load balancer attributes
aws elbv2 describe-load-balancers --load-balancer-arns LB_ARN

# Check listener rules
aws elbv2 describe-listeners --load-balancer-arn LB_ARN
Solutions:
  1. Health Check Configuration
# ALB health check settings
health_check:
  protocol: HTTP
  path: /health/
  healthy_threshold_count: 2
  unhealthy_threshold_count: 3
  timeout: 5
  interval: 30
  matcher: "200"
  1. Connection Draining
# Enable connection draining
aws elbv2 modify-target-group-attributes \
    --target-group-arn TARGET_GROUP_ARN \
    --attributes Key=deregistration_delay.timeout_seconds,Value=30

Networking Issues

DNS Resolution Problems

Investigation:
# Test DNS resolution
nslookup api.jobhive.com
dig api.jobhive.com

# Check Route 53 records
aws route53 list-resource-record-sets --hosted-zone-id ZONE_ID

# Test connectivity to external services
curl -I https://api.stripe.com
curl -I https://api.openai.com
Solutions:
  1. DNS Cache Issues
# Flush DNS cache
sudo systemctl flush-dns
# or
sudo dscacheutil -flushcache

# Check /etc/hosts for conflicts
cat /etc/hosts | grep jobhive
  1. Security Group Configuration
# Check security group rules
aws ec2 describe-security-groups --group-ids sg-xxxxxxxxx

# Add missing rules
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxxxxx \
    --protocol tcp \
    --port 443 \
    --cidr 0.0.0.0/0

Monitoring and Alerting Issues

DataDog Integration Problems

Missing Metrics

Investigation:
# Test DataDog connectivity
from datadog import DogStatsdClient
import time

statsd = DogStatsdClient(host='localhost', port=8125)

# Test metric submission
statsd.increment('test.metric', tags=['source:troubleshooting'])
statsd.gauge('test.gauge', 42.0, tags=['source:troubleshooting'])

# Check for DataDog agent logs
# tail -f /var/log/datadog/agent.log
Solutions:
  1. Agent Configuration
# datadog.yaml
api_key: YOUR_API_KEY
site: datadoghq.com

# Enable logging
logs_enabled: true

# Check agent status
sudo datadog-agent status
  1. Metric Tagging Issues
# Ensure consistent tagging
def send_metric_with_tags(metric_name, value, **kwargs):
    standard_tags = [
        f'environment:{settings.ENVIRONMENT}',
        f'service:jobhive-api',
        f'version:{settings.VERSION}'
    ]
    
    custom_tags = kwargs.get('tags', [])
    all_tags = standard_tags + custom_tags
    
    statsd.gauge(metric_name, value, tags=all_tags)

Log Aggregation Issues

Missing Log Entries

Investigation:
# Check log file permissions
ls -la /var/log/jobhive/

# Check log rotation
logrotate -d /etc/logrotate.d/jobhive

# Test logging configuration
python manage.py shell
import logging
logger = logging.getLogger('jobhive')
logger.info('Test log message')

# Check if logs are being written
tail -f /var/log/jobhive/application.log
Solutions:
  1. Log File Permissions
# Fix log file permissions
sudo chown -R www-data:www-data /var/log/jobhive/
sudo chmod -R 644 /var/log/jobhive/*.log
sudo chmod 755 /var/log/jobhive/
  1. Log Format Issues
# Ensure proper JSON formatting
import json
import logging
from pythonjsonlogger import jsonlogger

# Test log formatting
formatter = jsonlogger.JsonFormatter('%(asctime)s %(name)s %(levelname)s %(message)s')
handler = logging.StreamHandler()
handler.setFormatter(formatter)

logger = logging.getLogger('test')
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# This should produce valid JSON
logger.info('Test message', extra={'user_id': 123, 'action': 'test'})

Performance Debugging

Memory Issues

Memory Leaks

Investigation:
# Memory profiling
import tracemalloc
import gc

# Start tracing
tracemalloc.start()

# Your application code here
# ...

# Get memory statistics
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")

# Get top memory consumers
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("Top 10 memory consumers:")
for stat in top_stats[:10]:
    print(stat)

# Force garbage collection
gc.collect()
Solutions:
  1. Object Lifecycle Management
# Proper resource cleanup
class InterviewProcessor:
    def __init__(self):
        self.resources = []
    
    def process_interview(self, interview_session):
        try:
            # Process interview
            result = self.analyze_interview(interview_session)
            return result
        finally:
            # Clean up resources
            self.cleanup_resources()
    
    def cleanup_resources(self):
        for resource in self.resources:
            if hasattr(resource, 'close'):
                resource.close()
        self.resources.clear()
  1. QuerySet Optimization
# Use iterator() for large querysets
def process_all_interviews():
    # Bad: Loads all objects into memory
    for interview in InterviewSession.objects.all():
        process_interview(interview)
    
    # Good: Processes objects one at a time
    for interview in InterviewSession.objects.iterator(chunk_size=100):
        process_interview(interview)

CPU Performance Issues

High CPU Usage

Investigation:
# Check CPU usage by process
top -p $(pgrep -d',' python)

# Profile Python application
python -m cProfile -o profile.stats manage.py runserver

# Analyze profile results
python -c "
import pstats
p = pstats.Stats('profile.stats')
p.sort_stats('cumulative').print_stats(20)
"

# Check for infinite loops or blocking operations
strace -p PID -c -f -e trace=read,write,recv,send
Solutions:
  1. Algorithm Optimization
# Optimize expensive operations
# Bad: O(n²) complexity
def find_similar_interviews(target_interview):
    similar = []
    all_interviews = InterviewSession.objects.all()
    
    for interview in all_interviews:
        if calculate_similarity(target_interview, interview) > 0.8:
            similar.append(interview)
    
    return similar

# Good: Use database filtering and indexing
def find_similar_interviews(target_interview):
    # Use database-level filtering
    similar = InterviewSession.objects.filter(
        job__title=target_interview.job.title,
        technical_accuracy__gte=target_interview.technical_accuracy - 10,
        technical_accuracy__lte=target_interview.technical_accuracy + 10
    ).exclude(
        id=target_interview.id
    )[:10]
    
    return similar
  1. Async Processing
# Move CPU-intensive tasks to background
from celery import shared_task

@shared_task
def process_interview_analysis(interview_id):
    interview = InterviewSession.objects.get(id=interview_id)
    
    # CPU-intensive AI processing
    results = run_comprehensive_analysis(interview)
    
    # Update interview with results
    interview.analysis_results = results
    interview.save()
    
    return results

# In views, trigger async processing
def complete_interview(request, interview_id):
    interview = get_object_or_404(InterviewSession, id=interview_id)
    interview.status = 'completed'
    interview.save()
    
    # Process analysis in background
    process_interview_analysis.delay(interview_id)
    
    return JsonResponse({'status': 'completed'})

Deployment Issues

Deployment Failures

CI/CD Pipeline Failures

Investigation:
# Check GitHub Actions logs
# Visit: https://github.com/your-org/jobhive/actions

# Check ECS deployment status
aws ecs describe-services --cluster jobhive-production --services jobhive-web

# Check CloudFormation stack status (if using)
aws cloudformation describe-stacks --stack-name jobhive-production

# Check recent ECS deployments
aws ecs list-tasks --cluster jobhive-production --service-name jobhive-web
Solutions:
  1. Build Issues
# GitHub Actions troubleshooting
- name: Debug Build Environment
  run: |
    echo "Python version: $(python --version)"
    echo "Pip version: $(pip --version)"
    echo "Available space: $(df -h /)"
    echo "Memory: $(free -h)"

- name: Test Dependencies
  run: |
    pip install -r requirements/production.txt
    python manage.py check --deploy
  1. Deployment Rollback
# Quick rollback to previous version
aws ecs update-service \
    --cluster jobhive-production \
    --service jobhive-web \
    --task-definition jobhive-web-task:PREVIOUS_REVISION

# Wait for rollback to complete
aws ecs wait services-stable \
    --cluster jobhive-production \
    --services jobhive-web

Configuration Issues

Environment Variables

Investigation:
# Check environment configuration
import os
from django.core.management.base import BaseCommand

class Command(BaseCommand):
    def handle(self, *args, **options):
        critical_vars = [
            'DATABASE_URL',
            'REDIS_URL',
            'SECRET_KEY',
            'STRIPE_SECRET_KEY',
            'DATADOG_API_KEY'
        ]
        
        for var in critical_vars:
            value = os.environ.get(var)
            if value:
                # Don't log sensitive values
                masked_value = value[:10] + "..." if len(value) > 10 else "***"
                print(f"{var}: {masked_value}")
            else:
                print(f"❌ {var}: NOT SET")
Solutions:
  1. Secret Management
# Check AWS Secrets Manager
aws secretsmanager get-secret-value --secret-id jobhive/production/django

# Update secrets
aws secretsmanager update-secret \
    --secret-id jobhive/production/django \
    --secret-string '{"SECRET_KEY":"new-secret-key"}'
  1. Configuration Validation
# Add configuration validation
from django.core.management.base import BaseCommand
from django.core.exceptions import ImproperlyConfigured

class Command(BaseCommand):
    def handle(self, *args, **options):
        try:
            # Test database connection
            from django.db import connection
            connection.cursor()
            print("✅ Database connection: OK")
        except Exception as e:
            print(f"❌ Database connection: {e}")
        
        try:
            # Test cache connection
            from django.core.cache import cache
            cache.set('test', 'ok', 10)
            cache.get('test')
            print("✅ Cache connection: OK")
        except Exception as e:
            print(f"❌ Cache connection: {e}")
        
        try:
            # Test external APIs
            import requests
            response = requests.get('https://api.stripe.com/v1/account', 
                                 headers={'Authorization': f'Bearer {settings.STRIPE_SECRET_KEY}'})
            if response.status_code == 200:
                print("✅ Stripe API: OK")
            else:
                print(f"❌ Stripe API: HTTP {response.status_code}")
        except Exception as e:
            print(f"❌ Stripe API: {e}")

Emergency Procedures

System Recovery

Database Recovery

# Restore from snapshot
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier jobhive-db-restored \
    --db-snapshot-identifier jobhive-db-snapshot-2024-02-15

# Point-in-time recovery
aws rds restore-db-instance-to-point-in-time \
    --source-db-instance-identifier jobhive-db \
    --target-db-instance-identifier jobhive-db-recovery \
    --restore-time 2024-02-15T10:30:00.000Z

Application Recovery

# Scale up healthy instances
aws ecs update-service \
    --cluster jobhive-production \
    --service jobhive-web \
    --desired-count 6

# Force new deployment
aws ecs update-service \
    --cluster jobhive-production \
    --service jobhive-web \
    --force-new-deployment

Incident Response Checklist

  1. Assess Impact
    • Check system status dashboard
    • Verify user impact and scope
    • Determine severity level
  2. Immediate Actions
    • Notify team via Slack/PagerDuty
    • Begin incident documentation
    • Start resolution timer
  3. Investigation
    • Check recent deployments
    • Review error logs and metrics
    • Identify root cause
  4. Resolution
    • Implement fix or rollback
    • Verify system restoration
    • Monitor for regression
  5. Post-Incident
    • Update status page
    • Conduct post-mortem
    • Document lessons learned
This troubleshooting guide provides comprehensive coverage of common issues and their resolutions, enabling rapid problem identification and system recovery.