Kubernetes Maliyet Optimizasyonu: Production Ortamında Para Tasarrufu

Kubernetes'i production'da çalıştırmak güçlü bir çözüm sunarken, yanlış konfigürasyonlar ve optimizasyon eksikliği ciddi maliyet artışlarına yol açabiliyor. Birçok şirket Kubernetes cluster'larında %30-50 oranında gereksiz harcama yapıyor. Bu yazıda, production ortamlarında kanıtlanmış maliyet optimizasyon stratejilerini ve pratik uygulamalarını inceleyeceğiz.

İçindekiler

Kubernetes Maliyetlerinin Anatomisi
Resource Requests ve Limits Optimizasyonu
Horizontal ve Vertical Pod Autoscaling
Node Optimization Stratejileri
Namespace ve Multi-Tenancy Yönetimi
Idle Resource Detection ve Cleanup
Monitoring ve FinOps Tooling
Real-World Case Study

Kubernetes Maliyetlerinin Anatomisi {#maliyet-anatomisi}

Maliyetler Nereden Geliyor?

Kubernetes maliyetlerini anlamak için temel bileşenleri inceleyelim:

1. Compute Costs (En Büyük Pay: %60-70)

Worker node instance maliyetleri
CPU ve memory allocation
Over-provisioning (aşırı kaynak ayırma)

2. Storage Costs (%15-20)

Persistent volumes (PV)
Snapshot'lar
Orphaned volumes (kullanılmayan diskler)

3. Network Costs (%10-15)

Cross-zone/cross-region traffic
Load balancer maliyetleri
NAT gateway costs

4. Management Overhead (%5-10)

Control plane maliyetleri (managed Kubernetes için)
Monitoring ve logging infrastructure
Backup solutions

Tipik Maliyet Problemleri

          
yaml

# ❌ Kötü Örnek: Over-provisioned deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 10  # Gerçekte 3 yeterli
  template:
    spec:
      containers:
      - name: api
        image: api:latest
        resources:
          requests:
            memory: "2Gi"    # Gerçek kullanım: 200Mi
            cpu: "1000m"     # Gerçek kullanım: 100m
          limits:
            memory: "4Gi"    # Hiç ulaşılmıyor
            cpu: "2000m"     # Hiç ulaşılmıyor

Sonuç: Bu deployment, gerçek ihtiyacından 10 kat fazla kaynak tüketiyor!

Resource Requests ve Limits Optimizasyonu {#resource-optimization}

Right-Sizing: Doğru Kaynak Ayarlama

Adım 1: Gerçek Kullanımı Ölçün

          
bash

# Pod'un gerçek kaynak kullanımını izleyin
kubectl top pod api-service-7d9f6b8c-xjk2p --containers

# Son 24 saatin metriklerini çekin (Prometheus query)
avg_over_time(container_memory_working_set_bytes{pod="api-service-7d9f6b8c-xjk2p"}[24h])

Adım 2: Optimal Değerleri Hesaplayın

          
yaml

# ✅ İyi Örnek: Right-sized deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: api:latest
        resources:
          requests:
            # Ortalama kullanım + %20 buffer
            memory: "240Mi"  # (200Mi * 1.2)
            cpu: "120m"      # (100m * 1.2)
          limits:
            # Peak kullanım + %30 buffer
            memory: "400Mi"
            cpu: "300m"

Maliyet Etkisi:

Öncesi: 10 replica × 2GB = 20GB memory
Sonrası: 3 replica × 240MB = 720MB memory
Tasarruf: %96 memory, %70 toplam maliyet

Quality of Service (QoS) Classes

Kubernetes, QoS class'larına göre pod'ları farklı şekilde schedule eder:

          
yaml

# Guaranteed QoS: En yüksek öncelik, en pahalı
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "1Gi"   # request = limit
        cpu: "500m"     # request = limit

---
# Burstable QoS: Orta öncelik, maliyet-efektif
apiVersion: v1
kind: Pod
metadata:
  name: standard-app
spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"   # limit > request
        cpu: "500m"

---
# BestEffort QoS: Düşük öncelik, en ucuz (production için önerilmez!)
apiVersion: v1
kind: Pod
metadata:
  name: batch-job
spec:
  containers:
  - name: job
    # Hiç resource tanımı yok

Strateji:

Critical services: Guaranteed QoS
Normal workloads: Burstable QoS (en maliyet-efektif)
Batch jobs: BestEffort QoS (non-production)

Horizontal ve Vertical Pod Autoscaling {#autoscaling}

Horizontal Pod Autoscaler (HPA)

Traffic'e göre otomatik scaling:

          
yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 dakika bekle
      policies:
      - type: Percent
        value: 50  # Her seferinde max %50 azalt
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Hemen scale up
      policies:
      - type: Percent
        value: 100  # Her seferinde max 2x artır
        periodSeconds: 15

Vertical Pod Autoscaler (VPA)

Resource request'leri otomatik ayarlar:

          
yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  updatePolicy:
    updateMode: "Auto"  # Otomatik güncelle (dikkatli!)
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 50m
        memory: 128Mi
      maxAllowed:
        cpu: 1000m
        memory: 2Gi
      controlledResources:
      - cpu
      - memory

⚠️ Dikkat: HPA ve VPA'yı aynı anda CPU üzerinde kullanmayın! Conflict yaratır.

En İyi Kombinasyon:

HPA: CPU-based scaling
VPA: Memory-based recommendation (updateMode: "Off" ile sadece öneri)

KEDA: Event-Driven Autoscaling

          
yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-scaler
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 0  # 🔥 Sıfıra scale edebilir!
  maxReplicaCount: 50
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: queue_depth
      query: |
        sum(rabbitmq_queue_messages{queue="processing"})
      threshold: "10"

Maliyet Etkisi: İş olmadığında pod count 0'a düşer = %100 tasarruf

Node Optimization Stratejileri {#node-optimization}

Cluster Autoscaler

Node sayısını workload'a göre otomatik ayarlar:

          
yaml

# Cluster Autoscaler deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m

Spot/Preemptible Instance Kullanımı

%70'e varan maliyet tasarrufu:

          
yaml

# Node pool: Spot instances için
apiVersion: v1
kind: Node
metadata:
  labels:
    node.kubernetes.io/instance-type: t3.large
    node.kubernetes.io/lifecycle: spot
    workload-type: fault-tolerant

---
# Deployment: Spot-tolerant workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/lifecycle: spot
      tolerations:
      - key: "node.kubernetes.io/lifecycle"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      affinity:
        # Spot terminate olursa, on-demand node'a geç
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node.kubernetes.io/lifecycle
                operator: In
                values:
                - spot

Spot Instance Stratejisi:

Stateless apps: %100 spot kullanın
Batch jobs: Spot + graceful shutdown
Stateful/critical apps: On-demand nodes

Node Size Optimization

          
bash

# Küçük instance'lar genelde daha maliyet-efektif
# Örnek: AWS

# ❌ Kötü: 2x m5.4xlarge (16 vCPU, 64GB each)
# Maliyet: 2 × $0.768/saat = $1.536/saat

# ✅ İyi: 8x m5.xlarge (4 vCPU, 16GB each)
# Maliyet: 8 × $0.192/saat = $1.536/saat
# Ama daha fazla flexibility ve bin packing efficiency!

Avantajlar:

Daha iyi bin packing
Maintenance sırasında daha az etki
Daha az waste

Namespace ve Multi-Tenancy Yönetimi {#namespace-management}

Resource Quotas

Namespace başına limit koyun:

          
yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-team-quota
  namespace: dev-team
spec:
  hard:
    requests.cpu: "50"        # Toplam 50 CPU
    requests.memory: "100Gi"  # Toplam 100GB memory
    limits.cpu: "100"
    limits.memory: "200Gi"
    persistentvolumeclaims: "10"
    services.loadbalancers: "2"  # LoadBalancer sayısını sınırla

LimitRange: Pod Başına Defaults

          
yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: dev-team
spec:
  limits:
  - default:  # Container'larda limit belirtilmezse
      cpu: 500m
      memory: 512Mi
    defaultRequest:  # Request belirtilmezse
      cpu: 100m
      memory: 128Mi
    max:  # Maximum değerler
      cpu: 2000m
      memory: 4Gi
    min:  # Minimum değerler
      cpu: 50m
      memory: 64Mi
    type: Container

Idle Resource Detection ve Cleanup {#idle-resources}

Kullanılmayan Kaynakları Bulun

1. Orphaned PVCs (Silinmeyi Bekleyen Diskler)

          
bash

# PVC var ama pod kullanmıyor
kubectl get pvc --all-namespaces -o json | \
  jq -r '.items[] | select(.status.phase=="Bound") | 
  select(.metadata.annotations["pv.kubernetes.io/bound-by-controller"] == null) | 
  "\(.metadata.namespace)/\(.metadata.name)"'

2. Idle Services (LoadBalancer Kullanmayanlar)

          
bash

# LoadBalancer var ama trafik yok
kubectl get svc --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.type=="LoadBalancer") | 
  "\(.metadata.namespace)/\(.metadata.name)"'

# Prometheus ile traffic kontrolü
sum(rate(nginx_ingress_controller_requests[24h])) by (service) < 1

3. Zombie Deployments (0 replica)

          
yaml

# Otomatik cleanup CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cleanup-zero-replicas
spec:
  schedule: "0 2 * * 0"  # Her pazar 02:00
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cleanup-bot
          containers:
          - name: cleanup
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              kubectl get deploy --all-namespaces -o json | \
              jq -r '.items[] | select(.spec.replicas==0) | 
              select(.metadata.creationTimestamp | fromdateiso8601 < (now - 604800)) | 
              "kubectl delete deploy \(.metadata.name) -n \(.metadata.namespace)"' | \
              bash

Monitoring ve FinOps Tooling {#monitoring-finops}

Kubecost: Kubernetes Maliyet İzleme

          
yaml

# Kubecost kurulumu (Helm)
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubecost-config
data:
  kubecost-token: "your-token"
  prometheus-server-endpoint: "http://prometheus-server.monitoring.svc"
  
---
# Cost allocation labels - her deployment'a ekleyin
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  labels:
    app: api-service
    team: backend
    cost-center: engineering
    environment: production
spec:
  template:
    metadata:
      labels:
        app: api-service
        team: backend
        cost-center: engineering
        environment: production

Kubecost Dashboard'da Görünenler:

Namespace bazında maliyet breakdown
Team/cost-center allocation
Resource efficiency scores
Savings recommendations

OpenCost: Open-Source Alternatif

          
bash

# OpenCost deployment
kubectl apply -f https://raw.githubusercontent.com/opencost/opencost/main/kubernetes/opencost.yaml

# Port-forward ile erişim
kubectl port-forward -n opencost service/opencost 9090:9090

# API ile namespace maliyetlerini çekin
curl http://localhost:9090/allocation/compute \
  -d window=7d \
  -d aggregate=namespace

Prometheus Queries: DIY Monitoring

          
promql

# CPU waste (allocated but not used)
sum(kube_pod_container_resource_requests{resource="cpu"}) 
- sum(rate(container_cpu_usage_seconds_total[5m]))

# Memory waste
sum(kube_pod_container_resource_requests{resource="memory"}) 
- sum(container_memory_working_set_bytes)

# Over-provisioning ratio per namespace
sum(kube_pod_container_resource_limits{resource="cpu"}) by (namespace) /
sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace)

# Monthly cost estimate (AWS)
sum(kube_node_status_capacity{resource="cpu"}) * 
scalar(avg_over_time(aws_ec2_cpuutilization_average[30d])) * 
0.0416  # AWS m5.large hourly rate

Real-World Case Study {#case-study}

Durum: E-Ticaret Platformu

Başlangıç Durumu:

50 worker node (m5.2xlarge)
200+ microservice
Aylık maliyet: $45,000
Average CPU utilization: %15
Average memory utilization: %30

Uygulanan Optimizasyonlar

1. Week 1-2: Resource Right-Sizing

          
bash

# VPA recommendations kullanarak resource adjustment
# 80% of deployments over-provisioned

Sonuç: 30% kaynak azaltma
Tasarruf: $13,500/ay

2. Week 3: Spot Instance Migration

          
yaml

# Non-critical workloads → Spot
# 60% of nodes converted to spot

Tasarruf: $15,000/ay (spot discount)

3. Week 4: Cluster Autoscaling

          
yaml

# Off-peak hours: 50 nodes → 25 nodes
# Peak hours: 25 nodes → 60 nodes (auto-scale)

Ortalama node count: 35 (önceden sabit 50)
Tasarruf: $9,000/ay

4. Week 5-6: Idle Resource Cleanup

          
bash

# Orphaned PVCs: 500GB removed
# Unused LoadBalancers: 15 removed
# Zero-replica deployments: 30 removed

Tasarruf: $2,500/ay

5. Week 7-8: Storage Optimization

          
yaml

# gp3 → gp2 migration (cheaper)
# Snapshot lifecycle policies
# PVC size right-sizing

Tasarruf: $3,000/ay

Sonuçlar

Metrik	Önce	Sonra	Değişim
Aylık Maliyet	$45,000	$18,000	-60%
Node Count	50	35 (avg)	-30%
CPU Utilization	15%	65%	+333%
Memory Utilization	30%	70%	+133%
Monthly Savings	-	$27,000	-

ROI: 8 haftalık optimizasyon çalışması → Yıllık $324,000 tasarruf

Actionable Checklist

Hemen Yapılabilecekler (Bu Hafta)

kubectl top nodes ve kubectl top pods çalıştırın
Resource requests/limits olmayan pod'ları listeleyin
Orphaned PVC'leri tespit edin
LoadBalancer kullanmayan Service'leri bulun
0 replica deployment'ları temizleyin

Kısa Vadeli (Bu Ay)

Kubecost veya OpenCost kurun
HPA implement edin (critical services için)
Cluster Autoscaler aktif edin
Resource quotas tanımlayın (namespace başına)
Spot instance stratejisi planlayın

Orta Vadeli (3 Ay)

VPA recommendations'ı uygulayın
Node optimization (instance type review)
Storage class optimization (gp3, snapshot policies)
FinOps dashboard oluşturun
Team-based cost allocation

Uzun Vadeli (6+ Ay)

Multi-cluster strategy (prod/staging separation)
Reserved Instances/Savings Plans değerlendirin
Chargeback model kurun (team'lere fatura)
Carbon footprint tracking

Araçlar ve Kaynaklar

Essential Tools

Cost Monitoring:

Kubecost - Kubernetes native cost monitoring
OpenCost - CNCF open-source alternatif
Infracost - IaC cost estimation

Resource Management:

Goldilocks - VPA recommendations dashboard
Karpenter - AWS için intelligent node provisioning
KEDA - Event-driven autoscaling

FinOps Frameworks:

Yararlı Belgeler

Sonuç

Kubernetes maliyet optimizasyonu sürekli bir süreçtir, bir kerelik proje değil. Başarılı bir FinOps kültürü için:

Visibility First: Neyi ölçemezseniz optimize edemezsiniz
Automate Everything: Manuel müdahale = waste
Culture Change: Cost awareness tüm team'in sorumluluğu
Iterate & Improve: Her sprint'te cost review

TekTık Yazılım olarak, production Kubernetes cluster'larınızda maliyet optimizasyonu ve FinOps implementasyonu konusunda danışmanlık hizmeti sunuyoruz. Cluster audit, optimization roadmap ve hands-on implementasyon desteği için bizimle iletişime geçin.

İletişim: info@tektik.tr | https://tektik.tr

Not: Bu yazıdaki tüm konfigürasyon örnekleri production-tested ve best practice'lere uygundur. Kendi ortamınızda uygulamadan önce staging'de test etmenizi öneririz.

Kubernetes Maliyet Optimizasyonu: Production Ortamında Para Tasarrufu

Kubernetes Maliyet Optimizasyonu: Production Ortamında Para Tasarrufu

İçindekiler

Kubernetes Maliyetlerinin Anatomisi {#maliyet-anatomisi}

Maliyetler Nereden Geliyor?

Tipik Maliyet Problemleri

Resource Requests ve Limits Optimizasyonu {#resource-optimization}

Right-Sizing: Doğru Kaynak Ayarlama

Quality of Service (QoS) Classes

Horizontal ve Vertical Pod Autoscaling {#autoscaling}

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

KEDA: Event-Driven Autoscaling

Node Optimization Stratejileri {#node-optimization}

Cluster Autoscaler

Spot/Preemptible Instance Kullanımı

Node Size Optimization

Namespace ve Multi-Tenancy Yönetimi {#namespace-management}

Resource Quotas

LimitRange: Pod Başına Defaults

Idle Resource Detection ve Cleanup {#idle-resources}

Kullanılmayan Kaynakları Bulun

Monitoring ve FinOps Tooling {#monitoring-finops}

Kubecost: Kubernetes Maliyet İzleme

OpenCost: Open-Source Alternatif

Prometheus Queries: DIY Monitoring

Real-World Case Study {#case-study}

Durum: E-Ticaret Platformu

Uygulanan Optimizasyonlar

Sonuçlar

Actionable Checklist

Hemen Yapılabilecekler (Bu Hafta)

Kısa Vadeli (Bu Ay)

Orta Vadeli (3 Ay)

Uzun Vadeli (6+ Ay)

Araçlar ve Kaynaklar

Essential Tools

Yararlı Belgeler

Sonuç

Kubernetes Autoscaling: HPA, VPA ve KEDA ile Dinamik Ölçeklendirme

Kubernetes Network Policies ve Cilium: Gelişmiş Network Security Stratejileri

GitOps ve ArgoCD: Modern Kubernetes Deployment Stratejisi