1. 概述
1.1 方案简介
在 Kubernetes 生产环境中,Pod 默认可以访问任意外部网络。本文档介绍一种通过 CoreDNS hosts 插件配合 Calico NetworkPolicy,实现出口域名访问控制的方法。
核心思路是:把需要放行的域名解析到固定 IP(比如 10.0.0.1),然后用 Calico 策略只允许访问这个 IP 段。这样不需要改 CNI,不需要引入 Service Mesh,也不用付费。
每个 namespace 可以配置独立的白名单,支持 GitOps 多集群管理,有 CI 检测冲突,有监控告警,也有完整的回滚方案。
1.2 为什么不选其他方案
| 备选方案 | 不选原因 |
|---|---|
| CiliumNetworkPolicy + FQDN | 需要把 Calico CNI 换成 Cilium,改动太大,多集群都要动 |
| Istio egress Gateway | 需要完整的 Istio 环境,Sidecar 注入对应用有侵入,资源消耗高 |
| Calico Enterprise | 商业付费产品 |
| 纯 iptables/ipvs 规则 | 规则难管理,和 Kubernetes 命名空间概念不对齐,多集群同步麻烦 |
1.3 为什么选 Calico + CoreDNS 劫持
继续用现有 Calico CNI,只加策略配置。不需要额外部署组件,NetworkPolicy 是 Kubernetes 原生资源,YAML 可以用 Kustomize 管理,ArgoCD 直接同步。没有许可证费用。
1.4 技术选型对比
| 方案 | CNI 依赖 | 侵入性 | 复杂度 | 推荐场景 |
|---|---|---|---|---|
| CiliumNetworkPolicy + FQDN | 需 Cilium CNI | 高 | 中 | 已有 Cilium |
| Calico + CoreDNS 劫持 | Calico 即可 | 低 | 中 | 生产推荐 |
| Istio egress Gateway | Istio | 高 | 高 | 已用 Istio |
| Calico Enterprise | Calico(商业版) | 中 | 低 | 有预算 |
本方案:Calico + CoreDNS 劫持
2. 架构设计
2.1 整体架构
┌─────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
┌──────────────┐ │ ┌────────────┐ ┌────────────┐ │
│ namespace-a │ │ │ namespace-b│ │ namespace-c│ │
│ (GitHub) │ │ │ (国内服务) │ │ (混合) │ │
└──────┬───────┘ │ └─────┬──────┘ └─────┬─────┘ │
│ │ │ │ │
▼ │ ▼ ▼ │
┌──────────────────────────────────────────────┐ │
│ Calico GlobalNetworkPolicy │ │
│ Default Deny Egress + IP Whitelist │ │
└──────────────────────┬───────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────┐ │
│ CoreDNS (hosts 插件) │ │
│ │ │
│ 已授权域名 → 劫持到固定 IP (10.0.0.x) │ │
│ 未授权域名 → 转发上游(被策略拦截) │ │
└──────────────────────────────────────────────────┘ │
│ │
▼ │
┌─────────────┐ │
│ 外部网络 │ │
└─────────────┘ │
2.2 劫持原理
- Pod 请求 github.com
- CoreDNS hosts 插件把域名解析到 10.0.0.1
- Calico 检查目标 IP 是否在白名单
- 10.0.0.1 允许访问,流量通过 NAT 出去
2.3 关键约束
- 劫持 IP 段不能和集群 Pod IP、Service IP 重叠
- 应用必须用集群 DNS,不能硬编码 DNS 服务器
- Calico order 值越小优先级越高
3. 前置条件
| 组件 | 版本要求 | 检查命令 |
|---|---|---|
| Kubernetes | v1.32.9 | kubectl version --short |
| Calico CNI | v3.25+ | calicoctl version |
| CoreDNS | 集群内置 | kubectl get po -n kube-system -l k8s-app=kube-dns |
| ArgoCD | v2.5+ | argocd version(可选) |
3.1 验证 Calico 状态
# 确认 Calico CNI 正常运行
calicoctl node status
# 确认 GlobalNetworkPolicy 可用
calicoctl get globalsetworkpolicy
3.2 确认 CoreDNS 可配置
# 查看当前 CoreDNS ConfigMap
kubectl get cm -n kube-system coredns -o yaml
# CoreDNS 官方版本都支持 hosts 插件
4. 配置详解
4.1 分配劫持 IP 段
选一个集群里没在用的 IP 段作为劫持 IP。本指南用 10.0.0.0/24,按实际环境调整。
# 确认该 IP 段未被使用
kubectl get pods -o wide | grep -v host-network
4.2 CoreDNS 白名单配置
# base/coredns-egress-whitelist.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: egress-whitelist
namespace: kube-system
data:
hosts: |
# GitHub 相关(namespace-a 用)
10.0.0.1 github.com
10.0.0.1 api.github.com
10.0.0.1 githubusercontent.com
10.0.0.1 raw.githubusercontent.com
# 国内服务(namespace-b 用)
10.0.0.2 baidu.com
10.0.0.2 qingcdn.com
10.0.0.2 api.qingcdn.com
10.0.0.3 aliyun.com
10.0.0.3 market.aliyun.com
fallthrough
4.3 修改 CoreDNS Corefile
把上面的 ConfigMap 挂载到 CoreDNS:
# base/coredns-deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
template:
spec:
containers:
- name: coredns
args:
- -conf
- /etc/coredns/Corefile
- /etc/coredns/hosts
volumeMounts:
- name: config
mountPath: /etc/coredns
readOnly: true
- name: egress-whitelist
mountPath: /etc/coredns/hosts
readOnly: true
volumes:
- name: egress-whitelist
configMap:
name: egress-whitelist
items:
- key: hosts
path: hosts
- name: config
configMap:
name: coredns
4.4 Default Deny 策略
# base/default-deny-egress.yaml
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: default-deny-egress
spec:
namespaceSelector: ""
order: 1000
types:
- Egress
egress:
# 放行 DNS
- action: Allow
protocol: UDP
destination:
selector: k8s-app == "kube-dns"
ports:
- 53
# 放行 Kubernetes API
- action: Allow
protocol: TCP
destination:
selector: k8s-app == "kube-apiserver"
ports:
- 443
# 放行劫持 IP 段
- action: Allow
destination:
nets:
- 10.0.0.0/24
# 拒绝其他出口
- action: Deny
4.5 Per-Namespace 出口策略
namespace-a:仅 GitHub
# overlays/cluster-a/namespace-a-policy.yaml
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: namespace-a-allow-github
namespace: namespace-a
spec:
order: 100
namespaceSelector: metadata.name == "namespace-a"
types:
- Egress
egress:
- action: Allow
destination:
nets:
- 10.0.0.1/32
namespace-b:仅国内服务
# overlays/cluster-b/namespace-b-policy.yaml
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: namespace-b-allow-domestic
namespace: namespace-b
spec:
order: 100
namespaceSelector: metadata.name == "namespace-b"
types:
- Egress
egress:
- action: Allow
destination:
nets:
- 10.0.0.2/32
- 10.0.0.3/32
- action: Deny
destination:
nets:
- 10.0.0.1/32
4.6 验证策略冲突
# 检查是否有冲突的策略
calicoctl get policy -o yaml | grep -E "order:|nets:"
# 确认 Default Deny 存在
calicoctl get globalsetworkpolicy default-deny-egress
5. GitOps 目录结构
k8s-egress/
├── base/
│ ├── kustomization.yaml
│ ├── namespace.yaml
│ ├── default-deny-egress.yaml
│ ├── coredns-egress-whitelist.yaml
│ └── coredns-deployment-patch.yaml
├── overlays/
│ ├── cluster-a/
│ │ ├── kustomization.yaml
│ │ ├── namespace-a-policy.yaml
│ │ └── namespace-a.yaml
│ └── cluster-b/
│ ├── kustomization.yaml
│ ├── namespace-b-policy.yaml
│ └── namespace-b.yaml
├── ci/
│ ├── duplicate-domain-check.yaml
│ └── dns-hardcode-check.sh
├── scripts/
│ ├── validate-policy.sh
│ └── rollback.sh
├── argocd/
│ ├── k8s-egress-cluster-a.yaml
│ └── k8s-egress-cluster-b.yaml
└── monitoring/
├── hubble-alerts.yaml
└── prometheus-rules.yaml
5.1 Kustomization 配置
# overlays/cluster-a/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base/namespace.yaml
- ../../base/default-deny-egress.yaml
- ../../base/coredns-egress-whitelist.yaml
- ../../base/coredns-deployment-patch.yaml
- namespace-a-policy.yaml
- namespace-a.yaml
6. CI 冲突检测
6.1 重复域名检测
# ci/duplicate-domain-check.yaml
name: Check Duplicate Egress Domains
on:
pull_request:
paths:
- 'k8s-egress/base/coredns-egress-whitelist.yaml'
- 'k8s-egress/overlays/**/coredns-*.yaml'
jobs:
check-duplicates:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Extract all domains
run: |
grep -hE '^\s+[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\s+' \
k8s-egress/base/coredns-egress-whitelist.yaml \
k8s-egress/overlays/**/coredns-*.yaml \
2>/dev/null \
| awk '{print $2}' | sort > /tmp/all-domains.txt
echo "发现域名 $(wc -l < /tmp/all-domains.txt) 个"
cat /tmp/all-domains.txt
- name: Check duplicates
run: |
duplicates=$(sort /tmp/all-domains.txt | uniq -d)
if [ -n "$duplicates" ]; then
echo "发现重复域名:"
echo "$duplicates"
exit 1
fi
echo "无重复域名"
6.2 DNS 硬编码检测
#!/bin/bash
# ci/dns-hardcode-check.sh
set -e
echo "检测代码中的硬编码 DNS..."
PATTERNS=(
"8.8.8.8"
"114.114.114.114"
"1.1.1.1"
"dns.google"
"223.5.5.5"
)
FOUND=0
for pattern in "${PATTERNS[@]}"; do
if grep -r "$pattern" \
--include="*.yaml" \
--include="*.yml" \
--include="*.json" \
--include="*.toml" \
. 2>/dev/null | grep -v "^./ci/"; then
echo "发现硬编码 DNS: $pattern"
FOUND=1
fi
done
if [ $FOUND -eq 1 ]; then
echo "请移除硬编码 DNS,使用 CoreDNS 劫持配置"
exit 1
fi
echo "无硬编码 DNS"
7. 监控告警
7.1 Hubble 流量监控
# monitoring/hubble-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hubble-relay
namespace: calico-system
spec:
replicas: 1
selector:
matchLabels:
k8s-app: hubble-relay
template:
metadata:
labels:
k8s-app: hubble-relay
spec:
containers:
- name: hubble-relay
image: quay.io/cilium/hubble-relay:latest
command:
- hubble
- relay
ports:
- containerPort: 4245
7.2 Prometheus 告警规则
# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: egress-policy-alerts
namespace: kube-system
spec:
groups:
- name: egress-policy
rules:
- alert: EgressDenyRateHigh
expr: |
rate(calico_egress_policy_deny_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "出口拒绝流量异常增多"
description: |
集群 {{ $labels.cluster }} 过去 5 分钟
拒绝出口流量速率 > 10/s
当前值: {{ $value }}/s
- alert: DNShijackUnmatched
expr: |
rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "DNS 解析 NXDOMAIN 增多"
description: "可能存在未配置白名单的域名被劫持"
- alert: CoreDNSEgressHostsMissing
expr: |
count(coredns_dns_responses_total{plugin=="hosts"}) == 0
for: 10m
labels:
severity: critical
annotations:
summary: "CoreDNS egress hosts 配置缺失"
description: "CoreDNS 未加载 egress-whitelist hosts 配置"
7.3 Grafana Dashboard
{
"dashboard": {
"title": "Egress Policy Monitoring",
"panels": [
{
"title": "出口流量 Allow vs Deny",
"type": "piechart",
"targets": [
{
"expr": "sum(rate(calico_egress_policy_allow_total[5m]))",
"legendFormat": "Allow"
},
{
"expr": "sum(rate(calico_egress_policy_deny_total[5m]))",
"legendFormat": "Deny"
}
]
},
{
"title": "各 Namespace 出口流量 Top 10",
"type": "bargauge",
"targets": [
{
"expr": "topk(10, sum by (namespace) (rate(calico_egress_policy_allow_total[5m])))",
"legendFormat": "{{namespace}}"
}
]
},
{
"title": "DNS 劫持命中率",
"type": "timeseries",
"targets": [
{
"expr": "rate(coredns_dns_responses_total{plugin==\"hosts\"}[5m])",
"legendFormat": "命中 hosts"
},
{
"expr": "rate(coredns_dns_responses_total{plugin==\"forward\"}[5m])",
"legendFormat": "转发上游"
}
]
}
]
}
}
8. 实施步骤
8.1 分阶段部署流程
| 阶段 | 步骤 | 操作 | 验证 |
|---|---|---|---|
| Phase 1 | 1 | 部署 CoreDNS ConfigMap | kubectl get cm -n kube-system egress-whitelist |
| 2 | 修改 CoreDNS 挂载 hosts 文件 | kubectl rollout restart -n kube-system deployment/coredns | |
| 3 | 验证 CoreDNS hosts 生效 | kubectl run -it --rm dns-test --image=busybox --restart=Never -- nslookup github.com | |
| Phase 2 | 4 | 部署 Default Deny 策略 | calicoctl get GlobalNetworkPolicy default-deny-egress |
| 5 | 确认集群内 DNS/API 正常 | kubectl get pods -A | |
| Phase 3 | 6 | 选择性测试(仅 namespace-a) | 部署 namespace-a-policy.yaml |
| 7 | 验证 GitHub 访问正常 | kubectl exec -n namespace-a test-pod -- curl -I github.com | |
| 8 | 验证未授权域名拒绝 | kubectl exec -n namespace-a test-pod -- curl -I blocked-domain.com | |
| Phase 4 | 9 | 逐步覆盖所有 namespace | 逐个部署 per-namespace 策略 |
| 10 | 配置 ArgoCD Application | GitOps 自动化同步 | |
| Phase 5 | 11 | 部署 Prometheus 告警规则 | Grafana 验证告警 |
| 12 | 验证 Hubble 流量监控 | hubble ui 查看出口流量 |
8.2 快速验证脚本
#!/bin/bash
# scripts/validate-policy.sh
NAMESPACE="${1:-namespace-a}"
TEST_DOMAIN="${2:-github.com}"
echo "=== 验证出口策略 ($NAMESPACE) ==="
echo "[1/3] 检查 CoreDNS hosts 配置..."
kubectl get cm -n kube-system egress-whitelist -o jsonpath='{.data.hosts}' | grep -q "$TEST_DOMAIN" && echo "hosts 配置存在" || echo "hosts 配置缺失"
echo "[2/3] 检查 Calico NetworkPolicy..."
calicoctl get policy -n "$NAMESPACE" 2>/dev/null | grep -q "allow" && echo "NetworkPolicy 存在" || echo "NetworkPolicy 缺失"
echo "[3/3] 测试 DNS 劫持..."
POD_IP=$(kubectl get pod -n "$NAMESPACE" -l app=nginx -o jsonpath='{.items[0].status.podIP}' 2>/dev/null || echo "")
if [ -n "$POD_IP" ]; then
kubectl exec -n "$NAMESPACE" nginx-0 -- nslookup "$TEST_DOMAIN" 2>/dev/null | grep -q "10.0.0" && echo "DNS 劫持生效" || echo "DNS 未劫持"
else
echo "未找到测试 Pod,跳过 DNS 验证"
fi
echo "=== 验证完成 ==="
8.3 回滚方案
#!/bin/bash
# scripts/rollback.sh
echo "开始回滚出口策略..."
echo "[1/3] 删除 Calico NetworkPolicy..."
calicoctl delete policy --all 2>/dev/null || true
calicoctl delete globalsetworkpolicy default-deny-egress 2>/dev/null || true
echo "[2/3] 删除 CoreDNS ConfigMap..."
kubectl delete cm -n kube-system egress-whitelist 2>/dev/null || true
echo "[3/3] 重启 CoreDNS..."
kubectl rollout restart -n kube-system deployment/coredns
echo "回滚完成"
8.4 ArgoCD 多集群部署
# argocd/k8s-egress-cluster-a.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: k8s-egress-cluster-a
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/your-org/k8s-egress.git
targetRevision: main
path: overlays/cluster-a
destination:
server: https://cluster-a.k8s.internal:6443
namespace: kube-system
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- serverSideApply=true
# 部署到所有集群
argocd app set k8s-egress-cluster-a --sync-policy automated
argocd app set k8s-egress-cluster-b --sync-policy automated
# 查看同步状态
argocd app list -l app.kubernetes.io/managed-by=argocd
9. 常见问题
Q1: CoreDNS 重启影响业务?
A: 滚动更新时会有短暂 DNS 抖动(通常 < 30 秒)。建议在业务低峰期操作,或用 kubectl rollout pause 暂停滚动更新。
Q2: 如何处理通配符域名(如 *.aliyun.com)?
A: CoreDNS hosts 插件只支持精确域名匹配。通配符场景建议用应用层代理(如 Envoy),或者拆成多个精确域名逐个配置。
Q3: 多集群域名白名单不一致?
A: 共享域名放 base/coredns-egress-whitelist.yaml,集群特有的放 overlays/cluster-X/。
Q4: 临时授权怎么处理?
A: 短期(< 24h)通过 ArgoCD Rollback 快速撤销;长期通过 PR 流程正式合并,CI 会检测冲突。
Q5: 日志审计怎么做?
A: 通过 Hubble 导出 Flow 日志到 Elasticsearch:
# Hubble Flow 日志导出
apiVersion: v1
kind: ConfigMap
metadata:
name: hubble-relay
namespace: calico-system
data:
config.yaml: |
flow:
enableCapture: true
exporters:
- type: Elasticsearch
address: elasticsearch.logging:9200
10. 参考资料
| 资源 | 链接 |
|---|---|
| Calico NetworkPolicy 文档 | https://docs.tigera.io/calico/latest/network-policy/policy-rules/dns-policy |
| CoreDNS hosts 插件 | https://coredns.io/plugins/hosts/ |
| Cilium FQDN Policy(备选) | https://docs.cilium.io/en/stable/policy/language/#dns-based |
| ArgoCD 多集群管理 | https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/ |