整体结论 (Executive Summary) Executive Summary
📢 核心认知Core Concept
IMS 容灾倒换不是“瞬时无感”。核心风险集中在:信令检测窗口、会话刷新、HSS/DRA 负载、PCRF 会话膨胀。
IMS DR switchover is NOT "instant and seamless". Core risks: Signaling detection window, Session Refresh, HSS/DRA Load, PCRF Session Inflation.
🎯 关键结论Key Conclusions
- SBC 角色:不是核心风险点。真正风险在 PS/PGW + 终端行为 (P-CSCF IP 重分配、注册触发)。 SBC Role: Not the core risk. True risk is PS/PGW + UE Behavior (P-CSCF IP re-assignment, Reg trigger).
- 恢复依赖:用户不可能“瞬间全部切到另一套”。恢复速度强依赖:TCP vs UDP、是否触发 重注册 / 新呼叫。 Recovery Dependency: Users won't switch instantly. Depends on TCP vs UDP and triggers like Re-Reg / New Call.
- 倒回风险:倒回(恢复原网)阶段的负载风险通常 ≥ 倒换阶段。重点关注 Cx/Sh 再次升高与 PCRF Session 再次攀升。 Rollback Risk: Rollback load risk is often ≥ Switchover. Focus on Cx/Sh surge and PCRF Session climb.
Step-1: ATS01 故障 (STP ME) Step-1: ATS01 Failure (STP ME)
📊 故障特征Failure Characteristics
影响网元:MMTEL / SCC / SSF / IP-SM-GW
S-CSCF 确认时间:约 30 秒 (Detect Window)
Impacted Nodes: MMTEL / SCC / SSF / IP-SM-GW
S-CSCF Confirm Time: ~30 sec (Detect Window)
📞 语音业务影响矩阵Voice Service Impact Matrix
| 场景Scenario | 行为 / 结果Behavior / Result |
|---|---|
|
Ongoing Call (ATS前释放) Ongoing Call (Released before ATS) |
OK 计费正常Billing Normal |
|
Ongoing Call (媒体已建立) Ongoing Call (Media Established) |
Partial 媒体继续(RTP OK),但计费与 SRVCC 可能失败。会话刷新失败时释放 (Max 30min)。 Media continues (RTP OK), but Billing/SRVCC may fail. Release on Session Refresh fail (Max 30min). |
|
New Call (检测窗口内) New Call (In Detect Window) |
Fail 30s内新呼叫失败。故障确认后恢复。Fails within 30s. Recovers after confirmation. |
⚠️ 信令负载风险 (Risk-1)Signaling Risk (Risk-1)
- Sh 信令高于正常夜间:ATS 需从 HSS 重新下载数据。Sh signaling higher than normal night: ATS redownloads data from HSS.
- ATS01 到 HSS 的订阅未取消,HSS 发送 PNR 失败,影响 KPI。ATS01 subscription to HSS not cancelled; HSS sends PNR but fails, impacting KPI.
🧨 补充风险清单(ATS 阶段)Additional Risk List (ATS Phase)
| Risk | 说明Description |
|---|---|
| Risk-1 |
CSCF 需要约 30s 才能确认 ATS 故障,窗口内业务会受影响。
CSCF needs about 30s to confirm ATS faulty; services are impacted during the window.
|
| Risk-2 |
STP 检测到 IP-SM-GW-1 故障后需切到 IP-SM-GW-2;否则 MT-SMS 失败。
STP must detect IP-SM-GW-1 faulty and route to IP-SM-GW-2, otherwise MT-SMS will fail.
|
| Risk-3 |
ATS02 可能出现 License 过载 风险(承接/负载上移)。
Potential license overload risk on ATS02 (traffic shifted / takeover load).
|
| Risk-4 |
为下载用户数据,DRA/HSS 的 Sh 链路负载上升。
Sh link load on DRA/HSS increases for user data download.
|
Step-2: CSCF STP ME (~10m later) Step-2: CSCF STP ME (~10m later)
⏱️ 关键时延参数Key Latency Parameters
- Detect: 20s
- ART (Retransmission): 6s × 3 + 2s ≈ 20s
📉 业务影响Service Impact
| Ongoing CallOngoing Call | 释放。无连锁影响。Released. No chain effect. |
| New Register/CallNew Register/Call | Delay/Fail 受 20s 检测窗口影响。窗口内失败,确认后恢复。 Impacted by 20s window. Fails within window, recovers after. |
🌪️ Cx 信令风暴Cx Signaling Storm
Cx 信令显著上升,DRA/HSS 链路负载高于正常夜间,需要在割接窗口重点监控。
Significant Cx signaling surge. DRA/HSS link load higher than normal. Monitor closely during window.
🧨 补充风险清单(CSCF 阶段)Additional Risk List (CSCF Phase)
| Risk | 说明Description | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Risk-1 |
P-CSCF 需要约 20s 检测 CSC 故障,窗口内业务受影响。
P-CSCF needs about 20s to detect CSC faulty; services are impacted during the window.
|
||||||||
| Risk-2 |
MO Call/SMS:在新注册完成前可能失败(SBC 504)。UE 会在一定延时后发起到第二个 P-CSCF 的注册(定时器依 UE 厂商)。
MO Call/SMS may fail before new registration (SBC returns 504). UE starts Register to 2nd P-CSCF after a delay (timer depends on UE vendors).
|
||||||||
| Risk-3 |
注册负载会被均衡到其它 SBC,可能引发 E/// SBC & IMS 的 容量/License 过载风险。
Register load balances to other SBCs, which may cause capacity/license overload risks on E/// SBC & IMS.
|
||||||||
| Risk-4 |
为下载用户数据,DRA/HSS 的 Cx 链路负载上升。
Cx link load on DRA/HSS increases for user data download.
|
Step-3: SBC01 STP ME (~10m later) Step-3: SBC01 STP ME (~10m later)
检测机制与限制Detection & Limitations
CSC Detection 默认 30s。因 P-CSCF 使用 IP (非 Hostname) 注册,IP 故障期间 UE 无法快速迁移。
CSC Detection default 30s. Since P-CSCF uses IP (not Hostname), UE cannot migrate quickly during IP failure.
💣 核心高风险点 (Risk-2, 3, 4)Critical Risks (Risk-2, 3, 4)
🧨 补充风险清单(SBC/UE 行为阶段)Additional Risk List (SBC / UE Behavior)
| Risk | 说明Description |
|---|---|
| Risk-1 |
在 UE 尝试新的注册之前,所有 MO 业务 都会失败。
All MO services fail until UE tries a new Register.
|
| Risk-2 |
MT Call/SMS 依赖 CS Retry / Transfer 与 CSFB Paging 机制,表现不确定。
MT Call/SMS depends on CS Retry / Transfer with CSFB paging; behavior may vary.
|
| Risk-3 |
PGW 可能再次把故障的 P-CSCF IP 分配给 UE,导致部分用户恢复时间很长。
PGW may re-allocate the faulty P-CSCF IP again to UE, so some UEs may take a long time.
|
| Risk-4 |
PCRF session 小幅增加(few increased),需关注趋势与持续时间。
PCRF sessions increase slightly (few increased); monitor the trend and duration.
|
恢复 / 倒回阶段 (Rollback) Recovery / Rollback Phase
📈 信令与负载反弹Signaling & Load Rebound
恢复 IMS & SBC 后,Cx/Sh (到 HSS) 高于恢复前,DRA/HSS 链路再次承压。
PCRF session 仍会高于正常,需数天回落。
After recovery, Cx/Sh (to HSS) spikes higher than before. DRA/HSS links under pressure.
PCRF sessions remain high, taking days to drop.
🧨 补充风险清单(恢复/倒回)Additional Risk List (Recovery / Rollback)
| Risk | 说明Description |
|---|---|
| Risk-1 |
ATS/SBC/CSCF 恢复后,由于用户数据老化/过期触发重新下载,DRA/HSS 负载上升。
After ATS/SBC/CSCF restore, DRA/HSS load increases to download user data, as user data is aged.
|
| Risk-2 |
如果启用自动用户均衡(auto user balance),整体回切/回落可能需要 较长时间。
It may take a long time if auto user balance is enabled.
|
✅ 操作检查点Checkpoints
- 严格按 Rollback MOP 执行。Strictly follow Rollback MOP.
- 检查 ATS Plug-in (E/// Repository adaption)。Check ATS Plug-in (E/// Repository adaption).
- SBC 用户恢复不是瞬时,可能需数天。SBC user recovery is not instant, may take days.
SBC 专项结论 (工程化) SBC Specific Conclusion
❓ 停止 SBC 后,用户是否立刻“上新 SBC”?Do users immediately "jump to new SBC" after stop?
结论:不一定。取决于终端与传输协议。用户迁移是“渐进式”的。
Conclusion: Not necessarily. Depends on terminal & transport. Migration is "progressive".
| 场景Scenario | 行为Behavior | 结果Result |
|---|---|---|
| 触发新呼叫Trigger New Call | UE 重选 P-CSCFUE Reselect P-CSCF | Fast |
| TCP 注册Reg | 链路中断 → 立即重注册Link break → Immediate Re-reg | Fast |
| UDP 注册Reg | 等待超时 / 刷新周期Wait for timeout / refresh | Slow / Uncontrollable |
一句话总结 Summary
IMS 容灾的本质不是“网元切换”,而是“检测窗口 + 会话生命周期 + UE 行为 + PS/PGW 重分配”的综合效应。 IMS DR is not just about "Node Switchover", but the combined effect of "Detection Window + Session Lifecycle + UE Behavior + PS/PGW Re-assignment".
👉 真正的监控重点:HSS/DRA, PCRF, & PGW 👉 True Monitoring Focus: HSS/DRA, PCRF, & PGW