01

整体结论 (Executive Summary) Executive Summary

📢 核心认知Core Concept

IMS 容灾倒换不是“瞬时无感”。核心风险集中在:信令检测窗口、会话刷新、HSS/DRA 负载、PCRF 会话膨胀

IMS DR switchover is NOT "instant and seamless". Core risks: Signaling detection window, Session Refresh, HSS/DRA Load, PCRF Session Inflation.

🎯 关键结论Key Conclusions

  • SBC 角色:不是核心风险点。真正风险在 PS/PGW + 终端行为 (P-CSCF IP 重分配、注册触发)。 SBC Role: Not the core risk. True risk is PS/PGW + UE Behavior (P-CSCF IP re-assignment, Reg trigger).
  • 恢复依赖:用户不可能“瞬间全部切到另一套”。恢复速度强依赖:TCP vs UDP、是否触发 重注册 / 新呼叫 Recovery Dependency: Users won't switch instantly. Depends on TCP vs UDP and triggers like Re-Reg / New Call.
  • 倒回风险:倒回(恢复原网)阶段的负载风险通常 ≥ 倒换阶段。重点关注 Cx/Sh 再次升高与 PCRF Session 再次攀升。 Rollback Risk: Rollback load risk is often ≥ Switchover. Focus on Cx/Sh surge and PCRF Session climb.
02

Step-1: ATS01 故障 (STP ME) Step-1: ATS01 Failure (STP ME)

📊 故障特征Failure Characteristics

影响网元:MMTEL / SCC / SSF / IP-SM-GW
S-CSCF 确认时间:约 30 秒 (Detect Window)

Impacted Nodes: MMTEL / SCC / SSF / IP-SM-GW
S-CSCF Confirm Time: ~30 sec (Detect Window)

📞 语音业务影响矩阵Voice Service Impact Matrix

场景Scenario 行为 / 结果Behavior / Result
Ongoing Call
(ATS前释放)
Ongoing Call
(Released before ATS)
OK 计费正常Billing Normal
Ongoing Call
(媒体已建立)
Ongoing Call
(Media Established)
Partial 媒体继续(RTP OK),但计费与 SRVCC 可能失败。会话刷新失败时释放 (Max 30min)。 Media continues (RTP OK), but Billing/SRVCC may fail. Release on Session Refresh fail (Max 30min).
New Call
(检测窗口内)
New Call
(In Detect Window)
Fail 30s内新呼叫失败。故障确认后恢复。Fails within 30s. Recovers after confirmation.

⚠️ 信令负载风险 (Risk-1)Signaling Risk (Risk-1)

  • Sh 信令高于正常夜间:ATS 需从 HSS 重新下载数据。Sh signaling higher than normal night: ATS redownloads data from HSS.
  • ATS01 到 HSS 的订阅未取消,HSS 发送 PNR 失败,影响 KPI。ATS01 subscription to HSS not cancelled; HSS sends PNR but fails, impacting KPI.

🧨 补充风险清单(ATS 阶段)Additional Risk List (ATS Phase)

Risk 说明Description
Risk-1
CSCF 需要约 30s 才能确认 ATS 故障,窗口内业务会受影响。
CSCF needs about 30s to confirm ATS faulty; services are impacted during the window.
Risk-2
STP 检测到 IP-SM-GW-1 故障后需切到 IP-SM-GW-2;否则 MT-SMS 失败。
STP must detect IP-SM-GW-1 faulty and route to IP-SM-GW-2, otherwise MT-SMS will fail.
Risk-3
ATS02 可能出现 License 过载 风险(承接/负载上移)。
Potential license overload risk on ATS02 (traffic shifted / takeover load).
Risk-4
为下载用户数据,DRA/HSS 的 Sh 链路负载上升
Sh link load on DRA/HSS increases for user data download.
03

Step-2: CSCF STP ME (~10m later) Step-2: CSCF STP ME (~10m later)

⏱️ 关键时延参数Key Latency Parameters

  • Detect: 20s
  • ART (Retransmission): 6s × 3 + 2s ≈ 20s

📉 业务影响Service Impact

Ongoing CallOngoing Call 释放。无连锁影响。Released. No chain effect.
New Register/CallNew Register/Call Delay/Fail 受 20s 检测窗口影响。窗口内失败,确认后恢复。 Impacted by 20s window. Fails within window, recovers after.

🌪️ Cx 信令风暴Cx Signaling Storm

Cx 信令显著上升,DRA/HSS 链路负载高于正常夜间,需要在割接窗口重点监控。

Significant Cx signaling surge. DRA/HSS link load higher than normal. Monitor closely during window.

🧨 补充风险清单(CSCF 阶段)Additional Risk List (CSCF Phase)

Risk 说明Description
Risk-1
P-CSCF 需要约 20s 检测 CSC 故障,窗口内业务受影响。
P-CSCF needs about 20s to detect CSC faulty; services are impacted during the window.
Risk-2
MO Call/SMS:在新注册完成前可能失败(SBC 504)。UE 会在一定延时后发起到第二个 P-CSCF 的注册(定时器依 UE 厂商)。
MO Call/SMS may fail before new registration (SBC returns 504). UE starts Register to 2nd P-CSCF after a delay (timer depends on UE vendors).
终端UE 经验值延时Delay (Experience)
iPhone~5s
Samsung~5s
Huawei~52s
Risk-3
注册负载会被均衡到其它 SBC,可能引发 E/// SBC & IMS 的 容量/License 过载风险
Register load balances to other SBCs, which may cause capacity/license overload risks on E/// SBC & IMS.
Risk-4
为下载用户数据,DRA/HSS 的 Cx 链路负载上升
Cx link load on DRA/HSS increases for user data download.
04

Step-3: SBC01 STP ME (~10m later) Step-3: SBC01 STP ME (~10m later)

检测机制与限制Detection & Limitations

CSC Detection 默认 30s。因 P-CSCF 使用 IP (非 Hostname) 注册,IP 故障期间 UE 无法快速迁移

CSC Detection default 30s. Since P-CSCF uses IP (not Hostname), UE cannot migrate quickly during IP failure.

💣 核心高风险点 (Risk-2, 3, 4)Critical Risks (Risk-2, 3, 4)

Risk-2: MT Call
未完成重注册前,MT Call 持续失败。
MT Call fails continuously until re-registration is complete.
Risk-3: PGW Re-assignment
PGW 可能再次分配故障的 P-CSCF IP 给 UE,导致注册恢复时间拉长。
PGW might re-assign the failed P-CSCF IP to UE, prolonging recovery.
Risk-4: PCRF Session Inflation
用户数据 2小时后删除。PCRF Session 显著高于正常,需数天恢复 (待 PGW 释放 Bearer)。
User data cleared after 2hrs. PCRF Sessions spike significantly, needs days to recover (wait for PGW bearer release).

🧨 补充风险清单(SBC/UE 行为阶段)Additional Risk List (SBC / UE Behavior)

Risk 说明Description
Risk-1
在 UE 尝试新的注册之前,所有 MO 业务 都会失败。
All MO services fail until UE tries a new Register.
Risk-2
MT Call/SMS 依赖 CS Retry / Transfer 与 CSFB Paging 机制,表现不确定。
MT Call/SMS depends on CS Retry / Transfer with CSFB paging; behavior may vary.
Risk-3
PGW 可能再次把故障的 P-CSCF IP 分配给 UE,导致部分用户恢复时间很长。
PGW may re-allocate the faulty P-CSCF IP again to UE, so some UEs may take a long time.
Risk-4
PCRF session 小幅增加(few increased),需关注趋势与持续时间。
PCRF sessions increase slightly (few increased); monitor the trend and duration.
05

恢复 / 倒回阶段 (Rollback) Recovery / Rollback Phase

📈 信令与负载反弹Signaling & Load Rebound

恢复 IMS & SBC 后,Cx/Sh (到 HSS) 高于恢复前,DRA/HSS 链路再次承压。
PCRF session 仍会高于正常,需数天回落。

After recovery, Cx/Sh (to HSS) spikes higher than before. DRA/HSS links under pressure.
PCRF sessions remain high, taking days to drop.

🧨 补充风险清单(恢复/倒回)Additional Risk List (Recovery / Rollback)

Risk 说明Description
Risk-1
ATS/SBC/CSCF 恢复后,由于用户数据老化/过期触发重新下载,DRA/HSS 负载上升
After ATS/SBC/CSCF restore, DRA/HSS load increases to download user data, as user data is aged.
Risk-2
如果启用自动用户均衡(auto user balance),整体回切/回落可能需要 较长时间
It may take a long time if auto user balance is enabled.

操作检查点Checkpoints

  • 严格按 Rollback MOP 执行。Strictly follow Rollback MOP.
  • 检查 ATS Plug-in (E/// Repository adaption)。Check ATS Plug-in (E/// Repository adaption).
  • SBC 用户恢复不是瞬时,可能需数天。SBC user recovery is not instant, may take days.
06

SBC 专项结论 (工程化) SBC Specific Conclusion

停止 SBC 后,用户是否立刻“上新 SBC”?Do users immediately "jump to new SBC" after stop?

结论:不一定。取决于终端与传输协议。用户迁移是“渐进式”的。

Conclusion: Not necessarily. Depends on terminal & transport. Migration is "progressive".

场景Scenario 行为Behavior 结果Result
触发新呼叫Trigger New Call UE 重选 P-CSCFUE Reselect P-CSCF Fast
TCP 注册Reg 链路中断 → 立即重注册Link break → Immediate Re-reg Fast
UDP 注册Reg 等待超时 / 刷新周期Wait for timeout / refresh Slow / Uncontrollable
07

一句话总结 Summary

IMS 容灾的本质不是“网元切换”,而是“检测窗口 + 会话生命周期 + UE 行为 + PS/PGW 重分配”的综合效应。 IMS DR is not just about "Node Switchover", but the combined effect of "Detection Window + Session Lifecycle + UE Behavior + PS/PGW Re-assignment".

👉 真正的监控重点:HSS/DRA, PCRF, & PGW 👉 True Monitoring Focus: HSS/DRA, PCRF, & PGW