SRE for Qatar's Energy Sector: Observability and Incident Response for OT/IT Convergence
How Site Reliability Engineering practices must adapt for Qatar's energy sector - observability stacks that bridge OT and IT, incident response for critical infrastructure, and SLOs for operational technology.
Qatar’s energy sector operates some of the most complex industrial systems on the planet. QatarEnergy’s North Field Expansion - the world’s largest LNG project - will increase Qatar’s LNG capacity to 126 million tonnes per year by 2027. Behind these physical operations sits an expanding layer of software: SCADA systems, distributed control systems, digital twins, predictive maintenance platforms, and enterprise integration middleware that connects operational technology to business systems.
When software failures in this context carry safety and operational continuity implications, standard Site Reliability Engineering practices are not sufficient. The SRE playbook written for web applications - where a 99.9% SLO means roughly 8.7 hours of acceptable downtime per year - does not translate directly to a production monitoring system where even 30 seconds of incorrect data could trigger an unnecessary emergency shutdown.
This post covers how SRE for Qatar’s energy sector must be adapted for the unique requirements of OT/IT convergence.
Why Standard SRE Breaks at the OT/IT Boundary
Google’s SRE principles - SLOs, error budgets, toil reduction, blameless postmortems - are fundamentally sound. But they were designed for IT systems where the blast radius of failure is user-facing downtime or data inconsistency. In operational technology environments, the blast radius is different:
Failure modes include physical consequences. A monitoring system that displays incorrect pressure readings, even briefly, can lead operators to make decisions with physical safety implications. The SRE concept of “acceptable error budget” must be redefined when errors can cascade into operational safety events.
Operators are the reliability layer. In IT systems, automation is the primary reliability mechanism - automated failover, automated scaling, automated remediation. In OT environments, human operators are the primary reliability layer. The observability stack must be designed to support operator decision-making, not replace it.
Change velocity is intentionally slow. SRE practices often assume that reliability improves with deployment frequency - smaller changes, faster feedback loops. In OT environments, change frequency is deliberately limited because every change carries risk that must be evaluated against operational schedules. The SRE team must optimise for change safety, not change velocity.
Network segmentation is a hard constraint. OT networks are air-gapped or heavily segmented from IT networks for security reasons. This makes standard observability architectures - where agents push telemetry to a central SaaS platform - impossible without significant adaptation.
Observability Architecture for OT/IT Convergence
The observability stack for Qatar’s energy companies must bridge two fundamentally different environments while respecting the security boundaries between them.
The Three-Zone Model
We design observability architectures around three zones:
Zone 1: OT Network (Purdue Level 0-3). This includes PLCs, RTUs, SCADA servers, and DCS controllers. Telemetry collection in this zone uses lightweight, purpose-built agents that collect data through industrial protocols - OPC-UA, Modbus, MQTT - and push it to a local data collector within the OT network. No IT-originated traffic enters this zone.
Zone 2: DMZ (Purdue Level 3.5). A dedicated demilitarised zone sits between OT and IT networks. Data flows one way: from OT to IT, through a data diode or a strictly controlled firewall with unidirectional rules. The DMZ hosts data aggregation and filtering - raw OT telemetry is normalised, sampled, and forwarded to the IT observability stack. This zone is where OT/IT convergence physically happens.
Zone 3: IT Network and Cloud. The full observability platform - Prometheus, Grafana, OpenTelemetry collectors, alerting engines, and long-term storage - runs in the IT network or in AWS Bahrain (me-south-1). This zone handles correlation, dashboarding, alerting, and incident management. IT application telemetry (APIs, databases, middleware) feeds directly into this zone.
Technology Stack
For Qatar energy sector deployments, the observability stack typically includes:
OpenTelemetry as the telemetry collection standard for IT-side services. OTel collectors aggregate traces, metrics, and logs from application workloads and forward them to the backend.
Prometheus with Thanos or Cortex for metrics storage. Prometheus handles the collection and short-term storage. Thanos provides long-term storage in S3 (me-south-1) with global query capability across multiple Prometheus instances - essential when you have separate Prometheus deployments for different operational sites.
Grafana for dashboarding and visualisation. Energy sector dashboards are not standard application dashboards. They must display OT metrics (pressure, temperature, flow rates) alongside IT metrics (API latency, error rates, queue depths) in a unified view that operators can interpret quickly.
Loki for log aggregation. Centralised logging across OT gateway logs, application logs, and infrastructure logs with NCA-compliant retention policies - typically 12 months online and 7 years archived for critical infrastructure systems.
PagerDuty or Opsgenie for incident management, integrated with the operator shift management system. Alert routing must consider both IT on-call rotations and OT operator shift schedules.
Defining SLOs for Operational Technology
SLOs for energy sector systems require a different approach than web application SLOs. The key differences:
Correctness SLOs take priority over availability SLOs. For a production monitoring dashboard, displaying incorrect data is worse than being unavailable. An operator who sees “dashboard unavailable” will fall back to manual readings. An operator who sees incorrect but plausible values may make decisions based on bad data. The primary SLO is data correctness - measured as the percentage of displayed values that are within acceptable tolerance of the source sensor reading.
Latency SLOs have operational meaning. In a web application, a p99 latency of 500ms is a performance target. In an OT monitoring system, data staleness has operational safety implications. If a pressure reading is 30 seconds old, the operator needs to know that. The latency SLO defines the maximum acceptable age of displayed data, and the observability system must visually indicate when data exceeds this threshold.
Availability SLOs are per-component, not per-service. The overall system availability target might be 99.99%, but individual components have different criticality levels. The alarm management subsystem has a stricter availability target than the historical reporting module. SLOs must be defined at the component level with explicit dependency mapping.
Example SLO framework for an energy sector monitoring platform:
| Component | Metric | Target | Measurement |
|---|---|---|---|
| Real-time display | Data freshness | < 5 seconds | % of readings within threshold |
| Real-time display | Data correctness | 99.999% | Compared against source sensors |
| Alarm system | Availability | 99.99% | Uptime of alarm processing pipeline |
| Alarm system | Delivery latency | < 2 seconds | Time from trigger to operator notification |
| Historical reports | Availability | 99.9% | Uptime of query interface |
| Historical reports | Data completeness | 99.95% | % of expected data points present |
Incident Response for Critical Infrastructure
Incident response in Qatar’s energy sector follows a different pattern than IT incident response. The key adaptations:
Severity Classification
Standard IT severity levels (SEV1-SEV4) are insufficient. Energy sector incidents must be classified on two dimensions:
Operational impact - does this incident affect the ability of operators to monitor or control physical processes? An incident that degrades the monitoring dashboard is higher severity than an incident that takes down the historical reporting system, regardless of user count.
Safety relevance - could this incident, if unresolved, contribute to a safety event? Any incident with safety relevance automatically escalates to the highest severity tier and triggers notification to the operational safety team, not just the IT team.
Response Procedures
Parallel notification paths. When an incident affects OT-adjacent systems, the IT incident response team and the operations control room must be notified simultaneously. The IT team manages the technical response. The operations team assesses whether manual fallback procedures need to be activated. These are parallel workstreams, not sequential.
Operator bridge. Every major incident involving OT-adjacent systems includes an operator representative on the incident bridge. The operator provides context that the IT team cannot have - whether the affected readings are currently critical to an operational decision, whether a workaround exists, and what the operational deadline for resolution is.
Controlled remediation. Unlike IT incidents where the fastest fix wins, energy sector incident remediation must be controlled. A rushed fix that introduces a different failure mode is worse than a longer outage with a verified remediation. Every remediation action is documented before execution and verified after execution.
Postmortem Adaptations
Blameless postmortems apply, but with additional requirements:
Root cause analysis must address both IT and OT factors. An incident where a database failover caused a 45-second data gap requires analysis of why the failover took that long (IT root cause) and why the monitoring system did not indicate data staleness to operators (OT root cause).
Action items must include both technical and procedural remediation. If the incident revealed that operators were not trained on manual fallback procedures for a specific monitoring subsystem, that training gap is an action item alongside the technical fix.
Regulatory reporting. Certain incidents in critical infrastructure environments must be reported to the Qatar NCA within defined timeframes. The postmortem process must include a regulatory reporting assessment.
Building an SRE Practice for Energy
For Qatar energy companies building an SRE practice, the recommended approach is incremental:
Start with observability. Deploy the three-zone observability architecture and establish baseline visibility across OT and IT systems. You cannot define meaningful SLOs or effective incident response without data.
Define SLOs with operators. SLO workshops must include OT operators, not just IT engineers. Operators know which metrics matter, what latency is acceptable, and where the current monitoring gaps are. Engineering defines the measurement methodology. Operators define the targets.
Build incident response muscle memory. Run tabletop exercises that simulate incidents at the OT/IT boundary. These exercises expose gaps in communication, tooling, and procedures before a real incident does.
Automate carefully. Automation in OT-adjacent systems must be introduced gradually with explicit operator approval. Start with automated alerting and diagnostics. Progress to automated remediation only for well-understood failure modes with verified rollback procedures.
Getting Started
If your organisation is operating software for Qatar’s energy sector and needs an SRE practice designed for OT/IT convergence, book a free 30-minute consultation with our team. We design observability architectures, define operationally meaningful SLOs, and build incident response processes for critical infrastructure environments in Qatar.
Get Started for Free
Schedule a free consultation. 30-minute call, actionable results in days.
Talk to an Expert