Enterprise Telecom BSS/OSS Platform
A telecom BSS/OSS platform processing millions of Event Data Records was failing at a 40% critical error rate. Over a 5+ year engagement, GYSP led the development of its core async microservices, cut critical errors to 2%, and scaled the platform across AWS infrastructure without accumulating the system bottlenecks that kill throughput at telecom load.
The Challenge
Telecom BSS/OSS platforms process every billable event a carrier generates — calls, SMS, and data sessions — in real time, at high throughput, with near-zero tolerance for error. Any event processing failure means potential revenue leakage or billing errors affecting millions of subscribers. The platform's EDRListener service was receiving heavy concurrent streams of Event Data Records but had reached a critical error rate of 40% across a volume of over one million runs — nearly half of all processed events were failing, creating an unsustainable operational and billing risk at telecom scale. Alongside the event processing problem, the Django application layer needed to enforce granular, role-based access control across multiple client store tiers — a requirement typical of telecom platforms serving multiple enterprise and MVNO customers from a shared codebase. Heavy operational loads were creating system bottlenecks that existing infrastructure couldn't absorb: memory-intensive tasks were blocking rather than queueing, degrading throughput precisely when the platform most needed to perform. And over a 5+ year engagement, the cloud infrastructure had to remain consistent and deployable across environments without accumulating drift.
Our Solution
GYSP led the complete development lifecycle of the EDRListener microservice — built on Python 3, Flask, and Asyncio to process concurrent EDR streams without blocking. Asyncio's event loop enabled the service to handle large numbers of simultaneous events cooperatively, eliminating the sequential processing bottleneck that had contributed to the platform's failure rate. The 40% critical error rate was addressed through systematic analysis: over one million failed runs were examined via Kibana and AWS OpenSearch to identify the specific error patterns driving failures, and precise monitoring hotfixes were deployed that reduced the critical error rate from 40% to 2%. AWS CloudFormation templates were used to safely mirror environments across teams, maintaining deployment consistency across the full 5+ year platform lifecycle. AWS Lambda was integrated for event-driven data routines, adding serverless orchestration to processing paths that didn't require persistent infrastructure. Core Django application features were designed with the Django permissions framework providing granular, multi-tier access control across client store levels — supporting the platform's multi-tenant architecture without compromising data isolation between tenants. RabbitMQ and Redis were integrated to eliminate system bottlenecks under heavy load: memory-intensive operations were moved onto queues rather than executing inline, freeing the main processing paths. Custom Django management commands were engineered for batch processing operations, providing efficient, scriptable bulk data handling outside the request-response cycle. Pytest-driven testing discipline was maintained throughout.
Facing a similar challenge? Get a no-commitment technical brief.
Get free briefKey Deliverables
- Led the complete development lifecycle of the EDRListener microservice — Python 3, Flask, and Asyncio processing concurrent Event Data Record streams at telecom throughput
- Reduced critical errors from 40% to 2% across 1M+ failed runs via systematic Kibana and AWS OpenSearch analysis and targeted monitoring hotfixes
- AWS CloudFormation templates used to safely mirror environments, maintaining deployment consistency across a 5+ year, multi-team engagement
- AWS Lambda integrated for event-driven data routines, extending the platform with serverless orchestration for appropriate processing paths
- Django core features designed incorporating the permissions framework for granular, role-based access control across multiple client store tiers
- RabbitMQ and Redis integrated to eliminate bottlenecks under heavy load — queuing memory-intensive tasks rather than blocking main processing paths
- Custom Django management commands engineered for batch processing operations outside the request-response cycle
Services Delivered
- High-Throughput Microservices
- Cloud Infrastructure & Serverless
- Log Monitoring & Observability
- Enterprise Web Development
Tech Stack
Frequently Asked Questions
What are BSS/OSS systems and why do they require high-throughput event processing?+
BSS (Business Support Systems) and OSS (Operations Support Systems) are the platforms that underpin telecom operations — BSS handles customer-facing processes like billing, order management, and revenue assurance; OSS manages the network itself. Because every billable event a carrier generates (calls, data sessions, SMS) must be captured, processed, and rated by the BSS in near real-time, any processing failure creates direct revenue leakage or billing errors affecting subscribers at scale.
How was the 40% critical error rate diagnosed and reduced to 2%?+
GYSP systematically analysed over one million failed runs through Kibana dashboards backed by AWS OpenSearch, identifying the specific error patterns and failure modes driving the 40% critical error rate. Rather than broad architectural changes, the fix was precise: targeted monitoring hotfixes addressing the specific failure paths identified in the data. The result was a reduction from 40% to 2% — a 95% reduction in critical errors — without disrupting the live platform.
What is the EDRListener microservice and how does Asyncio enable concurrent stream processing?+
The EDRListener is a purpose-built microservice that processes Event Data Records — the individual records generated for every billable telecom event — as they arrive in concurrent streams from network elements. Python's Asyncio library provides an event loop that handles many simultaneous connections cooperatively, allowing the service to process a large number of concurrent EDR streams without spawning a thread per connection. This makes it practical to handle telecom-scale event volumes efficiently on a single process.
How were RabbitMQ and Redis used to eliminate system bottlenecks?+
The bottlenecks were caused by memory-intensive operations executing inline within the main processing paths — blocking throughput when load was high. GYSP introduced RabbitMQ as a message broker and Redis for in-memory data caching, moving those memory-intensive tasks onto queues that process asynchronously. This decoupled the heavy operations from the critical processing paths, allowing the platform to absorb load spikes without degrading the throughput of the main event processing pipeline.
Work with GYSP
Want results like these?
Get a free technical brief — architecture options, cost estimates, and a delivery timeline tailored to your challenge.
- 48-hour turnaround
- Senior engineers only
- No commitment required
Or call: +1 (929) 588-8364
More Telecom Case Studies
5G Telecom Giant, Thailand
A national 5G telecom leader in Thailand needed to migrate an entire on-premises platform to Oracle Cloud Infrastructure — across four heterogeneous data systems, with zero production downtime, and a mandate to build a reusable blueprint for 11 more sites to follow.
British Telecom
British Telecom needed to retire a legacy Mainframe platform without losing decades of embedded business logic. As ODI Lead, GYSP re-architected the system on Oracle Data Integrator 12c — work recognised internally with British Telecom's "Innovator of the Month" award.
