Senior VoIP Operations & Reliability Engineer (Carrier-Class Voice Platform)
Job Description
Job Description
Salary:
Why This Role Matters:
Voice is unforgiving. A dropped call, a one-way audio path, or a registration storm is visible to every customer at once. Our team can build the platform, but building it and running it are two different disciplines, and ambition without operational hardening is fragile. We need someone who has lived through real VoIP failures, learned from them, and can stand shoulder to shoulder with the developers to make sure the platform survives contact with production. If you want ownership of a modern, open-source, carrier-class platform from design review through 3am incident to the postmortem that makes it stronger, we want to talk to you.
About the Role:
Our software team is building a next-generation carrier-class voice platform. They are strong programmers, but they are not experienced operators, and there is a world of difference between code that works and infrastructure that stays up under real carrier load. We need a seasoned operator to close that gap and work hand in hand with the development team.
You are the person who has actually run this kind of system in production. You know the failure modes that do not show up in a code review, the things that break at 2am, and what it really takes to keep customers from ever noticing. Your job is to bring that operational reality into the platform from the inside: pairing with the programmers as they build, making sure the design can be operated, and then owning the platform in production with zero customer downtime.
There is an architectural side to this. You will sit in design reviews and push the team toward decisions that are operable, resilient, and testable, not just elegant in code. But the core of the role is operational: you are the experienced hand who keeps every system up, who owns every failure scenario end to end, and who instills operational discipline in a team by being on-call and training juniors to handle any incidents.
You should be equally comfortable pairing with a developer to make a service observable and failure-aware, and at 3am driving an incident to resolution. We need that judgment, with years of real VoIP operations behind it.
In the meantime, this is not a future-only role. We already run a live Kamailio and Asterisk production system carrying real customer traffic today, and your first and most immediate mandate is to help harden it: shore up its reliability, close its failure gaps, and keep it solid while the next-generation platform is being built. Day-to-day production stability of the current system comes first.
What You Will Do:
Harden the current production system (immediate priority)
Take ownership of the reliability of our live Kamailio and Asterisk production system from day one, while the next-generation platform is still in development.
Assess the current system end to end and find its weak points: single points of failure, brittle failover, missing redundancy, capacity headroom, and the failure scenarios it does not yet handle gracefully.
Close those gaps incrementally and safely, without disrupting live customer traffic: add redundancy and failover, tighten configuration, and remove fragility.
Add the observability the current system is missing so problems are caught before customers feel them, and stand up alerting, dashboards, and SIP capture against the live fleet.
Stabilize day-to-day operations: triage and resolve recurring issues, document the system as it actually runs, and write the runbooks that do not exist yet.
Work hand in hand with the development team
Pair with the programmers throughout development as the operational voice in the room: review designs, challenge assumptions, and find the failure modes that code reviews miss.
Make operability a build-time requirement, not an afterthought: push for the logging, metrics, health checks, graceful shutdown, retry behavior, and failure handling that the team needs to add for the platform to survive production.
Transfer operational knowledge to the team: help developers understand how their code behaves under load and failure, and raise the whole group's instinct for production reality..
Map the full failure surface of the platform (node failure, data-center loss, upstream carrier outage, registration storms, partial network partitions, resource exhaustion) and make sure every scenario has a defined, tested behavior..
Design and run a rigorous test program: functional, load, stress, soak, and failover testing, with realistic call models (concurrent calls, BHCA, registration churn).
Build fault-injection and chaos testing into the pipeline so failure handling is proven, not assumed.
Validate the high-availability and scalability design under real conditions: active-active and active-passive topologies, geographic redundancy, graceful degradation, automated failover with measured recovery times, and capacity limits.
Keep it up (day-to-day reliability engineering)
Own platform uptime as a daily responsibility, not a quarterly goal. Customers should experience no downtime.
Build and own the observability stack: SIP capture (HEP/Homer), CDR and quality pipelines, metrics, dashboards, and alerting that catches problems before customers do.
Define SLOs and SLIs for signaling, media, and registration, and hold the platform to them.
Run incident response: detect, triage, mitigate, and resolve, then drive blameless postmortems and make sure the same failure cannot recur.
Write and maintain runbooks, and lead disaster-recovery and failover drills so the team can execute under pressure.
Participate in (and help design) a sustainable on-call rotation.
Tune and operate the production fleet: Asterisk, Kamailio, OpenSIPS, and the supporting network layer, under live carrier traffic.
What We Need You to Bring:
Core expertise (required)
Years of senior, hands-on experience operating and reliability-engineering production VoIP systems at carrier scale.
Deep, protocol-level command of SIP: dialogs, transactions, registration, NAT scenarios, SDP negotiation, forking, and the failure modes that surface only under load.
Expert-level Kamailio and/or OpenSIPS: routing logic, dispatcher and load balancing, registrar and usrloc, dialog and topology modules.
Expert-level Asterisk: PJSIP stack, dialplan, ARI/AMI, bridging and media handling, and its role as an application and media server behind a SIP proxy.
Media plane fluency: RTP, SRTP, RTSP, RTCP, codecs (G.711, G.729, Opus), transcoding, jitter, and the link between QoS marking (DSCP) and call quality.
A demonstrated track record of designing for and operating reliability, scalability, and fault tolerance in carrier-class environments (five-nines thinking, failure-domain isolation, blast-radius control).
Hands-on reliability engineering practice: SLOs and error budgets, incident command, postmortems, runbooks, and DR testing.
Strongly preferred
Performance and failure testing tooling: sipp for load and call modeling, fault injection and chaos tooling, and SIP troubleshooting with sngrep and Wireshark.
Observability depth with Homer/HEP, plus metrics and alerting stacks (for example Prometheus, Grafana, or equivalent).
Strong Linux operations and automation skills (Python, Lua, shell), and comfort with infrastructure-as-code and CI/CD pipelines.
RADIUS/Diameter integration for AAA, and experience with provisioning and subscriber management.
Fraud and security operations: detecting and stopping toll fraud, SIP scanning, and registration attacks.
Experience interconnecting with multiple upstream carriers and managing the routing and failover complexity that brings..
FreeSWITCH or other media servers as a complement to Asterisk.
How you wor
You assume things will fail, and you design and test so that failure is contained and invisible to customers.
You measure before you optimize, and you instrument systems so failures are visible early.
You are calm and decisive in an incident, and rigorous afterward about making sure it never repeats.
You can challenge a design respectfully and precisely, and you write down the trade-offs so the team can reason about them later.
You work well alongside developers: you can teach operational thinking without condescension, and you would rather make the team better at running their own code than be the only one who can.
Why This Role Matters:
Voice is unforgiving. A dropped call, a one-way audio path, or a registration storm is visible to every customer at once. Our team can build the platform, but building it and running it are two different disciplines, and ambition without operational hardening is fragile. We need someone who has lived through real VoIP failures, learned from them, and can stand shoulder to shoulder with the developers to make sure the platform survives contact with production. If you want ownership of a modern, open-source, carrier-class platform from design review through 3am incident to the postmortem that makes it stronger, we want to talk to you.
All applicants are considered for all positions without regard to race, religion, color, sex, gender, sexual orientation, pregnancy, age, national origin, ancestry, physical/mental disability, medical condition, military/veteran status, genetic information, marital status, ethnicity, citizenship or immigration status, or any other protected classification, in accordance with applicable federal, state, and local laws. By completing this application, you are seeking to join a team of hardworking professionals dedicated to consistently delivering outstanding service to our customers and contributing to the financial success of the organization, its clients, and its employees. Equal access to programs, services, and employment is available to all qualified persons. Those applicants requiring an accommodation to complete the application and/or interview process should contact a management representative.
Recommended Jobs
Warehouse Agent (340 Airis, NJ)
Job Description Job Description Warehouse Agent is responsible for receiving and delivering import and export cargo, mail, e-commerce, or company materials, transporting cargo between terminals a…
Senior Accountant
Position Summary Clinton Hill Community & Early Childhood Center, Inc. is seeking a Senior Accountant to support the organization’s accounting operations and financial reporting across multiple …
Salad Maker
100 Steps Kitchen + Raw Bar is seeking a Salad Station Person to join our team! You will thrive in a fast-paced environment and dedicate time to expanding our clientele base to establish the restauran…
Bilingual Account Executive - English / Chinese
Job Description Job Description Summary: The Account Executive position is responsible for the proactive management and sustainable growth of the company’s product portfolio in the East coast …
Direct Support Professional PT Supporting adults Jackson,NJ
Job Description Job Description Direct Support Professional (DSP) – Part-Time Location: Jackson, NJ Pay: $17.86/hr | $18.86/hr weekends Shifts: Saturday & Sunday 7am - 3pm About Us…
Medical Front Desk Receptionist
Job Description Job Description Front Desk / Medical Assistant – Pain Management Clinic Pay: $23.00 – $25.00 per hour Schedule: Monday – Friday, varying hours between 6:00 AM and 6:00 PM …
Drafting Manager
Haddad Plumbing and Heating Inc. is seeking an experienced Drafting Manager to join our design and project coordination team. The Drafting Manager will work closely with the President and Project Mana…
Neuro Telemedicine Opportunity
Full-Time and Part-Time TeleNeurologist positions available for remote positions Nationwide! Our program allows our Neurologists to achieve better patient outcomes by utilizing protocol drive…
Senior Engineer - GMT Data Acquisition Team
At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our …
Advance Practice Provider - PT Night Neuro NB
Job Title: Advance Practice Provider Location: Rutgers University Medical Grp Department Name: Neurosciences APP's - NB Req #: 0000253391 Status: Salaried Shift: Night Pay Range: $83…