AI Infrastructure Architect
Oracle Financial Services Software Ltd
Apply on company website
AI Infrastructure Architect
Oracle Financial Services Software Ltd
Bengaluru/Bangalore
Not disclosed
Job Details
Job Description
AI Infrastructure Architect
What you will do (Key responsibilities)
1) Architect and deliver customer AI infrastructure (end-to-end)
- Lead architecture and implementation for secure, scalable AI/ML/LLM platforms based on customer requirements and constraints.
- Produce implementation-ready artifacts: HLD/LLD, reference architectures, network/topology diagrams, deployment plans, runbooks, and operational handover packs.
- Translate business and technical requirements into a scalable target state, and guide delivery teams through build, rollout, and production readiness.
2) Solve real enterprise constraints (network + access + topology)
- Design enterprise network topologies with segmentation/isolation: private subnets, route tables, security policies, egress control, private endpoints, controlled ingress patterns.
- Work within common enterprise constraints
- Fixed network address plans (pre-approved CIDR ranges), IP allowlists/deny-lists, and limited routing flexibility
- Private connectivity requirements (VPN/Direct Connect/FastConnect/ExpressRoute), no public endpoints, and restricted DNS resolution
- Controlled administrative access (bastion/jump host, privileged access management, session recording, time-bound access)
- Restricted egress (proxy-only outbound, firewall-controlled destinations, egress allowlists, DNS filtering, no direct internet)Ensure secure data movement and integration patterns for AI workloads (east-west and north-south traffic)
- Customer-managed encryption and key custody (KMS/HSM, BYOK/HYOK, key rotation, certificate lifecycle)
- Strict TLS policies (mTLS, approved ciphers, enterprise PKI, certificate pinning where required)
- Identity and access controls (SSO/SAML/OIDC, RBAC/ABAC, least privilege, break-glass accounts)
- Data governance constraints (PII/PHI handling, residency/sovereignty, retention, audit evidence requirements)
- Secure software supply chain (approved base images, artifact signing, SBOM, vulnerability scanning, patch SLAs)
- Endpoint controls (EDR agents, OS hardening standards, restricted packages, golden images)
- Change management gates (CAB approvals, limited maintenance windows, separation of duties)
- Observability restrictions (logs can’t leave tenant, redaction/masking, approved collectors/forwarders only)
- Multi-tenant isolation and policy boundaries (namespace isolation, network policies, runtime sandboxing)
- High availability & DR expectations (multi-zone patterns, backup/restore, failover runbooks, RTO/RPO)
3) Security-by-design, InfoSec approvals, and guardrails for AI platforms
- Lead InfoSec engagement: threat modeling, control mapping, evidence collection, remediation plans, and security signoffs for AI infrastructure.
- Implement security controls and platform guardrails:
- TLS/SSL-only communication patterns; encryption-in-transit and encryption-at-rest
- API security: OAuth2/JWT/mTLS, gateway policies, request signing patterns where required
- Secrets management using vault/key management services, rotation and lifecycle controls
- IAM and least-privilege access models; tenant/project isolation
- VM hardening (CIS-aligned baselines), patching strategy, secure images
- “Kill switches” / emergency stop mechanisms for agents (tool-disable, egress cut-off, policy stop, rollback runbooks)
- AI infra guardrails: controlled tool execution, outbound allowlists, boundary policies, audit-ready logging
4) LLM hosting, GPU infrastructure, and scale
- Architect LLM hosting patterns: managed endpoints, self-hosted inference, multi-model routing, and workload isolation.
- Design and operationalize GPU-based inference at scale:
- Capacity planning, GPU node pools, scaling policies, cost/performance optimization
- Performance profiling and reliability patterns for inference services
- Build container/Kubernetes-based AI platforms (OKE/EKS/AKS/GKE as applicable):
- Secure cluster designs, namespaces/tenancy, node isolation, secrets, and safe rollout strategies
- Support AI frameworks and application runtimes on Kubernetes for scale and portability
5) Observability, reliability engineering, and operational readiness
- Define and implement observability across AI systems:
- Metrics, logs, traces, audit trails, and network call tracing
- Integration with enterprise observability tools (customer standard platforms)
- Define SLIs/SLOs for AI services:
- Latency, throughput, error rates, saturation, GPU utilization, queue depth, retry behavior
- Execute load testing and capacity validation for inference endpoints, vector stores, agent runtimes, and integration services.
- Build reliable ops workflows: incident response, runbooks, dashboards, alerting, and proactive health checks.
6) Disaster recovery and resilience for AI platforms
- Design DR strategies for AI solutions:
- Multi-AD / multi-region patterns, backup/restore for critical stores, IaC-based rebuilds
- Failover runbooks, RTO/RPO alignment, and validation exercises
- Ensure production-grade resilience and safe rollback for platform and application layers.
7) Red teaming and risk mitigation for AI infrastructure
- Drive security validation for AI infrastructure and agent deployments:
- Attack surface review, secrets leakage paths, egress abuse scenarios
- Prompt/tool misuse impact assessment at infrastructure level
- Implement mitigations and hardening measures with measurable controls.
8) Consulting leadership and stakeholder management
- Act as a trusted technical advisor to customer platform, network, and security teams.
- Communicate clearly with diverse stakeholders (CIO/CTO, Security, Infra, App teams) and drive decisions under ambiguity.
- Mentor engineers/architects, conduct design reviews, and build reusable delivery accelerators and blueprints.
Job role
Work location
BENGALURU, KARNATAKA, India
Department
Project & Program Management
Role / Category
Technology / IT Project Management
Employment type
Full Time
Shift
Day Shift
Job requirements
Experience
Min. 5 years
About company
Name
Oracle Financial Services Software Ltd
Job posted by Oracle Financial Services Software Ltd
Apply on company website