AI Infrastructure Architect

Oracle Financial Services Software Ltd

Bengaluru/Bangalore

Not disclosed

Work from Office

Full Time

Min. 5 years

Job Details

Job Description

AI Infrastructure Architect

What you will do (Key responsibilities)

1) Architect and deliver customer AI infrastructure (end-to-end)

  • Lead architecture and implementation for secure, scalable AI/ML/LLM platforms based on customer requirements and constraints.
  • Produce implementation-ready artifacts: HLD/LLD, reference architectures, network/topology diagrams, deployment plans, runbooks, and operational handover packs.
  • Translate business and technical requirements into a scalable target state, and guide delivery teams through build, rollout, and production readiness.

2) Solve real enterprise constraints (network + access + topology)

  • Design enterprise network topologies with segmentation/isolation: private subnets, route tables, security policies, egress control, private endpoints, controlled ingress patterns.
  • Work within common enterprise constraints
    • Fixed network address plans (pre-approved CIDR ranges), IP allowlists/deny-lists, and limited routing flexibility
    • Private connectivity requirements (VPN/Direct Connect/FastConnect/ExpressRoute), no public endpoints, and restricted DNS resolution
    • Controlled administrative access (bastion/jump host, privileged access management, session recording, time-bound access)
    • Restricted egress (proxy-only outbound, firewall-controlled destinations, egress allowlists, DNS filtering, no direct internet)Ensure secure data movement and integration patterns for AI workloads (east-west and north-south traffic)
    • Customer-managed encryption and key custody (KMS/HSM, BYOK/HYOK, key rotation, certificate lifecycle)
    • Strict TLS policies (mTLS, approved ciphers, enterprise PKI, certificate pinning where required)
    • Identity and access controls (SSO/SAML/OIDC, RBAC/ABAC, least privilege, break-glass accounts)
    • Data governance constraints (PII/PHI handling, residency/sovereignty, retention, audit evidence requirements)
    • Secure software supply chain (approved base images, artifact signing, SBOM, vulnerability scanning, patch SLAs)
    • Endpoint controls (EDR agents, OS hardening standards, restricted packages, golden images)
    • Change management gates (CAB approvals, limited maintenance windows, separation of duties)
    • Observability restrictions (logs can’t leave tenant, redaction/masking, approved collectors/forwarders only)
    • Multi-tenant isolation and policy boundaries (namespace isolation, network policies, runtime sandboxing)
    • High availability & DR expectations (multi-zone patterns, backup/restore, failover runbooks, RTO/RPO)

3) Security-by-design, InfoSec approvals, and guardrails for AI platforms

  • Lead InfoSec engagement: threat modeling, control mapping, evidence collection, remediation plans, and security signoffs for AI infrastructure.
  • Implement security controls and platform guardrails:
    • TLS/SSL-only communication patterns; encryption-in-transit and encryption-at-rest
    • API security: OAuth2/JWT/mTLS, gateway policies, request signing patterns where required
    • Secrets management using vault/key management services, rotation and lifecycle controls
    • IAM and least-privilege access models; tenant/project isolation
    • VM hardening (CIS-aligned baselines), patching strategy, secure images
    • “Kill switches” / emergency stop mechanisms for agents (tool-disable, egress cut-off, policy stop, rollback runbooks)
    • AI infra guardrails: controlled tool execution, outbound allowlists, boundary policies, audit-ready logging

4) LLM hosting, GPU infrastructure, and scale

  • Architect LLM hosting patterns: managed endpoints, self-hosted inference, multi-model routing, and workload isolation.
  • Design and operationalize GPU-based inference at scale:
    • Capacity planning, GPU node pools, scaling policies, cost/performance optimization
    • Performance profiling and reliability patterns for inference services
  • Build container/Kubernetes-based AI platforms (OKE/EKS/AKS/GKE as applicable):
    • Secure cluster designs, namespaces/tenancy, node isolation, secrets, and safe rollout strategies
    • Support AI frameworks and application runtimes on Kubernetes for scale and portability

5) Observability, reliability engineering, and operational readiness

  • Define and implement observability across AI systems:
    • Metrics, logs, traces, audit trails, and network call tracing
    • Integration with enterprise observability tools (customer standard platforms)
  • Define SLIs/SLOs for AI services:
    • Latency, throughput, error rates, saturation, GPU utilization, queue depth, retry behavior
  • Execute load testing and capacity validation for inference endpoints, vector stores, agent runtimes, and integration services.
  • Build reliable ops workflows: incident response, runbooks, dashboards, alerting, and proactive health checks.

6) Disaster recovery and resilience for AI platforms

  • Design DR strategies for AI solutions:
    • Multi-AD / multi-region patterns, backup/restore for critical stores, IaC-based rebuilds
    • Failover runbooks, RTO/RPO alignment, and validation exercises
  • Ensure production-grade resilience and safe rollback for platform and application layers.

7) Red teaming and risk mitigation for AI infrastructure

  • Drive security validation for AI infrastructure and agent deployments:
    • Attack surface review, secrets leakage paths, egress abuse scenarios
    • Prompt/tool misuse impact assessment at infrastructure level
  • Implement mitigations and hardening measures with measurable controls.

8) Consulting leadership and stakeholder management

  • Act as a trusted technical advisor to customer platform, network, and security teams.
  • Communicate clearly with diverse stakeholders (CIO/CTO, Security, Infra, App teams) and drive decisions under ambiguity.
  • Mentor engineers/architects, conduct design reviews, and build reusable delivery accelerators and blueprints.

Job role

Work location

BENGALURU, KARNATAKA, India

Department

Project & Program Management

Role / Category

Technology / IT Project Management

Employment type

Full Time

Shift

Day Shift

Job requirements

Experience

Min. 5 years

About company

Name

Oracle Financial Services Software Ltd

Job posted by Oracle Financial Services Software Ltd

Apply on company website