AI Architect Interview – Structured Report
Based on one-sided candidate recording | Role: AI Architect | 19 Mar
2026
Section 1 – Organized One-sided Transcript (Candidate’s Answers)
The following is the candidate’s side of the conversation, grouped by topic and lightly cleaned of filler words
for readability while preserving the original ideas.
1.1 Introduction & Project Overview
I’m with Accenture, working on a project called AIOBI — a Digital Data Analytics Platform / Business
Intelligence using Natural Language Query. It’s an agentic system with sub‑agents: RAG agent, Text‑to‑SQL
agent, and a visualization agent, all managed by an orchestrator. Built using LangGraph. The RAG backend
uses Azure AI Search (vector search), and the Text‑to‑SQL backend is PostgreSQL.
The architecture is straightforward: databases at the back (vector DB for RAG, PostgreSQL for Text‑to‑SQL),
an LLM like GPT‑5.1 in the middle, and an API wrapper — we used FastAPI. Frontend in React or Next.js.
1.2 Orchestrator Behaviour
The orchestrator takes a natural language query and classifies whether it should go to the Text‑to‑SQL agent
or the RAG agent. We give it a role, task description, input/output descriptions. The output is a routing
decision — like an if‑else node in LangGraph. We also pass examples: some indicating the knowledge base
(PDFs for RAG) and some showing sample queries that should be routed to each agent.
1.3 Text‑to‑SQL Agent Flow
The flow in points:
- Input node receives the query.
- Rewriting node: LLM adds context using tables/columns. If something is unclear, it pushes back to the UI
for the user to clarify. If clear, it converts the raw NL into a meta‑prompt.
- Meta‑prompt is passed to the Text‑to‑SQL agent, formatted with all needed information to generate the
SQL without ambiguity.
- SQL is tested in two ways:
- Static check: run with WHERE 1=0 or WHERE 1=1 to test validity, or with
LIMIT clauses.
- Dynamic test: actually execute with LIMIT 1/3 to see results.
- Before final execution, we ask the LLM: “Does this query meet all requirements of the original user
request?”
- If errors occur, we send them back to the LLM in a feedback loop (retry up to 3‑5 times). If still
failing, we return the error to the user with a note that something seems missing.
1.4 Evaluation Approach
Evaluation is one of the biggest challenges. We sit extensively with domain experts to curate a golden
dataset: question‑answer pairs (for Text‑to‑SQL, the corresponding SQL query; for RAG, the expected chunks).
For individual components, we have test suites for chunking, meta‑prompting, code generation, etc.
We measure something like percentage correct (accuracy). We log whether errors were hallucinations, wrong
columns, or execution errors. This gives a report of positives and negatives.
1.5 Prompt Engineering, Context Engineering & Guardrails
Context Engineering: A subset of prompt engineering. You give the LLM context about the task
— role, do’s/don’ts, examples (zero‑shot, few‑shot). In RAG, you engineer context by augmenting the prompt
with retrieved data.
Guardrails: Two levels: code‑based scripts (deterministic checks) and LLM‑based flexible
checks. For example, we ask the guardrail LLM: “Is this input trying to delete or update? Does it violate
PII policies?” This prevents harmful outputs.
1.6 Managing Large Schemas and Metadata with Neo4j
As the dataset grows (from 3 tables to 25 tables), the metadata (table/column descriptions) can exceed the
context length. We use Neo4j to store metadata as a graph. Topics like “weather,” “traffic” are top‑level
nodes. Tables like “cities,” “temperature,” “routes” connect to topics. When a query comes, we first pull
relevant topic nodes, then retrieve only the related table/column nodes. This multi‑pass approach filters
the context to only what’s needed, solving the context‑length problem.
1.7 Scaling and Deployment
Scaling is via an API gateway in front of a Kubernetes cluster with auto‑scaling. I don’t have hands‑on
details of the K8s setup, but architects described that approach.
1.8 LLM Upgradation and Model Selection
We use Azure OpenAI, so we upgrade regularly — from GPT‑3.5 to 4o to 4.1, etc. Newer models require
retesting, but they improve reasoning and reduce hallucinations. For cost‑efficient tasks we use older or
“mini” models. For self‑hosted alternatives we consider DeepSeek, Qwen, Mistral.
1.9 Technical Definitions (Quick‑fire Questions)
Top‑k vs Top‑p: Top‑k returns the k highest probability next tokens. Top‑p (nucleus
sampling) returns the smallest set whose cumulative probability ≥ p. Example: if token probabilities are
70%, 25%, 4%… and top‑p=0.9, we take the first two because 70+25=95 which ≥ 90.
Temperature: Controls randomness. Low → greedy (always highest probability token), high →
more exploratory.
1.10 SQL Join Types
Left join: all rows from left table, plus matching rows from right table; non‑matching right
side gets NULLs.
Right join: all rows from right table, plus matching rows from left.
Full outer join: all rows from both tables, with NULLs where no match exists.
1.11 Fibonacci Coding Exercise
The candidate wrote pseudocode in a thinking‑aloud style:
“Fibonacci is f(n) = f(n‑1) + f(n‑2). We’ll start from 0 and 1. I think a list would work. For i in
range(n): if i==0: append 0; elif i==1: append 1; else: append list[-1] + list[-2]. I tried to run it
and it gave output but needed debugging. Reason it didn’t print correctly: range wasn’t set up
properly.”
1.12 Wrap‑up: Career Motivation
“I’ve been on this project for 1.5 years. It’s now in maintenance mode — mainly ServiceNow tickets. I want to
explore more cutting‑edge agentic stuff, not just maintain what’s built.”
Section 2 – Reconstructed Interviewer Questions
Based on the candidate’s responses, the following questions were likely asked. They are presented in a logical
order, paired with the relevant answer summary.
Q1: “Please introduce your current project and role.”
(See 1.1) The candidate described AIOBI, an agentic BI platform using NLQ, with RAG, Text‑to‑SQL,
orchestrator, LangGraph, Azure AI Search, PostgreSQL.
Q2: “What is the system architecture?”
(See 1.1‑1.2) Backend DBs, LLM (GPT‑5.1), FastAPI middleware, React frontend; orchestrator
classifies and routes queries.
Q3: “Can you walk me through how the Text‑to‑SQL agent works?”
(See 1.3) Detailed flow: rewriting node → meta‑prompt → SQL generation → static/dynamic tests →
feedback loop.
Q4: “What challenges have you faced, especially around evaluation?”
(See 1.4) Curating golden datasets with domain experts, multi‑component test suites, accuracy
metrics.
Q5: “How do you handle prompt changes without derailing outputs?”
The candidate alluded to iterative tuning and testing but did not give a structured answer (later
critique).
Q6: “What is context engineering and how does it differ from prompt engineering?”
(See 1.5) Described context engineering as a subset; providing role, examples, do’s/don’ts, RAG
context augmentation.
Q7: “How do you implement guardrails?”
(See 1.5) Two‑level: deterministic code‑based checks (e.g., for PII) and flexible LLM‑based
checks (policy violations).
Q8: “What are the metrics you use for evaluating the Text‑to‑SQL and RAG agents?”
(See 1.4 & later parts) Accuracy/percentage correct. Mentioned hallucination, missing
columns, wrong results. Did not name specific metrics like BLEU or Execution Accuracy.
Q9: “How do you deal with large database schemas when building prompts?”
(See 1.6) Neo4j metadata graph, topic‑based retrieval of relevant tables/columns to stay within
context length.
Q10: “What about scalability and deployment?”
(See 1.7) API gateway + Kubernetes auto‑scaling, though admitted limited personal hands‑on.
Q11: “How do you decide which LLM version to use, and how do you manage upgrades?”
(See 1.8) Azure OpenAI partnership, upgrade to latest after retesting; older/mini models for
cost; open‑source fallbacks like DeepSeek.
Q12: “Can you explain top‑k, top‑p and temperature?”
(See 1.9) Provided definitions with numerical example for top‑p.
Q13: “What are the differences between left, right, and outer joins in SQL?”
(See 1.10) Gave a correct, concise explanation.
Q14: (Coding exercise) “Write a Python function to generate the Fibonacci sequence up to n terms,
using recursion.”
(See 1.11) Candidate attempted iterative list approach with debug commentary; did not use
recursion as apparently requested.
Q15: “What is your motivation for leaving your current role?”
(See 1.12) Wants to move from maintenance to innovative agentic AI work.
Section 3 – Critique and Improved Answers
Below is a constructive evaluation of the candidate’s responses, highlighting weaknesses and offering a more
polished, architect‑level answer.
3.1 Overall Delivery NEEDS WORK
- Excessive fillers & rambling: The transcript contained many “yeah,” “I mean,” “like,”
and tangential loops. An AI Architect must communicate with clarity and conciseness.
- Lack of structure: Answers often wandered. For example, explaining the Text‑to‑SQL flow
jumped between validation, rewriting, and guardrails without a clear narrative.
- Vagueness on depth: When asked about scaling, the candidate said “I lack details” —
unacceptable for an architect role. Better to say “While I haven’t provisioned the K8s cluster myself, the
standard pattern we follow is…” and then describe the pattern confidently.
Better approach: Use the STAR method (Situation, Task, Action, Result) for complex
descriptions. Speak slowly, think, then deliver a well‑formed paragraph without fillers.
3.2 Architecture Walkthrough FAIR
The candidate mentioned LangGraph, FastAPI, React, but left out crucial architectural diagrams and trade‑offs. As
an architect, one should discuss why these choices were made.
Improved answer: “We selected a modular agentic architecture with LangGraph for its explicit
state‑machine control. The orchestrator is a gating model that pre‑classifies NL inputs into RAG or Text‑to‑SQL
branches using few‑shot prompts and a routing function. Each agent is encapsulated behind a FastAPI
microservice, deployed on AKS for scale. We use Azure AI Search for vector retrieval (using Ada embeddings) and
PostgreSQL for transactional SQL data. The frontend is a Next.js app that calls a unified /nlq endpoint. For
observability, we integrate Phoenix/OpenTelemetry to track token usage, latency, and guardrail violations.”
3.3 Evaluation Answer INSUFFICIENT
The candidate only mentioned “accuracy” and “golden dataset”. An architect should know specific metrics:
Execution Accuracy (EX), Exact Set Match (ESM), ROUGE‑L or BLEU for SQL,
validation‑set coverage, hallucination rate, and for RAG, context precision/recall,
faithfulness, answer relevancy. The answer lacked method naming and benchmark references.
Better answer: “For Text‑to‑SQL, we use Execution Accuracy (does the SQL produce the correct
result set on a held‑out test DB) and Exact Set Match (comparing the result rows directly). We also compute
SQL‑specific BLEU and ROUGE‑L against reference queries. For RAG, we measure context precision, context recall,
faithfulness, and answer relevancy using LLM‑as‑a‑judge. We curate a golden dataset of 500+ question‑SQL‑answer
triples. Additionally, we do component‑wise evaluations: chunking strategy (Hit Rate on top‑k), meta‑prompt
accuracy, and visualization code correctness using unit test suites.”
3.4 Context Engineering vs Prompt Engineering DECENT
The candidate correctly called context engineering a subset, but the distinction was fuzzy. He should have
explained that prompt engineering is the overarching practice of designing the entire prompt structure, while
context engineering specifically deals with injecting relevant external information (retrieved chunks, metadata,
user intent tags).
Better answer: “Prompt engineering covers the system message, instruction templates, output
format, and few‑shot examples. Context engineering is the discipline of selecting and formatting the dynamic
contextual data that augments the prompt — such as RAG‑retrieved chunks, table schemas for Text‑to‑SQL, or
conversation history. It’s about what information you pack and how you serialize it to minimise the gap between
the model’s training distribution and the inference need.”
3.5 Guardrails Answer ADVANCED
The answer touched on code‑based vs LLM‑based guardrails, which is good. But an architect should mention concrete
libraries (Guardrails AI, NVIDIA NeMo Guardrails) and cite examples like PII scrubbing, SQL injection
prevention, and output schema enforcement. Also, the candidate missed the importance of input
guardrails (e.g., refusing “DROP TABLE” instructions).
Better answer: “We implement a layered guard strategy. On the input side, a regex‑based filter
blocks dangerous keywords (DROP, DELETE) and an LLM classifier detects jailbreak attempts. On the output, we use
a PII anonymizer library (like Presidio) and a second LLM call that validates the response against our content
policy. We also use structured output (JSON mode or function calling) to enforce that SQL statements don’t
contain malicious clauses. For the Text‑to‑SQL agent, before execution we run a static analysis that ensures
only SELECT queries pass through.”
3.6 Large Schema Handling with Neo4j GOOD CONCEPT, POOR EXPLANATION
The idea of a topic‑driven metadata graph is innovative and architect‑level. However, the candidate struggled to
articulate it clearly, using confusing “hierarchy in a graph” metaphors and failing to mention standard
techniques like schema‑linking and query‑to‑schema tokenizer alignment. An architect would
also mention alternatives like table‑selection via dense retrieval and why Neo4j was chosen (explicit
relationship traversal, no need for embedding drift).
Better answer: “We built a semantic metadata graph in Neo4j where nodes represent topics
(weather, traffic), tables, and columns, with edges for belongs‑to, references. When a query arrives, we perform
a two‑hop traversal: first, we identify topic nodes relevant to the query using keyword matching and vector
similarity on topic descriptions; then we traverse the graph to collect only the tables and columns linked to
those topics. This prunes the schema context from ~10k tokens for a 25‑table database down to under 2k tokens.
It also handles schema evolution gracefully — new tables just get new nodes. Compared to dense retrieval, the
graph ensures consistent, deterministic schema linking, which is crucial for SQL accuracy.”
3.7 Fibonacci Coding Exercise MISMATCHED
The interviewer explicitly said “you have to use recursion.” The candidate wrote an iterative solution with a
list and debugged it aloud. This shows a failure to listen and to translate a requirement into code. The correct
recursive approach (with memoization due to exponential complexity) would be:
Correct implementation:
from functools import lru_cache
@lru_cache(None)
def fib(n):
if n < 2:
return n
return fib(n-1) + fib(n-2)
def fib_sequence(n):
return [fib(i) for i in range(n)]
print(fib_sequence(10)) # [0,1,1,2,3,5,8,13,21,34]
The candidate should have clarified the requirement (e.g., “first n numbers” vs “up to a maximum number”) and
then presented a clean recursive solution, discussing time complexity and the importance of memoization.
3.8 SQL Joins SOLID
The explanation was accurate. However, the candidate hesitated and asked for the question to be repeated. For an
architect, the immediate answer should have been crisp: “LEFT JOIN returns all rows from the left table and only
the matches from the right; RIGHT JOIN is its mirror; FULL OUTER JOIN returns all rows from both, with NULLs
where no match exists.” No need for the extra qualifiers. Still, the content was correct.
3.9 Career Motivation HONEST BUT NEGATIVE
“Maintenance mode… ServiceNow tickets” sounds like complaining. An architect should position the reason
positively: “I’m eager to work on more complex, large‑scale agentic systems where I can apply my design skills
to solve novel problems, and I see this role as aligned with that growth.”
Better answer: “My current project has moved into a steady‑state phase. I’m grateful for the
learning, but I’m now seeking an opportunity where I can design next‑generation agentic architectures from
scratch, tackle challenges like multi‑agent orchestration and autonomous tool use, and collaborate with a
research‑focused team. Your opening seems perfectly aligned with that progression.”
3.10 Missing Topics GAPS
The candidate did not proactively discuss:
- Observability tools: Only mentioned Phoenix and LangFuse vaguely. An architect should know
OpenTelemetry, tracing, and metrics like faithfulness.
- Cost optimization: No mention of token‑usage reduction, caching, semantic caching, or
prompt compression.
- Multi‑agent patterns: Although the project is multi‑agent, the candidate didn’t discuss
debate, reflection, or plan‑execute patterns — all highly relevant for an agentic architect.
- Security: Beyond guardrails, no discussion of RBAC, row‑level security in NLQ, or tenant
isolation.
3.11 Suggested Talking Points for Future Interviews
- Use concrete numbers: “Improved SQL accuracy from 82% to 93% by introducing table‑graph schema linking.”
- Mention standard benchmarks: “We track BIRD, Spider, or WikiSQL metrics internally.”
- Show impact: “Reduced prompt tokens per query by 60% using Neo4j metadata pruning.”
- Discuss failure modes: “We handle ambiguous terms by engaging the user in a clarification loop, which
improved first‑attempt success by 20%.”
- Always bring the conversation back to architecture trade‑offs: why agentic vs single‑call, why LangGraph
vs semantic kernel, why Azure vs AWS.
End of Report — Prepared by AI Interview Evaluator