Building Blocks of Applied AI

AI AI, *Integrates AI into your toothbrush*. AI is everywhere these days, in 2024 more than half of the YC funded startups were AI based, majority of product of the day on Producthunt are AI apps. If we want to build these AI-driven apps, we need to understand how to interact with and leverage LLM capibilities.

But remember, applied AI requires as much common sense as it does programming. Creating the right architecture, choosing the right approaches, understanding the scope and requirements correctly and managing costs effectively are all huge parts of the game.

I'm writing this with my past 6 months of learning and working with applied AI, these are more or less all the major things you need to get started with building fancy AI apps. I wrote this in a way so it feels like a story rather than a book also this is just a high level overview to make you familiar with the components of applied AI.

It All Starts With a Call to an LLM

Many AI applications start by calling an LLM (Large Language Model) through an API for a cost like OpenAI, Anthropic etc. Self-hosting is possible, but it's often more complex and high initail cost and may underperform if we don't have serious infrastructure.

def call_llm_api(prompt: str) -> str:
    llm_params = {
        "model": "LLM_MODEL",
        "prompt": prompt,
        "temperature": 0.7,     # controls randomness/creativity
        "max_tokens": 256,      # limits the output length
        "top_p": 0.9,           # nucleus sampling parameter
        # many more, less used.
    }
    response = some_llm_client.create_completion(**llm_params)
    return response["choices"][0]["text"]

prompt: The main message we feed to the LLM.
temperature (0–1): Controls how creative or random the output is (0 is more deterministic).
max_tokens: Caps how many tokens the model can generate.
top_p: If using nucleus sampling, it sets a probability threshold for the token distribution.

Writing a good prompt is like 50% work done. More context typically improves responses but can spike costs. Sometimes we gotta gaslight the LLM too by steering it toward the exact answer we want. Balancing context and cost is a big part of building successful AI apps.

Structured Outputs

Some newer LLMs like GPT-4o can directly parse outputs into structured data, especially when combined with libraries like pydantic. For instance:

from pydantic import BaseModel
from openai import OpenAI 

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed
print(event)
# output looks like: CalendarEvent(name="science fair", date="Friday", participants=["Alice", "Bob"])

Coupling Multiple LLM Calls to Make a Chatbot

When we maintain a continuous exchange of messages with the LLM, we effectively have a chatbot. Everyone wants one these days for customer support, user interaction, or just to have AI on their site. Typically:

A system prompt sets the assistant's style or role (e.g., You are a friendly AI chatbot).
Each user message is appended to a chat history array.
The AI's response is appended back, building an ongoing conversation.

chat_history = []

def chatbot_respond(user_message: str) -> str:
    system_prompt = "We are a helpful, friendly AI chatbot."
    chat_history.append({"role": "user", "content": user_message})

    # In practice, we'd summarize old messages or prune them
    recent_history = chat_history[-5:]
    llm_input = {
        "model": "gpt-4o-2024-08-06",
        "messages": [
            {"role": "system", "content": system_prompt},
        ] + recent_history
    }

    # response = some_llm_call(llm_input)
    response = {
        "choices": [
            {"message": {"content": "Pretend AI response based on user_message"}}
        ]
    }

    ai_response = response["choices"][0]["message"]["content"]
    chat_history.append({"role": "assistant", "content": ai_response})
    return ai_response

We need to be careful with context size: the more message history we feed in, the higher our token usage but the accuracy and context will be better too.

But the Chatbot Doesn't Know Anything About my Company?

Because LLMs are frozen at a specific training point, they won't automatically know our internal data or recent updates. That's where Retrieval-Augmented Generation (RAG) comes in.

If we have 500 pages of proprietary knowledge, we can't just shove all of it into the prompt. That's both expensive and might exceed token limits. The typical workflow is:

Chunk our documents into manageable pieces.
Convert each chunk into a vector embedding using a separate embedding model.
Store those embeddings in a vector database.
Convert our query into an embedding as well.
Use cosine similarity (or another similarity measure) to find the most relevant chunks.
Provide those chunks as additional context to the LLM.

def embed_text(text: str) -> list[float]:
    # Example embedding call
    return embedding_model.embed(text)

def build_vector_db(docs: list[str]):
    for doc in docs:
        vector = embed_text(doc)
        vector_db.store(vector, doc)

def rag_ask(user_query: str) -> str:
    query_vector = embed_text(user_query)
    top_docs = vector_db.similarity_search(query_vector, top_k=3)

    llm_prompt = f"""
    User's Question: {user_query}
    Relevant Info: {top_docs}
    Please provide a helpful answer:
    """
    return call_llm_api(llm_prompt)

Cosine similarity vector search checks how similar two vectors' "angles" are. If the angle is small, the documents are likely relevant. This way, we feed just what the LLM needs, saving tokens and cost.

Better Chatbots → Assistants

A chatbot that can handle images, voice input, data from our internal database, and more personalized context starts looking like an assistant. Assistants do more than just spit out text:

They can handle multimodal input, using advanced models like Google's Gemini 2.0 (best in market for multimodal inputs and outputs in my opinion), GPT-4o that support images, audio, or structured data.
They maintain long-term user context or memory to understand more about the user.
They can call tools too sometimes like searching the web or interacting with calender.

Tools? wait a few minutes.

Agentic Systems: The New Hype

Agents are basically AI systems where the LLM decides what to do next—which tool to call, what questions to ask us, or whether to finalize. Workflows tend to be more predictable, but agentic systems let the LLM choose its own path (to an extent) which makes it non deterministic.

5 Foundational Patterns to Build Agentic Systems

The five foundational patterns often used to build or extend agentic systems. Each pattern has its own pros, cons, and best-use scenarios. Sometimes they're more like flows (highly structured), and other times they're more agentic (dynamic, with the LLM deciding next steps). A combination of these can lead to extremely powerful solutions.

1. Prompt Chaining

We might say: "Please translate this text and create 10 quiz questions from it."
Sure, we could do this in a single LLM call, but the probability of an error or messy output is higher—this is a multi-step task.

Instead, we distribute the work across multiple calls:

We receive the user's request.
We pass the request to a translator LLM call.
The translator returns the translated text.
We then pass that translation to a quiz-maker prompt (with structured output).
We get a more accurate and well-formatted set of quiz questions.

Ideal Use: Tasks that can be broken down into clear, deterministic steps.
Tradeoffs: More calls → higher cost and potentially slower speed, but often better accuracy than a single “mega-prompt.”

2. Parallelization

Sometimes, we want to run independent tasks on the same input. We could combine everything into one LLM call, but the chance of incomplete or messy output rises.

For example, imagine we have a transcription of a business call, and we want to do three things with it:

Summarize it.
Translate it into Hindi.
Remove any informal or off-topic chatter.

Instead of doing these tasks one by one, we can parallelize them:

We get transcription from the user.
We spin off three separate LLM calls:
1. Summarization
2. Translation
3. Cleanup (removing informal talk)
Each LLM call runs in parallel.
All three independent outputs are returned.
The collector collects all the outputs, arranges them and returns the final output.

Another scenario: we might ask n LLMs, each with different prompts, to review some content (code, a blog post, images, etc.) and gather all their feedback.

Pros: We don't add sequential delays; tasks run concurrently, so it's faster overall.
Cons: Token usage can be high, so costs climb. However, we get refined outputs quickly.

3. Routing

Routing moves us into more agentic territory because now the flow can change depending on the input. It's still somewhat procedural, but each request might follow a different path.

Suppose we are creating a science doubt solver. Science can be broken down into Chemistry, Physics, and Biology.

A classifier that decides which category the user's question belongs to.
A specialized RAG pipeline for each category.
That specialized LLM or RAG system returns the final answer.

Couldn't we just have one big RAG for all science? Yes, but it can lead to irrelevant matches, especially if keywords overlap. For instance, "effects of salt on the human body" is actually a Biology question, but a general cosine search system might pull in random Chemistry details on salt compounds too.

Pros: Higher accuracy on broad but specialized topics; each pipeline is tuned to its domain.
Cons: More complexity, possible misrouting in rare cases. Also, a single misclassification at the start can derail the entire answer.

4. Evaluator–Optimizer

I call it reviewer–worker too, this pattern is for tasks that require high precision, with iterative checks. It goes like this:

We want a code snippet to be generated.
The worker (optimizer) LLM creates an initial version.
The reviewer (evaluator) LLM checks it and offers suggestions or points out mistakes.
The worker refines the code based on that feedback.
Steps 2–4 repeat until the reviewer is satisfied or we hit a max iteration cap.

Key Failure Point: Infinite loops if there's no iteration limit or if the LLMs keep finding new "improvements" forever.
Use Cases: When we need super high-quality outputs.
Tradeoffs: High cost (multiple calls per cycle) and slower speeds.

5. Orchestrator–Workers

This pattern handles more complex, potentially non-deterministic tasks, especially when we can't predict all sub-tasks upfront.

We have a big request, like "Design a database with user auth, cost constraints, and analytics."
An orchestrator LLM breaks this into subtasks (e.g., an "auth specialist," "cost specialist," etc.)
Each specialized LLM (the workers) handles its part, possibly in parallel.
A synthesizer then collects and merges all the individual outputs.
The orchestrator or synthesizer returns the final combined answer.

Key Failure Point: The synthesizer's context window can overflow if it tries to combine too many worker outputs. Also, cost rises quickly.
Speed: Potentially slow, as orchestrator, workers, and synthesizer may each require multiple calls.
Benefit: We can tackle very large or multifaceted tasks by dividing them into smaller, specialized pieces.

Flow vs. Agent

Flows: Typically have a predetermined path; we know each step. (Examples: prompt chaining, parallelization in a fixed structure.)
Agents: Have more dynamic decision-making at runtime (e.g., routing to the right domain expert, or deciding which tool to call next).

Often, real-world systems blend both. We might have a mostly procedural flow but insert agentic decisions where needed.

Core Agents

An agent is:

An LLM (the brain),
A set of tools (functions, APIs, or entire services we define),
An environment that tracks what's happening outside the LLM (user inputs, system states, or responses from tools).

The LLM decides when to call a tool, when to ask for more info, or when to finalize the answer. This flexibility is super powerful but also unpredictable if we don't define our tools well or set iteration caps (infinite loops can happen). A tool can be complicated but it is just a function we define and pass a list of tools to LLM with relevent documentaion and examples so it knows when to use which one. We'll talk more about tooling specifics in another post—just know that's where the real agent magic lies.

OK got it, pulls out 10 agents to summarize a blog

Please stop. With so many patterns—plus frameworks like LangFlow or LangGraph—it's easy to go overboard. More complexity = more cost + more debugging nightmares. Always get the requirements in one place before planning, often a simple RAG chatbot with thoughtful prompt engineering is enough.

My advice: Don't jump on agents directly. Start small, keep it simple, and only introduce more complex workflows or agentic behavior if we really need it. Half the time, we don't need a top-tier reasoning model for basic tasks like summarizing text. Provide well crafted prompts, watch the token usage and keep everything well-structured and documented.

Applied AI is about putting the right blocks together. Sometimes one LLM call with a clever prompt is all we need. Other times, we might need multiple calls with advanced patterns, or an entire agentic system. Whatever we do, let's stay mindful of cost, complexity, and user experience this is what makes or breaks your product.

Excellent resources

A huge thanks to Anthropic for the blogs and code examples they write, really easy and consise.