Tool calls with BAML - Ouachita Labs

Tags: software engineering, large language models
Audience: technical

Let's start at the end and work backwards. All the code for this example can be found here: https://github.com/ouachitalabs/baml-playground.

At work we sometimes need to extract values from semi-structured documents like 10-Qs or earnings call transcripts. Here's a quick script I wrote while exploring BAML:

$ uv run main.py "Combined Walmart U.S. and Samsclub U.S. net sales" --url "https://stock.walmart.com/sec-filings/all-sec-filings/content/0000104169-25-000137/0000104169-25-000137.pdf"

# Output:

{
  "company": "Walmart Inc.",
  "description": "Combined Walmart U.S. and Samsclub U.S. net sales for the latest quarter",
  "derivation": 144549000000,
  "facts": [
    {
      "value": 120911000000,
      "units": "USD",
      "start_date": "2025-05-01",
      "end_date": "2025-07-31",
      "location": "Page 15, Segment Information table, Walmart U.S. section, 'Net sales' row, '2025' column under 'Three Months Ended July 31,'"
    },
    {
      "value": 23638000000,
      "units": "USD",
      "start_date": "2025-05-01",
      "end_date": "2025-07-31",
      "location": "Page 15, Segment Information table, Sam's Club U.S. section, 'Net sales' row, '2025' column under 'Three Months Ended July 31,'"
    }
  ]
}

The numbers check out. Go verify them for yourself! These facts can sometimes be teased out with existing archaic and very tedious solutions like XBRL tags (made much simpler with the excellent Edgartools library, go check it out), but the tags are managed by the filers, and sometimes what we need out of the document is not tagged. Not to mention that XBRL is really only a standard for US filings, not Canadian/EU filings. Thus, we would like to have one solution to rule them all. Since it's a document extraction tool, your first instinct might just be to send the PDF to ChatGPT and see if it gets the answer right. Sure enough, it does so pretty often. LLMs are exceedingly good at pulling out relevant facts from deep within documents like this: commit hashes, facts and figures, exact dates and timestamps. The LLM is really good at regurgitating these facts just because of how the underlying transformer models work. Note that I'm no ML scientist, and my hand waving here should be taken with a grain of salt.

Tool calls are neat. They're what break out a single LLM call into a multi-turn agent that can actually interact with something on the outside. Orchestrating tool calls is relatively simple when you use a library like openai - however, with BAML things are a bit more low level. This is a good thing. Before trying BAML, I never truly understood what was happening under the hood when an LLM "decided" to make a tool call. What was that handoff process like? With BAML everything is nice and transparent. You can see everything as it happens, and the tooling around the language makes it very easy to test each prompt as a function with simple inputs/outputs.

Basically, BAML functions serve as a function interface to LLM prompts. Things go in and come out. BAML tries to help shape the data flowing in and out in a reasonably transparent way. Any real compute done in a BAML function is purely LLM state transfer. BAML functions are true pure functions, and side effects/mess are handled in the application layer rather than the routing layer.

To do the trick I showed at the beginning of the article, I’ve defined a few datatypes:

class Fact {
  value int @description(#"
    If the value is in currency, round to the nearest whole dollar. For example, always expand "$12.8 millions" to 12800000 "USD"
  "#)
  units string
  start_date string
  end_date string?
  location string? @description(#"
    A plain text description of where this fact can be found on the PDF
  "#)
}

class FilingInformation {
  company string
  description string
  derivation int? @description(#"
    If asked for multi-fact calculation, put the *final calculation* here in whole USD
  "#)
  facts Fact[] @description(#"
    If asked for a multi-fact calculation, put the *base facts* here
  "#)
}

class CalculatorRequest {
    equation string

    @@assert(contains_operation, {{ this|regex_match("[*-+/]") }} )
}

These are my data models. Basically users can extract filing information from an SEC filing (or any document, as you'll see next) and back it up with cold hard facts 😎. These facts contain a value, units, start/end dates, and a location description so you can find the fact in the doc for debugging purposes. Now, onto the functions:

function SearchDoc(document: pdf, query: string) -> FilingInformation {
  client Gemini
  prompt #"
    Extract the information from the document below.
    Always prefer to source a fact from within a table if possible.

    Info: {{query}}
    Document: {{document}}

    {{ ctx.output_format }}
  "#
}

function NeedCalculator(message: string, facts: Fact[]) -> CalculatorRequest {
  client Gpt5Nano
  prompt #"
    Given a message, create an expression that best represents the user's
    request.  If you don't have to calculate anything, return null.
    Never give the answer or include the units. Always just raw numbers and
    a symbolic arithmetic operation (+, -, *, /)

    {{ ctx.output_format }}

    {{ _.role("user") }} {{ message }}
    {{ _.role("user") }} {{ facts }}
  "#
}

These are the only two BAML functions we need. SearchDoc is a function that accepts a PDF and a query, returning some sort of filing information. NeedCalculator is a poorly named function that takes a message (better thought of as the original query) and a list of facts and turns it into an equation. If you check the models above, you'll see that the CalculatorRequest output type is literally just a string named "equation." This is both hilarious and deeply concerning. How do we solve an equation as a string? It turns out LLMs are very good at solving these simple arithmetic problems, but they're just not trustworthy (read: deterministic) enough to trust their calculations, even though they're correct very often. To remedy this, I thought we could just bring a nuke to a paintball match and import sympy - a full-blown symbolic algebra system 😀. Now, if the LLM ever decides to calculate derivatives or solve differential equations, it has the tools at its disposal.

So how do we orchestrate all of this? It's pretty simple! BAML works kind of like an OpenAPI spec generator if you've ever used one before. Every time you save your work, it regenerates the client super quickly (like in 5ms), which regenerates Python files using Pydantic for data modeling in our case. Just like with OpenAPI codegen, many languages are supported (Go, TypeScript, Python, etc). You can then import the client and use the types and functions you've defined in your *.baml files. The code to orchestrate pretty much looks like this:

import argparse
from utils import encode_file_to_base64
from baml_py import Pdf
from baml_client.sync_client import b

if __name__ == '__main__':
  parser = argparse.ArgumentParser(description='Search a PDF document for information')
  parser.add_argument('query', help='Search query for the document')
  parser.add_argument('--url', required=True, help='URL or path to the PDF document')

  args = parser.parse_args()

  pdf = Pdf.from_base64(encode_file_to_base64(args.url))
  result = b.SearchDoc(document=pdf, query=args.query)

  if len(result.facts) > 1:
    request = b.NeedCalculator(
      message=args.query,
      facts=result.facts
    )

    from sympy.parsing.mathematica import parse_mathematica
    derivation = parse_mathematica(request.equation)
    result.derivation = derivation.doit()

    print(result.model_dump_json(indent=2))

Note how there are no import statements for OpenAI, Gemini, Anthropic, etc. You may not have noticed I'm using two totally different model providers here! I've used gemini-2.5-pro for the more complicated document parsing and gpt-5-nano for generating the simple equation! I can mix and match models or run requests to different models concurrently, taking the first one to respond. I can even round-robin them if I want (a concept I first read about in XBOW's post on alloys, which I'm very excited to try out in the future). You can also easily plug in OpenAI-compatible local LLM servers like Ollama and LMStudio.

client Gpt5Nano {
  provider openai
  options {
    model "gpt-5-nano"
    api_key env.OPENAI_API_KEY
  }
}

client Gemini {
  provider google-ai
  options {
    model "gemini-2.5-pro"
    api_key env.GEMINI_API_KEY
  }
}

client Lmstudio {
  provider openai-generic
  options {
    model "google/gemma-3-4b"
    base_url "http://localhost:1234/v1"
  }
}

In conclusion, I don't know what the future holds for programming agents and baking LLMs into software, but this approach is a great glimpse into how the future may look. I didn't even get into the concept of evals, which - if your calls to the LLM are factored out into pure functions - are now trivial to implement in BAML. The tooling around building things with LLMs is getting better every day. There are about a million different very opinionated frameworks, and sometimes it's nice to step back into a more flexible, lower level of the stack for a change.