While everyone's turning English into structured code, I need to turn structured output into English
After twenty years in quantitative finance, you develop a dialect. You stop noticing that you speak it.
A sentence like "The name screens cheap on EV/EBITDA relative to sector with positive momentum z-scores and positive earnings revisions" makes perfect sense to you. It tells a complete story: the stock is undervalued on a key metric, the price trend is favorable, and analysts are raising estimates. Three independent signals pointing the same direction. On a quant desk, that sentence is practically a buy thesis.
To my client sitting across the table, it's noise. You might as well be reading out the ingredients off a shampoo bottle in a foreign language.
And it's not just quants. The entire financial industry speaks in tongues. Sell-side analysts will write something like "we see asymmetric risk/reward skewed to the upside with a favorable catalyst path into the print" and expect a normal human being to nod along. That sentence means "we think the stock goes up before earnings."
But it sounds like a transmission from another planet. The difference is that sell-side analysts have always written for other finance people. I needed to write for clients. Real people who are smart, accomplished, and perfectly capable of understanding an investment thesis, but who did not spend two decades marinating in financial statements, factor models or multivariate regressions.
Why this became my problem
When I set out to build Rubric Advisors, a wealth management firm rooted in a systematic approach, I thought the hard part would be the models. After two decades in quant finance, I had strong opinions about how to build factor models, risk decompositions, and valuation engines. I figured the intellectual challenge would be in the math.
I was wrong. The models were the easy part.
Building factor models that score stocks on value, growth, quality, momentum, and sentiment? I'd done that before. A risk model that decomposes volatility into systematic and idiosyncratic components? Straightforward. A valuation engine producing model-derived price estimates? Challenging, but tractable. Two decades of experience make these into well rehearsed engineering problems with known solutions.
The real bottleneck, the one I didn't see coming, was communication. Specifically: how do you take the output of institutional-grade quantitative systems and explain it to a wealth management client in a way that's clear, honest, and actually useful? And do it fairly automatically.
This isn't the same problem that institutional investors have. At an institutional fund, your "client" is often another quant, or at least someone who's fluent in the language. Or you have a teams of people whose job is to translate stuff produced by quants for human consumption.
In wealth management, your client is a successful person who trusts you with their financial future and reasonably expects you to explain what you're doing with it. They don't need to understand the math. But they absolutely need to understand the reasoning behind your decisions. A client who can't follow the logic behind a portfolio decisions can't evaluate it, can't pressure-test it, and can't maintain conviction when the market gets choppy.
Transparency isn't a nice-to-have. It's the foundation of the entire advisory relationship.
The reverse of code generation
Here's what struck me. I was watching LLMs transform natural language into working code. Taking a plain-English description of what a program should do and producing the precise, structured implementation. The entire industry was excited about this direction. Human language in, machine language out. It was all the buzz!
I needed the exact opposite.
My quantitative models produced structured, precise, machine-readable output. Factor scores, risk decompositions, valuation estimates, ratio tables. Dense, numerical, rigorous. And I needed to convert all of that into natural language that a human could read, understand, and evaluate. Machine language in, human language out.
It sounds like it should be easier. In some ways, it's actually harder.
Code generation has a clear correctness criterion: the code either runs or it doesn't. Narrative generation has no such luxury. A paragraph can be grammatically and technically accurate but still completely fail to communicate. It can faithfully report every number and still leave the reader with zero understanding of the investment thesis. The challenge isn't fidelity to the data. It's interpretation of the data. Connecting the dots across categories, framing the tensions, surfacing the insight that's implicit in the numbers but invisible to someone who doesn't live in them every day.
The inverting structured data into accessible narrative, turned out to be a harder problem that I imagined. And it required a fundamentally different approach to prompt engineering than anything I'd seen in the code-generation world.
What the model has to work with
To understand why this is so hard, you need to see what the system actually ingests. Here's a simplified data envelope for a single large-cap stock. The real version is bigger, but it gives you a flavor:
And on top of all that: five years of financial statements, nine tables of ratios, an earnings call transcript, 10-K excerpts, news sentiment, insider transactions, and peer comparables across 15 metrics.
A quant analyst reads all of this in about three minutes and forms a picture. The factor scores say high-quality growth name, not particularly cheap, with strong momentum and positive analyst sentiment. The risk model says it moves with the market and has a slight negative value tilt, consistent with a growth stock. The earnings history says consistent beats, supporting the positive sentiment score. The valuation model says modest upside. The thesis writes itself. If you speak the language.
The challenge is making that thesis understandable to someone who doesn't speak that dialect. I don't want to waste their time teaching my lingo, but my goal is to get them on the same page that I am on. Thats important in financial advisory business.
The single-prompt trap
The initial approach was the obvious one: feed all the data into a single LLM prompt, define a 13-section report structure with word targets, embed regulatory guardrails, and generate the report in one pass. The prompt was massive. Roughly 8,000 tokens of instructions sitting on top of 10,000 to 15,000 tokens of data.
What came back was technically correct. But deeply unsatisfying.
1. The output was bland
Here's what the Business Model section typically looked like:
The company operates in the technology sector with a market capitalization of approximately $2.8 trillion. It generates revenue through multiple business segments including products and services. The company has demonstrated strong profitability with an operating margin of 31.5% and a net margin of 25.3%. Revenue has grown year-over-year, supported by expansion in key markets. The company's return on equity stands at 157.4%, indicating efficient use of shareholder capital.
Every sentence is factually correct. Every sentence is boring. This reads like a Wikipedia page with better formatting. It lists numbers without connecting them. There's no insight about why the margins are where they are, what is driving the growth, or how the revenue segments interact. A client reads this and learns nothing they couldn't get from a Yahoo Finance summary page. It certainly doesn't help them understand my investment reasoning.
Compare that to what a human analyst might write:
The business model is deceptively simple: sell premium hardware at 35-40% gross margins, then lock the installed base into a services ecosystem that generates 70%+ margins. The 2.2 billion active devices aren't just a market share statistic. They're a recurring revenue engine. Every device sold a decade ago still funnels subscription revenue through today. Services revenue crossed $100B run rate last quarter and is growing at 15% while hardware growth has slowed to single digits. The margin mix shift alone explains most of the earnings trajectory.
The difference isn't just style. The human analyst connects data points across categories. The hardware margin informs the services strategy. The installed base explains the earnings trajectory. The growth rate differential frames the investment thesis. The single-pass model consistently failed to make these connections because it was simultaneously formatting, disclaiming, and counting words. It couldn't think and write at the same time. Which, if we're being honest, is something most humans struggle with too.
2. The compliance/regulatory guardrails made it robotic
Compliance rules made things worse. As a registered investment advisor, I can't just write freely. Here's what the Valuation section looked like after enforcing all thirty-plus regulatory rules:
Rubric's model-derived valuation estimate is approximately $247, subject to the assumptions described herein. This is a hypothetical result based on quantitative models and historical data; it does not represent actual portfolio performance and is not a guarantee of future results. The estimated valuation range under current assumptions is approximately $218 to $276. Model outputs indicate potential valuation range above current market price, subject to stated assumptions, corresponding to approximately 6.5% potential upside under current model assumptions.
Technically compliant. Practically unreadable. Every sentence is a disclaimer wrapped around a data point. A client reading this learns that the model thinks the stock might be worth more than it trades for, maybe, under certain assumptions, hypothetically, if the wind is blowing the right direction. It doesn't help anyone understand why the model arrived at that estimate or what would have to change for the estimate to be wrong.
The frustrating part is that each compliance rule is individually sensible. You genuinely shouldn't present a model estimate as "fair value" because it implies a precision no model has. You genuinely should disclaim hypothetical results. But when thirty of these rules fire simultaneously during generation, the model can't find a voice. It writes like a compliance officer who's been burned one too many times.
I tried tuning the prompt. I added instructions like "write in a casual, conversational tone, like a sharp analyst talking to a sophisticated client over coffee." The result was somehow worse. The model tried to be casual and compliant at the same time:
So here's the deal: Rubric's model-derived valuation estimate, which is a hypothetical result based on quantitative models and historical data and does not represent actual portfolio performance, comes in around $247.
Nobody talks like that. Not over coffee, not over anything. It read like a legal disclaimer wearing a Hawaiian shirt.
3. Key facts were wrong
The third problem was hallucination. Under the combined pressure of structural compliance and content generation, the model just started making things up. Specific examples I caught during testing:
A report claimed "the company beat EPS estimates for 8 consecutive quarters" when the source data only covered 4 quarters. Apparently the model liked the narrative so much it decided to extend the streak on its own. A Valuation section cited "forward P/E of 24.3" when the actual forward P/E was 29.8. The model had pulled a number from its training data, not from the prompt. A Scenario Framework assigned a probability-weighted expected price of $285 by computing 0.25 x $320 + 0.50 x $270 + 0.25 x $240 = $275, then reported $285. The math was just wrong. A Financial Health section stated "gross margins expanded 200bps year-over-year" when the ratio tables showed 150bp.
Each error seems minor in isolation. In a regulated context, every one is a liability. The SEC does not distinguish between a human analyst who misquotes a number and an AI system that hallucinates one. The advisor is responsible for the accuracy of what goes out the door, regardless of how it was produced.
This is where the code-generation analogy breaks down most sharply. If an LLM writes a function that doesn't compile, the feedback is immediate and binary. Red or green. If it writes a paragraph that claims 200bps of margin expansion instead of 150bps, the error is subtle, plausible, and dangerous. There's no compiler for prose. The verification has to be built into the pipeline itself.
The breakthrough: Separate the Thinking from the Writing
The single-pass approach was asking the model to do four things at once:
- Understand the business deeply,
- Generate original insight,
- Comply with thirty-plus regulatory rules, and
- Format everything into a 13-section structure with word targets, charts and tables.
Honestly, thats not how a human analyst works. Thats not how anyone works!
A human analyst reads the data, thinks and synthesizes about what it means, develops a thesis, and then sits down to write the report. The thinking happens before the writing. The structure serves the insight, not the other way around. You don't start with the table of contents and fill in the analysis. You start with the analysis and figure out how to present it.
Once I realized that, the solution was almost obvious. Stop asking the model to do everything at once. Let it synthesize first, then conform.
Stage one: Think Like an Analyst
Strip away everything except analysis. No section numbers. No word counts. No mandatory tables. No compliance rules beyond "ground your claims in the data." Just seven free-form analytical prompts: business model, strategy, competitive moat, upside thesis, downside thesis, capital allocation, industry trends. Each section runs 700 to 1,000 words of unconstrained reasoning. Total output: 5,000 to 7,000 words of raw analytical prose.
The difference was dramatic. Night and day.
Here's what this stage produces for the same Business Model section that was bland in the single-pass version:
The hardware business is a Trojan horse. Unit sales are essentially flat in mature markets, and the average selling price increases are modest: 3-5% annually, driven by mix shift toward Pro and Ultra tiers rather than actual pricing power on the base models. If you looked only at the hardware segment, you'd see a mature consumer electronics business growing low single digits. Wall Street would put a 12-15x multiple on it and move on.
But hardware is not the business. Hardware is the customer acquisition cost for a services ecosystem that now generates $96B in trailing revenue at 71% gross margins. Every device sold is a subscription funnel. iCloud storage, Apple Music, Apple TV+, the App Store's 15-30% commission, Apple Pay's transaction fees, AppleCare, the advertising business. The installed base of 2.2 billion active devices means even at low single-digit services penetration growth, the revenue compounds mechanically. Last quarter, services grew 14% year-over-year while product revenue grew 2%. The margin mix shift from 36% gross margins on hardware to 71% on services is the single largest driver of earnings growth, and it will continue for years without requiring any strategic brilliance from management.
Now that reads like an analyst wrote it. The insight, that hardware is really just a customer acquisition channel for services, isn't stated in any single data field. It emerges from connecting segment revenue data, the margin differential, and the installed base figure. The model made that connection because it wasn't simultaneously counting words and checking compliance boxes. It was doing one thing: thinking about the business. Give it room to breathe and it actually has something to say.
This is where the reverse code-generation problem gets solved. The model isn't just reporting structured data in paragraph form. It's interpreting it, finding the relationships between data points that a quant sees intuitively but that never appear in any single field. Turns out that LLMs are reasonably good at this, if you allow them to. These connections make all the difference between a data dump and a well thought out research report. And they only emerge when the model has the cognitive space to synthesize.
Stage two: Fact Check
Insight without verification is dangerous. Especially when it sounds really convincing.
Before the narrative feeds into the final report, a cheaper, faster model runs a verification pass. It receives the original source data and the Phase 1 narrative with narrow instructions: check every number against the source, correct mismatches, flag unsupported claims, preserve structure, return corrected text without commentary. No creativity required. Just accuracy.
This phase runs on a small model at near-zero temperature. It's a verification task, not a generation task, so a lightweight model is ideal. Think of it as a very diligent intern with a highlighter. And it catches real errors.
In testing, it corrected a narrative that stated "operating margins expanded from 28% to 31.5% over three years" when the ratio tables showed expansion from 29.2% to 31.5%. That's a 230bp expansion, not 350bp. Not a huge difference in isolation, but the kind of thing that makes a compliance officer's eye twitch. It caught a passage referencing "Q2 2025 revenue of $94.9B" when the source data showed $85.8B; the model had pulled a figure from a different quarter and didn't seem to notice. It flagged a claim about "6 consecutive quarters of positive earnings surprises" when the data only covered 4 quarters.
Each correction is small. Each one matters. The phase costs roughly one-tenth the price of the main model and adds only a few seconds to the pipeline.
Stage three: Structure the Insight
Now the model receives the fact-checked narrative alongside the complete data envelope, including the quantitative outputs that were deliberately excluded from the narrative phase: factor scores, valuation estimates, risk model loadings. Its job is fundamentally different from what the single-pass approach attempted. It's not generating insight. It's structuring insight that already exists.
The 13-section template activates. Word targets are enforced. Compliance guardrails are on. But now they constrain synthesis, not creativity. The prompt includes a critical directive: "Do not reproduce the narrative verbatim; synthesize and weave the insights into your structured analysis." The Phase 1 narrative is raw material, not finished product. The structured report refines, tightens, and contextualizes.
Here's what the Valuation section looks like after the full pipeline:
Analysts are split on whether the services growth premium is already priced in. The consensus target of $248.50 among 47 analysts reflects a 7.1% upside from current levels, but the range is wide: the most bullish target sits at $300, the most bearish at $185. Estimate revisions have been net positive over the past 90 days, with forward EPS estimates ticking up from $7.75 to $7.92, a 2.2% increase that signals improving confidence in the services trajectory.
The earnings track record supports the optimism. The company beat consensus EPS estimates in all four reported quarters: Q3 2025 ($1.64 vs. $1.59 est., +3.1%), Q2 2025 ($1.40 vs. $1.35 est., +3.7%), Q1 2025 ($1.52 vs. $1.50 est., +1.3%), Q4 2024 ($2.40 vs. $2.36 est., +1.7%). The beats are consistent but modest, suggesting analysts are tracking the business closely rather than systematically underestimating it.
Model-implied valuation indicates the stock is priced below Rubric's estimated fair value under current assumptions. Rubric's model-derived valuation estimate is approximately $247, based on current earnings multiples, historical factor exposures, and analyst consensus estimates. This is a hypothetical result based on quantitative models and historical data; it does not represent actual portfolio performance and is not a guarantee of future results.
Go back and compare this to the single-pass version. The wall of disclaimers that said almost nothing. The compliance language is still here ("model-implied," "hypothetical result," "does not represent actual portfolio performance"), but it's woven into a section that actually says something. The debate framing ("whether the services growth premium is already priced in") came from the Phase 1 narrative. The guardrails shaped the language without strangling the insight.
A client reading this section understands the valuation debate. They can see that analysts are generally optimistic but the range is wide. They can see that earnings execution has been consistent. They can see where our model's estimate sits relative to consensus. They understand the reasoning. And that's the whole point.
Translating Quant Speak
The pipeline solves the translation problem that motivated the entire project. Consider how raw factor scores, the ones a quant reads in thirty seconds, appear in the final report.
If you showed the raw scores to a client, they'd smile politely and wonder when the meeting would get over.
What the report actually says:
Rubric's factor model ranks this stock in the top decile for quality within its sector, driven by the 157% return on equity and 31.5% operating margins that sit well above sector averages. Growth scores rank in the top quintile, reflecting the 14% services revenue trajectory and consistent earnings beats. Analyst sentiment is strongly positive, with estimate revisions trending upward across all horizons. The stock does not screen as cheap on traditional value metrics (the 34.2x trailing P/E is above sector median), and its volatility profile runs higher than average, consistent with a large-cap growth name that trades on narrative shifts.
The quant content is fully preserved. The z-scores, the factor rankings, the risk characteristics, they're all there. But they're expressed in language that a client can follow. "Top decile for quality" means something. "+2.05 z-score" does not, unless you already know what it means, in which case you didn't need the report in the first place.
This is the heart of the reverse code-generation problem. When an LLM converts natural language to code, it's compressing ambiguous human intent into precise machine instructions. When this pipeline converts factor scores into narrative, it's doing the opposite. Expanding precise numerical outputs into the contextual, interpretive language that humans need to understand what the numbers actually mean. The z-score is the compressed representation. The paragraph is the decompressed, human-readable version. Both contain the same information. Only one of them my clients can actually follow.
The compliance Architecture
Regulatory compliance operates at two levels, and the separation is deliberate.
In the report body, inline rules shape how the model writes. Don't call anything a "Strong Buy." Don't present a model estimate as "fair value." Frame upside as model-implied and approximate. Present analyst consensus neutrally, as a data point, not as validation of a thesis. These rules are much easier to follow when the model is performing synthesis rather than generation. When it's synthesizing a pre-existing narrative into a structured format, the compliance rules function as style constraints on an already-formed analysis. When it's generating from scratch under the same constraints, the rules compete with the creative process and the output goes sideways. Or, as we saw, it puts on a Hawaiian shirt and tries to be casual about disclaimers.
Here's the key insight: compliance and clarity are not at odds if you sequence them correctly. The single-pass approach treated them as simultaneous constraints, and the model resolved the tension by sacrificing clarity. The multi-stage approach lets the model develop clear thinking first, then apply compliance as a refinement layer. The result is language that is both compliant and readable. Which is exactly what a regulator should want. Net result of this over hedged language is that nobody really reads them.
In the PDF export, nine structured disclosure sections (purpose and limitations, data sources, AI methodology, model limitations, hypothetical performance, forward-looking statements, conflicts of interest) are appended programmatically. They're not generated by the LLM. They're hard-coded: always present, always correctly formatted, never subject to hallucination.
The end result is a report body reads like an analyst wrote it. The disclosures read like a compliance officer or legal wrote them. Each is appropriate for its purpose.
Cost Tradeoff
The pipeline costs more. Three LLM invocations instead of one. The narrative phase uses the most capable model available. The fact-check phase uses a small model at roughly one-tenth the cost. The report phase uses the capable model again with a higher token budget. Total latency is 45 to 90 seconds versus 20 to 30 for a single pass.
The system is more complex too. Three prompt templates. Two system prompts. Orchestration logic that handles phase failures gracefully, falling back to unchecked narrative if the fact-checker fails, reporting errors without crashing if the final report truncates.
What we got is a report that bridges the gap I set out to bridge. The factor scores are translated into business context. The valuation model output is framed as a debate, not a price target. The earnings history is woven into a narrative about whether management is executing. The regulatory language is present but unobtrusive. It reads like something a thoughtful analyst would actually put their name on.
A single-pass report that needs 15 to 20 minutes of cleanup (fixing hallucinated numbers, injecting cross-references, smoothing robotic prose) costs more in human time than the extra API calls ever will. A multi-stage report that a compliance officer can review without redlining every other sentence saves even more.
The principle generalizes
The pattern extends well beyond finance. When you ask a language model to simultaneously optimize for creativity, precision, structure, compliance, and formatting, quality degrades across all of them. The model spreads its attention across competing objectives and settles for "good enough" on each one. Separate the objectives into sequential stages, each with a clear mandate and the right model for the job, and the output improves on every dimension.
In finance, that means research reports clients can actually follow. In law, it means briefs with stronger reasoning. In medicine, it means patient summaries with clearer risk framing. In engineering, it means documentation that balances rigor and readability. Anywhere you have a domain expert who needs to communicate complex, structured information to a less specialized audience, the same pipeline architecture applies. Synthesize first, verify second, structure third.
Most LLM applications focus on the natural-language-to-structured-output direction: chatbots, code generation, data extraction, function calling. The reverse direction, structured data into accessible narrative, is less explored but equally important. It requires not just fluency but interpretation. Not just accuracy but insight. The model has to do what a great analyst does: look at the numbers and explain what they mean.
The bridge, not the replacement
I don't expect this pipeline, or any LLM architecture, to replace a human analyst. A great analyst brings judgment that no model possesses. The intuition from watching a management team navigate three tough cycles. The pattern recognition from reading thousands of earnings transcripts. The ability to weigh what a CEO didn't say as much as what they did. The model will never call an investor relations team, attend an industry conference, or sense that management's body language doesn't quite match the guidance. It doesn't have a gut.
But it can get closer than I expected.
What the pipeline does is handle the 80% of the work that is synthesis and structuring, and it does it with a level of analytical depth that genuinely surprised me. The "hardware as Trojan horse" insight, the margin mix shift narrative, the framing of valuation as debate rather than target: these emerged from the pipeline, not from manual editing. They aren't as sharp as what the best human analyst would produce on their best day. But they're in the same neighborhood. And they're produced in 90 seconds instead of two hours.
That frees me to focus on the 20% that requires genuine expertise. The contrarian insight. The non-obvious risk. The thesis that only forms after years of watching an industry evolve. The pipeline handles the translation. I handle the judgment.
When I started building Rubric Advisors, I thought the quant models would be the product. I was wrong. The quant models are the engine. The product is comprehension. It's the client sitting across from me and understanding not just what we're doing with their portfolio, but why. Models produce signals. Humans need stories.
A multi-stage LLM pipeline doesn't replace analysis. It separates synthesis from writing, creativity from compliance, insight from formatting. And in doing so, it turns raw model output into something that resembles what a seasoned analyst would deliver.
Not because the model became smarter. But because the architecture respected how analysts actually think.
That's the bridge. And it gets a little closer every day.
