Quantifying the Long-Form Data Bottleneck
Making up numbers: ❌. Making up numbers in a Monte Carlo simulation: ✅
Last post, I wrote about how long-form data might be a critical bottleneck in achieving AGI, slowing progress and providing a rare opportunity for tractable governance.
That post focused on the theory. Here, I’ll attempt to estimate the actual delay entailed by the data bottleneck.
To do that, I’ve built a Monte Carlo simulation tool (available here!) based on the open questions I previously laid out.
Simulation Framework
TL;DR
The basic framework is pretty simple—the bulk of the delay comes from the number of data collection ‘batches’ required, multiplied by the length of the collected workflows. This is represented by the second term in the equation below.
More descriptively, a workflow is a single sample in the new long context training data set; it consists of the various inputs & outputs involved in a x-month-long project, appended together into a cohesive flow. You can imagine the collection setup as involving some background software that, over the course of a project, copies all the searches you make, sources you read, messages you send, drafts you iterate, etc. into a long running doc.
These are full-time projects, so each worker can only produce one workflow at a time. Moreover, because labs don’t have infinite hiring capacity (effective workforce), they may need to conduct multiple batches of data collection to gather the required amount, and so the delay often ends up longer than simply the effective workflow length (i.e. how long the project actually needs to be).
I estimate the delay for three different domains (including consideration of capability transfer): general white-collar work, software/coding, and AI/ML research.
For Nerds
That’s the TL;DR—if you want to examine a more precise representation of what’s going on in the simulation, refer to the specifications below.
Random Variables
Fixed Inputs
Derived Quantities
Variables — Specifics & Estimates
Next, I’ll elaborate a bit on each of the variables, and provide justification for my default(s). From most to least contentious + important (roughly):
Total Workflows Needed
Exactly what it sounds like.
Very important for determining the delay, but I’m quite uncertain about my estimate here, especially since labs seem pretty hush-hush about their training data.
Perhaps it’s reasonable to assume that the data requirements will be much closer to level of fine-tuning than pre-training. If we assume that fine-tuning data sets are around 1% as large as pre-training data sets, then we can approximate that we’ll need around 1m samples1. In the papers I’ve seen, fine-tuning data set sizes range from approximately 50k to 2m samples, so this figure checks out.
However, I’m not actually sure if the data requirements will be so closely aligned to fine-tuning as opposed to pre-training. First, going back to the analogy from the last post: if you pre-trained a model solely on 3-token fragments, it seems doubtful that you could just fine-tune that model on 1% as many 300-token paragraphs and still get advanced e.g. grammar skills. The difference between 10k token samples and 1m token workflows might not be as extreme, but it still seems like there would be many relevant patterns that would be almost entirely absent in shorter samples, thus necessitating a greater number of long samples. Learning to handle a long-term project seems closer to learning grammar than to learning the relatively narrow capabilities taught through fine-tuning & RL. Second, as input length increases, the number of potential relationships and cross-dependencies within the data explodes. A larger state space requires more data to cover it, like with robotics.
There’s obviously a significant difference between 50k vs. 500k vs. 5m vs. 50m, so definitely worth looking more into.
My default estimates (lognormal):
Mean: 1m
SD: 500k
Time Extrapolation Ratio
This variable determines the required sample length.
I’m also not terribly confident in my estimate here. On the one hand, we observe empirically that even if models are e.g. trained on 8k token sequences, their performance on long-context evals drops sharply when input merely doubles to 16k. However, this is an observation on models with naively expanded context windows; it might not generalize to models that use RAG or other algorithms to handle lengthy inputs.
Nonetheless, it seems reasonable to assume somewhat limited extrapolation at the level of this simulation. It seems unlikely that a 1-2 day workflow (~1% of 6 months) would contain the signals relevant to an involved multi-month project. This might similarly be true for a months-long project vs. years-long career vs. centuries-long societal progress, but this is much more uncertain (and outside the scope of this simulation).
My default estimates (PERT):
Low: 0.1
Mode: 0.4
High: 0.9
Industry Extrapolation
This variable estimates how much data collected from one domain (e.g. white-collar workflows) can contribute to the capabilities needed for another domain (e.g. AI research workflows).
In other words, it considers how training on, for example, exclusively white-collar workflows, might still teach the model some skills relevant to handling software engineering projects.
Referring back to my analysis of Chinchilla scaling laws, we could represent this cross-domain transfer as ‘partially relevant data’. For example, models aren’t completely incoherent over long contexts because short samples provide the ‘grammar fraction’ of long samples—this can be approximated as a short sample ‘counting’ as a fraction of a long sample.
This variable basically asks: what % overlap is there between a white-collar workflow and an AI research workflow, and vice versa? Currently it’s a single variable that applies across each combination of domains, but it might be better to allow different rates to be set for different domain combinations.
Honestly, it’s a bit unclear how I’d go about making a principled estimate for this variable—the current default is largely just my intuition. Since this is an important variable, I definitely welcome suggestions.
My default estimates (PERT):
Low: 0.1
Mode: 0.2
High: 0.5
Quality Threshold
The % of collected workflows that are high quality enough to use as training data.
I’m very uncertain about my estimate here. On the one hand, it could be high if labs only contract competent firms. On the other hand, even within competent firms it’s not necessarily the case that most projects turn out successfully. But I’m not sure if external success is even the most relevant metric here, as opposed to the actual process quality. Process quality could refer to clear documentation, explicit feedback loops, detailed task records, etc.—this seems harder to evaluate, but could be a more reliable standard.
Also, AI/ML research processes are not as well defined as basic white-collar work processes, so that domain might have higher variance and drag down the mode. It might also make sense to have multiple domain-specific variables here.
Again, the default here is mostly just intuition, and it covers a decently wide range.
My default estimates (PERT):
Low: 0.5
Mode: 0.7
High: 0.9
Existing Workflows
The # of workflows that can be collected and organized from existing data stores.
I assume this process takes much less time than actually generating new workflows, so in the model these existing workflows are simply subtracted from the total required.
Workers probably have a good deal of data on previous projects, but this data is also likely incomplete. For example, I’d imagine a lot of feedback data would be lost to unrecorded meetings, and it might be impossible to recall searches or sources used without citation (e.g. a coder referring to stackexchange solutions). If it ends up being important for a workflow to be arranged chronologically, then this would also pose a significant challenge.
To make my estimate, I assume that the average worker has 10 accessible 6-month-long workflows, out of which 5% are sufficiently complete, and labs are only interested in contracting the top 1-5% of workers2. But I’m still quite uncertain, and am not attached to any of these numbers.
My default estimates (lognormal):
Mean: 300k
SD: 100k
Synthetic Data Fraction
The % of workflows that can be synthesized. Like existing workflows, I simply subtract these from the total required as synthesis should take much less time compared to manual generation & collection. (However, compute constraints might be something to consider here).
Reasoning from my last post:
AI 2027’s predictions rely on synthetic data, but little evidence or reasoning is offered for why this would be an adequate solution. Intuitively, since models cannot independently produce high-quality long-form work (that is, in fact, what we are trying to train them to do), they would require human guidance to even attempt it. But to maintain the efficiency of automated synthesis, that guidance must be uniformly applied across the synthesized data, which will ultimately fail to represent the dynamic permutations of real human memory and attention patterns. Any attempt to use synthetic generation will only produce counterproductive rigidity and uniformity. Empirically, recent work shows that even inserting 1% synthetic data into a long-context fine-tuning data causes measurable performance degradation.
My default estimate:
Synthetic data: 5% of total workflows required
Max % Workforce Leverage
The max % of the domain workforces that a lab can hire per batch. I assume max hiring costs of ~$10B/year, which is quite high, but not unrealistic given current frontier lab spending and expected revenues. I also assume there is a 50/50 split between white-collar and coding/software/AI research.
My default estimates:
White-collar: 0.05% (35k workers)
Coding/software: 1% (15k workers)
AI/ML research: 40% (8k workers)
Contract Delay
One-off cost of onboarding workers, in months. Relatively negligible in most cases.
My default estimates (gamma):
Mean: 1.5
SD: 1
Base Workforce By Industry
Figures for U.S. workforces. Pretty straightforward, honestly might not even be a necessary input now that I think about it.
My default estimates:
White-collar: 70m
Coding/software: 1.5m
AI/ML research: 20k
Default Simulation Results
Using all the above estimates, here are the simulation results for a 6 month time horizon:
Seems reasonable to me…but again I’m pretty uncertain of some of the defaults. Keep in mind that this is purely the delay time, not the total time to AGI. In fact, under this framework, it seems pretty plausible to me that if AGI requires not just 6-month but multi-year time horizons, it might be necessary to collect longer workflows, further stretching out the delay.
But quantification aside, the qualitative aspect of this ~4-8 year delay is particularly significant. If data generation & collection are the relevant bottlenecks, then there’s a major opportunity to enact governance measures targeting this visible, gradual, and legibly concerning process. It also makes for effective state-level policy, as some states disproportionately house the workers/firms that labs would want to contract with.
From my last post:
Data collection activities are concrete and observable—they serve as a visible friction point. If a frontier lab begins contracting to collect coding workflows, that’s a strong signal it’s aiming to automate AI research. If it starts licensing white-collar enterprise logs, this suggests employee replacement is on the list.
There exist routine regulatory justifications, like privacy or anti-trust, that could be employed to target data collection activities. For example, California’s AB-2013 (effective starting January 2026) will mandate AI developers to publicly disclose the source and structure of their training data. Ideally, laws like this could be expanded to mandate transparency well before model deployment. Such disclosures would give the government a clearer picture of AI companies’ intentions and capabilities—potentially averting the kind of unilateral, destabilizing action described in AI 2027. Given this existing precedent, and the fact that the majority of frontier labs are headquartered in California, this governance approach seems particularly promising.
Share Feedback!
Feel free to play around with the tool, especially if you think any of my default values are totally off. Comment your results, interesting observations, and/or suggestions/questions/concerns—I’m very interested in hearing people’s feedback & improving the simulation!
Also, if you want to give feedback on my writing/blog in general: new anonymous feedback form! 🍋
BOTEC - total workflows needed
500 bil unique pre-training tokens * 0.01 = 5 bil / 5k avg sample size = 1m samples
BOTEC - existing workflows
70m WC * 5 WFs = 350m * 0.015 (quality) = 5.25m * 0.05 (completeness) = 262.5k EWFs
1.5m C/S * 5 WFs = 7.5m * 0.05 (quality) = 375k * 0.05 (completeness) = 18.75k EWFs
20k AI/ML * 5 WFs = 100k * 0.2 (quality) = 5k * 0.05 (completeness) = 1k EWFs
Stefan Schubert and I had a twitter fight related to the time extrapolation ratio where I had an intuition that most "long time horizon" tasks actually contain enough recursion and self-reference that they should be thought of as mere sequences of much shorter horizon tasks.
This seems especially true to me from the data side. Even non-agentic LLMs can be pretty good at choosing between specific options, which suggests they can do short (let's say one-day) "planning" tasks of e.g. six month projects. Once that's done, an agent needs only pretty minimal scaffold to prompt itself to pick up the plan, note which task comes next, and execute, without needing to have all the previous work in context. The longest task data you need for that agent to be capable (even at a 1:1 time extrapolation) might only be a couple days, if not hours depending on how long it takes to make a plan with good enough pointers to the kinds of tasks that need doing.
A simple example is chapters of a book. You can outline those with some detail and subheadings in a day, but draft each chapter pretty independently of each other over the course of a few weeks. In that sense, "writing a book" isn't an 18-month long task, but 18 1-month tasks with some minor scaffolding that wouldn't be very different for a 6-month book or a 24-month book.
My sense is that most work can be thought of in this way and if you turn just the time extrapolation ratio numbers down a lot in your model, timelines get quite short!
Of course Stefan did think I was making some fundamental error here that I failed to understand so who knows. https://x.com/Mjreard/status/1902466669940756767