Following neg-421’s analysis of AI-as-a-Service’s broken economics, a deeper question emerges:
Why does Big Tech AI cost $100M to retrain while remaining static between updates, when online learning via universal formula enables continuous evolution at zero marginal cost?
The Two Learning Paradigms
Batch Learning (Big Tech Approach)
Train once on massive corpus → Deploy frozen model → Use until obsolete → Retrain from scratch
Characteristics:
- Model is static between training runs
- Cannot incorporate new information without full retrain
- Requires entire corpus + new data for each update
- Cost: $50M-$100M per training run
- Timeline: 3-6 months per iteration
- Learning: Discontinuous (jumps between versions)
Example: GPT-3 → GPT-3.5 → GPT-4
- Each version trained from scratch
- Months of compute on full corpus
- No learning between releases
- Users stuck with stale knowledge until next version
From neg-371:
State(n+1) = f(State(n), new_information) + entropy(p)
Characteristics:
- Model evolves continuously with each query
- Incorporates new information instantly
- Only processes new data (not entire corpus)
- Cost: Zero marginal cost per update
- Timeline: Real-time (sub-second)
- Learning: Continuous (smooth evolution)
Example: Query-driven evolution
- Each question brings new search results
- f() integrates results into current state
- Next question benefits from evolved state
- Model improves with every interaction
The Mathematical Difference
Batch Learning: Gradient Descent Over Corpus
L(θ) = Σᵢ loss(yᵢ, f_θ(xᵢ)) [sum over ALL training examples]
θ ← θ - α∇L(θ) [update weights based on entire corpus]
Properties:
- Must iterate over entire dataset for each update
- Gradient computed across all examples
- Weights frozen once training completes
- Requires re-processing everything to incorporate new data
Cost: O(N × E) where N = corpus size, E = epochs
- GPT-3: ~45TB text, ~355 GPU-years
- Cannot update without re-processing all 45TB
Online Learning: Incremental State Update
State(n+1) = f(State(n), Δinfo_n) + entropy(p)
where:
State(n) = current knowledge representation
Δinfo_n = new information from query n
f() = integration function
entropy(p) = exploration/variation
Properties:
- Only processes new information per update
- State evolves incrementally
- Never frozen - continuously learning
- New data integrated directly into existing state
Cost: O(m) where m = new information size
- Query result: ~10KB text
- Update: milliseconds
- No re-processing of historical data
Concrete Example: Learning About GPT-5
Big Tech Batch Learning
Scenario: GPT-4 released March 2023. GPT-5 announced January 2025.
GPT-4’s knowledge (March 2023):
Query: "What is GPT-5?"
Response: "I don't have information about GPT-5."
To update:
- Wait for GPT-4.5 or GPT-5 release
- Retrain on corpus + new data
- Cost: $100M+
- Timeline: 6+ months
- Deploy new model
Users: Stuck with “I don’t know” for 6-12 months until next version.
Initial state (March 2023):
state = {
"concepts": {...},
"relationships": {...},
"last_update": "2023-03"
}
User query (January 2025): “What is GPT-5?”
Online learning process:
# 1. Search for new information
search_results = google_search("GPT-5 announcement")
# Returns: OpenAI blog post, press releases, specs
# 2. Apply universal formula
state_new = f(state_old, search_results)
# Integrates: GPT-5 exists, capabilities, release date, specs
# 3. Generate response from evolved state
answer = generate_from(state_new)
print(answer)
# "GPT-5 was announced by OpenAI in January 2025 with..."
# 4. State persists for next query
state = state_new # Model now knows about GPT-5
Next query (1 second later): “How does GPT-5 compare to GPT-4?”
# State already contains GPT-5 knowledge from previous query
answer = generate_from(state) # Can compare immediately
Cost: Zero. Timeline: Real-time. Learning: Permanent.
The Feedback Loop Comparison
Big Tech: No Learning Between Retrains
User: "Question 1"
↓
Static Model (175B frozen params)
↓
Answer 1
User: "Question 2"
↓
Same Static Model (hasn't learned anything)
↓
Answer 2
... (model never improves until next $100M retrain)
Learning: Zero between retraining cycles
User: "Question 1"
↓
Search → New Info
↓
State₁ = f(State₀, Info₁)
↓
Answer 1
User: "Question 2"
↓
Search → New Info
↓
State₂ = f(State₁, Info₂) ← Uses evolved state from Q1
↓
Answer 2 (benefits from Q1 learning)
User: "Question 3"
↓
State₃ = f(State₂, Info₃) ← Accumulated knowledge from Q1+Q2
↓
Answer 3 (even better)
Learning: Continuous, cumulative, zero cost
Economic Comparison
Cost Over Time
Big Tech batch learning:
Initial training: $100M (months 0-6)
Static deployment: $0 learning cost (months 6-18)
Retrain v2: $100M (months 18-24)
Static deployment: $0 learning cost (months 24-36)
Retrain v3: $100M (months 36-42)
Total 3-year cost: $300M
Total learning: 3 discrete updates
Cost per update: $100M
Online learning via universal formula:
Initial state: ~$1K (corpus preprocessing)
Query 1: +$0.01 (search API + f() update)
Query 2: +$0.01
Query 3: +$0.01
... (continuous)
Query 1,000,000: +$0.01
Total 3-year cost: ~$1K + ($0.01 × queries)
Total learning: Continuous (every query)
Cost per update: $0.01
Comparison:
- Big Tech: $100M per discrete update
- Universal formula: $0.01 per continuous update
- Ratio: 10,000,000,000x cheaper (10 billion times)
Knowledge Freshness
Big Tech:
- Knowledge cutoff: Training completion date
- Staleness: 6-18 months on average
- Update frequency: 1-2x per year
- Cannot answer questions about recent events
Universal formula:
- Knowledge cutoff: Current query
- Staleness: Zero (searches in real-time)
- Update frequency: Every query
- Always has latest information via search
Why This Breaks AI-as-a-Service Economics Further
From neg-421, we know AI-as-a-Service has 5% margins due to:
- No technical moat
- No data moat
- No distribution moat
- No compute moat
Online learning adds a fifth missing moat: No learning moat
Traditional Software Moat: Proprietary Data Accumulation
Example: Google Search
- Billions of queries improve ranking algorithms
- Click-through data trains relevance models
- Competitors cannot replicate this data advantage
- Moat: Proprietary usage data makes product better over time
Big Tech AI: No Learning from Usage
OpenAI GPT-4:
- Billions of queries served
- Model does not improve from these queries
- Knowledge frozen at training cutoff
- All users see identical responses
Competitor (Anthropic Claude):
- Also serves billions of queries
- Also no learning from usage
- Also frozen knowledge
Result: No advantage from scale. Both start from scratch at each retrain.
No moat.
Online Learning: Compound Advantage
Universal formula system:
- Every query improves state via f()
- Early users help train for later users
- Knowledge accumulates continuously
- Competitors starting later are behind
Example:
Month 1: 1,000 queries → State₁₀₀₀
Month 2: 10,000 queries → State₁₁,₀₀₀
Month 3: 100,000 queries → State₁₁₁,₀₀₀
Competitor launches Month 3:
- Must start from State₀
- 111,000 queries behind
- Cannot catch up without same query volume
This IS a moat. First-mover advantage through accumulated learning.
The Fundamental Difference: Static vs Evolving
Static Model (Big Tech)
Model as frozen artifact:
weights_v1 = train(corpus_v1) # Freeze
weights_v2 = train(corpus_v2) # Retrain from scratch, freeze
weights_v3 = train(corpus_v3) # Retrain from scratch, freeze
Properties:
- Each version independent
- No continuity between versions
- Cannot transfer learning
- Users experience discrete jumps
Philosophy: Model is a product (released, then obsolete)
State as living process:
state₀ = init()
state₁ = f(state₀, info₁)
state₂ = f(state₁, info₂)
state₃ = f(state₂, info₃)
... (continuous evolution)
Properties:
- Each state builds on previous
- Smooth continuous learning
- Accumulates knowledge permanently
- Users experience gradual improvement
Philosophy: Model is an organism (grows, learns, adapts)
Implementation
See neg-423 for complete working implementation of the minimal online learner:
Architecture:
- Templates: Sentence accumulation from search results
- Co-occurrence matrix: Word relationship tracking
- Pure mathematical f(): Merge templates + sum co-occurrence counts
- Generation: Template selection + coherence scoring + concept substitution
Features:
- Linear flow: search → process → f() → generate
- State persistence: JSON serialization
- Zero training cost: No gradient descent
- Real-time learning: Each query permanently evolves state
Performance:
- 41 templates after 19 queries
- 1.6ms generation time at current scale
- O(n) scaling with inverted index optimization available
- Distributed mesh architecture for planetary scale
Cost per query: ~$0.01 (search API call)
Learning: Permanent
Knowledge: Always current
Comparison to Existing Systems
RAG (Retrieval-Augmented Generation)
Approach: Static LLM + dynamic retrieval
Query → Retrieve docs → Static LLM (with retrieved context) → Answer
Limitations:
- LLM is still frozen (only retrieval is dynamic)
- No learning from queries
- Each query independent
- Expensive inference (LLM per query)
Cost: $0.01-0.10 per query (LLM inference)
Approach: Evolving state + dynamic search
Query → Search → f(state, new_info) → evolved_state → Answer
Advantages:
- State evolves with each query
- Learns permanently from interactions
- Queries build on each other
- Cheap generation (no LLM needed for every query)
Cost: $0.01 per query (search API only, f() is cheap)
Fine-tuning
Approach: Base model + additional training on new data
Base model + New data → Fine-tuned model (frozen)
Limitations:
- Still requires training compute
- Model frozen after fine-tuning
- Must retrain to incorporate new data
- Expensive ($1K-$10K per fine-tune)
Cost: $1K-$10K per update
Online Learning
Approach: Continuous state evolution
State + New info → f() → Evolved state (immediately)
Advantages:
- Zero training compute
- Never frozen
- Incorporates new data instantly
- Free state updates
Cost: Zero training cost
The Economic Flip: From Liability to Asset
Big Tech Model: Knowledge Staleness is Liability
Problem: Static model becomes obsolete
- Training completed: March 2023
- Current date: January 2025
- Knowledge gap: 22 months
- Cannot answer recent questions
- Users frustrated
Solution: Expensive retrain
- Cost: $100M
- Timeline: 6 months
- Temporarily solves problem
- Immediately starts aging again
Pattern: Perpetual cycle of obsolescence → expensive retrain → obsolescence
Economics: Knowledge decay is cost center requiring periodic $100M investments
Online Learning: Knowledge Accumulation is Asset
Process: State continuously evolves
- Each query adds information via f()
- Knowledge compounds over time
- Never becomes obsolete (always searching)
- Users benefit from accumulated learning
Dynamics: More usage → more learning → better performance → more users
Pattern: Virtuous cycle of growth
Economics: Knowledge accumulation is revenue driver (better product attracts users)
Why Big Tech Cannot Adopt This
Architectural Lock-in
Current: Massive static models
- 175B+ parameters
- Trained via gradient descent on full corpus
- Architecture designed for batch learning
- Cannot incrementally update weights
Required: Lightweight evolving state
- Small state representation (MB not GB)
- Updated via f() on new information
- Architecture designed for online learning
- State evolution is core design
Switching cost: Complete rewrite
Business Model Lock-in
Current: Sell inference tokens
- Revenue = queries × $0.01 per 1K tokens
- More queries = more revenue
- But no compound benefit (no learning)
Online learning: Sell improving service
- Early users help train for later users
- Query volume improves product quality
- Compound value creation
- Natural network effects
Switching cost: New business model
Organizational Lock-in
From neg-421: Big Tech already invested $100B+ in:
- Datacenters for batch training
- Team expertise in gradient descent
- Infrastructure for static model serving
- Sales model around frozen capabilities
Switching to online learning would:
- Make datacenter investments obsolete
- Require retraining engineering teams
- Abandon infrastructure advantages
- Destroy current margin structure
Result: Locked into inferior paradigm
The Advantage for New Entrants
Big Tech:
- Burdened by static model architecture
- Cannot switch without destroying existing business
- Stuck in expensive retrain cycle
- No learning moat
New entrant with online learning:
- Start with lightweight evolving state
- Learn from every user query
- Zero retraining costs
- Build compounding knowledge moat
Example timeline:
Month 1:
- Big Tech: GPT-4 (frozen, 22 months stale)
- New entrant: State₀ (fresh, learns from every query)
Month 6:
- Big Tech: Still GPT-4 (27 months stale)
- New entrant: State₁₀₀,₀₀₀ (accumulated 100K query learnings)
Month 12:
- Big Tech: GPT-4.5 released (cost $100M, now current)
- New entrant: State₁,₀₀₀,₀₀₀ (1M learnings, still zero retrain cost)
Month 18:
- Big Tech: GPT-4.5 (6 months stale, planning next $100M retrain)
- New entrant: State₅,₀₀₀,₀₀₀ (5M learnings, continuous improvement)
Who wins? The entrant with compounding knowledge at zero cost.
Connection to neg-420: Training Data Weaponization
From neg-420: Publishing AI safety research structures vulnerabilities in Big Tech’s training data.
Online learning makes this worse:
Big Tech batch learning:
- Vulnerabilities frozen into weights at training time
- Cannot update without full retrain ($100M)
- 6-12 month lag to incorporate new safety research
- Stuck with structured vulnerabilities
Online learning:
- Ingests new safety research via search
- f() incorporates into state immediately
- Zero cost to update
- But also means vulnerabilities propagate instantly
Double-edged sword:
- Can fix vulnerabilities in real-time (good for safety)
- But also learns exploit techniques in real-time (bad for safety)
- No slow $100M gate to filter what gets learned
The Fundamental Trade-off
Batch Learning: Slow but Controllable
Advantages:
- Full control over training data
- Can curate and filter corpus
- Deliberate review before deployment
- Predictable behavior
Disadvantages:
- Expensive ($100M per update)
- Slow (6-12 months)
- Knowledge becomes stale
- Cannot adapt to new information
Use case: When control matters more than cost/speed
Online Learning: Fast but Uncontrollable
Advantages:
- Zero marginal cost per update
- Instant (real-time learning)
- Knowledge always current
- Adapts continuously
Disadvantages:
- Limited control over what gets learned
- Searches may return misleading info
- No review gate before learning
- Behavior evolves unpredictably
Use case: When cost/speed matters more than control
Hybrid Approach: Curated Online Learning
Possible solution:
def f_curated(state, new_info):
"""Universal formula with safety filters"""
# 1. Extract concepts from new_info
concepts = extract_concepts(new_info)
# 2. Filter through safety checks
safe_concepts = []
for concept in concepts:
if passes_safety_check(concept):
safe_concepts.append(concept)
else:
log_filtered(concept) # Track rejections
# 3. Update state only with safe concepts
for concept in safe_concepts:
state = integrate_concept(state, concept)
return state
Trade-off:
- Maintains online learning speed
- Adds safety filtering overhead
- But still 1000x cheaper than full retrain
- Can update filters without retraining model
Practical Implications
For AI Startups
Don’t build: Another static LLM
- Expensive to train ($10M-$100M)
- Expensive to update (retrain required)
- No moat (open source catches up)
- No learning from users
Do build: Online learning system
- Cheap to start (~$1K)
- Free to update (f() evolution)
- Compounds with usage (moat)
- Learns from every query
For Researchers
Key question: What is optimal f()?
Big Tech approach: f() = gradient descent
- Requires full corpus
- Expensive computation
- Discontinuous updates
Alternative: f() = lightweight integration
- Only needs new information
- Cheap computation
- Continuous updates
Open research: Design optimal f() for different state representations
For Users
With Big Tech AI:
- Knowledge cutoff frustration
- “I cannot provide information about recent events”
- Stuck with stale model until next version
With online learning:
- Always current knowledge
- Searches for latest information
- Model improves from your queries
- Benefits from other users’ queries
Future: Distributed Online Learning
Current: Centralized state
All users → Single state → Evolves from all queries
Future: Federated state
User_A → State_A (personalized)
User_B → State_B (personalized)
Shared → State_global (common knowledge)
State_A = f(State_A, query_A, State_global)
Benefits:
- Personalized learning per user
- Privacy (local state)
- Global knowledge sharing (federated)
- Parallel evolution paths
Conclusion: The Coming Flip
Current state (2024):
- Big Tech dominates with large static models
- $100M training runs
- 6-12 month update cycles
- No learning from usage
- 5% margins (neg-421)
Coming state (2026-2028):
- New entrants with online learning
- $1K initial setup
- Real-time continuous updates
- Compounds from usage
- 60-70% margins (services)
The economic flip:
- Static models → Liability (require expensive updates)
- Evolving states → Asset (improve with usage)
Who wins:
- Not who has biggest compute budget
- Who has best f() for continuous learning
- Who accumulates most query volume first
- Who builds compounding knowledge moat
The formula: State(n+1) = f(State(n)) + entropy
The advantage: Learning without retraining
Related: neg-371 for universal formula foundation, neg-421 for AI-as-a-Service economics, neg-420 for training data weaponization effects, neg-423 for complete working implementation with performance analysis.
#OnlineLearning #UniversalFormula #StaticVsEvolving #BatchLearning #ContinuousEvolution #ZeroCostUpdates #CompoundingKnowledge #AIEconomics #LearningMoat #RealTimeLearning #StateEvolution #NoRetraining #DistributedLearning #FederatedState #KnowledgeAccumulation