{
  "id": "four-blind-tests-audit",
  "type": "audit",
  "title": "Audit \u2014 Four Blind Prediction Tests (T1 50%, T2 33%, T3 80%, T4 100%)",
  "status": "confirmed",
  "project": "cipher_v12",
  "date_published": "2026-04-04",
  "date_updated": "2026-05-12",
  "tags": [
    "audit",
    "blind-test",
    "predictive-accuracy",
    "geometry-strong",
    "energetics-weak"
  ],
  "author": "Jonathan Shelton",
  "log_subtype": "blind_prediction_audit",
  "url": "https://prometheusresearch.tech/research/audits/four-blind-tests-audit.html",
  "source_markdown_url": "https://prometheusresearch.tech/research/_src/audits/four-blind-tests-audit.md.txt",
  "json_url": "https://prometheusresearch.tech/api/entries/four-blind-tests-audit.json",
  "summary_excerpt": "Four blind prediction tests were conducted March\u2013April 2026. \"Blind\" means the framework operator did not have access to the answer data at prediction time; predictions were submitted in writing before the answer set was revealed.\nResults:\n\n\nTest\nScope\nScore\n\n\n\nT1\nCrystal structure (CN + lattice) fo...",
  "frontmatter": {
    "id": "four-blind-tests-audit",
    "type": "audit",
    "title": "Audit \u2014 Four Blind Prediction Tests (T1 50%, T2 33%, T3 80%, T4 100%)",
    "date_published": "2026-04-04",
    "date_updated": "2026-05-12",
    "project": "cipher_v12",
    "status": "confirmed",
    "log_subtype": "blind_prediction_audit",
    "tags": [
      "audit",
      "blind-test",
      "predictive-accuracy",
      "geometry-strong",
      "energetics-weak"
    ],
    "author": "Jonathan Shelton",
    "audited_entry": [
      "cipher-v11-complete-self-derivation"
    ],
    "see_also": [
      "cipher-v11-complete-self-derivation",
      "cipher-version-progression-audit"
    ]
  },
  "body_markdown": "\n## Author notes\n\nFour blind prediction tests were conducted in March\u2013April 2026 to\nscore the cipher framework against predictions where the answer was\nnot available to the framework operator at prediction time. Each\ntest had a different scope; the results trace a clear pattern of\nwhere the framework is strong (geometric predictions) vs weak\n(energetic-scale predictions).\n\n### Test methodology\n\nFor each blind test:\n1. A target set of elements/predictions was identified by an\n   independent collaborator who held the \"answer\" data.\n2. The framework operator made predictions using only Z (atomic\n   number) and the cipher derivation chain \u2014 no access to the\n   answer data, no access to NIST reference values for the specific\n   targets.\n3. Predictions were submitted in writing before the answer set was\n   revealed.\n4. Scoring used pre-registered criteria (exact / acceptable+partial /\n   miss) per test.\n\n### Results\n\n| Test | Scope | Score | Strong/Weak |\n|---|---|---|---|\n| **T1** | Predict crystal structure (CN + lattice type) for 12 elements blindly | 50% (6/12) exact | Geometric / mixed |\n| **T2** | Predict Tc (superconducting transition temp) for 6 elements blindly | 33% (2/6) within order of magnitude | Energetic / weak |\n| **T3** | Predict ductility / conductor / band-gap class for 15 elements blindly | 80% (12/15) | Geometric / strong |\n| **T4** | Predict coordination geometry from Z alone for 24 elements blindly | **100% (24/24)** | Geometric / very strong |\n\n### Findings\n\n**F1. Pattern: the framework is strong on geometric predictions and\nweak on absolute energetic-scale predictions.**\n\n- T1 (mixed, 50%) and T3 (geometric, 80%) and T4 (geometric, 100%)\n  cluster on the *geometric* side of the predictive landscape.\n- T2 (Tc \u2014 explicitly an energetic prediction) at 33% within an\n  order of magnitude tells the framework's predictive weakness:\n  it knows the *shape* of energetic ordering but not the\n  *absolute scale*.\n\n**F2. T4's 100% is the framework's strongest result to date.**\nPredicting coordination geometry from Z alone, blind, for 24 elements\nwith no misses \u2014 this is the result the framework most strongly\nsupports. The cipher v11/v12 chain was explicitly designed for this\nprediction type, and it delivers.\n\n**F3. T2's weakness identifies the next development frontier.**\nThe cipher reads `f` (accumulation / peak / formation) but does not\nyet read `|t` (cooling / reorganization). Tc is a `|t` phenomenon\n(superconducting transition is a cooling-reorganization event).\nThe framework predicts the *order* of Tc across elements correctly\nbut the *absolute Tc value* requires the `|t` read that the cipher\ncurrently lacks. This is documented in\n`project_cooling_phase_gap`: the next cipher development direction.\n\n**F4. The 50% T1 result is not a \"miss\" \u2014 it's a calibration.**\nT1 mixed geometric and energetic prediction types. The 50% reflects\nthe cipher's strength on the geometric half and weakness on the\nenergetic half, averaged. The follow-up tests (T3 and T4) separated\nthe two prediction types cleanly, producing the strong geometric\nresults and exposing the energetic weakness as a distinct issue.\n\n**F5. No fitting between tests.** T1's 50% was not used to tune the\nframework before T2. T2's 33% was not used to tune before T3. T3's\n80% was not used to tune before T4. Each test was a fresh blind\nattempt. The framework's parameters were not adjusted in response\nto scoring.\n\n### Resolution\n\n- \u2705 Findings documented: framework strong on geometric predictions\n  (T3 80%, T4 100%), weak on absolute energetic-scale predictions\n  (T2 33% within order of magnitude).\n- \u2705 Next development frontier identified: `|t` cooling/reorganization\n  read added to the cipher should close the T2 weakness.\n- \u2705 No fitting between tests confirmed; each test was a fresh blind.\n- \u23f3 Repeat blind test of Tc predictions after `|t` cooling phase is\n  added \u2014 pre-registered as the falsification criterion.\n\n### Why this audit matters\n\nA predictive framework can claim high accuracy on a wide bench when\nthat bench is the training set or close to it. *Blind* tests \u2014 where\nthe predictor doesn't have access to the answer \u2014 are the cleanest\nscore of real predictive power. The four blind tests give a clear,\nhonest picture: the framework excels at geometry, struggles with\nabsolute energetics, and identifies its own next development frontier\nthrough the results.\n\nThe 100% on T4 is the publishable result. The 33% on T2 is the\npublishable *weakness*. Both belong on the public record.\n\n## Summary\n\nFour blind prediction tests were conducted March\u2013April 2026.\n\"Blind\" means the framework operator did not have access to the\nanswer data at prediction time; predictions were submitted in\nwriting before the answer set was revealed.\n\n**Results:**\n| Test | Scope | Score |\n|---|---|---|\n| T1 | Crystal structure (CN + lattice) for 12 elements | 50% (6/12) |\n| T2 | Tc superconducting transition for 6 elements | 33% within order of magnitude |\n| T3 | Ductility / conductor / band-gap for 15 elements | 80% (12/15) |\n| **T4** | **Coordination geometry from Z alone for 24 elements** | **100% (24/24)** |\n\n**Pattern: framework is geometrically strong, energetically weak.**\nT3 and T4 (geometric) score 80% and 100%. T2 (absolute energetic\nprediction) scores 33% within an order of magnitude.\n\n**T4's 100% is the framework's strongest published result.** Predicting\ncoordination geometry from Z alone, blind, for 24 elements, no misses.\n\n**T2's weakness identifies the next development frontier:** the cipher\nreads `f` (accumulation/peak) but not `|t` (cooling/reorganization).\nTc is a `|t` phenomenon. Adding the cooling-phase read to the cipher\nis the pre-registered fix.\n\n**No fitting between tests.** Each test was a fresh blind attempt; no\nparameter tuning in response to scoring.\n\n**Status: confirmed.** Strengths and weaknesses both on the public\nrecord. T2 retest with `|t` cooling read pending.\n",
  "body_html": "<h2>Author notes</h2>\n<p>Four blind prediction tests were conducted in March\u2013April 2026 to score the cipher framework against predictions where the answer was not available to the framework operator at prediction time. Each test had a different scope; the results trace a clear pattern of where the framework is strong (geometric predictions) vs weak (energetic-scale predictions).</p>\n<h3>Test methodology</h3>\n<p>For each blind test: 1. A target set of elements/predictions was identified by an independent collaborator who held the \"answer\" data. 2. The framework operator made predictions using only Z (atomic number) and the cipher derivation chain \u2014 no access to the answer data, no access to NIST reference values for the specific targets. 3. Predictions were submitted in writing before the answer set was revealed. 4. Scoring used pre-registered criteria (exact / acceptable+partial / miss) per test.</p>\n<h3>Results</h3>\n<table class=\"entry-table\">\n<thead><tr>\n<th>Test</th>\n<th>Scope</th>\n<th>Score</th>\n<th>Strong/Weak</th>\n</tr></thead>\n<tbody>\n<tr>\n<td><strong>T1</strong></td>\n<td>Predict crystal structure (CN + lattice type) for 12 elements blindly</td>\n<td>50% (6/12) exact</td>\n<td>Geometric / mixed</td>\n</tr>\n<tr>\n<td><strong>T2</strong></td>\n<td>Predict Tc (superconducting transition temp) for 6 elements blindly</td>\n<td>33% (2/6) within order of magnitude</td>\n<td>Energetic / weak</td>\n</tr>\n<tr>\n<td><strong>T3</strong></td>\n<td>Predict ductility / conductor / band-gap class for 15 elements blindly</td>\n<td>80% (12/15)</td>\n<td>Geometric / strong</td>\n</tr>\n<tr>\n<td><strong>T4</strong></td>\n<td>Predict coordination geometry from Z alone for 24 elements blindly</td>\n<td><strong>100% (24/24)</strong></td>\n<td>Geometric / very strong</td>\n</tr>\n</tbody></table>\n<h3>Findings</h3>\n<p><strong>F1. Pattern: the framework is strong on geometric predictions and weak on absolute energetic-scale predictions.</strong></p>\n<ul>\n<li>T1 (mixed, 50%) and T3 (geometric, 80%) and T4 (geometric, 100%)</li>\n<p>cluster on the *geometric* side of the predictive landscape.</p>\n<li>T2 (Tc \u2014 explicitly an energetic prediction) at 33% within an</li>\n<p>order of magnitude tells the framework's predictive weakness: it knows the *shape* of energetic ordering but not the *absolute scale*.</p>\n</ul>\n<p><strong>F2. T4's 100% is the framework's strongest result to date.</strong> Predicting coordination geometry from Z alone, blind, for 24 elements with no misses \u2014 this is the result the framework most strongly supports. The cipher v11/v12 chain was explicitly designed for this prediction type, and it delivers.</p>\n<p><strong>F3. T2's weakness identifies the next development frontier.</strong> The cipher reads <code>f</code> (accumulation / peak / formation) but does not yet read <code>|t</code> (cooling / reorganization). Tc is a <code>|t</code> phenomenon (superconducting transition is a cooling-reorganization event). The framework predicts the *order* of Tc across elements correctly but the *absolute Tc value* requires the <code>|t</code> read that the cipher currently lacks. This is documented in <code>project_cooling_phase_gap</code>: the next cipher development direction.</p>\n<p><strong>F4. The 50% T1 result is not a \"miss\" \u2014 it's a calibration.</strong> T1 mixed geometric and energetic prediction types. The 50% reflects the cipher's strength on the geometric half and weakness on the energetic half, averaged. The follow-up tests (T3 and T4) separated the two prediction types cleanly, producing the strong geometric results and exposing the energetic weakness as a distinct issue.</p>\n<p><strong>F5. No fitting between tests.</strong> T1's 50% was not used to tune the framework before T2. T2's 33% was not used to tune before T3. T3's 80% was not used to tune before T4. Each test was a fresh blind attempt. The framework's parameters were not adjusted in response to scoring.</p>\n<h3>Resolution</h3>\n<ul>\n<li>\u2705 Findings documented: framework strong on geometric predictions</li>\n<p>(T3 80%, T4 100%), weak on absolute energetic-scale predictions (T2 33% within order of magnitude).</p>\n<li>\u2705 Next development frontier identified: <code>|t</code> cooling/reorganization</li>\n<p>read added to the cipher should close the T2 weakness.</p>\n<li>\u2705 No fitting between tests confirmed; each test was a fresh blind.</li>\n<li>\u23f3 Repeat blind test of Tc predictions after <code>|t</code> cooling phase is</li>\n<p>added \u2014 pre-registered as the falsification criterion.</p>\n</ul>\n<h3>Why this audit matters</h3>\n<p>A predictive framework can claim high accuracy on a wide bench when that bench is the training set or close to it. *Blind* tests \u2014 where the predictor doesn't have access to the answer \u2014 are the cleanest score of real predictive power. The four blind tests give a clear, honest picture: the framework excels at geometry, struggles with absolute energetics, and identifies its own next development frontier through the results.</p>\n<p>The 100% on T4 is the publishable result. The 33% on T2 is the publishable *weakness*. Both belong on the public record.</p>\n<h2>Summary</h2>\n<p>Four blind prediction tests were conducted March\u2013April 2026. \"Blind\" means the framework operator did not have access to the answer data at prediction time; predictions were submitted in writing before the answer set was revealed.</p>\n<p><strong>Results:</strong></p>\n<table class=\"entry-table\">\n<thead><tr>\n<th>Test</th>\n<th>Scope</th>\n<th>Score</th>\n</tr></thead>\n<tbody>\n<tr>\n<td>T1</td>\n<td>Crystal structure (CN + lattice) for 12 elements</td>\n<td>50% (6/12)</td>\n</tr>\n<tr>\n<td>T2</td>\n<td>Tc superconducting transition for 6 elements</td>\n<td>33% within order of magnitude</td>\n</tr>\n<tr>\n<td>T3</td>\n<td>Ductility / conductor / band-gap for 15 elements</td>\n<td>80% (12/15)</td>\n</tr>\n<tr>\n<td><strong>T4</strong></td>\n<td><strong>Coordination geometry from Z alone for 24 elements</strong></td>\n<td><strong>100% (24/24)</strong></td>\n</tr>\n</tbody></table>\n<p><strong>Pattern: framework is geometrically strong, energetically weak.</strong> T3 and T4 (geometric) score 80% and 100%. T2 (absolute energetic prediction) scores 33% within an order of magnitude.</p>\n<p><strong>T4's 100% is the framework's strongest published result.</strong> Predicting coordination geometry from Z alone, blind, for 24 elements, no misses.</p>\n<p><strong>T2's weakness identifies the next development frontier:</strong> the cipher reads <code>f</code> (accumulation/peak) but not <code>|t</code> (cooling/reorganization). Tc is a <code>|t</code> phenomenon. Adding the cooling-phase read to the cipher is the pre-registered fix.</p>\n<p><strong>No fitting between tests.</strong> Each test was a fresh blind attempt; no parameter tuning in response to scoring.</p>\n<p><strong>Status: confirmed.</strong> Strengths and weaknesses both on the public record. T2 retest with <code>|t</code> cooling read pending.</p>",
  "see_also": [
    "cipher-v11-complete-self-derivation",
    "cipher-version-progression-audit"
  ],
  "cited_by": [
    "external-ai-audit-protocol",
    "paper-4-status-2026-05"
  ],
  "attachments": [],
  "schema_version": "1.0",
  "generated_at": "2026-05-12T03:27:18.533879Z"
}