Anomaly Report - captcha-existential

The dashboard loaded at 8:47 AM, fleet list populating from the top, and Priya worked through it the way she always had—model by model, the sign-off click ready for anything clean. This was the eighth quarterly audit she'd run since her promotion. The process had become nearly automatic: she registered anomalies faster than she could have articulated why, some pattern-matching accumulated from hundreds of previous audits, useful precisely because it didn't require deliberate attention.

SELECTOR-1 through SELECTOR-5 cleared in six minutes. Clean latency distributions, accuracy holding within two decimal points of historical averages. SELECTOR-3 was drifting slightly—80ms above baseline, within tolerance, nothing requiring action before next quarter. She logged it for follow-up and moved on.

The QA bullpen had filled in around her during those six minutes. Marcus at the desk to her left, headphones on, running some test batch she hadn't asked about. The desk beside the far window had been empty since the department reorganization, its surface wiped clean months ago. At the east end of the floor, through the glass walls of his office, Nate was on a call, his back half-turned to the bullpen. The whiteboards behind his desk showed Q4 performance targets in blue dry-erase, the latency benchmarks she was about to note one model was edging toward listed in the second column.

The overhead lights were the flat fluorescent of every corporate floor she'd worked on, a brightness without temperature that made everything look slightly drained. The ventilation system moved air at a constant low register. Two monitors per analyst, the industry standard, though Priya had requested a third from facilities six months ago and kept getting deferred. Her coffee had been better when it was hotter, and she was on her second before SELECTOR-9 loaded.

The summary view showed accuracy at 99.2%, latency elevated, error rate within bounds. Standard enough presentation—she'd seen dozens of audits that looked like this, minor drift in the latency, nothing requiring immediate escalation. She moved her cursor to the sign-off button and stopped at the trend line. She toggled the distribution overlay on and looked at it. SELECTOR-9's processing time should produce a narrow bell curve: same basic operation, thousands of times, consistent hardware, variance kept low by regularity. The distribution she was looking at had a right tail that didn't belong there. Processing times weren't uniformly elevated—they were selectively elevated. Some images came in close to baseline. Others were running 300 to 400 milliseconds over. She pulled the correlation tool and ran processing time against image metadata. Image complexity: the coefficient came back 0.73.

Priya looked at that number. Standard model drift—feature map decay, in the technical vocabulary—produced elevated means and wider variances, but the correlation coefficient for a decaying model was typically under 0.2. Decay was noise. Noise didn't pattern. A coefficient of 0.73 between latency and image complexity meant the extra processing wasn't random. SELECTOR-9 was working harder on complex images. Doing more with them, whatever doing more meant for a model whose entire mandate was to find traffic lights. She flagged the anomaly and finished the fleet audit by 10:15, then took her tablet to the end of the floor.

Nate's office smelled of the cold brew on his desk and, faintly, the dry-erase markers lined up in a row along the base of his whiteboard. His desk was clean in a way that suggested deliberate effort—not sparse, but organized, the organized surface of someone who processed paper quickly and filed it. He finished typing, turned, and looked at the data she'd pulled up on her tablet without any of the territorial hesitation that some managers showed when QA came in with problems. He took the tablet and scrolled back to the distribution.

"So it's not the mean that's concerning you," he said. "It's the shape."

"The shape and the correlation." She pointed to the coefficient. "0.73 between processing time and image complexity. Feature map decay doesn't produce that. Decay is random. This isn't."

Nate studied the chart. "I've seen this signature before. Certain architectures develop complexity-correlated latency as the feature maps age—it's a function of how the attention weights interact with the edge detection layers over extended deployment. Doesn't mean the model is processing anything it shouldn't be. It means the processing is getting less efficient in a particular way." He set the tablet down on his desk. "What's accuracy holding at?"

"99.2."

"So function is intact." He picked up his coffee. "This is what the recalibration protocol is designed to catch. The attention parameters drift outward a little, start applying processing cycles to features that aren't relevant to the task. You retrain, the parameters normalize, the distribution tightens. We've got the Model B maintenance window Thursday—run SELECTOR-9 in the same window and you're done." He made a note on his own tablet. "Anything else in the fleet look concerning?"

"SELECTOR-3 is starting a minor drift. Below threshold for now."

"Keep monitoring. Flag it again if it crosses 100ms." He was already half-turned back to his screen. "Good catch on the correlation. That's the kind of thing that takes an eye to spot in the summary view."

She agreed that it was, and went back to her desk, where the recalibration ticket was already open on her secondary monitor. She'd started it before going to Nate, filling in the model ID and the anomaly description: Complexity-correlated latency drift, 340ms above historical baseline. Accuracy within spec at 99.2%. Recommend weight rebalancing and attention parameter reset. She read it back. The words were accurate. She kept returning to that. The complexity correlation was 0.73. Nate's explanation—attention weights spreading into irrelevant features, producing inefficiency across complex image processing—was technically sound. That was one explanation. The standard one. Not the same as the only one.

Attention weights spreading into irrelevant features. For a traffic light classifier, irrelevant features meant everything in the image that wasn't a traffic light. The vehicles, the signage, the people, the trees. In a well-functioning model, those things didn't register as significant—they were processed at low weight, background, dismissed. If the attention weights were spreading, they were spreading into that background. The model was paying more attention to things it had been trained not to attend to.

Which was, she supposed, what Nate had said. She was reaching the same technical conclusion through different language. The recalibration would fix it either way. She typed a note in the ticket—Recommend pulling selection history before recalibration to confirm no downstream selection anomalies—then went and got more coffee and opened the selection history database.

By 2 PM, the bullpen had gotten quiet around her, Marcus having left for a meeting and not returned, the engineers who'd commandeered the standing desks near the window clearing out around noon. Priya's typing was the most prominent sound on the floor. She'd stopped tracking when exactly she had stopped doing the audit and started doing something else.

The query was straightforward: return all selections from the 12,000-image audit window where the selected grid square was subsequently scored by the verification layer as containing no traffic light. In normal operations, this number should be near zero. Wrong selections happened, but they happened at low confidence—the model hedged when it was uncertain, and the confidence score reflected the hedging. High-confidence wrong selections were a different category. That was what she was looking for, though she hadn't quite let herself name why she was looking for it specifically. She set the confidence floor at 90% and ran it. Processing time: four seconds. Results: seven records.

Seven selections in 12,000 images where SELECTOR-9 had selected a grid square that the verification layer confirmed contained no traffic light. She pulled the confidence report on each one: 94.3%. 96.1%. 96.2%. 94.8%. 95.4%. 97.1%. 94.7%. She looked at those numbers for a long time; on the audit window behind them, her cursor still blinked in the empty comment field of the recalibration ticket.

SELECTOR-9's normal range for traffic light identification—the confidence scores it produced on correct selections—ran between 93.1% and 99.4%, depending on image clarity and signal positioning. These seven scores sat squarely in that band. They were not outliers. They were not the 55-percent and 62-percent values that appeared when a model encountered something genuinely ambiguous and was guessing. These were the confidence levels SELECTOR-9 produced when it knew what it was selecting. Seven times in 12,000 images, SELECTOR-9 had selected a grid square that contained no traffic light. And each time, its confidence in that selection was functionally identical to its confidence when it was right.

She did not open the images. The selection history database showed her the coordinates and the confidence scores—the images themselves were stored in a separate archive that would take another query to pull. It was already past 2 PM. The recalibration ticket was still open. Nate expected her to have submitted it by end of day. Seven selections in 12,000. She wrote the fraction out: 0.0583%. Statistically, in any model producing tens of thousands of outputs, you expected a certain error floor—some percentage of wrong selections was baked into the architecture, an accepted consequence of classification at scale. Seven in 12,000 was below most error floor estimates. Nate would look at that fraction and tell her she had found nothing. He would be correct, technically. He would say it in the same measured, reasonable tone he used for everything, and every word he said would be defensible, and seven in 12,000 was noise.

Except that noise did not select at 97.1% confidence.

When SELECTOR-9 identified a traffic light, that confidence score meant: the evidence in this grid square is consistent with my classification target. The probability of a false positive at 97.1% was low. The model was not confused. It was not reaching. It was certain. And certainty required a reason—a set of features in the selected square that matched some internal criterion closely enough to generate a probability score in that range.

The features just hadn't been traffic lights.

Priya saved the query results, attached them to the anomaly ticket, and stared at the seven confidence scores until her monitor dimmed from inactivity. She moved the mouse. The numbers came back. Seven squares selected with certainty. The question she couldn't file away with the report was the same question the report couldn't answer: certain about what?