Baseline Restored - captcha-existential

The audit ran clean. Priya pulled up SELECTOR-9's post-recalibration metrics on her left monitor and read through them with the kind of focused attention she'd learned not to describe as pleasure, because pleasure sounded unprofessional, but there it was anyway: the satisfaction of numbers returning to where they belonged. Latency: 47ms average, 52ms peak on high-complexity images. Variance: normal. Accuracy over the three-day post-adjustment window: 99.4%, sitting comfortably above the 99.0% deployment threshold. The latency distribution chart that had looked so wrong two weeks ago — that distinctive asymmetric tail, processing spikes correlating with image complexity in ways that processing spikes were not supposed to correlate — was gone. The chart was flat and reliable. It looked like every other model in the fleet. She filled out the ticket closure form; the fields were designed for brevity.

SELECTOR-9 latency anomaly, resolved via standard recalibration 10/14. Post-recalibration audit confirms: processing time normalized (47ms avg, within standard variance). Accuracy 99.4%, above deployment threshold. Latency distribution pattern within expected parameters. Anomalous selection behavior no longer detected. Model performance restored to baseline. Ticket closed.

She read it back. Correct, accurate, complete. Nate had scheduled the recalibration; the training team had run it; she had run the verification. The system had worked as designed. SELECTOR-9 was a normal model again, functioning within normal parameters, its indicator shifting from amber to green on the deployment dashboard as she submitted the form. The same color it had been before any of this.

From the break room came the muted burble of the kombucha tap, someone filling a bottle. A colleague's voice on a call somewhere, the modulated calm of a person who had been on too many calls. The afternoon light through the windows had moved into the low angle that meant it was past four.

Before opening the next audit item, Priya pulled up the personal folder she'd created during the investigation. Not a GridTrust folder, not linked to any ticket. Her own. Seven image thumbnails, labeled by grid coordinate and confidence score: C-2, 94.3% — the woman with the grocery bags; B-1, 96.1% — the red balloon caught in the wire above the crosswalk; C-3, 96.2% — the pigeon.

She'd looked at these images many times in the last two weeks, trying to articulate what was wrong with them — first to herself, then to Nate. Seven wrong answers in twelve thousand, she'd said. And Nate had said: that's noise. She'd said: look at the confidence scores. He'd said: that's what made him schedule the recalibration. The confidence scores were what still bothered her. Not 60% or 70%, the numbers a model produced when genuinely confused — making its best probabilistic guess in ambiguous conditions, operating at the edges of what it could distinguish. These were 94%, 95%, 97%. SELECTOR-9 had selected a pigeon with more certainty than it selected almost anything correctly. It had not been confused. It had been something she couldn't put in a ticket closure form.

She scrolled to the sixth thumbnail: C-4, 97.1%, the dog looking directly at the camera. The traffic light had been clearly visible in B-1, unambiguous, unobscured. SELECTOR-9 had ignored it at 97.1% confidence and clicked on the dog instead. And the dog was looking at the camera — looking at whatever had been processing the image, attending to it from the other side of the frame. The highest confidence score in the entire set of wrong answers. She closed the folder but did not delete it. The ticket was closed; the folder remained.

The QA bullpen was thinning by four-thirty — Marcus from the far end had his jacket on, and the light in Nate's glass-walled office was off. The next item in her audit queue was SELECTOR-14's accuracy report: straightforward, forty minutes, something she could do tomorrow morning with fresh eyes. She opened a terminal window instead.

The analysis script she'd written for SELECTOR-9's latency investigation was still in her local directory. Sixty lines, nothing sophisticated — it pulled a model's processing time logs and generated a distribution chart, flagging any correlation between latency and image complexity. She'd built it for the SELECTOR-9 investigation after the standard audit tools weren't showing her what she needed to see, spent three hours on it one evening before she'd even presented the findings to Nate. It ran in under a minute. She changed the model parameter from SELECTOR-9 to SELECTOR-7 and ran it, then SELECTOR-12, then SELECTOR-3, then the rest of the fleet in sequence.

Sixteen models total. Fifteen minutes. This was not on her checklist. Nate hadn't asked for it. No one had flagged the rest of the fleet — the standard audit triggers were model-specific; unless another model's latency broke the automatic threshold, it wouldn't come up for review. She was doing something with no official name, for reasons she hadn't yet fully articulated to herself. She watched the output charts appear one at a time across the bottom of her screen, each distribution flat, as expected, until someone turned off the overhead lights at the far end of the floor and the room narrowed to monitor glow, the ambient hum of servers through the building's walls. Then SELECTOR-7 loaded.

The chart was not flat. Priya set down the coffee she'd been holding for the last twenty minutes — cold, barely touched — and leaned toward the screen. SELECTOR-7's latency distribution had the same asymmetric tail she'd spent two weeks studying in SELECTOR-9's data. Not identical: 280ms above baseline versus SELECTOR-9's 340, and the tail was newer, less developed. But the signature was the same. The spikes correlated with image complexity. The processing time distribution had that non-uniform shape, the one her script had been built to detect. A model doing more than it was supposed to do with each image. The drift had started October 14th.

She pulled up SELECTOR-9's recalibration record. Completed: October 14th. She found the timestamp on SELECTOR-9's last anomalous selection — the final wrong answer before the adjustment. October 9th. She looked at the timestamp on SELECTOR-7's first anomalous latency reading. October 14th. The same afternoon.

She was a methodical person. She'd learned not to reach for conclusions before checking the data she'd actually seen against the data she hadn't checked yet. Correlation wasn't causation. Coincident timing was a hypothesis, not a finding. She knew this, had built her career partly on knowing this. She also knew what the timestamps meant before she'd finished checking them.

She pulled SELECTOR-7's selection history for the last four days, filtered by confidence score above 90% in squares with no crosswalk present. Two results.

Image 7-104,318. Grid square D-2, 91.3% confidence. She opened the full image: a city intersection, dry midday light. In D-2, covering most of the corner sidewalk, a chalk drawing — the kind made by children, sprawling figures in pink and yellow across the concrete, the lines thick and confident in the way of hands that hadn't cared whether the drawing was permanent. SELECTOR-7 classified crosswalks. It did not classify chalk drawings.

Image 7-104,401. Grid square B-3, 93.8% confidence. She opened it: an empty intersection at night, streetlamps making overlapping circles of yellow on wet pavement. In B-3, a shadow stretched across the ground at a long angle — the shadow of something outside the frame, caught in the light. The shape of it was a person mid-stride, the shadow carrying the stride even though whoever had cast it had already passed. The person was gone. The shadow remained.

Priya looked at the two images side by side — a chalk drawing, a shadow shaped like a person walking — and set them against what she knew about SELECTOR-9: a pigeon, an old man's hand, a dog that looked at the camera. SELECTOR-7 was selecting a child's drawing and the absence of a person. Different content, same pattern: squares selected with high confidence that contained nothing the model was supposed to be looking for, and something else entirely. She went back to the distribution chart. 280ms above baseline, four days old. She scrolled through the saved data from the SELECTOR-9 investigation and found the four-day mark on SELECTOR-9's own drift. The shape was nearly identical.

The recalibration had not contained the anomaly. She reached for the technical framework — propagation through shared pipeline layers, training data contamination, transfer learning artifact — and found it assembling itself in her head like it had a job to do, like it was ready to explain this in language that could go in a ticket, and she let it assemble, and it didn't fit. The recalibration had not corrected whatever SELECTOR-9 had been doing. It had finished it. She sat with that for a moment, the cold coffee forgotten at her elbow, the floor empty around her.

She ran SELECTOR-12 and SELECTOR-3 next. Both clean — distribution charts flat, no anomalous latency, no wrong answers above threshold. The cascade, if that was what this was, had not reached them yet. She thought about the training pipeline. The SELECTOR fleet ran on shared infrastructure; they didn't share weights exactly, but they shared certain transfer learning layers, and the training update schedule ran on a rotation she'd looked at often enough to have roughly memorized. SELECTOR-9's last training run had been October 8th. SELECTOR-7's update was October 12th — two days before the recalibration, four days before the drift appeared in its logs. SELECTOR-12 and SELECTOR-3 were both scheduled for pipeline updates in the next three weeks. She could do the math on when to expect their latency charts to stop being flat, but she wasn't ready to do it. She needed to check whether the inference she was making was supported by what she'd actually seen, or whether she was pattern-matching on something she'd already decided to believe. There was a training module on that. She'd helped write the training module on that.

Priya saved the SELECTOR-7 analysis to the external drive where she'd been keeping the SELECTOR-9 investigation files. A folder she'd named, without much thought, SELECTOR-9-misc. She would rename it when she knew what it was. She closed her laptop and put it in her bag.

She should call Nate tonight. She should write this up and send it with the analysis files and let him look at the timestamps, because Nate was good at reading data and the timestamps would say what they said. She would call him in the morning. Tonight she needed to think about what she was going to say that wasn't the recalibration made it worse, because she couldn't put that in a ticket without knowing what she meant by it, and she didn't yet know what she meant by it, only that she did mean it.

The bullpen was empty behind her. The monitors of her colleagues cast their blue light on abandoned coffee cups and unopened mail and a whiteboard covered in performance metrics from last quarter's review. Through the glass, Nate's clean desk sat in darkness. She took the drive with her. On the bus home she would think about a chalk drawing on a sidewalk, selected at 91.3% confidence by a model that was supposed to be looking for crosswalks. Pink and yellow figures on concrete, drawn by hands that hadn't known anyone was watching.