Computer-vision system for predatory wildlife identification

ML pipeline replaces manual expert review of camera-trap footage with automated species detection.

Scope a similar engagement → See the metrics ↓

CASE FILE · CS-05 SHIPPED

“We went from sampling footage to processing all of it. The conservation insight is now data-driven, not anecdotal.”

Camera-trap footage now processed automaticallyTB/mo

ClientWildlife conservation organization

SectorSports

Service linesBuild · Agents

Window8 weeks fixed

READ THE FILE ↓

Challenge

Hundreds of camera-trap units generated TB of footage per month. Manual expert review was slow, expensive, and inconsistent. Visual similarity between predator species (and between predators and large herbivores) made naive image classification unreliable. Lighting, weather, occlusion, and partial-frame captures all degraded model accuracy. The team needed a production-grade ML pipeline that could run scalably on cloud infrastructure and self-improve over time.

Solution

A computer-vision pipeline with custom preprocessing (lighting normalization, frame stabilization, occlusion handling), feature extraction tuned for fur patterns / facial features / body shape, and an ensemble of classification models trained on labeled species datasets. Built as an auto-scaling AWS inference pipeline with batch + real-time modes. Confidence-scored outputs flagged low-confidence frames for human review, building a continuous-improvement loop.

Engagement

Sector: Sports
Service lines: Build · Agents
Client: Wildlife conservation organization (anonymized)

Wildlife in natural habitat — CASE FILE · CS-05 · COMPUTER-VISION SYSTEM FOR PREDATORY WILDLIFE IDENTIFICATION

ML pipeline replaces manual expert review of camera-trap footage with automated species detection.

CASE FILE · CS-05 · LONG-FORM

The full story.

The overview's above. Below is what actually happened — the trigger, the surprises, the decisions, the build, the cutover, and how it's holding up.

TLDR audio briefing

For busy executives

~1m 12s summary · 0:00 / 1:12

Why we got the call.

The trigger

The conservation organization had accumulated about 3TB of camera-trap footage over 18 months. They could afford to review roughly 10% of it. The reviewers were specialist conservation biologists at $100+/hour — pulling them off field work to watch night-IR video at 4× speed was the wrong use of their time, but funder reporting requirements meant somebody had to do it.

The funder pressure was the trigger. A multi-year grant renewal required reporting on predator-presence numbers across the whole monitored region, not extrapolated from a 10% sample. Without full-coverage data, the renewal was at risk. The grant was the organization's largest single funding source.

The harder problem underneath: one of the predator species the organization was monitoring is visually similar to a large co-occurring herbivore. Volunteer reviewers misidentified them in roughly 20% of frames. The grant's predator-presence number would have been inflated if we just ran more eyeballs at the problem without addressing the misidentification rate. The Conservation Director was specific: "I need accurate numbers, not just more numbers."

The internal data team had attempted two prior ML approaches. One used a generic image-classification model fine-tuned on a small labeled set; it hit 76% accuracy and stalled because of training-data scarcity. The other tried a commercial wildlife-ID API; it returned "large carnivore" for most predator frames without species-level resolution.

What we found that the brief didn't say.

The week-one discovery

Week zero we mapped the data. 47 camera-trap units across 8 sites, averaging 1.2GB per unit per day. Footage quality varied wildly: night IR vs day color vs partial-frame captures vs blown highlights at sunrise. Some cameras had occlusion problems (grass growing in front of the lens in late summer); some had vibration from being mounted on living trees.

Expert review accuracy on a holdout set was 92%. That number startled the team — even the specialists got it wrong 8% of the time. The accuracy gap meant our model didn't need to beat humans on every frame; it needed to be reliably above 88% with the false-positives flagged for human review.

The labeled training data we needed didn't exist as a single resource. Public datasets had general carnivore labels but not species-specific labels for our predator. The organization had ~6,000 expert-labeled frames from prior research. We supplemented with about 34,000 frames from public conservation datasets (eMammal, iWildCam, Snapshot Serengeti) filtered to the relevant predator + herbivore species. Total labeled training set: ~40,000 frames.

The hardest discovery: visual similarity wasn't uniform. The predator and the herbivore look similar in frontal poses at typical camera-trap distance, but distinctly different in profile or partial-frame. That fact would shape the architecture.

The tradeoffs we made + why.

The architecture decisions

**Ensemble vs single foundation model.** A single fine-tuned ResNet-50 hit 89% on our holdout set. Tempting to ship. We tested an ensemble of three models (a fine-tuned ResNet for general carnivore detection, a specialist binary classifier for predator-vs-similar-herbivore, and an EfficientNet for fine-grained features like fur pattern and ear shape). The ensemble hit 93%. The 4-point gap was the engineering decision: ensemble for the per-species-specialist accuracy story, accept the 3× inference cost.

**Custom preprocessing vs feed-raw-frames.** Camera-trap data has well-known degradations: night-IR exposure variations, camera shake on tree mounts, partial-frame captures when animals move fast. We built a preprocessing layer with lighting normalization (CLAHE on IR frames), frame stabilization (optical flow on adjacent frames in a clip), and occlusion detection (mask out grass / branch overlay). Preprocessing added 4 points of accuracy on the holdout — measurably more than any single model architecture change.

**PyTorch vs TensorFlow.** The client's data team was PyTorch. Our team was historically more TensorFlow but the maintenance hand-off mattered. We went with PyTorch. We had to retool one of our internal training pipelines to PyTorch in week 1, which cost us 2 days, but the post-handoff transition was clean.

**Confidence-thresholded human review.** Two options: ship a fully-automated pipeline that returns a species label for every frame (with the implicit "trust the model"), or ship a hybrid where low-confidence predictions go to a human review queue. We argued for the hybrid because of the 8% expert-baseline error rate — even a perfect model would still be wrong on the genuinely hard frames where humans were also wrong. The confidence-flagged review queue became the training-data flywheel: human corrections feed back as new labeled data, model accuracy improves over time. This was the highest-leverage architecture decision of the engagement.

**Sagemaker vs self-managed.** Sagemaker for training (Spot instances brought cost down ~70% on the longer training runs); Lambda for inference at modest scale. We documented the upgrade path to dedicated GPU instances for the day inference volume justifies it.

How the work actually unfolded.

The build

Weeks 2-5 ran daily training cycles. First-generation models trained on 40,000 frames hit 86% on the holdout. Generation two added the preprocessing layer and hit 90%. Generation three was the ensemble — 93%.

The four architecture iterations were instructive. We tried a transformer-based vision model (ViT-Base) in iteration two; it tied the ResNet on accuracy but trained 4× slower and didn't justify the cost at our data scale. We tried a contrastive-learning pretraining approach in iteration three; it gave a small boost (~1 point) but added an entire pretraining pipeline the client's team would have to maintain. We kept it simple: fine-tuned ResNet ensemble with the preprocessing layer, which the data team can retrain and redeploy without exotic infrastructure.

Week 5 surprise: classification accuracy was 91.2% on the holdout but the human-review queue still rejected 4% of high-confidence predictions. Investigation found a recurring camera angle (one unit was mounted slightly tilted, putting the animal lower in frame than other units) that was over-represented in training and led to mis-calibrated confidence scores. We added a "production-distribution-shift" detector that flagged frames whose feature distribution differed from the training set; those frames got auto-routed to human review regardless of confidence score.

Week 6 was confidence-threshold tuning. We swept the threshold from 0.5 to 0.95 against the holdout and found the sweet spot at 0.78: at that threshold, the auto-accepted predictions had 96% accuracy (better than human expert review at 92%) and the human-review queue absorbed ~22% of frames. The volume was tractable — the conservation biologists could review 22% of frames in their normal time budget.

How launch went.

The cutover

Week 7: the pipeline deployed behind their existing S3 footage-drop. Camera-trap units uploaded daily as before. The pipeline ingested, preprocessed, ran the ensemble inference, and emitted three outputs per frame: predicted species, confidence score, and a flag for human review if confidence was below threshold OR the production-distribution-shift detector fired.

By week 8 the full 3TB backlog had been processed. The conservation biologists were reviewing only the ~22% of confidence-flagged frames — about 660 hours of work versus the 3,000+ hours full manual review would have taken. The grant-reporting deadline came in week 10. The report shipped with full-coverage data for the first time in the organization's history.

How the work is holding up.

Ninety days later

Ninety days post-handoff: the human-flagged retraining loop has moved overall accuracy from 91.2% to 94.8%. The mechanism is simple — every week, the previous week's human-flagged frames feed back into the training set as new labels. The data team runs a retraining cycle monthly. Each cycle has produced 0.3-0.5 points of accuracy improvement, exactly as we'd modeled.

Full-coverage reporting is now feasible for every grant cycle. The Conservation Director's quote — "we went from sampling footage to processing all of it" — is operational reality. The grant was renewed.

Infrastructure costs settled at ~$2,400/month at current volume (training + inference). Training is the larger line item because the team is on a monthly retrain cadence. Inference is a flat ~$400/month thanks to Sagemaker's pricing economics on batch endpoints. Compared to the alternative — $300K+/year for additional specialist reviewer hours — the math closed in week 1.

Honest retrospective.

What we'd do differently

We should have built the production-distribution-shift detector earlier. It was a week-5 hotfix; it should have been a week-2 design item. Training data distribution always drifts from production data distribution. Building the detector as a first-class artifact makes the system resilient to this from day one.

The decision to ship as an ensemble of small models instead of a single large model was correct but felt unusual. We'd defend it again on any vision engagement where the client team will own retraining post-handoff: smaller models are tractable for non-MLOps teams to retrain. A large foundation model would have given us 1-2 more accuracy points but locked the client into a specialist-only retraining workflow.

The confidence-thresholded human review pattern is the highest-leverage thing we shipped. We'd ship it on every classification engagement where the cost of false positives is non-trivial. The flywheel effect (human corrections → retraining → better accuracy → fewer human corrections) compounds over time and is the dominant story 6 months in.

ENGAGEMENT TIMELINE · 8 WEEKS FIXED

Every engagement runs through the same five gates of the FORGE method. Here’s how this case ran.

FORGE GATES · CS-05SHIPPED

W0 · FRAMESpecies classification rubric, edge-case dataset review (lighting / occlusion / partial-frame), labelled-data audit.

W1 · OUTLINEPipeline architecture, Sagemaker training plan, feature-extraction design (fur patterns, facial features, body shape), batch + real-time modes.

W2–5 · REBUILDPreprocessing layer (lighting normalisation, frame stabilisation), model ensemble training, cloud inference pipeline.

W6 · GOVERNConfidence-threshold tuning, human-review queue for low-confidence frames, model-card documentation.

W7–8 · ENGAGEFirst-month deployment across camera-trap network, feedback loop on misclassifications, retraining cadence agreed.

Results · Key metrics · CS-05Verified

Automated

Species detection from raw camera-trap footage

Scalable

Cloud-native inference handles TB/month throughput

Confidence-scored

Low-confidence frames routed to human review

Self-improving

Continuous-learning loop from human-flagged corrections

STACK · CS-05SHIPPED

SectorSports

ServicesBuild · Agents

ClientWildlife conservation organization (anonymized)

Python PyTorch / TensorFlow OpenCV AWS Sagemaker AWS Lambda S3 Custom feature-extraction pipeline

Client voice

We went from sampling footage to processing all of it. The conservation insight is now data-driven, not anecdotal.

Conservation Director · wildlife organization

RELATED · OTHER CASES ALL CASES →

Get a quick answer for a similar engagement · See all 10 →

Try the matching free calculator

Each calculator runs in 3 minutes and emails you an 8-page memo.

AI · TOKENS

AI API cost calculator

OpenAI vs Claude vs Gemini vs self-hosted. Wrapper margin (Harvey / Glean / Hebbia) called out.

RUN CALCULATOR→~3 min

Scope a similar engagement.

A 30-min call: walk through your situation, get a fixed-price SOW within 24 hours. Tell us "I want what CS-05 did" and we'll calibrate to your specifics.

Book a 30-min call →