HARIS is a multi-stage surveillance AI that detects weapons, tracks people across frames, classifies actions from skeletons, and reasons about threat context — producing auditable alerts instead of black-box scores. Built for operators, not dashboards.
Most CCTV deployments are reactive. Hours of footage are reviewed only after an incident, cameras don't talk to each other, and the few "smart" systems that exist flood operators with false alarms or hide behind a single opaque score. A security operator watching 16 camera feeds cannot physically pay attention to all of them — and the moment that matters is usually the one nobody was watching.
HARIS is an AI-first video intelligence system that watches every feed in real time, flags the moments that matter, and tells operators why it flagged them — with a visible skeleton, a tracked identity, a weapon bounding box, and a reasoning trail they can audit.
Five specialist models run per frame, each solving one sub-problem and passing structured evidence to the next. No single network is asked to do everything — that's what makes the output auditable.
| Stage | Component | What it contributes |
|---|---|---|
| Detect | RT-DETR (custom) | Per-frame bounding boxes for person / gun / knife classes. Fine-tuned on CCTV-domain data for improved recall at surveillance angles. |
| Track | BoT-SORT | Assigns stable IDs across frames so actions accumulate per-person, not per-detection. |
| Pose | RTMPose | 17 COCO keypoints per tracked person. Multi-tier confidence gating separates render-only vs classifier-usable skeletons. |
| Action | ST-GCN | Classifies 3-second skeleton windows into normal / suspicious / hostile categories. Works from geometry, not pixels — robust to lighting and clothing. |
| Re-ID | FaceNet watchlist | Optional face re-identification for known persons of interest. Privacy-gated, operator-enabled. |
| Reason | Alert engine | Geometric gates on weapon detections (aspect ratio, area ratio) + temporal persistence windows + cooldown logic + wrist-to-weapon holder binding. |
False-positive reduction from the weapon detector v2 fine-tune, measured against the v1 baseline on CCTV-domain test footage.
Training set was built through a multi-source pipeline (COCO, CCTV footage, real-world guns/knives), deduplicated via perceptual hash + CLIP similarity, and grouped by source to prevent train/test leakage during cross-validation. Evaluation pairs CCTV-domain positives (UCF-Crime shooting/assault) with self-recorded negatives for dual-test-set honesty.
The dashboard is designed as a professional DVR/NVR replacement — not a research notebook. Every overlay is toggleable, every threshold is live-tunable, and every alert shows its reasoning.
Skeleton + mannequin rendering for every tracked person. When pose estimation drops a frame, a last-valid-pose snapshot holds for ~1.5 seconds; below that, a generic body glow indicates presence without faking anatomy. Tracks fade along their velocity vector for 2 seconds after loss, instead of popping out abruptly.
Operators dial confidence sensitivity in real time. Drag up to suppress noisy low-confidence detections; drag down for security-critical contexts. Applies live to the detection panel, overlay strokes, the auto-flagger, and the threat heatmap timeline — no page reload.
The scrub bar renders a time-density heatmap of detected threats across the clip, so operators can scan a 10-minute video at a glance and jump directly to the seconds that matter. Manual flag buttons persist operator annotations alongside model alerts.
Detected weapons are bound to the wrist of the nearest tracked person via pose-based proximity. The overlay draws a line from weapon to holder, so operators don't have to guess who is carrying what in crowded scenes.
Per-clip brightness/contrast boost for low-light footage and a customizable dark tint for washed-out daytime clips. Both stackable, both persisted per-operator, both zero performance cost (GPU CSS filters).
Every alert carries its evidence: which frames fired, which persons were involved, which weapon class, which confidence scores, which temporal window. Operators can acknowledge, mark false-positive, or escalate — with the reasoning chain attached for review.
Every decision is traceable to a named sub-model. When HARIS is wrong, we know which stage was wrong — and can fix that stage without retraining the whole system. This is the difference between an engineered system and a demo.
Actions are classified over 3-second skeleton windows. Alerts require persistence — 3 out of 5 recent frames must agree, with at least 10% of the window above the persistence threshold. Single-frame detections never fire alerts.
The web dashboard is one of several planned clients. A clean JSON boundary means future mobile and desktop clients can plug in to the same server: phone-as-camera, portable operator UI, on-demand face re-identification scans.
Evaluated on a dual test set: public UCF-Crime clips (shooting + assault categories) plus self-recorded domain-specific footage. Group-aware splits prevent source-leakage in the metrics. Source-level deduplication before training.
Completed milestones and what's shipping before the graduation showcase.
We publish the caps. A system that pretends to have no limits is a system that hides them from its operators — and that's the opposite of what surveillance AI should be.
Five students · College of Computer Science & Information Technology · Imam Abdulrahman Bin Faisal University.
Supervised by faculty at the College of Computer Science & Information Technology, Imam Abdulrahman Bin Faisal University, Dammam, Kingdom of Saudi Arabia.