# Enhancement: Stuck Process Detection

**Date:** 2026-02-05
**Status:** Implemented
**Priority:** High

## Problem

The system uses PHP `exec()` to spawn Python background scripts (`sync_to_local.py`, `generate_webp.py`) via `nohup ... &`. This is "fire-and-forget" — if the Python script hangs mid-execution, the PHP parent has no way to know, and no supervisor restarts it.

The cron-based `process_retry_queue.py` (every 15 min) only handles items that **failed and were queued**. It cannot detect:
- Processes that hang mid-transfer (never complete, never queue)
- Zombie processes from crashed scripts
- Duplicate sync attempts when cron re-tries an item whose previous attempt is still hung
- `generate_webp.py` hangs (has no timeout at all)

### Current Gaps

| Gap | Impact |
|-----|--------|
| No hung process detection | Orphaned processes accumulate; files never sync |
| No lockfile/PID mechanism | Race conditions if cron re-tries while previous attempt still running |
| No max wall-clock timeout | `generate_webp.py` can hang indefinitely (no timeout) |
| `sync_to_local.py` hang before queue | If script hangs before reaching the enqueue step, file is lost |
| No process inventory | No way to list/count active background sync processes |

### Worst-Case Scenario

1. `sync_to_local.py` starts, opens TCP connection to local server
2. WSL2 networking drops the connection (SYN-ACK loss)
3. `requests.post()` waits up to 60s read timeout — but if TCP stays half-open, can hang longer
4. Script never reaches `enqueue_failed()`, never exits
5. File exists on remote but never syncs to local
6. No alert, no retry, no visibility

## Affected Files

| File | Server | Role |
|------|--------|------|
| `photoapi/process_retry_queue.py` | Remote | Cron-driven retry processor — needs stuck detection |
| `photoapi/sync_to_local.py` | Remote | Background sync — needs PID/lockfile tracking |
| `photoapi/generate_webp.py` | Remote | Background WebP — needs wall-clock timeout |
| `photoapi/upload.php` | Remote | Spawns background processes — needs PID logging |

## Recommended Changes

### Immediate: Stuck Item Detection in `process_retry_queue.py`

Add a "staleness" check: if a `meter_files` record was inserted more than N minutes ago and the file still exists on the remote server but has no corresponding queue entry and no record of successful sync, treat it as stuck.

**Approach:**
1. Query `meter_files` for records inserted in the last 24 hours
2. For each, check if the file was synced (local server health check or `sync_status` column)
3. If not synced and not in `queue/`, re-enqueue it
4. Log as `STUCK_DETECTED` for monitoring

### Immediate: Lockfile for Cron Overlap Prevention

Add a PID lockfile at the start of `process_retry_queue.py`:
- Check if `/tmp/process_retry_queue.lock` exists
- If PID in lockfile is still running, exit gracefully
- Otherwise, write own PID and proceed
- Remove lockfile on exit (including signal handlers)

### Medium-Term: Wall-Clock Timeout for `generate_webp.py`

Wrap the Pillow conversion in a `signal.alarm()` or `threading.Timer` with a 120s max runtime. If exceeded, exit with error code so the failure is visible in logs.

### Long-Term: Proper Job Queue

Replace `nohup exec()` with a managed job queue:
- **Option A:** Redis + rq (Python Redis Queue) — lightweight, fits existing Python stack
- **Option B:** Supervisor process manager — manages long-running workers
- **Option C:** systemd service units — native Linux process management

## Architecture (Current vs Proposed)

```
CURRENT (fire-and-forget):
  upload.php
    |-> exec("nohup sync_to_local.py ... &")  -- no tracking
    |-> exec("nohup generate_webp.py ... &")   -- no tracking
    \-> echo JSON response

  process_retry_queue.py (cron */15)
    \-> only processes queue/*.json files (items that FAILED and were queued)

PROPOSED (tracked execution):
  upload.php
    |-> exec("nohup sync_to_local.py ... &")  -- PID logged
    |-> exec("nohup generate_webp.py ... &")   -- PID logged
    \-> echo JSON response

  process_retry_queue.py (cron */15)
    |-> Process queue/*.json files (failed items)
    |-> Scan meter_files for unsynced records (stuck items)
    |-> Check for hung PIDs, kill if stale
    \-> Re-enqueue stuck items
```

## Monitoring Additions

```bash
# Check for hung sync processes (>5 min old)
ps aux | grep sync_to_local | grep -v grep | awk '{if (systime() - $9 > 300) print}'

# Check for orphaned queue items
ls -la /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/queue/

# Check retry queue log for STUCK_DETECTED entries
grep STUCK_DETECTED /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/logs/retry_queue.log
```

## Success Criteria

- [ ] `process_retry_queue.py` detects items stuck >30 min and re-enqueues them
- [ ] Lockfile prevents overlapping cron runs
- [ ] `generate_webp.py` has a 120s max wall-clock timeout
- [ ] Hung process count visible in retry queue logs
- [ ] No duplicate sync attempts for the same file

## Full Documentation

See `STUCK_PROCESS_DETECTION_ENHANCEMENT.md` (in this folder) for detailed specification.
