# Stuck Process Detection Enhancement

**Date:** 2026-02-05
**Status:** Planned
**Priority:** High
**Enhancement Folder:** `ENHANCEMENTS/STUCK_PROCESS_DETECTION/`

---

## 1. Problem Statement

The AEI Photo system spawns background Python processes from PHP using `nohup exec()`:

```php
// upload.php — spawns sync_to_local.py
exec('nohup /usr/local/bin/python3.6 sync_to_local.py ... > /dev/null 2>&1 &');

// upload.php — spawns generate_webp.py
exec('nohup /usr/local/bin/python3.6 generate_webp.py ... > /dev/null 2>&1 &');
```

This is "fire-and-forget" — PHP returns immediately, the mobile app gets its response in ~0.5s, but the background processes run unsupervised. If they hang or crash, there is no mechanism to detect or recover.

### Current Retry Architecture

```
sync_to_local.py:
  Attempt 1 -> fail -> wait 2s
  Attempt 2 -> fail -> wait 4s
  Attempt 3 -> fail -> wait 8s
  All failed + source file exists -> enqueue JSON to queue/

process_retry_queue.py (cron every 15min):
  Read queue/*.json files
  Retry POST -> success -> delete JSON
  Retry POST -> fail -> increment retry_count
  retry_count >= 10 -> move to queue/failed/
```

### What the Retry Queue CANNOT Detect

| Scenario | Why It's Missed |
|----------|-----------------|
| `sync_to_local.py` hangs mid-transfer | Never reaches `enqueue_failed()`, so no queue entry created |
| `generate_webp.py` hangs on Pillow conversion | Has no timeout — can hang indefinitely |
| Zombie processes from crashed scripts | No PID tracking, no process inventory |
| Duplicate sync of same file | No lockfile — cron can re-try a queued item while a previous attempt is still running |
| WSL2 TCP half-open connection | `requests.post()` may hang beyond the 60s read timeout if TCP stays alive |

### Worst Case

1. Mobile app uploads photo — gets immediate success response
2. `sync_to_local.py` starts, TCP connection to local server established
3. WSL2 networking drops packets (documented ~40-60% SYN-ACK loss)
4. Process hangs in `requests.post()` — TCP stays half-open
5. Script never reaches `enqueue_failed()`, never exits
6. Photo exists on remote server but never syncs to local
7. **No alert, no retry, no visibility**

---

## 2. Affected Files

| File | Server | Path (Deployed) |
|------|--------|-----------------|
| `process_retry_queue.py` | Remote | `/var/www/vhosts/aeihawaii.com/httpdocs/photoapi/process_retry_queue.py` |
| `sync_to_local.py` | Remote | `/var/www/vhosts/aeihawaii.com/httpdocs/photoapi/sync_to_local.py` |
| `generate_webp.py` | Remote | `/var/www/vhosts/aeihawaii.com/httpdocs/photoapi/generate_webp.py` |
| `upload.php` | Remote | `/var/www/vhosts/aeihawaii.com/httpdocs/photoapi/upload.php` |

Backup location: `/var/opt/AEI_REMOTE/AEI_PHOTO_API_PROJECT/REMOTE/photoapi/`

---

## 3. Current Code Analysis

### process_retry_queue.py — Retry Logic (lines 84-154)

```python
# For each queue/*.json file:
retry_count = data.get('retry_count', 0)

if retry_count >= MAX_RETRIES:     # MAX_RETRIES = 10
    # Move to queue/failed/
    shutil.move(queue_file, failed_dir)
    return

if not os.path.isfile(file_path):
    # Source file gone — remove queue entry
    os.remove(queue_file)
    return

# Attempt sync
success = sync_file(...)
if success:
    os.remove(queue_file)
else:
    data['retry_count'] = retry_count + 1
    data['last_error'] = error_reason
    data['last_retry'] = datetime.now().isoformat()
    # Rewrite JSON
```

**Gap:** Only processes items in `queue/` directory. If a file was never queued (sync hung before reaching `enqueue_failed()`), it's invisible.

### sync_to_local.py — Inline Retry (lines 157-176)

```python
for attempt in range(1, INLINE_RETRIES + 1):   # INLINE_RETRIES = 3
    success, error_reason = sync_file(...)
    if success:
        break
    if attempt < INLINE_RETRIES:
        delay = RETRY_DELAYS[attempt - 1]      # [2, 4, 8]
        time.sleep(delay)

if not success and os.path.isfile(file_path):
    enqueue_failed(error_reason=error_reason, ...)
```

**Gap:** If the process hangs during `sync_file()` (inside `requests.post()`), it never reaches the `if not success` branch.

### generate_webp.py — No Timeout

- Uses Pillow `Image.open()` and `.save()` with no wall-clock limit
- A corrupted image or filesystem issue could cause an indefinite hang

### upload.php — exec() Calls

```php
// Line 126-132: WebP generation
exec('nohup /usr/local/bin/python3.6 ' . escapeshellarg($generateScript) . ' ... > /dev/null 2>&1 &');

// Line 141-152: Local sync
exec('nohup /usr/local/bin/python3.6 ' . escapeshellarg($syncScript) . ' ... > /dev/null 2>&1 &');
```

**Gap:** No PID capture, no tracking, no way to know if process started successfully.

---

## 4. Proposed Changes

### 4A. Stuck Item Detection in `process_retry_queue.py`

Add a new function that runs after the existing queue scan:

```python
def detect_stuck_items():
    """Find meter_files records that were never synced and never queued."""
    # Query meter_files for records inserted >30 min ago
    # where the source file still exists on disk
    # but no corresponding queue/*.json file exists
    # Re-enqueue these as stuck items
```

**Logic:**
1. Connect to the database (already has DB connection for logging)
2. Query `meter_files` for records created in the last 24 hours
3. For each record, check:
   - Does the source file still exist on the remote server?
   - Is there already a `queue/{hash}.json` file for this record?
   - Was it created more than 30 minutes ago? (avoid re-queuing items still being processed)
4. If source exists, no queue entry, and >30 min old: create a queue JSON file with `stuck_detected: true`
5. Log as `STUCK_DETECTED` with file details

**Alternative (simpler, no DB required):**
- Scan the upload directories for files that are >30 min old
- Cross-reference against `queue/` entries
- Any file not in `queue/` and not in `queue/failed/` is potentially stuck
- Re-enqueue it

### 4B. Lockfile for Cron Overlap Prevention

Add to the top of `process_retry_queue.py`:

```python
import fcntl

LOCKFILE = '/tmp/process_retry_queue.lock'

def acquire_lock():
    lock_fd = open(LOCKFILE, 'w')
    try:
        fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
        lock_fd.write(str(os.getpid()))
        lock_fd.flush()
        return lock_fd
    except IOError:
        logging.info("Another instance is running, exiting")
        sys.exit(0)
```

### 4C. Wall-Clock Timeout for `generate_webp.py`

```python
import signal

def timeout_handler(signum, frame):
    raise TimeoutError("WebP generation exceeded 120s wall-clock limit")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(120)  # 120 second max

try:
    # existing Pillow conversion code
    img = Image.open(source_path)
    img.save(dest_path, 'WEBP', quality=quality)
finally:
    signal.alarm(0)  # cancel alarm
```

### 4D. PID Logging in `upload.php` (Optional)

Capture the PID of spawned processes for monitoring:

```php
$syncCmd = 'nohup /usr/local/bin/python3.6 ...' ;
$pid = exec($syncCmd . ' echo $!');
error_log("sync_to_local.py spawned with PID: " . $pid);
```

---

## 5. Monitoring Commands

```bash
# Check for hung sync processes (>5 min old)
ps aux | grep sync_to_local | grep -v grep

# Check for hung WebP processes (>2 min old)
ps aux | grep generate_webp | grep -v grep

# Count items in retry queue
ls /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/queue/*.json 2>/dev/null | wc -l

# Count failed items
ls /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/queue/failed/*.json 2>/dev/null | wc -l

# Check retry queue log for STUCK_DETECTED
grep STUCK_DETECTED /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/logs/retry_queue.log

# Check sync log for recent activity
tail -50 /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/logs/sync_to_local.log
```

---

## 6. Implementation Order

| Phase | Change | Risk | Effort |
|-------|--------|------|--------|
| 1 | Lockfile in `process_retry_queue.py` | Very low | 15 min |
| 2 | Wall-clock timeout in `generate_webp.py` | Low | 15 min |
| 3 | Stuck detection in `process_retry_queue.py` | Medium (needs testing) | 1-2 hours |
| 4 | PID logging in `upload.php` | Low | 15 min |

---

## 7. Long-Term Considerations

The current `nohup exec()` pattern works but is inherently fragile. For higher reliability, consider:

- **Supervisor** (process manager): Manages long-running workers, auto-restarts on crash
- **Redis + rq** (job queue): Python-native task queue with retry, monitoring, and dead-letter support
- **systemd service units**: Native Linux process management with journald logging

These would replace the fire-and-forget pattern with managed execution, but require more infrastructure setup.

---

## 8. Success Criteria

- [ ] `process_retry_queue.py` detects items stuck >30 min and re-enqueues them
- [ ] Lockfile prevents overlapping cron runs
- [ ] `generate_webp.py` exits cleanly after 120s wall-clock timeout
- [ ] `STUCK_DETECTED` entries appear in retry queue log when items are recovered
- [ ] No duplicate sync attempts for the same file (lockfile + queue dedup)
