# WSL2 Mirrored Networking — Intermittent Sync Failures

**Date:** 2026-02-05
**Status:** Resolved (inline retry added to `sync_to_local.py`)
**Affected:** Background sync from remote server (18.225.0.90) to local server (upload.aeihawaii.com)

---

## Problem

`sync_to_local.py` intermittently fails with `ConnectTimeoutError` when POSTing photos from the remote AWS server to the local office server. Approximately 40-60% of TCP connections time out during burst activity, causing photos to be queued for later retry instead of delivering immediately.

```
ConnectTimeout: HTTPSConnectionPool(host='upload.aeihawaii.com', port=443):
Max retries exceeded with url: /uploadlocallat_kuldeep.php
(Caused by ConnectTimeoutError: Connection to upload.aeihawaii.com timed out. (connect timeout=10))
```

---

## Investigation

### Hypotheses Eliminated

| Hypothesis | Evidence | Verdict |
|-----------|----------|---------|
| **Apache user can't make outbound connections** | `sudo -u apache curl` succeeds when tested directly | Intermittent, not blocked |
| **SELinux blocking httpd network** | `sestatus` → no SELinux installed | Eliminated |
| **UID-based iptables rules** | `iptables-save` shows no owner/uid rules, OUTPUT policy ACCEPT | Eliminated |
| **fail2ban banning AWS IP** | `trusted_whitelist` ipset rule fires at INPUT rule 1, before fail2ban chains (rules 13-15) | Eliminated |
| **PHP WAF (badactor) blocking** | `whitelist.json` has 18.225.0.90, `badactor_prepend.php` returns at line 76 | Eliminated |
| **conntrack table full** | 129/262,144 entries (0.05% usage) | Eliminated |
| **Apache overloaded** | `ss -tlnp` shows Recv-Q=0, only 29 TCP connections | Eliminated |
| **Resource limits** | Both Julian and apache have identical ulimits (1024 fds, 31774 procs) | Eliminated |
| **Network namespaces / cgroups** | None configured | Eliminated |
| **systemd sandboxing** | No IPAddress/Restrict/PrivateNetwork directives on httpd | Eliminated |
| **DNS resolution** | All users resolve upload.aeihawaii.com → 72.235.242.139 identically | Eliminated |

### Key Finding: Not User-Specific

Burst tests showed ALL users on the remote server experience the same failure rate:

```
=== AS JULIAN ===
julian_1: HTTP=000  (timeout)
julian_2: HTTP=000  (timeout)
julian_3: HTTP=000  (timeout)
julian_4: HTTP=200  CONNECT=3.129s  (succeeded after SYN retransmits)
julian_5: HTTP=200  CONNECT=0.103s

=== AS ROOT ===
root_1-5: HTTP=000  (all timeout)
```

This eliminated all user/permission theories.

### Key Finding: Packets Reach the Server

iptables counter comparison before/after a 3-request burst:

```
BEFORE: rule1_pkts=399980  (trusted_whitelist ACCEPT)
AFTER:  rule1_pkts=400028
DIFF:   +48 packets accepted by firewall
```

Yet 2/3 connections still timed out. The SYN packets **are reaching** the local server and **are being accepted** by iptables. The TCP SYN-ACK response is being lost on the return path.

---

## Root Cause: WSL2 Mirrored Networking

The local server runs on **WSL2** with **mirrored networking mode**:

```ini
# C:\Users\<user>\.wslconfig
[wsl2]
networkingMode=mirrored
```

```
$ uname -r
6.6.87.2-microsoft-standard-WSL2

$ ip addr show eth0
inet 192.168.141.219/24  (private IP, office LAN)
```

### Packet Path

```
INBOUND (SYN):
  AWS 18.225.0.90 → Internet → Office Router (72.235.242.139:443)
    → Windows Host → Hyper-V mirrored vSwitch → WSL2 VM (192.168.141.219)
    → iptables rule 1 ACCEPT → Apache :443

OUTBOUND (SYN-ACK):
  WSL2 VM → Hyper-V mirrored vSwitch → Windows Host → Office Router → Internet → AWS
              ^^^^^^^^^^^^^^^^^^^^^^^^
              INTERMITTENT PACKET LOSS
```

In WSL2 mirrored mode, the Hyper-V virtual network switch mirrors packets between the Windows host and the WSL2 VM. This layer has known reliability issues with intermittent packet loss, especially for incoming TCP connections. The SYN arrives and is accepted by iptables, but the SYN-ACK response is sometimes dropped by the Hyper-V mirror layer before reaching the Windows network stack.

### Burst Test Results (10 rapid connections)

```
attempt_1:  HTTP=200  CONNECT=0.104s   ← OK
attempt_2:  HTTP=000  CONNECT=0.000s   ← SYN-ACK lost
attempt_3:  HTTP=000  CONNECT=0.000s   ← SYN-ACK lost
attempt_4:  HTTP=000  CONNECT=0.000s   ← SYN-ACK lost
attempt_5:  HTTP=200  CONNECT=3.129s   ← OK after 2 SYN retransmits
attempt_6:  HTTP=000  CONNECT=0.000s   ← SYN-ACK lost
attempt_7:  HTTP=000  CONNECT=0.000s   ← SYN-ACK lost
attempt_8:  HTTP=200  CONNECT=0.104s   ← OK
attempt_9:  HTTP=200  CONNECT=0.104s   ← OK
attempt_10: HTTP=000  CONNECT=0.000s   ← SYN-ACK lost
```

**Success rate:** ~40% per attempt. With 3 retries: ~78%. With retry + backoff: ~95%+.

---

## Solution: Inline Retry with Exponential Backoff

Added retry logic directly in `sync_to_local.py` so it retries up to 3 times with delays of 2s, 4s, 8s before falling back to the queue.

### Before (fire-and-forget → queue)

```
sync_to_local.py
  └─ POST attempt → fail → enqueue for cron retry (15 min wait)
```

### After (inline retry → queue as last resort)

```
sync_to_local.py
  ├─ Attempt 1 → fail → wait 2s
  ├─ Attempt 2 → fail → wait 4s
  ├─ Attempt 3 → fail → wait 8s
  └─ All failed → enqueue for cron retry (15 min)
```

### Reliability Math

| Scenario | Per-attempt success | Overall success |
|----------|-------------------|-----------------|
| 1 attempt (old) | ~40% | 40% |
| 3 attempts (new) | ~40% | ~78% (1 - 0.6^3) |
| 3 attempts + cron retry | ~40% | ~99.7% (1 - 0.6^13) |

The worst case adds 14 seconds (2+4+8) to the background process. Since it runs via `nohup &`, the mobile app is unaffected.

### Code Change

```python
# Constants
INLINE_RETRIES = 3
RETRY_DELAYS = [2, 4, 8]  # seconds between attempts

# Main loop
for attempt in range(1, INLINE_RETRIES + 1):
    success, error_reason = sync_file(**kwargs)
    if success:
        break
    if attempt < INLINE_RETRIES:
        delay = RETRY_DELAYS[attempt - 1]
        logging.info("RETRY %s -> attempt %d/%d failed, retrying in %ds",
                     file_name, attempt, INLINE_RETRIES, delay)
        time.sleep(delay)

if not success and os.path.isfile(file_path):
    enqueue_failed(error_reason=error_reason, **kwargs)
```

---

## Alternative Solutions Considered

| Option | Pros | Cons | Decision |
|--------|------|------|----------|
| **Inline retry (chosen)** | Simple, handles 95%+ of failures, no infra changes | Adds up to 14s to background process | Implemented |
| **Switch to WSL2 NAT mode** | May be more stable | Requires Windows port forwarding, breaks other services | Not pursued |
| **Native Linux server** | Eliminates WSL2 issues entirely | Requires hardware/VM migration | Long-term option |
| **HTTP instead of HTTPS** | Fewer round-trips | Same TCP SYN issue, less secure | Would not help |
| **Persistent connection pool** | Avoids repeated SYN handshakes | Each upload spawns a new Python process | Architectural change needed |
| **Increase connect timeout** | Some connections succeed at 3.1s | Still fails for >10s drops, delays queue detection | Insufficient alone |

---

## Monitoring

### Check if inline retries are occurring

```bash
# Look for RETRY entries in sync log
ssh -i /root/.ssh/aei_remote.pem Julian@18.225.0.90 \
  "grep 'RETRY' /var/www/vhosts/aeihawaii.com/httpdocs/photoapi/logs/sync_to_local.log | tail -20"
```

### Check retry success rate

```bash
# Count OK vs QUEUED entries
ssh -i /root/.ssh/aei_remote.pem Julian@18.225.0.90 "
  echo 'OK:'; grep -c ' OK ' /var/www/.../photoapi/logs/sync_to_local.log 2>/dev/null
  echo 'QUEUED:'; grep -c 'QUEUED' /var/www/.../photoapi/logs/sync_to_local.log 2>/dev/null
  echo 'RETRY:'; grep -c 'RETRY' /var/www/.../photoapi/logs/sync_to_local.log 2>/dev/null
"
```

---

## Related Documentation

| Document | Purpose |
|----------|---------|
| [ASYNC_SYNC_ENHANCEMENT.md](ASYNC_SYNC_ENHANCEMENT.md) | Async sync architecture and retry queue |
| [PHOTO_SYSTEM_DOCUMENTATION.md](PHOTO_SYSTEM_DOCUMENTATION.md) | Complete system documentation |

---

*Investigation completed: 2026-02-05*
