MeshCore Security & Bug Audit

Audit Date: April 2026
Scope: MeshCore/src/ — core mesh networking, packet handling, crypto, CLI, radio wrappers, serial interface, bridges
Focus: Security vulnerabilities, programming bugs, and issues that can cause device hang / unresponsiveness


Summary Table

# Severity Fix Size Category File(s) Short Description Worst-Case Outcome
1 CRITICAL S Out-of-bounds write BaseChatMesh.cpp data[len] = 0 writes 1 byte past the payload[] array on max-length messages Crash/reboot on any node receiving a max-length DM or channel message.
4 HIGH S Pool exhaustion → hang Dispatcher.cpp Packet pool is finite and not recoverable; exhaustion makes device permanently deaf Permanent deaf node — all roles. Rapid burst fills pool; node drops all inbound radio.
5 HIGH S (partial) / XL (full) AES-ECB mode Utils.cpp Encryption uses AES-128 in ECB mode with no IV/nonce — leaks plaintext patterns Passive traffic analysis + replay injection on all encrypted messages. Partial mitigation (random padding, 2-line change, non-breaking) included. Full fix (AES-CTR) requires protocol version bump.
6 HIGH S TRACE flood DoS Mesh.cpp Crafted TRACE packets can drain TX budget and packet pool of target nodes Sustained DoS on repeaters/room servers. Unauthenticated. Self-recovers when flood stops.
7 HIGH S Buffer overflow in CLI CommonCLI.cpp Unbounded sprintf into fixed reply buffer with attacker-controlled strings Crash/reboot or possible code execution via long CLI command over BLE/radio/serial.
8 MEDIUM M ISR race condition RadioLibWrappers.cpp Global volatile state modified from ISR and main loop without atomic guards Single lost radio packet (self-recovering). Practically untriggerable (~20μs window vs ≥10ms demod).
9 HIGH S Radio stuck — no recovery Dispatcher.cpp Radio-not-in-RX detection sets a flag but takes no corrective action Permanent deaf node — all roles. 8s timeout fires but takes zero recovery action.
10 MEDIUM S AdvertDataParser OOB read AdvertDataHelpers.cpp Parser reads fields before checking if app_data_len is large enough Crash/reboot on malformed advertisement from rogue enrolled node.
11 MEDIUM S TRACE path offset overflow Mesh.cpp path_len << path_sz can overflow uint8_t, causing OOB payload read Crash/reboot or wrong forwarding. Unauthenticated — any SDR can trigger.
12 LOW S PacketQueue memory leak StaticPoolPacketManager.h new[] in constructor but no destructor to delete[] ~11 KB leaked per recreation. Low impact (objects created once in current firmware).
13 LOW S restoreFrom no validation SimpleMeshTables.h _next_idx read from file without bounds check — corrupted file causes OOB write Latent OOB write — no callers invoke restoreFrom() today.
14 MEDIUM L ⚠️ protocol Group channel 1-byte hash BaseChatMesh.cpp, MeshCore.h, Mesh.cpp Channel lookup uses only 1 byte of SHA-256 — brute-force injection in ~109 min Garbage message injection to any group channel. Unauthenticated. Breaking change — all nodes must update together.
15 LOW S RNG::nextInt modular bias + div-by-zero Utils.cpp num % (_max - _min) has bias; _max == _min causes division by zero Latent UB — not remotely triggerable. All current callers safe.
16 LOW XL ⚠️ protocol No HMAC replay protection Utils.cpp No nonce/sequence in MAC; replayed packets accepted after hash table cycles
17 LOW S IdentityStore sprintf overflow IdentityStore.cpp sprintf into 40-byte filename buffer with potentially long _dir + name Latent stack overflow — not triggerable in current firmware.
18 LOW S Passwords in plaintext on flash + echoed in replies CommonCLI.cpp, BaseChatMesh.cpp Passwords echoed in encrypted CLI reply and stored unencrypted on flash Credential exposure — physical flash access or authenticated admin CLI.
19 LOW M ftoa static buffer not reentrant TxtDataHelpers.cpp static char return buffer overwritten on second call in same expression Latent wrong CLI response — not triggerable in current firmware.
20 LOW S Serial frame truncation ArduinoSerialInterface.cpp Oversized frames partially captured and returned as valid truncated data Silent data loss on long serial messages. No crash, no security impact.
22 CRITICAL L Flash write blocks radio during acl.save() ClientACL.cpp, MyMesh.cpp Synchronous flash erase+write in main loop starves radio; triggers on every login Permanent radio hang (reboot required) on nRF52. Chains into Bug #9.

Fix Size Legend

Size Meaning Typical scope
S Small — trivial, low-risk 1-2 files, <15 lines changed, no API/protocol change
M Medium — moderate refactor 1-3 files, 15-50 lines, possibly touches multiple call sites
L Large — significant change 4+ files or 50+ lines, may need careful testing across platforms
XL Extra-large — protocol change Requires coordinated rollout of all mesh nodes; breaking wire format
⚠️ protocol Breaking protocol change All nodes in a mesh must update simultaneously or lose interop

Tier 1 — Do first (high severity, small fix, immediate payoff): | # | Severity | Fix Size | Why | |—|———-|———-|—–| | 1 | CRITICAL | S | 1-line bounds check prevents crash on every max-length message | | 7 | HIGH | S | sprintfsnprintf in 2 files; prevents remote code execution | | 9 | HIGH | S | ~15 lines in Dispatcher; prevents permanent deaf node | | 10 | MEDIUM | S | 3-line bounds check; prevents unauthenticated crash | | 11 | MEDIUM | S | 5-line type fix; prevents unauthenticated crash |

Tier 2 — Do next (high severity, slightly more work): | # | Severity | Fix Size | Why | |—|———-|———-|—–| | 4 | HIGH | S | Pool eviction logic in Dispatcher; prevents permanent hang | | 6 | HIGH | S | Rate limiter in Mesh.cpp; prevents unauthenticated DoS | | 5 | HIGH | S (partial) | 2-line random padding; non-breaking partial crypto improvement | | 22 | CRITICAL | L | Flash/radio coordination; requires nRF52-specific testing (7 files) |

Tier 3 — Plan for (lower severity or bigger change): | # | Severity | Fix Size | Why | |—|———-|———-|—–| | 8 | MEDIUM | M | ISR critical sections; correct but practically untriggerable race | | 19 | LOW | M | API change for ftoa; 9 call sites to update; latent-only | | 15 | LOW | S | Div-by-zero guard; latent-only, all current callers safe | | 17 | LOW | S | sprintf → snprintf; latent-only | | 18 | LOW | S | Stop echoing password; low-impact credential hardening | | 13 | LOW | S | Clamp index after restore; latent-only, no current callers | | 12 | LOW | S | Add destructors; affects unit tests mainly | | 20 | LOW | S | Reject oversized frame; data-loss-only, no security impact |

Tier 4 — Long-term / breaking protocol changes: | # | Severity | Fix Size | Why | |—|———-|———-|—–| | 14 | MEDIUM | L ⚠️ | Channel hash 1→4 bytes; all nodes must update together | | 16 | LOW | XL ⚠️ | Add replay nonce to HMAC; wire format change | | 5 | HIGH | XL (full) | AES-CTR with per-packet nonce; protocol version bump |


Detailed Findings


Bug #1 — CRITICAL: Out-of-Bounds Write on Text Message Receive

File: src/helpers/BaseChatMesh.cpp (lines ~215 and ~372)
Impact: Memory corruption, device crash/hang on receiving a max-length text or group text message.

Worst-Case Outcome: Any node (repeater, room server, sensor, or companion) that receives a max-length DM or channel message crashes immediately. The data[len]=0 write corrupts the stack frame of onRecvPacket(). On nRF52 (ARM Cortex-M4) this overwrites saved registers or the return address → HardFault → reboot. On ESP32 it may corrupt adjacent stack variables → undefined behavior or LoadProhibited exception → reboot. Triggerable by any peer who shares a channel secret (channel messages) or any contact (DMs). No brute-force needed for normal API traffic — just send the longest possible message.

Description:
When a text message or group text message is received, the code writes a null terminator at data[len] to turn the decrypted payload into a C string. The data pointer points into a uint8_t data[MAX_PACKET_PAYLOAD] local array (184 bytes). If len == MAX_PACKET_PAYLOAD (which is the case when the encrypted payload fills the entire packet), this writes one byte past the end of the array, corrupting the stack. On embedded devices this can overwrite a return address or adjacent local variable, leading to an immediate crash or unpredictable behavior.

The same pattern is repeated in multiple firmware examples (simple_repeater, simple_room_server, simple_sensor).

Code (peer text message):

// BaseChatMesh.cpp:215 — onPeerDataRecv()
if (type == PAYLOAD_TYPE_TXT_MSG && len > 5) {
    uint32_t timestamp;
    memcpy(&timestamp, data, 4);
    uint8_t flags = data[4] >> 2;

    // len can be > original length, but 'text' will be padded with zeroes
    data[len] = 0; // ← BUG: writes past end of data[] when len == MAX_PACKET_PAYLOAD

Code (group text message):

// BaseChatMesh.cpp:372 — onGroupDataRecv()
    uint32_t timestamp;
    memcpy(&timestamp, data, 4);

    // len can be > original length, but 'text' will be padded with zeroes
    data[len] = 0; // ← BUG: same out-of-bounds write

Note: data is declared on the stack in Mesh::onRecvPacket() as:

uint8_t data[MAX_PACKET_PAYLOAD]; // 184 bytes
int len = Utils::MACThenDecrypt(secret, data, macAndData, pkt->payload_len - i);
// len can be up to MAX_PACKET_PAYLOAD (184)

Bug #4 — HIGH: Packet Pool Exhaustion Causes Permanent Unresponsiveness

File: src/helpers/StaticPoolPacketManager.cpp, src/Dispatcher.cpp
Impact: Node becomes permanently deaf to all traffic; requires reboot to recover.

Worst-Case Outcome: Permanent deafness on any role. On companion_radio (16-slot pool): a BLE-paired phone app sending a burst of ~20 messages fills the pool before the radio can drain it. On repeater/room_server (32-slot pool): a TRACE/flood storm or a fast series of client logins can exhaust the pool. Once exhausted, allocNew() returns NULL for every incoming radio packet — the node is silently deaf to all mesh traffic. Only a reboot recovers. No external attacker needed — a legitimate client with poor connectivity retrying rapidly can trigger this.

Description:
The packet pool is a fixed-size static allocation. Once all packets are in use (held in outbound queue, inbound queue, or via ACTION_MANUAL_HOLD), allocNew() returns NULL and all incoming radio packets are silently discarded. There is no mechanism to reclaim or forcibly free queued packets. A burst of flood or TRACE packets, combined with long retransmit delays, can fill the pool. Once full, the node cannot receive or process any new traffic until rebooted.

The ACTION_MANUAL_HOLD path is particularly dangerous — if a sub-class holds a packet and never releases it (e.g. due to a logic error), that packet slot is permanently lost.

Code:

// StaticPoolPacketManager.cpp
mesh::Packet* StaticPoolPacketManager::allocNew() {
  return unused.removeByIdx(0);  // just get first one (returns NULL if empty)
}

// Dispatcher.cpp — checkRecv()
pkt = _mgr->allocNew();
if (pkt == NULL) {
    MESH_DEBUG_PRINTLN("WARNING: received data, no unused packets available!");
    // ← pkt stays NULL, incoming radio data is simply dropped!
}

// Dispatcher.cpp — processRecvPacket()
void Dispatcher::processRecvPacket(Packet* pkt) {
  DispatcherAction action = onRecvPacket(pkt);
  if (action == ACTION_RELEASE) {
    _mgr->free(pkt);
  } else if (action == ACTION_MANUAL_HOLD) {
    // sub-class is wanting to manually hold Packet instance
    // ← if sub-class never calls releasePacket(), this slot is LEAKED forever
  } else {
    _mgr->queueOutbound(pkt, priority, futureMillis(_delay));
  }
}

Bug #5 — HIGH: AES-128 ECB Mode Leaks Plaintext Patterns

File: src/Utils.cpp
Impact: Identical plaintext blocks produce identical ciphertext; enables pattern analysis and replay detection.

Worst-Case Outcome: Passive traffic analysis and replay injection on all encrypted messages (DMs, channel messages, CLI commands). Any LoRa SDR within radio range can capture packets and determine, without the key, that two messages are identical — because AES-ECB is deterministic. Group channel messages are especially exposed: all members share the same channel secret, so a common greeting from the same sender at the same timestamp produces byte-for-byte identical ciphertext. Attacker can also build a first-block codebook exploiting the known plaintext structure ([timestamp(4)] [type(1)] [sender_name...]) — identical first 16 bytes produce identical first ciphertext block, leaking sender identity and timing. Combined with Bug #16 (no replay protection), captured ciphertext packets can be retransmitted verbatim after the hasSeen() ring buffer cycles (~128 packets), and receivers will accept them as new valid messages. Does NOT enable direct decryption of unknown messages, but degrades encryption from IND-CPA (indistinguishable under chosen-plaintext) to a deterministic permutation — a fundamental cryptographic weakness. Partial mitigation available (non-breaking): replacing zero-padding with random-padding in the last ECB block (memset(tmp,0,16) -> random(tmp,16) in encrypt()) makes short messages (< 16 bytes plaintext, i.e. single-block) fully non-deterministic and randomizes the last block of all longer messages. This is backward-compatible — decryption ignores padding bytes. Full aligned blocks remain deterministic under ECB; a complete fix (AES-CTR with per-packet nonce) requires a future protocol version bump.

PoC: bug5-aes-ecb.py
Patch: bug5-aes-ecb.patch

Description:
All encryption uses AES-128 in ECB (Electronic Codebook) mode — each 16-byte block is encrypted independently with no chaining, no IV, and no nonce. This is a well-known insecure mode: - Identical plaintext blocks produce identical ciphertext blocks, leaking structural information. - The same message encrypted with the same key always produces the same ciphertext, enabling an observer to detect repeated messages. - Padding uses zero bytes, further reducing entropy in the last block.

Code:

// Utils.cpp — encrypt()
int Utils::encrypt(const uint8_t* shared_secret, uint8_t* dest, const uint8_t* src, int src_len) {
  AES128 aes;
  uint8_t* dp = dest;

  aes.setKey(shared_secret, CIPHER_KEY_SIZE);
  while (src_len >= 16) {
    aes.encryptBlock(dp, src);     // <- ECB mode: each block encrypted independently
    dp += 16; src += 16; src_len -= 16;
  }
  if (src_len > 0) {  // remaining partial block
    uint8_t tmp[16];
    memset(tmp, 0, 16);            // <- zero-padding
    memcpy(tmp, src, src_len);
    aes.encryptBlock(dp, tmp);     // <- no IV, no chaining
    dp += 16;
  }
  return dp - dest;
}

Severity Review: Keeping HIGH. ECB mode is a fundamental cryptographic weakness that affects every encrypted packet in the protocol. It does not directly expose plaintext (the AES key is still required), but it violates the basic expectation of semantic security (IND-CPA). The practical exploitation path — passive SDR traffic analysis + codebook construction + replay — is realistic for a motivated attacker near the mesh. However, it requires radio proximity and does not yield plaintext directly, which is why it stays HIGH rather than CRITICAL.


Bug #6 — HIGH: Denial-of-Service via TRACE Packet Flood

File: src/Mesh.cpp
Impact: Remote attacker can drain TX budget and fill packet queue of any target node.

Worst-Case Outcome: Sustained denial-of-service on repeaters and room servers (any firmware with allowPacketForward() returning true). Unauthenticated — TRACE packets carry no signature or encryption, so any LoRa radio (cheap module, SDR, or modified firmware) within radio range can execute this attack. The attacker sends a continuous stream of TRACEs, each with a unique trace_tag (bypassing the 128-entry hasSeen() dedup table) and the target’s 1-byte path hash (only 256 possibilities, trivially guessable or known from public key). Each matching TRACE is queued for retransmit, consuming 1 of 32 pool slots and burning TX airtime budget. At ~10 TRACEs/sec, the pool is exhausted in ~3 seconds; the TX budget drops to 0, pushing next_tx_time indefinitely into the future. Result: the repeater becomes deaf (pool full → can’t allocate for RX) and mute (no TX budget → can’t send). Normal mesh traffic is completely blocked for the duration of the flood. This is NOT a permanent hang — once the attacker stops transmitting, queued TRACEs drain via TX or outbound_expiry timeout (seconds), the pool refills, and updateTxBudget() restores the TX budget proportionally to elapsed idle time. The node self-recovers without reboot. The damage is availability-during-attack only, unlike Bugs #22/#9/#4 which cause permanent hangs. Multi-hop TRACE paths amplify through intermediate repeaters, allowing a single attacker to disrupt an entire chain. Companion radio nodes are unaffected (allowPacketForward() returns false).

Description:
TRACE packets use direct routing and bypass the flood deduplication. The TRACE handler forwards the packet if the current node’s hash matches a byte in the payload path — and since PATH_HASH_SIZE is 1 byte, this is easy to target. The forwarding allocates a priority-5 outbound slot and consumes TX airtime budget. An attacker can generate a continuous stream of unique TRACE packets (each with a different trace_tag), all targeting a specific node’s 1-byte hash, to: 1. Fill the outbound packet queue 2. Drain the TX duty-cycle budget 3. Exhaust the packet pool

Code:

// Mesh.cpp — onRecvPacket(), TRACE handler
if (pkt->isRouteDirect() && pkt->getPayloadType() == PAYLOAD_TYPE_TRACE) {
    if (pkt->path_len < MAX_PATH_SIZE) {
      // ...
      uint8_t flags = pkt->payload[i++];
      uint8_t path_sz = flags & 0x03;

      uint8_t len = pkt->payload_len - i;
      uint8_t offset = pkt->path_len << path_sz;
      if (offset >= len) {
        onTraceRecv(pkt, trace_tag, auth_code, flags, pkt->path, &pkt->payload[i], len);
      } else if (self_id.isHashMatch(&pkt->payload[i + offset], 1 << path_sz)
                 && allowPacketForward(pkt) && !_tables->hasSeen(pkt)) {
        pkt->path[pkt->path_len++] = (int8_t) (pkt->getSNR()*4);
        uint32_t d = getDirectRetransmitDelay(pkt);
        return ACTION_RETRANSMIT_DELAYED(5, d); // ← queued for retransmit, consuming pool + TX budget
      }
    }
    return ACTION_RELEASE;
}

Bug #7 — HIGH: Unbounded sprintf Buffer Overflow in CLI Handler

File: src/helpers/CommonCLI.cpp
Impact: Stack buffer overflow if an attacker sends a long CLI command; can corrupt memory or crash the device.

Worst-Case Outcome: Crash/reboot or potential code execution on repeater, room server, or sensor nodes. Remote attack (BLE/radio): An admin-authenticated client sends a 179-byte CLI command like set AAAA... (175-byte config). The firmware writes "unknown config: " (16 bytes) + 175 bytes = 191 bytes into a 161-byte reply[] buffer → 30-byte stack overflow. On ARM Cortex-M4 (nRF52) this overwrites saved r4-r7 and the return address (LR) → HardFault or attacker-controlled PC. Serial attack: 155-byte config → 171 bytes into 160-byte buffer → 11-byte overflow. The get owner.info path writes up to 122 bytes char-by-char with zero bounds checking. All three firmware variants (repeater, room server, sensor) are affected.

Description:
Multiple handleCommand / handleSetCmd code paths write attacker-controlled string data into the reply buffer using sprintf without any length check. The reply buffer is typically a fixed-size stack array in the calling firmware (often 160–256 bytes). A crafted long command string can overflow this buffer.

Code (unknown config echo):

// CommonCLI.cpp:729 — handleSetCmd()
  } else {
    sprintf(reply, "unknown config: %s", config);
    // ← 'config' is attacker-controlled, no length limit
  }

Code (password echo):

// CommonCLI.cpp:288 — handleCommand()
    } else if (memcmp(command, "password ", 9) == 0) {
      StrHelper::strncpy(_prefs->password, &command[9], sizeof(_prefs->password));
      savePrefs();
      sprintf(reply, "password now: %s", _prefs->password);

Code (owner.info — char-by-char write without bounds):

// CommonCLI.cpp:787 — handleGetCmd()
  } else if (memcmp(config, "owner.info", 10) == 0) {
    *reply++ = '>';
    *reply++ = ' ';
    const char* sp = _prefs->owner_info;
    while (*sp) {
      *reply++ = (*sp == '\n') ? '|' : *sp;  // ← no check on reply buffer end
      sp++;
    }
    *reply = 0;
  }

Bug #8 — MEDIUM: Race Condition on Global Radio State Variable

File: src/helpers/radiolib/RadioLibWrappers.cpp
Impact: Missed packets or radio state desync on interrupt timing edge cases.

Worst-Case Outcome: Single lost radio packet (self-recovering). In recvRaw(), the ISR can set STATE_INT_READY in the ~20μs window between state = STATE_IDLE and state = STATE_RX; the latter clobbers the flag, losing one received packet. The race window is orders of magnitude smaller than LoRa demodulation time (≥10ms at SF7), making occurrence probability vanishingly low. Not externally triggerable. System self-recovers on the next incoming packet.

PoC: No PoC can be written. The race window (~20μs between two assignments in recvRaw()) is 500× shorter than the fastest LoRa packet demodulation time (~10ms at SF7). There is no meshcore_py API or external radio stimulus that can control ARM hardware interrupt timing at the instruction level. The bug is proven by code analysis only.

Severity Review: MEDIUM — kept. The race is real but the window is effectively impossible to hit: LoRa demodulation takes ≥10ms, the vulnerable window is ~20μs, and the consequence (one lost packet) is indistinguishable from normal RF packet loss. No permanent state corruption, no hang, no security impact. Self-recovers on next packet. The fix is a defensive-correctness improvement, not an urgent operational fix.

Description:
The global volatile uint8_t state variable is written from the ISR (setFlag() — called on packet received or TX complete interrupt) and read/written from the main loop (recvRaw(), isSendComplete(), startSendRaw()). While volatile prevents compiler optimization, it does not provide atomicity on the read-check-write patterns in the main loop. On ARM Cortex-M (ESP32, nRF52), an interrupt can fire between a read and a write, causing TOCTOU races.

Race analysis — all state writers in the main loop:

Function Pattern Race window Practical risk
recvRaw() state = STATE_IDLE then state = STATE_RX ~20μs (includes startReceive() SPI call) Primary risk. ISR sets INT_READY between the two assignments; state = STATE_RX clobbers it. One RX packet lost.
isSendComplete() Check INT_READY, then state = STATE_IDLE <1μs No practical risk. During TX, only the TX-complete ISR fires; no second ISR can occur in this window.
startSendRaw() state = STATE_TX_WAIT <1μs No practical risk. Radio is being switched from IDLE to TX; no ISR pending.
onSendFinished() state = STATE_IDLE <1μs No practical risk. Called immediately after isSendComplete() returns true; no new ISR expected.
idle() / resetAGC() / startRecv() state = STATE_IDLE or STATE_RX <1μs Minimal risk. Called during initialization or calibration, not in hot receive path.

The primary race in recvRaw() step-by-step:

Main loop:
  [A] if (state & STATE_INT_READY)    → true (RX complete ISR fired)
  [B] readData() from radio FIFO      (SPI: ~50-200μs)
  [C] state = STATE_IDLE              → clears INT_READY + base state
  ------- ISR fires here: state |= STATE_INT_READY → state = 0x10 -------
  [D] if (state != STATE_RX)          → true (0x10 != 0x01)
  [E] startReceive()                  (SPI: ~10-50μs, puts radio in RX)
  [F] state = STATE_RX                → state = 0x01, CLOBBERS STATE_INT_READY
      → Packet that caused the ISR between [C] and [F] is LOST

Why it’s practically untriggerable: For the ISR to fire between [C] and [F], a second packet must have completed full LoRa demodulation in that window. Even at the fastest setting (SF7/500kHz), a minimum-size packet takes ~5ms to demodulate. The [C]-to-[F] window is ~20μs. The radio’s continuous-RX mode can overlap demodulation of a new packet with main-loop processing, but the new packet’s ISR won’t fire until demodulation completes — which takes far longer than the race window.

Code (current — vulnerable):

static volatile uint8_t state = STATE_IDLE;

static void setFlag(void) {
  state |= STATE_INT_READY;  // ISR: read-modify-write (safe — ISR can't be preempted by main loop)
}

int RadioLibWrapper::recvRaw(uint8_t* bytes, int sz) {
  int len = 0;
  if (state & STATE_INT_READY) {         // check flag
    len = _radio->getPacketLength();
    // ... read data ...
    state = STATE_IDLE;                  // [C] clear flag — ISR can fire after this
  }
  if (state != STATE_RX) {
    int err = _radio->startReceive();
    if (err == RADIOLIB_ERR_NONE) {
      state = STATE_RX;                  // [F] CLOBBERS any INT_READY set since [C]
    }
  }
  return len;
}

Fix: See bug8-isr-race.patch — wraps all main-loop reads and writes of state in noInterrupts() / interrupts() critical sections. Key changes: 1. recvRaw(): Atomically snapshot+clear INT_READY at the start; after startReceive(), only set STATE_RX if no new INT_READY arrived during the SPI call. 2. isSendComplete(): Atomically test-and-clear INT_READY. 3. All other writers (idle(), startRecv(), resetAGC(), startSendRaw(), onSendFinished()): Wrap state writes in critical sections for consistency. 4. All readers (isInRecvMode(), loop()): Snapshot state under noInterrupts() to prevent torn reads.


Bug #9 — HIGH: Radio Stuck Detection With No Recovery

File: src/Dispatcher.cpp
Impact: If the radio gets stuck in non-RX mode, the device becomes permanently deaf without reboot.

Worst-Case Outcome: Permanent deaf node requiring reboot — all roles (repeater, room server, sensor, companion). This bug is the defense-in-depth failure that makes Bug #22 fatal: when acl.save() corrupts the SPI bus on nRF52, the SX1262 radio gets stuck in a non-RX state. The 8-second timeout fires, sets ERR_EVENT_STARTRX_TIMEOUT, but takes no recovery action — no radio reset, no AGC cycle, no reboot. The radio stays stuck permanently. Even without Bug #22, any hardware glitch, power brownout, or SPI noise that leaves the radio in IDLE or TX state will be permanently fatal. The node appears online (BLE still works, serial still responds) but never receives a radio packet again.

Description:
The dispatcher checks if the radio has been outside RX mode for more than 8 seconds and sets an error flag. However, this flag is never acted upon — no attempt is made to reset the radio, force it back into RX mode, or reboot. The flag is only cleared by a manual resetStats() call. If the radio genuinely gets stuck (e.g. after a failed TX or a hardware glitch), the device silently stops receiving packets forever.

Code:

// Dispatcher.cpp — loop()
  bool is_recv = _radio->isInRecvMode();
  if (is_recv != prev_isrecv_mode) {
    prev_isrecv_mode = is_recv;
    if (!is_recv) {
      radio_nonrx_start = _ms->getMillis();
    }
  }
  if (!is_recv && _ms->getMillis() - radio_nonrx_start > 8000) {
    _err_flags |= ERR_EVENT_STARTRX_TIMEOUT;
    // ← That's it. No recovery action. Radio stays stuck.
  }

Bug #10 — MEDIUM: AdvertDataParser Reads Past Buffer

File: src/helpers/AdvertDataHelpers.cpp
Impact: Out-of-bounds read on crafted short advertisement packets; may read adjacent stack/heap data.

Worst-Case Outcome: Crash/reboot on any node (all roles) that processes a malformed advertisement. A compromised or rogue enrolled node broadcasts an advert with app_data_len=1 and flags=0x70 (all field flags set). The parser immediately performs 4 memcpy() calls reading 12 bytes past the 1-byte buffer — into adjacent stack/heap memory. On ESP32 this may cross into flash-mapped IRAM → LoadProhibited exception → reboot. On nRF52 it may cross an MPU boundary → HardFault → reboot. The attack requires a valid ed25519 signature (attacker must own a key pair = be an enrolled node), but hits every receiver in radio range simultaneously. The read data doesn’t leak back to the attacker (no info disclosure), but the crash is the goal.

Description:
The AdvertDataParser constructor checks if flags indicate the presence of lat/lon (8 bytes) and feature fields (2 bytes each) and reads them from app_data using memcpy. The bounds check (app_data_len >= i) only happens after all the reads. A crafted short payload with flags claiming lat/lon and features but providing only a few bytes will cause memcpy to read past the end of the app_data buffer.

Code:

// AdvertDataHelpers.cpp — AdvertDataParser constructor
  AdvertDataParser::AdvertDataParser(const uint8_t app_data[], uint8_t app_data_len) {
    _flags = app_data[0];
    _valid = false;

    int i = 1;
    if (_flags & ADV_LATLON_MASK) {
      memcpy(&_lat, &app_data[i], 4); i += 4;  // ← reads 4 bytes, no check on app_data_len
      memcpy(&_lon, &app_data[i], 4); i += 4;  // ← reads another 4 bytes
    }
    if (_flags & ADV_FEAT1_MASK) {
      memcpy(&_extra1, &app_data[i], 2); i += 2;  // ← no bounds check yet
    }
    if (_flags & ADV_FEAT2_MASK) {
      memcpy(&_extra2, &app_data[i], 2); i += 2;
    }

    if (app_data_len >= i) {  // ← bounds check happens HERE, too late!
      // ... parse name ...
      _valid = true;
    }
  }

Bug #11 — MEDIUM: TRACE Path Offset Overflow Causes OOB Read

File: src/Mesh.cpp
Impact: Out-of-bounds read in payload array from crafted TRACE packets; may crash or leak memory.

Worst-Case Outcome: Crash/reboot or unintended packet forwarding on any node (all roles). Unauthenticated — TRACE packets require no signature or encryption, so any SDR or modified firmware within radio range can trigger this. The attacker crafts a TRACE with path_len=32 and flags=0x03 (path_sz=3). The firmware computes offset = 32 << 3 = 256, which truncates to uint8_t 0. This causes the wrong branch (forward instead of “end of path”), and isHashMatch() reads 8 bytes from payload[9] — which may be past the actual payload end. On nRF52 → HardFault → reboot. On ESP32 → LoadProhibited → reboot. If the OOB bytes happen to match the node’s pub_key prefix (1/256 chance for 1-byte hash), the node incorrectly forwards the TRACE, amplifying the attack. Continuous broadcast of malformed TRACEs = low-cost wide-area DoS against every node in radio range.

Description:
In the TRACE packet handler, offset is calculated as pkt->path_len << path_sz where path_sz comes from attacker-controlled flags (bits 0–1, so 0–3). If path_len is, say, 60 and path_sz is 3, then 60 << 3 = 480, which overflows the uint8_t offset variable (wraps to 480 & 0xFF = 224). The subsequent pkt->payload[i + offset] can then read past the 184-byte payload buffer.

Code:

// Mesh.cpp — TRACE handler
      uint8_t flags = pkt->payload[i++];
      uint8_t path_sz = flags & 0x03;  // ← attacker-controlled: 0..3

      uint8_t len = pkt->payload_len - i;
      uint8_t offset = pkt->path_len << path_sz;  // ← overflow! e.g. 60 << 3 = 480 → truncated to 224
      if (offset >= len) {
        onTraceRecv(pkt, trace_tag, auth_code, flags, pkt->path, &pkt->payload[i], len);
      } else if (self_id.isHashMatch(&pkt->payload[i + offset], 1 << path_sz)
                 // ← i + offset can exceed payload_len

Bug #12 — LOW: PacketQueue Has No Destructor (Memory Leak)

File: src/helpers/StaticPoolPacketManager.h
Impact: If the packet manager is ever destroyed/recreated, all dynamically allocated queue memory leaks permanently.

Worst-Case Outcome: On current embedded firmware: no direct impactStaticPoolPacketManager is created once at startup and never destroyed. However: (1) Unit tests or CI builds using ASAN/Valgrind will flag ~11 KB of leaked memory per StaticPoolPacketManager(32) destruction (3 queue arrays × 9×32 = 864 bytes + 32 Packet objects × ~258 bytes = ~9.1 KB), masking real bugs in leak reports. (2) The PacketManager base class (Dispatcher.h) lacks a virtual destructor, so delete via a PacketManager* pointer is undefined behavior in C++ — the derived destructor (if added) would never be called. (3) Any future runtime reconfiguration (changing pool size, hot-restart) would leak all pool memory on each cycle. All roles affected equally (repeater, room server, sensor, companion).

Description:
PacketQueue allocates three arrays with new[] in its constructor but has no destructor to free them. While on embedded devices this is typically a one-time allocation, any reconfiguration or test scenario that recreates the manager will permanently leak memory.

Code:

// StaticPoolPacketManager.cpp
PacketQueue::PacketQueue(int max_entries) {
  _table = new mesh::Packet*[max_entries];      // ← allocated
  _pri_table = new uint8_t[max_entries];         // ← allocated
  _schedule_table = new uint32_t[max_entries];   // ← allocated
  _size = max_entries;
  _num = 0;
}
// ← No destructor defined anywhere — no delete[]

Bug #13 — LOW: SimpleMeshTables::restoreFrom Loads Indexes Without Validation

File: src/helpers/SimpleMeshTables.h
Impact: Corrupted file causes out-of-bounds write on next hasSeen() call; memory corruption / crash.

Worst-Case Outcome: Latent OOB write on ESP32 — currently no firmware variant calls restoreFrom(), so this bug cannot be triggered today. However, restoreFrom() is a public method compiled on ESP32 (behind #ifdef ESP32). If table persistence is added in the future, a corrupted flash file (from wear, power loss during write, or bit-flip) would load _next_idx > 128, and the next hasSeen() call writes 8 bytes at offset _next_idx * 8 past the start of _hashes[] — up to ~7 KB beyond the array. This corrupts heap metadata or adjacent globals, causing an immediate crash or silent data corruption. Worse, because the corrupted file is read on every boot, this becomes a crash loop — the device reboots, loads the corrupt file, crashes, reboots — until the flash file is manually erased via serial or full reflash.

Description:
restoreFrom() reads _next_idx and _next_ack_idx from a file without validating they are within bounds. If the file is corrupted (flash wear, incomplete write), _next_idx could be > MAX_PACKET_HASHES, causing _hashes[_next_idx * MAX_HASH_SIZE] to write outside the array on the next hasSeen() call.

Code:

// SimpleMeshTables.h
  void restoreFrom(File f) {
    f.read(_hashes, sizeof(_hashes));
    f.read((uint8_t *) &_next_idx, sizeof(_next_idx));       // ← no bounds check!
    f.read((uint8_t *) &_acks[0], sizeof(_acks));
    f.read((uint8_t *) &_next_ack_idx, sizeof(_next_ack_idx)); // ← no bounds check!
  }

// Later in hasSeen():
    memcpy(&_hashes[_next_idx * MAX_HASH_SIZE], hash, MAX_HASH_SIZE);
    _next_idx = (_next_idx + 1) % MAX_PACKET_HASHES;
    // ← if _next_idx was 999, _hashes[999*8] writes WAY past array end

Bug #14 — MEDIUM: Group Channel Hash Uses Only 1 Byte

File: src/MeshCore.h, src/Mesh.h, src/Mesh.cpp, src/helpers/BaseChatMesh.cpp
Impact: High collision rate between channels; brute-force message injection feasible in ~109 minutes.

Worst-Case Outcome: Garbage message injection to any group channel — unauthenticated, any LoRa SDR or modified firmware in radio range. The group message packet format uses only 1 byte of SHA-256 for channel identification (PATH_HASH_SIZE = 1 in MeshCore.h). With 256 possible hash values, a node with 40 channels has a 96% chance of at least one collision pair (birthday problem). An attacker who knows or guesses the target channel’s hash byte (1/256 blind, or trivially enumerable) needs only brute-force the 2-byte HMAC (CIPHER_MAC_SIZE = 2, 65,536 possibilities). At ~10 LoRa packets/sec, near-certain MAC collision in ~109 minutes. The successfully “injected” message decrypts to random garbage (attacker lacks the AES key), but MACThenDecrypt returns success and onGroupDataRecv processes it — displaying a corrupted message from a random “sender” (first 5 bytes parsed as sender ID + timestamp). Secondary impact: CPU waste DoS — flooding all 256 hash byte values forces the target to attempt HMAC-SHA256 + AES-128 decryption for every locally matching channel per packet. With 40 channels, a 256-packet burst triggers 40 full crypto operations. This is a protocol-level design weakness — the fix requires increasing the channel hash size (breaking backward compatibility with older firmware).

Description:
The searchChannelsByHash function compares only the first byte of the channel hash. With 256 possible values, roughly 1 in 256 channels will match any incoming group packet’s hash byte. The only remaining authentication is the 2-byte HMAC, which can be brute-forced in ~65,536 attempts.

The root constant is PATH_HASH_SIZE = 1 in MeshCore.h, which controls both GroupChannel::hash[] storage and the number of hash bytes written into group message packets. The hash is the first byte of SHA-256(channel_secret).

Code (channel lookup — 1-byte comparison):

// BaseChatMesh.cpp
int BaseChatMesh::searchChannelsByHash(const uint8_t* hash, mesh::GroupChannel dest[], int max_matches) {
  int n = 0;
  for (int i = 0; i < MAX_GROUP_CHANNELS && n < max_matches; i++) {
    if (channels[i].channel.hash[0] == hash[0]) {  // ← only 1 byte compared!
      dest[n++] = channels[i].channel;
    }
  }
  return n;
}

Code (receive path — 1-byte extraction):

// Mesh.cpp — onRecvPacket()
    case PAYLOAD_TYPE_GRP_DATA:
    case PAYLOAD_TYPE_GRP_TXT: {
      int i = 0;
      uint8_t channel_hash = pkt->payload[i++];  // ← only 1 byte read from packet
      uint8_t* macAndData = &pkt->payload[i];
      // ...
      int num = searchChannelsByHash(&channel_hash, channels, 4);
      for (int j = 0; j < num; j++) {
        int len = Utils::MACThenDecrypt(channels[j].secret, data, macAndData, pkt->payload_len - i);
        if (len > 0) {  // MAC matched — packet accepted, even if content is garbage
          onGroupDataRecv(pkt, pkt->getPayloadType(), channels[j], data, len);
          break;
        }
      }
    }

Code (packet construction — 1-byte hash written):

// Mesh.cpp — createGroupDatagram()
  int len = 0;
  memcpy(&packet->payload[len], channel.hash, PATH_HASH_SIZE); len += PATH_HASH_SIZE;  // ← 1 byte
  len += Utils::encryptThenMAC(channel.secret, &packet->payload[len], data, data_len);

Code (protocol constant):

// MeshCore.h
#define PATH_HASH_SIZE       1   // ← only 1 byte for all hash-based lookups

---

### Bug #15 — LOW: `RNG::nextInt` Has Modular Bias and Division-by-Zero Risk

**File:** `src/Utils.cpp`, `examples/companion_radio/MyMesh.cpp`  
**Impact:** Division by zero crash if `_max == _min`; biased random output for BLE PIN generation.

**Worst-Case Outcome:** **Latent undefined behavior** — not remotely triggerable in current firmware. All callers use hardcoded bounds where `_min < _max` (e.g., `nextInt(0, 5)`, `nextInt(1, 4)`, `nextInt(100000, 999999)`). The div-by-zero would only occur if a future caller passes dynamically computed bounds where equality is possible (e.g., `nextInt(0, num_peers)` when `num_peers == 0`). On ARM Cortex-M, unsigned integer modulo by zero is undefined behavior — GCC on ARM typically returns 0 silently, but some toolchains trap to HardFault → device reboot. The modular bias affects BLE PIN generation (`nextInt(100000, 999999)`, range = 899,999): 19.1% of PINs are ~0.02% more likely, reducing entropy from 19.78 bits to 19.78 bits (loss < 0.001 bits — negligible). BLE PIN range also incorrectly excludes 999999 (should be `nextInt(100000, 1000000)`). All firmware roles are potentially affected if a div-by-zero caller is added.

**Description:**  
The modulo operation `num % (_max - _min)` introduces bias when the range doesn't divide evenly into 2^32. More critically, if `_max == _min`, the expression becomes `num % 0`, which is undefined behavior in C++ — on most embedded platforms this causes a hardware fault / crash / hang.

**Code:**
```cpp
// Utils.cpp
uint32_t RNG::nextInt(uint32_t _min, uint32_t _max) {
  uint32_t num;
  random((uint8_t *) &num, sizeof(num));
  return (num % (_max - _min)) + _min;  // ← div-by-zero if _max == _min
}

Code (BLE PIN — excluded range):

// companion_radio/MyMesh.cpp
  _active_ble_pin = rng.nextInt(100000, 999999); // ← excludes 999999; should be 1000000

All current callers (safe in stock firmware): | Call Site | _min | _max | Purpose | |———–|——|——|——–| | Mesh.cpp getRetransmitDelay | 0 | 5 | Retransmit jitter | | Mesh.cpp getCADFailRetryDelay | 1 | 4 | CAD retry delay | | SensorMesh.cpp | 0 | 6 | Sensor retransmit | | MyMesh.cpp (repeater/room/companion) | 0 | 5*t+1 | Retransmit jitter | | MyMesh.cpp (companion BLE) | 100000 | 999999 | BLE pairing PIN |


Bug #16 — LOW: No Replay Protection in Encrypt-then-MAC Scheme

File: src/Utils.cpp
Impact: Previously captured packets can be replayed after the hasSeen hash table cycles.

Description:
The encrypt-then-MAC scheme uses only the shared secret for HMAC — no sequence number, nonce, or timestamp participates in the MAC. While the SimpleMeshTables::hasSeen() table provides deduplication, it is a small cyclic buffer (128 entries). After enough new packets are seen, old entries are evicted and a replayed packet will be accepted as new. This enables delayed replay attacks.


Bug #17 — LOW: IdentityStore Buffer Overflow in Filename Construction

File: src/helpers/IdentityStore.cpp
Impact: Stack buffer overflow if directory path + name exceeds ~34 characters.

Worst-Case Outcome: Latent stack overflow – not triggerable in current firmware. All four methods (load x2, save x2) use sprintf(filename, "%s/%s.id", _dir, name) into a char filename[40] stack buffer. Every call site in the codebase (26 total across all firmware variants) uses hardcoded _dir values ("" on nRF52/STM32, "/identity" on ESP32/RP2040) and the hardcoded literal "_main" as name. The worst-case current output is "/identity/_main.id" = 19 bytes, well within the 40-byte buffer (21-byte margin). Neither _dir nor name is reachable from any serial, BLE, or CLI command. The bug becomes exploitable only if a future code change passes user-controlled input (e.g., contact names, CLI arguments) as the name parameter. A name longer than ~25 characters (with _dir="/identity") would overflow the buffer by up to tens of bytes, corrupting the ARM saved registers and return address on the stack – causing a crash or potentially controlled code execution.

Description:
sprintf(filename, "%s/%s.id", _dir, name) writes into a 40-byte char filename[40] buffer with no length check. While name is typically set programmatically, a long _dir configuration string could push the total past 40 bytes.

Code:

// IdentityStore.cpp -- same pattern in all four methods
bool IdentityStore::load(const char *name, mesh::LocalIdentity& id) {
  bool loaded = false;
  char filename[40];
  sprintf(filename, "%s/%s.id", _dir, name);  // <- no size check

All callers verified safe: | Caller | _dir | name | Output bytes | |——–|——|——|————–| | All nRF52/STM32 variants | "" | "_main" | 10 | | All ESP32/RP2040 variants | "/identity" | "_main" | 19 |


Bug #18 — LOW: Passwords Stored Unencrypted on Flash and Echoed in CLI Replies

File: src/helpers/CommonCLI.cpp, src/helpers/CommonCLI.h, src/helpers/BaseChatMesh.cpp
Impact: Admin and guest passwords stored unencrypted on flash; echoed in encrypted CLI reply packets that the admin client decrypts automatically.

Worst-Case Outcome: Credential exposure requiring admin authentication or physical access. Three vectors: (1) The "password <new>" CLI command echoes the new admin password in the encrypted reply: sprintf(reply, "password now: %s", _prefs->password). The reply IS encrypted (AES-ECB) over the radio, but the admin’s companion radio decrypts it automatically — the plaintext password appears in the meshcore_py event payload and app UI. (2) The "get guest.password" CLI command returns the guest password in an encrypted reply: sprintf(reply, "> %s", _prefs->guest_password) — again, decrypted automatically by the client. (3) Both passwords are stored as raw char[16] at known offsets in the /com_prefs flash file (admin at offset 56, guest at offset 88) with no encryption or hashing — physical access to the device allows trivial extraction. The login protocol sends the password as plaintext within AES-ECB encrypted ANON_REQ packets (sendLogin() in BaseChatMesh.cpp), enabling ciphertext pattern analysis (same password = same ECB block). All roles affected (repeater, room server, sensor).

Description:
When an admin sets or changes the password, the reply echoes the new password in the encrypted CLI response. The get guest.password command returns the guest password in an encrypted reply. In both cases, the admin’s companion radio decrypts the reply automatically, so the plaintext password is visible to the client software. Additionally, all passwords are stored in plaintext in the preferences file on flash — anyone with physical device access can extract credentials directly.

Code (admin password echo):

// CommonCLI.cpp:284
    } else if (memcmp(command, "password ", 9) == 0) {
      StrHelper::strncpy(_prefs->password, &command[9], sizeof(_prefs->password));
      savePrefs();
      sprintf(reply, "password now: %s", _prefs->password);   // <- echoed over radio!

Code (guest password retrieval):

// CommonCLI.cpp:754
  } else if (memcmp(config, "guest.password", 14) == 0) {
    sprintf(reply, "> %s", _prefs->guest_password);  // <- plaintext over radio

Code (login sends password in payload):

// BaseChatMesh.cpp:558
    memcpy(&temp[4], password, len);  // <- password plaintext in encrypted packet
    pkt = createAnonDatagram(PAYLOAD_TYPE_ANON_REQ, self_id, recipient.id,
                             recipient.getSharedSecret(self_id), temp, tlen);

Code (flash storage – no encryption):

// CommonCLI.cpp:savePrefs()
    file.write((uint8_t *)&_prefs->password[0], sizeof(_prefs->password));       // offset 56
    file.write((uint8_t *)&_prefs->guest_password[0], sizeof(_prefs->guest_password)); // offset 88

Bug #19 — LOW: ftoa/ftoa3 Static Buffer Not Reentrant

File: src/helpers/TxtDataHelpers.cpp
Impact: Corrupted output when called twice in one expression (e.g. in sprintf argument list).

Worst-Case Outcome: Latent wrong CLI response – not triggerable in current firmware. All 9 call sites in CommonCLI.cpp use a single ftoa/ftoa3 per expression, or explicitly strcpy to a local buffer before the next call (the “radio” getter at lines 775-777). The bug will manifest only if a future developer writes sprintf(reply, "%s,%s", ftoa(lat), ftoa(lon)) – since both calls return a pointer to the same static char[16] buffer, the second call overwrites the first, and both %s format arguments resolve to the same (second) value. The result is a duplicated value in the CLI response text (e.g. “3.45,3.45” instead of “1.23,3.45”). No memory corruption, no crash, no security impact – only incorrect text output. The developer was clearly aware of the problem (the explicit strcpy workaround proves it), but the pattern is fragile and easy to misuse.

PoC: bug19-ftoa-static-buffer.py
Patch: bug19-ftoa-static-buffer.patch

Description:
Both StrHelper::ftoa() and StrHelper::ftoa3() return a pointer to a static char buffer. If called twice in the same sprintf or expression, the second call overwrites the first result before it is consumed. The handleGetCmd code for “radio” params works around this by copying to local buffers, but other call sites may not.

Code:

// TxtDataHelpers.cpp
const char* StrHelper::ftoa(float f) {
  static char tmp[16];  // <- shared across all calls
  int status;
  _ftoa(f, tmp, &status);
  // ...
  return tmp;  // <- pointer to static buffer, next call overwrites
}

const char* StrHelper::ftoa3(float f) {
  static char s[16];  // <- same problem
  // ...
  return s;
}

Severity Review: Keeping LOW. The bug is entirely latent – no current call site triggers it, and the impact if triggered is limited to wrong text in a CLI response (no memory safety issue). The fix is straightforward: change the API to accept a caller-provided buffer.


Bug #20 — LOW: Serial Frame Truncation Returns Corrupt Data

File: src/helpers/ArduinoSerialInterface.cpp
Impact: When oversized frames arrive, a partially-captured truncated frame is returned as valid.

Worst-Case Outcome: Silent data loss on long serial messages – affects only serial (UART) connections, not BLE or WiFi. When a companion app or meshcore_py script sends a frame larger than MAX_FRAME_SIZE (172 bytes) over serial, checkRecvFrame() stores the first 172 bytes in rx_buf and discards the rest. When all _frame_len bytes have been read from UART, the code clamps _frame_len to 172 and returns the truncated buffer as a valid frame. The command handler (handleCmdFrame) then parses a shorter-than-intended payload. In practice, this affects CMD_SEND_TXT_MSG and CMD_SEND_CHANNEL_TXT_MSG with long text payloads – the message is silently shortened before encryption and transmission. No crash, no memory corruption, no security impact – just data loss. The meshcore_py library’s serial_cx.py has no outbound size limit, so it can trigger this. The WiFi interface (SerialWifiInterface) already correctly rejects oversized frames with a debug log. The BLE interfaces are naturally limited by MTU.

PoC: bug20-serial-truncation.py
Patch: bug20-serial-truncation.patch

Description:
When _frame_len > MAX_FRAME_SIZE, the code reads bytes into rx_buf only up to MAX_FRAME_SIZE and discards the rest. But when the frame is “complete” (all _frame_len bytes received), it truncates _frame_len to MAX_FRAME_SIZE and returns the buffer. The problem is that rx_buf contains only the first MAX_FRAME_SIZE bytes, but the frame may have been longer — the caller receives a truncated, potentially malformed frame and tries to parse it.

Code:

// ArduinoSerialInterface.cpp — checkRecvFrame()
      default:
        if (rx_len < MAX_FRAME_SIZE) {
          rx_buf[rx_len] = (uint8_t)c;   // rest of frame will be discarded if > MAX
        }
        rx_len++;
        if (rx_len >= _frame_len) {  // received a complete frame?
          if (_frame_len > MAX_FRAME_SIZE) _frame_len = MAX_FRAME_SIZE;    // truncate
          memcpy(dest, rx_buf, _frame_len);
          _state = RECV_STATE_IDLE;
          return _frame_len;  // <- returns truncated/corrupt data as if valid
        }

Severity Review: Keeping LOW. The bug only affects serial UART connections (not BLE, not WiFi). The impact is silent text truncation – no memory safety issue, no crash, no security impact. Most real-world usage is via BLE (phone apps), which is unaffected.


Bug #22 — CRITICAL: Synchronous Flash Write During acl.save() Blocks Radio and Causes Hang

File: src/helpers/ClientACL.cpp, examples/simple_repeater/MyMesh.cpp (line ~1294)
Impact: Device becomes unresponsive for 200–500 ms on every login; on nRF52 with BLE this can drop the BLE connection entirely. Confirmed to cause observable “hang” on repeater login.

Worst-Case Outcome: Permanent radio hang (requires reboot) on nRF52 repeaters and room servers. When acl.save() triggers flash page erase via the SoftDevice, the CPU is halted for ~85 ms per page. If the SX1262 LoRa radio has an active SPI transaction at that moment, the SPI bus is corrupted — the radio enters an undefined state and never recovers. This chains into Bug #9 (stuck detection with no recovery). The node appears partially alive (BLE responds, serial works) but is permanently deaf to all LoRa mesh traffic. Risk scales with ACL size: more enrolled clients = more flash pages to erase/write = wider time window for SPI collision. On ESP32 the flash subsystem is less blocking, so the impact is a 200-500 ms radio blackout (temporary, not permanent). Triggered reliably ~5 seconds after any first client login to a repeater.

Description:
After a successful password login, the repeater sets dirty_contacts_expiry = futureMillis(5000). Five seconds later, in the main loop(), acl.save(_fs) is called. This performs a synchronous flash file delete + rewrite of the entire client table (up to 20 clients × 136 bytes = 2720 bytes).

On nRF52 with SoftDevice (BLE), the openWrite() function first calls _fs->remove("/s_contacts") which triggers a flash page erase via sd_flash_page_erase(). A single page erase takes ~85 ms during which the SoftDevice blocks CPU execution. Then the subsequent file.write() calls trigger additional flash page program operations. The total blocking time can reach 200–500 ms.

During this entire blocking period: 1. Dispatcher::loop() is not called — no radio RX/TX processing occurs 2. The radio ISR may fire (setting STATE_INT_READY), but recvRaw() is never called to service it 3. If the radio was mid-transmit of the login reply, the outbound_expiry timer may pass, causing the reply to be marked as a TX failure 4. On nRF52 with BLE, the SoftDevice may lose BLE connection events during prolonged flash operations, causing the BLE serial link to disconnect 5. The 8-second “radio stuck” detector (Bug #9) doesn’t help because it only sets a flag with no recovery action

Because login is the one operation guaranteed to dirty the ACL for a new client, this bug is triggered reliably on every first login to a repeater.

Code (loop triggers save after 5-second delay):

// MyMesh.cpp — loop()
  // is pending dirty contacts write needed?
  if (dirty_contacts_expiry && millisHasNowPassed(dirty_contacts_expiry)) {
    acl.save(_fs);              // ← blocks main loop for 200-500ms!
    dirty_contacts_expiry = 0;
  }

Code (save performs synchronous flash delete + write):

// ClientACL.cpp — openWrite() on nRF52
static File openWrite(FILESYSTEM* _fs, const char* filename) {
  #if defined(NRF52_PLATFORM) || defined(STM32_PLATFORM)
    _fs->remove(filename);                    // ← flash page ERASE (~85ms per page)
    return _fs->open(filename, FILE_O_WRITE); // ← allocate new file
  #else
    return _fs->open(filename, "w", true);
  #endif
}

// ClientACL.cpp — save()
void ClientACL::save(FILESYSTEM* fs, bool (*filter)(ClientInfo*)) {
  File file = openWrite(_fs, "/s_contacts");
  if (file) {
    for (int i = 0; i < num_clients; i++) {
      auto c = &clients[i];
      // 136 bytes per client × up to 20 clients = 2720 bytes of flash writes
      file.write(c->id.pub_key, 32);       // ← each write may trigger flash page program
      file.write((uint8_t *)&c->permissions, 1);
      // ... more writes ...
      file.write(c->out_path, 64);
      file.write(c->shared_secret, PUB_KEY_SIZE);
    }
    file.close();                           // ← final flush + metadata write
  }
}

Code (login sets the dirty flag):

// MyMesh.cpp — handleLoginReq()
    if (perms != PERM_ACL_GUEST) {
      dirty_contacts_expiry = futureMillis(LAZY_CONTACTS_WRITE_DELAY); // 5000ms
    }

Most Likely to Cause Device Hang / Unresponsive

Priority Bug # Severity Mechanism
1 #22 CRITICAL acl.save() blocks main loop 200–500 ms — SPI corruption on nRF52 permanently kills LoRa radio; triggered by normal first login
2 #1 CRITICAL OOB write on data[len]=0 corrupts stack — immediate crash on receiving max-length text message
3 #4 HIGH Packet pool exhaustion via flood/TRACE abuse — node permanently drops all traffic
4 #9 HIGH Radio stuck in non-RX with no recovery — device permanently deaf (chains from #22)
5 #6 HIGH TRACE flood drains TX budget and fills queue — node stops responding to normal messages (self-recovers)
6 #7 HIGH Unbounded sprintf overflow — 30-byte stack corruption via BLE/radio CLI command
7 #15 LOW RNG::nextInt division-by-zero if _max == _min — latent UB, not remotely triggerable