30. Substring with Concatenation of All Words

題目 / Problem

中文： 給你一個字串 s 和一個字串陣列 words，words 裡所有字串長度都相同。一個「拼接字串」是指：把 words 裡所有單字依任意排列順序首尾相連得到的字串（每個單字剛好用一次）。請回傳 s 中所有「拼接子字串」的起始索引，順序不限。

English: You are given a string s and an array words where every word has the same length. A concatenated string is the joining of all the words of words in any order (each word used exactly once). Return the starting indices of every substring of s that is such a concatenation. Any order is fine.

Constraints: - 1 <= s.length <= 10^4 - 1 <= words.length <= 5000 - 1 <= words[i].length <= 30 - s 和 words[i] 只含小寫英文字母 / lowercase letters only.

Worked example: - s = "barfoothefoobarman", words = ["foo","bar"] → [0, 9] - 索引 0 開始是 "barfoo" = ["bar","foo"]；索引 9 開始是 "foobar" = ["foo","bar"]。兩者都是 words 的排列。/ index 0 gives "barfoo", index 9 gives "foobar", both permutations of words.

名詞解釋 / Glossary

雜湊表 / hash map：把「鍵」對應到「值」的資料結構，查詢平均是 O(1)。這裡用它記錄「每個單字需要出現幾次」以及「目前視窗內每個單字出現幾次」。/ A key→value structure with average O(1) lookup; here it stores how many times each word is needed and how many are currently in the window.
多重集 / multiset：可以重複元素的集合，只看「每個元素出現幾次」而不看順序。題目要的就是「視窗內單字的多重集」剛好等於 words 的多重集。/ A set that allows duplicates and ignores order — exactly what "the words in the window must match words" means.
滑動視窗 / sliding window：用左右兩個邊界框住一段連續區間，邊界只往前移動、不回頭，藉此把重複計算省掉。/ A moving [left, right) interval whose ends only advance, avoiding repeated work.
雙指針 / two pointers：用 left、right 兩個下標指向視窗兩端。/ Two indices left and right marking the window's ends.
以單字為步長 / stepping by word length：因為每個單字長度固定為 L，視窗每次前進 L 個字元（剛好一個單字），而不是 1 個字元。/ Since each word has fixed length L, the window advances L characters (one whole word) at a time.
id 編號（C 版本用）：把每個不同的單字字串對應到一個整數，這樣計數就能用整數陣列而不必反覆比較字串。/ Mapping each distinct word string to an integer so counts live in an int array instead of comparing strings repeatedly.

思路

最直接的暴力法：拼接字串的總長度固定是 totalLen = 單字長度 L × 單字數量 k，所以候選的起點只有 s 中每個位置。對每個起點，把後面 totalLen 個字元切成 k 個小段，逐段查是不是 words 裡的單字、且數量不超量。這樣每個起點要花 O(k·L)，總共 O(n·k·L)，在最壞情況會太慢。問題出在：相鄰起點的切法其實高度重疊，我們重複切了很多一樣的單字。優化的關鍵是注意到「以單字為步長」的對齊：如果我們固定一個偏移量 i（i 從 0 到 L−1），那麼從 i 出發、每次跳 L 個字元，所有單字邊界都會對齊，於是同一個偏移類別裡可以用滑動視窗一次掃完，不必每個起點重來。對每個偏移 i，我們維護一個視窗計數表 window，右指針每次吃進一個單字：若它根本不在 need（需求表）裡，視窗整段作廢、從下一格重啟；若它在 need 裡但數量超過需求，就從左邊一直丟單字直到不超量；當視窗裡的單字數剛好等於 k，代表多重集完全吻合，記下 left 當作答案，再把最左單字丟掉、視窗右移繼續找。need 是不變量：視窗合法的條件永遠是「每個單字的數量都不超過 need，且總數等於 k」。

The brute force: the answer always has fixed length totalLen = L × k (word length times word count), so candidate starts are just positions in s. For each start, chop the next totalLen characters into k pieces and check each piece is a needed word without exceeding its quota — O(k·L) per start, O(n·k·L) overall, which is too slow because neighboring starts re-chop the same words over and over. The fix exploits word-length alignment: fix an offset i in 0..L−1 and only step by L characters; then every word boundary lines up, and the whole offset class can be swept with one sliding window instead of restarting at every index. For each offset we keep a window count map. The right pointer eats one word at a time: if that word isn't in need, the window is invalid so we wipe it and restart past the bad word; if it's needed but now over its quota, we drop words from the left until it's legal again; and when the window holds exactly k words, the multiset matches perfectly — we record left, drop the leftmost word, and slide on. The invariant is need: a legal window never exceeds any word's required count and its total is at most k.

逐步走查 / Walkthrough

輸入 / Input: s = "barfoothefoobarman", words = ["foo","bar"]. L = 3, k = 2, totalLen = 6, n = 18, need = {bar:1, foo:1}.

我們追蹤偏移 i = 0（其他偏移 1、2 不會對齊出答案）。視窗每次前進 3 個字元。 We trace offset i = 0 (offsets 1 and 2 align to nothing here). Window steps by 3.

right	s[right..+3]	在 need? / in need?	動作 / action	window	cnt	left	記錄? / record?
0	`bar`	✅	加入 / add	{bar:1}	1	0	—
3	`foo`	✅	加入；cnt==k / add; full	{bar:1,foo:1}	2	0	記 0；丟最左 `bar` / record 0, drop `bar`
			丟最左後 / after drop	{bar:0,foo:1}	1	3	—
6	`the`	❌	作廢視窗、重啟 / wipe & restart	{}	0	9	—
9	`foo`	✅	加入 / add	{foo:1}	1	9	—
12	`bar`	✅	加入；cnt==k / add; full	{foo:1,bar:1}	2	9	記 9；丟最左 `foo` / record 9, drop `foo`
			丟最左後 / after drop	{foo:0,bar:1}	1	12	—
15	`man`	❌	作廢視窗、重啟 / wipe & restart	{}	0	18	—

right = 18 時 18 + 3 > 18，停止。最終答案 / final answer: [0, 9]. ✅

Solution — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
 * 算法 / Algorithm:
 * 把 s 切成「單字大小」的格子，對每個起始偏移 0..L-1 各跑一次滑動視窗。
 * 視窗以單字為步長前進，維護內部單字計數並與 need 比較，齊全時記錄起點。
 * Cut s into word-sized cells; for each offset 0..L-1 run one sliding window
 * that steps by whole words, keeps per-word counts, and records a start index
 * whenever the window holds exactly the required multiset of words.
 */

/* 雜湊表節點：把一個單字字串對應到整數 id / hash node: maps a word string to an int id */
typedef struct Entry {
    const char *word;     /* 指向某個 words[i]，長度為 wordLen / points into words[i], length wordLen */
    int id;               /* 此單字的編號 / this word's id */
    struct Entry *next;   /* 同一桶內的下一個節點（鏈結法解衝突）/ next node in same bucket (chaining) */
} Entry;

#define TABLE 16384       /* 桶數量，取 2 的次方 / number of buckets, a power of two */

/* djb2 字串雜湊，只看前 len 個字元 / djb2 string hash over the first len chars */
static unsigned long hashStr(const char *s, int len) {
    unsigned long h = 5381;                       /* 經典起始值 / classic seed value */
    for (int i = 0; i < len; i++)                 /* 逐字元混合 / fold each byte in */
        h = h * 33 + (unsigned char)s[i];         /* h*33 + 字元值 / h*33 + byte */
    return h;
}

/* 查單字的 id；create=1 時找不到就新建。回傳 id；create=0 且找不到回傳 -1。
 * Look up a word's id; if create=1, insert a new one. Returns id, or -1 when
 * create=0 and the word is absent. */
static int wordId(Entry **table, int *distinct, const char *word, int len, int create) {
    unsigned long h = hashStr(word, len) % TABLE; /* 算出桶的位置 / pick the bucket */
    for (Entry *e = table[h]; e != NULL; e = e->next)  /* 走訪這條鏈結 / walk the chain */
        if (memcmp(e->word, word, len) == 0)      /* 比較 len 個位元組是否相同 / compare len bytes */
            return e->id;                         /* 命中，回傳既有 id / hit */
    if (!create) return -1;                       /* 只查不建：回 -1 / lookup-only miss */
    Entry *e = (Entry *)malloc(sizeof(Entry));    /* 配置新節點 / allocate a node */
    e->word = word;                               /* 記住單字位址 / remember the word pointer */
    e->id = (*distinct)++;                        /* 指派下一個 id 並讓計數加一 / next id */
    e->next = table[h];                           /* 插到鏈頭 / prepend into bucket */
    table[h] = e;
    return e->id;
}

int* findSubstring(char* s, char** words, int wordsSize, int* returnSize) {
    int n = (int)strlen(s);                       /* s 的長度 / length of s */
    int wordLen = (int)strlen(words[0]);          /* 每個單字長度（都相同）/ each word's length */
    int numWords = wordsSize;                     /* 單字數量 k / number of words */
    int totalLen = wordLen * numWords;            /* 完整拼接長度 / total match length */

    int *res = (int *)malloc(sizeof(int) * (n + 1)); /* 結果陣列，最多 n 個起點 / result, ≤ n starts */
    *returnSize = 0;                              /* 先記錄 0 個答案 / start with zero answers */
    if (totalLen > n) return res;                 /* 放不下就直接回空 / can't fit → empty */

    Entry **table = (Entry **)calloc(TABLE, sizeof(Entry *)); /* 桶陣列全設 NULL / all buckets NULL */
    int distinct = 0;                             /* 不同單字的個數 / number of distinct words */
    int *need = (int *)calloc(numWords, sizeof(int)); /* need[id] = 該單字需要幾個 / required count */
    for (int i = 0; i < numWords; i++) {          /* 建立需求表 / build the required counts */
        int id = wordId(table, &distinct, words[i], wordLen, 1); /* 取得/新建 id / get-or-create id */
        need[id]++;                               /* 需求加一 / one more required */
    }

    int *window = (int *)malloc(sizeof(int) * distinct); /* window[id] = 視窗內該單字數 / in-window count */

    for (int i = 0; i < wordLen; i++) {           /* 試每個起始偏移 0..L-1 / each starting offset */
        memset(window, 0, sizeof(int) * distinct);/* 清空視窗計數 / clear window counts */
        int left = i, cnt = 0;                    /* left=視窗左界, cnt=已配對單字數 / left edge & matched count */
        for (int right = i; right + wordLen <= n; right += wordLen) { /* 每次前進一個單字 / step one word */
            int id = wordId(table, &distinct, s + right, wordLen, 0); /* 右側單字的 id（不新建）/ right word's id */
            if (id == -1) {                       /* 不是目標單字 / not one of our words */
                memset(window, 0, sizeof(int) * distinct); /* 整個視窗作廢 / drop entire window */
                cnt = 0;
                left = right + wordLen;           /* 跳過壞單字、重新起算 / restart past the bad word */
                continue;
            }
            window[id]++;                         /* 把右側單字加入視窗 / add the right word */
            cnt++;
            while (window[id] > need[id]) {        /* 該單字數量超量了 / this word is over quota */
                int lid = wordId(table, &distinct, s + left, wordLen, 0); /* 最左單字 id / leftmost word id */
                window[lid]--;                    /* 從左邊移除一個 / drop one from the left */
                left += wordLen;
                cnt--;
            }
            if (cnt == numWords) {                /* 剛好湊齊所有單字 / window matches all words */
                res[(*returnSize)++] = left;      /* 記錄這個起點 / record start index */
                int lid = wordId(table, &distinct, s + left, wordLen, 0); /* 移除最左單字 / pop leftmost */
                window[lid]--;
                left += wordLen;                  /* 視窗右移，繼續找下一個 / slide and keep searching */
                cnt--;
            }
        }
    }

    for (int b = 0; b < TABLE; b++) {             /* 逐桶釋放鏈結節點 / free every chained node */
        Entry *e = table[b];
        while (e != NULL) { Entry *nx = e->next; free(e); e = nx; }
    }
    free(table);                                  /* 釋放桶陣列 / free buckets */
    free(need);                                   /* 釋放需求表 / free need */
    free(window);                                 /* 釋放視窗表 / free window */
    return res;                                   /* 回傳起點陣列 / return the indices */
}

Solution — C++

#include <vector>
#include <string>
#include <unordered_map>
using namespace std;

/*
 * 算法 / Algorithm:
 * 對每個起始偏移 0..L-1 跑一次以單字為步長的滑動視窗，用 unordered_map 維護
 * 視窗內單字計數並與 need 比較；計數齊全時記錄起點。
 * For each offset 0..L-1, run a word-stepping sliding window. An unordered_map
 * tracks in-window word counts vs. need; record a start when the window is full.
 */
class Solution {
public:
    vector<int> findSubstring(string s, vector<string>& words) {
        vector<int> res;                           // 答案：起始索引 / answer: start indices
        int wordLen = words[0].size();             // 每個單字長度（都相同）/ each word's length
        int numWords = words.size();               // 單字數量 k / number of words
        int n = s.size();                          // s 的長度 / length of s
        int totalLen = wordLen * numWords;         // 完整拼接長度 / total match length
        if (totalLen > n) return res;              // 放不下就回空 / can't fit → empty

        // unordered_map：鍵→值的雜湊表，平均 O(1) 查詢。need[w] = 單字 w 需要幾個。
        // unordered_map is a hash map (avg O(1)). need[w] = how many of word w we need.
        unordered_map<string, int> need;
        for (const string& w : words)              // 範圍 for 走訪 words / range-for over words
            need[w]++;                              // 該單字需求加一 / one more required

        for (int i = 0; i < wordLen; i++) {        // 試每個起始偏移 / try each offset
            unordered_map<string, int> window;     // 視窗內每個單字的計數 / counts inside window
            int left = i, cnt = 0;                 // left=視窗左界, cnt=已配對數 / left edge & matched count
            for (int right = i; right + wordLen <= n; right += wordLen) { // 一次前進一個單字 / step one word
                string word = s.substr(right, wordLen); // 取出右側單字 / extract the right word
                if (need.find(word) == need.end()) {    // find 回傳 end() 代表不存在 / not a needed word
                    window.clear();                // 整個視窗作廢 / drop entire window
                    cnt = 0;
                    left = right + wordLen;        // 跳過壞單字、重啟 / restart past bad word
                    continue;
                }
                window[word]++;                    // 把右側單字加入視窗 / add the right word
                cnt++;
                while (window[word] > need[word]) {// 該單字超量 / this word over quota
                    string lw = s.substr(left, wordLen); // 最左單字 / leftmost word
                    window[lw]--;                  // 從左邊移除一個 / drop one from the left
                    left += wordLen;
                    cnt--;
                }
                if (cnt == numWords) {             // 剛好湊齊所有單字 / window matches all words
                    res.push_back(left);           // 記錄起點 / record start index
                    string lw = s.substr(left, wordLen); // 移除最左單字 / pop leftmost
                    window[lw]--;
                    left += wordLen;               // 視窗右移繼續找 / slide and keep searching
                    cnt--;
                }
            }
        }
        return res;                                // 回傳所有起點 / return all start indices
    }
};

複雜度 / Complexity

Time: O(n × L)，其中 n = s.length、L = 單字長度。對每個偏移 i（共 L 個），右指針沿著 s 前進，所有偏移加起來右指針總共只走過約 n 個單字格；每處理一個單字需要 O(L) 來雜湊／切片，左指針的移除是攤還 O(1)（每個單字最多被丟一次）。建表 need 另花 O(k × L)。總和為 O(n × L)。 / Across all L offsets the right pointer visits about n word-cells total; each costs O(L) to hash/slice, and left-side removals are amortized O(1). Building need is O(k × L). Net O(n × L).
Space: O(k × L)，雜湊表存最多 k 個不同單字、每個長度 L，加上 window 與 need 的計數陣列。 / The maps hold up to k distinct words of length L, plus the window/need count arrays. O(k × L).

Pitfalls & Edge Cases

逐字元滑動會超時 / sliding by 1 char TLEs：必須以單字長度 L 為步長並對 L 個偏移各跑一次，才能把複雜度從 O(n·k·L) 降到 O(n·L)。漏掉「偏移迴圈」就會錯過沒對齊的答案。/ Step by L and loop over all L offsets; stepping by 1 (or forgetting the offset loop) either TLEs or misses misaligned matches.
重複單字 / duplicate words：words 可能有重複（例 2 有兩個 "word"）。用「計數」而非「集合」才正確——這就是 need[w] 存次數、且檢查 window[w] > need[w] 的原因。/ words can repeat; counting (not a plain set) and the window[w] > need[w] check handle quotas correctly.
超量時要從左邊收縮，而非整段重置 / shrink from the left, don't reset on overflow：遇到「合法單字但數量超標」時只要丟掉左邊單字直到合法即可；只有遇到「根本不是目標單字」才該整段作廢。混淆這兩種情況是常見錯誤。/ An over-quota valid word means shrink from the left; only a non-word warrants wiping the window. Confusing the two is a classic bug.
放不下要先擋掉 / guard totalLen > n：若拼接總長超過 s，直接回傳空陣列，避免 right + wordLen 越界與無謂計算。/ If totalLen > n, return empty up front to avoid out-of-bounds and wasted work.
回傳長度別忘了寫 / set *returnSize（C）：LeetCode C 介面靠 *returnSize 知道答案個數；忘了更新會讀到垃圾值。空答案也要把它設為 0。/ The C harness reads *returnSize; forgetting to update it (including setting 0 for empty) yields garbage output.
比較定長字串用 memcmp 而非 strcmp（C）/ compare with memcmp, not strcmp, in C：我們從 s 取的子字串沒有結尾的 '\0'，所以用已知長度 wordLen 做 memcmp 比較，避免讀過頭。/ Substrings taken from s aren't null-terminated, so compare exactly wordLen bytes with memcmp.