49. Group Anagrams
題目 / Problem
中文:給定一個字串陣列 strs,把彼此互為「易位構詞」(anagram)的字串分到同一組。所謂易位構詞,是指兩個字串用到完全相同的字母、且每個字母出現次數也相同,只是排列順序不同(例如 "eat" 和 "tea")。你可以用任意順序回傳答案,組與組之間、組內元素之間的順序都不要求。
English: Given an array of strings strs, group the strings that are anagrams of each other into the same group. Two strings are anagrams if they contain exactly the same letters with the same counts, just in a different order (e.g. "eat" and "tea"). You may return the answer in any order — neither the order of the groups nor the order within a group matters.
Constraints / 限制
- 1 <= strs.length <= 10^4
- 0 <= strs[i].length <= 100
- strs[i] 只由小寫英文字母組成 / consists of lowercase English letters only.
Worked example / 範例
Input: strs = ["eat","tea","tan","ate","nat","bat"]
Output: [["bat"],["nat","tan"],["ate","eat","tea"]]
"eat", "tea", "ate" 互為易位構詞 → 同一組;"tan", "nat" → 同一組;"bat" 自己一組。
名詞解釋 / Glossary
- 易位構詞 / anagram:兩字串字母種類與每種字母的數量完全相同,只是順序不同。判斷的關鍵是「字母多重集合」是否相等。/ Two strings using the exact same multiset of letters; only the order differs.
- 雜湊表 / hash map:一種以「鍵 → 值」存取資料的結構,平均能在 O(1) 時間查找與插入。本題用它把「某個簽名」映射到「屬於該簽名的字串清單」。/ A key→value structure with average O(1) lookup/insert; here it maps a signature to the list of strings sharing it.
- 簽名 / canonical key (signature):替每個字串算出一個「同組必相同、不同組必不同」的代表字串。兩種常見做法:把字母排序後的結果,或 26 個字母的出現次數編碼成字串。/ A representative string that is identical for anagrams and different otherwise — either the sorted letters, or the 26 letter-counts encoded as text.
- 計數陣列 / count array:長度 26 的整數陣列,索引 0..25 對應字母
a..z,記錄每個字母出現幾次。/ A length-26 int array indexed by letter (a→0 …z→25) holding each letter's frequency. unordered_map(C++):C++ 標準函式庫提供的雜湊表,鍵唯一、查找平均 O(1)。/ The C++ standard-library hash map: unique keys, average O(1) access.
思路
最直接的暴力想法:對每一對字串都判斷它們是不是易位構詞,是的話就放進同一組。但「判斷兩兩是否同組」要做 O(n²) 次比較,n 最大到 10⁴ 時就是上億次操作,太慢。問題的核心在於:我們其實不需要「兩兩比較」,只要能替每個字串算出一個代表它所屬組別的標記,標記相同就是同一組。這個標記就叫「簽名」。怎麼設計簽名?關鍵觀察是:易位構詞們把字母排序後會得到完全相同的字串,例如 "eat"、"tea"、"ate" 排序後都是 "aet"。於是把排序後的字串當作簽名,丟進雜湊表:簽名當鍵,值是一個清單,存放所有擁有這個簽名的原始字串。掃過一遍陣列,每個字串算簽名、塞進對應的清單,最後把雜湊表裡所有清單收集起來就是答案。雜湊表讓「找出某簽名對應的那組」變成平均 O(1),整體只需掃一遍。
另一種更快的簽名作法(C 版用這個):因為字母只有小寫 a–z,可以用一個長度 26 的計數陣列數每個字母出現幾次,再把這 26 個數字編碼成一個字串當簽名。這樣算簽名只要 O(L)(L 是字串長度),比排序的 O(L log L) 更省,而且同樣保證「同組簽名相同、不同組簽名不同」。
The brute-force idea is to compare every pair of strings and check whether they're anagrams, merging matches into groups. But pairwise comparison is O(n²), which blows up when n is 10⁴. The real insight is that we don't need to compare pairs at all — we just need a signature for each string that is identical for anagrams and different otherwise. Since anagrams become the same string once their letters are sorted ("eat", "tea", "ate" all sort to "aet"), we can use the sorted string as the signature. We feed everything into a hash map: the signature is the key, and the value is the list of original strings that produced it. One pass over the array — compute each signature, append the string to its bucket — and finally we collect all the buckets as the answer. The hash map turns "find the right group for this signature" into an average O(1) operation.
An even faster signature (used in the C version): since the alphabet is just lowercase a–z, count each letter into a length-26 array, then encode those 26 counts into a string. Building this signature is O(L) instead of sorting's O(L log L), while still guaranteeing same-group ⇔ same-signature.
逐步走查 / Walkthrough
輸入 / Input: ["eat","tea","tan","ate","nat","bat"],使用「排序後字串」當簽名,維護一個雜湊表 map: signature → list。
| 步驟 Step | 目前字串 word | 排序後簽名 sorted key | 雜湊表動作 map action | 雜湊表狀態 map state |
|---|---|---|---|---|
| 1 | "eat" |
"aet" |
新鍵,建立清單 / new key | {aet: [eat]} |
| 2 | "tea" |
"aet" |
鍵已存在,附加 / append | {aet: [eat, tea]} |
| 3 | "tan" |
"ant" |
新鍵,建立清單 / new key | {aet:[eat,tea], ant:[tan]} |
| 4 | "ate" |
"aet" |
鍵已存在,附加 / append | {aet:[eat,tea,ate], ant:[tan]} |
| 5 | "nat" |
"ant" |
鍵已存在,附加 / append | {aet:[eat,tea,ate], ant:[tan,nat]} |
| 6 | "bat" |
"abt" |
新鍵,建立清單 / new key | {aet:[eat,tea,ate], ant:[tan,nat], abt:[bat]} |
最後 / Finally:收集所有清單值 → [["eat","tea","ate"], ["tan","nat"], ["bat"]]。順序可任意,符合題目要求。/ Collect all map values; any order is accepted.
Solution — C
// 演算法 / Algorithm:
// 對每個字串算一個長度 26 的字母計數,編碼成「簽名」字串;
// 簽名相同 = 同組易位構詞。用簽名陣列分組(這裡用線性比對找已存在的簽名)。
// For each string, build a 26-letter count, encode it into a signature string;
// equal signatures = same anagram group. Group by matching signatures.
#include <stdlib.h> // malloc / free / qsort
#include <string.h> // strcmp / strlen / strcpy
// 比較函式給 qsort 用:把字串裡的字元由小到大排序
// Comparator for qsort: sorts characters ascending.
static int cmp_char(const void *a, const void *b) {
// *(char*)a 把 void* 轉回 char* 再取值 / cast void* back to char* and dereference
return (*(const char *)a) - (*(const char *)b);
}
// 回傳值:char***(指向「字串陣列的陣列」),即每組是一個 char**
// Returns char*** : an array of groups, each group being a char** (array of strings).
char ***groupAnagrams(char **strs, int strsSize,
int *returnSize, int **returnColumnSizes) {
// keys[i] 存第 i 組的簽名;用來判斷新字串該歸到哪一組
// keys[i] holds group i's signature, used to decide which group a string joins.
char **keys = (char **)malloc(sizeof(char *) * strsSize); // 最多 strsSize 組 / at most strsSize groups
char ***groups = (char ***)malloc(sizeof(char **) * strsSize); // 每組的字串清單 / list per group
int *counts = (int *)malloc(sizeof(int) * strsSize); // 每組目前有幾個字串 / size of each group
int groupCnt = 0; // 目前共幾組 / number of groups so far
for (int i = 0; i < strsSize; i++) {
// ---- 步驟 A:替 strs[i] 算簽名(排序它的字元複本)----
// Step A: compute signature of strs[i] by sorting a copy of its chars.
int len = (int)strlen(strs[i]); // 字串長度(不含結尾 '