274. H-Index
題目 / Problem
中文: 給定一個整數陣列 citations,其中 citations[i] 表示研究者第 i 篇論文被引用的次數。請回傳該研究者的 h 指數 (h-index)。
根據 h-index 的定義:h 指數是滿足「該研究者至少有 h 篇論文,每篇至少被引用 h 次」的最大整數 h。
English: Given an integer array citations where citations[i] is the number of citations of the researcher's i-th paper, return the researcher's h-index.
By definition: the h-index is the maximum value h such that the researcher has at least h papers each cited at least h times.
Constraints / 限制:
- 1 <= n <= 5000
- 0 <= citations[i] <= 1000
Example / 範例:
Input: citations = [3, 0, 6, 1, 5]
Output: 3
解釋:有 3 篇論文被引用次數 ≥ 3(即 3, 6, 5),剩下兩篇 ≤ 3,所以 h = 3。 Explanation: 3 papers each have at least 3 citations (3, 6, 5); the other 2 have at most 3, so h = 3.
名詞解釋 / Glossary
- h 指數 / h-index:最大的
h使得「至少h篇論文,每篇至少被引用h次」。注意h不能超過總論文數n。/ The largesthfor which at leasthpapers each have ≥hcitations. Capped by total paper countn. - 計數排序 / Counting Sort (Bucket):當數值範圍有限時,用一個額外陣列當「桶子」紀錄每個數值出現幾次,避免比較式排序的
O(n log n)成本。/ When values fall in a small range, count occurrences in a bucket array — this skips comparison-based sorting'sO(n log n)cost. - 降序排序 / Descending Sort:由大到小排列。排序後
citations[i]是第i+1高的引用數。/ Sort largest to smallest; afterwardscitations[i]is the (i+1)-th highest citation count. qsort:C 標準庫的排序函式,需傳入一個比較函式指標。/ C standard library sort; takes a comparator function pointer.- 指標 / Pointer (
int*):儲存記憶體位址的變數。*p取出該位址的內容,p[i]等同於*(p+i)。/ A variable holding a memory address.*preads the value there;p[i]is shorthand for*(p+i). - 後綴和直覺 / Suffix-count intuition:若把引用數從大到小排序,前
i篇都至少有citations[i-1]次引用,這是判斷 h-index 的核心。/ Sorting descending makes "the topipapers each have ≥citations[i-1]citations" trivially true, which is exactly what h-index needs.
思路
最直觀的做法是「列舉所有可能的 h 值」:對每個候選 h 從 0 到 n,掃過陣列數有幾篇論文 ≥ h,再判斷是否 ≥ h。這是 O(n²),可行但慢。更好的觀察是:把 citations 由大到小排序之後,第 i(從 1 開始)篇論文的引用數就是 citations[i-1],而前 i 篇論文每篇都至少有 citations[i-1] 次引用(因為已排序)。所以只要找到最大的 i 使得 citations[i-1] >= i,這個 i 就是答案。這種做法是 O(n log n),主要花費在排序。
更快的做法用 計數排序(桶子):注意到 h-index 不可能超過 n,所以引用數 ≥ n 的論文都可以併進同一個桶 bucket[n]。建一個長度 n+1 的桶陣列,掃一遍 citations 把每篇論文丟進對應桶(超過 n 的塞進 bucket[n])。然後從 h = n 往 0 遞減,累計目前為止「引用數 ≥ h」的論文數 count;第一個讓 count >= h 的 h 就是答案。這是純 O(n) 時間、O(n) 空間。
The brute-force idea is to try every candidate h from 0 to n and count how many papers meet it — O(n²). A better observation: if we sort citations in descending order, then for each rank i (1-indexed), the top i papers all have at least citations[i-1] citations (because the array is sorted). So we want the largest i with citations[i-1] >= i. That gives O(n log n), dominated by the sort.
We can do even better with counting sort / bucketing. Two key facts: the answer can never exceed n, and any paper with more than n citations contributes the same as one with exactly n citations for h-index purposes. So allocate a bucket array of size n+1, and for each paper drop it into bucket[min(c, n)]. Then walk h from n down to 0, accumulating count — the number of papers with ≥ h citations. The first h where count >= h is the answer. This runs in O(n) time and O(n) extra space.
逐步走查 / Walkthrough
Input: citations = [3, 0, 6, 1, 5], so n = 5.
Step 1 — Build buckets / 建立桶子 (size n+1 = 6, indices 0..5; values >5 collapse to bucket 5):
| Paper citations | Bucket index (min(c, 5)) |
|---|---|
| 3 | 3 |
| 0 | 0 |
| 6 | 5 (capped) |
| 1 | 1 |
| 5 | 5 |
Final bucket = [1, 1, 0, 1, 0, 2].
Step 2 — Scan h from n down to 0, accumulate count (papers with citations ≥ h):
| h | bucket[h] | count (running total) | count >= h ? | Decision |
|---|---|---|---|---|
| 5 | 2 | 2 | 2 >= 5? No | continue |
| 4 | 0 | 2 | 2 >= 4? No | continue |
| 3 | 1 | 3 | 3 >= 3? Yes | return 3 ✅ |
Answer / 答案:h = 3.
Solution — C
// Algorithm / 演算法:counting-sort bucket. Cap each citation at n, drop into a bucket,
// then sweep h from n down to 0 accumulating "papers with ≥ h citations". The first
// h where that running count >= h is the h-index. O(n) time, O(n) space.
#include <stdlib.h> // malloc, calloc, free / 動態記憶體配置函式
#include <string.h> // (not strictly needed, kept for portability / 可移植性保留)
int hIndex(int* citations, int citationsSize) {
int n = citationsSize; // n = 論文總數 / total number of papers
// 配置長度 n+1 的桶子陣列,calloc 會把每格初始化為 0
// Allocate a bucket array of length n+1; calloc zero-initializes every slot.
int* bucket = (int*)calloc(n + 1, sizeof(int));
// 第一輪:把每篇論文丟進對應桶子 / First pass: drop each paper into its bucket
for (int i = 0; i < n; i++) {
int c = citations[i]; // 取出第 i 篇的引用數 / read i-th citation count
// 引用數 > n 的論文一律塞進 bucket[n],因為 h 不可能超過 n
// Anything > n collapses into bucket[n], since h cannot exceed n.
if (c >= n) {
bucket[n]++; // 大於等於 n 的篇數 +1 / count of papers with ≥ n citations
} else {
bucket[c]++; // 否則丟進對應引用數的桶子 / otherwise, exact citation bucket
}
}
// 第二輪:從 h = n 往下掃,count 累計「引用數 >= h 的論文數」
// Second pass: sweep h from n down, accumulating papers with ≥ h citations.
int count = 0; // 後綴累計 / running suffix count
for (int h = n; h >= 0; h--) {
count += bucket[h]; // 加上恰好 h 次引用的論文 / add papers with exactly h citations
// 第一個滿足 count >= h 的 h 就是答案(h 從大到小掃,自然找到最大值)
// First h with count >= h is the answer (we scan top-down, so it's the maximum).
if (count >= h) {
free(bucket); // 釋放桶子 / free the heap allocation
return h; // 回傳答案 / return result
}
}
// 理論上 h = 0 一定成立,迴圈內就會回傳;這行只是給編譯器看
// h = 0 always satisfies count >= 0, so the loop always returns; this is just a fallback.
free(bucket);
return 0;
}
Solution — C++
// Algorithm / 演算法: same counting-sort bucket as the C version. Cap citations at n,
// bucket them, sweep h from n downward keeping a running count of "papers with ≥ h
// citations", and return the first h with count >= h. O(n) time, O(n) space.
#include <vector> // std::vector — 動態陣列 / dynamic array container
#include <algorithm> // std::min — 取最小值 / utility function
class Solution {
public:
int hIndex(std::vector<int>& citations) {
int n = static_cast<int>(citations.size()); // n 是論文總數 / total paper count
// std::vector<int> 預設初始化為 0;長度 n+1 對應 h 可能值 0..n
// std::vector<int> default-initializes elements to 0; length n+1 covers h = 0..n.
std::vector<int> bucket(n + 1, 0);
// range-for 是 C++11 的語法糖:for (auto x : container) 逐一拿出每個元素
// Range-for is C++11 sugar: iterates each element, here as `int c`.
for (int c : citations) {
// std::min(c, n) 把超過 n 的引用數壓回 n,因為 h 不會超過 n
// std::min(c, n) caps citations at n — h-index can't exceed n anyway.
bucket[std::min(c, n)]++; // 對應桶子 +1 / increment matching bucket
}
int count = 0; // 累計後綴計數 / running suffix count
// 從大的 h 開始掃,第一個成立的就是最大的 h
// Sweep from large h downward; first hit is automatically the maximum.
for (int h = n; h >= 0; --h) {
count += bucket[h]; // 把這一桶加進累計 / add this bucket's papers
if (count >= h) { // 至少 h 篇論文,每篇 ≥ h 次引用 / definition satisfied
return h; // 即是 h-index / this is the answer
}
}
return 0; // 不會執行到,h=0 必成立 / unreachable; h=0 always works
}
};
複雜度 / Complexity
- Time: O(n) — 兩次線性掃描:一次填桶子,一次從 n 往下累計。沒有排序,不需要
O(n log n)。 / Two linear passes — one to fill buckets, one to scan downward. No sort needed. - Space: O(n) — 桶子陣列長度為
n + 1。輸入陣列本身不算額外空間。 / The bucket array has lengthn + 1; input itself doesn't count as extra space.
Pitfalls & Edge Cases
- h 不能超過 n / h cannot exceed n:即使有論文被引用一百萬次,h 最多就是
n。所以可以安全地用min(c, n)把大值壓回。/ Even with a million citations, h is bounded byn. Capping withmin(c, n)is safe and necessary for the bucket size. - 全部 0 的情況 / all zeros:例如
[0, 0, 0]。bucket[0] = 3,但h = 3, 2, 1時 count 都是 0;h = 0時count = 3 >= 0成立,回傳 0。/ E.g.[0,0,0]: onlyh = 0satisfies the definition, and the loop reaches it correctly. - 單一元素 / single element:例如
[100],n = 1,會被壓進bucket[1],h = 1時count = 1 >= 1成立,回傳 1。/[100]withn = 1collapses intobucket[1];h = 1succeeds, returning 1. - 由大到小掃的重要性 / why scan top-down:必須從
h = n往下,第一個滿足的就是最大值。從小往大掃會錯過更大的解。/ Scanning fromndownward guarantees we find the maximumhfirst; bottom-up would find a valid but non-maximum value. - C 中要
free/ rememberfreein C:calloc配置的記憶體必須釋放,否則記憶體洩漏。提早return的分支也要先free。/ Heap memory fromcallocmust be released on every return path to avoid leaks. - 不要漏掉
bucket[n]/ don't forget bucket[n]:桶子大小是n + 1(索引 0..n),不是n。少一格會在c >= n時越界寫入。/ Bucket size isn + 1, notn— off-by-one here causes out-of-bounds writes for capped values.