Deep Learning 101, Taiwan’s pioneering and highest deep learning meetup, launched on 2016/11/11 @ 83F, Taipei 101
AI是一條孤獨且充滿惶恐及未知的旅程,花俏絢麗的收費課程或活動絕非通往成功的捷徑。
衷心感謝當時來自不同單位的AI同好參與者實名分享的寶貴經驗;如欲移除資訊還請告知。
由 TonTon Huang Ph.D. 發起,及其當時任職公司(台灣雪豹科技)無償贊助場地及茶水點心。
Deep Learning 101 創立初衷,是為了普及與分享深度學習及AI領域的尖端知識,深信AI的價值在於解決真實世界的商業問題。
去 YouTube 訂閱 | Facebook | 回 GitHub Pages 首頁 | 到 GitHub 點星 | 網站 | 到 Hugging Face Space 按愛心
👁️ 電腦視覺 (CV)・必讀資源總整理
編者按: 本頁面彙整了電腦視覺領域的關鍵技術資源,涵蓋物件偵測、生成式 AI、影像分割以及文字識別(OCR)等最新論文與實作。
如果您想尋找更詳細的筆記,歡迎訪問 GitHub Repository: 👉 GitHub: Computer-Vision-Paper (歡迎 Star ⭐)
文章目錄
- Anomaly Detection (異常檢測)
- Object Detection (目標偵測)
- Segmentation (圖像分割)
- OCR (光學文字識別)
- Diffusion Model (擴散模型)
- Digital Human (虛擬數字人)
AnomalyDetection
Anomaly Detection,異常檢測
- 2026-01-29|LLM2CLIP
- 說明:以大語言模型重塑跨模態表徵學習的文本基石
-
資源:📄 AlphaXiv 🐙 GitHub 📝 公眾號解讀
- 2025-09-24|FS-SAM2
- 說明:Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation
-
資源:📄 AlphaXiv 📝 中文解讀:效能與效率雙優
- 2025-09-20|MOCHA
- 說明:Multi-modal Objects-aware Cross-arcHitecture Alignment
-
資源:📄 AlphaXiv 📝 中文解讀:注入 YOLO 性能大漲
- 2025-07-16|CostFilter-AD
- 說明:Enhancing Anomaly Detection through Matching Cost Filtering
-
資源:🐙 GitHub 📝 中文解讀:刷新無監督上限
- 2025-06-13|One-to-Normal
- 說明:Anomaly Personalization (少樣本異常識別新突破)
-
資源:📄 AlphaXiv 📝 中文解讀
- 2025-06-06|DualAnoDiff (CVPR 2025)
- 說明:Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
-
資源:📄 AlphaXiv 📝 中文解讀:復旦騰訊優圖入選
- 2025-05-15|AdaptCLIP
- 說明:Adapting CLIP for Universal Visual Anomaly Detection
-
資源:📄 AlphaXiv 🐙 GitHub 📝 中文解讀
- 2025-05-05|Multi-Modal LLM for AD
- 說明:Detect, Classify, Act: Categorizing Industrial Anomalies
-
資源:📄 AlphaXiv 📚 DeepWiki 💾 Dataset
- 2025-04-27|AnomalyCLIP
- 說明:Object-agnostic Prompt Learning for Zero-shot AD
-
資源:📄 AlphaXiv 📚 DeepWiki
- 2025-04-26|PaDim
- 說明:經典無監督異常檢測方法
-
資源:📄 AlphaXiv 📚 DeepWiki
- 2025-04-12|AA-CLIP
- 說明:Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
-
資源:📄 AlphaXiv 📚 DeepWiki
- 2025-03-25|Dinomaly
ObjectDetection
Object Detection (目標偵測)
- 2025|MCL (AAAI 2025)
- 說明:Multi-clue Consistency Learning (遙感半監督目標檢測)
-
資源:📄 AlphaXiv 🐙 GitHub 📝 中文解讀
- 2025-07-24|OV-DINO
- 2025-06-18|CountVid
- 說明:Open-World Object Counting in Videos (影片中指哪數哪)
-
資源:📄 AlphaXiv 📝 中文解讀
- 2025-06-15|GeoPix
- 2025-05-23|VisionReasoner
- 2025-03-14|Falcon
- 說明:A Remote Sensing Vision-Language Foundation Model
-
資源:📄 AlphaXiv 📚 DeepWiki
Segmentation
Segmentation (圖像分割)
- SAM 3
- Perceive Anything Model
- 說明:Recognize, Explain, Caption, and Segment Anything (對標 SAM2 + LLM)
-
資源:📄 AlphaXiv 📝 中文解讀
- RemoteSAM
- 說明:Towards Segment Anything for Earth Observation
-
資源:📄 AlphaXiv 📚 DeepWiki
- InstructSAM
- 說明:Training-Free Framework for Remote Sensing
-
資源:🌐 Project 📄 AlphaXiv 📚 DeepWiki
- SAM 2 & Variants (SAM 2 相關變體)
- Meta SAM 2: Meta 官方最新分割一切模型。
- Grounded SAM 2: Ground and Track Anything in Videos.
- SAM2Long: 港中文提出,專注於複雜長視頻分割。
- SAM2-Adapter: 首次讓 SAM 2 適應下游任務。
- SAM2Point: 可提示 3D 分割研究里程碑。
- Other Notable Models
- SAMURAI: KF + SAM2 解決快速移動或自遮擋問題。
- MatAnyone: 視訊摳圖,髮絲級還原。
- Exact (CVPR 2025): 遙感影像時間序列弱監督學習。
- SegAnyMo (CVPR 2025): Segment Any Motion in Videos.
OCR
Optical Character Recognition (光學文字識別) 針對物件或場景影像進行分析與偵測
- 使用開源模型強化您的 OCR 工作流程
-
2026-01-27 DeepSeek-OCR 2 - 2025-11-30|HunyuanOCR
-
2025-10-21 Chandra OCR - 2025-10-19|PaddleOCR-VL
- 2025-08-18|DianJin-OCR-R1
- 2025-07-30|dots.ocr
- 2025-06-16|OCRFlux
- 2025-06-05|MonkeyOCR
-
資源:📚 DeepWiki 📄 AlphaXiv - 2025-03-05|OpenOCR
- 資源:🐙 GitHub | 📝 通用OCR工具OpenOCR開源
-
- 2025-03-05|PP-DocBee
- 2025-03-03|olmocr
- 2025-02-05|MinerU
- 2024-12-15|markitdown
- 資源:🐙 GitHub
- 2024-10-29|OmniParser
- 2024-09-11|GOT-OCR-2.0
- 2024-08-20|PDF 轉 MarkDown 工具
- 其他實用工具與資源
- RapidOCR:🐙 GitHub
-
TableStructureRec:🐙 GitHub 📝 表格結構辨識推理庫 - PaddleOCR 教學:📝 用 PPOCRLabel 微調醫療診斷書和收據
Diffusion Model
Diffusion Model (擴散模型)
- 2025-05-28|Jodi
- 說明:視覺理解 & 生成大一統模型
-
資源:📄 AlphaXiv 🌐 Project
- 2025-05-27|AnomalyAny (CVPR 2025)
- 2025-05-23|HivisionIDPhotos
- 說明:智慧證件照生成神器 (摳圖、換背景、任意尺寸)
-
資源:📚 DeepWiki 📝 教學文章
- 2025-05-19|Index-AniSora
- 說明:B 站開源 SOTA 動畫影片生成模型
-
資源:📄 AlphaXiv 📚 DeepWiki 📝 中文解讀
- 2025-04-26|Insert Anything
- 2025-04-24|Phantom
- 2025-04-22|MAGI-1
- 2025-04-22|SkyReels V2
- 2025-04-14|FramePack
- 2025-04-14|Fantasy-talking
- 2025-04-05|SkyReels-A2
- 2025-03-10|HunyuanVideo-I2V
- 2025-02-25|Wan-Video
- 2025-02-14|FlashVideo
- 2025-01-28|Sana (ICLR 2025 Oral)
- Flux & Ecosystem
- Flux Models: 🤗 Black Forest Labs
-
PuLID (2024-11-29): 🐙 GitHub 📝 ComfyUI 教學 -
Leffa (2024-12-17): 🐙 GitHub 📝 Meta AI 人物特徵保持 -
MagicQuill (2024-11-26): 🐙 GitHub 🤗 Space 📝 AI P 圖神器
- Practical Tools (ComfyUI & Others)
Digital Human
Digital Human (虛擬數字人)
- Open Avatar Chat
- HeyGem
- Duix
- Linly-Talker
- 說明:結合 LLM 與視覺模型的智能交互系統
- 資源:🐙 GitHub
- CVPR 2025 / NeurIPS Resources
- Other Tools
Image Recognition
Image Recognition (圖像識別)
- ViT (Vision Transformer)
- Swin Transformer
- EfficientNetV2
Document AI
Document Understanding & OCR (文檔理解與文字識別)
- Donut (2022): OCR-free Document Understanding Transformer. 📄 arXiv:2111.15664
- LayoutParser (2021): Unified toolkit for Deep Learning Based Document Analysis. 📄 arXiv:2103.15348
- TrOCR (2021): Transformer-based OCR with Pre-trained Models. 📄 arXiv:2109.10282
- DiT (2022): Self-supervised Pre-training for Document Image Transformer. 📄 arXiv:2203.02378
- Nougat (2023): Neural Optical Understanding for Academic Documents. 📄 arXiv:2308.13418
📚 LayoutLM Series (點擊展開)
- **LayoutLM (2020)**: Pre-training of Text and Layout. [📄 arXiv:1912.13318](./LayoutLM.md) - **LayoutLMv2 (2021)**: Multi-modal Pre-training. [📄 arXiv:2012.14740](./LayoutLMv2.md) - **LayoutXLM (2021)**: Multilingual Visually-rich Document Understanding. [📄 arXiv:2104.08836](./LayoutXLM.md) - **LayoutLMv3 (2022)**: Pre-training with Unified Text and Image Masking. [📄 arXiv:2204.08387](./LayoutLMv3.md)- Scene Text Recognition
- ABINet (2021): Read Like Humans. 📄 arXiv:2103.06495
- ABINet++ (2022): Iterative Language Modeling for Text Spotting. 📄 arXiv:2211.10578
- ABCNet v2 (2021): Adaptive Bezier-Curve Network. 📄 arXiv:2105.03620
- SVTR (2022): Scene Text Recognition with a Single Visual Model. 📄 arXiv:2205.00159
DeepFake Detection
DeepFake Detection (深度偽造偵測)
- Multi-attentional Deepfake Detection (CVPR 2021)
- H. Zhao et al., Proceedings of the IEEE/CVF CVPR 2021.
- Geometric Features (CVPR 2021)
- Improving Efficiency and Robustness through Precise Geometric Features. Sun, Zekun et al.
- 3D Decomposition (CVPR 2021)
- Face Forgery Detection by 3D Decomposition. Xiangyu Zhu et al.