Deep Learning 101, Taiwan’s pioneering and highest deep learning meetup, launched on 2016/11/11 @ 83F, Taipei 101

AI是一條孤獨且充滿惶恐及未知的旅程，花俏絢麗的收費課程或活動絕非通往成功的捷徑。
衷心感謝當時來自不同單位的AI同好參與者實名分享的寶貴經驗；如欲移除資訊還請告知。
由 TonTon Huang Ph.D. 發起，及其當時任職公司(台灣雪豹科技)無償贊助場地及茶水點心。
Deep Learning 101 創立初衷，是為了普及與分享深度學習及AI領域的尖端知識，深信AI的價值在於解決真實世界的商業問題。

👁️ 電腦視覺 (CV)・必讀資源總整理

編者按： 本頁面彙整了電腦視覺領域的關鍵技術資源，涵蓋物件偵測、生成式 AI、影像分割以及文字識別（OCR）等最新論文與實作。

如果您想尋找更詳細的筆記，歡迎訪問 GitHub Repository： 👉 GitHub: Computer-Vision-Paper (歡迎 Star ⭐)

AnomalyDetection

Anomaly Detection，異常檢測

2026-01-29｜LLM2CLIP
- 說明：以大語言模型重塑跨模態表徵學習的文本基石
- 資源：📄 AlphaXiv 🐙 GitHub 📝 公眾號解讀
2025-09-24｜FS-SAM2
- 說明：Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation
- 資源：📄 AlphaXiv 📝 中文解讀：效能與效率雙優
2025-09-20｜MOCHA
- 說明：Multi-modal Objects-aware Cross-arcHitecture Alignment
- 資源：📄 AlphaXiv 📝 中文解讀：注入 YOLO 性能大漲
2025-07-16｜CostFilter-AD
- 說明：Enhancing Anomaly Detection through Matching Cost Filtering
- 資源：🐙 GitHub 📝 中文解讀：刷新無監督上限
2025-06-13｜One-to-Normal
- 說明：Anomaly Personalization (少樣本異常識別新突破)
- 資源：📄 AlphaXiv 📝 中文解讀
2025-06-06｜DualAnoDiff (CVPR 2025)
- 說明：Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
- 資源：📄 AlphaXiv 📝 中文解讀：復旦騰訊優圖入選
2025-05-15｜AdaptCLIP
- 說明：Adapting CLIP for Universal Visual Anomaly Detection
- 資源：📄 AlphaXiv 🐙 GitHub 📝 中文解讀
2025-05-05｜Multi-Modal LLM for AD
- 說明：Detect, Classify, Act: Categorizing Industrial Anomalies
- 資源：📄 AlphaXiv 📚 DeepWiki 💾 Dataset
2025-04-27｜AnomalyCLIP
- 說明：Object-agnostic Prompt Learning for Zero-shot AD
- 資源：📄 AlphaXiv 📚 DeepWiki
2025-04-26｜PaDim
- 說明：經典無監督異常檢測方法
- 資源：📄 AlphaXiv 📚 DeepWiki
2025-04-12｜AA-CLIP
- 說明：Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
- 資源：📄 AlphaXiv 📚 DeepWiki
2025-03-25｜Dinomaly
- 說明：The Less Is More Philosophy in Multi-Class Unsupervised AD
- 資源：🐙 GitHub 📝 中文解讀

ObjectDetection

Object Detection (目標偵測)

2025｜MCL (AAAI 2025)
- 說明：Multi-clue Consistency Learning (遙感半監督目標檢測)
- 資源：📄 AlphaXiv 🐙 GitHub 📝 中文解讀
2025-07-24｜OV-DINO
- 說明：開源工業開放詞彙目標檢測
- 資源：🐙 GitHub 📝 中文解讀
2025-06-18｜CountVid
- 說明：Open-World Object Counting in Videos (影片中指哪數哪)
- 資源：📄 AlphaXiv 📝 中文解讀
2025-06-15｜GeoPix
- 說明：像素級遙感多模態大模型
- 資源：🐙 GitHub 📝 北大實驗室介紹
2025-05-23｜VisionReasoner
- 說明：用強化學習統一視覺感知與推理 (對標 Qwen2.5-VL)
- 資源：🐙 GitHub 📝 中文解讀
2025-03-14｜Falcon
- 說明：A Remote Sensing Vision-Language Foundation Model
- 資源：📄 AlphaXiv 📚 DeepWiki

Segmentation

Segmentation (圖像分割)

SAM 3
- 資源：🐙 GitHub 📝 公眾號解讀
Perceive Anything Model
- 說明：Recognize, Explain, Caption, and Segment Anything (對標 SAM2 + LLM)
- 資源：📄 AlphaXiv 📝 中文解讀
RemoteSAM
- 說明：Towards Segment Anything for Earth Observation
- 資源：📄 AlphaXiv 📚 DeepWiki
InstructSAM
- 說明：Training-Free Framework for Remote Sensing
- 資源：🌐 Project 📄 AlphaXiv 📚 DeepWiki
SAM 2 & Variants (SAM 2 相關變體)
- Meta SAM 2: Meta 官方最新分割一切模型。
  - 📝 60行程式碼微調 SAM 2
- Grounded SAM 2: Ground and Track Anything in Videos.
- SAM2Long: 港中文提出，專注於複雜長視頻分割。
- SAM2-Adapter: 首次讓 SAM 2 適應下游任務。
- SAM2Point: 可提示 3D 分割研究里程碑。
Other Notable Models
- SAMURAI: KF + SAM2 解決快速移動或自遮擋問題。
- MatAnyone: 視訊摳圖，髮絲級還原。
- Exact (CVPR 2025): 遙感影像時間序列弱監督學習。
- SegAnyMo (CVPR 2025): Segment Any Motion in Videos.

OCR

Optical Character Recognition (光學文字識別) 針對物件或場景影像進行分析與偵測

使用開源模型強化您的 OCR 工作流程
12個流行的開源免費OCR項目
2026-01-27 DeepSeek-OCR 2
- 資源：🐙 GitHub 🤗 HuggingFace 📝 公眾號解讀
2025-11-30｜HunyuanOCR
- 資源：🐙 GitHub 📝 騰訊混元 1B 級全能模型
2025-10-21 Chandra OCR
- 資源：🐙 GitHub 📝 超越DeepSeek-OCR！ OCR領域的革命性突破：Chandra OCR本地部署+真實評測
2025-10-19｜PaddleOCR-VL
- 資源：🤗 HuggingFace 📝 圖片辨識轉文字巔峰之作
2025-08-18｜DianJin-OCR-R1
- 資源：🐙 GitHub 📝 點金 OCR-R1：模糊蓋章、跨頁表格全拿下
2025-07-30｜dots.ocr
- 資源：🤗 HuggingFace 📝 本地部署 1.7B 超強 OCR
2025-06-16｜OCRFlux
- 說明：基於 LLM 的複雜佈局與跨頁合併 PDF 解析
- 資源：🐙 GitHub 🌐 Demo
2025-06-05｜MonkeyOCR
- 資源：📚 DeepWiki 📄 AlphaXiv
- 2025-03-05｜OpenOCR
- 資源：🐙 GitHub | 📝 通用OCR工具OpenOCR開源
2025-03-05｜PP-DocBee
- 資源：🐙 GitHub 📝 百度文檔影像理解
2025-03-03｜olmocr
- 資源：🐙 GitHub 📝 本地部署精準提取 PDF
2025-02-05｜MinerU
- 資源：🐙 GitHub 📝 PDF 轉 Markdown 神器
2024-12-15｜markitdown
- 資源：🐙 GitHub
2024-10-29｜OmniParser
- 資源：🐙 GitHub 📝 Alibaba 出品：通用文檔複雜場景抽取
2024-09-11｜GOT-OCR-2.0
- 資源：📝 模型開源介紹 📝 OCR 2.0 時代來了
2024-08-20｜PDF 轉 MarkDown 工具
- 資源：📝 萬物皆可 AI 化！12000 人圍觀的開源工具
其他實用工具與資源
- RapidOCR：🐙 GitHub
- TableStructureRec：🐙 GitHub 📝 表格結構辨識推理庫
- PaddleOCR 教學：📝 用 PPOCRLabel 微調醫療診斷書和收據

Diffusion Model

Diffusion Model (擴散模型)

2025-05-28｜Jodi
- 說明：視覺理解 & 生成大一統模型
- 資源：📄 AlphaXiv 🌐 Project
2025-05-27｜AnomalyAny (CVPR 2025)
- 說明：Stable Diffusion 協助視覺異常檢測，無需訓練
- 資源：🌐 Project 📝 中文解讀
2025-05-23｜HivisionIDPhotos
- 說明：智慧證件照生成神器 (摳圖、換背景、任意尺寸)
- 資源：📚 DeepWiki 📝 教學文章
2025-05-19｜Index-AniSora
- 說明：B 站開源 SOTA 動畫影片生成模型
- 資源：📄 AlphaXiv 📚 DeepWiki 📝 中文解讀
2025-04-26｜Insert Anything
- 資源：📄 AlphaXiv 📚 DeepWiki
2025-04-24｜Phantom
- 說明：字節跳動 1280x720 影片生成模型，10G 顯存可用
- 資源：🐙 GitHub 📝 實測報告
2025-04-22｜MAGI-1
- 說明：Sand AI 全球首個自回歸影片生成大模型
- 資源：🐙 GitHub 📝 性能亮點解析
2025-04-22｜SkyReels V2
- 說明：全球首個無限時長影片生成，電影級理解
- 資源：🐙 GitHub 📝 媒體報導
2025-04-14｜FramePack
- 說明：ComfyUI 插件，6G 顯存跑 13B 模型，支援 1 分鐘影片
- 資源：🐙 GitHub 📝 性價比分析
2025-04-14｜Fantasy-talking
- 說明：基於 Wan2.1 的音訊驅動數字人
- 資源：🌐 Project 📝 解讀文章
2025-04-05｜SkyReels-A2
- 資源：📄 AlphaXiv 📚 DeepWiki 📝 中文解讀
2025-03-10｜HunyuanVideo-I2V
- 說明：騰訊開源圖生視訊模型 + LoRA 訓練腳本
- 資源：🐙 GitHub 📝 實戰教學
2025-02-25｜Wan-Video
- 說明：阿里萬相大模型開源，全模態、全尺寸
- 資源：🐙 GitHub 📝 媒體報導
2025-02-14｜FlashVideo
- 說明：字節跳動視訊增強演算法，102 秒生成 1080P 影片
- 資源：🐙 GitHub 📝 解讀文章
2025-01-28｜Sana (ICLR 2025 Oral)
- 說明：英偉達/MIT/清華開源，比 FLUX 快 100 倍
- 資源：🐙 GitHub 📝 中文解讀
Flux & Ecosystem
- Flux Models: 🤗 Black Forest Labs
  - Canny-dev Depth-dev Fill-dev Redux-dev
- PuLID (2024-11-29): 🐙 GitHub 📝 ComfyUI 教學
- Leffa (2024-12-17): 🐙 GitHub 📝 Meta AI 人物特徵保持
- MagicQuill (2024-11-26): 🐙 GitHub 🤗 Space 📝 AI P 圖神器
Practical Tools (ComfyUI & Others)
- OOTDiffusion: 🐙 GitHub 📝 AI 換裝神器
- ComfyUI Impact Pack: 🐙 GitHub 📝 最強臉部修復
- OmniGen: 🐙 GitHub 📝 全能影像生成

Digital Human

Digital Human (虛擬數字人)

Open Avatar Chat
- 資源：📝 專案介紹 📝 GitHub 爆火神器，本地部署無套路
HeyGem
- 資源：🐙 GitHub 📝 數字人克隆神器
Duix
- 資源：🐙 GitHub 📝 全球首個真人數字人開源
Linly-Talker
- 說明：結合 LLM 與視覺模型的智能交互系統
- 資源：🐙 GitHub
CVPR 2025 / NeurIPS Resources
- EchoMimicV2 (CVPR 2025): 🐙 GitHub - Striking, Simplified Human Animation.
- Hallo3 (CVPR 2025): 🐙 GitHub - Highly Dynamic Portrait Animation.
- MimicTalk (NeurIPS 2024): 🐙 GitHub - 3D talking face.
Other Tools
- JoyGen: 🐙 GitHub (Audio-Driven 3D Editing)
- Latentsync: 🐙 GitHub
- MuseTalk: 🐙 GitHub

Image Recognition

Image Recognition (圖像識別)

ViT (Vision Transformer)
- 資源：🐙 GitHub 📝 解析文章 📝 遷移表現分析
Swin Transformer
- 資源：🐙 GitHub 📝 用 CNN 方式打敗 CNN
EfficientNetV2
- 資源：🐙 GitHub 📝 更小更快的訓練

Document AI

Document Understanding & OCR (文檔理解與文字識別)

Donut (2022): OCR-free Document Understanding Transformer. 📄 arXiv:2111.15664
LayoutParser (2021): Unified toolkit for Deep Learning Based Document Analysis. 📄 arXiv:2103.15348
TrOCR (2021): Transformer-based OCR with Pre-trained Models. 📄 arXiv:2109.10282
DiT (2022): Self-supervised Pre-training for Document Image Transformer. 📄 arXiv:2203.02378
Nougat (2023): Neural Optical Understanding for Academic Documents. 📄 arXiv:2308.13418

📚 LayoutLM Series (點擊展開)

- **LayoutLM (2020)**: Pre-training of Text and Layout. [📄 arXiv:1912.13318](./LayoutLM.md) - **LayoutLMv2 (2021)**: Multi-modal Pre-training. [📄 arXiv:2012.14740](./LayoutLMv2.md) - **LayoutXLM (2021)**: Multilingual Visually-rich Document Understanding. [📄 arXiv:2104.08836](./LayoutXLM.md) - **LayoutLMv3 (2022)**: Pre-training with Unified Text and Image Masking. [📄 arXiv:2204.08387](./LayoutLMv3.md)

Scene Text Recognition
- ABINet (2021): Read Like Humans. 📄 arXiv:2103.06495
- ABINet++ (2022): Iterative Language Modeling for Text Spotting. 📄 arXiv:2211.10578
- ABCNet v2 (2021): Adaptive Bezier-Curve Network. 📄 arXiv:2105.03620
- SVTR (2022): Scene Text Recognition with a Single Visual Model. 📄 arXiv:2205.00159

DeepFake Detection

DeepFake Detection (深度偽造偵測)

Multi-attentional Deepfake Detection (CVPR 2021)
- H. Zhao et al., Proceedings of the IEEE/CVF CVPR 2021.
Geometric Features (CVPR 2021)
- Improving Efficiency and Robustness through Precise Geometric Features. Sun, Zekun et al.
3D Decomposition (CVPR 2021)
- Face Forgery Detection by 3D Decomposition. Xiangyu Zhu et al.