《Swin Transformer: Hierarchical Vision Transformer using Shifted Windows》阅读笔记

发表于 2021-06-29 更新于 2021-11-25 分类于学习，论文阅读阅读次数：

本文字数： 3.3k 阅读时长 ≈ 11 分钟

Abstract

本文提出一种新的transformer在cv领域。

介绍目前存在的问题，本文提出了swin transformer解决了这一问题。

简短介绍swin transformer的概念。

运用在图像都是SOTA结果。（在 XX数据集达到了XX。）

Introduction

Modeling in computer vision has long been dominated by convolutional neural networks (CNNs).

计算机视觉长期以来一直由卷积神经网络模型主导。

接下来介绍CNN在CV领域一直占主导，从AlexNet开始，由于其架构更广阔…

又开始介绍在NLP领域的发展，出现了一篇神文《All You Need is Attention》…

IIts tremendous success in the language domain has led researchers to investigate its adaptation to computer vision,where it has recently demonstrated promising results on certain tasks, specifically image classification [19] and joint vision-language modeling [46].

它在语言领域的巨大成功促使研究人员研究它对计算机视觉的适应性，最近它在某些任务上显示出有希望的结果，特别是图像分类[19]和联合视觉-语言建模[46]。

[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. 1, 2, 3, 4, 5, 6, 9

[46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1

接下来就开始结合起来了。

然后列举别人最近有关工作，并评判相关工作的缺点，提出自己的优点。

As illustrated in Figure1(a), …

如图1(a)所示，…

图1主要是和文献19做对比，提出自己的复杂度是线性的，而非二次，而且不是单一分辨率特征图。

conveniently leverage advanced techniques

方便利用先进技术

图2介绍了滑动窗口，在窗口计算SA，而且每个窗口中的patch是同样的k，减少latency。

在Introduction最后再次强调自己工作的优势，该模型achieves strong performance在cv不同的领域，如目标检测、分割、分类，用数据来佐证其strong。

最后一段写出了我们的信念，我们把两个领域结合在一起，我们也希望我们提出的模型可以鼓励统一建模（unified modeling）。

CNN and variants

介绍了CNN在CV领域的霸权地位，介绍了开山之AlexNet，以及后续如春笋般涌出的CNNs，除了在结构上进步，也有改进了单个卷积层的进步。

不过既然这篇文章是想用transformer来运用到cv领域，肯定是要在最后一点表明其巨大潜力是可以有助于建模转变。

Self-attention based backbone architectures

提出了SA作为主干网络架构，并分析了之前他人的工作由于内存cost导致其latency明显大于CNNs。

所以本文使用shift window来进行替换，在连续层间，这样的话在一般硬件中也可以有效地实现。

Self-attention/Transformers to complement CNNs

用SA/Transformer来补充增强CNNs。

先是介绍一些相关工作（providing the capability to encode distant dependencies or heterogeneous interactions.），接下来突出自己的工作，探索了Transformer拓展了对CV的adaption，并且是对之前一些相关工作的补充。

Transformer based vision backbones

介绍了和我们工作最接近的相关工作[19]，和其后续。

但是优点介绍完后，是要介绍一些后续工作对缺点的改进，当然了，要是改进完了我们这篇论文算什么，所以提出对方的致命缺点：

but its architecture is unsuitable for use as a general-purpose backbone network on dense vision tasks or when the input image resolution is high, due to its low-resolution feature maps and the quadratic increase in complexity with image size.

但是，由于其低分辨率的特征映射和复杂度随图像尺寸的二次增加，其体系结构不适合用作密集视觉任务或输入图像分辨率高时的通用backbone。

然后介绍了一些并行的修改ViT架构获得更好的图像分类。

但是我们的工作在分类中也获得了好的结果，虽然我们工作侧重于通用性能呢，才不仅仅只是分类呢。

有一项工作[63]探求了一种类似的思路，不过它的复杂度也是二次的哦，我们是线性的！

当然啦，反正我们是最棒的。

[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. 1, 2, 3, 4, 5, 6, 9

[63] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021. 3

Method

Overall Architecture

Conclusion

结尾再说一遍！我们提出的swin transformer真的蛮棒啦！

精简总结我们创新的idea，然后呢这个idea还蛮有用的。

最后呢，swin transformer在cv上是有用的，那希望研究它在nlp领域的应用。

----------到结尾啦！！ Hoohoo----------