【注意力機(jī)制】Attention Augmented Convolutional Networks

加載中

【注意力機(jī)制】Attention Augmented Convolutional Networks

注意力機(jī)制之Attention Augmented Convolutional Networks

初始連接：https://www.yuque.com/lart/papers/aaconv

具體內(nèi)容

We propose to augment convolutional operators with this self-attention mechanism by concatenating convolutional feature maps with a set of feature maps produced via self-attention.

關(guān)鍵工作中

最先掌握卷積和實(shí)際操作自身二點(diǎn)特點(diǎn):

可逆性:locality via a limited receptive field
等轉(zhuǎn)性:translation equivariance via weight sharing
- The Convolution Operator is Translation Equivariant meaning it preserves Translations however the CNN processing allows for Translation Invariance which is achieved by means of a proper (i.e. related to spatial features) dimensionality reduction. https://aboveintelligent.com/ml-cnn-translation-equivariance-and-invariance-da12e8ab7049
- While convolutions are translation equivariant and not invariant, an approximative translation invariance can be achieved in neural networks by combining convolutions with spatial pooling operators. https://chriswolfvision.medium.com/what-is-translation-equivariance-and-why-do-we-use-convolutions-to-get-it-6f18139d4c59

雖然這種特性被證實(shí)了是設(shè)計(jì)方案在圖象上實(shí)際操作的實(shí)體模型時(shí)尤為重要的梳理參考點(diǎn)(inductive biase). 可是卷積和的部分特性(the local nature of the convolutional kernel)阻攔了其捕捉全局性的前后文信息內(nèi)容(global context), 而這種信息內(nèi)容針對(duì)圖像識(shí)別技術(shù)是很必需的. 它是卷積和的關(guān)鍵的缺點(diǎn). (convolution operator is limited by its locality and lack of understandingof global contexts)

而在捕捉遠(yuǎn)距離互動(dòng)關(guān)聯(lián)(long range interaction)上, 近期的Self-attention主要表現(xiàn)的很非常好(has emerged as a recent advance). 自專(zhuān)注力身后的重要觀念是轉(zhuǎn)化成從掩藏模塊測(cè)算的值的加權(quán)平均. 有別于卷積和實(shí)際操作或是池化實(shí)際操作, 這種權(quán)重值是動(dòng)態(tài)性的依據(jù)鍵入特點(diǎn), 根據(jù)掩藏模塊中間的相似度涵數(shù)造成的(produced dynamically via a similarity function between hidden units). 因而鍵入數(shù)據(jù)信號(hào)中間的互動(dòng)取決于數(shù)據(jù)信號(hào)自身, 而不是像在卷積和中, 被事先由她們的相對(duì)位置而決策.

因此文中試著將自專(zhuān)注力測(cè)算運(yùn)用到卷積和實(shí)際操作中, 來(lái)完成遠(yuǎn)距離互動(dòng). 在辨別性視覺(jué)效果每日任務(wù)(discriminative visual tasks)中, 考慮到應(yīng)用自專(zhuān)注力更換一般的卷積和. 引進(jìn)a novel two-dimensional relative self-attention mechanism, 其在引入(being infused with)相對(duì)性位置信息的另外能夠維持translation equivariance, 使其特別適合圖象.

在替代卷積和做為單獨(dú)測(cè)算模塊層面被證實(shí)是有競(jìng)爭(zhēng)能力的. 可是必須留意的是, 在操縱試驗(yàn)中發(fā)覺(jué), 將自專(zhuān)注力和卷積和組成起來(lái)的狀況能夠得到最好是的結(jié)果. 因而并沒(méi)有徹底拋下卷積和, 只是明確提出應(yīng)用self-attention mechanism來(lái)提高卷積和(augment convolutions), 將要注重可逆性的卷積和特點(diǎn)圖和根據(jù)self-attention造成的可以模型更遠(yuǎn)距離依靠(capable of modeling longer range dependencies)的特點(diǎn)圖拼湊來(lái)得到最后結(jié)果.

在好幾個(gè)試驗(yàn)中, 專(zhuān)注力提高卷積和都完成了一致的提高, 此外針對(duì)徹底的自留意實(shí)體模型(無(wú)需卷積和那一部分), 這能夠當(dāng)作是專(zhuān)注力提高實(shí)體模型的一種特殊情況, 在ImageNet上僅比他們的徹底卷積和構(gòu)造略差, 這說(shuō)明自留意體制是一種用以圖像分類(lèi)的強(qiáng)勁單獨(dú)的測(cè)算原語(yǔ)(a powerful standalone computational primitive).

有關(guān)primitive這一定義, 找到一段表述: 疏忽就是指全部系統(tǒng)軟件中最基本上的定義.
https://stackoverflow.com/a/8022435
For me, it means something that cannot be decomposed (people use also the atomic word sometimes in that sense, but atomic is often also used for explanation on concurrency or parallelism with a different meaning).?
For instance, on Unix (or Linux) the system calls, as seen by the application are primitive or atomic, they either happen or not (sometimes, they got interrupted and give an EINTR or ERESTART error).
And inside an interpreter, or even in the formal specification, of a language, the primitive are those operations which you cannot define, and which the interpreter deals with specially. Very often, cons is a primitive operation for Lisp dialects.

這兒提及了別的的一些visual tasks中的專(zhuān)注力的工作中:

reweigh feature channels using signals aggregated from entire feature maps
- Squeezeand-Excitation [SENet]
- Gather-Excite [http://papers.nips.cc/paper/8151-gather-excite-exploiting-feature-context-in-convolutional-neural-networks.pdf]
refine convolutional features independently in the channel and spatial dimensions
- BAM [Bam: bottleneck attention module]
- CBAM [Cbam: Convolutional block attention module]
the additive use of a few non-local residual blocks that employ self-attention in convolutional architectures
- non-local neural networks

相對(duì)性于目前的方式, 這兒要明確提出的構(gòu)造不依賴于相匹配的(counterparts)徹底卷積和實(shí)體模型的預(yù)訓(xùn)煉, 只是全部互聯(lián)網(wǎng)都應(yīng)用了self-attention mechanism. 此外multi-head attention的應(yīng)用促使實(shí)體模型另外關(guān)心室內(nèi)空間向量空間和特點(diǎn)向量空間. (雙頭專(zhuān)注力便是將特點(diǎn)劃順著安全通道區(qū)劃為不一樣的組, 不一樣同組開(kāi)展獨(dú)立的轉(zhuǎn)換, 能夠得到更為多元化的特點(diǎn)表述)

此外, 為了更好地提高圖象上的自專(zhuān)注力的語(yǔ)言表達(dá)能力, 這兒拓展[Selfattention with relative position representations, Music transformer]中的相對(duì)性自專(zhuān)注力到二維方式, 這促使能夠而有標(biāo)準(zhǔn)(in a principled way)地仿真模擬移動(dòng)等轉(zhuǎn)性(translation equivariance).

那樣的構(gòu)造能夠立即造成附加的特點(diǎn)圖, 而不是根據(jù)加減法(可能是加法)[Non-local neural networks, Self-attention generative adversarial networks]或自動(dòng)門(mén)[Squeeze-and-excitation networks, Gather-excite: Exploiting feature context in convolutional neural networks, Bam: bottleneck attention module, Cbam: Convolutional block attention module]再次校正卷積和特點(diǎn). 這一特點(diǎn)容許靈便地調(diào)節(jié)專(zhuān)注力安全通道的占比, 考慮到從徹底卷積和到徹底留意實(shí)體模型的一系列構(gòu)架(a spectrum of architectures, ranging from fully convolutional to fully attentional models).

關(guān)鍵構(gòu)造

H, W, Fin: 鍵入特點(diǎn)圖的height, weight, 安全通道數(shù)
Nh, dv, dk:heads的總數(shù), values的深層(也就是特點(diǎn)圖安全通道數(shù)), queries和keys的深層(這好多個(gè)主要參數(shù)全是MHA, multi-head attention的一些主要參數(shù)), 這里有規(guī)定, dv和dk務(wù)必能夠被Nh整除, 這兒應(yīng)用dhv和dhk來(lái)做為每一個(gè)head平均值的深層和查看/鍵的深層

圖象數(shù)據(jù)信息雙頭專(zhuān)注力的測(cè)算

雙頭的測(cè)算方式

雙頭是由雙頭拼湊而成

in_tensor\((H,W,F_{in})\) =(flatten)=> X\((HW,F_{in})\)(We omit the batch dimension for simplicity.)
依照transformer構(gòu)造清算雙頭專(zhuān)注力
1. 針對(duì)head h相匹配的自專(zhuān)注力結(jié)果為算式1所顯示, 這兒的\(W_q\)/\(W_k\)/\(W_v\)各自樣子為\((F_{in}, d^h_q)/(F_{in}, d^h_k)/(F_{in}, d^h_v)\), 各自用以投射鍵入X到查看\(Q=XW_q\) 、鍵\(K=XW_k\) 合值\(V=XW_v\) , 各自的樣子為\((HW, d^h_q)/(HW, d^h_k)/(HW, d^h_v)\)?
2. 全部head的輸出拼湊到一起, 隨后依照算式2開(kāi)展解決, 這兒的\(W^O \in \mathbb{R}^{d_v \times d_v}\)(能夠了解, 這兒的\(N_h\)個(gè)\(O\)的拼湊, 事實(shí)上深層為\(d_v\), 也就是\(d_v=N_h \times d^h_v\)), 這兒MHA測(cè)算后會(huì)調(diào)節(jié)樣子為\((H, W, d_v)\)來(lái)配對(duì)初始的空間和時(shí)間
3. multi-head attention
  1. 測(cè)算復(fù)雜性：\(O((HW)^2d_k)\)(這兒只必須考慮到大部分\((XW_q)(XW_k)^T\)的測(cè)算)
  2. 空間復(fù)雜度：\(O((HW)^2N_h)\)(這兒包括了Nh塊頭的結(jié)果)

二維部位置入Two-dimensional Positional Embeddings

這兒的"二維"事實(shí)上是相對(duì)性于初始對(duì)于語(yǔ)言表達(dá)的一維信息內(nèi)容的構(gòu)造來(lái)講, 這兒鍵入的是二維圖象數(shù)據(jù)信息.

因?yàn)闆](méi)有顯式的位置信息的運(yùn)用, 因此自專(zhuān)注力達(dá)到交換律:\(MHA(\pi(X))=\pi(MHA(X))\), 這兒的\(\pi\)表明針對(duì)清晰度部位的隨意換置. 這體現(xiàn)出去self-attention具備 permutation equivariant. 那樣的特性促使針對(duì)仿真模擬高寬比結(jié)構(gòu)型的數(shù)據(jù)信息(比如圖象)來(lái)講, 并不是很合理.

好幾個(gè)應(yīng)用顯式的空間數(shù)據(jù)來(lái)提高激話圖的部位編號(hào)早已被明確提出來(lái)解決有關(guān)的難題:

Image Transformer extends the sinusoidal waves first introduced in the original Transformer to 2 dimensional inputs.
CoordConv concatenates positional channels to an activation map.

在文章內(nèi)容的試驗(yàn)中發(fā)覺(jué), 在圖像分類(lèi)和目標(biāo)檢測(cè)上, 這種編碼方式并不太好用, 創(chuàng)作者們將其歸功于盡管這種對(duì)策能夠擺脫換置等轉(zhuǎn)性, 可是卻不可以確保圖象每日任務(wù)必須的移動(dòng)等轉(zhuǎn)性(permutation equivariant(換置等轉(zhuǎn)性), translation equivariance(移動(dòng)等轉(zhuǎn)性)). 因此, 這兒拓展了目前的相對(duì)位置編號(hào)[Self attention with relative position representations]到二維上, 而且根據(jù)Music Transformer明確提出一個(gè)運(yùn)行內(nèi)存合理的完成.

相對(duì)位置置入Relative positional embeddings

Introduced in [Self attention with relative position representations] for the purpose of language modeling, relative self-attention augments self-attention with relative position encodings and enables translation equivariance while preventing permutation equivariance.

這兒根據(jù)單獨(dú)加上相對(duì)性的寬和相對(duì)性的高的信息內(nèi)容, 來(lái)完成二維相對(duì)性自專(zhuān)注力.
針對(duì)清晰度\(i=(i_x, i_y)\)有關(guān)清晰度\(j=(j_x, j_y)\)的attention logit測(cè)算方法以下(The attention logit for how much pixel i attends to pixel j is computed as):

\(q_i\)表明 部位為\(i\) 的query vector, 也就是Q中的一個(gè)長(zhǎng)為\(d^h_k\)的矢量素材原素.
\(k_j\)表明 部位為\(j\) 的key vector, 也就是K中的一個(gè)長(zhǎng)為\(d^h_k\)的矢量素材原素.
\(r^W_{j_x-i_x}\)和\(r^H_{j_y-i_y}\)表明針對(duì)相對(duì)性總寬\(j_x-i_x\)和絕對(duì)高度\(j_y-i_y\)學(xué)習(xí)培訓(xùn)到的置入表明, 分別均為dhk長(zhǎng)短的矢量素材.
\(r\)相匹配的相對(duì)位置主要參數(shù)引流矩陣\(r^W\)和\(r^H\)分別是\((2W-1, d^h_k)\)和\((2H-1, d^h_k)\)尺寸的.

單獨(dú)頭h的輸出變成了:

這兒的2個(gè)\(S\)全是\(HW \times HW\)的引流矩陣, 表明順著寬高維度的相對(duì)位置logits

由于考慮到相對(duì)性高寬信息內(nèi)容, 因此達(dá)到\(S^{rel}_W[i, j]=S^{rel}_W[i, j W]\),\(S^{rel}_H[i, j]=S^{rel}_H[i, j H]\). 那樣就不用為全部的(i, j)對(duì)測(cè)算logits了, 這兒能夠依照那樣來(lái)了解(這是我自身的了解): 針對(duì)二維引流矩陣, 依照順著個(gè)人行為W方位(橫著), 也就是x方位, 順著列入H方位(豎向)即y向, 針對(duì)隨意一點(diǎn)\(j\)和固定不動(dòng)的點(diǎn)\(i\):

SW中有\((j_x-i_x)\%W=[(j nW)_x-i_x]\%W\), 即依照行主序向后挪動(dòng)個(gè)部位, 仍坐落于同一列;
SH中有\((j_y-i_y)\%H=[(j nH)_y-i_x]\%H\), 即依照列主序向后挪動(dòng)\(nH\)個(gè)部位, 仍然在同一行.

這兒的相對(duì)性專(zhuān)注力的方式事實(shí)上有別于初始參考論文Self attention with relative position representations中具備內(nèi)存占用為\(O((HW)^2d^h_k)\)(相對(duì)性置入\(r_{ij} \in \mathbb{R}^{HW \times HW \times d^h_k}\))的設(shè)計(jì)方案, 只是根據(jù)MUSIC TRANSFORMER中明確提出的memory efficient relative masked attention algorithm的一種3D拓展, 拓展為了更好地unmasked relative self-attention over 2 dimensional inputs上, 進(jìn)而儲(chǔ)存耗費(fèi)變成了\(O(HWd^h_k)\)(相對(duì)位置置入\(r_{ij}\)被拆分為2個(gè)一部分, 即\(r^H \in \mathbb{R}^{(2H-1) \times d^h_k}, r^W \in \mathbb{R}^{(2W-1 )\times d^h_k}\), 而且跨頭不跨層的方式開(kāi)展共享資源). 針對(duì)各層, 事實(shí)上只必須加上附加的\((2(H W) ? 2)d^h_k\)個(gè)主要參數(shù)來(lái)模型順著高和寬的相對(duì)性間距就可以.

Attention Augmented Convolution

文章內(nèi)容明確提出的應(yīng)用專(zhuān)注力提高的卷積和關(guān)鍵的優(yōu)點(diǎn):

use an attention mechanism that can attend jointly to spatial and feature subspaces (each head corresponding to a feature subspace)
introduce additional feature maps rather than refining them

AAConv的關(guān)鍵全過(guò)程:

Similarly to the convolution, the proposed attention augmented convolution

is equivariant to translation
can readily operate on inputs of different spatial dimensions

下面對(duì)比一般的卷積和\((F_{out}, F_{in}, k, k)\)剖析了AAConv的參總數(shù):

設(shè)定\(v=\frac{d_v}{F_{out}}\)做為MHA一部分的總輸出安全通道數(shù)與總的AAConv輸出安全通道數(shù)的比率;
設(shè)定\(\kappa = \frac{d_k}{F_{out}}\)做為MHA中Key的深層與總的AAConv輸出安全通道數(shù)的比率.
應(yīng)用\(1 \times 1\)卷積和來(lái)線性變換獲得Q\K\V, 因此有參總數(shù)\((d_v d_k d_q)F_{in} = (2d_k d_v)F_{in}=(v 2\kappa)F_{out}F_{in}\)
應(yīng)用一個(gè)附加的\(1\times1\)卷積和用以混和好幾個(gè)頭的奉獻(xiàn)(mix the contribution of different heads), 這一部分參總數(shù)為\(d_vd_v=(vF_{out})^2\);
除開(kāi)專(zhuān)注力一部分, 也有一部分規(guī)范卷積和, 即前邊算式中的Conv, 其參總數(shù)為:\(k^2(F_{out} - d_v)F_{in} = k^2(1 - v)F_{out}F_{in}\);
因此, 忽視了相對(duì)位置置入和卷積和參考點(diǎn)以后, 總體的構(gòu)造的參總數(shù)約為:\(F_{in}F_{out}(2\kappa v v^2\frac{F_{out}}{F_{in}} k^2-k^2v)=F_{in}F_{out}(2\kappa v(1-k^2) k^2 v^2\frac{F_{out}}{F_{in}})\)
總體相對(duì)性于卷積和的主要參數(shù)的變化量為\(\Delta_{params}\sim F_{in}F_{out}(2\kappa v(1-k^2) v^2\frac{F_{out}}{F_{in}})\), 因此更換3x3卷積和時(shí), 會(huì)輕度降低參總數(shù), 而更換1x1卷積和時(shí), 則會(huì)產(chǎn)生輕度的提升.

Attention Augmented Convolutional Architectures

全部試驗(yàn)中, AAConv后都是會(huì)跟隨BN來(lái)縮放卷積層和專(zhuān)注力層特點(diǎn)圖的共享資源.
每一個(gè)方差塊應(yīng)用一次AAConv.
因?yàn)镼K的結(jié)果具備很大的內(nèi)存占用, 因此是依照從深到淺的次序應(yīng)用, 直至做到運(yùn)行內(nèi)存限制.
To reduce the memory footprint of augmented networks, we typically resort to a smaller batch size and sometimes additionally downsample the inputs to self-attention in the layers with the largest spatial dimensions where it is applied(這兒指的應(yīng)該是在專(zhuān)注力測(cè)算前后左右各自下采樣和上采樣). Downsampling is performed by applying 3x3 average pooling with stride 2 while the following upsampling (requiredfor the concatenation) is obtained via bilinear interpolation.

試驗(yàn)結(jié)果

部位編號(hào)

the position-unaware version of self-attention (referred to as None),
a two-dimensional implementation of the sinusoidal positional waves (referred to as 2d Sine) as used in [32],
CoordConv [29] for which we concatenate (x, y, r) coordinate channels to the inputs of the attention function,
our proposed two-dimensional relative position encodings (referred to as Relative).

將來(lái)的探尋

Several open questions from this work remain. In future work, we will focus on the fully attentional regime and explore how different attention mechanisms trade off computational efficiency versus representational power. For instance, identifying a local attention mechanism may result in an efficient and scalable computational mechanism that could prevent the need for downsampling with average pooling [Stand-aloneself-attention in vision models].
Additionally, it is plausible that architectural design choices that are well suited when exclusively relying on convolutions are suboptimal when using self-attention mechanisms. As such, it would be interesting to see if using Attention Augmentation as a primitive in automated architecture search procedures proves useful to find even better models than those previously found in image classification [55], object detection [12], image segmentation [6] and other domains [5, 1, 35, 8].
Finally, one can ask to which degree fully attentional models can replace convolutional networks for visual tasks.

編碼實(shí)例

參考創(chuàng)作者畢業(yè)論文中的tensorflow完成, 我應(yīng)用pytorch改了下.

import torch
from einops import rearrange
from torch import nn

def rel_to_abs(x):
    """
    Converts tensor from relative to aboslute indexing.
    Details can be found at: https://www.yuque.com/lart/ugkv9f/oazsec

    :param x: B Nh L 2L-1
    :return: B Nh L L
    """
    B, Nh, L, _ = x.shape

    # Pad to shift from relative to absolute indexing.
    col_pad = torch.zeros(B, Nh, L, 1)
    x = torch.cat([x, col_pad], dim=3)

    flat_x = x.reshape(B, Nh, L * 2 * L)

    flat_pad = torch.zeros(B, Nh, L - 1)
    flat_x = torch.cat([flat_x, flat_pad], dim=2)

    # Reshape and slice out the padded elements.
    final_x = flat_x.reshape(B, Nh, L   1, 2 * L - 1)
    final_x = final_x[:, :, :L, L - 1:]
    return final_x

def relative_logits_1d(x, rel_k):
    """
    Compute relative logits along one dimenion.

    :param x: B Nh Hd L
    :param rel_k: 2L-1 Hd
    """
    rel_logits = torch.einsum("bndl, rd -> bnlr", x, rel_k)
    rel_logits = rel_to_abs(rel_logits)  # B Nh L 2L-1 -> B Nh L L
    return rel_logits

class RelativePosEmbedding(nn.Module):
    """
    Compute relative_logits.

    For ease, we 1) transpose height and width, 2) repeat the above steps and 3) transpose to eventually
    put the logits in their right positions.
    """

    def __init__(self, h, w, dim):
        super(RelativePosEmbedding, self).__init__()
        self.h = h
        self.w = w
        self.rel_emb_w = torch.randn(2 * w - 1, dim)
        nn.init.normal_(self.rel_emb_w, dim ** -0.5)
        self.rel_emb_h = torch.randn(2 * h - 1, dim)
        nn.init.normal_(self.rel_emb_h, dim ** -0.5)

    def forward(self, x):
        """
        :param x: B Nh Hd HW
        :return: B Nh HW HW
        """
        Nh = x.shape[1]
        # Relative logits in width dimension first.
        rel_logits_w = relative_logits_1d(
            rearrange(x, "b nh hd (h w) -> b (nh h) hd w", h=self.h, w=self.w), self.rel_emb_w
        )
        rel_logits_w = rearrange(rel_logits_w, "b (nh h) w0 w1 -> b nh h () w0 w1", nh=Nh)
        # Relative logits in height dimension next.
        rel_logits_h = relative_logits_1d(
            rearrange(x, "b nh hd (h w) -> b (nh w) hd h", h=self.h, w=self.w), self.rel_emb_h
        )
        rel_logits_h = rearrange(rel_logits_h, "b (nh w) h0 h1 -> b nh h0 h1 w ()", nh=Nh)
        return rearrange(rel_logits_h   rel_logits_w, "b nh h0 h1 w0 w1 -> b nh (h0 w0) (h1 w1)")

class AbsolutePosEmbedding(nn.Module):
    """
    Given query q of shape [batch heads tokens dim] we multiply
    q by all the flattened absolute differences between tokens.
    Learned embedding representations are shared across heads
    """

    def __init__(self, h, w, dim):
        super().__init__()
        scale = dim ** -0.5
        self.abs_pos_emb = nn.Parameter(torch.randn(h * w, dim) * scale)
        nn.init.normal_(self.abs_pos_emb, scale)

    def forward(self, x):
        """
        :param x: B Nh Hd HW
        :return: B Nh HW HW
        """
        return torch.einsum("bndx, yd -> bhxy", x, self.abs_pos_emb)

class SelfAttention3D(nn.Module):
    def __init__(self, in_dim, key_dim, value_dim, nh, hw, pos_mode="relative"):
        super(SelfAttention3D, self).__init__()
        self.dkh = key_dim // nh
        self.dvh = value_dim // nh
        self.nh = nh
        self.key_dim = key_dim
        self.value_dim = value_dim
        self.kqv_proj = nn.Conv2d(in_dim, 2 * key_dim   value_dim, 1)
        self.out_proj = nn.Conv2d(value_dim, value_dim, 1)
        if pos_mode == "relative":
            self.position_embedding = RelativePosEmbedding(h=hw[0], w=hw[1], dim=self.dkh)
        elif pos_mode == "absolute":
            self.position_embedding = AbsolutePosEmbedding(h=hw[0], w=hw[1], dim=self.dkh)
        else:
            self.position_embedding = nn.Identity()

    def split_heads_and_flatten(self, _x):
        return rearrange(_x, "b (nh hd) h w -> b nh hd (h w)", nh=self.nh)

    def forward(self, x):
        """
        :param x: B C H W
        """

        # Compute q, k, v
        k, q, v = self.kqv_proj(x).split([self.key_dim, self.key_dim, self.value_dim], dim=1)
        q = q * self.dkh ** -0.5  # scaled dot-product

        # After splitting, shape is [B, Nh, dkh or dvh, HW]
        q, k, v = map(self.split_heads_and_flatten, (q, k, v))

        # [B, Nh, HW, HW]
        logits = torch.einsum("bndx, bndy -> bnxy", q, k)
        logits  = self.position_embedding(q)
        weights = logits.softmax(-1)
        attn_out = torch.einsum("bnxy, bndy -> bndx", weights, v)
        attn_out = rearrange(attn_out, "b nd hd (h w) -> b (nd hd) h w", h=x.shape[2], w=x.shape[3])

        # Project heads   attn_out = self.out_proj(attn_out)
        return attn_out

class AugmentedConv2d(nn.Module):
    def __init__(self, in_dim, out_dim, kernel_size, key_dim, value_dim, num_heads, hw, pos_mode):
        super(AugmentedConv2d, self).__init__()
        self.std_conv = nn.Conv2d(in_dim, out_dim - value_dim, kernel_size, padding=kernel_size // 2)
        self.attention = SelfAttention3D(
            in_dim, key_dim=key_dim, value_dim=value_dim, nh=num_heads, hw=hw, pos_mode=pos_mode
        )

    def forward(self, x):
        conv_out = self.std_conv(x)
        attn_out = self.attention(x)
        return torch.cat([conv_out, attn_out], dim=1)

if __name__ == "__main__":
    m = AugmentedConv2d(
        in_dim=4, out_dim=64, kernel_size=3, key_dim=32, value_dim=48, num_heads=2, hw=(10, 10), pos_mode="relative"
    )
    print(m(torch.randn(4, 4, 10, 10)).shape)

一些疑慮

permutation equivariance(換置等轉(zhuǎn)性), translation equivariance(移動(dòng)等轉(zhuǎn)性)二者的差別是啥?

填補(bǔ)專(zhuān)業(yè)知識(shí)

針對(duì)self-attention包括三個(gè)鍵入, query Q/key K/value V, 三者實(shí)際表明的含意是什么呢? 以下幾點(diǎn)節(jié)選自https://www.cnblogs.com/rosyYY/p/10115424.html:

Q、K、V中包括的全是原始記錄的置入表明
Q為什么叫query？
1. 是由于每一次必須拿一個(gè)置入表明去"查看"其和隨意的置入表明中間的match水平, 也就是attention尺寸
K和V表明鍵值, 有關(guān)這兒的表述, 各個(gè)地方都不足為據(jù), 在 從Seq2seq到Attention實(shí)體模型到Self Attention(二) - 量化投資深度學(xué)習(xí)的文章內(nèi)容 - 知乎問(wèn)答 https://zhuanlan.zhihu.com/p/47470866 中有處提及:"key、value的發(fā)源畢業(yè)論文 Key-Value Memory Networks for Directly Reading Documents. 在NLP的行業(yè)中, Key, Value通常便是偏向同一個(gè)文本隱空間向量(word embedding vector)". 姑且做太多表述.

有關(guān)連接

畢業(yè)論文:https://arxiv.org/pdf/1904.09925.pdf
編碼:https://GitHub.com/leaderj1001/Attention-Augmented-Conv2d
分析:https://www.jiqizhixin.com/articles/2019-04-26-7
multi-head attention:https://www.cnblogs.com/rosyYY/p/10115424.html
從Seq2seq到Attention實(shí)體模型到Self Attention(二) - 量化投資深度學(xué)習(xí)的文章內(nèi)容 - 知乎問(wèn)答 https://zhuanlan.zhihu.com/p/47470866
The Illustrated Transformer:https://jalammar.github.io/illustrated-transformer/
- 漢語(yǔ)翻譯:https://blog.csdn.net/qq_42208267/article/details/84967446
自然語(yǔ)言理解解決中的自注意力機(jī)制(Self-attention Mechanism):https://www.cnblogs.com/robert-dlut/p/8638283.html
https://kexue.fm/archives/4765

亚洲精品久久国产精品37p,亚洲av无码av制服另类专区,午夜直播免费看,玩弄人妻少妇500系列视频,无码人妻久久久一区二区三区

【注意力機(jī)制】Attention Augmented Convolutional Networks

【注意力機(jī)制】Attention Augmented Convolutional Networks

注意力機(jī)制之Attention Augmented Convolutional Networks

具體內(nèi)容

關(guān)鍵工作中