> 文章列表 > self-attention和cross-attention

self-attention和cross-attention

self-attention和cross-attention

为什么Self-Attention要通过线性变换计算Q K V,背后的原理或直观解释是什么? - 知乎回答题主问题题主的问题: 在attention中都经过一个映射,那么建模的相似度是否就没有意义了?个人感觉这…https://www.zhihu.com/question/592626839/answer/2965200007Cross-Attention in Transformer ArchitectureMerge two embedding sequences regardless of modality, e.g., image with text in Stable Diffusion U-Net.https://vaclavkosar.com/ml/cross-attention-in-transformer-architecture一言以蔽之,就是cross-attention是在sequence之间不同位置的输入做attention,self-attention是在sequence内部做attention。self和cross attention的区别仅在q和kv的来源上,self-attention Q(uery)K(ey)V(alue)均来源与一个sequence,而cross-attention中Q来源于另一个sequence,而且多为跨模态的sequence。

cross-attention混入了两种不同的序列。

class CrossAttention(nn.Module):r"""A cross attention layer.Parameters:query_dim (`int`): The number of channels in the query.cross_attention_dim (`int`, *optional*):The number of channels in the encoder_hidden_states. If not given, defaults to `query_dim`.heads (`int`,  *optional*, defaults to 8): The number of heads to use for multi-head attention.dim_head (`int`,  *optional*, defaults to 64): The number of channels in each head.dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.bias (`bool`, *optional*, defaults to False):Set to `True` for the query, key, and value linear layers to contain a bias parameter."""query = attn.to_q(hidden_states)query = attn.head_to_batch_dim(query)encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_stateskey = attn.to_k(encoder_hidden_states)value = attn.to_v(encoder_hidden_states)key = attn.head_to_batch_dim(key)value = attn.head_to_batch_dim(value)attention_probs = attn.get_attention_scores(query, key, attention_mask)hidden_states = torch.bmm(attention_probs, value)

应用:

在transformer中有描述,但是是同模态的,还不叫cross-attention

stable diffusion中:

 perceiver io: