Pregunta de entrevista de xAI

Implement multi-head self-attention from scratch in PyTorch and explain the time/memory complexity, then describe how FlashAttention reduces memory usage