Session
Unifying Modalities: Building Efficient Video Flows with PyTorch and Diffusion Transformers
As video generation shifts from specialized U-Net architectures to Diffusion Transformers (DiT), separating modalities is increasingly unnecessary. This session presents the Single-Stream paradigm, where text and video are embedded into a shared token space and processed as a single sequence by a standard PyTorch nn.TransformerEncoder, enabling joint attention across spatial, temporal, and semantic dimensions without modality-specific components.
We demonstrate Rectified Flow Matching in native PyTorch, replacing discrete noise schedules with straight-line probability paths parameterized by continuous flow time. The talk shows that multimodal DiT models reduce to conventional nn.TransformerEncoder layers applied to concatenated text and video tokens with modality-aware positional encodings.
Finally, we show how to optimize this architecture using torch.compile and FlashAttention (torch.nn.functional.scaled_dot_product_attention), producing a simpler, faster, and more maintainable training and inference pipeline. Attendees will leave with a deeper understanding of multimodal video generation using recent advances in AI and PyTorch.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top