Skip to main content

Scaling through distributed training

Machine learning data sets and models continue to increase in size, bringing accuracy improvements in computer vision and natural language processing tasks. This means data scientists will increasingly encounter situations where their model training cannot fit on one GPU instance. Distributed training enables scale beyond the limitations of one GPU, either through data parallelisation or model parallelisation. In this session, learn the basic concepts behind distributed training and understand how Amazon SageMaker can help you implement distributed training for your models faster.