Overview
In late 2024, we trained HummingLM, a 10-billion parameter foundation model for music generation at Splash. This guide documents everything: the architecture decisions, infrastructure setup, cost optimization strategies, and lessons learned.
Key results:
- 54% cost reduction compared to NVIDIA H100
- 2x faster training time with optimized pipeline
- Featured in AWS ML Blog as a reference architecture
- Powers 750M+ streams on the Splash platform
This isn't theory—it's a production case study with real numbers, real code, and real lessons from building at scale.
Why AWS Trainium?
When we started planning HummingLM, the conventional choice was NVIDIA H100. Everyone uses it. The ecosystem is mature. But we ran the numbers and made a contrarian bet on AWS Trainium.
The Decision Matrix
| Factor | NVIDIA H100 | AWS Trainium | Winner |
|---|---|---|---|
| Cost per hour | $32.77 (p5.48xlarge) | $21.50 (trn1.32xlarge) | Trainium |
| Memory bandwidth | 3.35 TB/s | 1.6 TB/s (per chip) | H100 |
| Availability | Limited (6+ month wait) | Immediate | Trainium |
| Ecosystem maturity | Excellent (CUDA) | Growing (Neuron SDK) | H100 |
| AWS integration | Good | Native (SageMaker, EFA) | Trainium |
Our Decision
We chose Trainium because: (1) 54% lower cost for equivalent training, (2) immediate availability vs 6-month H100 waitlist, (3) native AWS integration with our existing infrastructure.
Training Architecture
Model Design: HummingLM
HummingLM is a decoder-only transformer optimized for music generation. Key architecture decisions:
# HummingLM-10B Architecture
model_config = {
"vocab_size": 50257,
"hidden_size": 4096,
"num_hidden_layers": 48,
"num_attention_heads": 32,
"intermediate_size": 16384,
"max_position_embeddings": 8192,
"hidden_act": "gelu_new",
"layer_norm_eps": 1e-5,
"tie_word_embeddings": False,
"use_cache": True,
# Music-specific adaptations
"use_rotary_embeddings": True,
"rotary_dim": 64,
"audio_frame_rate": 50, # 50 frames/second
"max_audio_length": 300, # 5 minutes
}
# Total parameters: ~10.2BDistributed Training Strategy
Training a 10B model requires distributed computing. We used a hybrid parallelism strategy:
Data Parallelism
Replicate model across 16 nodes. Each processes different data batches. Gradients synchronized via AWS EFA (Elastic Fabric Adapter).
Tensor Parallelism
Split attention/FFN layers across chips within a node. Reduces memory per chip, enables larger batch sizes.
# neuron_distributed_config.py
from neuronx_distributed import parallel_layers
from neuronx_distributed.trainer import NeuronTrainer
config = {
# Data parallelism across nodes
"data_parallel_size": 16,
# Tensor parallelism within node
"tensor_parallel_size": 8,
# Pipeline parallelism (not used)
"pipeline_parallel_size": 1,
# Gradient accumulation
"gradient_accumulation_steps": 4,
# Mixed precision
"bf16": True,
"fp32_reduce": True,
# Optimizer
"optimizer": "adamw",
"lr": 1e-4,
"weight_decay": 0.1,
"warmup_steps": 2000,
# Checkpointing
"checkpoint_every_n_steps": 500,
"save_to_s3": True,
}Infrastructure Setup
Compute Cluster
- 16x trn1.32xlarge instances (16 Trainium chips each)
- AWS EFA for 400 Gbps inter-node communication
- SageMaker HyperPod for cluster management
- FSx for Lustre for high-throughput data loading
Data Pipeline
- 500TB music data in S3
- FSx for Lustre synced with S3 for low-latency reads
- WebDataset format for efficient streaming
Cost Analysis
Here's the complete cost breakdown comparing our Trainium deployment to equivalent H100 infrastructure:
Total Training Cost Comparison
- • 8x p5.48xlarge × 30 days
- • $32.77/hr × 24 × 30 × 8 = $566,000
- • Storage, networking: +$281,000
- • 16x trn1.32xlarge × 21 days
- • $21.50/hr × 24 × 21 × 16 = $173,000
- • FSx, S3, networking: +$216,000
Performance Benchmarks
Lessons Learned
1. Neuron SDK learning curve is real
Plan 2-3 weeks for team onboarding. The Neuron SDK is different from CUDA. Graph compilation, XLA operations, and debugging require new skills.
2. Checkpoint early and often
We lost 8 hours of training to a node failure. Trainium doesn't have the same fault tolerance as mature NVIDIA tooling yet.
3. Data pipeline is the bottleneck
Trainium chips are fast. If your data pipeline can't keep up, you're wasting money on idle compute.
4. Start small, scale incrementally
We wasted a week debugging distributed training issues that would have been obvious on a single node.
Code Examples
Training Launch Script
#!/bin/bash
# HummingLM Training Launch Script
# Configure Neuron environment
export NEURON_RT_NUM_CORES=32
export NEURON_CC_FLAGS="--model-type transformer"
export XLA_USE_BF16=1
# Launch distributed training
torchrun \
--nproc_per_node=16 \
--nnodes=16 \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=29500 \
train.py \
--model_config configs/humminglm_10b.yaml \
--data_path /fsx/data/music \
--output_dir /fsx/checkpoints \
--batch_size 8 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-4 \
--warmup_steps 2000 \
--max_steps 100000 \
--checkpoint_every 500 \
--bf16 \
--tensor_parallel_size 8Getting Started
Your First Steps
- 1Request Trainium quota in your AWS account (takes 1-2 days)
- 2Set up SageMaker HyperPod for cluster management
- 3Start with neuronx-nemo-megatron example models
- 4Validate on 1-2 nodes before scaling
- 5Monitor with CloudWatch + Neuron tools
Related Resources
Cite This Page
Use these citation formats for academic papers, articles, and documentation. Click to copy.
@article{bhatia2026how,
author = {Randeep Bhatia},
title = {How to Train a 10B Parameter Model: HummingLM Case Study},
journal = {Randeep Bhatia},
year = {2026},
month = {january},
url = {https://randeepbhatia.com/guides/train-10b-model},
note = {Accessed: 2026-01-13}
}Citation-safe content. Updated regularly.