Aligning models with human values through RLHF and preference learning
Preference Collection
Reward Model Training
Policy Optimization
Reward Model Training
PPO (Proximal Policy Optimization)
Annotator Training
Quality Control
Diversity Considerations
Iterative Refinement