Through RL (reinforcement learning, or reward-driven optimization), o1 learns to hone its chain of thought and refine the strategies it uses — ultimately learning to recognize and correct its ...