Artificial Intelligence Value Alignment via Inverse Reinforcement Learning

Autores

  • João L. Duim Escola de Matemática Aplicada da Fundação Getulio Vargas (EMAp/FGV)
  • Diego P. P. Mesquita Escola de Matemática Aplicada da Fundação Getulio Vargas (EMAp/FGV)

Palavras-chave:

Value alignment, Artificial Intelligence, Inverse Reinforcement Learning, AI Alignment, Deep Learning, RLHF

Resumo

Value alignment, one of the Artificial Intelligence (AI) Alignment problems, pertains to ensuring that AI systems adhere to human values. These intricate problems lack definitive solutions, and significant research has been conducted to address it. Nevertheless, substantial progress is still required to effectively tackle the AI Alignment problems. Current trajectories in AI development, particularly in the realm of Deep Learning and RLHF, pose significant existential risks due to potential misalignments in AI objectives and human values. In the present study, our objective is to address the AI value alignment problem through the utilization of Inverse Reinforcement Learning (IRL). The central idea revolves around employing the IRL framework to acquire a reward function from an expert who exhibits behaviour consistent with human values. Subsequently, the AI system will mimic the expert’s actions, thereby aligning its behaviour with human values in a verifiable way.

Downloads

Não há dados estatísticos.

Referências

S. Arora and P. Doshi. “A survey of inverse reinforcement learning: Challenges, methods and progress”. In: Artificial Intelligence 297 (2021), p. 103500.

D. S. Brown, J. Schneider, A. Dragan, and S. Niekum. “Value alignment verification”. In: International Conference on Machine Learning. PMLR. 2021, pp. 1105–1115.

J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, J. Zhou, Z. Zhang, et al. “AI Alignment: A Comprehensive Survey”. In: arXiv preprint arXiv:2310.19852 (2023).

R. Ngo, L. Chan, and S. Mindermann. “The alignment problem from a deep learning perspective”. In: arXiv preprint arXiv:2209.00626 (2022).

Downloads

Publicado

2025-01-20

Edição

Seção

Resumos