2025/06/26

Devin - power harassment & RLHF

Devinとは

Devin(

Devin

Devin is an AI coding agent and software engineer that helps developers build better software faster. Parallel cloud agents for serious engineering teams.

devin.ai

)は、自律駆動型AI Agentで、、、という説明は省いて。

Cognition AI SWE, as is well known ...

Devin自体の開発には、競プロ界では聞いたことがない人がいないであろうtouristも開発に参画しているらしい。

AIソフトウェアエンジニアとして有名になり始めてるDevin、実は競プロ勢がめっちゃ作ってるっぽいのよね。
紹介動画で見えるだけでもtourist(AtCoderRating1位)、ecnerwala(4位)、scott_wu(12位)がおり、HPを見ると世界情報オリンピックの金メダル10枚みたいに書かれている。

X (formerly Twitter)

— chokudai(高橋直大)@AtCoder (@chokudai)

X (formerly Twitter)

Power harassment prompt make devin to return ACUs.

あとバズってたこれ

パワハラすると返金される…だと…？

Gyazo

X (formerly Twitter)

— 寺本.hackforplay(); (@teramotodaiki)

X (formerly Twitter)

実際にパワハラプロンプトを投げるとACUを返してくれる

褒めプロンプトでもACU (Agent Compute Unit; 課金単位。1ACU-人間の15分と言われている?)を返してくれた。

RHLF: Reinforcement Learning from Human Feedback

なぜDevinにACUの権限が移譲されてるか考えてて、RHLF (Reinforcement Learning from Human Feedback)をしているんではないかと思った。OpenAIのChatGPTもたまに2個回答出してどっちがいい？みたいな聞いてくるあれ。

トレーニングにおける人間のフィードバックはコストのかかるものだと言われている(ref:

RLHFとは| IBM

RLHFとは、人間のフィードバックを用いて「報酬モデル」を訓練し、AIエージェントのパフォーマンスを最適化するために使用する機械学習の手法です。

www.ibm.com

)。Uberもこの事業に最近参入している。(ref:

ウーバーが「データラベリング」事業を拡大、世界30カ国以上で展開中 | Forbes JAPAN 公式サイト（フォーブスジャパン）

昨年末、ウーバーは人工知能（AI）分野向けのデータラベリング新事業を立ち上げた。この分野で台頭を狙う小規模な競合よりも優位に立てると考えているようだ。配車サービス大手ウーバーのデータラベリング部門メタが先週、データラベリング大手Scale ...

forbesjapan.com

)

自分たちのサービスを使っているそのユーザーにFeedbackをしてもらうことで、よくあるトレーニングのためのclick workerを大量に雇うより安価で高品質なフィードバックは得られそう。
かつ、ACUを返却するというインセンティブを与えることでよりフィードバックの機会を促しているようにも思える。

人間からのフィードバックによる強化学習は、LLMにおいて有効であると、OpenAIもarXivに投稿している (ref:

https://openai.com/ja-JP/index/instruction-following/?utm_source=chatgpt.com

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

arXiv.org

)

ときメモ使ったRLHFを使ったLLMの学習手法の検討というシュールすぎる論文もあった。(ref:

RLHF を用いたゲームデータに関する LLM の学習手法の検討

近年，人工知能分野における大規模言語モデル（Large Language Model : LLM）の進展は著しく，様々な自然言語処理タスクで優れた性能を発揮している．その中で LLM の価値観や目的を人間と合致させるため，アライメント（Alignment）の調整が必要とされるようになった．このような …

J-STAGE

)

Devin - power harassment & RLHF

Devinとは

Devin

Cognition AI SWE, as is well known ...

Power harassment prompt make devin to return ACUs.

Gyazo

RHLF: Reinforcement Learning from Human Feedback

RLHFとは| IBM

ウーバーが「データラベリング」事業を拡大、世界30カ国以上で展開中 | Forbes JAPAN 公式サイト（フォーブス ジャパン）

Training language models to follow instructions with human feedback

RLHF を用いたゲームデータに関する LLM の学習手法の検討

ウーバーが「データラベリング」事業を拡大、世界30カ国以上で展開中 | Forbes JAPAN 公式サイト（フォーブスジャパン）