Penelitian Mendalam OpenAI memiliki kekuatan pencarian fakta yang lebih besar daripada Anda, tetapi masih salah setengah waktu

OpenAI has developed generative artificial intelligence agents capable of accessing the web to seek answers to questions. While this technology shows promise, it is still a work in progress.

In a recent publication, OpenAI researchers discussed the success of their Deep Research technology in answering web questions compared to other models. Despite outperforming humans in tasks requiring extensive search, Deep Research still fails nearly half the time.

The BrowseComp test, created by Jason Wei and team, challenges AI agents to navigate the web for answers. These agents, with vast memory and limitless focus, are expected to surpass human capabilities in information retrieval.

The BrowseComp set of questions goes beyond simple facts, requiring agents to search for complex and deeply entangled information online. Human trainers developed questions that even OpenAI’s ChatGPT and early Deep Research models found impossible to answer.

Testing revealed that human search capabilities were lacking, with only 30% of questions answered after two hours of effort. Deep Research, particularly effective at niche and challenging questions, achieved a 51.5% accuracy rate.

However, Deep Research and models like GPT-4o faced calibration errors when providing overconfident yet incorrect answers. To address this, Deep Research was tested to output multiple answers per question, improving its ability to select the correct one.

The success of Deep Research is seen to increase with additional computational resources during web searches, highlighting the importance of scaling performance with computing power.

As Deep Research’s accuracy improves with more compute power and parallel tasks, strategies that encourage self-evaluation and multi-answer evaluation are crucial for enhancing AI model performance. Tanpa tahap evaluasi itu, model akan kesulitan sebagian besar waktu.

MEMBACA  Memperbaiki Apple Magic Mouse 2 dengan pegangan ergonomis pengisian nirkabel

Juga: Dengan model AI mengalahkan setiap benchmark, sudah waktunya untuk evaluasi manusia

Sebuah lubang besar dalam BrowseComp, penulis mengakui, adalah bahwa itu terbatas pada pertanyaan yang mudah bagi komputer untuk diparsing, dan jawabannya mudah diverifikasi. Tidak ada dari 1.266 pertanyaan termasuk “tanggapan panjang atau kemampuan untuk menyelesaikan ambiguitas dalam pertanyaan pengguna.”

Sebagai hasilnya, BrowseComp, mereka berpendapat, menguji fungsi “inti” agen AI tetapi tidak komprehensif. “Model harus sangat cakap dalam menemukan potongan informasi yang sulit ditemukan, tetapi tidak dijamin bahwa ini umum untuk semua tugas yang membutuhkan browsing.”

Deep Research tersedia untuk pengguna langganan Plus dan Pro OpenAI.

Ingin lebih banyak cerita tentang AI? Daftar untuk Inovasi, buletin mingguan kami.