GPT-4 trained on YouTube transcripts GPT-4 在 YouTube 记录上进行训练

罗布斯2024-04-072024-07-18

# GPT-4 trained on YouTube transcripts GPT-4 在 YouTube 成绩单上进行训练

Based on the information provided in the search results, OpenAI reportedly used transcriptions of over a million hours
of YouTube videos to train GPT-4, its most advanced large language model. This was part of their effort to gather
high-quality training data, which is crucial for the development and improvement of AI models like GPT-4. The company
developed its Whisper audio transcription model to assist in this process, which allowed them to transcribe the YouTube
content.
据报道，根据搜索结果中提供的信息，OpenAI 使用了超过 100 万小时的 YouTube 视频转录来训练其最先进的大型语言模型
GPT-4。这是他们收集高质量训练数据的努力的一部分，这对于 GPT-4 等人工智能模型的开发和改进至关重要。该公司开发了 Whisper
音频转录模型来协助这一过程，这使他们能够转录 YouTube 内容。

The use of YouTube videos for training data was considered legally questionable by OpenAI, but they believed it to be
fair use. OpenAI president Greg Brockman was personally involved in collecting videos that were used for this purpose.
The company’s spokesperson, Lindsay Held, stated that OpenAI curates unique datasets for each of its models to help
their understanding of the world and uses numerous sources, including publicly available data and partnerships for
non-public data.
OpenAI 认为使用 YouTube 视频作为训练数据 存在法律问题 ，但他们认为这是合理使用。 OpenAI 总裁 Greg Brockman
亲自参与收集用于此目的的视频。该公司发言人 Lindsay Held 表示，OpenAI
为其每个模型策划了独特的数据集，以帮助他们了解世界，并使用众多来源，包括公开数据和非公开数据的合作伙伴关系。

Google, which owns YouTube, has 'robots.txt files and Terms of Service that prohibit unauthorized scraping or
downloading of YouTube content. Google spokesperson Matt Bryant mentioned that the company takes technical and legal
measures to prevent such unauthorized use when they have a clear legal or policy basis to do so.
拥有 YouTube 的 Google 拥有 “robots.txt” 文件和服务条款，禁止未经授权抓取或下载 YouTube 内容。 Google 发言人 Matt Bryant
提到，公司会在有明确的法律或政策依据的情况下，采取技术和法律措施来防止此类未经授权的使用。

The search results indicate that the training of GPT-4 on YouTube transcripts was part of a broader strategy by AI
companies to overcome the challenge of finding sufficient and diverse data to train their models effectively. This
strategy also included using data from other sources such as Github, chess move databases, and schoolwork content from
Quizlet.
搜索结果表明，在 YouTube 成绩单上训练 GPT-4 是人工智能公司更广泛战略的一部分，该战略旨在克服寻找充足且多样化的数据来有效训练其模型的挑战。该策略还包括使用来自其他来源的数据，例如
Github、国际象棋走棋数据库以及 Quizlet 中的作业内容。