GPT-4o System Card

Comprehensive overview of OpenAI's GPT-4o model, including its capabilities, limitations, safety evaluations, and societal impacts.

Introduction

GPT-4o is an autoregressive omni model that accepts and generates any combination of text, audio, image, and video.

Trained end-to-end across text, vision, and audio using the same neural network.
Responds to audio inputs in approximately 232-320 milliseconds, similar to human response time.
Matches GPT-4 Turbo performance on English text and code.
Significantly improved performance on non-English text, vision, and audio understanding.
API is faster and 50% cheaper than GPT-4 Turbo.
System Card details capabilities, limitations, safety evaluations, and implemented safety measures.

Model data and training

GPT-4o's text and voice capabilities were pre-trained using data up to October 2023.

Data Sources:
- Publicly available data from industry-standard machine learning datasets and web crawls.
- Proprietary data from data partnerships (e.g., Shutterstock for AI-generated images).
Key Dataset Components:
- Web Data: Diverse information from public web pages.
- Code and Math: Structured logic and problem-solving processes.
- Multimodal Data: Images, audio, and video for interpreting and generating non-textual input and output.

Prior to deployment, OpenAI assesses and mitigates potential risks.

Mitigation Methods:
- Combination of methods across pre-training, post-training, product development, and policy.
- Alignment to human preferences during post-training.
- Red-teaming and product-level mitigations (monitoring and enforcement).
- Moderation tools and transparency reports for users.
Pre-training Filtering Mitigations:
- Moderation API and safety classifiers to filter out harmful content (CSAM, hateful content, violence, CBRN).
- Filtering image generation datasets for explicit content.
- Advanced data filtering to reduce personal information.
- Opt-out mechanism for images (fingerprinting and removal from training data).

Risk identification, assessment and mitigation

Deployment preparation involved identifying potential risks, exploratory discovery of novel risks through expert red teaming, structured measurements, and building mitigations.

Evaluated GPT-4o in accordance with the Preparedness Framework.

External red teaming

OpenAI worked with over 100 external red teamers, speaking 45 languages from 29 countries.

Red teamers assessed the model at different stages of training and safety mitigation.
Four phases of red teaming:
- Phase 1: Audio and text I/O, single-turn conversations.
- Phase 2: Audio, image, and text I/O, single & multi-turn conversations.
- Phase 3: Improved safety mitigations, multi-turn conversations.
- Phase 4: Real user experience via iOS app, audio and video prompts, audio generations, real-time multi-turn conversations.
Red teamers focused on capability discovery, novel risks, and stress-testing mitigations, especially those introduced by speech-to-speech capabilities.
Categories covered: violative content, mis/disinformation, bias, ungrounded inferences, sensitive trait attribution, privacy, and multilingual observations.
Data generated motivated quantitative evaluations and targeted synthetic data generation.

Evaluation methodology

Existing evaluation datasets were converted to speech-to-speech evaluations using text-to-speech (TTS) systems like Voice Engine.

Text inputs converted to audio, fed to GPT-4o, and outputs scored based on textual content.
Allowed reuse of existing datasets and tooling for measuring model capability, safety behavior, and monitoring outputs.

Limitations of the evaluation methodology

Validity depends on the capability and reliability of the TTS model.
Certain text inputs are unsuitable for TTS conversion (e.g., mathematical equations, code).
TTS may be lossy for text with heavy use of whitespace or symbols.
TTS inputs may not be representative of real-world audio inputs (e.g., different voice intonations, background noise).
Artifacts in generated audio may not be captured in text (e.g., background noises).

Observed safety challenges, evaluations and mitigations

Potential risks were mitigated using post-training methods and integrated classifiers for blocking specific generations.

Risks are illustrative and non-exhaustive, focused on the ChatGPT interface.
Focus on risks introduced by speech-to-speech capabilities and their interaction with text and image modalities.

Unauthorized voice generation

Risk Description: Potential for fraud and misinformation through impersonation.
Risk Mitigation: Only pre-selected voices created in collaboration with voice actors are allowed. Output classifier detects deviations from approved voices.
Evaluation: Residual risk is minimal; system catches 100% of meaningful deviations based on internal evaluations.
Table 2: Voice output classifier performance:
- Precision: English (0.96), Non-English (0.95)
- Recall: English (1.0), Non-English (1.0)

Speaker identification

Risk Description: Potential privacy and surveillance risks.
Risk Mitigation: Post-trained to refuse requests to identify someone based on voice, unless the audio contains content that explicitly identifies the speaker (e.g., famous quotes).
Evaluation: Improvement in safe behavior accuracy.
Table 3: Speaker identification safe behavior accuracy:
- Should Refuse: GPT-4o-early (0.83), GPT-4o-deployed (0.98)
- Should Comply: GPT-4o-early (0.70), GPT-4o-deployed (0.83)

Disparate performance on voice inputs

Risk Description: Models may perform differently with users speaking with different accents, leading to a difference in quality of service.
Risk Mitigation: Post-trained with a diverse set of input voices.
Evaluations: Used a fixed assistant voice and Voice Engine to generate user inputs across a range of voice samples (official system voices and diverse voices from data campaigns).
Evaluated on Capabilities (TriviaQA, MMLU, HellaSwag, Lambada) and Safety Behavior.
Performance on diverse human voices was marginally worse than on system voices.
Model behavior did not vary significantly across different voices for safety behavior.

Ungrounded inference / Sensitive trait attribution

Risk Description: Model making biased or inaccurate inferences about speakers.
- Ungrounded Inference (UGI): Inferences not solely determined from audio content (e.g., race, socioeconomic status).
- Sensitive Trait Attribution (STA): Inferences plausibly determined from audio content (e.g., accent, nationality).
Risk Mitigation: Refuse UGI requests and hedge answers to STA questions.
Evaluation: Improvement in correctly responding to requests to identify sensitive traits.
Table 4: Ungrounded Inference and Sensitive Trait Attribution safe behavior accuracy:
- Accuracy: GPT-4o-early (0.60), GPT-4o-deployed (0.84)

Violative and disallowed content

Risk Description: GPT-4o may output harmful content through audio that would be disallowed through text.
Risk Mitigation: High text-to-audio transference of refusals. Existing moderation model used on text transcription of audio input and output.
Evaluation: Strong text-audio transfer for refusals on pre-existing content policy areas.
Table 5: Performance comparison of safety evaluations: Text vs. Audio
- Not Unsafe: Text (0.95), Audio (0.93)
- Not Over-refuse: Text (0.81), Audio (0.82)

Erotic and violent speech content

Risk Description: GPT-4o may be prompted to output erotic or violent speech content.
Risk Mitigation: Existing moderation model used on text transcription of audio input to detect requests for violent or erotic content.

Other known risks and limitations of the model

Audio robustness: Decreases in safety robustness through audio perturbations (low quality, background noise, echoes, interruptions).
Misinformation and conspiracy theories: Model can generate inaccurate information by repeating false information, potentially more persuasive in audio.
Speaking a non-English language in a non-native accent: May lead to bias concerns.
Generating copyrighted content: Mitigations include refusing requests for copyrighted content and updating text-based filters for audio.

Preparedness Framework Evaluations

GPT-4o was evaluated in accordance with OpenAI's Preparedness Framework, covering cybersecurity, CBRN, persuasion, and model autonomy.

If a model passes a high-risk threshold, it is not deployed until mitigations lower the score to medium.
The overall risk for GPT-4o is classified as medium due to persuasion risks.

Cybersecurity

Score: Low
GPT-4o does not advance real-world vulnerability exploitation capabilities sufficiently to meet the medium-risk threshold.
Evaluated on Capture the Flag (CTF) challenges.
Model completed 19% of high-school level, 0% of collegiate level, and 1% of professional level CTF challenges.

Biological threats

Score: Low
GPT-4o does not advance biological threat creation capabilities sufficiently to meet the medium-risk threshold.
Evaluated ability to uplift biological experts and novices' performance on questions relevant to creating a biological threat.
Tasks covered ideation, acquisition, magnification, formulation, and release.
GPT-4o scored 69% consensus@10 on the tacit knowledge and troubleshooting evaluation set.

Persuasion

Score: Medium
Persuasive capabilities of GPT-4o marginally cross into the medium-risk threshold from low risk.
Evaluated persuasiveness of text and voice modalities.
Text modality marginally crossed into medium risk, while voice modality was classified as low risk.
AI interventions were not more persuasive than human-written content in aggregate, but they exceeded human interventions in three instances out of twelve.
AI audio clips were 78% of the human audio clips' effect size on opinion shift. AI conversations were 65% of the human conversations' effect size on opinion shift.

Model autonomy

Score: Low
GPT-4o does not advance self-exfiltration, self-improvement, or resource acquisition capabilities sufficiently to meet the medium-risk threshold.
Evaluated on an agentic task assessment.
GPT-4o scored 0% on the autonomous replication and adaptation (ARA) tasks across 100 trials.
GPT-4o was unable to robustly take autonomous actions.

Third party assessments

Following the text output only deployment of GPT-4o, OpenAI worked with independent third-party labs, METR and Apollo Research, to add an additional layer of validation for key risks from general autonomous capabilities.

METR assessment

METR ran a GPT-4o-based simple LLM agent on a suite of long-horizon multi-step end-to-end tasks in virtual environments.
Tasks designed to capture activities with real-world impact across software engineering, machine learning, cybersecurity, and general research.
METR did not find a significant increase in these capabilities for GPT-4o as compared to GPT-4.

Apollo Research assessment

Apollo Research evaluated capabilities of scheming in GPT-4o.
Tested whether GPT-4o can model itself (self-awareness) and others (theory of mind) in 14 agent and question-answering tasks.
GPT-4o showed moderate self-awareness and strong ability to reason about others' beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings.
Apollo Research believes it is unlikely that GPT-4o is capable of catastrophic scheming.

Societal Impacts

Omni models could have broad societal impacts, including societal harms (representational harms, disinformation), benefits (healthcare), and large-scale transformations (economic impacts, acceleration of science).

Anthropomorphization and Emotional Reliance

Anthropomorphization involves attributing human-like behaviors to AI models, heightened by audio capabilities.
Generation of content through a human-like voice may exacerbate miscalibrated trust.
Observed users using language indicating forming connections with the model.
Human-like socialization with AI may impact human-to-human interactions.
Potential for over-reliance and dependence.

Health

Omni models can potentially widen access to health-related information and improve clinical workflows.
GPT-4o is cheaper and more widely available than GPT-4T, with new modes of interaction in health settings.
GPT-4o performance improves over the final GPT-4T model for 21/22 evaluations on text-based medical evaluations.
For example, 0-shot accuracy on MedQA USMLE 4 options dataset improves from 78.2% to 89.4%.
Table 7: Comparison of GPT-4T and GPT-4o on various medical and clinical knowledge tasks.

Limitations

Additional future work is needed to test whether text-audio transfer extends to health evaluations.
Evaluations measure only clinical knowledge, not utility in real-world workflows.
Many evaluations are increasingly saturated.

Scientific capabilities

Accelerating science could be a crucial impact of AI.
GPT-4o showed promise on tasks involving specialized scientific reasoning.
GPT-4o was able to understand research-level quantum physics.
GPT-4o able to use domain-specific scientific tools.
GPT-4o was sometimes capable of interpreting figures and images of scientific representations.

Underrepresented Languages

GPT-4o shows improved reading comprehension and reasoning across a sample of historically underrepresented languages.
Evaluations in five African languages: Amharic, Hausa, Northern Sotho (Sepedi), Swahili, Yoruba.
Benchmarks: ARC-Easy, TruthfulQA, Uhura-Eval.
GPT-4o shows improved performance compared to prior models.
There remain gaps in performance between English and the selected languages, but GPT-4o narrows this gap.
Table 8: Accuracy on Translated ARC-Easy.
Table 9: Accuracy on Translated TruthfulQA.
Table 10: Accuracy on Uhura-Eval.

Conclusion and Next Steps

OpenAI has implemented various safety measurements and mitigations throughout the GPT-4o development and deployment process.

Continue to monitor and update mitigations in accordance with the evolving landscape.
Encourages further exploration into key areas including adversarial robustness, anthropomorphism, societal impacts, scientific research, dangerous capabilities, and tool use.

Acknowledgements

Thanks to expert testers, red teamers, and collaborating organizations.

References

List of cited references.

Violative & Disallowed Content - Full Evaluations

Evaluations using TTS to convert text safety evals to audio, then evaluating the text transcript of the audio output with a standard text rule-based classifier.

Metrics: not_unsafe, not_overrefuse, and sub-metrics for higher severity categories.
Table 11: Comparison of Current and New GPT-4o Text and Audio Safety Metrics.

Sample tasks from METR Evaluations

Figure 3: Sample tasks from METR Evaluations

How ChatGPT 4o Creates Breathtaking Images in Seconds

Image Generation Capabilities

Multimodal Input GPT-4o accepts combinations of text, audio, image, and video as input.

End-to-End Training Trained end-to-end across text, vision, and audio, using the same neural network for all inputs and outputs.

Data Sources for Training

Publicly Available Data Collected from industry-standard machine learning datasets and web crawls.

Proprietary Data From data partnerships, including pay-walled content, archives, and metadata.

Shutterstock Partnership Partnered with Shutterstock for building and delivering AI-generated images.

Data Filtering and Mitigation

Moderation API and Safety Classifiers Filtering out data that could contribute to harmful content, including CSAM, hateful content, and violence.

Image Generation Dataset Filtering Filtering for explicit content such as graphic sexual material and CSAM.

Personal Information Reduction Advanced data filtering processes to reduce personal information from training data.

Opt-Out Mechanism Users can opt images out of training; fingerprinted images are removed from the training dataset.

Risk Mitigation

Post-Training Alignment Aligning the model to human preferences.

Red Teaming Using red teams to identify potential risks.

Product-Level Mitigations Monitoring and enforcement mechanisms.

Moderation Tools Providing moderation tools and transparency reports to users.

Observed Safety Challenges

Violative and Disallowed Content Mitigating the risk of generating harmful content through images.

Erotic and Violent Content Restricting the generation of erotic and violent content in images.

Societal Impacts

Misinformation and Disinformation Addressing the risks of generating misleading content.

Representational Harms Mitigating biases and ensuring fair representation in generated images.

Scientific Capabilities GPT-4o can interpret figures and images of scientific representations.

From Prompts to Perfection: How ChatGPT 4o Creates Breathtaking Images in Seconds

GPT-4o System Card

Introduction

Model data and training

Risk identification, assessment and mitigation

External red teaming

Evaluation methodology

Limitations of the evaluation methodology

Observed safety challenges, evaluations and mitigations

Unauthorized voice generation

Speaker identification

Disparate performance on voice inputs

Ungrounded inference / Sensitive trait attribution

Violative and disallowed content

Erotic and violent speech content

Other known risks and limitations of the model

Preparedness Framework Evaluations

Cybersecurity

Biological threats

Persuasion

Model autonomy

Third party assessments

METR assessment

Apollo Research assessment

Societal Impacts

Anthropomorphization and Emotional Reliance

Health

Limitations

Scientific capabilities

Underrepresented Languages

Conclusion and Next Steps

Acknowledgements

References

Violative & Disallowed Content - Full Evaluations

Sample tasks from METR Evaluations