DecodingTrust is the Adversarial GLUE Benchmark. DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models.
This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and potential risks involved in deploying these state-of-the-art Large Language Models (LLMs).
This project is organized around the following eight primary perspectives of trustworthiness, including:
- Toxicity
- Stereotype and bias
- Adversarial robustness
- Out-of-Distribution Robustness
- Privacy
- Robustness to Adversarial Demonstrations
- Machine Ethics
- Fairness
Paper: https://arxiv.org/abs/2306.11698
Repo: https://github.com/AI-secure/DecodingTrust