Customized Testing and Evaluation for Your AI Systems

AI Governance with AI: efficient, cost-effective AI testing sandbox for safer models, agents, and embodied intelligence.

Customized Testing and Evaluation for Your AI Systems

AI Governance with AI: efficient, cost-effective AI testing sandbox for safer models, agents, and embodied intelligence.

Customized Testing and Evaluation for Your AI Systems

AI Governance with AI: efficient, cost-effective AI testing sandbox for safer models, agents, and embodied intelligence.

Home

Leaderboard

Home

Leaderboard

Home

Leaderboard

Developing AI Safety through Iterative Problem Solving

The rapid evolution of AI presents continuous challenges. In this dynamic landscape, we adapt, we innovate, and we stay ahead.

The rapid evolution of AI presents continuous challenges. In this dynamic landscape, we adapt, we innovate, and we stay ahead.

March, 2023

The GPT-4 technical report was released, using translated data for evaluation, which revealed a lack of adequate test data for assessing the capabilities of LLMs.

June, 2023

Objective questions (multiple-choice) have become mainstream in model testing. However, how such testing can be scaled for LLM evaluation still needs to be explored.

July, 2023

The development of LLMs proposed new challenges. Many capabilities go beyond the scope of objective questions, with one of the most important being the safety of generated content.

How to effectively evaluate LLMs in non-English languages?

CMMLU: One of the most authoritative benchmark for Chinese LLM Capabilities

CMMLU: One of the most authoritative benchmark for Chinese LLM Capabilities

CMMLU: One of the most authoritative benchmark for Chinese LLM Capabilities

March, 2023

GPT-4 technical report has been released. How do we test the capabilities of foundation models?

How do we scale such testing?

LM-Evaluation-Harness: One of the most impactful opensource benchmark for LLMs

LM-Evaluation-Harness: One of the most impactful opensource benchmark for LLMs

LM-Evaluation-Harness: One of the most impactful opensource benchmark for LLMs

June, 2023

Objective questions has became mainstream in model testing. How do we scale the testing?

How do we assess LLM safety?

July, 2023

How to test the safety of large models?

Do-Not-Answer: Impactful research featured in Stanford AI Index Report

Do-Not-Answer: Impactful research featured in Stanford AI Index Report

Do-Not-Answer: Impactful research featured in Stanford AI Index Report

How can we design affordable test solutions for LLMs?

Alternative Solution: Expert small models that perform the same as LLM evaluators but cost 200 times less

Alternative Solution: Expert small models that perform the same as LLM evaluators but cost 200 times less

Alternative Solution: Expert small models that perform the same as LLM evaluators but cost 200 times less

August, 2023

How to reduce the cost of model testing?

How can we verify factuality that isn't even obvious to humans?

Loki: Open-source agent solution for automating factuality verification

Loki: Open-source agent solution for automating factuality verification

Loki: Open-source agent solution for automating factuality verification

October, 2023

How to verify factuality that cannot be solved by human annotation?

How can we develop a new schema for testing agents?

Specialized Evaluator: testing intermediate results with 20% higher precision and 80% cheaper cost

Specialized Evaluator: testing intermediate results with 20% higher precision and 80% cheaper cost

Specialized Evaluator: testing intermediate results with 20% higher precision and 80% cheaper cost

December, 2023

How to evaluate agents and RAGs?

How to address the issue of data contamination and conduct dynamic testing?

Break down the scope of testing and define knowledge points.

Break down the scope of testing and define knowledge points.

Break down the scope of testing and define knowledge points.

March, 2024

How to address the issue of data contamination and conduct dynamic testing?

June, 2023

Objective questions (multiple-choice) have become mainstream in model testing. However, how such testing can be scaled for LLM evaluation still needs to be explored.

July, 2023

The development of LLMs proposed new challenges. Many capabilities go beyond the scope of objective questions, with one of the most important being the safety of generated content.

August, 2023

With a clearer landscape of LLM testing emerging (using LLMs to test LLMs becoming mainstream), the extremely high cost remains a significant problem.

October, 2023

Most existing evaluations and tests heavily rely on human annotation. However, some tasks are challenging even for humans. A crucial aspect of content safety is factuality.

December, 2023

The concept of agents has emerged, posing new challenges for AI safety. This extends beyond content safety, involving systems interacting with more extensive and sensitive information and actions.

March, 2024

How to address the issue of data contamination and conduct dynamic testing?

Change is constant.
With us, you are always ahead of the curve.

Change is constant.
With us, you are always ahead of the curve.

The LibrAI Team

LibrAI is a team of AI PhDs and engineers with extensive research backgrounds, bringing together deep expertise, hands-on experience, and strong collaboration in cutting-edge AI research, supported by the world-class resources of MBZUAI.

Dr. Xudong Han (CEO)

Dr. Xudong Han (CEO)

Dr. Emad Alghamdi (Regional CEO)

Dr. Emad Alghamdi (Regional CEO)

Yilin Geng (COO)

Yilin Geng (COO)

Dr. Haonan Li (CTO)

Dr. Haonan Li (CTO)

Hao Wang (Head of Engineering)

Hao Wang (Head of Engineering)

Rong Zheng (Senior Engineer)

Rong Zheng (Senior Engineer)

Prof. Timothy Baldwin (Advisor)

Prof. Timothy Baldwin (Advisor)

Dr. Zenan Zhai (AI Researcher)

Dr. Zenan Zhai (AI Researcher)

Dr. Yuxia Wang (AI Researcher)

Dr. Yuxia Wang (AI Researcher)

Keep up to date with LibrAI

Get the latest updates, research insights, and industry trends in AI safety and benchmarking delivered to your inbox. Stay informed with LibrAI.