FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

Fengjian Xue1,†, Xuecheng Wu1,†, Heli Sun1,✉, Yunyun Shi1, Shi Chen1, Liangyu Fu2, Jinheng Xie3, Dingkang Yang4, Hao Wang1,✉, Junxiao Xue5, Liang He1
1Xi'an Jiaotong University   2Northwestern Polytechnical University   3National University of Singapore
4Fudan University   5Zhejiang Lab
† Equal contribution    ✉ Corresponding author
hiixfj@stu.xjtu.edu.cn
{hlsun, haowangx, lhe}@xjtu.edu.cn
Overview of FED-Bench

Figure 1. Overview of FED-Bench. Representative samples from our benchmark covering seven basic emotions: happy, surprise, disgust, neutral, sad, fear, and angry.

Abstract

Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains.

Benchmark Construction Pipeline

FED-Bench construction pipeline

Figure 2. The overall illustration of FED-Bench construction pipeline.

To obtain high-quality Ground Truth images for refined evaluation, we design a rigorous multi-stage screening pipeline consisting of five stages:

1. Source Data Screening — We select SFEW 2.0 and DFEW as raw data sources for their diverse in-the-wild backgrounds and sufficient clarity, curating 747 high-quality source images.
2. Candidate Target Generation — We use a state-of-the-art editing model to generate candidate targets for seven basic expressions: angry, disgust, fear, happy, neutral, sad, and surprise.
3. Coarse-grained Expression Recognition — We regroup seven expressions into three polarities (Positive, Neutral, Negative) and employ a multi-model voting mechanism using a 5-MLLM ensemble, boosting accuracy to 84.87%.
4. Fidelity Ranking — We evaluate identity preservation (via ArcFace cosine similarity) and background consistency (via RMSE), retaining only the top two candidates.
5. Human Verification — Three human evaluators independently inspect candidate pairs, with the final Ground Truth determined by majority voting.

FED-Score Evaluation Protocol

We propose FED-Score, a decoupled evaluation protocol that integrates rule-based computations with the perceptual capabilities of MLLMs. The evaluation is explicitly decomposed into three dimensions:

Fidelity

Assesses whether the model faithfully preserves elements that should remain unchanged: Identity Preservation (ID) via ArcFace cosine similarity, Background Consistency (BG) via pixel-level RMSE in non-facial areas, and Perceptual Quality (PQ) via MLLM-based visual evaluation.

Alignment

Evaluates how accurately the generated image executes editing instructions: Semantic Consistency (SC) judges the text-image matching degree, and GT-based Expression Alignment (GTA) directly compares the target with the reference Ground Truth expression.

Relative Expression Gain (REG)

Measures the actual magnitude of expression change relative to the expected change. A Gaussian penalty centered at 1.0 penalizes both lazy editing (insufficient change) and overfit editing (exaggerated distortion):

$$\text{REG} = \frac{\text{LPIPS}_{\text{face}}(I_{\text{src}},\; I_{\text{trg}})}{\text{LPIPS}_{\text{face}}(I_{\text{src}},\; I_{\text{gt}})}, \quad \mathcal{S}_{\text{reg}} = \exp\!\left(-\frac{(\text{REG} - 1)^2}{2\sigma^2}\right)$$

Comprehensive FED-Score

The three dimensions are integrated via multiplication, ensuring that all dimensions must be simultaneously satisfied:

$$\text{FED-Score} = \mathcal{S}_{\text{fid}} \times \mathcal{S}_{\text{align}} \times \mathcal{S}_{\text{reg}}$$

Benchmarking Results

Table 1. Benchmarking results on FED-Bench. Left: Dense instructions. Right: Simple instructions. BG↓: lower is better. REG: optimal at 1.0. Bold: best; underline: second best.

Dense Instructions Simple Instructions
MethodIDBG↓PQSCGTAREGScore MethodIDBG↓PQSCGTAREGScore
Qwen-Image-Edit-Plus.5817.59.78.85.71.18.469 Qwen-Image-Edit-Plus.6316.89.88.85.81.13.492
SeedDream 4.0.6215.59.59.14.31.37.379 SeedDream 4.0.6916.49.78.05.01.26.413
FLUX.2 Pro.5813.99.89.04.01.37.377 FLUX.2 Pro.5814.09.68.85.21.37.400
Qwen-Image-Edit.4519.69.38.94.21.37.337 Qwen-Image-Edit-2511.4415.39.89.44.01.42.361
FLUX-Kontext-FED.687.39.76.23.40.95.332 Qwen-Image-Edit.4319.99.68.74.41.37.343
FLUX-Kontext-Pro.527.39.87.63.50.99.327 Step1X v1p2.5217.89.79.33.11.41.333
FLUX-Kontext-Max.509.79.77.73.51.07.320 FLUX-Kontext-FED.687.89.76.52.90.96.325
Qwen-Image-Edit-2511.4615.89.59.33.41.43.317 FLUX-Kontext-Max.498.19.95.23.71.03.259
Step1X v1p2.5617.19.47.73.01.31.303 SeedEdit 3.0.4812.89.17.62.71.56.239
FLUX-Kontext-Dev.7223.89.66.73.00.67.243 FLUX-Kontext-Pro.558.59.94.83.40.97.227
SeedEdit 3.0.4911.67.87.41.91.55.203 Bagel.5313.15.44.12.41.29.163
UniWorld-v2.3731.69.17.62.51.45.201 Step1X.3217.39.75.62.31.67.149
DreamOmni2.8124.19.65.02.60.45.168 OmniGen2.4477.68.34.13.61.61.122
OmniGen2.5663.77.36.32.61.52.155 FLUX-Kontext-Dev.864.89.82.33.10.35.120
FLUX-Kontext-Fill.115.19.95.51.71.56.155 UniWorld-v2.3034.98.63.42.71.60.110
Step1X.3316.37.36.91.31.72.127 FLUX-Kontext-Fill.115.19.73.31.11.52.096
Bagel.4247.26.56.81.81.57.115 DreamOmni2.4536.98.91.22.01.40.049
InstructPix2Pix.0847.00.00.00.02.56.001 InstructPix2Pix.0841.20.20.10.12.42.004

Qualitative Results

Qualitative comparison of facial expression editing

Figure 3. Qualitative comparison of facial expression editing on FED-Bench. Each row shows a different editing task with the original image, ground truth target, and results from evaluated models.

Qualitative comparison of FED-Score

Figure 4. Qualitative comparison of the FED-Score on two editing tasks.

REG metric qualitative analysis

Figure 5. Qualitative analysis of the REG metric on two editing tasks, illustrating lazy editing vs. overfit editing patterns.

BibTeX

@misc{xue2026fedbenchcrossgranularbenchmarkdisentangled,
      title={FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing}, 
      author={Fengjian Xue and Xuecheng Wu and Heli Sun and Yunyun Shi and Shi Chen and Liangyu Fu and Jinheng Xie and Dingkang Yang and Hao Wang and Junxiao Xue and Liang He},
      year={2026},
      eprint={2603.29697},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.29697}, 
}