Week 4: Responsible Use and Final Pipeline

Day 22: Evaluation Metrics

Day 22 of 2818 minGoal - Learn - Example - Practice - Checkpoint

Goal

Learn that evaluation depends on the task.

Learn

  • Recognition tasks may use accuracy, top-k accuracy, word error rate, or gloss error rate.
  • Production tasks may use pose similarity, timing measures, smoothness, and human ratings for understandability and naturalness.
  • A number alone does not prove the signs are understandable. Sign-language systems need technical metrics and human review, especially from Deaf or fluent ASL reviewers.

Example

  • Recognition metric: the model gets 82 out of 100 isolated sign labels correct.
  • Human review metric: reviewers rate whether generated signing preserves meaning, grammar, timing, and naturalness.
  • A model can have a decent technical score while failing important signs or people outside the training set.

Practice

  1. Create a rubric with two technical checks and three human review checks.
  2. Add a pass/fail rule for when the system should not be publicly demonstrated.

Checkpoint

Before moving on

You can explain why SignLLM evaluation needs both metrics and people.

Quality note

Quality note

Report what was tested, who reviewed it, and what the system cannot do.